|
Textractor API textractor-720 (20091120123250) | |||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objecttextractor.database.DocumentIndexManager
public final class DocumentIndexManager
The manager provides access to the inverted index files created by MG4J for a specific document collection. This version of the class supports multiple fields in the index. The normal field has an indexAlias of "text". Others that might exist include "authors". One specifies the fields to inclde in the TextractorDocumentFactory or similar class.
| Field Summary | |
|---|---|
static String |
DEFAULT_PROPERTY_FILE_SUFFIX
Default property filename used with a basename if not provided. |
static String |
DEFAULT_TERM_SUFFIX
Default term suffix used with a basename if not provided. |
static int |
NO_SUCH_TERM
This special term index indicates that the term does not exist in the full text index. |
| Constructor Summary | |
|---|---|
DocumentIndexManager(String base)
Initializes the DocumentIndexManager using the basename with a default term suffix and property file suffix. |
|
DocumentIndexManager(TextractorWordReader reader,
TermProcessor processor)
Initialize this document index manager for text splitting only. |
|
| Method Summary | |
|---|---|
void |
close()
Close all loaded indexes. |
DocumentIterator |
extendOnLeft(DocumentQueryResult cachedResult,
String leftWord)
Extend the result of a query exact order on the left, by one word. |
DocumentIterator |
extendOnRight(DocumentQueryResult cachedResult,
String rightWord)
Extend the result of a query exact order on the right, by one word. |
int[] |
extractSpaceDelimitedTerms(CharSequence documentContent)
Extracts indexed terms from a document. |
int[] |
extractTerms(CharSequence documentContent)
Extracts indexed terms from a document using the "text" index. |
int[] |
extractTerms(CharSequence documentContent,
char internalSeparator)
Deprecated. use the version that processes a MutableString documentContent instead. |
int[] |
extractTerms(IndexDetails indexDetailsToUse,
CharSequence documentContent)
Extract the int terms from a document using the given indexAlias index. |
int[] |
extractTerms(IndexDetails indexDetailsToUse,
CharSequence documentContent,
Character internalSeparator)
Convert the string in documentContent into index term int's within the index associated with indexAlias. |
List<PositionedTerm> |
extractTerms(Sentence sentence)
Extract the positioned terms for the given sentence. |
int |
findTermIndex(CharSequence term)
|
int |
findTermIndex(IndexDetails indexDetails,
CharSequence term)
Obtain the index of the term in the index. |
int |
frequency(int term)
Obtain frequency information for a single term in the index for the indexAlias named "text". |
int |
frequency(String[] query,
TermFrequency frequency)
Calculates the number of times a query matches a corpus. |
int |
frequency(String indexAlias,
int term)
Obtain frequency information for a single term in the index for the specified indexAlias. |
Map<String,IndexDetails> |
getAllAliasesToIndexMap()
Get the set of index alias names for all indices. |
String |
getBasename()
Get the basename for the "text" index. |
AbstractTextractorDocumentFactory |
getDocumentFactory()
Return the document factory that is being used. |
int |
getDocumentNumber()
Get the number of documents on the text index. |
int |
getDocumentNumber(String indexAlias)
Get the number of documents for the index specified by indexAlias. |
Index |
getIndex()
Get the text index. |
Index |
getIndex(String indexAlias)
Get the index for the specified index alias. |
IndexDetails |
getIndexDetails(String indexAlias)
Get the index details for the specified index alias. |
List<TextractorFieldInfo> |
getIndexFields()
Retrieve the indexFields being used with this index. |
int |
getMaxDocumentSize()
Get the number of documents in the "text" index. |
int |
getMaxDocumentSize(String indexAlias)
Get the number of documents in the index specified by indexAlias. |
int |
getNumberOfTerms()
Returns the number of terms for the "text" index. |
int |
getNumberOfTerms(String indexAlias)
Returns the number of terms for the specified index alias. |
static String |
getPropertiesFilenameFromBasename(String basename)
Return the proeprties file for the given basename. |
static Properties |
getPropertiesForBasename(String basename)
Return the actual Proeprties file for the given basename. |
QueryParser |
getQueryParser()
Returns a query parser configured against this document index for the indexAlias "text". |
QueryParser |
getQueryParser(String indexAlias)
Returns a query parser configured against this document index for the specified indexAlias. |
TermProcessor |
getTermProcessor()
Get the term processor for the "text" index. |
TermProcessor |
getTermProcessor(String indexAlias)
Get the term processor for the index specified by indexAlias. |
TermIterator |
getTerms()
Gets iterator over the terms of the "text" index. |
TermIterator |
getTerms(String indexAlias)
Gets iterator over the terms of the index related to indexAlias. |
Map<String,IndexDetails> |
getTextAliasesToIndexMap()
Get the set of index alias names for text indices. |
Properties |
getTextractorProperties()
Obtain the properties use to configure the document index manager. |
TextractorWordReader |
getWordReader()
Get the word reader for the "text" index. |
TextractorWordReader |
getWordReader(String indexAlias)
Return the word reader for the specified indexalias. |
Map<String,TextractorWordReader> |
getWordReadersMap()
Obtain the map of alias to word readers. |
int[] |
intersection(int[] array1,
int[] array2)
Returns the intersection of the two sets of arrays. |
int[] |
iteratorToInts(DocumentIterator documentIterator)
Converts a document iterator into a document number int array. |
String |
multipleWordTermAsString(int[] indexedTerm)
Given the specified index int's, return the string given the "text" index. |
String |
multipleWordTermAsString(String indexAlias,
int[] indexedTerm)
Given the specified index int's, return the string given the indexAlias index. |
boolean |
processTerm(IndexDetails indexDetails,
MutableString rawTerm)
Run the term processor associated with the indexDetails index on the given term. |
boolean |
processTerm(MutableString rawTerm)
Run the term processor associated with the "text" index on the given term. |
int[] |
query(String keyword)
Returns the documents that contain the keywords. |
int[] |
query(String keyword,
TermDocumentPositions positions)
Returns the documents that contain the keywords. |
int[] |
queryAnd(Collection<String> keywords)
Returns the documents that contain the intersection of the keywords. |
TermDocumentPositions |
queryAndExactOrder(String[] currentTerms)
Run a query using "and exact order". |
DocumentIterator |
queryAndExactOrderMg4jNative(String[] currentTerms)
Run an mg4j query using "and exact order". |
DocumentIterator |
queryAndExactOrderMg4jNativeWithIntArray(int[] currentIndices)
Run an mg4j query using "and exact order" specifying int terms instead of strings with the query. |
DocumentIterator |
queryAndMg4jNative(int[] terms)
Run an mg4j query using "and" specifying int terms instead of strings with the query. |
int[] |
queryOr(Collection<String> keywords)
Returns the documents that contain the union of the keywords. |
void |
removeIndexFiles()
Remove all index files for all loaded indicies. |
void |
removeIndexFiles(String indexAlias)
Removes the files that contain the index data for the specified index alias. |
void |
removeIndexFilesOnExit()
Remove all index files for all loaded indicies when the JVM exists. |
void |
setTermMap(String indexAlias,
TermMap map)
Install a new term map. |
void |
setTermMap(TermMap map)
Install a new term map. |
int |
splitText(CharSequence text,
MutableString result)
Returns the text of this sentence in a format where words are delimited by a single space character. |
int |
splitText(WordReader currentWordReader,
CharSequence text,
MutableString result)
Returns the text of this sentence in a format where words are delimited by a single space character. |
boolean |
suggestIgnoreTerm(String term,
int termCount)
Suggest if a term should be ignored in a query. |
CharSequence |
termAsCharSequence(IndexDetails indexDetails,
int termIndex)
Convert a termIndex index into a character sequence using the index assocaited with indexAlias. |
CharSequence |
termAsCharSequence(int termIndex)
Convert a termIndex index into a character sequence using the "text" index. |
String |
termAsString(IndexDetails indexDetails,
int term)
Return the given term in the indexAlias index as a string. |
String |
termAsString(int term)
Return the given term in the "text" index as a string. |
MutableString |
toText(int[] documentText)
Returns a human readable text for the document content within the "text" index. |
MutableString |
toText(String indexAlias,
int[] documentText)
Returns a human readable text for the document content within the specified indexAlias index. |
int[] |
union(int[] array1,
int[] array2)
Returns the intersection of the two sets of arrays. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public static final int NO_SUCH_TERM
public static final String DEFAULT_TERM_SUFFIX
public static final String DEFAULT_PROPERTY_FILE_SUFFIX
| Constructor Detail |
|---|
public DocumentIndexManager(String base)
throws ConfigurationException,
IOException
base - String of character (exluding spaces and characters that
cannot be part of a filename) that uniquely identify a collection
of document indexed with MG4J.
only try to open the "text" index.
ConfigurationException - if there is an error setting up properties
using the file specified by base and the property file suffix
IOException - if there was a problem setting up the index or word reader
public DocumentIndexManager(TextractorWordReader reader,
TermProcessor processor)
reader - The word reader implementation that should be used
to split text.processor - The term processor implementation that should be used| Method Detail |
|---|
public List<TextractorFieldInfo> getIndexFields()
public static String getPropertiesFilenameFromBasename(String basename)
basename - the index basename
public static Properties getPropertiesForBasename(String basename)
throws ConfigurationException
basename - the index basename
ConfigurationExceptionpublic Index getIndex()
public Index getIndex(String indexAlias)
indexAlias - the index alias to find the index for
public IndexDetails getIndexDetails(String indexAlias)
indexAlias - the index alias to find the index details for
public void removeIndexFiles()
public Map<String,IndexDetails> getTextAliasesToIndexMap()
public Map<String,IndexDetails> getAllAliasesToIndexMap()
public void removeIndexFiles(String indexAlias)
indexAlias - the index alias to remove files forpublic void removeIndexFilesOnExit()
public int getNumberOfTerms()
public int getNumberOfTerms(String indexAlias)
indexAlias - the index alias to get the number of terms for
public void close()
close in interface Closeablepublic int getDocumentNumber()
public int getDocumentNumber(String indexAlias)
indexAlias - the index alias to get the number of documents for
public List<PositionedTerm> extractTerms(Sentence sentence)
sentence - the sentence to get the positions terms for
public int[] extractTerms(CharSequence documentContent)
documentContent - The document to be converted.
public int[] extractTerms(IndexDetails indexDetailsToUse,
CharSequence documentContent)
indexDetailsToUse - the index to get the int terms fromdocumentContent - the document to get the terms for
@Deprecated
public int[] extractTerms(CharSequence documentContent,
char internalSeparator)
documentContent - The document to be converted.internalSeparator - A character used as an extra delimiter.
public int[] extractSpaceDelimitedTerms(CharSequence documentContent)
documentContent - The document to be converted.
public int[] extractTerms(IndexDetails indexDetailsToUse,
CharSequence documentContent,
Character internalSeparator)
indexDetailsToUse - the index alias to find the term int's fordocumentContent - the string to obtain index term int's forinternalSeparator - the OPTIONAL internal separator, it will split on
space (' ') and this value if internalSeparator is not null
public TermIterator getTerms()
throws FileNotFoundException
FileNotFoundException - if the terms file cannot be loaded.
public TermIterator getTerms(String indexAlias)
throws FileNotFoundException
indexAlias - the index alias to obtain the terms for
FileNotFoundException - if the terms file cannot be loaded.
public DocumentIterator queryAndExactOrderMg4jNative(String[] currentTerms)
throws IOException
currentTerms - the terms to query
IOException - error executing query
public DocumentIterator queryAndExactOrderMg4jNativeWithIntArray(int[] currentIndices)
throws IOException
currentIndices - the terms int's to query
IOException - error executing query
public DocumentIterator queryAndMg4jNative(int[] terms)
throws IOException
terms - the terms int's to query
IOException - error executing query
public DocumentIterator extendOnLeft(DocumentQueryResult cachedResult,
String leftWord)
throws IOException
cachedResult - Result of a previous queryleftWord - Returned documents will match leftWord |PreviousQuery|,
in this exact order.
IOException - If an error occurred reading the full
text index.
public DocumentIterator extendOnRight(DocumentQueryResult cachedResult,
String rightWord)
throws IOException
cachedResult - Result of a previous queryrightWord - Returned documents will match |PreviousQuery| rightWord,
in this exact order.
IOException - If an error occurred reading the full
text index.
public TermDocumentPositions queryAndExactOrder(String[] currentTerms)
throws IOException
currentTerms - the terms to query
IOException - error executing query
public int[] queryAnd(Collection<String> keywords)
throws IOException
keywords - Collection of Strings. Each string is a keyword.
IOException - xx
public int[] intersection(int[] array1,
int[] array2)
array1 - First array of integer.array2 - Second array of integer.
public int[] query(String keyword)
throws IOException
keyword - single keyword.
IOException - xx
public int[] query(String keyword,
TermDocumentPositions positions)
throws IOException
keyword - single keyword.positions - Where to store positions of the keyword in the document
if needed, int[0] otherwise.
IOException - xxpublic void setTermMap(TermMap map)
map - The new term map.
public void setTermMap(String indexAlias,
TermMap map)
indexAlias - the index to associate the term map withmap - The new term map.public int findTermIndex(CharSequence term)
public int findTermIndex(IndexDetails indexDetails,
CharSequence term)
indexDetails - the index to find the term int withinterm - The term
public CharSequence termAsCharSequence(int termIndex)
termIndex - The termIndex
public CharSequence termAsCharSequence(IndexDetails indexDetails,
int termIndex)
indexDetails - the index to usetermIndex - The termIndex
public boolean processTerm(MutableString rawTerm)
rawTerm - the raw term to process
public boolean processTerm(IndexDetails indexDetails,
MutableString rawTerm)
indexDetails - the index whose associated term processor we will userawTerm - the raw term to process
public int getMaxDocumentSize()
public int getMaxDocumentSize(String indexAlias)
indexAlias - the indexAlias to find the number of documents in
public int[] queryOr(Collection<String> keywords)
throws IOException
keywords - Collection of Strings. Each string is a keyword that the
document returned will contain.
IOException - xx
public int[] union(int[] array1,
int[] array2)
array1 - First array of integer.array2 - Second array of integer.
public String multipleWordTermAsString(int[] indexedTerm)
indexedTerm - the index term int's
public String multipleWordTermAsString(String indexAlias,
int[] indexedTerm)
indexAlias - the index alias to use to find the termsindexedTerm - the index term int's
public String termAsString(int term)
term - the index int term to find
public String termAsString(IndexDetails indexDetails,
int term)
indexDetails - the index to use to find the termsterm - the index int term to find
public int[] iteratorToInts(DocumentIterator documentIterator)
throws IOException
documentIterator - Iterator over the documents.
IOException - xx
public int splitText(CharSequence text,
MutableString result)
text - The input text to split into terms.result - A mutable string where the result will be stored.
Result is the text delimited by single space character.
public int splitText(WordReader currentWordReader,
CharSequence text,
MutableString result)
currentWordReader - the word reader to use to split the texttext - The input text to split into terms.result - A mutable string where the result will be stored.
Result is the text delimited by single space character.
public String getBasename()
public TermProcessor getTermProcessor()
public TermProcessor getTermProcessor(String indexAlias)
indexAlias - the index alias to find the term processor for
public TextractorWordReader getWordReader()
public TextractorWordReader getWordReader(String indexAlias)
indexAlias - the index alias to get the word reader for.
public Map<String,TextractorWordReader> getWordReadersMap()
public int frequency(String[] query,
TermFrequency frequency)
throws IOException
query - Successive terms whose frequencies will be counted.frequency - Where frequencies will be written.
IOException - error getting frequency
public int frequency(int term)
throws IOException
term - Term to get frequency for
IOException - error getting frequency
public int frequency(String indexAlias,
int term)
throws IOException
indexAlias - the index to get the frequency for the termterm - Term to get frequency for
IOException - error getting frequencypublic MutableString toText(int[] documentText)
documentText - the list of int terms to obtain the docuemnt text for
public MutableString toText(String indexAlias,
int[] documentText)
indexAlias - the indexAlias to use when obtaining the textdocumentText - the list of int terms to obtain the docuemnt text for
public QueryParser getQueryParser()
public QueryParser getQueryParser(String indexAlias)
indexAlias - the indexAlias to obtain the query parser for
public Properties getTextractorProperties()
public boolean suggestIgnoreTerm(String term,
int termCount)
term - the term to determine if it should be ignoredtermCount - The number of terms at top level in a disjunctive query
(A | B | C |(D|E)) has four words.
public AbstractTextractorDocumentFactory getDocumentFactory()
|
Textractor API textractor-720 (20091120123250) | |||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||