Textractor API textractor-720 (20091120123250)

textractor.database
Class DocumentIndexManager

java.lang.Object
  extended by textractor.database.DocumentIndexManager
All Implemented Interfaces:
Closeable

public final class DocumentIndexManager
extends Object
implements Closeable

The manager provides access to the inverted index files created by MG4J for a specific document collection. This version of the class supports multiple fields in the index. The normal field has an indexAlias of "text". Others that might exist include "authors". One specifies the fields to inclde in the TextractorDocumentFactory or similar class.

Author:
Fabien Campagne, Kevin Dorff

Field Summary
static String DEFAULT_PROPERTY_FILE_SUFFIX
          Default property filename used with a basename if not provided.
static String DEFAULT_TERM_SUFFIX
          Default term suffix used with a basename if not provided.
static int NO_SUCH_TERM
          This special term index indicates that the term does not exist in the full text index.
 
Constructor Summary
DocumentIndexManager(String base)
          Initializes the DocumentIndexManager using the basename with a default term suffix and property file suffix.
DocumentIndexManager(TextractorWordReader reader, TermProcessor processor)
          Initialize this document index manager for text splitting only.
 
Method Summary
 void close()
          Close all loaded indexes.
 DocumentIterator extendOnLeft(DocumentQueryResult cachedResult, String leftWord)
          Extend the result of a query exact order on the left, by one word.
 DocumentIterator extendOnRight(DocumentQueryResult cachedResult, String rightWord)
          Extend the result of a query exact order on the right, by one word.
 int[] extractSpaceDelimitedTerms(CharSequence documentContent)
          Extracts indexed terms from a document.
 int[] extractTerms(CharSequence documentContent)
          Extracts indexed terms from a document using the "text" index.
 int[] extractTerms(CharSequence documentContent, char internalSeparator)
          Deprecated. use the version that processes a MutableString documentContent instead.
 int[] extractTerms(IndexDetails indexDetailsToUse, CharSequence documentContent)
          Extract the int terms from a document using the given indexAlias index.
 int[] extractTerms(IndexDetails indexDetailsToUse, CharSequence documentContent, Character internalSeparator)
          Convert the string in documentContent into index term int's within the index associated with indexAlias.
 List<PositionedTerm> extractTerms(Sentence sentence)
          Extract the positioned terms for the given sentence.
 int findTermIndex(CharSequence term)
           
 int findTermIndex(IndexDetails indexDetails, CharSequence term)
          Obtain the index of the term in the index.
 int frequency(int term)
          Obtain frequency information for a single term in the index for the indexAlias named "text".
 int frequency(String[] query, TermFrequency frequency)
          Calculates the number of times a query matches a corpus.
 int frequency(String indexAlias, int term)
          Obtain frequency information for a single term in the index for the specified indexAlias.
 Map<String,IndexDetails> getAllAliasesToIndexMap()
          Get the set of index alias names for all indices.
 String getBasename()
          Get the basename for the "text" index.
 AbstractTextractorDocumentFactory getDocumentFactory()
          Return the document factory that is being used.
 int getDocumentNumber()
          Get the number of documents on the text index.
 int getDocumentNumber(String indexAlias)
          Get the number of documents for the index specified by indexAlias.
 Index getIndex()
          Get the text index.
 Index getIndex(String indexAlias)
          Get the index for the specified index alias.
 IndexDetails getIndexDetails(String indexAlias)
          Get the index details for the specified index alias.
 List<TextractorFieldInfo> getIndexFields()
          Retrieve the indexFields being used with this index.
 int getMaxDocumentSize()
          Get the number of documents in the "text" index.
 int getMaxDocumentSize(String indexAlias)
          Get the number of documents in the index specified by indexAlias.
 int getNumberOfTerms()
          Returns the number of terms for the "text" index.
 int getNumberOfTerms(String indexAlias)
          Returns the number of terms for the specified index alias.
static String getPropertiesFilenameFromBasename(String basename)
          Return the proeprties file for the given basename.
static Properties getPropertiesForBasename(String basename)
          Return the actual Proeprties file for the given basename.
 QueryParser getQueryParser()
          Returns a query parser configured against this document index for the indexAlias "text".
 QueryParser getQueryParser(String indexAlias)
          Returns a query parser configured against this document index for the specified indexAlias.
 TermProcessor getTermProcessor()
          Get the term processor for the "text" index.
 TermProcessor getTermProcessor(String indexAlias)
          Get the term processor for the index specified by indexAlias.
 TermIterator getTerms()
          Gets iterator over the terms of the "text" index.
 TermIterator getTerms(String indexAlias)
          Gets iterator over the terms of the index related to indexAlias.
 Map<String,IndexDetails> getTextAliasesToIndexMap()
          Get the set of index alias names for text indices.
 Properties getTextractorProperties()
          Obtain the properties use to configure the document index manager.
 TextractorWordReader getWordReader()
          Get the word reader for the "text" index.
 TextractorWordReader getWordReader(String indexAlias)
          Return the word reader for the specified indexalias.
 Map<String,TextractorWordReader> getWordReadersMap()
          Obtain the map of alias to word readers.
 int[] intersection(int[] array1, int[] array2)
          Returns the intersection of the two sets of arrays.
 int[] iteratorToInts(DocumentIterator documentIterator)
          Converts a document iterator into a document number int array.
 String multipleWordTermAsString(int[] indexedTerm)
          Given the specified index int's, return the string given the "text" index.
 String multipleWordTermAsString(String indexAlias, int[] indexedTerm)
          Given the specified index int's, return the string given the indexAlias index.
 boolean processTerm(IndexDetails indexDetails, MutableString rawTerm)
          Run the term processor associated with the indexDetails index on the given term.
 boolean processTerm(MutableString rawTerm)
          Run the term processor associated with the "text" index on the given term.
 int[] query(String keyword)
          Returns the documents that contain the keywords.
 int[] query(String keyword, TermDocumentPositions positions)
          Returns the documents that contain the keywords.
 int[] queryAnd(Collection<String> keywords)
          Returns the documents that contain the intersection of the keywords.
 TermDocumentPositions queryAndExactOrder(String[] currentTerms)
          Run a query using "and exact order".
 DocumentIterator queryAndExactOrderMg4jNative(String[] currentTerms)
          Run an mg4j query using "and exact order".
 DocumentIterator queryAndExactOrderMg4jNativeWithIntArray(int[] currentIndices)
          Run an mg4j query using "and exact order" specifying int terms instead of strings with the query.
 DocumentIterator queryAndMg4jNative(int[] terms)
          Run an mg4j query using "and" specifying int terms instead of strings with the query.
 int[] queryOr(Collection<String> keywords)
          Returns the documents that contain the union of the keywords.
 void removeIndexFiles()
          Remove all index files for all loaded indicies.
 void removeIndexFiles(String indexAlias)
          Removes the files that contain the index data for the specified index alias.
 void removeIndexFilesOnExit()
          Remove all index files for all loaded indicies when the JVM exists.
 void setTermMap(String indexAlias, TermMap map)
          Install a new term map.
 void setTermMap(TermMap map)
          Install a new term map.
 int splitText(CharSequence text, MutableString result)
          Returns the text of this sentence in a format where words are delimited by a single space character.
 int splitText(WordReader currentWordReader, CharSequence text, MutableString result)
          Returns the text of this sentence in a format where words are delimited by a single space character.
 boolean suggestIgnoreTerm(String term, int termCount)
          Suggest if a term should be ignored in a query.
 CharSequence termAsCharSequence(IndexDetails indexDetails, int termIndex)
          Convert a termIndex index into a character sequence using the index assocaited with indexAlias.
 CharSequence termAsCharSequence(int termIndex)
          Convert a termIndex index into a character sequence using the "text" index.
 String termAsString(IndexDetails indexDetails, int term)
          Return the given term in the indexAlias index as a string.
 String termAsString(int term)
          Return the given term in the "text" index as a string.
 MutableString toText(int[] documentText)
          Returns a human readable text for the document content within the "text" index.
 MutableString toText(String indexAlias, int[] documentText)
          Returns a human readable text for the document content within the specified indexAlias index.
 int[] union(int[] array1, int[] array2)
          Returns the intersection of the two sets of arrays.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

NO_SUCH_TERM

public static final int NO_SUCH_TERM
This special term index indicates that the term does not exist in the full text index.

See Also:
Constant Field Values

DEFAULT_TERM_SUFFIX

public static final String DEFAULT_TERM_SUFFIX
Default term suffix used with a basename if not provided.

See Also:
Constant Field Values

DEFAULT_PROPERTY_FILE_SUFFIX

public static final String DEFAULT_PROPERTY_FILE_SUFFIX
Default property filename used with a basename if not provided.

See Also:
Constant Field Values
Constructor Detail

DocumentIndexManager

public DocumentIndexManager(String base)
                     throws ConfigurationException,
                            IOException
Initializes the DocumentIndexManager using the basename with a default term suffix and property file suffix. Use THIS version if you want to attempt to open a all of the TEXT indexes in the List[TextractorFieldInfo].

Parameters:
base - String of character (exluding spaces and characters that cannot be part of a filename) that uniquely identify a collection of document indexed with MG4J. only try to open the "text" index.
Throws:
ConfigurationException - if there is an error setting up properties using the file specified by base and the property file suffix
IOException - if there was a problem setting up the index or word reader

DocumentIndexManager

public DocumentIndexManager(TextractorWordReader reader,
                            TermProcessor processor)
Initialize this document index manager for text splitting only. When initialized through this constructor, the index manager can only be used to call the following methods: Other methods will throw a variety of exceptions

Parameters:
reader - The word reader implementation that should be used to split text.
processor - The term processor implementation that should be used
Method Detail

getIndexFields

public List<TextractorFieldInfo> getIndexFields()
Retrieve the indexFields being used with this index.

Returns:
the indexFields

getPropertiesFilenameFromBasename

public static String getPropertiesFilenameFromBasename(String basename)
Return the proeprties file for the given basename.

Parameters:
basename - the index basename

getPropertiesForBasename

public static Properties getPropertiesForBasename(String basename)
                                           throws ConfigurationException
Return the actual Proeprties file for the given basename.

Parameters:
basename - the index basename
Throws:
ConfigurationException

getIndex

public Index getIndex()
Get the text index.

Returns:
Index text index

getIndex

public Index getIndex(String indexAlias)
Get the index for the specified index alias.

Parameters:
indexAlias - the index alias to find the index for
Returns:
Index the index for the specified index alias

getIndexDetails

public IndexDetails getIndexDetails(String indexAlias)
Get the index details for the specified index alias.

Parameters:
indexAlias - the index alias to find the index details for
Returns:
IndexDetails the index details for the specified index alias

removeIndexFiles

public void removeIndexFiles()
Remove all index files for all loaded indicies.


getTextAliasesToIndexMap

public Map<String,IndexDetails> getTextAliasesToIndexMap()
Get the set of index alias names for text indices.

Returns:
the set of index alias names

getAllAliasesToIndexMap

public Map<String,IndexDetails> getAllAliasesToIndexMap()
Get the set of index alias names for all indices.

Returns:
the set of index alias names

removeIndexFiles

public void removeIndexFiles(String indexAlias)
Removes the files that contain the index data for the specified index alias.

Parameters:
indexAlias - the index alias to remove files for

removeIndexFilesOnExit

public void removeIndexFilesOnExit()
Remove all index files for all loaded indicies when the JVM exists.


getNumberOfTerms

public int getNumberOfTerms()
Returns the number of terms for the "text" index.

Returns:
Number of terms in the index.

getNumberOfTerms

public int getNumberOfTerms(String indexAlias)
Returns the number of terms for the specified index alias.

Parameters:
indexAlias - the index alias to get the number of terms for
Returns:
Number of terms in the index.

close

public void close()
Close all loaded indexes.

Specified by:
close in interface Closeable

getDocumentNumber

public int getDocumentNumber()
Get the number of documents on the text index.

Returns:
the number of documents on the text index

getDocumentNumber

public int getDocumentNumber(String indexAlias)
Get the number of documents for the index specified by indexAlias.

Parameters:
indexAlias - the index alias to get the number of documents for
Returns:
the number of documents for the index specified indexAlias

extractTerms

public List<PositionedTerm> extractTerms(Sentence sentence)
Extract the positioned terms for the given sentence. The terms will be from the "text" index.

Parameters:
sentence - the sentence to get the positions terms for
Returns:
the list of positioned terms for the given sentence

extractTerms

public int[] extractTerms(CharSequence documentContent)
Extracts indexed terms from a document using the "text" index.

Parameters:
documentContent - The document to be converted.
Returns:
An array of ints. Each int is the index of the term in the document index manager.

extractTerms

public int[] extractTerms(IndexDetails indexDetailsToUse,
                          CharSequence documentContent)
Extract the int terms from a document using the given indexAlias index.

Parameters:
indexDetailsToUse - the index to get the int terms from
documentContent - the document to get the terms for
Returns:
the int terms

extractTerms

@Deprecated
public int[] extractTerms(CharSequence documentContent,
                                     char internalSeparator)
Deprecated. use the version that processes a MutableString documentContent instead.

Extracts indexed terms from a document. Conversion is done with the same method used by MG4J when indexing the documents, expect that an extra word delimiter is considered (internalSeparator). Any words that have this delimiter are split at the delimiter boundary and the parts are resolved independently in the full text index. For instance, let's assume "A B_C,D_E" is the document content, and space and comma are delimit words in the index. If internalSeparator is '_', then the following words will be converted: [ A B C D E]. In contrast, #this.extractTerm would return [A TermNotFound TermNotFound], assuming terms B_C and D_E have not been indexed, or [A B_C D_E] if terms B_C and D_E exist in the index.

This method returns an array of int. Each int of the array represents a term that occurs in the document. The order of terms in the document is preserved in the int array. The int[] is often more convenient to use than a String, since parsing into terms has already been done. Direct access into the array of terms can be useful. The terms will be from the "text" index.

Parameters:
documentContent - The document to be converted.
internalSeparator - A character used as an extra delimiter.
Returns:
An array of ints. Each int is the index of the term in the document index manager.

extractSpaceDelimitedTerms

public int[] extractSpaceDelimitedTerms(CharSequence documentContent)
Extracts indexed terms from a document. Conversion is done with the same method used by MG4J when indexing the documents. This method returns an array of int. Each int of the array represents a term that occurs in the document. The order of terms in the document is preserved in the int array. The int[] is often more convenient to use than a String, since parsing into terms has already been done. Direct access into the array of terms can be useful. The terms will be from the "text" index.

Parameters:
documentContent - The document to be converted.
Returns:
An array of ints. Each int is the index of the term in the document index manager.

extractTerms

public int[] extractTerms(IndexDetails indexDetailsToUse,
                          CharSequence documentContent,
                          Character internalSeparator)
Convert the string in documentContent into index term int's within the index associated with indexAlias.

Parameters:
indexDetailsToUse - the index alias to find the term int's for
documentContent - the string to obtain index term int's for
internalSeparator - the OPTIONAL internal separator, it will split on space (' ') and this value if internalSeparator is not null
Returns:
the array of index term int's for the given string

getTerms

public TermIterator getTerms()
                      throws FileNotFoundException
Gets iterator over the terms of the "text" index. Call close() when you are done with the iterator.

Returns:
An iterator over the terms of this index.
Throws:
FileNotFoundException - if the terms file cannot be loaded.

getTerms

public TermIterator getTerms(String indexAlias)
                      throws FileNotFoundException
Gets iterator over the terms of the index related to indexAlias. Call close() when you are done with the iterator.

Parameters:
indexAlias - the index alias to obtain the terms for
Returns:
An iterator over the terms of this index.
Throws:
FileNotFoundException - if the terms file cannot be loaded.

queryAndExactOrderMg4jNative

public DocumentIterator queryAndExactOrderMg4jNative(String[] currentTerms)
                                              throws IOException
Run an mg4j query using "and exact order". This will run on the "text" index.

Parameters:
currentTerms - the terms to query
Returns:
the query document iterator
Throws:
IOException - error executing query

queryAndExactOrderMg4jNativeWithIntArray

public DocumentIterator queryAndExactOrderMg4jNativeWithIntArray(int[] currentIndices)
                                                          throws IOException
Run an mg4j query using "and exact order" specifying int terms instead of strings with the query. This will run on the "text" index.

Parameters:
currentIndices - the terms int's to query
Returns:
the query document iterator
Throws:
IOException - error executing query

queryAndMg4jNative

public DocumentIterator queryAndMg4jNative(int[] terms)
                                    throws IOException
Run an mg4j query using "and" specifying int terms instead of strings with the query. This will run on the "text" index.

Parameters:
terms - the terms int's to query
Returns:
the query document iterator
Throws:
IOException - error executing query

extendOnLeft

public DocumentIterator extendOnLeft(DocumentQueryResult cachedResult,
                                     String leftWord)
                              throws IOException
Extend the result of a query exact order on the left, by one word. This will run on the "text" index.

Parameters:
cachedResult - Result of a previous query
leftWord - Returned documents will match leftWord |PreviousQuery|, in this exact order.
Returns:
DocumentIterator that exactly matches the combined query.
Throws:
IOException - If an error occurred reading the full text index.

extendOnRight

public DocumentIterator extendOnRight(DocumentQueryResult cachedResult,
                                      String rightWord)
                               throws IOException
Extend the result of a query exact order on the right, by one word. This will run on the "text" index.

Parameters:
cachedResult - Result of a previous query
rightWord - Returned documents will match |PreviousQuery| rightWord, in this exact order.
Returns:
DocumentIterator that exactly matches the combined query.
Throws:
IOException - If an error occurred reading the full text index.

queryAndExactOrder

public TermDocumentPositions queryAndExactOrder(String[] currentTerms)
                                         throws IOException
Run a query using "and exact order".

Parameters:
currentTerms - the terms to query
Returns:
the query document iterator
Throws:
IOException - error executing query

queryAnd

public int[] queryAnd(Collection<String> keywords)
               throws IOException
Returns the documents that contain the intersection of the keywords.

Parameters:
keywords - Collection of Strings. Each string is a keyword.
Returns:
Documents that contain the intersection of keywords. Empty result sets are returned as an empty int array.
Throws:
IOException - xx

intersection

public int[] intersection(int[] array1,
                          int[] array2)
Returns the intersection of the two sets of arrays. The two sets must be sorted in an ascending order or the behaviour of this method is undefined.

Parameters:
array1 - First array of integer.
array2 - Second array of integer.
Returns:
An array of integer that contains the intersection of array1 and array2. That is, the resulting array will contain an int value only iff the value if contained in array1 and array2.

query

public int[] query(String keyword)
            throws IOException
Returns the documents that contain the keywords.

Parameters:
keyword - single keyword.
Returns:
index of the documents that contain the keyword, or null if the keyword was not found in the index.
Throws:
IOException - xx

query

public int[] query(String keyword,
                   TermDocumentPositions positions)
            throws IOException
Returns the documents that contain the keywords. This will run on the "text" index.

Parameters:
keyword - single keyword.
positions - Where to store positions of the keyword in the document if needed, int[0] otherwise.
Returns:
index of the documents that contain the keyword, or null if the keyword was not found in the index.
Throws:
IOException - xx

setTermMap

public void setTermMap(TermMap map)
Install a new term map. The new term map will be used by this document manager whenever terms need to be converted to term indices and back and forth. The client is responsible for populating the term map with terms/indices consistent with the inverted index that this manager provides access to. This sets the termMap on the "text" index.

Parameters:
map - The new term map.

setTermMap

public void setTermMap(String indexAlias,
                       TermMap map)
Install a new term map. The new term map will be used by this document manager whenever terms need to be converted to term indices and back and forth. The client is responsible for populating the term map with terms/indices consistent with the inverted index that this manager provides access to. This sets the termMap on the indexAlias index.

Parameters:
indexAlias - the index to associate the term map with
map - The new term map.

findTermIndex

public int findTermIndex(CharSequence term)

findTermIndex

public int findTermIndex(IndexDetails indexDetails,
                         CharSequence term)
Obtain the index of the term in the index. Prefer the method that processes a MutableString for best performance.

Parameters:
indexDetails - the index to find the term int within
term - The term
Returns:
Index of this term in the index.

termAsCharSequence

public CharSequence termAsCharSequence(int termIndex)
Convert a termIndex index into a character sequence using the "text" index.

Parameters:
termIndex - The termIndex
Returns:
Term that corresponds to this index in the full text index.

termAsCharSequence

public CharSequence termAsCharSequence(IndexDetails indexDetails,
                                       int termIndex)
Convert a termIndex index into a character sequence using the index assocaited with indexAlias.

Parameters:
indexDetails - the index to use
termIndex - The termIndex
Returns:
Term that corresponds to this index in the full text index.

processTerm

public boolean processTerm(MutableString rawTerm)
Run the term processor associated with the "text" index on the given term.

Parameters:
rawTerm - the raw term to process
Returns:
the term processed term

processTerm

public boolean processTerm(IndexDetails indexDetails,
                           MutableString rawTerm)
Run the term processor associated with the indexDetails index on the given term.

Parameters:
indexDetails - the index whose associated term processor we will use
rawTerm - the raw term to process
Returns:
the term processed term

getMaxDocumentSize

public int getMaxDocumentSize()
Get the number of documents in the "text" index.

Returns:
Get the number of documents in the "text" index

getMaxDocumentSize

public int getMaxDocumentSize(String indexAlias)
Get the number of documents in the index specified by indexAlias.

Parameters:
indexAlias - the indexAlias to find the number of documents in
Returns:
the number of documents in the index specified by indexAlias

queryOr

public int[] queryOr(Collection<String> keywords)
              throws IOException
Returns the documents that contain the union of the keywords.

Parameters:
keywords - Collection of Strings. Each string is a keyword that the document returned will contain.
Returns:
Documents that contain the union of keywords.
Throws:
IOException - xx

union

public int[] union(int[] array1,
                   int[] array2)
Returns the intersection of the two sets of arrays. The two sets must be sorted in an ascending order or the behaviour of this method is undefined.

Parameters:
array1 - First array of integer.
array2 - Second array of integer.
Returns:
An array of integer that contains the intersection of array1 and array2. That is, the resulting array will contain an int value only iff the value if contained in array1 and array2.

multipleWordTermAsString

public String multipleWordTermAsString(int[] indexedTerm)
Given the specified index int's, return the string given the "text" index.

Parameters:
indexedTerm - the index term int's
Returns:
a string for those terms

multipleWordTermAsString

public String multipleWordTermAsString(String indexAlias,
                                       int[] indexedTerm)
Given the specified index int's, return the string given the indexAlias index.

Parameters:
indexAlias - the index alias to use to find the terms
indexedTerm - the index term int's
Returns:
a string for those terms

termAsString

public String termAsString(int term)
Return the given term in the "text" index as a string.

Parameters:
term - the index int term to find
Returns:
the string for that term.

termAsString

public String termAsString(IndexDetails indexDetails,
                           int term)
Return the given term in the indexAlias index as a string.

Parameters:
indexDetails - the index to use to find the terms
term - the index int term to find
Returns:
the string for that term.

iteratorToInts

public int[] iteratorToInts(DocumentIterator documentIterator)
                     throws IOException
Converts a document iterator into a document number int array.

Parameters:
documentIterator - Iterator over the documents.
Returns:
Array of document numbers or null if there are no documents
Throws:
IOException - xx

splitText

public int splitText(CharSequence text,
                     MutableString result)
Returns the text of this sentence in a format where words are delimited by a single space character. The algorithm used to split the sentence into terms is the same as used by MG4J, allowing direct calculation of the word positions with a StringTokenizer that would split on spaces. As a side effect, this method calculates and stores the number of terms in this sentence. That number is then available through getTermNumber(). This will use wordreader associated with the "text" index.

This method does not process the terms: each term is returned with the capitalization that it had in the input text.

Parameters:
text - The input text to split into terms.
result - A mutable string where the result will be stored. Result is the text delimited by single space character.
Returns:
The number of terms that were processed.

splitText

public int splitText(WordReader currentWordReader,
                     CharSequence text,
                     MutableString result)
Returns the text of this sentence in a format where words are delimited by a single space character. The algorithm used to split the sentence into terms is the same as used by MG4J, allowing direct calculation of the word positions with a StringTokenizer that would split on spaces. As a side effect, this method calculates and stores the number of terms in this sentence. That number is then available through getTermNumber(). This will use the specified wordreader.

This method does not process the terms: each term is returned with the capitalization that it had in the input text.

Parameters:
currentWordReader - the word reader to use to split the text
text - The input text to split into terms.
result - A mutable string where the result will be stored. Result is the text delimited by single space character.
Returns:
The number of terms that were processed.

getBasename

public String getBasename()
Get the basename for the "text" index.

Returns:
Returns the basename.

getTermProcessor

public TermProcessor getTermProcessor()
Get the term processor for the "text" index.

Returns:
Returns the term processor.

getTermProcessor

public TermProcessor getTermProcessor(String indexAlias)
Get the term processor for the index specified by indexAlias.

Parameters:
indexAlias - the index alias to find the term processor for
Returns:
Returns the term processor.

getWordReader

public TextractorWordReader getWordReader()
Get the word reader for the "text" index.

Returns:
Returns the word reader.

getWordReader

public TextractorWordReader getWordReader(String indexAlias)
Return the word reader for the specified indexalias.

Parameters:
indexAlias - the index alias to get the word reader for.
Returns:
the word reader for the specified index or null of the specified index does not exist.

getWordReadersMap

public Map<String,TextractorWordReader> getWordReadersMap()
Obtain the map of alias to word readers.

Returns:
Returns the word readers map.

frequency

public int frequency(String[] query,
                     TermFrequency frequency)
              throws IOException
Calculates the number of times a query matches a corpus.

Parameters:
query - Successive terms whose frequencies will be counted.
frequency - Where frequencies will be written.
Returns:
Occurence frequency, for convenience.
Throws:
IOException - error getting frequency

frequency

public int frequency(int term)
              throws IOException
Obtain frequency information for a single term in the index for the indexAlias named "text".

Parameters:
term - Term to get frequency for
Returns:
The count of documents that contain this term in the index.
Throws:
IOException - error getting frequency

frequency

public int frequency(String indexAlias,
                     int term)
              throws IOException
Obtain frequency information for a single term in the index for the specified indexAlias.

Parameters:
indexAlias - the index to get the frequency for the term
term - Term to get frequency for
Returns:
The count of documents that contain this term in the index.
Throws:
IOException - error getting frequency

toText

public MutableString toText(int[] documentText)
Returns a human readable text for the document content within the "text" index.

Parameters:
documentText - the list of int terms to obtain the docuemnt text for
Returns:
A string, where each coded word was replaced by its human readable term.

toText

public MutableString toText(String indexAlias,
                            int[] documentText)
Returns a human readable text for the document content within the specified indexAlias index.

Parameters:
indexAlias - the indexAlias to use when obtaining the text
documentText - the list of int terms to obtain the docuemnt text for
Returns:
A string, where each coded word was replaced by its human readable term.

getQueryParser

public QueryParser getQueryParser()
Returns a query parser configured against this document index for the indexAlias "text". The query parser parses queries in the MG4J query syntax, implement and execute them.

Returns:
query parser configured against this document index for indexAlias "test"

getQueryParser

public QueryParser getQueryParser(String indexAlias)
Returns a query parser configured against this document index for the specified indexAlias. The query parser parses queries in the MG4J query syntax, implement and execute them.

Parameters:
indexAlias - the indexAlias to obtain the query parser for
Returns:
query parser configured against this document index for specified indexAlias

getTextractorProperties

public Properties getTextractorProperties()
Obtain the properties use to configure the document index manager.

Returns:
the properties

suggestIgnoreTerm

public boolean suggestIgnoreTerm(String term,
                                 int termCount)
Suggest if a term should be ignored in a query. Terms that occur in more than 50% of the documents could be ignored if they are involved in a top-level OR statement. (e.g. A | B).

Parameters:
term - the term to determine if it should be ignored
termCount - The number of terms at top level in a disjunctive query (A | B | C |(D|E)) has four words.
Returns:
True if the word could be ignored, false otherwise.

getDocumentFactory

public AbstractTextractorDocumentFactory getDocumentFactory()
Return the document factory that is being used.

Returns:
the document factory

Textractor API textractor-720 (20091120123250)

Copyright © 2003-2008 Institute for Computational Biomedicine, All Rights Reserved.