Textractor API textractor-720 (20091120123250)

textractor.chain.indexer
Class Indexer

java.lang.Object
  extended by textractor.sentence.AbstractSentenceProcessor
      extended by textractor.chain.AbstractSentenceConsumer
          extended by textractor.chain.indexer.Indexer
All Implemented Interfaces:
Callable<Boolean>, EventListener, Command, SentenceProcessingCompleteListener, SentenceConsumer, SentenceProcessor, TextractorProcessor

public final class Indexer
extends AbstractSentenceConsumer

A Command that creates an index with mg4j.


Field Summary
static int DOCUMENT_SEPARATOR
           
 
Fields inherited from class textractor.chain.AbstractSentenceConsumer
productionCompleted, textractorContext
 
Fields inherited from interface org.apache.commons.chain.Command
CONTINUE_PROCESSING, PROCESSING_COMPLETE
 
Constructor Summary
Indexer()
          Create a new indexer Command.
 
Method Summary
static Object assureFieldType(TextractorFieldInfo fieldInfo, Object value)
          Assure that value is appropriate for fieldInfo.type.
 void consume(Article article, Collection<Sentence> sentences)
          Process sentences along with their associated article into individual sentences so that the IndexBuilder process can read them properly.
static String fieldErrorString(TextractorFieldInfo fieldInfo, Object value)
          The value is not valid for the fieldInfo.tyep.
 String getBasename()
           
 int getCombineBufferSize()
           
 int getDocumentsPerBatch()
          Gets the number of documents per batch.
 int getHeight()
           
 int getIndexingQueueSize()
          Get the size of the indexing queue.
 int getMinimumDashSplitLength()
           
 int getNumberOfArticlesProcessed()
          Get the number of articles processed so far.
 int getNumberOfSentencesProcessed()
          Get the number of sentences processed so far.
 int getPasteBufferSize()
           
 Map<CompressionFlags.Component,CompressionFlags.Coding> getPayloadWriterFlags()
           
 int getQuantum()
           
 int getScanBufferSize()
           
 int getSkipBufferSize()
           
 Map<CompressionFlags.Component,CompressionFlags.Coding> getStandardWriterFlags()
           
 String getTermProcessorClass()
           
 TextractorWordReader getWordReader()
           
 String getZipDocumentCollectionName()
           
 boolean isLowercaseIndex()
           
 boolean isParenthesesAreWords()
           
 boolean isSkips()
           
 boolean okToComplete()
          Indicate that all processing is complete and it's ok to terminate.
 void setBasename(String name)
           
 void setCombineBufferSize(int size)
           
 void setDocumentFactoryClass(String factoryClassName)
           
 void setDocumentsPerBatch(int numberOfDocuments)
          Sets the number of documents per batch.
 void setHeight(int height)
           
 void setIndexConfigurationFile(String indexConfigurationFileVal)
           
 void setIndexingQueueSize(int size)
          Set the size of the indexing queue.
 void setLowercaseIndex(boolean lowercaseIndex)
           
 void setMinimumDashSplitLength(int minimumDashSplitLength)
           
 void setParenthesesAreWords(boolean parenthesesAreWords)
          Choose if parentheses should be indexed as words.
 void setPasteBufferSize(int size)
           
 void setPayloadWriterFlags(Map<CompressionFlags.Component,CompressionFlags.Coding> flags)
           
 void setQuantum(int quantum)
           
 void setScanBufferSize(int size)
           
 void setSkips(boolean skips)
           
 void setStandardWriterFlags(Map<CompressionFlags.Component,CompressionFlags.Coding> flags)
           
 void setTermProcessorClass(String termProcessorClass)
           
 void setWordReader(TextractorWordReader wordReader)
           
 void setWordReaderClass(String wordReader)
           
 void setZipDocumentCollectionName(String zipDocumentCollectionName)
           
 void size(int size)
           
protected  Properties storeProperties(String filename)
          Stores properties of this class into the given filename.
 
Methods inherited from class textractor.chain.AbstractSentenceConsumer
call, execute, processingComplete
 
Methods inherited from class textractor.sentence.AbstractSentenceProcessor
addSentenceProcessedListener, addSentenceProcessingCompleteListener, fireSentenceProcessedEvent, fireSentenceProcessingCompleteEvent, removeSentenceProcessedListener, removeSentenceProcessingCompleteListener
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface textractor.sentence.SentenceProcessor
addSentenceProcessedListener, addSentenceProcessingCompleteListener, removeSentenceProcessedListener, removeSentenceProcessingCompleteListener
 

Field Detail

DOCUMENT_SEPARATOR

public static final int DOCUMENT_SEPARATOR
See Also:
Constant Field Values
Constructor Detail

Indexer

public Indexer()
        throws IllegalAccessException,
               InstantiationException,
               ClassNotFoundException
Create a new indexer Command.

Throws:
IllegalAccessException - error creating object
InstantiationException - error creating object
ClassNotFoundException - error creating object
Method Detail

consume

public void consume(Article article,
                    Collection<Sentence> sentences)
Process sentences along with their associated article into individual sentences so that the IndexBuilder process can read them properly.

Parameters:
article - The article assoicated with the sentenceQueue.
sentences - A collection of Sentences to process.

storeProperties

protected Properties storeProperties(String filename)
                              throws ConfigurationException
Stores properties of this class into the given filename.

Parameters:
filename - Name of the file to store the properties in
Returns:
A property object containing the properties stored.
Throws:
ConfigurationException - if the file cannot be written properly

assureFieldType

public static Object assureFieldType(TextractorFieldInfo fieldInfo,
                                     Object value)
Assure that value is appropriate for fieldInfo.type. If fieldInfo.name is "text" this will just return null. If value is appropriate, this will return value. If the value is inappropriate this will throw IllegalArgumentException as indexing should be killed.

Parameters:
fieldInfo - the field we are indexing the value for
value - the value for the field
Returns:
the value or null
Throws:
IllegalArgumentException - the value is no appropriate for fieldInfo.type

fieldErrorString

public static String fieldErrorString(TextractorFieldInfo fieldInfo,
                                      Object value)
The value is not valid for the fieldInfo.tyep. Generate the error string.

Parameters:
fieldInfo - the field which is in error
value - the value which is invalid
Returns:
the error string

setParenthesesAreWords

public void setParenthesesAreWords(boolean parenthesesAreWords)
Choose if parentheses should be indexed as words.

Parameters:
parenthesesAreWords - True if the index should be built with parentheses indexed as words.

getBasename

public String getBasename()
Returns:
the basename

setBasename

public void setBasename(String name)
Parameters:
name - the basename to set

getDocumentsPerBatch

public int getDocumentsPerBatch()
Gets the number of documents per batch.

Returns:
the number of documents Scan will attempt to add to each batch.

setDocumentsPerBatch

public void setDocumentsPerBatch(int numberOfDocuments)
Sets the number of documents per batch.

Parameters:
numberOfDocuments - the number of documents Scan will attempt to add to each batch.

getMinimumDashSplitLength

public int getMinimumDashSplitLength()
Returns:
the minimumDashSplitLength

setMinimumDashSplitLength

public void setMinimumDashSplitLength(int minimumDashSplitLength)
Parameters:
minimumDashSplitLength - the minimumDashSplitLength to set

getWordReader

public TextractorWordReader getWordReader()
Returns:
the wordReader

setWordReader

public void setWordReader(TextractorWordReader wordReader)

setWordReaderClass

public void setWordReaderClass(String wordReader)
                        throws ClassNotFoundException,
                               IllegalAccessException,
                               InstantiationException
Throws:
ClassNotFoundException
IllegalAccessException
InstantiationException

getZipDocumentCollectionName

public String getZipDocumentCollectionName()
Returns:
the zipDocumentCollectionName

setZipDocumentCollectionName

public void setZipDocumentCollectionName(String zipDocumentCollectionName)
Parameters:
zipDocumentCollectionName - the zipDocumentCollectionName to set

isParenthesesAreWords

public boolean isParenthesesAreWords()
Returns:
the parenthesesAreWords

getHeight

public int getHeight()
Returns:
the height

setHeight

public void setHeight(int height)
Parameters:
height - the height to set

getQuantum

public int getQuantum()
Returns:
the quantum

setQuantum

public void setQuantum(int quantum)
Parameters:
quantum - the quantum to set

isSkips

public boolean isSkips()
Returns:
the skips

setSkips

public void setSkips(boolean skips)
Parameters:
skips - the skips to set

isLowercaseIndex

public boolean isLowercaseIndex()

setLowercaseIndex

public void setLowercaseIndex(boolean lowercaseIndex)

getStandardWriterFlags

public Map<CompressionFlags.Component,CompressionFlags.Coding> getStandardWriterFlags()

setStandardWriterFlags

public void setStandardWriterFlags(Map<CompressionFlags.Component,CompressionFlags.Coding> flags)

getPayloadWriterFlags

public Map<CompressionFlags.Component,CompressionFlags.Coding> getPayloadWriterFlags()

setPayloadWriterFlags

public void setPayloadWriterFlags(Map<CompressionFlags.Component,CompressionFlags.Coding> flags)

getTermProcessorClass

public String getTermProcessorClass()

setTermProcessorClass

public void setTermProcessorClass(String termProcessorClass)

setDocumentFactoryClass

public void setDocumentFactoryClass(String factoryClassName)

setIndexConfigurationFile

public void setIndexConfigurationFile(String indexConfigurationFileVal)

getNumberOfArticlesProcessed

public int getNumberOfArticlesProcessed()
Get the number of articles processed so far.

Returns:
The number of articles processed so far

getNumberOfSentencesProcessed

public int getNumberOfSentencesProcessed()
Get the number of sentences processed so far.

Returns:
The number of sentences processed so far

okToComplete

public boolean okToComplete()
Indicate that all processing is complete and it's ok to terminate. If false is returned the consumer thread will terminate without firing a SentenceProcessingCompleteEvent. Be aware of this and send the event if you override the default behavior.

Overrides:
okToComplete in class AbstractSentenceConsumer
Returns:
true if it's ok to complete.

getCombineBufferSize

public int getCombineBufferSize()
Returns:
the combineBufferSize

setCombineBufferSize

public void setCombineBufferSize(int size)
Parameters:
size - the combineBufferSize to set

getPasteBufferSize

public int getPasteBufferSize()
Returns:
the pasteBufferSize

setPasteBufferSize

public void setPasteBufferSize(int size)
Parameters:
size - the pasteBufferSize to set

getScanBufferSize

public int getScanBufferSize()
Returns:
the scanBufferSize

setScanBufferSize

public void setScanBufferSize(int size)
Parameters:
size - the scanBufferSize to set

getSkipBufferSize

public int getSkipBufferSize()
Returns:
the skipBufferSize

size

public void size(int size)
Parameters:
size - the skipBufferSize to set

getIndexingQueueSize

public int getIndexingQueueSize()
Get the size of the indexing queue.

Returns:
The size of the queue.

setIndexingQueueSize

public void setIndexingQueueSize(int size)
Set the size of the indexing queue.

Parameters:
size - The size of the queue.

Textractor API textractor-720 (20091120123250)

Copyright © 2003-2008 Institute for Computational Biomedicine, All Rights Reserved.