Textractor API textractor-720 (20091120123250)

textractor.chain.loader
Class AbstractFileLoader

java.lang.Object
  extended by textractor.sentence.AbstractSentenceProcessor
      extended by textractor.chain.AbstractSentenceProducer
          extended by textractor.chain.loader.AbstractFileLoader
All Implemented Interfaces:
Callable<Boolean>, EventListener, Chain, Command, SentenceProcessingCompleteListener, SentenceProcessor, SentenceProducer, TextractorProcessor
Direct Known Subclasses:
FastaFileLoader, Html2TextArticleLoader, OmimArticleLoader, OtmiArticleLoader, PubmedArticleLoader, SfnArticleLoader, TrecGov2ArticleLoader

public abstract class AbstractFileLoader
extends AbstractSentenceProducer

Abstract base class that can be used to load articles from files and process them into sentences that can be then further processed, indexed or stored into a database via an appropriate SentenceConsumer or SentenceProcessor.


Field Summary
protected  boolean appendSentencesInOneDocument
          Indicates that each article should produce a single sentence with all the article text.
protected  int currentIteration
          The iteration currently in progress (should start at 1).
protected  int numberOfIterations
          The number of times to iterate over a file or directory.
protected static String PARAGRAPH_BOUNDARY_TAG
          Separator to use between paragraphs in a document.
protected  String paragraphBoundary
          This string will be placed between sentences.
protected static String SENTENCE_BOUNDARY_TAG
          Separator to use when creating sentences in a single document.
protected  String sentenceBoundary
          This string will be placed between sentences.
 
Fields inherited from class textractor.chain.AbstractSentenceProducer
commands, frozen
 
Fields inherited from interface org.apache.commons.chain.Command
CONTINUE_PROCESSING, PROCESSING_COMPLETE
 
Constructor Summary
AbstractFileLoader()
          Create a new SentenceProducer that loads sentences from files.
 
Method Summary
protected  void beginIteration(int iteration)
          Called at the start of an iteration.
 Boolean call()
          Thread that will process a file, directory or list of files.
protected  void endIteration(int iteration)
          Called at the end of an iteration.
 String getDirectory()
          Get the name of the directory to process.
 List<String> getExtensionList()
          Get the filename extensions used to filter which files to process.
 String getExtensions()
          Get the filename extensions used to filter which files to process.
 String getFile()
          Get the name of the file to process.
 String getList()
          Get the name of the file containing the list of files to process.
 int getNumberOfSentencesProcessed()
          Get the number of sentences processed so far.
 String getParagraphBoundaryTag()
          Get the string that will be placed between paragraphs.
 String getProcessedFileLog()
          Get the name (if any) of the file where the names of the files processed are written to.
 String getSentenceBoundary()
          Get the string that will be placed between sentences.
 boolean isAppendSentencesInOneDocument()
           
 boolean isRecursive()
          Should subdirectories be processed as well as files.
static String padWithSpaces(String stringToPad)
           
 void processDirectory(String directory)
          Processes all the files in a given directory.
 void processDirectory(String directory, FilenameFilter filter)
          Processes all the files that match a filter in a given directory.
 void processDirectory(String directory, List<String> extensions)
          Processes all the files in a given directory.
 void processFileList(String list)
          Process a list of filenames.
abstract  void processFilename(String filename)
          Process a single file designated by name.
 void processingComplete(SentenceProcessingCompleteEvent event)
          This method gets called when a sentence processing is complete.
 Sentence produce(Article article, CharSequence text)
          Produce a new Sentence.
 void setAppendSentencesInOneDocument(boolean value)
           
 void setDirectory(String name)
          Set the name of the directory to process.
 void setExtensions(String extensionString)
          Set the filename extensions used to filter which files to process.
 void setFile(String name)
          Set the name of the file to process.
 void setList(String name)
          Set the name of the file containing the list of files to process.
 void setParagraphBoundary(String boundaryString)
          Set the string that will be placed between paragraphs.
 void setProcessedFileLog(String name)
          Set the name of the file where the names of the files processed are written to.
 void setRecursive(boolean value)
          Indicated whether subdirectories should be processed as well as files.
 void setSentenceBoundary(String boundaryString)
          This string will be placed between sentences.
 
Methods inherited from class textractor.chain.AbstractSentenceProducer
addCommand, execute, getWorkQueueSize, produce, setWorkQueueSize
 
Methods inherited from class textractor.sentence.AbstractSentenceProcessor
addSentenceProcessedListener, addSentenceProcessingCompleteListener, fireSentenceProcessedEvent, fireSentenceProcessingCompleteEvent, removeSentenceProcessedListener, removeSentenceProcessingCompleteListener
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface textractor.sentence.SentenceProcessor
addSentenceProcessedListener, addSentenceProcessingCompleteListener, getNumberOfArticlesProcessed, removeSentenceProcessedListener, removeSentenceProcessingCompleteListener
 

Field Detail

numberOfIterations

protected int numberOfIterations
The number of times to iterate over a file or directory.


currentIteration

protected int currentIteration
The iteration currently in progress (should start at 1).


appendSentencesInOneDocument

protected boolean appendSentencesInOneDocument
Indicates that each article should produce a single sentence with all the article text.


SENTENCE_BOUNDARY_TAG

protected static final String SENTENCE_BOUNDARY_TAG
Separator to use when creating sentences in a single document.

See Also:
Constant Field Values

PARAGRAPH_BOUNDARY_TAG

protected static final String PARAGRAPH_BOUNDARY_TAG
Separator to use between paragraphs in a document.

See Also:
Constant Field Values

sentenceBoundary

protected String sentenceBoundary
This string will be placed between sentences.


paragraphBoundary

protected String paragraphBoundary
This string will be placed between sentences.

Constructor Detail

AbstractFileLoader

public AbstractFileLoader()
Create a new SentenceProducer that loads sentences from files.

Method Detail

processDirectory

public final void processDirectory(String directory)
                            throws IOException
Processes all the files in a given directory.

Parameters:
directory - The name of the directory to process.
Throws:
IOException - if there is a problem reading the directory processing the files in the directory.

processDirectory

public final void processDirectory(String directory,
                                   List<String> extensions)
                            throws IOException
Processes all the files in a given directory.

Parameters:
extensions - The file extensions to allow, must not be null
directory - The name of the directory to process.
Throws:
IOException - if there is a problem reading the directory processing the files in the directory.

processDirectory

public final void processDirectory(String directory,
                                   FilenameFilter filter)
                            throws IOException
Processes all the files that match a filter in a given directory.

Parameters:
directory - The name of the directory to process.
filter - Filename filter
Throws:
IOException - if there is a problem reading the directory processing the files in the directory.

processFileList

public final void processFileList(String list)
                           throws IOException
Process a list of filenames. Lists can contain comments, which are ignored. Comments are any line that starts with a '#' character.

Parameters:
list - The name of the file containing the list
Throws:
IOException - if there is a problem reading the list or processing the files in the list

processFilename

public abstract void processFilename(String filename)
                              throws IOException
Process a single file designated by name.

Parameters:
filename - The name of the file to process.
Throws:
IOException - if there is a problem reading the file.

call

public final Boolean call()
                   throws Exception
Thread that will process a file, directory or list of files.

Returns:
true if processing completed with no problems
Throws:
IOException - if there is a problem processing the file(s).
Exception

produce

public final Sentence produce(Article article,
                              CharSequence text)
Produce a new Sentence.

Parameters:
article - The article the sentence will be associated with.
text - The text to be used for the sentence.
Returns:
The sentence object based on the article and text sequence.

getNumberOfSentencesProcessed

public int getNumberOfSentencesProcessed()
Get the number of sentences processed so far.

Returns:
The number of sentences processed so far

getDirectory

public final String getDirectory()
Get the name of the directory to process.

Returns:
the name of the directory to process

setDirectory

public final void setDirectory(String name)
Set the name of the directory to process.

Parameters:
name - the name of the directory to process

getFile

public final String getFile()
Get the name of the file to process.

Returns:
the name of the file to process

setFile

public final void setFile(String name)
Set the name of the file to process.

Parameters:
name - the name of the file to process

getList

public final String getList()
Get the name of the file containing the list of files to process.

Returns:
the name of the file to process

setList

public final void setList(String name)
Set the name of the file containing the list of files to process.

Parameters:
name - the name of the file to process

isRecursive

public boolean isRecursive()
Should subdirectories be processed as well as files.

Returns:
true if subdirectories should be processed

setRecursive

public void setRecursive(boolean value)
Indicated whether subdirectories should be processed as well as files.

Parameters:
value - true if subdirectories should be processed

getExtensions

public final String getExtensions()
Get the filename extensions used to filter which files to process.

Returns:
A comma separated list of filename extensions

setExtensions

public final void setExtensions(String extensionString)
Set the filename extensions used to filter which files to process.

Parameters:
extensionString - A comma separated list of filename extensions

getExtensionList

public final List<String> getExtensionList()
Get the filename extensions used to filter which files to process.

Returns:
A comma separated list of filename extensions

getProcessedFileLog

public final String getProcessedFileLog()
Get the name (if any) of the file where the names of the files processed are written to.

Returns:
A filename or null.

setProcessedFileLog

public final void setProcessedFileLog(String name)
Set the name of the file where the names of the files processed are written to.

Parameters:
name - The name of the file to write to.

beginIteration

protected void beginIteration(int iteration)
Called at the start of an iteration. Interested parties should override this.

Parameters:
iteration - The iteration that just completed

endIteration

protected void endIteration(int iteration)
Called at the end of an iteration. Interested parties should override this.

Parameters:
iteration - The iteration that just completed

processingComplete

public void processingComplete(SentenceProcessingCompleteEvent event)
This method gets called when a sentence processing is complete.

Specified by:
processingComplete in interface SentenceProcessingCompleteListener
Overrides:
processingComplete in class AbstractSentenceProducer
Parameters:
event - A SentenceProcessingCompleteEvent object describing the event source.

isAppendSentencesInOneDocument

public boolean isAppendSentencesInOneDocument()

setAppendSentencesInOneDocument

public void setAppendSentencesInOneDocument(boolean value)

getParagraphBoundaryTag

public String getParagraphBoundaryTag()
Get the string that will be placed between paragraphs.

Returns:
the String that will be placed between paragraphs

setParagraphBoundary

public void setParagraphBoundary(String boundaryString)
Set the string that will be placed between paragraphs.

Parameters:
boundaryString - the String that will be placed between paragraphs

getSentenceBoundary

public String getSentenceBoundary()
Get the string that will be placed between sentences.

Returns:
the String that will be placed between sentences

setSentenceBoundary

public void setSentenceBoundary(String boundaryString)
This string will be placed between sentences. If an empty string or null is specified a single space will be used, otherwise it will insert the specified String (padded by a single space on both sides, even if the padding isn't manually specified or the string is over-padded).

Parameters:
boundaryString - the String that should be placed between sentences

padWithSpaces

public static String padWithSpaces(String stringToPad)

Textractor API textractor-720 (20091120123250)

Copyright © 2003-2008 Institute for Computational Biomedicine, All Rights Reserved.