Textractor API textractor-720 (20091120123250)

textractor.datamodel
Class SingleBagOfWordFeatureCreationParameters

java.lang.Object
  extended by textractor.datamodel.FeatureCreationParameters
      extended by textractor.datamodel.SingleBagOfWordFeatureCreationParameters

public final class SingleBagOfWordFeatureCreationParameters
extends FeatureCreationParameters

Holds parameters to calculate features with the bag of word approach. In the bag of word approach, a window is centered around a word of interest (the position of this word is encoded in the TextFragmentAnnotation#getWordPosition). The window has a size given by the windowSize in this class. A set of words/terms are used in turn to perform boolean tests on the text within the window around the term. The results of these tests provides the features (features are 0 or 1). As an example, consider the following window, centered around term A23P: The mutant A23P exhibits a strong phenotype. If the window size is 2 and the terms considered for calculating the bag of words are "the", "mutant", and "strong" (in this order), then the features calculated for the text within the window in the example will be: 1 1 0 The first 1 means that the word "the" occurs in the window around the word of interest. the second 1 that the term "mutant" occurs. The last feature is 0 because the word "strong" does not occur within the window of size 2 centered on the word of interest. Note that the order of the terms is significant, because the features must be calculated in exactly the same order when exporting the training set and to perform predictions on the test set. If the order is not maintained, the behavior of prediction engines (such as SVMs) is undefined. User: Fabien Campagne Date: Jan 19, 2004 Time: 11:42:11 AM


Field Summary
static int LOCATION_CENTERED_ON_WORD
          The window is centered on the reference word.
static int LOCATION_LEFT_OF_WORD
          The window is at the left of the reference word.
static int LOCATION_RIGHT_OF_WORD
          The window is at the right of the reference word.
 
Fields inherited from class textractor.datamodel.FeatureCreationParameters
windowSize
 
Constructor Summary
SingleBagOfWordFeatureCreationParameters()
          Constructs default parameters.
 
Method Summary
 void addWordsToTerms(int[] indexedTerms, int wordOfInterestPosition, int termLength, int[] excludedPositions)
          To find out what terms are in the window and add them to the non-redundant termsInWindows.
 int calculateMaxWindowWidth(int wordOfInterestPosition, int termLength, int numTerms)
           
 int calculateMinWindowIndex(int wordOfInterestPosition)
           
 void clearTerms()
          Clear the terms stored in this parameter.
 int getFirstFeatureNumber()
           
 int getLastFeatureNumber()
           
 String[] getTerms()
          Returns the terms for which features should be calculated.
 IntArrayList getTermsInWindows()
           
 int getWindowLocation()
          Window location with respect to the reference word.
 void removeTerm(String term, int indexedTerm)
           
 void setFirstFeatureNumber(int firstFeatureNumber)
          Sets the number of the first feature that this exporter will generate.
 void setTerms(DocumentIndexManager docmanager)
          Convert indexed terms to their string representation.
 void setTermsInWindows(IntArrayList termsInWindows)
           
 void setWindowLocation(int windowLocation)
           
 void setWindowSize(int windowSize)
          Sets the size of the window.
 void updateIndex(DocumentIndexManager docmanager)
          Update the index of the termsInWindows in SingleBagOfWordFeatureCretionParameters with the current index.
 boolean windowContainsTerm(int[] annotationIndexedTerms, int term, int windowCenter, int termLength, int minIndex, int maxIndex)
           
 
Methods inherited from class textractor.datamodel.FeatureCreationParameters
getParameterNumber, getWindowSize, positionIsExcluded, setParameterNumber
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOCATION_RIGHT_OF_WORD

public static final int LOCATION_RIGHT_OF_WORD
The window is at the right of the reference word. For instance a b >REF-WORD c d < is a window of size 3 at the right of the reference word.

See Also:
Constant Field Values

LOCATION_LEFT_OF_WORD

public static final int LOCATION_LEFT_OF_WORD
The window is at the left of the reference word. For instance >a b REF-WORD< is a window of size 3 at the left of the reference word.

See Also:
Constant Field Values

LOCATION_CENTERED_ON_WORD

public static final int LOCATION_CENTERED_ON_WORD
The window is centered on the reference word. For instance >a b REF-WORD c d< e is a window of size 3 centered on the reference word.

See Also:
Constant Field Values
Constructor Detail

SingleBagOfWordFeatureCreationParameters

public SingleBagOfWordFeatureCreationParameters()
Constructs default parameters. By default, the window is centered on the word of interest and has size 1.

Method Detail

getFirstFeatureNumber

public int getFirstFeatureNumber()

setFirstFeatureNumber

public void setFirstFeatureNumber(int firstFeatureNumber)
Sets the number of the first feature that this exporter will generate. When this number is zero, the exporter assumes that it should export the label before the first feature.

Parameters:
firstFeatureNumber -

getLastFeatureNumber

public int getLastFeatureNumber()

getWindowLocation

public int getWindowLocation()
Window location with respect to the reference word.

Returns:
One of LOCATION_RIGHT_OF_WORD, LOCATION_LEFT_OF_WORD, or LOCATION_CENTERED_ON_WORD.

setWindowLocation

public void setWindowLocation(int windowLocation)

getTerms

public String[] getTerms()
Returns the terms for which features should be calculated. The terms are arranged in the array in the order in which they should be used to calculate features.

Specified by:
getTerms in class FeatureCreationParameters
Returns:
An array of string. Each string is a term/word.

updateIndex

public void updateIndex(DocumentIndexManager docmanager)
Description copied from class: FeatureCreationParameters
Update the index of the termsInWindows in SingleBagOfWordFeatureCretionParameters with the current index.

Specified by:
updateIndex in class FeatureCreationParameters

clearTerms

public void clearTerms()
Description copied from class: FeatureCreationParameters
Clear the terms stored in this parameter. This method is used within an exporter FirstPass method to garantee that no previous terms are stored in the parameters.

Specified by:
clearTerms in class FeatureCreationParameters

removeTerm

public void removeTerm(String term,
                       int indexedTerm)
Specified by:
removeTerm in class FeatureCreationParameters

setTerms

public void setTerms(DocumentIndexManager docmanager)
Convert indexed terms to their string representation. The side effect is that getTerms() returns accurate terms.

Specified by:
setTerms in class FeatureCreationParameters
Parameters:
docmanager - the index manager

setTermsInWindows

public void setTermsInWindows(IntArrayList termsInWindows)

getTermsInWindows

public IntArrayList getTermsInWindows()

setWindowSize

public void setWindowSize(int windowSize)
Description copied from class: FeatureCreationParameters
Sets the size of the window.

Specified by:
setWindowSize in class FeatureCreationParameters
Parameters:
windowSize - Size of the window (in words) each side of the special word. The total window width will be double this value (minus 1 if the reference word is ignored).

addWordsToTerms

public void addWordsToTerms(int[] indexedTerms,
                            int wordOfInterestPosition,
                            int termLength,
                            int[] excludedPositions)
To find out what terms are in the window and add them to the non-redundant termsInWindows. a b c d e f g h i j wp=5 ws=2 window contains: d e f g h

Parameters:
indexedTerms -
wordOfInterestPosition -
termLength -
excludedPositions -

calculateMinWindowIndex

public int calculateMinWindowIndex(int wordOfInterestPosition)

calculateMaxWindowWidth

public int calculateMaxWindowWidth(int wordOfInterestPosition,
                                   int termLength,
                                   int numTerms)

windowContainsTerm

public boolean windowContainsTerm(int[] annotationIndexedTerms,
                                  int term,
                                  int windowCenter,
                                  int termLength,
                                  int minIndex,
                                  int maxIndex)

Textractor API textractor-720 (20091120123250)

Copyright © 2003-2008 Institute for Computational Biomedicine, All Rights Reserved.