Textractor API textractor-716 (20091105163204)

textractor.mg4j.document
Class AbstractTextractorDocumentFactory

java.lang.Object
  extended by it.unimi.dsi.mg4j.document.AbstractDocumentFactory
      extended by it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory
          extended by textractor.mg4j.document.AbstractTextractorDocumentFactory
All Implemented Interfaces:
DocumentFactory, FlyweightPrototype<DocumentFactory>, Serializable
Direct Known Subclasses:
ConfigurableTextractorDocumentFactory, SfnDocumentFactory, TextractorDocumentFactory

public abstract class AbstractTextractorDocumentFactory
extends PropertyBasedDocumentFactory

A factory that can produce MG4J documents from Textractor sentences.

A factory that provides a single field containing just the raw input stream; the encoding is set using the property PropertyBasedDocumentFactory.MetadataKeys.ENCODING. The field is named text, but you can change the name using the property fieldname.

By default, the WordReader provided by this factory is just a FastBufferedReader, but you can specify an alternative word reader using the property PropertyBasedDocumentFactory.MetadataKeys.WORDREADER. For instance, if you need to index a list of identifiers to retrieve documents from the collection more easily, you can use a LineWordReader to index each line of a file as a whole.

A default encoding can be provided using the property PropertyBasedDocumentFactory.MetadataKeys.ENCODING.

See Also:
Serialized Form

Nested Class Summary
static class AbstractTextractorDocumentFactory.MetadataKeys
          Case-insensitive keys for metadata passed to DocumentFactory.getDocument(java.io.InputStream, it.unimi.dsi.fastutil.objects.Reference2ObjectMap).
 
Nested classes/interfaces inherited from interface it.unimi.dsi.mg4j.document.DocumentFactory
DocumentFactory.FieldType
 
Field Summary
static int DEFAULT_MAXIMUM_LENGTH_CONSERVE_CASE
          The maximum number of characters in a term that guarantee that the term is not downcased.
static int DEFAULT_MINIMUM_DASH_SPLIT_LENGTH
          Default miniumm index at which a word can be split.
static String DEFAULT_OTHER_CHARACTER_DELIMITERS
          Default character delimiters to match Twease behaviour.
static Class DEFAULT_WORD_READER_CLASS
          Default value for the word reader class supported by this factory.
protected  List<TextractorFieldInfo> fields
          Fields and configuration associated with this DocumentFactory.
 
Fields inherited from class it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory
defaultMetadata
 
Constructor Summary
AbstractTextractorDocumentFactory()
          Construct a new DocumentFactory.
AbstractTextractorDocumentFactory(Properties properties)
          Construct a new DocumentFactory.
AbstractTextractorDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
          Construct a new DocumentFactory.
AbstractTextractorDocumentFactory(String[] property)
          Construct a new DocumentFactory.
 
Method Summary
 int fieldIndex(String fieldName)
          Returns the index of a field, given its symbolic name.
 String fieldName(int field)
          Returns the symbolic name of a field.
 DocumentFactory.FieldType fieldType(int field)
          Returns the type of a field.
 Document getDocumentNoContent(Reference2ObjectMap<Enum<?>,Object> metadata)
          Returns the document obtained by parsing the given byte stream.
 List<TextractorFieldInfo> getFieldInfoList()
          Return the list of fields for thsi index.
 int numberOfFields()
          Returns the number of fields present in the documents produced by this factory.
protected  boolean parseProperty(String key, String[] values, Reference2ObjectMap<Enum<?>,Object> metadata)
          Declare the properties the various wordReader implementations need to store in MG4J Metadata.
 
Methods inherited from class it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory
ensureJustOne, getInstance, getInstance, getInstance, getInstance, parseProperties, parseProperties, resolve, resolve, resolveNotNull, sameKey
 
Methods inherited from class it.unimi.dsi.mg4j.document.AbstractDocumentFactory
ensureFieldIndex, toString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface it.unimi.dsi.mg4j.document.DocumentFactory
copy, getDocument
 

Field Detail

DEFAULT_WORD_READER_CLASS

public static final Class DEFAULT_WORD_READER_CLASS
Default value for the word reader class supported by this factory.


DEFAULT_MAXIMUM_LENGTH_CONSERVE_CASE

public static final int DEFAULT_MAXIMUM_LENGTH_CONSERVE_CASE
The maximum number of characters in a term that guarantee that the term is not downcased.

See Also:
Constant Field Values

DEFAULT_OTHER_CHARACTER_DELIMITERS

public static final String DEFAULT_OTHER_CHARACTER_DELIMITERS
Default character delimiters to match Twease behaviour.

See Also:
Constant Field Values

DEFAULT_MINIMUM_DASH_SPLIT_LENGTH

public static final int DEFAULT_MINIMUM_DASH_SPLIT_LENGTH
Default miniumm index at which a word can be split.

See Also:
Constant Field Values

fields

protected final List<TextractorFieldInfo> fields
Fields and configuration associated with this DocumentFactory.

Constructor Detail

AbstractTextractorDocumentFactory

public AbstractTextractorDocumentFactory()
                                  throws ConfigurationException,
                                         ClassNotFoundException,
                                         IllegalAccessException,
                                         InstantiationException
Construct a new DocumentFactory.

Throws:
ConfigurationException - if there is a problem with the configuration of the factory.
ClassNotFoundException - if the specified WordReader cannot be found.
IllegalAccessException - if the factory is unable to create an instance of the specified WordReader.
InstantiationException - if the factory is unable to create an instance of the specified WordReader.

AbstractTextractorDocumentFactory

public AbstractTextractorDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
                                  throws ConfigurationException,
                                         ClassNotFoundException,
                                         IllegalAccessException,
                                         InstantiationException
Construct a new DocumentFactory.

Parameters:
defaultMetadata - meta data used to configure this factory
Throws:
ConfigurationException - if there is a problem with the configuration of the factory.
ClassNotFoundException - if the specified WordReader cannot be found.
IllegalAccessException - if the factory is unable to create an instance of the specified WordReader.
InstantiationException - if the factory is unable to create an instance of the specified WordReader.

AbstractTextractorDocumentFactory

public AbstractTextractorDocumentFactory(Properties properties)
                                  throws ConfigurationException,
                                         ClassNotFoundException,
                                         IllegalAccessException,
                                         InstantiationException
Construct a new DocumentFactory.

Parameters:
properties - properties used to configure this factory
Throws:
ConfigurationException - if there is a problem with the configuration of the factory.
ClassNotFoundException - if the specified WordReader cannot be found.
IllegalAccessException - if the factory is unable to create an instance of the specified WordReader.
InstantiationException - if the factory is unable to create an instance of the specified WordReader.

AbstractTextractorDocumentFactory

public AbstractTextractorDocumentFactory(String[] property)
                                  throws ConfigurationException,
                                         ClassNotFoundException,
                                         IllegalAccessException,
                                         InstantiationException
Construct a new DocumentFactory.

Parameters:
property - properties used to configure this factory
Throws:
ConfigurationException - if there is a problem with the configuration of the factory.
ClassNotFoundException - if the specified WordReader cannot be found.
IllegalAccessException - if the factory is unable to create an instance of the specified WordReader.
InstantiationException - if the factory is unable to create an instance of the specified WordReader.
Method Detail

parseProperty

protected boolean parseProperty(String key,
                                String[] values,
                                Reference2ObjectMap<Enum<?>,Object> metadata)
                         throws ConfigurationException
Declare the properties the various wordReader implementations need to store in MG4J Metadata. This should be delegated to the word reader, but it is unclear how at this time.

Overrides:
parseProperty in class PropertyBasedDocumentFactory
Parameters:
key - the property key
values - the values property values
metadata - the properties metadata
Throws:
ConfigurationException - error configuring

numberOfFields

public final int numberOfFields()
Returns the number of fields present in the documents produced by this factory.

Returns:
the number of fields present in the documents produced by this factory.

getFieldInfoList

public final List<TextractorFieldInfo> getFieldInfoList()
Return the list of fields for thsi index.

Returns:
The List[TextractorFieldInfo] for this index.

fieldName

public final String fieldName(int field)
Returns the symbolic name of a field.

Parameters:
field - the index of a field (between 0 inclusive and numberOfFields() exclusive).
Returns:
the symbolic name of the field-th field.

fieldIndex

public final int fieldIndex(String fieldName)
Returns the index of a field, given its symbolic name.

Parameters:
fieldName - the name of a field of this factory.
Returns:
the corresponding index, or -1 if there is no field with name fieldName.

fieldType

public final DocumentFactory.FieldType fieldType(int field)
Returns the type of a field.

The possible types are defined in DocumentFactory.FieldType.

Parameters:
field - the index of a field (between 0 inclusive and numberOfFields() exclusive}).
Returns:
the type of the field-th field.

getDocumentNoContent

public final Document getDocumentNoContent(Reference2ObjectMap<Enum<?>,Object> metadata)
Returns the document obtained by parsing the given byte stream. DocumentCollection.

Parameters:
metadata - a map from enums (e.g., keys taken in PropertyBasedDocumentFactory) to various kind of objects.
Returns:
the document obtained by parsing the given character sequence.

Textractor API textractor-716 (20091105163204)

Copyright © 2003-2008 Institute for Computational Biomedicine, All Rights Reserved.