Textractor API textractor-716 (20091105163204)

textractor.mg4j.document
Class ConfigurableTextractorDocumentFactory

java.lang.Object
  extended by it.unimi.dsi.mg4j.document.AbstractDocumentFactory
      extended by it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory
          extended by textractor.mg4j.document.AbstractTextractorDocumentFactory
              extended by textractor.mg4j.document.ConfigurableTextractorDocumentFactory
All Implemented Interfaces:
DocumentFactory, FlyweightPrototype<DocumentFactory>, Serializable

public final class ConfigurableTextractorDocumentFactory
extends AbstractTextractorDocumentFactory

A factory that can produce MG4J documents from Textractor sentences.

A factory that provides multiple fields containing just the raw input stream; the encoding is set using the property PropertyBasedDocumentFactory.MetadataKeys.ENCODING. The field is named text, but you can change the name using the property fieldname.

By default, the WordReader provided by this factory is just a FastBufferedReader, but you can specify an alternative word reader using the property PropertyBasedDocumentFactory.MetadataKeys.WORDREADER. For instance, if you need to index a list of identifiers to retrieve documents from the collection more easily, you can use a LineWordReader to index each line of a file as a whole.

A default encoding can be provided using the property PropertyBasedDocumentFactory.MetadataKeys.ENCODING.

See Also:
Serialized Form

Nested Class Summary
static class ConfigurableTextractorDocumentFactory.MetadataKeys
          Case-insensitive keys for metadata passed to DocumentFactory.getDocument( java.io.InputStream,it.unimi.dsi.fastutil.objects.Reference2ObjectMap).
 
Nested classes/interfaces inherited from interface it.unimi.dsi.mg4j.document.DocumentFactory
DocumentFactory.FieldType
 
Field Summary
static DocumentFactory.FieldType DEFAULT_FIELD_TYPE
          Default value for field type supported by this factory.
 
Fields inherited from class textractor.mg4j.document.AbstractTextractorDocumentFactory
DEFAULT_MAXIMUM_LENGTH_CONSERVE_CASE, DEFAULT_MINIMUM_DASH_SPLIT_LENGTH, DEFAULT_OTHER_CHARACTER_DELIMITERS, DEFAULT_WORD_READER_CLASS, fields
 
Fields inherited from class it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory
defaultMetadata
 
Constructor Summary
ConfigurableTextractorDocumentFactory(String basenameVal, Properties properties)
          Construct a new DocumentFactory.
ConfigurableTextractorDocumentFactory(String basenameVal, Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
          Construct a new DocumentFactory.
 
Method Summary
 ConfigurableTextractorDocumentFactory copy()
          Creates a copy of this factory.
 Document getDocument(InputStream rawContent, Reference2ObjectMap<Enum<?>,Object> metadata)
          Returns the document obtained by parsing the given byte stream.
protected  boolean parseProperty(String key, String[] values, Reference2ObjectMap<Enum<?>,Object> metadata)
          Parse any properties specific to this class.
 
Methods inherited from class textractor.mg4j.document.AbstractTextractorDocumentFactory
fieldIndex, fieldName, fieldType, getDocumentNoContent, getFieldInfoList, numberOfFields
 
Methods inherited from class it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory
ensureJustOne, getInstance, getInstance, getInstance, getInstance, parseProperties, parseProperties, resolve, resolve, resolveNotNull, sameKey
 
Methods inherited from class it.unimi.dsi.mg4j.document.AbstractDocumentFactory
ensureFieldIndex, toString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

DEFAULT_FIELD_TYPE

public static final DocumentFactory.FieldType DEFAULT_FIELD_TYPE
Default value for field type supported by this factory.

Constructor Detail

ConfigurableTextractorDocumentFactory

public ConfigurableTextractorDocumentFactory(String basenameVal,
                                             Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
                                      throws ConfigurationException,
                                             ClassNotFoundException,
                                             IllegalAccessException,
                                             InstantiationException
Construct a new DocumentFactory.

Parameters:
basenameVal - the basename
defaultMetadata - meta data used to configure this factory
Throws:
ConfigurationException - if there is a problem with the configuration of the factory.
ClassNotFoundException - if the specified WordReader cannot be found.
IllegalAccessException - if the factory is unable to create an instance of the specified WordReader.
InstantiationException - if the factory is unable to create an instance of the specified WordReader.

ConfigurableTextractorDocumentFactory

public ConfigurableTextractorDocumentFactory(String basenameVal,
                                             Properties properties)
                                      throws ConfigurationException,
                                             ClassNotFoundException,
                                             IllegalAccessException,
                                             InstantiationException
Construct a new DocumentFactory.

Parameters:
basenameVal - the basename
properties - properties used to configure this factory
Throws:
ConfigurationException - if there is a problem with the configuration of the factory.
ClassNotFoundException - if the specified WordReader cannot be found.
IllegalAccessException - if the factory is unable to create an instance of the specified WordReader.
InstantiationException - if the factory is unable to create an instance of the specified WordReader.
Method Detail

parseProperty

protected boolean parseProperty(String key,
                                String[] values,
                                Reference2ObjectMap<Enum<?>,Object> metadata)
                         throws ConfigurationException
Parse any properties specific to this class. Declare the properties the various wordReader implementations need to store in MG4J Metadata. this should be delegated to the word reader, but it is unclear how at this time.

Overrides:
parseProperty in class AbstractTextractorDocumentFactory
Parameters:
key - the property to parse
values - the values for that property
metadata - the detadata to parse the property into, if possible
Returns:
true if this method was able to parse the property
Throws:
ConfigurationException - error parsing property

copy

public ConfigurableTextractorDocumentFactory copy()
Creates a copy of this factory.

Returns:
a copy of this factory.

getDocument

public Document getDocument(InputStream rawContent,
                            Reference2ObjectMap<Enum<?>,Object> metadata)
Returns the document obtained by parsing the given byte stream.

Parameters:
rawContent - the raw content from which the document should be extracted; it must not be closed, as resource management is a responsibility of the DocumentCollection.
metadata - a map from enums (e.g., keys taken in PropertyBasedDocumentFactory) to various kind of objects.
Returns:
the document obtained by parsing the given character sequence.

Textractor API textractor-716 (20091105163204)

Copyright © 2003-2008 Institute for Computational Biomedicine, All Rights Reserved.