Textractor API textractor-716 (20091105163204)

textractor.mg4j.document
Class TextractorDocumentFactory

java.lang.Object
  extended by it.unimi.dsi.mg4j.document.AbstractDocumentFactory
      extended by it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory
          extended by textractor.mg4j.document.AbstractTextractorDocumentFactory
              extended by textractor.mg4j.document.TextractorDocumentFactory
All Implemented Interfaces:
DocumentFactory, FlyweightPrototype<DocumentFactory>, Serializable

public final class TextractorDocumentFactory
extends AbstractTextractorDocumentFactory

A factory that can produce MG4J documents from Textractor sentences. This ONLY supports using the "text" field. If you need more, please use the ConfigurableTextractorDocumentFactory.

A factory that provides a single field containing just the raw input stream; the encoding is set using the property PropertyBasedDocumentFactory.MetadataKeys.ENCODING. The field is named text, but you can change the name using the property fieldname.

By default, the WordReader provided by this factory is just a FastBufferedReader, but you can specify an alternative word reader using the property PropertyBasedDocumentFactory.MetadataKeys.WORDREADER. For instance, if you need to index a list of identifiers to retrieve documents from the collection more easily, you can use a LineWordReader to index each line of a file as a whole.

A default encoding can be provided using the property PropertyBasedDocumentFactory.MetadataKeys.ENCODING.

See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class textractor.mg4j.document.AbstractTextractorDocumentFactory
AbstractTextractorDocumentFactory.MetadataKeys
 
Nested classes/interfaces inherited from interface it.unimi.dsi.mg4j.document.DocumentFactory
DocumentFactory.FieldType
 
Field Summary
 
Fields inherited from class textractor.mg4j.document.AbstractTextractorDocumentFactory
DEFAULT_MAXIMUM_LENGTH_CONSERVE_CASE, DEFAULT_MINIMUM_DASH_SPLIT_LENGTH, DEFAULT_OTHER_CHARACTER_DELIMITERS, DEFAULT_WORD_READER_CLASS, fields
 
Fields inherited from class it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory
defaultMetadata
 
Constructor Summary
TextractorDocumentFactory()
          Construct a new DocumentFactory.
TextractorDocumentFactory(Properties properties)
          Construct a new DocumentFactory.
TextractorDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
          Construct a new DocumentFactory.
TextractorDocumentFactory(String[] property)
          Construct a new DocumentFactory.
 
Method Summary
 TextractorDocumentFactory copy()
          Creates a copy of this factory.
 Document getDocument(InputStream rawContent, Reference2ObjectMap<Enum<?>,Object> metadata)
          Returns the document obtained by parsing the given byte stream.
static PropertyBasedDocumentFactory getInstance(Class<DocumentFactory> klass, String basename, Properties properties)
           
 
Methods inherited from class textractor.mg4j.document.AbstractTextractorDocumentFactory
fieldIndex, fieldName, fieldType, getDocumentNoContent, getFieldInfoList, numberOfFields, parseProperty
 
Methods inherited from class it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory
ensureJustOne, getInstance, getInstance, getInstance, getInstance, parseProperties, parseProperties, resolve, resolve, resolveNotNull, sameKey
 
Methods inherited from class it.unimi.dsi.mg4j.document.AbstractDocumentFactory
ensureFieldIndex, toString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

TextractorDocumentFactory

public TextractorDocumentFactory()
                          throws ConfigurationException,
                                 ClassNotFoundException,
                                 IllegalAccessException,
                                 InstantiationException
Construct a new DocumentFactory.

Throws:
ConfigurationException - if there is a problem with the configuration of the factory.
ClassNotFoundException - if the specified WordReader cannot be found.
IllegalAccessException - if the factory is unable to create an instance of the specified WordReader.
InstantiationException - if the factory is unable to create an instance of the specified WordReader.

TextractorDocumentFactory

public TextractorDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
                          throws ConfigurationException,
                                 ClassNotFoundException,
                                 IllegalAccessException,
                                 InstantiationException
Construct a new DocumentFactory.

Parameters:
defaultMetadata - meta data used to configure this factory
Throws:
ConfigurationException - if there is a problem with the configuration of the factory.
ClassNotFoundException - if the specified WordReader cannot be found.
IllegalAccessException - if the factory is unable to create an instance of the specified WordReader.
InstantiationException - if the factory is unable to create an instance of the specified WordReader.

TextractorDocumentFactory

public TextractorDocumentFactory(Properties properties)
                          throws ConfigurationException,
                                 ClassNotFoundException,
                                 IllegalAccessException,
                                 InstantiationException
Construct a new DocumentFactory.

Parameters:
properties - properties used to configure this factory
Throws:
ConfigurationException - if there is a problem with the configuration of the factory.
ClassNotFoundException - if the specified WordReader cannot be found.
IllegalAccessException - if the factory is unable to create an instance of the specified WordReader.
InstantiationException - if the factory is unable to create an instance of the specified WordReader.

TextractorDocumentFactory

public TextractorDocumentFactory(String[] property)
                          throws ConfigurationException,
                                 ClassNotFoundException,
                                 IllegalAccessException,
                                 InstantiationException
Construct a new DocumentFactory.

Parameters:
property - properties used to configure this factory
Throws:
ConfigurationException - if there is a problem with the configuration of the factory.
ClassNotFoundException - if the specified WordReader cannot be found.
IllegalAccessException - if the factory is unable to create an instance of the specified WordReader.
InstantiationException - if the factory is unable to create an instance of the specified WordReader.
Method Detail

copy

public TextractorDocumentFactory copy()
Creates a copy of this factory.

Returns:
a copy of this factory.

getDocument

public Document getDocument(InputStream rawContent,
                            Reference2ObjectMap<Enum<?>,Object> metadata)
Returns the document obtained by parsing the given byte stream.

Parameters:
rawContent - the raw content from which the document should be extracted; it must not be closed, as resource management is a responsibility of the DocumentCollection.
metadata - a map from enums (e.g., keys taken in PropertyBasedDocumentFactory) to various kind of objects.
Returns:
the document obtained by parsing the given character sequence.

getInstance

public static PropertyBasedDocumentFactory getInstance(Class<DocumentFactory> klass,
                                                       String basename,
                                                       Properties properties)
                                                throws InstantiationException,
                                                       IllegalAccessException,
                                                       InvocationTargetException,
                                                       NoSuchMethodException
Throws:
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException

Textractor API textractor-716 (20091105163204)

Copyright © 2003-2008 Institute for Computational Biomedicine, All Rights Reserved.