textractor.mg4j.document
Class TextractorDocumentFactory
java.lang.Object
it.unimi.dsi.mg4j.document.AbstractDocumentFactory
it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory
textractor.mg4j.document.AbstractTextractorDocumentFactory
textractor.mg4j.document.TextractorDocumentFactory
- All Implemented Interfaces:
- DocumentFactory, FlyweightPrototype<DocumentFactory>, Serializable
public final class TextractorDocumentFactory
- extends AbstractTextractorDocumentFactory
A factory that can produce MG4J documents from Textractor sentences.
This ONLY supports using the "text" field. If you need more, please use the
ConfigurableTextractorDocumentFactory.
A factory that provides a single field containing just the raw input stream;
the encoding is set using the property
PropertyBasedDocumentFactory.MetadataKeys.ENCODING.
The field is named text, but you can change the name using the
property fieldname.
By default, the WordReader provided by this
factory is just a FastBufferedReader, but you
can specify an alternative word reader using the property
PropertyBasedDocumentFactory.MetadataKeys.WORDREADER.
For instance, if you need to index a list of identifiers to retrieve
documents from the collection more easily, you can use a
LineWordReader to index each line of a file as a
whole.
A default encoding can be provided using the property
PropertyBasedDocumentFactory.MetadataKeys.ENCODING.
- See Also:
- Serialized Form
| Methods inherited from class it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory |
ensureJustOne, getInstance, getInstance, getInstance, getInstance, parseProperties, parseProperties, resolve, resolve, resolveNotNull, sameKey |
TextractorDocumentFactory
public TextractorDocumentFactory()
throws ConfigurationException,
ClassNotFoundException,
IllegalAccessException,
InstantiationException
- Construct a new
DocumentFactory.
- Throws:
ConfigurationException - if there is a problem with the
configuration of the factory.
ClassNotFoundException - if the specified WordReader
cannot be found.
IllegalAccessException - if the factory is unable to create an
instance of the specified WordReader.
InstantiationException - if the factory is unable to create an
instance of the specified WordReader.
TextractorDocumentFactory
public TextractorDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
throws ConfigurationException,
ClassNotFoundException,
IllegalAccessException,
InstantiationException
- Construct a new
DocumentFactory.
- Parameters:
defaultMetadata - meta data used to configure this factory
- Throws:
ConfigurationException - if there is a problem with the
configuration of the factory.
ClassNotFoundException - if the specified WordReader
cannot be found.
IllegalAccessException - if the factory is unable to create an
instance of the specified WordReader.
InstantiationException - if the factory is unable to create an
instance of the specified WordReader.
TextractorDocumentFactory
public TextractorDocumentFactory(Properties properties)
throws ConfigurationException,
ClassNotFoundException,
IllegalAccessException,
InstantiationException
- Construct a new
DocumentFactory.
- Parameters:
properties - properties used to configure this factory
- Throws:
ConfigurationException - if there is a problem with the
configuration of the factory.
ClassNotFoundException - if the specified WordReader
cannot be found.
IllegalAccessException - if the factory is unable to create an
instance of the specified WordReader.
InstantiationException - if the factory is unable to create an
instance of the specified WordReader.
TextractorDocumentFactory
public TextractorDocumentFactory(String[] property)
throws ConfigurationException,
ClassNotFoundException,
IllegalAccessException,
InstantiationException
- Construct a new
DocumentFactory.
- Parameters:
property - properties used to configure this factory
- Throws:
ConfigurationException - if there is a problem with the
configuration of the factory.
ClassNotFoundException - if the specified WordReader
cannot be found.
IllegalAccessException - if the factory is unable to create an
instance of the specified WordReader.
InstantiationException - if the factory is unable to create an
instance of the specified WordReader.
copy
public TextractorDocumentFactory copy()
- Creates a copy of this factory.
- Returns:
- a copy of this factory.
getDocument
public Document getDocument(InputStream rawContent,
Reference2ObjectMap<Enum<?>,Object> metadata)
- Returns the document obtained by parsing the given byte stream.
- Parameters:
rawContent - the raw content from which the document should be
extracted; it must not be closed, as resource management is a
responsibility of the
DocumentCollection.metadata - a map from enums (e.g., keys taken in
PropertyBasedDocumentFactory) to various kind of objects.
- Returns:
- the document obtained by parsing the given character sequence.
getInstance
public static PropertyBasedDocumentFactory getInstance(Class<DocumentFactory> klass,
String basename,
Properties properties)
throws InstantiationException,
IllegalAccessException,
InvocationTargetException,
NoSuchMethodException
- Throws:
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
Copyright © 2003-2008 Institute for Computational Biomedicine, All Rights Reserved.