textractor.mg4j.document
Class ConfigurableTextractorDocumentFactory
java.lang.Object
it.unimi.dsi.mg4j.document.AbstractDocumentFactory
it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory
textractor.mg4j.document.AbstractTextractorDocumentFactory
textractor.mg4j.document.ConfigurableTextractorDocumentFactory
- All Implemented Interfaces:
- DocumentFactory, FlyweightPrototype<DocumentFactory>, Serializable
public final class ConfigurableTextractorDocumentFactory
- extends AbstractTextractorDocumentFactory
A factory that can produce MG4J documents from Textractor sentences.
A factory that provides multiple fields containing just the raw input stream;
the encoding is set using the property
PropertyBasedDocumentFactory.MetadataKeys.ENCODING.
The field is named text, but you can change the name using the
property fieldname.
By default, the WordReader provided by this
factory is just a FastBufferedReader, but you
can specify an alternative word reader using the property
PropertyBasedDocumentFactory.MetadataKeys.WORDREADER.
For instance, if you need to index a list of identifiers to retrieve
documents from the collection more easily, you can use a
LineWordReader to index each line of a file as a
whole.
A default encoding can be provided using the property
PropertyBasedDocumentFactory.MetadataKeys.ENCODING.
- See Also:
- Serialized Form
| Methods inherited from class it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory |
ensureJustOne, getInstance, getInstance, getInstance, getInstance, parseProperties, parseProperties, resolve, resolve, resolveNotNull, sameKey |
DEFAULT_FIELD_TYPE
public static final DocumentFactory.FieldType DEFAULT_FIELD_TYPE
- Default value for field type supported by this factory.
ConfigurableTextractorDocumentFactory
public ConfigurableTextractorDocumentFactory(String basenameVal,
Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
throws ConfigurationException,
ClassNotFoundException,
IllegalAccessException,
InstantiationException
- Construct a new
DocumentFactory.
- Parameters:
basenameVal - the basenamedefaultMetadata - meta data used to configure this factory
- Throws:
ConfigurationException - if there
is a problem with the configuration of the factory.
ClassNotFoundException - if the specified WordReader
cannot be found.
IllegalAccessException - if the factory is unable to create an
instance of the specified WordReader.
InstantiationException - if the factory is unable to create an
instance of the specified WordReader.
ConfigurableTextractorDocumentFactory
public ConfigurableTextractorDocumentFactory(String basenameVal,
Properties properties)
throws ConfigurationException,
ClassNotFoundException,
IllegalAccessException,
InstantiationException
- Construct a new
DocumentFactory.
- Parameters:
basenameVal - the basenameproperties - properties used to configure this factory
- Throws:
ConfigurationException - if
there is a problem with the configuration of the factory.
ClassNotFoundException - if the specified WordReader
cannot be found.
IllegalAccessException - if the factory is unable to create an
instance of the specified WordReader.
InstantiationException - if the factory is unable to create an
instance of the specified WordReader.
parseProperty
protected boolean parseProperty(String key,
String[] values,
Reference2ObjectMap<Enum<?>,Object> metadata)
throws ConfigurationException
- Parse any properties specific to this class.
Declare the properties the various wordReader implementations need to store in MG4J Metadata.
this should be delegated to the word reader, but it is unclear how at this time.
- Overrides:
parseProperty in class AbstractTextractorDocumentFactory
- Parameters:
key - the property to parsevalues - the values for that propertymetadata - the detadata to parse the property into, if possible
- Returns:
- true if this method was able to parse the property
- Throws:
ConfigurationException - error parsing property
copy
public ConfigurableTextractorDocumentFactory copy()
- Creates a copy of this factory.
- Returns:
- a copy of this factory.
getDocument
public Document getDocument(InputStream rawContent,
Reference2ObjectMap<Enum<?>,Object> metadata)
- Returns the document obtained by parsing the given byte stream.
- Parameters:
rawContent - the raw content from which the document should be
extracted; it must not be closed, as resource management is a
responsibility of the
DocumentCollection.metadata - a map from enums (e.g., keys taken in
PropertyBasedDocumentFactory) to various kind of objects.
- Returns:
- the document obtained by parsing the given character sequence.
Copyright © 2003-2008 Institute for Computational Biomedicine, All Rights Reserved.