|
Textractor API textractor-716 (20091105163204) | |||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectit.unimi.dsi.mg4j.document.AbstractDocumentFactory
it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory
textractor.mg4j.document.AbstractTextractorDocumentFactory
public abstract class AbstractTextractorDocumentFactory
A factory that can produce MG4J documents from Textractor sentences.
A factory that provides a single field containing just the raw input stream; the encoding is set using the propertyPropertyBasedDocumentFactory.MetadataKeys.ENCODING.
The field is named text, but you can change the name using the
property fieldname.
By default, the WordReader provided by this
factory is just a FastBufferedReader, but you
can specify an alternative word reader using the property
PropertyBasedDocumentFactory.MetadataKeys.WORDREADER.
For instance, if you need to index a list of identifiers to retrieve
documents from the collection more easily, you can use a
LineWordReader to index each line of a file as a
whole.
A default encoding can be provided using the property
PropertyBasedDocumentFactory.MetadataKeys.ENCODING.
| Nested Class Summary | |
|---|---|
static class |
AbstractTextractorDocumentFactory.MetadataKeys
Case-insensitive keys for metadata passed to DocumentFactory.getDocument(java.io.InputStream,
it.unimi.dsi.fastutil.objects.Reference2ObjectMap). |
| Nested classes/interfaces inherited from interface it.unimi.dsi.mg4j.document.DocumentFactory |
|---|
DocumentFactory.FieldType |
| Field Summary | |
|---|---|
static int |
DEFAULT_MAXIMUM_LENGTH_CONSERVE_CASE
The maximum number of characters in a term that guarantee that the term is not downcased. |
static int |
DEFAULT_MINIMUM_DASH_SPLIT_LENGTH
Default miniumm index at which a word can be split. |
static String |
DEFAULT_OTHER_CHARACTER_DELIMITERS
Default character delimiters to match Twease behaviour. |
static Class |
DEFAULT_WORD_READER_CLASS
Default value for the word reader class supported by this factory. |
protected List<TextractorFieldInfo> |
fields
Fields and configuration associated with this DocumentFactory. |
| Fields inherited from class it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory |
|---|
defaultMetadata |
| Constructor Summary | |
|---|---|
AbstractTextractorDocumentFactory()
Construct a new DocumentFactory. |
|
AbstractTextractorDocumentFactory(Properties properties)
Construct a new DocumentFactory. |
|
AbstractTextractorDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
Construct a new DocumentFactory. |
|
AbstractTextractorDocumentFactory(String[] property)
Construct a new DocumentFactory. |
|
| Method Summary | |
|---|---|
int |
fieldIndex(String fieldName)
Returns the index of a field, given its symbolic name. |
String |
fieldName(int field)
Returns the symbolic name of a field. |
DocumentFactory.FieldType |
fieldType(int field)
Returns the type of a field. |
Document |
getDocumentNoContent(Reference2ObjectMap<Enum<?>,Object> metadata)
Returns the document obtained by parsing the given byte stream. |
List<TextractorFieldInfo> |
getFieldInfoList()
Return the list of fields for thsi index. |
int |
numberOfFields()
Returns the number of fields present in the documents produced by this factory. |
protected boolean |
parseProperty(String key,
String[] values,
Reference2ObjectMap<Enum<?>,Object> metadata)
Declare the properties the various wordReader implementations need to store in MG4J Metadata. |
| Methods inherited from class it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory |
|---|
ensureJustOne, getInstance, getInstance, getInstance, getInstance, parseProperties, parseProperties, resolve, resolve, resolveNotNull, sameKey |
| Methods inherited from class it.unimi.dsi.mg4j.document.AbstractDocumentFactory |
|---|
ensureFieldIndex, toString |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Methods inherited from interface it.unimi.dsi.mg4j.document.DocumentFactory |
|---|
copy, getDocument |
| Field Detail |
|---|
public static final Class DEFAULT_WORD_READER_CLASS
public static final int DEFAULT_MAXIMUM_LENGTH_CONSERVE_CASE
public static final String DEFAULT_OTHER_CHARACTER_DELIMITERS
public static final int DEFAULT_MINIMUM_DASH_SPLIT_LENGTH
protected final List<TextractorFieldInfo> fields
DocumentFactory.
| Constructor Detail |
|---|
public AbstractTextractorDocumentFactory()
throws ConfigurationException,
ClassNotFoundException,
IllegalAccessException,
InstantiationException
DocumentFactory.
ConfigurationException - if there is
a problem with the configuration of the factory.
ClassNotFoundException - if the specified WordReader
cannot be found.
IllegalAccessException - if the factory is unable to create an
instance of the specified WordReader.
InstantiationException - if the factory is unable to create an
instance of the specified WordReader.
public AbstractTextractorDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
throws ConfigurationException,
ClassNotFoundException,
IllegalAccessException,
InstantiationException
DocumentFactory.
defaultMetadata - meta data used to configure this factory
ConfigurationException - if there
is a problem with the configuration of the factory.
ClassNotFoundException - if the specified WordReader
cannot be found.
IllegalAccessException - if the factory is unable to create an
instance of the specified WordReader.
InstantiationException - if the factory is unable to create an
instance of the specified WordReader.
public AbstractTextractorDocumentFactory(Properties properties)
throws ConfigurationException,
ClassNotFoundException,
IllegalAccessException,
InstantiationException
DocumentFactory.
properties - properties used to configure this factory
ConfigurationException - if
there is a problem with the configuration of the factory.
ClassNotFoundException - if the specified WordReader
cannot be found.
IllegalAccessException - if the factory is unable to create an
instance of the specified WordReader.
InstantiationException - if the factory is unable to create an
instance of the specified WordReader.
public AbstractTextractorDocumentFactory(String[] property)
throws ConfigurationException,
ClassNotFoundException,
IllegalAccessException,
InstantiationException
DocumentFactory.
property - properties used to configure this factory
ConfigurationException - if
there is a problem with the configuration of the factory.
ClassNotFoundException - if the specified WordReader
cannot be found.
IllegalAccessException - if the factory is unable to create an
instance of the specified WordReader.
InstantiationException - if the factory is unable to create an
instance of the specified WordReader.| Method Detail |
|---|
protected boolean parseProperty(String key,
String[] values,
Reference2ObjectMap<Enum<?>,Object> metadata)
throws ConfigurationException
parseProperty in class PropertyBasedDocumentFactorykey - the property keyvalues - the values property valuesmetadata - the properties metadata
ConfigurationException - error configuringpublic final int numberOfFields()
public final List<TextractorFieldInfo> getFieldInfoList()
public final String fieldName(int field)
field - the index of a field (between 0 inclusive and
numberOfFields() exclusive).
field-th field.public final int fieldIndex(String fieldName)
fieldName - the name of a field of this factory.
fieldName.public final DocumentFactory.FieldType fieldType(int field)
The possible types are defined in
DocumentFactory.FieldType.
field - the index of a field (between 0 inclusive and
numberOfFields() exclusive}).
field-th field.public final Document getDocumentNoContent(Reference2ObjectMap<Enum<?>,Object> metadata)
DocumentCollection.
metadata - a map from enums (e.g., keys taken in
PropertyBasedDocumentFactory) to various kind of objects.
|
Textractor API textractor-716 (20091105163204) | |||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||