Textractor API textractor-720 (20091120123250)

textractor.mg4j.docstore
Class DocumentStoreWriter

java.lang.Object
  extended by textractor.mg4j.docstore.DocumentStoreWriter
All Implemented Interfaces:
Closeable

public final class DocumentStoreWriter
extends Object
implements Closeable

Provides for writing a document store. A DocumentStore provides random access to documents in a corpus. This class lets you write documents to the store. After constructing this writer, you must call appendDocument() in sequence for each document that is to be written to the store. Writting must be done sequentially, it is not possible to insert documents between documents already written. However, gaps in the index of the documents are supported. For instance, if document 0 is submitted, followed by document 4, documents 1,2 and 3 will be created with empty content, so that random access to documents after the first gap work as expected.

A significant storage compression advantage is realized if the writer is optimized before documents are appended (optimized writer may require 50% less storage for the same content). Optimization leverages the term frequency data found in the inverted index. More frequent terms are coded with shorter bit streams.

User: Fabien Campagne Date: Oct 29, 2005 Time: 11:57:37 AM


Field Summary
static int DOCUMENT_NOT_FOUND
           
 
Constructor Summary
DocumentStoreWriter(DocumentIndexManager docmanager)
          Initialize a document store writer for the "text" index.
DocumentStoreWriter(IndexDetails indexDetailsVal, boolean writePmidsVal, boolean writePositionsVal)
          Initialize a document store writer.
 
Method Summary
 void addDocumentPMID(long documentNumber, long pmid)
          Add a pmid for a document number.
 int appendDocument(int documentIndex, int[] tokens)
          Append a document to this docstore at the given document number.
 int appendPositions(List<IntRange> ranges)
          Append position information to the document store.
 void close()
          Close.
protected  void finalize()
          Finalize (close).
 void flush()
          Flush the output files.
static String getDocumentDataFilename(String basename)
          Create the document data filename based on the index basename.
static String getOffsetFilename(String basename)
          Create the document offsets filename based on the index basename.
static String getPMIDMapFilename(String basename)
          Get the pmid map filename based on the basename.
static String getPositionFilename(String basename)
          The positions data filename based on the index basename.
static String getPositionOffsetFilename(String basename)
          The positions offsets filename based on the index basename.
static String getSmallIndexToTermFilename(String basename)
          Get the small index to term filename based on the basename.
 boolean getWritePositions()
          Get the value of writePositions.
 void optimizeTermOrdering()
          Optimize the term ordering.
 void writePMIDs()
          Write document PMID info to the pmid map file.
 int writtenDocuments()
          The number of documents written by this writter.
 
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DOCUMENT_NOT_FOUND

public static final int DOCUMENT_NOT_FOUND
See Also:
Constant Field Values
Constructor Detail

DocumentStoreWriter

public DocumentStoreWriter(DocumentIndexManager docmanager)
                    throws FileNotFoundException
Initialize a document store writer for the "text" index. This will write pmid values. Positions will be written

Parameters:
docmanager - the document index manager
Throws:
FileNotFoundException - When docstore files cannot be written with the given basename.

DocumentStoreWriter

public DocumentStoreWriter(IndexDetails indexDetailsVal,
                           boolean writePmidsVal,
                           boolean writePositionsVal)
                    throws FileNotFoundException
Initialize a document store writer.

Parameters:
indexDetailsVal - the document index to write the documents for
writePmidsVal - if true, the pmids file will be written
writePositionsVal - if true, the positions data will be written
Throws:
FileNotFoundException - When docstore files cannot be written with the given basename.
Method Detail

getDocumentDataFilename

public static String getDocumentDataFilename(String basename)
Create the document data filename based on the index basename.

Parameters:
basename - the index basename
Returns:
the document data filename

getOffsetFilename

public static String getOffsetFilename(String basename)
Create the document offsets filename based on the index basename.

Parameters:
basename - the index basename
Returns:
the document offsets filename

getPositionFilename

public static String getPositionFilename(String basename)
The positions data filename based on the index basename.

Parameters:
basename - the index basename
Returns:
the positions data filename

getPositionOffsetFilename

public static String getPositionOffsetFilename(String basename)
The positions offsets filename based on the index basename.

Parameters:
basename - the index basename
Returns:
the positions offsets filename

appendDocument

public int appendDocument(int documentIndex,
                          int[] tokens)
                   throws IOException
Append a document to this docstore at the given document number.

Parameters:
documentIndex - the document number
tokens - the document
Returns:
the number of bits written?
Throws:
IOException - error appending a document

appendPositions

public int appendPositions(List<IntRange> ranges)
                    throws IOException
Append position information to the document store.

Parameters:
ranges - The ranges for each term in the document
Returns:
The number of bits written to the document store.
Throws:
IOException - If the positions cannot be written to

flush

public void flush()
           throws IOException
Flush the output files.

Throws:
IOException - error flushing

getWritePositions

public boolean getWritePositions()
Get the value of writePositions.

Returns:
true if we are writing positions

finalize

protected void finalize()
                 throws Throwable
Finalize (close).

Overrides:
finalize in class Object
Throws:
Throwable - error finalizing

close

public void close()
           throws IOException
Close.

Specified by:
close in interface Closeable
Throws:
IOException - error closing

writePMIDs

public void writePMIDs()
                throws IOException
Write document PMID info to the pmid map file.

Throws:
IOException - if the file cannot be written or created

getPMIDMapFilename

public static String getPMIDMapFilename(String basename)
Get the pmid map filename based on the basename.

Parameters:
basename - the basename to create the filename for
Returns:
the pmid map filename

optimizeTermOrdering

public void optimizeTermOrdering()
                          throws IOException
Optimize the term ordering.

Throws:
IOException - error optimizing

getSmallIndexToTermFilename

public static String getSmallIndexToTermFilename(String basename)
Get the small index to term filename based on the basename.

Parameters:
basename - the basename to create the filename for
Returns:
the small index to term filename

writtenDocuments

public int writtenDocuments()
The number of documents written by this writter.

Returns:
The number of documents written by this writer.

addDocumentPMID

public void addDocumentPMID(long documentNumber,
                            long pmid)
Add a pmid for a document number.

Parameters:
documentNumber - the document number
pmid - the pmid for documentNumber

Textractor API textractor-720 (20091120123250)

Copyright © 2003-2008 Institute for Computational Biomedicine, All Rights Reserved.