textractor.html
Class AbstractHtml2Text
java.lang.Object
textractor.html.AbstractHtml2Text
- Direct Known Subclasses:
- Html2Text2DB
public abstract class AbstractHtml2Text
- extends Object
Converts HTML to Text. This translator uses the alt text of images to replace
images. This is useful since many journals use images for greek symbol, and
use an alt attribute to render this for text-only browsers.
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
appendSentencesInOneDocument
protected boolean appendSentencesInOneDocument
noSentenceBoundaryTag
protected boolean noSentenceBoundaryTag
verbose
protected final boolean verbose
articleChunkSize
protected int articleChunkSize
AbstractHtml2Text
public AbstractHtml2Text(String[] args)
process
public void process(String[] args)
throws IOException,
org.htmlparser.util.ParserException,
ConfigurationException,
SentenceProcessingException
- Throws:
IOException
org.htmlparser.util.ParserException
ConfigurationException
SentenceProcessingException
loadArticleSentences
protected final Collection<Sentence> loadArticleSentences(Article article,
String title,
String text,
Map<String,Object> additionalFieldsMap)
throws SentenceProcessingException
- Throws:
SentenceProcessingException
createSentenceOneDocument
protected final Collection<Sentence> createSentenceOneDocument(Article article,
Iterator<MutableString> sentencesAsTextIterator,
String title)
throws SentenceProcessingException
- Throws:
SentenceProcessingException
createSentences
protected final Collection<Sentence> createSentences(Article article,
Iterator<MutableString> sentencesAsTextIterator,
String title)
throws SentenceProcessingException
- Throws:
SentenceProcessingException
setConsumer
public abstract void setConsumer(TextConsumer consumer)
getConsumer
public abstract TextConsumer getConsumer()
Copyright © 2003-2008 Institute for Computational Biomedicine, All Rights Reserved.