Textractor API textractor-720 (20091120123250)

textractor.parsers
Class PubmedExtractor

java.lang.Object
  extended by it.unimi.dsi.mg4j.util.parser.callback.DefaultCallback
      extended by textractor.parsers.PubmedExtractor
All Implemented Interfaces:
Callback
Direct Known Subclasses:
ParsePubmedFast, PubmedLoadExtractor

public abstract class PubmedExtractor
extends DefaultCallback

Pumbed / medline XML parser.


Field Summary
static MutableString EMPTY_MUTABLE_STRING
          An empty mutable string.
 
Fields inherited from interface it.unimi.dsi.mg4j.util.parser.callback.Callback
EMPTY_CALLBACK_ARRAY
 
Constructor Summary
protected PubmedExtractor()
          Create the parser.
 
Method Summary
 boolean characters(char[] characters, int offset, int length, boolean flowBroken)
          Received XML characters.
 void configure(it.unimi.dsi.mg4j.util.parser.BulletParser parserVal)
          Configure the parser to parse text.
 void endDocument()
          End of document.
 boolean endElement(it.unimi.dsi.mg4j.util.parser.Element elementOrig)
          We have found an XML end-element tag.
abstract  boolean processAbstractText(MutableString pmidVal, MutableString titleVal, MutableString textVal, Map<String,Object> additionalFieldsMap)
          Process the text of this document.
abstract  void processNoticeOfRetraction(MutableString pmidVal, List<String> retractedPmidsVal, boolean createArticleVal)
          Process retraction notices.
 void setArticleElementName(String dummy)
          Deprecated. 
 void startDocument()
          Start of document.
 boolean startElement(it.unimi.dsi.mg4j.util.parser.Element elementOrig, Map attrMap)
          We have found an XML end-element tag.
 it.unimi.dsi.mg4j.util.parser.Element translateElement(it.unimi.dsi.mg4j.util.parser.Element elementOrig)
          If the element is in elementIgnoreSet this will return null.
 
Methods inherited from class it.unimi.dsi.mg4j.util.parser.callback.DefaultCallback
cdata, getInstance
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

EMPTY_MUTABLE_STRING

public static final MutableString EMPTY_MUTABLE_STRING
An empty mutable string.

Constructor Detail

PubmedExtractor

protected PubmedExtractor()
Create the parser. Make sure we have an empty value for all known fields within singleArticleFieldsMap.

Method Detail

setArticleElementName

@Deprecated
public void setArticleElementName(String dummy)
Deprecated. 

No longer supported. If you find a use for this, please modify the class to support it, but I couldn't tell that it was necessary any longer, so I modified this class to be more readable but that precludes the use of this property.

Parameters:
dummy - ignored value

configure

public final void configure(it.unimi.dsi.mg4j.util.parser.BulletParser parserVal)
Configure the parser to parse text.

Specified by:
configure in interface Callback
Overrides:
configure in class DefaultCallback
Parameters:
parserVal - parser to use

startDocument

public final void startDocument()
Start of document.

Specified by:
startDocument in interface Callback
Overrides:
startDocument in class DefaultCallback

endDocument

public void endDocument()
End of document.

Specified by:
endDocument in interface Callback
Overrides:
endDocument in class DefaultCallback

characters

public final boolean characters(char[] characters,
                                int offset,
                                int length,
                                boolean flowBroken)
Received XML characters. If they are for an element we are interested in, save them to the singleArticleFieldsMap for saving later.

Specified by:
characters in interface Callback
Overrides:
characters in class DefaultCallback
Parameters:
characters - the characters
offset - the offset of the characters we are interested in
length - the length of the characters we are interested in
flowBroken - true of flow is broken
Returns:
true

startElement

public final boolean startElement(it.unimi.dsi.mg4j.util.parser.Element elementOrig,
                                  Map attrMap)
We have found an XML end-element tag.

Specified by:
startElement in interface Callback
Overrides:
startElement in class DefaultCallback
Parameters:
elementOrig - the current element
attrMap - attributes map for this element
Returns:
true

endElement

public final boolean endElement(it.unimi.dsi.mg4j.util.parser.Element elementOrig)
We have found an XML end-element tag.

Specified by:
endElement in interface Callback
Overrides:
endElement in class DefaultCallback
Parameters:
elementOrig - the current element
Returns:
true

translateElement

public it.unimi.dsi.mg4j.util.parser.Element translateElement(it.unimi.dsi.mg4j.util.parser.Element elementOrig)
If the element is in elementIgnoreSet this will return null. If the element is in the elementTranslationsMap map, this will return the associated element. Otherwise this will return elementOrig.

Parameters:
elementOrig - the element to translate
Returns:
the translated element as described above

processAbstractText

public abstract boolean processAbstractText(MutableString pmidVal,
                                            MutableString titleVal,
                                            MutableString textVal,
                                            Map<String,Object> additionalFieldsMap)
                                     throws IOException,
                                            SentenceProcessingException
Process the text of this document.

Parameters:
pmidVal - the pmid of the document
titleVal - the title of the document
textVal - the text of the document
additionalFieldsMap - additional fields to index
Returns:
True if an article was created for this PMID.
Throws:
IOException - error processing sentence
SentenceProcessingException - error processing sentence

processNoticeOfRetraction

public abstract void processNoticeOfRetraction(MutableString pmidVal,
                                               List<String> retractedPmidsVal,
                                               boolean createArticleVal)
Process retraction notices.

Parameters:
pmidVal - the pmid of the retraction
retractedPmidsVal - the retracted pmids
createArticleVal - True if this method may create an article to represent the retraction notice

Textractor API textractor-720 (20091120123250)

Copyright © 2003-2008 Institute for Computational Biomedicine, All Rights Reserved.