ChemicalTagger

About

ChemicalTagger is a phrase-based semantic NLP tool for parsing the language of chemical experiments. It takes a string as input and produces an XML document as output. Tagging is based on a modular architecture and uses a combination of OSCAR4, domain-specific regex and English taggers to identify parts-of-speech. An ANTLR grammar is then used to structure the tagged tokens into tree-based phrases which are then converted into an XML document.

License and Warranty

ChemicalTagger is licensed under the Apache License Version 2.0

ChemicalTagger is made available in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Installation Instructions

This online version of ChemicalTagger is a demonstration and can be downloaded from here.

To use ChemicalTagger as a library either:

Download the chemicalTagger-1.3-jar-with-dependencies.jar from the downloads page.

Or through maven by adding the following to your pom file:

Add our repository:

<repository> <id>ucc-repo</id> <name>UCC Repository</name> <url>http://maven.ch.cam.ac.uk/m2repo</url> </repository>

Then add the following under dependencies:

<dependency> <groupId>uk.ac.cam.ch</groupId> <artifactId>chemicalTagger</artifactId> <version>1.3</version> </dependency>

The latest version of the code can be downloaded from our bitbucket repository.

ChemicalTagger Components

It has 2 main classes:

ChemistryPOSTagger

This class adds syntactic structure to the input text.

It first performs some pre-processesing by:

  • Normalising the text
  • Running the SpectraTagger (Optional and only used for detecting NMR Spectra)

It then tokenises the text using one of the following tokenisers:

  • OscarTokeniser (default tokeniser and used for chemistry text)
  • WhitespaceTokeniser (used for mainly non-chemistry text)

Then finally it runs the following 3 taggers against the text:

  • OSCAR (for chemical entities)
  • Regex (for recognising chemistry related entities)
  • OpenNLP (for English parts-of-speech)

ChemistrySentenceParser

This class converts a tagged sentence into a parse tree as well as an XML document.

It first outputs the AST(Abstract Syntax Tree) by using the generated Lexer and Parser files (generated from compiling the ANTLR ChemicalChunker.g file).

The AST is then converted into an XML document.

Running ChemicalTagger

To run ChemicalTagger you can either use the Utils convenience method

Document doc = Utils.runChemicalTagger(text);

Or for a more step-by-step method try the following:

ChemistryPOSTagger chemPos = ChemistryPOSTagger.getDefaultInstance();

// Alternatively, if you want to reconfigure the tokenisers and taggers then try the following command

//ChemistryPOSTagger chemPos = new ChemistryPOSTagger(ctTokeniser, oscarTagger, regexTagger, openNLPTagger)

POSContainer posContainer = chemPos.runTaggers(text);

// If you want to toggle priotiseOscar and useSpectraTagger then use the following command.

//POSContainer posContainer = chemPos.runTaggers(inputSentence, prioritiseOscar, useSpectraTagger)

ChemistrySentenceParser chemistrySentenceParser = new ChemistrySentenceParser(posContainer);

chemistrySentenceParser.parseTags();

Document doc = chemistrySentenceParser.makeXMLDocument();