Next: 9 Natural Language Processing Up: The Alchemy Tutorial Previous: 7 Hidden Markov Models

8 Information Extraction

In Section 6, we assumed the citation data was already segmented, i.e. it was known which text belonged to which field, making the InField an evidence predicate. In general, this is not the case and we need to segment the text into its fields. A common approach is to use an HMM and we demonstrate how to do this with Alchemy on the unsegmented Cora data.

Now, the InField predicate is non-evidence, leaving only the text, in form of the Token predicate as our only evidence. We model the observation matrix of our HMM with the rule

Token(+t, i, c) => InField(i, +f, c)

and the transition matrix via

InField(i, +f, c) => InField(i+1, +f, c)

We also want to impose the constraint that a position in a citation can be part of at most one field, thus:

!(f1 = f2) => (!InField(i, +f1, c) v !InField(i, +f2 ,c))

We learn the weights with the call:

bin/learnwts -d -i seg.mln -o seg-out.mln -t cora-unseg-train.db -ne InField

and use the output file to perform inference with:

bin/infer -ms -i seg-out.mln -r seg.result -e cora-unseg-test.db -q InField

Here, we have treated segmentation and entity resolution as two separate tasks. This is a valid approach; however, the information obtained from the entity resolution step can be utilized to improve the segmentation. For the state-of-the-art of this type of joint inference in information extraction using Markov logic, see [2].

Next: 9 Natural Language Processing Up: The Alchemy Tutorial Previous: 7 Hidden Markov Models

Marc Sumner 2010-01-22