In Section 6, we assumed the citation data was already segmented, i.e. it was known which text belonged to which field, making the InField an evidence predicate. In general, this is not the case and we need to segment the text into its fields. A common approach is to use an HMM and we demonstrate how to do this with Alchemy on the unsegmented Cora data.
Now, the InField predicate is non-evidence, leaving only the text, in form of the Token predicate as our only evidence. We model the observation matrix of our HMM with the rule
Token(+t, i, c) => InField(i, +f, c)
and the transition matrix via
InField(i, +f, c) => InField(i+1, +f, c)
We also want to impose the constraint that a position in a citation can be part of at most one field, thus:
!(f1 = f2) => (!InField(i, +f1, c) v !InField(i, +f2 ,c))
We learn the weights with the call:
bin/learnwts -d -i seg.mln -o seg-out.mln -t cora-unseg-train.db -ne InField
and use the output file to perform inference with:
bin/infer -ms -i seg-out.mln -r seg.result -e cora-unseg-test.db -q InField
Here, we have treated segmentation and entity resolution as two separate tasks. This is a valid approach; however, the information obtained from the entity resolution step can be utilized to improve the segmentation. For the state-of-the-art of this type of joint inference in information extraction using Markov logic, see [2].