Next: 7 Hidden Markov Models Up: The Alchemy Tutorial Previous: 5.2 Information Retrieval

# 6 Entity Resolution

Entity resolution is an important step of data cleaning and information extraction on which much research has been done. Markov logic allows an intuitive and elegant approach to this task. In order to demonstrate entity resolution with Alchemy, we take a look at the Cora dataset containing citations of computer science publications. Citations of the same paper often appear differently and the task here is to determine which citations are referring to the same paper. The model used here is based on that of [].

We start with one basic evidence predicates, HasToken, telling us the actual text and the ``field'' (author, title, or venue) of each token in each citation, respectively. The predicate HasToken(t, f, c) tells us that token t is present in field f in citation c.

Given this evidence, we want to predict which citations are the same, indicated by the predicate SameCitation. We determine identical citations by looking at each of the fields author, title, and venue and determining their similarity. This is expressed by the predicate SameField(f, c1, c2), where f is a field (author, title, or venue) and c1 and c2 are citations. To recap, the predicates we need are:

```HasToken(token, field, citation)
SameField(field, citation, citation)
SameCitation(citation, citation)
```

The formulas we need to perform entity resolution are very compact thanks to the per-constant + operator. This can be used during weight learning to produce a separate clause (and, hence, learn a weight) for each value of the variable to which it is applied. For example, the first rule for entity resolution we want to express ``If the same token occurs in the same field in two separate citations, then the field is the same''; we want to do this for each token and field pair. In Markov logic, this looks like

```Token(+t, i1, c1) ^ InField(i1, +f, c1) ^ Token(+t, i2, c2)
^ InField(i2, +f, c2) => SameField(+f, c1, c2)
```

Also, we want to make the connection from same field to same citation, doing it for each field:

```SameField(+f, c1, c2) => SameCitation(c1, c2)
```

Finally, we want to add transitivity to the model (if c1 and c2 are the same citation and c2 and c3 are the same citation, then c1 and c3 are the same citation):

```SameCitation(c1, c2) ^ SameCitation(c2, c3) => SameCitation(c1, c3)
```

We run weight learning on the MLN and data with the following command:

```learnwts -d -i er.mln -o er-out.mln -t cora-seg-train.db
-ne SameField,SameCitation
```

which produces the clauses with learned weights in the file er-out.mln. We can use this to perform inference on the test data:

```infer -ms -i er-out.mln -r er.result -e cora-seg-test.db
-q SameField,SameCitation
```

The file er.result then contains the marginal probabilities of the query predicates. More refinements of this model exist which improve the results significantly; for example we could add transitivity on the SameField predicate. For the state-of-the-art model in Markov logic for entity resolution, see [3] and [2].

Next: 7 Hidden Markov Models Up: The Alchemy Tutorial Previous: 5.2 Information Retrieval
Marc Sumner 2010-01-22