next up previous
Next: 5.2 Information Retrieval Up: 5 Text Classification and Previous: 5 Text Classification and


5.1 Text Classification and Collective Classification

Text classification is an extremely important application of machine learning. We demonstrate text classification on a subset of the WebKB dataset which contains the text of web pages from four universities. Each page belongs to one of four classes: course, faculty, research project, or student.

If a given page contains a given word, then the predicate HasWord(word, page) is true for that pair; otherwise it is false. This information is given and we wish to infer the class of a page, given by the predicate Topic(class, page). Therefore, we need a rule

HasWord(+w, p) => Topic(+c, p)

for each (word, class) pair and a rule stating that, given no evidence, a page does not belong to a class:

!Topic(c, p)

That's it! In fact, we can eliminate the last rule; the addition of a unit clause to the MLN is helpful in many domains and, for this reason, Alchemy adds unit clauses to the MLN by default when weights are learned (this can be supressed with the option -noAddUnitClauses). If we run weight learning, we will learn a weight for every word/class pair representing how good of a predictor each word is for each class. We can extend this model by stating that each page must belong to exactly one class in the predicate declaration:

Topic(class!, page)

We run weight learning with the following command:

learnwts -d -i text-class.mln -o text-class-out.mln
  -t text-class-train.db -ne Topic

We can then use the classifier to classify test instances with

infer -m -i text-class-out.mln -r text-class.result
  -e text-class-test.db -q Topic

We can extend this model to demonstrate how to perform hypertext classification with Alchemy. WebKB also contains information on which pages are linked to each other, via the predicate LinkTo(linkid, page, page). We can incorporate this into a rule to perform hypertext classification:

Topic(c, p1) ^ LinkTo(id, p1, p2) => Topic(c, p2)

We add LinkTo from the dataset to our evidence and this one formula extends our text classifier to a hypertext classifier. We learn weights and infer as before:

learnwts -d -i hypertext-class.mln -o hypertext-class-out.mln
  -t text-class-train.db,links-train.db -ne Topic

infer -m -i hypertext-class-out.mln -r hypertext-class.result
  -e hypertext-class-test.db -q Topic

These rules represent a special case of collective classification. MLNs for many other collective classification tasks can be expressed in this manner; for example, social network modeling can be achieved by replacing LinkTo() with Friends() and determining topics of a blog in which they participate instead of the topic of web pages.


next up previous
Next: 5.2 Information Retrieval Up: 5 Text Classification and Previous: 5 Text Classification and
Marc Sumner 2010-01-22