This data and this README has been adapted from http://www.biostat.wisc.edu/~craven/webkb to work with Alchemy. This directory contains the "University Computer Science Department" data set used for the experiments in the publications: @article{craven.mlj01 ,author = "M. Craven and S. Slattery" ,title = "Relational Learning with Statistical Predicate Invention: Better Models for Hypertext" ,journal = "Machine Learning" ,year = 2001 ,volume = 43 ,number = "1-2" ,pages = "97--119" } @inproceedings{slattery.ilp98 ,author = "S. Slattery and M. Craven" ,title = "Combining Statistical and Relational Methods for Learning in Hypertext Domains" ,booktitle = "Proceedings of the Eighth International Conference on Inductive Logic Programming" ,year = 1998 } This data set consists of Web pages and hyperlinks collected from four computer science departments: Cornell University, The University of Texas, The University of Washington, and The University of Wisconsin. The following files contain data in the input format for Alchemy: common..db The background relations defined in this file are "LinkTo" which specifies hyperlink connections, and "AllWordsCapitalized" and "HasAlphanumericWord" which are Boolean predicates characterizing the anchor text of hyperlinks. page-words..db This set of files provides a bag-of-words representation of the words that occur in pages in the data set. Each predicate in these files specifies a stemmed word, and the instances of the predicate are those pages that contain the word. There are four files in this set -- one for each university. The files differ for each training/test partition because the vocabulary was pruned by considering the frequency of word occurrences only in the training set. anchor-words..db This is the analogous set of files for words that occur in the anchor text of hyperlinks. neighborhood-words..db This is the analogous set of files for words that occur in the "neighboring" text of hyperlinks. The neighborhood of a hyperlink includes words in a single paragraph, list item, table entry, title or heading in which the hyperlink is contained. page-classes..db This set of files contains a set of predicates indicating the class of each page in the data set. For training-set instances, the true class labels are used. For test-set instances, predicted class labels are used. These predictions were made using a method that combined statistical text classifiers with a URL-based clustering method. IMPORTANT: these files should be used only for the binary target relations (DepartmentOf, InstructorsOf, and membersOfProject). department-of..db These are the training and test set instances for the target relation "DepartmentOf". instructors-of..db These are the training and test set instances for the target relation "InstructorsOf". members-of-project..db These are the training and test set instances for the target relation "MembersOfProject". student..db These are the training and test set instances for the target relation "StudentPage". course..db These are the training and test set instances for the target relation "CoursePage". research.project..db These are the training and test set instances for the target relation "ResearchProjectPage". faculty..db These are the training and test set instances for the target relation "FacultyPage". To create input databases for Alchemy, concatenate the database files corresponding to the predicates you want. Create a separate database file for each university (this greatly speeds up learning by reducing the number of ground clauses constructed). For example, to train on Cornell, Texas and Washington while testing on Wisconsin for the "InstructorsOf" relation, you should, for each x in {cornell, texas, washington}, concatenate the following files: common.db page-words.x.db anchor-words.x.db neighborhood-words.x.db page-classes.x.db instructors-of.x.db and use each resulting .db file and the -multipleDatabases option to learn weights. To learn the "Student" target relation using the same training/test partition, you should concatenate the following files: common.db page-words.x.db anchor-words.x.db neighborhood-words.x.db student.x.db