Statistical Predicate Invention

Stanley Kok and Pedro Domingos


We propose statistical predicate invention as a key problem for statistical relational learning. SPI is the problem of discovering new concepts, properties and relations in structured data, and generalizes hidden variable discovery in statistical models and predicate invention in ILP. We propose an initial model for SPI based on second-order Markov logic, in which predicates as well as arguments can be variables, and the domain of discourse is not fully known in advance. Our approach iteratively refines clusters of symbols based on the clusters of symbols they appear in atoms with (e.g., it clusters relations by the clusters of the objects they relate). Since different clusterings are better for predicting different subsets of the atoms, we allow multiple cross-cutting clusterings. We show that this approach outperforms Markov logic structure learning and the recently introduced infinite relational model on a number of relational datasets.


Paper (PDF)

Datasets used:


Supplementary Information:

Parameters for MLN Structure Learning and Inference

We augmented each dataset by creating an isX arity-1 predicate for each constant X in the dataset (e.g., isLeopard(animal)).  

The beam search version of MLN structure learning (MSL) was allowed to run for 24 hours on all datasets. If MSL did not complete in 24 hours, we added the best clause in its current beam to the MLN it had found at that point, and relearned the MLN weights (by optimizing weighted pseudo-log-likelihood).

Unless stated otherwise, we used the default parameters of the Alchemy package.
For all datasets, we set minWt to 0, and used the startFromEmptyMLN parameter. For the Animals dataset, we did not sample atoms and clauses. The length penalty was set to 0.0001. For the UML and Kinship datasets, we had to aggressively sample the number of atoms so that MSL could find some rules within 24 hours. We kept all the true atoms, and sampled the same number of false atoms as true ones. We set the length penalty to 0.001, and the maxNumPredicates parameter to four. For the Nations dataset, we did not sample atoms. The parameters maxVars and maxNumPredicates were both set to 10, and the length penalty was set to 0.01. All parameters were set using preliminary experiments.

To evaluate the test atoms in each fold of a dataset, we ran MC-SAT for 24 hours or 10,000,000 iterations (whichever condition occurred earlier) .

Examples of Multiple Clusterings MRC Learned for the UMLS Biomedical Dataset

In the above figure, the organisms are clustered in three different ways according to: what are found in them (red), their pathologic properties (blue), and whether they are animals/vertebrates (green).

In the above figure, there are two clusterings of "Injury or Poisoning" and the Abnormalities according to what they are manifestations of (blue) and what they are associated with (red).

In the above figure, the relations "diagnoses", "prevents" and "treats" are clustered in three ways. "Antibiotic" and "Pharmacologic Substance" diagnose, prevent and treat diseases. "Diagnostic Procedure" and "Laboratory Procedure" only diagnose but does not prevent or treat diseases. "Drug Delivery Device" and "Medical Device" prevent and treat diseases but does not diagnose them.