## Statistical Predicate Invention

Stanley Kok and Pedro Domingos
### Abstract:

We propose statistical predicate invention as a key
problem for statistical relational learning. SPI is the problem of
discovering new concepts, properties and relations in structured data,
and generalizes hidden variable discovery in statistical models and
predicate invention in ILP. We propose an initial model for SPI based on
second-order Markov logic, in which predicates as well as arguments
can be variables, and the domain of discourse is not fully known in
advance. Our approach iteratively refines clusters of symbols based on
the clusters of symbols they appear in atoms with (e.g., it clusters
relations by the clusters of the objects they relate). Since different
clusterings are better for predicting different subsets of the atoms,
we allow multiple cross-cutting clusterings. We show that this approach
outperforms Markov logic structure learning and the recently introduced
infinite relational model on a number of relational datasets.
### Download:

Paper (PDF)

Slides

Video
### Datasets used:

Animals

Kinships

Nations

UMLS

### Supplementary Information:

#### Parameters for MLN Structure Learning and Inference

We augmented each dataset by creating an *isX* arity-1 predicate
for each constant *X* in the dataset
(e.g., *isLeopard(animal)*).

The beam search version of MLN structure learning (MSL) was allowed to run for
24 hours on all datasets. If MSL did not complete in 24 hours, we added the best
clause in its current beam to the MLN it had found at that point, and relearned
the MLN weights (by optimizing weighted pseudo-log-likelihood).

Unless stated otherwise, we used the default parameters of the Alchemy package.

For all datasets, we set *minWt* to 0, and used the
*startFromEmptyMLN* parameter. For the Animals dataset, we did not
sample atoms and clauses. The length penalty was set to 0.0001. For the UML and
Kinship datasets, we had to aggressively sample the number of atoms so that MSL
could find some rules within 24 hours. We kept all the true atoms, and sampled
the same number of false atoms as true ones. We set the length penalty to
0.001, and the *maxNumPredicates* parameter to four. For the Nations
dataset, we did not sample atoms. The parameters *maxVars* and
*maxNumPredicates* were both set to 10, and the length penalty was set
to 0.01. All parameters were set using preliminary experiments.

To evaluate the test atoms in each fold of a dataset, we ran MC-SAT for 24
hours or 10,000,000 iterations (whichever condition occurred earlier) .

#### Examples of Multiple Clusterings MRC Learned for the UMLS Biomedical Dataset

In the above figure, the organisms are clustered in three different ways
according to: what are found in them (red), their pathologic properties (blue),
and whether they are animals/vertebrates (green).

In the above figure, there are two clusterings of "Injury or Poisoning" and
the Abnormalities according to what they are manifestations of (blue) and
what they are associated with (red).

In the above figure, the relations "diagnoses", "prevents" and "treats" are
clustered in three ways. "Antibiotic" and "Pharmacologic Substance" diagnose, prevent
and treat diseases. "Diagnostic Procedure" and "Laboratory Procedure" only diagnose but
does not prevent or treat diseases. "Drug Delivery Device" and "Medical Device" prevent and
treat diseases but does not diagnose them.