Extracting Product Information from Email Receipts using Markov Logic
Stanley Kok and Wen-tau Yih
Abstract:
Email receipts (e-receipts) frequently record e-commerce transactions between
users and online retailers, and contain a wealth of product
information. Such information could be used in a variety of applications if
it could be reliably extracted.
However, extracting product information from e-receipts poses
several challenges. For example, the high labor cost of annotating
e-receipts makes traditional supervised approaches infeasible.
E-receipts may also be generated from a variety of templates,
and are usually encoded in plain text rather than HTML, making it
difficult to discover the regularity of how product information is
presented.
In this paper, we present an
approach that addresses all these challenges. Our approach is based on
Markov logic, a language that combines probability
and logic. From a corpus of unlabeled e-receipts, we
identify all possible templates by jointly clustering
the e-receipts and the lines in them. From the non-template portions of
e-receipts,
we learn patterns describing how product information is laid out,
and use them to extract the product information.
Experiments on a corpus of real-world e-receipts demonstrate
that our approach performs well. Furthermore, the extracted information
can be reliably used as labeled data to bootstrap a supervised statistical
model, and our experiments show that such a model is able to
extract even more product information.
Download:
Paper (PDF)