Google Refine Blog: Progress on Generic Reconciliation Support

Gridworks 1.1 provides a feature for reconciling cells in a data set to topics on Freebase (e.g., matching "Tom Hanks" to the topic identified by "/en/tom_hanks" and viewable here). There are a few advantages to performing reconciliation. First, reconciliation can match both "Tom Hanks" and "Thomas Jeffrey Hanks" to the same topic, making the data more internally consistent and allowing trends to emerge more clearly. Second, by connecting the data to Freebase topics, we can now pull data from Freebase to augment the original data set, say, adding nationality and birth places to each person mentioned in the data set. Finally, as there are implicitly relationships between cells in the data set (a "directed" relationship between a "Director" cell and a "Movie" cell), by connecting cells to Freebase topics, we can now load those relationships into Freebase, enriching Freebase for other people to benefit from.

But of course, all those benefits are applicable to other databases, too. If you have your own private database, or if you work primarily with your university's data, or the Library of Congress' data, or any other data source, then it makes sense to reconcile your data set with that source. We are re-working Gridworks to support reconciliation against arbitrary data source (discussion thread). Here is a brief update on the current development.

In the source trunk, the Reconcile dialog box has been changed to support registering of standard reconciliation services adhering to this developing API specs:

Some sample APIs can be experimented with "in the raw" here. Each reconciliation service can have its own semantics. For example, to the Netflix reconciliation service below, "types" does not mean Freebase types but film genres.

The service can also specify how to formulate URLs from identifiers. Here, the Netflix identifier is "60020675":

The service can also customize various auto suggest widgets in the UI. For example, here, the Netflix service automatically suggests only film topics (rather than cities) for "Chicago".

If you're interested in this development, check out Gridworks' trunk and try out the feature. The Netflix reconciliation service mentioned is implemented as an Acre app (open source). Feel free to develop your own reconciliation service, plug it into Gridworks, and tell us what works and what doesn't (mailing list).

Google Refine Blog

Tuesday, June 29, 2010

Progress on Generic Reconciliation Support

No comments:

Post a Comment