Wednesday, November 10, 2010

Google Refine 2.0

(Cross posted on Google Open Source blog.)

Our acquisition of Metaweb back in July also brought along Freebase Gridworks, an open source software project for cleaning and enhancing entire data sets. Today we’re announcing that the project has been renamed to Google Refine and version 2.0 is now available.

Google Refine is a power tool for working with messy data sets, including cleaning up inconsistencies, transforming them from one format into another, and extending them with new data from external web services or other databases. Version 2.0 introduces a new extensions architecture, a reconciliation framework for linking records to other databases (like Freebase), and a ton of new transformation commands and expressions.

Freebase Gridworks 1.0 has already been well received by the data journalism and open government data communities (you can read how the Chicago Tribune, ProPublica and have used it) and we are very excited by what they and others will be able to do with this new release. To learn more about what you can do with Google Refine 2.0, watch the following screencasts:

Tuesday, June 29, 2010

Progress on Generic Reconciliation Support

Gridworks 1.1 provides a feature for reconciling cells in a data set to topics on Freebase (e.g., matching "Tom Hanks" to the topic identified by "/en/tom_hanks" and viewable here). There are a few advantages to performing reconciliation. First, reconciliation can match both "Tom Hanks" and "Thomas Jeffrey Hanks" to the same topic, making the data more internally consistent and allowing trends to emerge more clearly. Second, by connecting the data to Freebase topics, we can now pull data from Freebase to augment the original data set, say, adding nationality and birth places to each person mentioned in the data set. Finally, as there are implicitly relationships between cells in the data set (a "directed" relationship between a "Director" cell and a "Movie" cell), by connecting cells to Freebase topics, we can now load those relationships into Freebase, enriching Freebase for other people to benefit from.

But of course, all those benefits are applicable to other databases, too. If you have your own private database, or if you work primarily with your university's data, or the Library of Congress' data, or any other data source, then it makes sense to reconcile your data set with that source. We are re-working Gridworks to support reconciliation against arbitrary data source (discussion thread). Here is a brief update on the current development.

In the source trunk, the Reconcile dialog box has been changed to support registering of standard reconciliation services adhering to this developing API specs:

Some sample APIs can be experimented with "in the raw" here. Each reconciliation service can have its own semantics. For example, to the Netflix reconciliation service below, "types" does not mean Freebase types but film genres.

The service can also specify how to formulate URLs from identifiers. Here, the Netflix identifier is "60020675":

The service can also customize various auto suggest widgets in the UI. For example, here, the Netflix service automatically suggests only film topics (rather than cities) for "Chicago".

If you're interested in this development, check out Gridworks' trunk and try out the feature. The Netflix reconciliation service mentioned is implemented as an Acre app (open source). Feel free to develop your own reconciliation service, plug it into Gridworks, and tell us what works and what doesn't (mailing list).

Thursday, May 27, 2010

Release 1.1

Your Gridworks' front page should have already alerted you to a new version: 1.1. Changes include
  • Row/record sorting (Issue 32)
  • CSV exporter (Issue 59)
  • Mqlwrite exporter
  • Templating exporter (experimental)
  • Meta facet (Issue 58) - supported by the function facetCount()
  • Issue 34: "Behavior of Text Filter is unpredictable when "regular expression" mode is enabled." Regex was not compiled with case insensitivity flag.
  • Issue 4: "Match All bug with ZIP code". Numeric values in cells were not stringified first before comparison.
  • Issue 41: "Envelope quotation marks are removed by CSV importer"
  • Issue 19: "CSV import is too basic"
  • Issue 15: "Ability to rename projects"
  • Issue 16: "Column name collision when adding data from Freebase"
  • Issue 28: "mql-like preview is not properly unquoting numbers"
  • Issue 45: "Renaming Cells with Ctrl-Enter produced ERROR". Tentative fix for a concurrent bug.
  • Issue 46: "Array literals in GEL"
  • Issue 55: "Use stable sorting for text facets sorted by count"
  • Issue 53: "Moving the cursor inside the Text Filter box by clicking"
  • Issue 14: "Limiting Freebase load to starred records". We load whatever rows that are filtered through, not particularly starred rows.
  • Issue 49: "Add Edit Cells / Set Null"
  • Issue 30: "Transform dialog should remember preferred language."
  • Issue 62: "It'd be nice if URIs were hyperlinked in the data cells"

Wednesday, May 12, 2010

Release 1.0.1 with important bug fixes

Your Gridworks' front page should have already alerted you to a new version: 1.0.1. There is no new feature but there are several important bug fixes:

Issue 2 is especially critical because in version 1.0, once you flag or star several rows, you cannot undo that operation or any other operation preceding it.

Monday, May 10, 2010

Freebase Gridworks 1.0 released!

We're happy to announce the availability of Freebase Gridworks 1.0, which you can find at the home page