Working with Freebase Data

I’ve been playing with the data produced by for a while now. This site, recently bought by Google, allows community editing of a shared knowledge base. The snapshot I’ve been using is 83 Gbytes in size, pushing it into the realm of “big data”, or data that is too big to fit in memory (on most computers, at least).

The database dump from freebase is in RDF turtle format. The file contains a series of triples, grouped into subjects (which are like records) by having a common value in the first field, the subject field. The second two fields on each line are a key/value pair, defining a property of the subject. Processing the data using Perl, I’ve generated indeces of between 2 Gbytes and 4.3 Gbytes. Generation of a single index on my Toshiba laptop takes well over an hour.

The index format I’m generating is simple: value offset lines. Value a value (e.g. ns:american_football.football_coach of the key being indexed (e.g. rdf:type). Offset is the byte offset in the RDF file of a subject that contains that key. Lines is the number of lines in the subject. The lines are sorted, meaning that the values are in order and, within a value, subjects are in the order they appear in the RDF file.

At some point, I’ll put the code on GitHub.

About jimbelton

I'm a software developer, and a writer of both fiction and non-fiction, and I blog about movies, books, and philosophy. My interest in religious philosophy and the search for the truth inspires much of my writing.
This entry was posted in programming and tagged , , , , , . Bookmark the permalink.

Leave a Comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s