I’ve been playing with the data produced by http://www.freebase.com/ for a while now. This site, recently bought by Google, allows community editing of a shared knowledge base. The snapshot I’ve been using is 83 Gbytes in size, pushing it into the realm of “big data”, or data that is too big to fit in memory (on most computers, at least).
The database dump from freebase is in RDF turtle format. The file contains a series of triples, grouped into subjects (which are like records) by having a common value in the first field, the subject field. The second two fields on each line are a key/value pair, defining a property of the subject. Processing the data using Perl, I’ve generated indeces of between 2 Gbytes and 4.3 Gbytes. Generation of a single index on my Toshiba laptop takes well over an hour.
The index format I’m generating is simple: value offset lines. Value a value (e.g. ns:american_football.football_coach of the key being indexed (e.g. rdf:type). Offset is the byte offset in the RDF file of a subject that contains that key. Lines is the number of lines in the subject. The lines are sorted, meaning that the values are in order and, within a value, subjects are in the order they appear in the RDF file.
At some point, I’ll put the code on GitHub.