I just pushed the first bit of a tool for extracting information from wikidata.org. See my previous article for a description of the dump format. The tool can be found here:
It doesn’t do a lot so far, but is able to list the labels of the objects in the dump, simplify the objects in various ways, and extract a map of property ids to labels. A sample property map is available in the data directory.
Working with the entire dump takes a long time. If you want to play with it more easily, you can take the head of a dump and run the tool against that.
Right now, the tool will:
- Print labels only instead of dumping the objects in JSON
- Simplify the complex claims structures
- Collapse multilingual string tables to a single language of your choice
- Remove all or keep only a subset of sitelinks
- Select only properties or items be extracted
- Remove references