Wikidata Extraction Tool on GitHub

I just pushed the first bit of a tool for extracting informwikidataation from wikidata.org. See my previous article for a description of the dump format. The tool can be found here:

It doesn’t do a lot so far, but is able to list the labels of the objects in the dump, simplify the objects in various ways, and extract a map of property ids to labels. A sample property map is available in the data directory.

Working with the entire dump takes a long time. If you want to play with it more easily, you can take the head of a dump and run the tool against that.

Right now, the tool will:

  • Print labels only instead of dumping the objects in JSON
  • Simplify the complex claims structures
  • Collapse multilingual string tables to a single language of your choice
  • Remove all or keep only a subset of sitelinks
  • Select only properties or items be extracted
  • Remove references
Advertisements

About jimbelton

I'm a software developer, and a writer of both fiction and non-fiction, and I blog about movies, books, and philosophy. My interest in religious philosophy and the search for the truth inspires much of my writing.
This entry was posted in programming and tagged , , . Bookmark the permalink.

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s