Big Data: How Much Space is Needed to Unzip a Gzip?

When you start playing with big data files like wikimedia dumps, you need a lot of disk space. Normally, if you run zcat -l file.gz, it will tell you how much. But for very large files:

wikidata/data $ zcat -l 20160215.json.gz 
         compressed        uncompressed  ratio uncompressed_name
         6644936388                   3 -221497878900.0% 20160215.json

This is due to a limitation in the gzip file format. At the end of a gzipped file, there is a trailer that contains two 4 byte integers. The first is a CRC (cyclic redundancy check) for the file that allows the content to be verified after it’s decompressed. The second is the size of the uncompressed file. Because this integer is only 32 bits, it can’t store a value larger than 4294967295. Therefore, any file larger that 4 Gbytes will have its size truncated; the number displayed with be the actual size modulo 4294967296.

The alternative is the much slower:

[130]jim@bifrost ~/Code/gzip-1.6 $ time zcat ../wikidata/data/20160215.json.gz | wc
20114879 498583411 67435365902

real    18m38.375s

On my laptop, it more than 18 minutes to decompress the zipped file. The size is around 67 gigabytes, substantially larger than the wikidata dump from last March, which was a mere 41 gigabytes. If you have lots of room, your best off to simply gunzip the file. The zcat|wc method allows you to check the size without filling your disk if you don’t have enough space.

