Friday, December 11, 2015

For the love of bits, stop using gzip!

Everytime you download a tar.gz, Zod kills a kitten. Everytime you generate a tar.gz, well, let's keep this family-safe.

In 2015, there should be very few reasons to generate .tar.gz files. We might as well just use .zip for all the progress we have made since 1992 when gzip was initially released. xz has been a thing since 2009, yet I still see very little adoption of the format. Today, I downloaded Meteor and noticed that it downloads a .tar.gz. I manually downloaded the file and then recompressed it using xz:

$ du -h meteor-bootstrap-os.osx.x86_64.tar.*
139M meteor-bootstrap-os.osx.x86_64.tar.gz
67M meteor-bootstrap-os.osx.x86_64.tar.xz

Seriously, less than half the size! Maybe it's the amount of time is takes to compress? Let's see:

$ cat meteor-bootstrap-os.osx.x86_64.tar | time xz -9 -c > /dev/null xz -9 -c > /dev/null 165.21s user 1.14s system 99% cpu 2:47.70 total

$ cat meteor-bootstrap-os.osx.x86_64.tar | time gzip -9 -c > /dev/null
gzip -9 -c > /dev/null 35.03s user 0.23s system 99% cpu 35.583 total

Ok, so compressing takes longer. You have to do it once. It's still on the order of reasonable for something that compiles a 600MB tarball in the first place. What about decompressing?

$ time xz -d -c meteor-bootstrap-os.osx.x86_64.tar.xz > /dev/null
4.25s user 0.08s system 99% cpu 4.327 total

$ time gzip -d -c meteor-bootstrap-os.osx.x86_64.tar.gz > /dev/null
1.35s user 0.04s system 99% cpu 1.389 total


... and decompressing takes a little longer. But, wait a second, how long does it take to download the file in the first place? I'm on a decent connection and the file is being hosted on something that delivers the content at an average of (say) 1.5 MB/s. That's 88 seconds for the .tar.gz and 42 seconds for the tar.xz. Since the content is streamed directly to tar (a la: curl ... | tar -xf - ), we actually don't see a time slowdown because xz is slower, we see an overall speedup because the slowest operation is getting the bits in the first place!

What about tooling?

OSX: tar -xf some.tar.xz (WORKS!)
Linux: tar -xf some.tar.xz (WORKS!)
Windows: ? (No idea, I haven't touched the platform in a while... should WORK!)

Why am I picking on Meteor? Well, they place the tagline of "Build apps that are a delight to use, faster than you ever thought possible" right on their homepage. I just ran their install incantation and timed it:

./install.sh 2.32s user 7.67s system 14% cpu 1:10.53 total

70 seconds! Nice job! I must have downloaded it slightly faster than in my initial testing. It also means that the install is extremely limited by download speeds. So ... I can easily imagine this being twice as fast. All that needs to be done is change the compression format and I should be able to install this in 33 seconds!

So, who *does* use xz? kernel.org. Also, the linux kernel itself optionally supports xz compression of initrd images. Vendors just need to pay attention and turn the flags on. Anyone else want to be part of the elite field of people who use xz? Please?

28 comments:

  1. You should also compare pxz and pigz for compression times (for multi-core systems)

    ReplyDelete
    Replies
    1. I agree.

      I'd also like to see a comparison against a much larger file (eg. 28 GB .sql dump). I have serious doubts that something other than gzip would prevail as more efficient.

      Delete
  2. I would suspect that the subset of those who use a parallel compressor much less a more space-efficient algorithm are embarassingly small. Add to that the fact that the resultant size is the dominating factor in delivery performance, aka the task you do more than once, and I wonder why talking about parallel compressors even fits into the question?

    ReplyDelete
  3. On Windows 7zip confirmed for packing/unpacking XZ files (source: http://www.7-zip.org/)

    ReplyDelete
  4. It's a question of good enough. Vast majority of data stored and transferred today is photo and video. For those loseless compression is needed, and there is healthy innovation in that space. For the rest, people just use what is available to as many recipients of the compressed files as possible. Nobody is going to download a new decompressor just to install your app, but plenty will choose your video services if it's the only one that provides reliable 4K video over their midrange Internet connection.

    ReplyDelete
  5. Phpmyadmin uses gzipped as compression method for importing and exporting database.

    ReplyDelete
  6. Arch Linux uses xz for their package system since 2009 I think

    ReplyDelete
  7. Arch Linux uses xz for their package system since 2009 I think

    ReplyDelete
  8. For people who can't be bothered to learn specific commands for specific archives, no need for tar: on linux I use 7z for all archives and the command is the same for gz or xz: 7z x somefile.*z.

    ReplyDelete
    Replies
    1. Yes there is a *need* for tar instead of 7zip on linux, it's written right there in 7z manual:

      DO NOT USE the 7-zip format for backup purpose on Linux/Unix because :
      - 7-zip does not store the owner/group of the file.

      On Linux/Unix, in order to backup directories you must use tar :
      - to backup a directory : tar cf - directory | 7za a -si directory.tar.7z
      - to restore your backup : 7za x -so directory.tar.7z | tar xf -

      Delete
  9. What about Brotli? http://www.cio.com.au/article/585180/google-new-brotli-algorithm-compresses-data-faster-than-zopfli/

    ReplyDelete
  10. for the pattern of compress a few times, decompress many times, consider lz4.

    ReplyDelete
  11. See also: https://github.com/square/git-fastclone

    ReplyDelete
  12. xz has built-in multiprocessing support.

    ReplyDelete
  13. gzip is better (because faster) on small devices (1-2 cores on arm). xz and lzma are more cpu intensive.

    ReplyDelete
  14. Slackware packages from package.slackware.org are compressed using XZ and have been for a while now. :)

    ReplyDelete
  15. For packaging and static stuff like this (WORM, write-once read-many), ok. But the claim is too broad---"stop using gzip".
    On production servers, gzip will bring output I/O down significantly, while barely using CPU. So while I agree that most compressors yield better results, gzip is far from unuseful. ;-)

    ReplyDelete
  16. Can you be more wrong by asking to replace something that's long proven to work pretty much anywhere by something shinier, I doubt it. What about those embedded system and raspberry pi and the half of the world that doesn't have the privilege of modern and powerful computers ?

    Had you asked for choice instead of replacement, you'd have had a point but here this just push people to ignore this little fit you've thrown over not seeing how privileged you are.

    ReplyDelete
  17. FreeBSD uses tar and xz format = txz

    ReplyDelete
  18. Most ctf games/organizations package their binaries in xz format

    ReplyDelete
  19. here's my reply:
    https://t35t37.wordpress.com/2015/12/14/gzip-is-good-enough-to-stay/

    ReplyDelete
  20. To all whom have replied, thanks! I like the additional info and perspectives on other hardware variants. Indeed, I haven't tried this on my R-Pi or Intel SoC class nodes. To those who are upset and claim my privilege is clouding my judgement, I appreciate your perspective too. Still, I'm bandwidth limited before I'm CPU limited and on my embedded devices, I'm often space limited. Either way, higher levels of compression would be nice.

    ReplyDelete
  21. I don't know of browsers that have support for xz. That would be cool though! Some *very* bandwidth constrained users would probably like the feature. Browsers tend to be optimized to the bone for low-latency rendering though. Adding any amount of time on the decompressor would be a hard sell.

    ReplyDelete
  22. This comment has been removed by the author.

    ReplyDelete