Friday, July 18, 2025

How Reordering Files Reduced My Archive Size by Half

Background

I have a solar inverter made by Fronius that uploads json to an ftp server every ten seconds. There are two different json files (*.powerflow and *.interverter) with slightly different information. Every day, this means there are about 10k files and the file sizes vary but are usually between 500-1000 bytes each. On the local filesystem that equates to about 40MB per day. Over a year, we are talking about 3-4 million files (14-15GB) which can exhaust inodes on a reasonably sized ext4 filesystem far before space is a concern.

To manage this, I roll stats up daily. The stats are published into a folder named via YYYYMMDD by the inverter and I slurp those up into InfluxDB and then create a compressed tar archive called YYYMMDD.tar.xz for an ultimate backup.

This is a good system and each year compresses into around 50-100 MB. Already, we see that we are 1000x better in terms of space and inode usage. Maybe we can do better?

Example File Listing

/anon-ftp/fronius/photosynth/processed/20250717# ls -la | head
total 40540
drwxr-xr-x  2 root root 512000 Jul 18 08:05 .
drwxr-xr-x 10 root root  40960 Jul 19 01:35 ..
-rw-------  1 ftp  ftp     738 Jul 17 17:14 100450.solarapi.v1.inverter
-rw-------  1 ftp  ftp     802 Jul 17 17:14 100450.solarapi.v1.powerflow
-rw-------  1 ftp  ftp     738 Jul 17 17:14 100500.solarapi.v1.inverter
-rw-------  1 ftp  ftp     802 Jul 17 17:14 100500.solarapi.v1.powerflow
-rw-------  1 ftp  ftp     738 Jul 17 17:15 100510.solarapi.v1.inverter
-rw-------  1 ftp  ftp     802 Jul 17 17:15 100510.solarapi.v1.powerflow
-rw-------  1 ftp  ftp     738 Jul 17 17:15 100520.solarapi.v1.inverter
...

When we do the obvious tar invocation, the files are added in a somewhat random order.

# tar -cJvf 20250717-obvious.tar.xz 20250717/ | head
20250717/
20250717/152720.solarapi.v1.inverter
20250717/193700.solarapi.v1.inverter
20250717/154110.solarapi.v1.powerflow
20250717/163020.solarapi.v1.inverter
20250717/230850.solarapi.v1.inverter
20250717/130120.solarapi.v1.inverter
20250717/102720.solarapi.v1.inverter
20250717/213220.solarapi.v1.powerflow
20250717/110950.solarapi.v1.inverter
...

and we are left with the resulting tar.

Does ordering the files make a difference?

Based on how compression technology works, it makes sense (in my head) that similar content right next to each other would be compressed better than mixed content. If I do a simple experiment, maybe I can validate this and see how much of an effect it has.

# tar -cJf 20250717-manual-sort.tar.xz 20250717/*.powerflow 20250717/*.inverter
# du -c *.tar.xz
116 20250717-manual-sort.tar.xz
212 20250717.tar.xz

Holy smokes! With almost no work, we are using about 54% of the space of the original tar invocation! (Yes, I did validate that both produce the same directory structure/content with diff -r)

Can I apply this anywhere? 

I decided to download a dataset with lots of files. The easiest to think of was a linux kernel tarball for 6.16-rc6 which has 89,677 files and unpacks to 1.7G on my local filesystem. To order files here, I needed to write a utility since I couldn't do as simple of a tar invocation as I did for my stats collection.

The script: https://github.com/imoverclocked/orderfs

I played around with different orderings and binning based on sizes. As it turns out, the linux kernel source is structured in a way that is pretty optimal already:

# du -h *.tar.gz
241M linux-6.16-rc6.recompress.tar.gz
249M linux-6.16-rc6.sorted-bins.tar.gz
264M linux-6.16-rc6.sorted.tar.gz
241M linux-6.16-rc6.tar.gz

The variants above:

  • recompress - uncompressed the original tar and then recompress locally with vanilla gzip
  • sorted - apply a rough sort with filetype based on extension and then by size
  • sorted-bins - apply an extension-sort and prefer to keep smaller files sorted by path

The vanilla recompression was within a couple hundred bytes so my gzip is slightly different from the one kernel org is using. Maybe they use --best or some other flag that I didn't bother with.

Sorting by extension and then putting files in ascending size actually made things worse. I suspect (SWAG) that the pathnames in random order made it harder for gzip to find larger common chunks in the tar headers.

Sorting by extension and then by path made things better but still not as good as the default directory structure. (Well done kernel folks!)

Summary

While not extensively tested, the script does provide a definite win for my local stats in terms of percentages. The technique already seems to be used by the kernel.org structure, which is cool. Maybe others can use the script to see if restructuring their archives makes a significant difference for them.