Nailing Down a Years-Old Problem in Tapestry

Last Updated: 2023-12-20 05:00:00 -0600

Tapestry is the oldest code project of mine that can be said to fall under the Arcana Labs umbrella - in fact, it’s so old that the project itself has persisted through two full rebrands and renamings of the lab - Kensho Security Labs, and Patch Savage Labs before that. While I’m sure I’m now its only continuous user, the fact remains it is, at least for me, a profoundly useful backup utility… when it works.

I’ve been quite vocal about my complaints with Tapestry in the Arcana Labs Discord Server, floating possible fixes everywhere from an experiment I conducted to change the hashing algorithm used for validating file integrity on through to a wholesale replacement tool that would completely change my personal backup paradigm. That replacement tool, which I was calling “Gardener”, had some merits. But it also has some dangerous assumptions baked into it, and would be a huge lift, so whether I choose to pursue it or not, I really had to get Tapestry working again.

Wait, How Long Was This Broken?

Broken is sort of a strong word. There was a brief window between an update released early in 2021 (2.1.1) and another I released later that same year (2.2.0) where a certain conlfluence of weird events could have broken it; and as it happened we found that bug because I was hitting it. So that exposed a mountain of problems I still haven’t solved in my backup methodology. More on that next year, in 2024, but I actually have all the tools I need to solve that problem.

Once I got backups working again though, I realized I had a new problem; my backups are taking over a full day to complete. This is a serious problem - I had to disable the on-build file validation because there was a risk of failing out an entire backup if I updated a file between when it was first indexed and when it was actually added to an archive.

To my shame, I let this condition persist for months

What The Problem Wasn’t

I had some thoughts immediately after this problem came up, so far as where I could see inefficiencies in the program. SHA256 is a slow hashing algorithm and there are faster ones out there, so I did some science and found to my amasement that of the commonly-available, standard-ish hashing packages for python, SHA256 was actually the most performant; implementation matters more than specification for this kind of thing. Nothing major had changed in my compression settings or any of the libraries that underpinned tapestry, and switching from a spinning rust hard disk to a SATA SSD offered only marginal improvement in performance.

At this point, I decided I must have simply outgrown what Tapestry was designed to handle, and started planning Gardener. I had a pie-in-the-sky idea of building in features from the never-built tapestry management software (loom), and switching the paradigm around to changes-only/incremental backups. The idea was and still is appealing, but it had a major problem: my backup system wasn’t working for me now.

Diagnosing and Working Around the Problem

By this point, I had a pretty clear idea that whatever was causing the performance leak had to do with the total number of files being backed up. I looked at the size of my output blocks and the size of past output blocks and realized that in actual fact, I’d only grown my backups by a total of about 5%. That’s not nothing, but it’s also not nearly enough to explain or justify run times increasing by several hundred percent.

My first blind thought about this was one of the directories on my list, which contains all the locally-stored data for thunderbird. Backing up my email has always been a pain point when working with Tapestry and I figured I was probably getting snarled up in hundreds of lock or symlink files. I knew I wasn’t getting stuck in a symlink loop because the process could eventually complete, so I needed more data. To solve this, I woke up a python terminal, loaded up an actual .riff file from the most recent successful run, and after some poking around, wrote a few commands to send all the index material from the RIFF into a SQL-like DB so I could run queries on it.

It turned out that I was right - out of something like 58,000 total files in the backup some 50,000 were less than a kilobyte in size, and many of those were less than 100 bytes in size. The vast majority of all of these came from a single source - the local clone of the qmk source code repo I had from working on the Model 2. QMK ships with makefiles, definitions, docs and soforth for many hobby-industry keyboard layouts. None of those files are large, but it adds up.

Sure enough, by tarring my working directory for the model2 and rerunning tapestry, the backup time went from over 24 hours to roughly 24 minutes to complete which was shocking, because the tarballed version of those files itself only took maybe 10 minutes to complete as a standalone operation, and the resulting tarball would have fit on a CD, so it wasn’t like I removed a huge amount of files. In fact, the final backups I generated in 24 minutes today were a few megabytes larger than the last successful run in November. The index files themselves, however, were 25% of the size of the previous run, and that’s going to go a long way to helping explain this problem.

Why Was This The Problem?

Tapestry is one of those programs that I actually feel the need to explain the operation of to people in its own terms rather than vague feature terms, and my description of that has evolved significantly over the years. It’s a little illustrative, but I think that will also help get the problem across.

If you think of the computer as one of the many warehouse-libraries that makes up part of the Arcana Labs complex, like I do, Tapestry can be thought of as a robot to be used to help pack up parts of that library and transport them somewhere else. Being a robot, Tapestry is also capable of operating many seemingly-independant bodies at once.

When you call Tapestry, it boots up a single robot to come and grab a list you left it of all the parent directories it needs to look in for objects to back up. As it searches these areas of the warehouse, it builds a list of all the items it finds and collects the file hash (in SHA256) and file size, along with other metadata. This information becomes the index, the largest-by-far component of the Recovery Index File Format (RIFF) file that will be included in every block of the backup, like the packing list on a commercial shipment.

This is the first place that the number of items becomes a time and size constraint. The inventory is only conducted by the first, mother Tapestry robot that boots up. It does all this work alone, a single file at a time, and on small files most of the time spent checking the item out is actually spent getting ready to do that; there’s a minimum time per item that is not directly bounded by item size. You can actually see this in a property of the index itself; the size of the index file scales not with the total size of the backup (directly) but much more proportionally to the number of individual items in the backup. In an extreme example, a backup composed of a tapestry-wrapped backup of a single 2 TB tarball will have a RIFF less than 1 KB; a similar list for 1TB of text files is going to be several hundred MB on its own.

This delay is also obvious to the user when running Tapestry in a terminal mode; you can hang on the “sorting the files to be archived - this could take a few minutes” step for upwards of an hour, in my case. I even briefly considered trying to figure out how to add a status bar to this step so that you could at least see it was working.

After this point, once the list is created, Tapestry does get clever. As a legacy inclusion that remains useful even in the cloud storage age, Tapestry’s main utility at one point was that it divides the backups across the minimum number of blocks of a certain arbitrary maximum size. You can think of these as Tapestry summoning multiple moving trucks with a given maximum size. The mother program takes the index she just built and works out the minimum number of trucks she’ll need, basically by packing trucks full using the index sorted by file size. This sort method is actually pretty efficient in terms of time to run, but it introduces a new problem: we’re not done with file count being a choke point for performance.

This problem shows up once you start actually packing the trucks. Tapestry will wake up several other robots to do this heavy lifting, but those robots are set up in such a way that the following two statements are always true:

  1. There are never more truck-packing robots than there are CPU cores your machine or trucks to be packed, whichever is less, and;
  2. Each truck-packing robot is responsible for one or more trucks on their own; that is, no two truck-packing robots can help each other.

This means, if you have a situation where most of the trucks have a couple hundred or even a couple thousand items to pack, they can complete fairly quickly, where if you have one final truck at the end with tens of thousands of tiny items to pack, that can take forever regardless of your disk write speed or how powerful your computer is. The robot has to pick up each of these tiny files, sometimes as small as zero KB, and put them on the truck one at a time.

Once the trucks are packed you’re out of the file-count-bottleneck woods, but this was by far the longest part of the process - watching a single Tapestry child process pack <=1024 Byte files, one at a time, into the final tarball of the backup.

In short, the problem here is simple: the big constraint on creating the packing list and creating the tarballs themselves is only partially file size, especially now that modern drives are getting faster and faster. The real bugbear is the bottleneck introduced by a small number of worker processes, unable to help each other, writing one file at a time, and that there is a certain minimum time that operation takes regardless of how small the file is.

How to Solve For This

To be honest I’ve been thinking about this problem longer than I’ve understood it existed. Finding a way to allow the packing working processes to help each other write to the same tarball has been a holy grail problem-solver for tapestry since before the 1.0 and 2.0 rewrites, precisely because this has always been the performance bottleneck step, though I’ll admit I didn’t really understand why until I did this exact demonstration for myself. Right now, the differences in write locking between Unix and Windows systems and the difference in inter-process communication for both systems is the single largest source of complexity in the tapestry source code; this step actually has to be handled differently on Unix and Windows systems, and there’s wholly different functions set aside for doing that. The most ideal solution would be to figure out how other programs are able to do things like tar the qmk repo in mere minutes when it takes Tapestry hours, but I suspect a huge portion of that is going to lie in the performance differences between compiled and interpreted code.

The solution we’ll actually implement into Tapestry is to sidestep around the problem. This hasn’t come up in any other context apart from working with my code repos, which are sometimes cloned from other people and often contain hundred to thousands of files, whereas even my best efforts at media-archival often contain a fraction of that. So I’m going to open development back up on Tapestry in order to add a feature I’ve wanted for a long time: the ability to detect, and skip, repos that are controlled by a version control system like Git or Mercurial.

My logic here is this: a git repo with a defined remote is likely being backed up more frequently than Tapestry is being run (at least in my use case) and is being backed up off site (again, this assumes for example that I’m using a github or gitlab instance that is hosted somewhere else, physically). This means that including these files into Tapestry is redundant. If I want to get really fancy with it maybe I can further tweak tapestry so that it backs up a marker indicating the repo was there with information how to retrieve it. This all involves planning that’s WAY out of scope for a blog post being written five days before christmas, though. If you’re interested in the soltuion that does ultimately get implemented, I’d suggest following the RSS feed for this blog as well as Issue 41 on the Tapestry Repo, where possible solutions to this problem will be discussed at length. If you have ideas about possible solutions, feel free to weigh in. It’s Open Source for a reason!


If you wanted to show your support financially for Arcana Labs projects like the Tapestry or PETI, but don’t need a virtual pet development kit, your best avenue is via my Github Sponsors account or by making a one-time donation to Arcana Labs via Ko-Fi.com or through other avenues detailed here. Supporters also get access to a special patrons-only section of the Arcana Labs Discord Server as well, and new bonuses are soon to be introduced on the github side!

Comments

Using your Fediverse account, you can respond to this article's Mastodon Post. Embracing the spirit of decentralization inherent to the Fediverse, you can use your account on any compatible platform to post. Clicking the "load comments" button below will make your browser request all of the non-private comments and display them below.

This was built based on this reference implementation.