Much faster incremental apt updates

APT’s performance in applying the Pdiffs files, which are the diff format used for Packages, Sources, and other files in the archive has been slow.

Improving performance for uncompressed files

The reason for this is that our I/O is unbuffered, and we were reading one byte at a time in order to read lines. This changed on December 24, by adding read buffering for reading lines, vastly improving the performance of rred.

But it was still slow, so today I profiled – using gperftools – the rred method running on a 430MB uncompressed Contents file with a 75 KB large patch. I noticed that our ReadLine() method was calling some method which took a long time (google-pprof told me it was some _nss method, but that was wrong [thank you, addr2line]).

After some further look into the code, I noticed that we set the length of the buffer using the length of the line. And whenever we moved some data out of the buffer, we called memmove() to move the remaining data to the front of the buffer.

So, I tried to use a fixed buffer size of 4096 (commit). Now memmove() would spend less time moving memory around inside the buffer. This helped a lot, bringing the run time on my example file down from 46 seconds to about 2 seconds.

Later on, I rewrote the code to not use memmove() at all – opting for start and end variables instead; and increasing the start variable when reading from the buffer (commit).

This in turn further improved things, bringing it down to about 1.6 seconds. We could now increase the buffer size again, without any negative effect.

Effects on apt-get update

I measured the run-time of apt-get update, excluding appstream and apt-file files, for the update from todays 07:52 to the 13:52 dinstall run. Configured sources are unstable and experimental with amd64 and i386 architectures. appstream and apt-file indexes are disabled for testing, so only Packages and Sources indexes are fetched.

The results are impressive:

  • For APT 1.1.6, updating with PDiffs enabled took 41 seconds.
  • For APT 1.1.7, updating with PDiffs enabled took 4 seconds.

That’s a tenfold increase in performance. By the way, running without PDiffs took 20 seconds, so there’s no reason not to use them.

Future work

Writes are still unbuffered, and account for about 75% to 80% of our runtime. That’s an obvious area for improvements.

rred-profile

Performance for patching compressed files

Contents files are usually compressed with gzip, and kept compressed locally because they are about 500 MB uncompressed and only 30MB compressed. I profiled this, and it turns out there is not much we can do about it: The majority of the time is spent inside zlib, mostly combining CRC checksums:

rred-gz-profile

Going forward, I think a reasonable option might be to recompress Contents files using lzo – they will be a bit bigger (50MB instead of 30MB), but lzo is about 6 times as fast (compressing a 430MB Contents file took 1 second instead of 6).

62 thoughts on “Much faster incremental apt updates

  1. Hi,

    As a Debian user, I want to just say: thank you, really a lot of thanks.

    I disable Pdiffs since they did appear.

    It’s one of those things on Debian, that every Debian supporter (IRC, etc) defends, but you as a final user see a contradictory behavior (it’s not faster, it’s slower).

    You made my day publishing your work on this. Thanks for the fixes, and for your time doing an explanation so nice for everybody. Great.

    Looking forward for the future work, and willing to test on my systems the results.

    Have a nice day !!!

  2. A relevant comment from elsewhere (not mine):

    “it’s curious that people seem to still think that using lzo makes sense when lz4 is always better to use instead of lzo.

    lz4 gives slightly faster compression (around 10%), very slightly smaller archives (around 1%), and much faster decompression (around 3 to 4x as fast!) .”

      1. In recent LZ4 versions, api has been changed. Old api has been deprecated, plenty of new funcs appeared. TBH it’s not what I really like about LZ4. Now kernel can be compressed … but it needs “obsolete” format? Oh, crap. Unstable API interface isn’t what makes library fun to use.

    1. Actually, I’ve tested LZ4 vs LZO in quite many cases. LZ4 is good, but…
      1) LZ4 usually compresses somewhat worse than LZO, it can’t beat LZO1X-999 even on -HC 16 level (you can’t get it better). There is LZ5 floating around, by the author of Lzbench I’ve referred to. He modded algo a bit to improve compression, while trying not to kill speed too much. It compresses better, but also decompression is slower. LZ4 does some tradeoffs to be fast, it trying hard to fit cache at all costs. This implies compression ratio lower than similar algos which are less inclined on fitting cache at all costs.
      2) Thanks to careful optimizations, LZ4 is VERY fast, beating LZO decomression in virtually all cases. On both x86-64 and ARM. Yet, it comes at compression ratio price. LZ4 can’t compress better than LZO on many kinds of data due to short lookup range, required to keep cache hot. LZO less picky about it, it somewhat kills compression/decompression speeds but improves ratio. Try something like lzbench on your data, you’ll see🙂

      …and on side note, computing CRC32 takes heck a lot of time, both comparable to LZO and LZ4!!! Whatever, if one wants THIS speed, they want faster checksumming as well. But there is nasty catch. We also want to have it cryptographically secure. So it does not gets tampered on the way. And it means checksums are going to be rather slow. Since it can’t be avoided, I think something like brotli is actually better overall balance. It would be fancy if you save decompression time at expense of download time increase and then also waste plenty of time computing cryptographically-secure checksums anyway. And not like if you want to get unknown, unsigned, unchecksummed data coming from hell knows who.

      Optimization isn’t about blindly making it as fast as you can. It is about hitting some sweet point which makes sense. And in case of subj, it is fairly complicated since one have to take into account plenty of factors. Download times. Machine properties. Data nature. And so on. Debian runs on really different machines, exposing different properties. And say Pi-like ARMv7 board vs x86-64 can be quite different, etc. And it would be good if Debian works great on both, right?🙂

  3. Really great work.

    And I personally love LZO and use it where possible. I’m just thinking of slow connections where the most amount of time isn’t spent on decompression but on downloading. There, LZO would make things worse.

  4. Could there be a bug in the second patch as the buffer.read return value is not always used and if the requested_size inside the buffer.read changes?

    1. Not really, it only happens when the requested size is larger than the buffer size, and the code paths always call it with min(buffer.size(), the size they want). I’ll probably refactor this to be a bit clearer, though.

      1. Sweet, thanks! On Dec 26, 2015 3:57 PM, “Blog of Julian Andres Klode” wrote:

        > Julian Andres Klode commented: “I profiled with gperftools profiler, than > used its google-pprof –svg to create an SVG which I converted to png using > rsvg-convert.”

  5. > The reason for this is that our I/O is unbuffered, and we were reading one byte at a time in order to read lines.

    Will there be public inquiry and investigation into how this happen in first place?

    Why this not caught earlier?

    How much other suspect code in APT?

    I used to be trusting of Debian and APT but after I learn here of such bad bugs how can I be trusting in Debian again unless there is full public inquiry about how this bug got in there in first place?

    We need root cause analysis please!

    We need root cause analysis so we can again be trusting of Debian!

    – Johann

    1. It’s only a performance issue. We kind of forgot that we did not efficiently implement ReadLine() when we started using it for rred. The code in question was only in APT 1.1 releases and pre-releases uploaded during and after DebConf, so for about 4 months.

    2. Johann, just a guess, but we would only notice this when things like network performance to the repo becomes really fast, right? So if the code was tested on a slow network link, could that mask the issue? Is this a case of a chain only being as strong as the weakest link? As one link in the chain gets stronger, so we find the next weakest link, and so on. Just a guess on my part…

  6. That’s great

    but I’d rather see improvements in space management.

    I continuously face failure of upgrade because apt-tool think (stupidly) that I’m out of disk space and I guess I’m not the only one.

    Right now afaik apt install update this way:
    download all apt.
    install

    It would be better if it worked that way:
    download apt/install apt/clean
    process next apt

    Why would it be better ?
    You would’t have to wait for bulk (long) installation but you could start using new versions as soon as they are installed.
    You would be able to install more software :
    eg you only have 300 mb left and you wan’t to update kernel sources (200mb) + GPU driver (150mb) + Java junk (200mb)
    Right now you’re out of luck (=you need to delete files)
    Using partial/repeated install you’ll start with Java junk first.
    After installation and cleaning you’d still have 300mb free space (minus some megabytes because new version is probably a bit larger)
    After this installation you’re able to use your new java tools.
    And you can update the kernel sources…
    And you can update/recompile the GPU driver…

    I think the apt tool need a smarter algorithm that would install the package in the following order

    update versions (with no deps) that frees up space (if any)
    update versions (with no deps) that will be necessary for other packages
    update versions (with no deps) of popular packages first

    thanks for reading

    keep up the good job

    1. Yes. I also had this problem on my last computer. (Now I have more disk space.)
      Actually if apt installed one update while downloading the next it would speed up the whole install process and solve the disk space problem at the same time.

      Another speedup could be using IPFS (https://ipfs.io/) to increase download speed.

    2. This drove me nuts while using Debian on SheevaPlug – I had to constantly uninstall software in order to perform system update/upgrade.

      BTW, when it comes to updates as you describe it above, that’s pretty much how OpenBSD’s pkg_add works, albeit it goes through the list of installed packages alphabetically, obviously resolving dependencies as it goes. There are ideas out there to be learned from.

  7. i’ve always always disabled pdiff, because it takes approximately 15 minutes to download the pdiff files, whereas downloading the entire apt repository can take a couple of minutes. the reason why it takes 15 minutes is because each incremental update – although they are small – has a significant pause in between each file. if there are 60 files (one per day since the last apt-get update), with a 10 second pause in between for HTTP or HTTPS connection setup and DNS resolution, and each file is downloaded at only 40k/sec (instead of 400k/sec for the entire apt repository), it is hardly surprising to find that the total time for a pdiff apt-get update is so insanely high.

    fixing this should involve things like the concatenation of all pdiff files required, on-the-fly – for example by using an inline (uncompressed) tar command, and having the apt-get pdiff protocol specify the date-range of required .pdiff files to be downloaded as a single parameter to the HTTP request made to the apt server. such a solution would allow the massive latency associated with single file downloads from these highly-overloaded apt servers to be minimised.

    1. Using pdiff is faster if done regularly, instead of disabling it you can just enable the scheduled update

      # echo ‘APT::Periodic::Update-Package-Lists “1”;’ >> /etc/apt/apt.conf.d/10periodic

      1. Seems like the on the fly concatenation would be helpful for those on high latency connections (satellite) or those whose systems are shutdown regularly and miss periodic pdiff updates (laptop). Is this thinking wrong? On Dec 26, 2015 6:44 PM, “Blog of Julian Andres Klode” wrote:

        > j1143651 commented: “Replace the wordpress-mangled quotes with proper > quotes”

    2. >because each incremental update – although they are small – has a significant pause in between each file. if there are 60 files (one per day since the last apt-get update), with a 10 second pause in between for HTTP or HTTPS connection setup and DNS resolution,

      >fixing this should involve things like the concatenation of all pdiff files required, on-the-fly – for example by using an inline (uncompressed) tar command, and having the apt-get pdiff protocol specify the date-range of required .pdiff files to be downloaded as a single parameter to the HTTP request made to the apt server.

      I think this would be the wrong solution. There’s no reason HTTP keep-alive cannot be used (it’s mandatory since HTTP/1.1 – so for more than a decade) – and DNS resolution shouldn’t be done again until the TTL of the entry has passed. In that case there’s no extra overhead whether you fetch multiple files or not. The actual bug would be closing the connection in-between in the first place.

      It’s nice that people start taking performance more seriously – but in this case, the http keep-alive would have a much larger payoff everywhere than one-off solutions like concatenating in this specific case.

      @wearebobo: satellite links already have a large window so that they don’t have to wait. HTTP allows pipelining – where you request many files without waiting for an answer in-between.

      I’m not sure why people always use the lowest-possible subset of HTTP. All these things are solved problems… no need to put a bandaid on some other place instead of just using the HTTP facilities that are already provided.

      1. @Danny Milosavljevic

        I agree, Keep-Alive (plus pipelining I guess? Is that the technical way of saying “send lots of requests at once?”) is the correct solution here. Only one TCP handshake, one DNS resolution (assuming connection does not drop).

        What do you mean when you say satellite connections have a larger window to compensate for latency. Are you talking about the receive window on TCP? Does it start with a larger receive window or still have to go through TCP slow start (which I understand is now misnamed, since TCP actually doubles sending window size during slow start now? I might have my implementations mixed up, not sure what Debian uses.)

        Thank you!

        Sounds like the install + cleanup one at a time is still useful for low disk space environments. Could always start fetching the next packages in parallel and cache them in-progress until disk space frees up after the first completed one finishes installing and cleanup.

      2. …that’s where HTTP/2 rocks, it takes heavy uptake on both latency issues AND also can pipeline stuff via same connection withh fairly small overhead. And these days modern servers (at least apache and nginx) have already added HTTP/2 support. HTTP libs are catching up as well. So, isn’t it a time to get best of techs?😛

  8. > AFAIK: We’d love to do pipelining, but apparently some of our mirrors misbehave and we cannot detect that as we do not have the needed metadata yet.

    Then one of the requirements for being an official mirror is that your server must provide HTTP pipelining. Add a cron job that checks the state every hour and remove mirrors that misbehave from the official list of mirrors + redirector.

    1. As a good client, you have to support pipelining failures anyways and probably should support retrying the requests that failed in the pipeline. For example, if for any reason an HTTP connection pauses for more than the keep-alive timeout in between responses/requests, it will probably drop anyways(Apache2 defaults this to 5 seconds).

      1. I think we actually do, I cannot see if it was disabled. IIRC, we detect pipelining failures using hash sums, as some servers might send responses in the wrong way, and disable pipelining if the hashes are wrong, and try again. I’m not sure whether we this was specifically disabled for PDiffs or not, I think I remember it was, but I don’t see anything in the code.

  9. Have you considered overlapping downloads and installs? Currently, apt pulls down all the files, then installs them all, but you could download and install them as they come, and start downloading the next ones while installing their dependencies.

    However, this might get a bit strange when you have errors for when a package was updated and apt-get update needs to be run again.

  10. There was a GSoc project about this some years ago IIRC, but I have to say that I personally don’t like it very much. If installation of a package fails while others are downloading, you might not be able to recover as well as you can now.

  11. Nice work! I can also recommend the linux perf tool for doing low-overhead CPU profiling. The “perf report” has a very nice text mode curses interface.

  12. Regarding pipelining and your mirrors not supporting it, I suggest you follow the rule “build it and they will come”.

    Once the feature is in place and working for some of the mirrors will the maintainers of all mirrors want to support it at which point the feature will be available and ready for use.

    1. You still have to compute stuff like SHA-256 or better for ALL DATA. Else it going really easy to patch your system in arbitrary way by MITMing your traffic. So, after speeding up CRC32 you can figure out there is full-blown SHA to compute. Which is much worse than any CRC. Though CRC32 is slow for sure. E.g. adler or similar fletcher is much faster/smaller at comparable reliability. But this reliability isn’t enough anyway for tamper-proof donwloads, so I guess best idea just to completely get rid of this CRC. But you can’t get rid of secure hashing and it doomed to be slow, since it MUST do good permutations. Else it going to be cracked very easily.

  13. memmove() is only useful if you are shifting text over itself. memmove is EXPENSIVE to use. If the target bytes are discrete from the source bytes (determined before the move), that is, no overlap, then replace memmove() with memcpy() for more than a 50% improvement.

  14. That’s practically wrong, because the implementation is optimized to be basically equally fast, so you can just always use memmove() – for example, in the kernel, memcpy() is actually memmove().

    In this case, though, memmove should not be used at all.

  15. If you care about compression/decompressoin speed at expense of compression ratio, LZ4 is even faster in most cases. Then, Brotli proven to be interesting: it usually gives one LZMA-grade compresssion at zlib-grade decompressoin speed. Reaching very sweet point. I.e. no compression loss, but 2x faster to decompress on ARM and 3x faster on x86-64 compared to LZMA. Interestingly, zlib isn’t that bad either: it actually manages to reach fairly good balance in terms of speed vs compression ratio. Yet, zstd usually does it better (fairly interesting algo from LZ4 author)

    But it is not that simple. How compression algo perform is also strongly depend on nature of machine and data used. Say, little ARM board and x86-64 powerhouse would expose entirely different patterns. Some data would just perform very well on some algos or very bad on another algos.

    I’ve recently stumbled on https://github.com/inikep/lzbench and really like it – this program allows one to get a good idea how bunch of algos performs on YOUR data. It focused on benching LZ-based algos, most of which are noteworthy for good decompression speed and modest decompression memory requirements.

  16. Why not use memory mapped files through mmap and let the operating system deal with caching in a way that is most efficient for that specific case?

    1. I guess because mmaped IO can’t handle IO errors reasonably, and in package manager you actually care about status of all operations. Getting borked system without even knowing it gone nuts could be rather sad experience. Also, “usual” file operations are also buffered. Unless you explicitly request them not to be like this for some reason. Package manager MAY have some reasons. For package manager it is really not an option to fail in silent way, get killed by SIGSEGV on IO errors, etc. So mmaped IO isn’t a package manager’s best friend. Mmaped IO also comes with many nasty limits on 32-bit systems, and Debian runs on plenty of 32-bit architectures.

  17. Just a quick note on gzip compression performance. I don’t know whether you can use this as a library, but pigz is a gzip compatible tool that is multithreaded. This means performance scales with the number of cores. This way you can be backwards compatible on the format, but get enormous performance gains om most systems. (There is something special about seeing a 1GB file being compressed in 10 seconds, if you are doing it on a machine with 24 virtual cores that is)

    1. Those must be slow 24 cores then: One of my laptop cores compressed 1GiB in 15 seconds [70 MiB/s], and with pigz in 7s (dual-core) [160 MiB/s]. But those are still miles below what lz4 achieves: (less than) 2 seconds [588 MiB/s].

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s