Category: General

A year ends, a new year begins

2017 is ending. It’s been a rather uneventful year, I’d say. About 6 months ago I started working on my master’s thesis – it plays with adding linear types to Go – and I handed that in about 1.5 weeks ago. It’s not really complete, though – you cannot actually use it on a complete Go program. The source code is of course available on GitHub, it’s a bunch of Go code for the implementation and a bunch of Markdown and LaTex for the document. I’m happy about the code coverage, though: As a properly developed software project, it achieves about 96% code coverage – the missing parts happening at the end, when time ran out ūüėČ

I released apt 1.5 this year, and started 1.6 with seccomp sandboxing for methods.

I went to DebConf17 in Montreal. I unfortunately did not make it to DebCamp, nor the first day, but I at least made the rest of the conference. There, I gave a talk about APT development in the past year, and had a few interesting discussions. One thing that directly resulted from such a discusssion was a new proposal for delta upgrades, with a very simple delta format based on a variant of bsdiff (with external compression, streamable patches, and constant memory use rather than linear). I hope we can implement this – the savings are enormous with practically no slowdown (there is no reconstruction phase, upgrades are streamed directly to the file system), which is especially relevant for people with slow or data capped connections.

This month, I’ve been buying a few “toys”: I got a pair of speakers (JBL LSR 305), and I got a noise cancelling headphone (a Sony WH-1000XM2). Nice stuff. Been wearing the headphones most of today, and they’re quite comfortable and really make things quite, except for their own noise ūüėČ Well, both the headphone and the speakers have a white noise issue, but oh well, the prices were good.

This time of the year is not only a time to look back at the past year, but also to look forward to the year ahead. In one week, I’ll be joining Canonical to work on Ubuntu foundation stuff. It’s going to be interesting. I’ll also be moving places shortly, having partially lived in student housing for 6 years (one room, and a shared kitchen), I’ll be moving to a complete apartement.

On the APT front, I plan to introduce a few interesting changes. One of them involves automatic removal of unused packages: This should be happening automatically during install, upgrade, and whatever. Maybe not for all packages, though – we might have a list of “safe” autoremovals. I’d also be interested in adding metadata for transitions: Like if libfoo1 replaces libfoo0, we can safely remove libfoo0 if nothing depends on it anymore. Maybe not for all “garbage” either. It might make sense to restrict it to new garbage – that is packages that become unused as part of the operation. This is important for safe handling of existing setups with automatically removable packages: We don’t suddenly want to remove them all when you run upgrade.

The other change is about sandboxing. You might have noticed that sometimes, sandboxing is disabled with a warning because the method would not be able access the source or the target. The goal is to open these files in the main program and send file descriptors to the methods via a socket. This way, we can avoid permission problems, and we can also make the sandbox stronger – for example, by not giving it access to the partial/ directory anymore.

Another change we need to work on is standardising the “Important” field, which is sort of like essential – it marks an installed package as extra-hard to remove (but unlike Essential, does not cause apt to install it automatically). The latest draft calls it “Protected”, but I don’t think we have a consensus on that yet.

I also need to get happy eyeballs done – fast fallback from IPv6 to IPv4. I had a completely working solution some months ago, but it did not pass CI, so I decided to start from scratch with a cleaner design to figure out if I went wrong somewhere. Testing this is kind of hard, as it basically requires a broken IPv6 setup (well, unreachable IPv6 servers).

Oh well, 2018 has begun, so I’m going to stop now. Let’s all do our best to make it awesome!

Advertisements

jak-linux.org moved / backing up

In the past two days, I moved my main web site jak-linux.org (and jak-software.de) from a very old contract at STRATO over to something else: The domains are registered with INWX¬†and the hosting is handled by uberspace.de. Encryption is provided by¬†Let’s Encrypt.

I requested the domain transfer from STRATO on Monday at 16:23, received the auth codes at 20:10 and the .de domain was transferred completely on 20:36 (about 20 minutes if you count my overhead). The .org domain I had to ACK, which I did at 20:46 and at 03:00 I received the notification that the transfer was successful (I think there was some registrar ACKing involved there). So the whole transfer took about 10 1/2 hours, or 7 hours since I retrieved the auth code. I think that’s quite a good time ūüôā

And, for those of you who don’t know: uberspace is a shared hoster that basically just gives you an SSH shell account, directories for you to drop files in for the http server, and various tools to add subdomains, certificates, virtual users to the mailserver. You can also run your own custom build software and open ports in their firewall. That’s quite cool.

I’m considering¬†migrating the blog away from wordpress at some point in the future – having a more integrated experience is a bit nicer than having my web presence split over two sites. I’m unsure if I shouldn’t add something like cloudflare there – I don’t want to overload the servers (but I only serve static pages, so how much load is this really going to get?).

in other news: off-site backups

I also recently started doing offsite backups via borg to a server operated by the wonderful rsync.net. For those of you who do not know rsync.net: You basically get SSH to a server where you can upload your backups via common tools like rsync, scp, or you can go crazy and use git-annex, borg, attic; or you could even just plain zfs send your stuff there.

The normal price is $0.08 per GB per month, but there is a special borg price of $0.03 (that price does not include snapshotting or support, really). You can also get a discounted normal account for $0.04 if you find the correct code on Hacker News, or other discounts for open source developers, students, etc. Рyou just have to send them an email.

Finally, I must say that uberspace and rsync.net feel similar in spirit. Both heavily emphasise the command line, and don’t really have any fancy click stuff. I like that.

Introducing TrieHash, a order-preserving minimal perfect hash function generator for C(++)

Abstract

I introduce TrieHash an algorithm for constructing perfect hash functions from tries. The generated hash functions are pure C code, minimal, order-preserving and outperform existing alternatives. Together with the generated header files,they can also be used as a generic string to enumeration mapper (enums are created by the tool).

Introduction

APT (and dpkg) spend a lot of time in parsing various files, especially Packages files. APT currently uses a function called AlphaHash which hashes the last 8 bytes of a word in a case-insensitive manner to hash fields in those files (dpkg just compares strings in an array of structs).

There is one obvious drawback to using a normal hash function: When we want to access the data in the hash table, we have to hash the key again, causing us to hash every accessed key at least twice. It turned out that this affects something like 5 to 10% of the cache generation performance.

Enter perfect hash functions: A perfect hash function matches a set of words to constant values without collisions. You can thus just use the index to index into your hash table directly, and do not have to hash again (if you generate the function at compile time and store key constants) or handle collision resolution.

As #debian-apt people know,¬†I happened to play a bit around with tries this week before guillem suggested¬†perfect hashing. Let me tell you one thing: My trie implementation was very naive, that did not really improve things a lot…

Enter TrieHash

Now, how is this related to hashing? The answer is simple: I wrote a perfect hash function generator that is based on tries. You give it a list of words, it puts them in a trie, and generates C code out of it, using recursive switch statements (see code generation below). The function achieves competitive performance with other hash functions, it even usually outperforms them.

Given a dictionary, it generates an enumeration (a C enum or C++ enum class) of all words in the dictionary, with the values corresponding to the order in the dictionary (the order-preserving property), and a function mapping strings to members of that enumeration.

By default, the first word is considered to be 0 and each word increases a counter by one (that is, it generates a minimal hash function). You can tweak that however:

= 0
WordLabel ~ Word
OtherWord = 9

will return 0 for an unknown value, map “Word” to the enum member WordLabel and map OtherWord to 9. That is, the input list functions like the body of a C enumeration. If no label is specified for a word, it will be generated from the word. For more details see the documentation

C code generation

switch(string[0] | 32) {
case 't':
    switch(string[1] | 32) {
    case 'a':
        switch(string[2] | 32) {
        case 'g':
            return Tag;
        }
    }
}
return Unknown;

Yes, really recursive switches – they directly represent the trie. Now, we did not really do a straightforward translation, there are some optimisations to make the whole thing faster and easier to look at:

First of all, the 32 you see is used to make the check case insensitive in case all cases of the switch body are alphabetical characters. If there are non-alphabetical characters, it will generate two cases per character, one upper case and one lowercase (with one break in it). I did not know that lowercase and uppercase characters differed by only one bit before, thanks to the clang compiler for pointing that out in its generated assembler code!

Secondly, we only insert breaks only between cases. Initially, each case ended with a return Unknown, but guillem (the dpkg developer) suggested it might be faster to let them fallthrough where possible. Turns out it was not faster on a good compiler, but it’s still more readable anywhere.

Finally, we build one trie per word length, and switch by the word length first. Like the 32 trick, his gives a huge improvement in performance.

Digging into the assembler code

The whole code translates to roughly 4 instructions per byte:

  1. A memory load,
  2. an or with 32
  3. a comparison, and
  4. a conditional jump.

(On x86, the case sensitive version actually only has a cmp-with-memory and a conditional jump).

Due to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77729 this may be one instruction more: On some architectures an unneeded zero-extend-byte instruction is inserted Рthis causes a 20% performance loss.

Performance evaluation

I run the hash against all 82 words understood by APT in Packages and Sources files, 1,000,000 times for each word, and summed up the average run-time:

host arch Trie TrieCase GPerfCase GPerf DJB
plummer ppc64el 540 601 1914 2000 1345
eller mipsel 4728 5255 12018 7837 4087
asachi arm64 1000 1603 4333 2401 1625
asachi armhf 1230 1350 5593 5002 1784
barriere amd64 689 950 3218 1982 1776
x230 amd64 465 504 1200 837 693

Suffice to say, GPerf does not really come close.

All hosts except the x230 are Debian porterboxes. The x230 is my laptop with a a Core i5-3320M, barriere has an Opteron 23xx. I included the DJB hash function for another reference.

Source code

The generator is written in Perl, licensed under the MIT license and available from¬†https://github.com/julian-klode/triehash¬†–¬†I initially prototyped it in Python, but¬†guillem complained that this would add new build dependencies to dpkg, so I rewrote it¬†in Perl.

Benchmark is available from https://github.com/julian-klode/hashbench

Usage

See the script for POD documentation.

ThinkPad X230 UEFI broken by setting a setting

Today, I decided to set my X230 back to UEFI-only boot, after having changed that for a bios upgrade recently (to fix a resume bug). I then choose to save the settings and received several error messages telling me that the system ran out of resources (probably storage space for UEFI variables).

I rebooted my machine, and saw no logo appearing. Just something like an underscore on a text console. The system appears to boot normally otherwise, and once the i915 module is loaded (and we’re switching away from UEFI’s Graphical Output Protocol [GOP]) the screen works correctly.

So it seems the GOP broke.

What should I do next?

Review: ThinkPad X230

This week, a few hours after Lenovo announced the X240, I bought an X230. Normally, the X230 model I bought comes with a Core i5-3320M CPU, 4GB RAM, 500GB HDD. My model was a special set including a second 4GB RAM stick and a 128 GB mSATA Plextor SSD. It came without Windows; and the ThinkVantage button is black instead of blue and has no label.

I put a fresh installation of unstable on it and tweaked it to save more power when on battery (enabled RC6++, and enabled autosuspend and other powertop suggestions with a script in /etc/pm/power.d); and configured hdparm.conf to put my hard disk into standby after 5 seconds (it’s only used for big data anyway, so most of the time it is unused). It now consumes 5W in idle with minimum brightness, and 8-10W with brightness 13 of 15. Consumption when surfing is 10 – 15 watts. Booting from grub to gdm is fast, I did not measure it, but it probably took about 5 seconds.

The IPS screen looks really good. Much much much better than the screen in my 2010 Edge 15 (I reviewed that one in 2010). It seems a bit more glossy than that one, but still matte. Keyboard is good as well. The touch pad is crap however. All hardware seems to work.

Comparison to the X240 for others who think about buying one of them: The X240 is lighter and thinner (it’s an Ultrabook) and has an optional FullHD 12.5 inch screen. It also offers a much larger touchpad and palm rest. But compared to the X230, the X240 lacks many things: No dedicated buttons for the trackpoint (you need to use the click-pad), it’s limited at 8GB RAM, uses a slower low-voltage Haswell CPU, and it uses the new M.2 connector (probably supporting only the shortest cards) instead of mini-PCIe, so it’s not really possible to add an additional SSD currently; as M.2 SSDs do not seem to be available yet. I also do not know whether the X240 offers ThinkLight or LEDs for hard drive activity and similar.

Implicit preferences in OR dependencies

Debian packages commonly use or dependencies of the form “a | b” to mean that a or b should be installed, while preferring option a over b. In general, for resolving an or dependency, we will try all options from the left to the right, preferring the left-most option. We also prefer real packages over virtual ones. If one of the alternatives is already installed we use that.

def solve_or(or):
  best_real = None
  best_virtual = None
  for dep in or:
     for target in dep:
        if target.name == dep.name and best_real is None:
           best_real = target
        if target.name != dep.name and best_virtual is None:
           best_virtual = target        
        if target.is_installed():
          return target

  return best_real if best_real is not None else best_virtual

Now, this way of solving dependencies is slightly problematic. Let us consider a package that depends on: a | b, b. APT will likely choose to install ‘a’ to satisfy the first dependency and ‘b’ to satisfy the second. I currently have draft code around for a future version of APT that will cause it to later on revert unneeded changes, which means that APT will then only install ‘b’. This result closely matches the CUDF solvers and cupt’s solver.

On the topic of solving algorithms, we also have the problem that optimizing solvers like the ones used with apt-cudf do not respect the order of dependencies, rather choosing to minimise the number of packages installed. This causes such a solver to often do stuff like selecting an sqlite database as backend for some service rather then a larger SQL server, as that installs fewer packages.

To make such solvers aware of the implicit preferences, we can introduce a new type of dependency category: Weak conflicts, also known as Recommends-Not. If a package P defines a Recommends-Not dependency against a package Q, then this means that Q should not be installed if P is installed. Now, if we have a dependency like:

Depends: a | b | c

we can encode this as:

Recommends-Not: c, c, b

Causing the solver to prefer a, then b, and then c. This should be representable as a pseudo-boolean optimization problem, as is common for the dependency problem, although I have not looked at that yet — it should work by taking the standard representation of conflicts, adding a relaxation variable and then minimising [or maximising] the number of relaxation variables.

Memory allocators

In the past days, I looked at various memory allocators in a quest to improve performance in my multi-threaded test cases of a reference counting language runtime / object allocator (a fun project).

It turns out the glibc’s memory allocator is relatively slow, especially if you do loops that create one element and destroy one element at the same time (for example, map() on a list you no longer use after you passed it to map). To fix this problem, I considered various options.

The first option was to add a thread-local cache around malloc(). So whenever we want to free() an object, we place it on a thread-local list instead, and if a malloc() request for an object of that size comes in, we reuse the old object.

This fixes the problem with the lists, but experiments shown another problem with the glibc allocator: Allocating many objects without releasing others (let’s say appending an element to a functional list). I started testing with tcmalloc instead, and noticed that it was several times faster (reducing run time from 6.75 seconds to 1.33 seconds (5 times faster)). As I do not want to depend on a C++ code base, I decided to write a simple allocator that allocates blocks of memory via mmap(), splits those into objects, and puts the objects on a local free list. That allocator performed faster than tcmalloc(), but was also just a simple test case, and not really useable, due to potential fragmentation problems (and because the code was statically linked in, causing thread-local storage to be considerably faster, as it does not need to call a function on every TLS access, but can rather go through a segment register).

I then discovered jemalloc, and tried testing with this. It turned out that jemalloc was even faster than tcmalloc in all test cases, and also required about the same amount of memory as tcmalloc (and slightly more than the custom test code, as that combined reference counting with memory allocator metadata and thus had less meta data overhead). Using jemalloc, the list building test performs in 0.9X seconds now (instead of tcmalloc’s 1.33 or glibc’s 6.75 seconds), requires 0 major page faults (tcmalloc has 2), and uses 5361472 kb memory (tcmalloc uses 5290240, glibc requires 7866608).

Given that jemalloc is written in C, I can only recommend it to everyone in need of a scalable memory allocator. Depending on your workload, it might require less memory than tcmalloc (at least that’s the case at Facebook) and is most likely faster than tcmalloc. It also provides tcmalloc-compatible profiling facilities (as Facebook needed them). Furthermore, jemalloc is also the memory allocator used by FreeBSD, and is used by Mozilla projects, and on several Facebook projects, and should thus be relatively stable and useable.

All tests were run with 4 threads, on a dual-core laptop with hyper-threading, using (e)glibc 2.13, tcmalloc 2.0, and jemalloc 3.0. The tests may not be representative of real world performance, and do not account for fragmentation.

Should I try replacing Python’s custom caching of malloc() calls with simply calling jemalloc, and run a few Python benchmarks? Anyone interested in that?