APT2 progress report for the 1st half of December

This week was successful. I have pushed some changes from November to the repository which change the license to LGPL-2.1+ (which makes bi-directional sharing of code with other projects easier, since most Vala code is under the same license) and implement HTTP using libsoup2.4 directly, instead of using GIO and GVFS for this. I also added a parser for the sources.list format which uses regular expressions to parse the file and is relatively fast. The code needs a current checkout of Vala’s git master to work correctly; as released versions had a bug which I noticed today and Jürg Billeter fixed in Vala 25 minutes later; thank you Jürg.

While nothing else happened in the public repository, the internal branch has seen a lot of new code; including SQLite 3 caches; Acquire text progress handling; and capt; the command-line advanced package tool. Most of the code will need to be reworked before it will be published, but I hope to have this completed until Christmas. It will also depend on Vala 0.7.9 or newer, which is yet to be released.

The decision to use SQLite 3 as a backend means that we won’t see the size limitations APT has and that development can be simplified by using SQL queries for filtering requests. It also means that APT2 will be very fast in most actions, like searching; which currently happens in 0.140 seconds (unstable,experimental and more repositories enabled), whereas aptitude takes 1.101 seconds, cupt (which has no on-disk cache) 1.292 seconds, and apt-cache 0.475 seconds. Searching is performed by one SQL query. I also want to thank Jens Georg <mail@jensge.org>, who wrote Rygel’s Database class which is also used with minor modifications (like defaulting to in-memory journals) in APT2 as well. Rygel.Database is a small wrapper around sqlite3 which makes it easier to program for Vala programmers.

The command-line application ‘capt’ provides a shell based on readline with history (and later on command completion) as well as direct usage like ‘capt config dump’ or ‘capt search python-apt’. Just as with Eugene’s cupt, ‘capt’ will be the only program in the core APT2 distribution and provide the same functionality currently provided by apt-get, apt-config and friends. The name is not perfect and can be easily confused with ‘cupt’, but it was the closest option for now; considering that the name ‘apt’ is already used by Java (for its “Annotation Processing Tool”).

That’s all for now, I’ll tell you once all those features have passed my QA, and there is really something usable in the repository. In the meanwhile, you can discuss external dependency solvers, database layouts and other stuff in their threads on deity@lists.debian.org.

And a ‘screenshot’ from capt:

jak@hp:~/Desktop/APT2:temp$ capt
apt$ help
APT2 0.0.20091213 command-line frontend

Commands:
  config dump               Dump the configuration
  config get OPTION         Get the given option
  config set OPTION VALUE   Set the given option
  search EXPRESSION         Search for the given expression
  show PACKAGE              Show all versions of the given package
  sources list              Print a list of all sources
  version                   Print the version of APT2
apt$ search python-apt
build-depends-python-apt - Dummy package to fulfill package dependencies
python-apt - Python interface to libapt-pkg
python-apt-dbg - Python interface to libapt-pkg (debug extension)
python-apt-dev - Python interface to libapt-pkg (development files)
python-aptdaemon - Python module for the server and client of aptdaemon
python-aptdaemon-gtk - Python GTK+ widgets to run an aptdaemon client
apt$ 

24 thoughts on “APT2 progress report for the 1st half of December

  1. Hmm does this mean that dpkg will be faster too, or just apt? I have no problem with speed of APT, and I think 1-2 seconds or 0.1 is not a so much difference, _however_ at installing/removing etc packages I have sometimes even more a minute delay while dpkg is reading database … It would be much more useful to make it shorter, I think, to improve the usability of debian style package management. Thanks for the patience!

      1. Yes I had the suspect it has, just I was not sure when you started to talk about a “new” APT. It’s a good question too if it’s a good solution or not to have those tons of files, shouldn’t it moved into some kind of database as well? I had that idea earlier too, but some people seems to feel scared and start to mention that it’s the case of rpm, when you can loose a binary database and you will be in a hopeless situation then. For me at least it does not sound too “modern” to have 10900 files under /var/lib/dpkg, I’ve just checked it out with find and wc. Also, in most filesystems that means wasting lots of disk space if you don’t use extents at least. But disk is cheap, it is said🙂 Sorry for my long comment anyway.

      2. BTW, I favor databases as caches; which could be corrupted at any time without any negative impact. If the database is corrupt, it can be recreated from the original files. It should not be the only way to access the data, just an additional one. This of course means more disk usage, but also means much faster package manager.

      1. I always felt that solution as a “hack”, file systems should have implemented to do its work better. Also, if you call readahead, which reads information (for the purpose to be in the cache then), why can’t it be done in dpkg directly, that it’s use the information into the memory in the exact way as readahead would do it, so we won’t waste resources to read by readahead (so data in the cache then), then read by the actual dpkg, somehow dpkg should be made “smarter” to read it in a “peaceful” and quick way, as readahead do that. Sure, I am not familiar with dpkg internals too much, so sorry if I am talking non-sense …

      2. LGB, the data is read by dpkg and not in the cache. That is the problem. Using readahead in dpkg would not make that much sense, because its useless to readahead something you are already reading. The only way would be to do it in the background before running dpkg (readahead blocks until it has finished).

  2. @Julian: I know that dpkg read data of course🙂 I meant that you suggested using readahead to “precache” data so dpkg can read the real data faster. That was I told about, that it’s kinda strange, since basically readahead reads data too as dpkg, just its purpose is not storing the data itsel “in the use space” just read it to have it in the cache so it will be faster then to use read that data “from the disk”. I understood the idea to have separate process to run readahead since it blocks. What I can’t understand: why is it not possible to optimize dpkg to read its files in “one pass”, since this is some kind of two pass stuff: you have a thread running readahead to have your data in the cache, so the other thread (which is dpkg itself, let’s say) can read it much faster, if no physical disk I/O is needed. I guess the problem here that it’s really slow to open/read/etc tons of files, that’s why I thought, a structured database within one file (let’s say it’s sqlite or whatever it is, even bdb/innodb/something) is faster, not counting the fact that some queries can be done faster with some kind of RDBMS like funcionality even if it’s an embedded one (and not a “real” rdbms of course). Just think about /var/lib/dpkg/status, available, and such: though I don’t know how is managed, but I can imagine that any modification in them needs rewriting the whole file (or at least from the file position where the modification was done?).

    1. > why is it not possible to optimize dpkg to read its files in “one pass”,

      That’s what it does, and that’s what you are complaining about. I just said that we could e.g. readahead the files while fetching the packages or solving dependencies, as most of the disk is unused.

      1. Ahha ok, maybe I have misunderstood your message then, sorry for that. My English is not the best either, as you could notice … My motivation to comment was only the fact that I feel that debian based systems of mine (debian and/or ubuntu) seems to make slower and slower in time as distribution evolves, in the sense that now a simple “dpkg -i …” takes time longer than minute with the title “Reading database …” which is a bit annoying. Anyway I have to live with it, so …🙂

  3. There is one big problem in all (most?) applications that use SQLite – if the application abruptly terminates, then no changes are saved to the database, not even changes that were made an hour ago. This means that if something interrupts APT2 operations, then you will get an inconsistent state – SQLite can not be trusted to save the information that you gave to it, at least not until you close the db connection and fsync the db file after that.

    1. If it does not write anything, the mtime should not change and thus we can detect if their have been changes by comparing the mtime of the database and /var/lib/dpkg/status and update the database. Changes are in fact only done after the dpkg call has completed, so the mtime check should work normally.

      1. Are you using FTS3 extension?

        $ time sqlite3 share.db “select count(*) from file_text where file_text match ‘документ мтс’;”
        68

        real 0m0.006s
        user 0m0.008s
        sys 0m0.000s

        This is more fast, isn’t it?

        About the test database:
        $ ls -lh share.db |awk ‘{print $5}’
        64M
        $ sqlite3 share.db “select count(*) from file_text;”
        3492

      2. And the results with regular expressions based search:

        $ time sqlite3 share.db “select count(*) from file_text where file_text match ‘документ* мтс*’;”
        142

        real 0m0.007s
        user 0m0.000s
        sys 0m0.004s

      3. FTS3 would probably increase the database size noticeably and has problems when searching for e.g. ‘python-apt’, because it finds all matches for Python (>1000 results); although it should only find ~20 results.

    1. Actually, something with OR seems to fail

      sqlite> select name, * from package WHERE name MATCH ‘”python-apt”‘ OR description MATCH ‘”python-apt”‘;
      Error: SQL logic error or missing database

Comments are closed.