Blog of Julian Andres Klode

Debian Developer | Ubuntu Member | Fellow of FSFE | SPI contributing member

APT2 progress report for the 1st half of December

This week was successful. I have pushed some changes from November to the repository which change the license to LGPL-2.1+ (which makes bi-directional sharing of code with other projects easier, since most Vala code is under the same license) and implement HTTP using libsoup2.4 directly, instead of using GIO and GVFS for this. I also added a parser for the sources.list format which uses regular expressions to parse the file and is relatively fast. The code needs a current checkout of Vala’s git master to work correctly; as released versions had a bug which I noticed today and Jürg Billeter fixed in Vala 25 minutes later; thank you Jürg.

While nothing else happened in the public repository, the internal branch has seen a lot of new code; including SQLite 3 caches; Acquire text progress handling; and capt; the command-line advanced package tool. Most of the code will need to be reworked before it will be published, but I hope to have this completed until Christmas. It will also depend on Vala 0.7.9 or newer, which is yet to be released.

The decision to use SQLite 3 as a backend means that we won’t see the size limitations APT has and that development can be simplified by using SQL queries for filtering requests. It also means that APT2 will be very fast in most actions, like searching; which currently happens in 0.140 seconds (unstable,experimental and more repositories enabled), whereas aptitude takes 1.101 seconds, cupt (which has no on-disk cache) 1.292 seconds, and apt-cache 0.475 seconds. Searching is performed by one SQL query. I also want to thank Jens Georg <mail@jensge.org>, who wrote Rygel’s Database class which is also used with minor modifications (like defaulting to in-memory journals) in APT2 as well. Rygel.Database is a small wrapper around sqlite3 which makes it easier to program for Vala programmers.

The command-line application ‘capt’ provides a shell based on readline with history (and later on command completion) as well as direct usage like ‘capt config dump’ or ‘capt search python-apt’. Just as with Eugene’s cupt, ‘capt’ will be the only program in the core APT2 distribution and provide the same functionality currently provided by apt-get, apt-config and friends. The name is not perfect and can be easily confused with ‘cupt’, but it was the closest option for now; considering that the name ‘apt’ is already used by Java (for its “Annotation Processing Tool”).

That’s all for now, I’ll tell you once all those features have passed my QA, and there is really something usable in the repository. In the meanwhile, you can discuss external dependency solvers, database layouts and other stuff in their threads on deity@lists.debian.org.

And a ‘screenshot’ from capt:

jak@hp:~/Desktop/APT2:temp$ capt
apt$ help
APT2 0.0.20091213 command-line frontend

Commands:
  config dump               Dump the configuration
  config get OPTION         Get the given option
  config set OPTION VALUE   Set the given option
  search EXPRESSION         Search for the given expression
  show PACKAGE              Show all versions of the given package
  sources list              Print a list of all sources
  version                   Print the version of APT2
apt$ search python-apt
build-depends-python-apt - Dummy package to fulfill package dependencies
python-apt - Python interface to libapt-pkg
python-apt-dbg - Python interface to libapt-pkg (debug extension)
python-apt-dev - Python interface to libapt-pkg (development files)
python-aptdaemon - Python module for the server and client of aptdaemon
python-aptdaemon-gtk - Python GTK+ widgets to run an aptdaemon client
apt$ 

About these ads

Written by Julian Andres Klode

December 13, 2009 at 21:07

Posted in APT2

24 Responses

Subscribe to comments with RSS.

  1. Since you are writing a Debian package manager, I am wondering about your take on bug 554373 (which relates to the computation of preferences in apt/cupt/…)

    Jean-Christophe Dubacq

    December 13, 2009 at 23:07

  2. Hmm does this mean that dpkg will be faster too, or just apt? I have no problem with speed of APT, and I think 1-2 seconds or 0.1 is not a so much difference, _however_ at installing/removing etc packages I have sometimes even more a minute delay while dpkg is reading database … It would be much more useful to make it shorter, I think, to improve the usability of debian style package management. Thanks for the patience!

    LGB

    December 14, 2009 at 09:53

    • No, dpkg has its own database; consisting of thousands of text files in /var/lib/dpkg.

      Julian Andres Klode

      December 14, 2009 at 09:55

      • Yes I had the suspect it has, just I was not sure when you started to talk about a “new” APT. It’s a good question too if it’s a good solution or not to have those tons of files, shouldn’t it moved into some kind of database as well? I had that idea earlier too, but some people seems to feel scared and start to mention that it’s the case of rpm, when you can loose a binary database and you will be in a hopeless situation then. For me at least it does not sound too “modern” to have 10900 files under /var/lib/dpkg, I’ve just checked it out with find and wc. Also, in most filesystems that means wasting lots of disk space if you don’t use extents at least. But disk is cheap, it is said :) Sorry for my long comment anyway.

        LGB

        December 14, 2009 at 10:54

      • BTW, I favor databases as caches; which could be corrupted at any time without any negative impact. If the database is corrupt, it can be recreated from the original files. It should not be the only way to access the data, just an additional one. This of course means more disk usage, but also means much faster package manager.

        Julian Andres Klode

        December 14, 2009 at 14:16

    • Letting APT/APT2 call readahead(2) in a second thread on all dpkg files could help; so those files are already loaded when dpkg is started.

      Julian Andres Klode

      December 14, 2009 at 10:05

      • I always felt that solution as a “hack”, file systems should have implemented to do its work better. Also, if you call readahead, which reads information (for the purpose to be in the cache then), why can’t it be done in dpkg directly, that it’s use the information into the memory in the exact way as readahead would do it, so we won’t waste resources to read by readahead (so data in the cache then), then read by the actual dpkg, somehow dpkg should be made “smarter” to read it in a “peaceful” and quick way, as readahead do that. Sure, I am not familiar with dpkg internals too much, so sorry if I am talking non-sense …

        LGB

        December 14, 2009 at 11:02

      • LGB, the data is read by dpkg and not in the cache. That is the problem. Using readahead in dpkg would not make that much sense, because its useless to readahead something you are already reading. The only way would be to do it in the background before running dpkg (readahead blocks until it has finished).

        Julian Andres Klode

        December 14, 2009 at 14:19

  3. [...] Andres Klode opisał na swoim blogu postępy prac nad projektem APT2 — nową implementacją debianowego menedżera pakietów, o której pisaliśmy niedawno. [...]

  4. @Julian: I know that dpkg read data of course :) I meant that you suggested using readahead to “precache” data so dpkg can read the real data faster. That was I told about, that it’s kinda strange, since basically readahead reads data too as dpkg, just its purpose is not storing the data itsel “in the use space” just read it to have it in the cache so it will be faster then to use read that data “from the disk”. I understood the idea to have separate process to run readahead since it blocks. What I can’t understand: why is it not possible to optimize dpkg to read its files in “one pass”, since this is some kind of two pass stuff: you have a thread running readahead to have your data in the cache, so the other thread (which is dpkg itself, let’s say) can read it much faster, if no physical disk I/O is needed. I guess the problem here that it’s really slow to open/read/etc tons of files, that’s why I thought, a structured database within one file (let’s say it’s sqlite or whatever it is, even bdb/innodb/something) is faster, not counting the fact that some queries can be done faster with some kind of RDBMS like funcionality even if it’s an embedded one (and not a “real” rdbms of course). Just think about /var/lib/dpkg/status, available, and such: though I don’t know how is managed, but I can imagine that any modification in them needs rewriting the whole file (or at least from the file position where the modification was done?).

    LGB

    December 14, 2009 at 15:11

    • > why is it not possible to optimize dpkg to read its files in “one pass”,

      That’s what it does, and that’s what you are complaining about. I just said that we could e.g. readahead the files while fetching the packages or solving dependencies, as most of the disk is unused.

      Julian Andres Klode

      December 14, 2009 at 15:58

      • Ahha ok, maybe I have misunderstood your message then, sorry for that. My English is not the best either, as you could notice … My motivation to comment was only the fact that I feel that debian based systems of mine (debian and/or ubuntu) seems to make slower and slower in time as distribution evolves, in the sense that now a simple “dpkg -i …” takes time longer than minute with the title “Reading database …” which is a bit annoying. Anyway I have to live with it, so … :)

        LGB

        December 14, 2009 at 16:18

  5. capt looks similar to aptsh (http://packages.debian.org/aptsh)

    Julian: I find APT2 very interesting, hope this project will be a success :)

    azhag

    December 14, 2009 at 17:30

  6. There is one big problem in all (most?) applications that use SQLite – if the application abruptly terminates, then no changes are saved to the database, not even changes that were made an hour ago. This means that if something interrupts APT2 operations, then you will get an inconsistent state – SQLite can not be trusted to save the information that you gave to it, at least not until you close the db connection and fsync the db file after that.

    Aigarius

    December 14, 2009 at 20:31

    • If it does not write anything, the mtime should not change and thus we can detect if their have been changes by comparing the mtime of the database and /var/lib/dpkg/status and update the database. Changes are in fact only done after the dpkg call has completed, so the mtime check should work normally.

      Julian Andres Klode

      December 14, 2009 at 21:12

  7. [...] APT2 progress report for the 1st half of December This week was successful. I have pushed some changes from November to the repository which change the license to LGPL-2.1+ (which makes bi-directional sharing of code with other projects easier, since most Vala code is under the same license) and implement HTTP using libsoup2.4 directly, instead of using GIO and GVFS for this. I also added a parser for the sources.list format which uses regular expressions to parse the file and is relatively fast. The code needs a current checkout of Vala’s git master to work correctly; as released versions had a bug which I noticed today and Jürg Billeter fixed in Vala 25 minutes later; thank you Jürg. [...]

  8. I think it’s very slow. Can you publish the test SQLite database and your queries? I will try to optimize them.

    Alexey Pechnikov

    December 15, 2009 at 21:05

    • The speed is probably related to the regular expressions which have to be used. Index-based searches are much faster (0.0002 seconds (or 0.002?) or similar in SQLite).

      Julian Andres Klode

      December 15, 2009 at 21:10

      • Are you using FTS3 extension?

        $ time sqlite3 share.db “select count(*) from file_text where file_text match ‘документ мтс’;”
        68

        real 0m0.006s
        user 0m0.008s
        sys 0m0.000s

        This is more fast, isn’t it?

        About the test database:
        $ ls -lh share.db |awk ‘{print $5}’
        64M
        $ sqlite3 share.db “select count(*) from file_text;”
        3492

        Alexey Pechnikov

        December 16, 2009 at 08:05

      • And the results with regular expressions based search:

        $ time sqlite3 share.db “select count(*) from file_text where file_text match ‘документ* мтс*’;”
        142

        real 0m0.007s
        user 0m0.000s
        sys 0m0.004s

        Alexey Pechnikov

        December 16, 2009 at 08:06

      • FTS3 would probably increase the database size noticeably and has problems when searching for e.g. ‘python-apt’, because it finds all matches for Python (>1000 results); although it should only find ~20 results.

        Julian Andres Klode

        December 16, 2009 at 13:26

  9. Use the query like to

    select snippet(…) from … where … match ‘”python-apt”‘;

    And you will get only results with python-apt substring.

    Alexey Pechnikov

    December 16, 2009 at 15:14

    • Actually, something with OR seems to fail

      sqlite> select name, * from package WHERE name MATCH ‘”python-apt”‘ OR description MATCH ‘”python-apt”‘;
      Error: SQL logic error or missing database

      Julian Andres Klode

      December 16, 2009 at 15:49

      • The query may be writed more simple:

        select name, * from package WHERE package MATCH ‘”python-apt”‘;

        Alexey Pechnikov

        December 16, 2009 at 16:56


Comments are closed.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: