Archive for the ‘Python’ Category
Recursive-descent in Python, using generators
Writing recursive descent parsers is easy, especially in Python (just like everything is easy in Python). But Python defines a low limit on the number of recursive calls that can be made, and recursive descent parsers are prone to exceed this limit.
We should thus write our parser without real recursion. Fortunately, Python offers us a way out: Coroutines, introduced in Python 2.5 as per PEP 342. Using coroutines and a simple trampoline function, we can convert every mutually recursive set of functions into a set of coroutines that require constant stack space.
Let’s say our parser looks like this (tokens being an iterator over characters):
def parse(tokens):
token = next(tokens)
if token.isalnum():
return token
elif token == '(':
result = parse(tokens)
if next(tokens) != ')': raise ...
return result
else:
raise ...
We now apply the following transformations:
return X => yield (X)
parse() => (yield parse())
That is, we yield the value we want to return, and we yield a generator whenever we want to call (using a yield expression). Our rewritten parser reads:
def parse(tokens):
token = next(tokens)
if token.isalnum():
yield token
elif token == '(':
result = yield parse(tokens)
if next(tokens) != ')': raise ...
return result
else:
raise ...
We obviously cannot call that generator like the previous function. What we need to introduce is a trampoline. A trampoline is a function that manages that yielded generators, calls them, and passes their result upwards. It looks like this:
def trampoline(generator):
"""Execute the generator using trampolining. """
queue = collections.deque()
def schedule(coroutine, stack=(), value=None):
def resume():
if 0:
global prev
now = stack_len(stack)
if now < prev:
print("Length", now)
prev = -1
elif prev != -1:
prev = now
result = coroutine.send(value)
if isinstance(result, types.GeneratorType): # Normal call
schedule(result, (coroutine, stack))
elif isinstance(result, Tail): # Tail call (if you want to)
schedule(result.value, stack)
elif stack: # Return to parent
schedule(stack[0], stack[1], result)
else: # Final Return
return result
queue.append(resume)
schedule(generator)
result = None
while queue:
func = queue.popleft()
result = func()
return result
This function is based on the code in PEP 342, the difference being that
- we do not correctly propagate exceptions through the stack, but directly unwind to the caller of the parser (we don’t handle exceptions inside our parser generators anyway)
- the code actually compiles (code in PEP used ‘value = coroutine.send(value)’ which does not work)
- the code returns a value (code in PEP was missing a return in schedule)
- we don’t use a class, and allow only one function to be scheduled at once (yes, we could get rid of the queue)
- we allow tail calls [where Tail = namedtuple("Tail", ["result"])] to save some more memory.
For a more generic version of that, you might want to re-add exception passing, but the exceptions will then have long tracebacks, so I’m not sure how well they will work if you have deep recursion.
Now, the advantage of that is that our parser now requires constant stack space, because the actual real stack is stored in the heap using tuples which are used like singly-linked lists in scheme here. So, the only recursion limit is available memory.
A disadvantage of that transformation is that the code will run slightly slower for small inputs that could be handled using a normally recursive parser.
Cleaning up the system with pseudo-boolean optimization
You can use a PBO solver to clean up your system from unneeded automatically installed packages. First of all, you convert the system state to PB, and add an optimization function telling it to remove as many automatically installed packages as possible. Then you run this thing through a solver (such as clasp, which seems the fastest solver for PBO instances in the Debian archive) and convert its output to human-readable package names.
Code is provided at
http://anonscm.debian.org/gitweb/?p=users/jak/cleanup.git
, under the MPL 2.0. You need to have python-apt and clasp installed to use it. There is potential minisat+ support, but it’s currently a bit broken.
To use, run python program_builder.py, and it will tell you which packages are no longer needed on your system. It ignores Suggests, if you want those in, you have to hack the code and replace {“Recommends”} by {“Recommends”, “Suggests”}. You can also turn of such dependencies by setting Program.hard_softdeps to False.
FAIL: Arch Linux switch python executable to Python 3
Today, I got an email from an user of one of my Python scripts asking why the script t does not work on Arch Linux anymore. As it turns out, the Arch Linux team decided to switch /usr/bin/python to Python 3.0 and use python2 for Python 2.X versions. By doing this, they decided to make their distribution incompatible to almost any Python script in the world.
Arch Linux’s decision to diverge from the rest of the world that uses python for Python 2.X and python3 for Python 3.X is stupid. And doing this without updating reverse dependencies beforehand and thus breaking packages in their own distribution is insane. In the end this means that if you use Arch Linux, you should consider switching to a distribution that does things the right way: Debian. You can also switch to Ubuntu, that’s normally just a bit less right (= faster). But using a distribution that does those crazy things in such an irresponsible way is really insane.
Really, how can they be so stupid in the Arch world?
Python 3.1 bug: Objects in modules (m_size=-1) not deallocated
Last year, in July, I reported an issue to Python’s issue tracker. This issue can be seen at
http://bugs.python.org/issue6483
. Since then, there has been no action on this bug from the developers.
The bug describes that every object stored in a module will not be deallocated if the module is deallocated and it’s m_size = -1 (which it should be if the module has no state). The problem seems to be that Python copies the dictionary of the module but forgets to decrease the reference count of the copy when the module is deallocated. This bug may have serious impact if objects are stored in the module which write status to a file when they are deallocated, because the deallocation functions are never called.
Python 3.1 and python-apt
So, I have started to port python-apt to Python 3. Most things work already, but there is one single problem. I can not access the attributes of the objects, only their methods.
In Python 2.5, everything works perfectly. In Python 3.1, the same code produces an error. An example is apt_pkg.GetCache().Packages. I defined the slots tp_getattro and tp_methods. In Python 3.1, tp_getattro seems to be ignored.
If you want to help,
http://bzr.debian.org/loggerhead/users/jak/python-apt/py3k/changes
for browsing the branch and
http://bzr.debian.org/users/jak/python-apt/py3k/
for branching it.
This branch contains the current state and allows you to build a python-apt package shipping with Python 3.1 modules and extensions. But as I wrote above, not everything works yet. But its far enough for about 5 hours of work.
Update 2009-04-12 00:17 CEST: I just fixed a bunch of problems, including the one listed above. Most of python-apt should work now, including the ‘apt’ package.
python-apt documentation – 2nd time
News:
http://bzr.debian.org/loggerhead/users/jak/python-apt/jak/revision/219
introduced again some new documentation.
doc/source/apt/debfile.rst | 6
doc/source/apt_pkg/cache.rst | 540 +++++++++++++++++++++++++++++++++-
doc/source/apt_pkg/index.rst | 24 +
doc/source/conf.py | 5
doc/source/examples/cache-packages.py | 22 +
doc/source/examples/cache-pkgfile.py | 29 +
doc/source/examples/missing-deps.py | 51 +++
7 files changed, 659 insertions(+), 18 deletions(-)
You can see them at
http://people.debian.org/~jak/python-apt-doc/
Missing seem to be:
– AcquireFile
– AcquireItem
– ActionGroup
– Configuration
– MetaIndex
– PackageIndexFile
– PkgManager
– PkgRecords
– PkgSourceList
– PkgSrcRecords
– ProblemResolver
– TagFile
– TagSection
But the weekend is not over yet. I will blog tomorrow again, and once I’m finished with everything (which may be tomorrow, too).
Have fun, read the documentation, find the mistakes. Some things are currently written as ‘???’, if you know what fits there, please tell me.
News from the python-apt front: NEW and COOL DOCUMENTATION
I have been working the whole week on python-apt and the result is the jak branch. This implements some of the proposals I made in my last post, but has one very interesting feature: REAL COOL DOCUMENTATION.
After Sandro Tosi told me in a comment in my last post that the real big problem with python-apt is a lack of documentation, I immediately started writing it. UsingreStructuredText and Sphinx, we now have a really cool and much more detailed documentation. (Although it is not really finished yet [it contains everything, but there is still room to improve]).
The whole documentation is available at
http://people.debian.org/~jak/python-apt-doc/
, and the source is in my branch at http://bzr.debian.org/users/jak/python-apt/jak”, which can be browsed via Loggerhead at:
http://bzr.debian.org/loggerhead/users/jak/python-apt/jak/changes
It also contains a lot of cleanup, whitespace removal (bundled in one commit), and improved docstrings. And apt.debfile and apt.gtk.widgets should work completely now. Oh, and apt.cdrom now supports sources.list.d.
Python Speed: ‘x in list’ vs ‘x in set’
Well, this is my second post about speed in Python. Today, I noticed that debimg’s dependency resolver was much much slower than before. I thought what the problem could be and finally realized that the problem was that I switched from sets to list. This is fixed now in commit d0fd700080de5c19cb5fd66918d14c5ffa26e805
Now, some benchmarks (using IPython):
In [1]: a = range(10**6)
In [2]: b = set(a)
In [3]: %timeit 10**6 in a
10 loops, best of 3: 31.8 ms per loop
In [4]: %timeit 10**6 in b
10000000 loops, best of 3: 98.6 ns per loop
1ms are 1 million ns. Therefore, using sets is about 322515 times faster than using lists (or tuples).
debimg can now calculate dependencies in 0.5 seconds again, instead of 1 minute with lists.
python-libisofs 0.0.1 available
A first preview of the python-libisofs bindings is available now. It’s currently located in a git branch at git.debian.org, but this may change at a later point.
The bindings support the creation of ISO Images and (almost) all options libisofs supports, like Rockridge, Joliet, and much more. Reading and Modifying existing images is not supported yet.
The code is written in Cython and you need cython installed for building from the git branch. It can be installed just like any other Python module/extension/package, using a setup.py.
Browse:
http://git.debian.org/?p=users/jak-guest/python-libisofs.git
Get: git clone git://git.debian.org/git/users/jak-guest/python-libisofs.git
I’ll package it for Debian within the next weeks after some further tests.
Python Extension for libisofs
I am working on a Python extension for the libisofs library. The extension is written in Cython , a Python-like language designed to write Python extensions.
At a later point, this extension will be used to create the ISO Images in debimg. It will be disabled by default, but you can enable it via a configuration option.
