State of the Python/PyPi dependency graph

tags: #Data #Dll Hell #Gephi #Pip
categories: Data Python
published: January 5, 2013
reading time: 3 minutes

I usually work in Java/Maven environment, so when I explain to people that Python also has a package manager - a bit less heavy than maven - and that it's working pretty well, I always have to answer the same question : “Ok, but how does it solve the transitive dependency hell ?”

Also known as the historic DLL Hell/Jar Hell etc... In short, when you depend on A and C, that A depends on B (version 1.2) and C depends on B (version 1.5) : How do you choose which version of B you will take ?

I ended up trying to answer, not exactly that question, but why I never really had that problem in Python. So this article is the first of a three part series you could call “Dependency as a liability“.

In this part, I wanted to analyse the Python library world in terms of a full dependency graph - how every library depends on each other.

After talking with Tarek Ziadé about that, he told me how complicated things are right now. It seems that, for now, the way things are, the only complete and secure way to know what a package needs in terms of dependency is to execute its installation on every operating system. This was a bit out of my scope for now, so I took another way, just to see where it would lead me.

Analyzing setup.py files

For recent packages, following the Hitchiker's Guide to packaging, the metadata of the package are stored in file called setup.py that looks like this :

[sourcecode language=”python”]
from distutils.core import setup

setup(
name=‘TowelStuff’,
version=‘0.1.0′,
author=‘J. Random Hacker’,
author_email='jrh@example.com’,
packages=[’towelstuff’, ’towelstuff.test’],
scripts=[‘bin/stowe-towels.py’,‘bin/wash-towels.py’],
url=‘http://pypi.python.org/pypi/TowelStuff/',
license=‘LICENSE.txt’,
description=‘Useful towel-related stuff.’,
long_description=open(‘README.txt’).read(),
install_requires=[
"Django >= 1.1.1",
"caldav == 0.1.4",
],
)
[/sourcecode]

You can notice a few things like the _author, version, author_email, url, license… _and what I was focusing on the _install_requires _parameter, where you declare all your dependencies. the problem is, that it may sound simple, but the _setup.py _file is a python script in itself, so the _install_requires _directive can be changed when the script is executed.

So I took my chances, and decided to create a project to extract dependencies from all packages on PyPi according to the _install_requires _parameter and see if this is mainly used statically or dynamically. So what the **meta-deps **project does is :

extract all packages from PyPi using the XML-RPC api;
download the releases and extract from the _setup.py _file the _install_requires _dependency;
Store the results in a csv file pypi-deps.csv;

If you want to re-use the raw data, you don’t need to re-execute the process (and overload PyPi servers in the meantime), just download the pypi-deps.csv file, it contains just these columns :

name of the dependency
version extracted
a base64 encoded, json string to store the list of dependencies : so you just need to execute json.loads(b64decode(…))

Results

So what comes out of all this ? This graph :

PyPi dependency graph generated using Gephi — PyPi dependency graph - click to see the interactive version

Ok, if you see it like that, you must think it looks like a huge jellyfish, and that i’m just joking with you. So I spent a little time creating and optimizing an interactive graph of the PyPi dependency (it seems to be best to open it using chrome) where you can scroll and see all the dependencies with all the metrics and explanation needed.

The next steps will be to do the same with Maven dependencies in a Java world, and compute metrics needed to compare the both.

Vale