I frequently have to re-create virtual environments from a
requirements.txt and I am already using
$PIP_DOWNLOAD_CACHE. It still takes a lot of time and I noticed the following:
Pip spends a lot of time between the following two lines:
Downloading/unpacking SomePackage==1.4 (from -r requirements.txt (line 2)) Using download cache from $HOME/.pip_download_cache/cached_package.tar.gz
Like ~20 seconds on average to decide it’s going to use the cached package, then the install is fast. This is a lot of time when you have to install dozens of packages (actually enough to write this question).
What is going on in the background?
Are they some sort of integrity checks against the online package?
Is there a way to speed this up?
edit: Looking at:
time pip install -v Django==1.4
real 1m16.120s user 0m4.312s sys 0m1.280s
The full output is here http://pastebin.com/e4Q2B5BA. Looks like pip is spending his time looking for a valid download link while it already has a valid cache of http://pypi.python.org/packages/source/D/Django/Django-1.4.tar.gz.
Is there a way to look for the cache first and stop there if versions match?
After spending some time to study the pip internals and to profile some package installations I came to the conclusion that even with a download cache, pip does the following for each package :
- go to the main index url, usually http://pypi.python.org/simple// (example)
- follows every link to fetch additional web pages
- extracts all links from all those pages
- checks the validity of all the links against the package name and version requirements
- selects the most recent version from the valid links
Now pip has a download url, checks against the download cache folder if configured and eventually decides not to use this url if a local file named after the url is present.
My guess is that we could save a lot of time by checking the cache upfront but I do not have a good enough understanding of all the pip code base to start the required modifications. Of course it would only be for exact version number requirements,
==, because with other constraints, like
>, we still want to crawl the web looking for the latest version.
Nevertheless, I was able to make a small pull request which will save us some time if merged.