Avoiding dependency confusion in python

Avoiding dependency confusion in python - python

In the python project I work on at my workplace, we install some packages from PyPI, and some private company packages from Gemfury, using a standard requirements file.
After reading this article: https://medium.com/#alex.birsan/dependency-confusion-4a5d60fec610.
Our requirement file looks something like:
--index-url <OUR_GEMFURY_URL>
--extra-index-url https://pypi.python.org/simple
aiohttp==3.7.1
simplejson==3.17.1
<our-package>==1.0.0
<our-other-package>==1.2.0
I tried reading some of pip's documentation but I wasn't able to fully understand how it chooses from where to download the package.
For example, what happens if someone uploads a malicious version 1.0.0 to pypi-prod - how does pip know which one of the packages to take?
Is there maybe a way to specify to pip for a specific package to only search for it in --index-url?
How do you protect against dependency confusion in your code?
Thanks for the help!

The article mentions the algorithm pip uses:
Checks whether library exists on the specified (internal) package
index
Checks whether library exists on the public package index (PyPI)
Installs whichever version is found. If the package exists on both, it defaults to installing from the source with the higher version number.
So if your script requires <our-other-package>>=1.2.0, you can get some mailicios package from public pypi server if it has higher than the version you intented to install.
The straightforward solution mentioned in the article is removing --extra-index-url
If package 1.0 is internal or external package and is present in private pypi server it will be downloaded from there.
External packages will be downloaded from public pypi server through internal pypi server which will cache them for future usage.
I'd also suggest to have explicit versions in requirements.txt, this way you are aware of versions you get and do conscious upgrades by increasing the versions.
To sum up the guidelines (which by no means exhaustive and protect against all possible security holes)
remove --extra-index-url https://pypi.python.org/simple from pip.conf, requirements.txt and automation scripts.
specify explicit versions of internal and external packages in requirements.txt

Related

Per line index url in requirements.txt

Suppose I have the following PyPIs:
public PyPi (standard packages)
gitlab pypi (because internal team ABC wanted to use this)
artifactory PyPi (because contractor team DEF wanted to use this)
Now suppose package titled "ABC" exists on all of them, but are not the same thing (for instance, "apples," which are 3 entirely different packages on all pypis.). How do I do something in my requirements and setup.py to map the package name to the pypi to use?
Something like:
package_def==1.2.3 --index-url=artifactory
apples==1.08 --index-url=gitlab # NOT FROM PUBLIC OR FROM ARTIFACTORY
package_abc==1.2.3 --index-url=artifactory
package_efg==1.0.0 # public pypi
I don't even know how I'd configure the setup.py in this instance either.
I really don't want multiple requirements.txt with different index urls at the top. I also don't want --extra-index-url due to the vulnerabilities it could introduce when using a private pypi.
I tried googling around, messing around with the order of requirements.txt, breaking it up into different files, etc. No luck. Seems that the last --index-url is always used to install all packages.
Any ideas?

The question gets back to the idea that a package dependency specification usually is a state of need that is independent of how that need should be satisfied.
So the dependency declaration “foo==1.0.0” (the thing declared as part of the package metadata) means “I need the package named foo with version 1.0.0" and that is in principle implementation independent. You can install that package with pip from PyPI, but you could also use a different tool and/or different source to satisfy that requirement (e.g. conda, installation-from-source, etc.).
This distinction is the reason why there's no good way to do this.
There are a few work arounds:
You can specify the full link to a wheel you want to pip install
You can use an alternative tool like Poetry, which does support this a little more cleanly.
For my particular usecase, I just listed the full link to the wheel I wanted to pip install, since upgrading to poetry is out of scope at the moment.

Keep track of pypi dependencies - who is using my packages

Is there anyway through either pip or PyPi to identify which projects (published on Pypi) might be using my packages (also published on PyPi) - I would like to identify the user base for each package and possible attempt to actively engage with them.
Thanks in advance for any answers - even if what I am trying to do isn't possible.

This is not really possible. There is no easily accessible public dataset that lets you produce a dependency graph.
At most you could scan all publicly available packages to parse their dependencies, but even then those dependencies are generated by running the setup.py script, so the dependencies can be set dynamically. It is quite common to adjust the dependencies based on the Python version (installing a backport of a standard-library dependency on older Python versions, for example). People have done this before but this is not an easy, lightweight task.
Note that even then, you'll only find publicly declared dependencies. Any dependencies declared by private packages, not published to PyPI or another public repository, can't be accounted for.

Should put in requirements.txt dependency targeting package in pypi or github repo?

Are there any technical indications to prefer referencing a package on PyPI over the original source on GitHub in requirements.txt?
Only thing that comes to my mind is that freezing a package on a certain version is very cumbersome with GitHub (package==1.0.0 vs git://github.com/{ username }/{ reponame }.git#{ tag name }#egg={ desired egg name }), but I'm not sure if this can cause any problems.
Other thing is necessity to install git on target machine.
Are there any other indications?

PyPI is the accepted defacto location for distributing released versions of a package, and it could be that not all Python packaging tools support installing from GitHub.
And as you already noticed, for pip to support GitHub you must have git installed; this limits portability of your file.
Next, not all project maintainers remember to tag releases in GitHub; what is distributed to PyPI may be hard to locate on GitHub. The tag could also be wrong. You could end up installing a subtly different version from PyPI, creating confusion when you run into a support issue.
On the other hand, if you must install a non-released development version (say, you need a critical bugfix but no release has been rolled since), then GitHub may be the only place you can get that version.
So, in short, you should prefer using PyPI over GitHub, as that ensures that you got an official release, and is more portable. Only use a GitHub URL in requirements.txt if there is no other source for a specific version.

How to `pip install` a package that has Git dependencies?

I have a private library called some-library (actual names have been changed) with a setup file looking somewhat like this:
setup(
name='some-library',
// Omitted some less important stuff here...
install_requires=[
'some-git-dependency',
'another-git-dependency',
],
dependency_links=[
'git+ssh://git#github.com/my-organization/some-git-dependency.git#egg=some-git-dependency',
'git+ssh://git#github.com/my-organization/another-git-dependency.git#egg=another-git-dependency',
],
)
All of these Git dependencies may be private, so installation via HTTP is not an option. I can use python setup.py install and python setup.py develop in some-library's root directory without problems.
However, installing over Git doesn't work:
pip install -vvv -e 'git+ssh://git#github.com/my-organization/some-library.git#1.4.4#egg=some-library'
The command fails when it looks for some-git-dependency, mistakenly assumes it needs to get the dependency from PyPI and then fails after concluding it's not on PyPI. My first guess was to try re-running the command with --process-dependency-links, but then this happened:
Cannot look at git URL git+ssh://git#github.com/my-organization/some-git-dependency.git#egg=some-git-dependency
Could not find a version that satisfies the requirement some-git-dependency (from some-library) (from versions: )
Why is it producing this vague error? What's the proper way to pip install a package with Git dependencies that might be private?

What's the proper way to pip install a package with Git dependencies that might be private?
Two options
Use dependency_links as you do. See below for details.
Along side the dependency_links in your setup.py's, use a special dependency-links.txt that collects all the required packages. Then add this package in requirements.txt. That's my recommendend option as explained below.
# dependency-links.txt
git+ssh://...#tag#egg=package-name1
git+ssh://...#tag#egg=package-name2
# requirements.txt (per deployed application)
-r dependency-links.txt
While option 2 adds some extra burden on package management, namely keeping dependency-links.txt up to date, it makes installing packages a lot easier because you can' forget to add the --process-dependency-link option on pip install.
Perhaps more importantly, using dependency-links.txt you get to specify the exact version to be installed on deployment, which is want you want in a CI/CD environment - nothing is more risky than to install some version. From a package maintainer's perspective however it is common and considered good practice to specify a minimum version, such as
# setup.py in a package
...
install_requires = [ 'foo>1.0', ... ]
That's great because it makes your packages work nicely with other packages that have similar dependencies yet possibly on different versions. However, in a deployed application this can still cause mayhem if there are conflicting requirements among packages. E.g. package A is ok with foo>1.0, package B wants foo<=1.5 and the most recent version is foo==2.0. Using dependency-links.txt you can be specific, applying one version for all packages:
# dependency-links.txt
foo==1.5
The command fails when it looks for some-git-dependency,
To make it work, you need to add --process-dependency-links for pip to recognize the dependency to github, e.g.
pip install --process-dependency-links -r private-requirements.txt
Note since pip 8.1.0 you can add this option to requirements.txt. On the downside it gets applied to all packages installed and may have unintended consequences. That said, I find using dependency-links.txt is a safer and more manageable solution.
All of these Git dependencies may be private
There are three options:
Add collaborators on each of the required packages' repositories. These collaborators need to have their ssh keys setup with github for this to work. Then use git+ssh://...
Add a deploy key to each of the repositories. The downside here is that you need to distribute the corresponding private key to all the machines that need to deploy. Again use git+ssh://...
Add a personal access token on the github account that holds the private repositories. Then you can use git+https://accesstoken#github.com/... The downside is that the access token will have read + write access to all repositories, public and private, on the respective github account. On the plus side distributing and managing per-repository private keys is no longer necessary, and cycling the key is a lot simpler. In an all-inhouse environment where every dev has access to all repositories I have found this to be the most efficient, hassle-free way for everybody. YMMV

This should work for private repositories as well:
dependency_links = [
'git+ssh://git#github.com/my-organization/some-git-dependency.git#master#egg=some-git-dependency',
'git+ssh://git#github.com/my-organization/another-git-dependency.git#master#egg=another-git-dependency'
],

You should use git+git when url with #egg, like this:
-e git+git#repo.some.la:foo/my-repo.git#egg=my-repo
Use git+ssh in production without #egg, but you can specify #version or branch #master
git+ssh://git#repo.some.la/foo/my-repo.git#1.1.6
for work with app versions use git tagging Git Basics - Tagging

If I refer to "pip install dependency links", you would not refer to the GitHub repo itself, but to the tarball image associated to that GitHub repo:
dependency_links=[
'git+ssh://git#github.com/my-organization/some-git-dependency/tarball/master/#egg=some-git-dependency',
'git+ssh://git#github.com/my-organization/another-git-dependency/tarball/master/#egg=another-git-dependency',
],
with "some-git-dependency" being the name *and version of the dependency.

"Cannot look at git URL git+ssh://git#github.com/my-organization/some-git-dependency.git#egg=some-git-dependency" means pip cannot fetch an html page from this url to look for direct download links in the page, i.e, pip doesn't recognize the URL as a vcs checkout, because maybe some discrepancy between the requirement specifier and the fragment part in the vcs url.
In the case of a VCS checkout, you should also append #egg=project-version in order to identify for what package that checkout should be used.
Be sure to escape any dashes in the name or version by replacing them with underscores.
Check Dependencies that aren’t in PyPI
replace - with an _ in the package and version string.
git+ssh://git#github.com/my-organization/some-git-dependency.git#egg=some_git_dependency
and --allow-all-external may be useful.

Does Python have a package/module management system?

Does Python have a package/module management system, similar to how Ruby has rubygems where you can do gem install packagename?
On Installing Python Modules, I only see references to python setup.py install, but that requires you to find the package first.

Recent progress
March 2014: Good news! Python 3.4 ships with Pip. Pip has long been Python's de-facto standard package manager. You can install a package like this:
pip install httpie
Wahey! This is the best feature of any Python release. It makes the community's wealth of libraries accessible to everyone. Newbies are no longer excluded from using community libraries by the prohibitive difficulty of setup.
However, there remains a number of outstanding frustrations with the Python packaging experience. Cumulatively, they make Python very unwelcoming for newbies. Also, the long history of neglect (ie. not shipping with a package manager for 14 years from Python 2.0 to Python 3.3) did damage to the community. I describe both below.
Outstanding frustrations
It's important to understand that while experienced users are able to work around these frustrations, they are significant barriers to people new to Python. In fact, the difficulty and general user-unfriendliness is likely to deter many of them.
PyPI website is counter-helpful
Every language with a package manager has an official (or quasi-official) repository for the community to download and publish packages. Python has the Python Package Index, PyPI. https://pypi.python.org/pypi
Let's compare its pages with those of RubyGems and Npm (the Node package manager).
https://rubygems.org/gems/rails RubyGems page for the package rails
https://www.npmjs.org/package/express Npm page for the package express
https://pypi.python.org/pypi/simplejson/ PyPI page for the package simplejson
You'll see the RubyGems and Npm pages both begin with a one-line description of the package, then large friendly instructions how to install it.
Meanwhile, woe to any hapless Python user who naively browses to PyPI. On https://pypi.python.org/pypi/simplejson/ , they'll find no such helpful instructions. There is however, a large green 'Download' link. It's not unreasonable to follow it. Aha, they click! Their browser downloads a .tar.gz file. Many Windows users can't even open it, but if they persevere they may eventually extract it, then run setup.py and eventually with the help of Google setup.py install. Some will give up and reinvent the wheel..
Of course, all of this is wrong. The easiest way to install a package is with a Pip command. But PyPI didn't even mention Pip. Instead, it led them down an archaic and tedious path.
Error: Unable to find vcvarsall.bat
Numpy is one of Python's most popular libraries. Try to install it with Pip, you get this cryptic error message:
Error: Unable to find vcvarsall.bat
Trying to fix that is one of the most popular questions on Stack Overflow: "error: Unable to find vcvarsall.bat"
Few people succeed.
For comparison, in the same situation, Ruby prints this message, which explains what's going on and how to fix it:
Please update your PATH to include build tools or download the DevKit from http://rubyinstaller.org/downloads and follow the instructions at http://github.com/oneclick/rubyinstaller/wiki/Development-Kit
Publishing packages is hard
Ruby and Nodejs ship with full-featured package managers, Gem (since 2007) and Npm (since 2011), and have nurtured sharing communities centred around GitHub. Npm makes publishing packages as easy as installing them, it already has 64k packages. RubyGems lists 72k packages. The venerable Python package index lists only 41k.
History
Flying in the face of its "batteries included" motto, Python shipped without a package manager until 2014.
Until Pip, the de facto standard was a command easy_install. It was woefully inadequate. The was no command to uninstall packages.
Pip was a massive improvement. It had most the features of Ruby's Gem. Unfortunately, Pip was--until recently--ironically difficult to install. In fact, the problem remains a top Python question on Stack Overflow: "How do I install pip on Windows?"

And just to provide a contrast, there's also pip.

The Python Package Index (PyPI) seems to be standard:
To install a package:
pip install MyProject
To update a package
pip install --upgrade MyProject
To fix a version of a package pip install MyProject==1.0
You can install the package manager as follows:
curl -O http://python-distribute.org/distribute_setup.py
python distribute_setup.py
easy_install pip
References:
http://guide.python-distribute.org/
http://pypi.python.org/pypi/distribute

As a Ruby and Perl developer and learning-Python guy, I haven't found easy_install or pip to be the equivalent to RubyGems or CPAN.
I tend to keep my development systems running the latest versions of modules as the developers update them, and freeze my production systems at set versions. Both RubyGems and CPAN make it easy to find modules by listing what's available, then install and later update them individually or in bulk if desired.
easy_install and pip make it easy to install a module ONCE I located it via a browser search or learned about it by some other means, but they won't tell me what is available. I can explicitly name the module to be updated, but the apps won't tell me what has been updated nor will they update everything in bulk if I want.
So, the basic functionality is there in pip and easy_install but there are features missing that I'd like to see that would make them friendlier and easier to use and on par with CPAN and RubyGems.

There are at least two, easy_install and its successor pip.

As of at least late 2014, Continuum Analytics' Anaconda Python distribution with the conda package manager should be considered. It solves most of the serious issues people run into with Python in general (managing different Python versions, updating Python versions, package management, virtual environments, Windows/Mac compatibility) in one cohesive download.
It enables you to do pretty much everything you could want to with Python without having to change the system at all. My next preferred solution is pip + virtualenv, but you either have to install virtualenv into your system Python (and your system Python may not be the version you want), or build from source. Anaconda makes this whole process the click of a button, as well as adding a bunch of other features.

That'd be easy_install.

It's called setuptools. You run it with the "easy_install" command.
You can find the directory at http://pypi.python.org/

I don't see either MacPorts or Homebrew mentioned in other answers here, but since I do see them mentioned elsewhere on Stack Overflow for related questions, I'll add my own US$0.02 that many folks seem to consider MacPorts as not only a package manager for packages in general (as of today they list 16311 packages/ports, 2931 matching "python", albeit only for Macs), but also as a decent (maybe better) package manager for Python packages/modules:
Question
"...what is the method that Mac python developers use to manage their modules?"
Answers
"MacPorts is perfect for Python on the Mac."
"The best way is to use MacPorts."
"I prefer MacPorts..."
"With my MacPorts setup..."
"I use MacPorts to install ... third-party modules tracked by MacPorts"
SciPy
"Macs (unlike Linux) don’t come with a package manager, but there are a couple of popular package managers you can install.
Macports..."
I'm still debating on whether or not to use MacPorts myself, but at the moment I'm leaning in that direction.

On Windows install http://chocolatey.org/ then
choco install python
Open a new cmd-window with the updated PATH. Next, do
choco install pip
After that you can
pip install pyside
pip install ipython
...

Since no one has mentioned pipenv here, I would like to describe my views why everyone should use it for managing python packages.
As #ColonelPanic mentioned there are several issues with the Python Package Index and with pip and virtualenv also.
Pipenv solves most of the issues with pip and provides additional features also.
Pipenv features
Pipenv is intended to replace pip and virtualenv, which means pipenv will automatically create a separate virtual environment for every project thus avoiding conflicts between different python versions/package versions for different projects.
Enables truly deterministic builds, while easily specifying only what you want.
Generates and checks file hashes for locked dependencies.
Automatically install required Pythons, if pyenv is available.
Automatically finds your project home, recursively, by looking for a Pipfile.
Automatically generates a Pipfile, if one doesn’t exist.
Automatically creates a virtualenv in a standard location.
Automatically adds/removes packages to a Pipfile when they are un/installed.
Automatically loads .env files, if they exist.
If you have worked on python projects before, you would realize these features make managing packages way easier.
Other Commands
check checks for security vulnerabilities and asserts that PEP 508 requirements are being met by the current environment. (which I think is a great feature especially after this - Malicious packages on PyPi)
graph will show you a dependency graph, of your installed dependencies.
You can read more about it here - Pipenv.
Installation
You can find the installation documentation here
P.S.: If you liked working with the Python Package requests , you would be pleased to know that pipenv is by the same developer Kenneth Reitz

In 2019 poetry is the package and dependency manager you are looking for.
https://github.com/sdispater/poetry#why
It's modern, simple and reliable.

Poetry is what you're looking for. It takes care of dependency management, virtual environments, running.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.