Is there anyway through either pip or PyPi to identify which projects (published on Pypi) might be using my packages (also published on PyPi) - I would like to identify the user base for each package and possible attempt to actively engage with them.
Thanks in advance for any answers - even if what I am trying to do isn't possible.
This is not really possible. There is no easily accessible public dataset that lets you produce a dependency graph.
At most you could scan all publicly available packages to parse their dependencies, but even then those dependencies are generated by running the setup.py script, so the dependencies can be set dynamically. It is quite common to adjust the dependencies based on the Python version (installing a backport of a standard-library dependency on older Python versions, for example). People have done this before but this is not an easy, lightweight task.
Note that even then, you'll only find publicly declared dependencies. Any dependencies declared by private packages, not published to PyPI or another public repository, can't be accounted for.
Related
We are maintaining a private PyPi repository on a GitLab instance in which we upload hundreds of Python wheels for different platforms, architectures and Python versions in order to use them internally.
As the amount of packages grows, we only want to upload packages that do not already exist in the repository and thus, want to to check what's already available in advance. We have looked into pip search but it only seems to work for single packages.
Is there any way to check for all available packages on a package repository including their versions, architectures, etc.?
I was hoping for something like pip list but for the repository.
Suppose I have the following PyPIs:
public PyPi (standard packages)
gitlab pypi (because internal team ABC wanted to use this)
artifactory PyPi (because contractor team DEF wanted to use this)
Now suppose package titled "ABC" exists on all of them, but are not the same thing (for instance, "apples," which are 3 entirely different packages on all pypis.). How do I do something in my requirements and setup.py to map the package name to the pypi to use?
Something like:
package_def==1.2.3 --index-url=artifactory
apples==1.08 --index-url=gitlab # NOT FROM PUBLIC OR FROM ARTIFACTORY
package_abc==1.2.3 --index-url=artifactory
package_efg==1.0.0 # public pypi
I don't even know how I'd configure the setup.py in this instance either.
I really don't want multiple requirements.txt with different index urls at the top. I also don't want --extra-index-url due to the vulnerabilities it could introduce when using a private pypi.
I tried googling around, messing around with the order of requirements.txt, breaking it up into different files, etc. No luck. Seems that the last --index-url is always used to install all packages.
Any ideas?
The question gets back to the idea that a package dependency specification usually is a state of need that is independent of how that need should be satisfied.
So the dependency declaration “foo==1.0.0” (the thing declared as part of the package metadata) means “I need the package named foo with version 1.0.0" and that is in principle implementation independent. You can install that package with pip from PyPI, but you could also use a different tool and/or different source to satisfy that requirement (e.g. conda, installation-from-source, etc.).
This distinction is the reason why there's no good way to do this.
There are a few work arounds:
You can specify the full link to a wheel you want to pip install
You can use an alternative tool like Poetry, which does support this a little more cleanly.
For my particular usecase, I just listed the full link to the wheel I wanted to pip install, since upgrading to poetry is out of scope at the moment.
In the python project I work on at my workplace, we install some packages from PyPI, and some private company packages from Gemfury, using a standard requirements file.
After reading this article: https://medium.com/#alex.birsan/dependency-confusion-4a5d60fec610.
Our requirement file looks something like:
--index-url <OUR_GEMFURY_URL>
--extra-index-url https://pypi.python.org/simple
aiohttp==3.7.1
simplejson==3.17.1
<our-package>==1.0.0
<our-other-package>==1.2.0
I tried reading some of pip's documentation but I wasn't able to fully understand how it chooses from where to download the package.
For example, what happens if someone uploads a malicious version 1.0.0 to pypi-prod - how does pip know which one of the packages to take?
Is there maybe a way to specify to pip for a specific package to only search for it in --index-url?
How do you protect against dependency confusion in your code?
Thanks for the help!
The article mentions the algorithm pip uses:
Checks whether library exists on the specified (internal) package
index
Checks whether library exists on the public package index (PyPI)
Installs whichever version is found. If the package exists on both, it defaults to installing from the source with the higher version number.
So if your script requires <our-other-package>>=1.2.0, you can get some mailicios package from public pypi server if it has higher than the version you intented to install.
The straightforward solution mentioned in the article is removing --extra-index-url
If package 1.0 is internal or external package and is present in private pypi server it will be downloaded from there.
External packages will be downloaded from public pypi server through internal pypi server which will cache them for future usage.
I'd also suggest to have explicit versions in requirements.txt, this way you are aware of versions you get and do conscious upgrades by increasing the versions.
To sum up the guidelines (which by no means exhaustive and protect against all possible security holes)
remove --extra-index-url https://pypi.python.org/simple from pip.conf, requirements.txt and automation scripts.
specify explicit versions of internal and external packages in requirements.txt
I've gone down the Python packaging and distribution rabbit-hole and am wondering:
Is there ever any reason to provide standard library modules/packages as dependencies to setup() when using setuptools?
And a side note: the only setuptools docs I have found are terrible. Is there anything better?
Thanks
In a word, no. The point of the dependencies is to make sure they are available when someone installs your library. Since standard libraries are always available with a python installation, you do not need to include them.
For a more user-friendly guide to packaging, check out Hynek's guide:
Sharing Your Labor of Love: PyPI Quick and Dirty
No — In fact, you should never specify standard modules as setup() requirements, because those requirements are only for downloading packages from PyPI (or VCS repositories or other package indices). Adding, say, "itertools" to install_requires will mean that your package will fail to install because its dependencies can't be satisfied because there's (currently) no package named "itertools" on PyPI. Some standard modules do share their name with a project on PyPI; in the best case (e.g., argparse), the PyPI project is compatible with the standard module and only exists as a separate project for historical reasons/backwards compatibility. In the worst case... well, use your imagination.
I have packages A and B, both have their own git repository, PyPI page, etc... Package A depends on package B, and by using the install_requires keyword I can get A to automatically download and install B.
But suppose I want to go a step further for my especially non-savvy users; I want to actually include package B within the tar/zip for package A, so no download is necessary (this also lets them potentially make any by-hand edits to package B's setup.cfg)
Is there a suggested (ideally automated) way to,
Include B in A when I call sdist for A
Tell setuptools that B is bundled with A for resolving the dependency (something like a local dependency_links)
Thanks!
This is called 'vendorizing' and no, there is no support for such an approach.
It is also a bad idea; you want to leave installation to the specialized tools, which not only manage dependencies but also what versions are installed. With tools like buildout or pip's requirements.txt format you can control very precisely what versions are being used.
By bundling a version of a dependency inside, you either force upon your users what version they get to use, or make it harder for such tools to ensure the used versions for a given installation are consistent. In addition, you are potentially wasting bandwidth and space; if other packages have also included the same requirements, you now have multiple copies. And if your dependency is updated to fix a critical security issue, you have to re-release any package that bundles it.
In the past, some packages did use vendorize packaging to include a dependency into their distribution. requests was a prime example; they stepped away from this approach because it complicated their release process. Every time there was a bug in one of the vendored packages they had to produce a new release to include the fix, for example.
If you do want to persist in including packages, you'll have to write your own support. I believe requests manually just added the vendorized packages to their repository; so they kept a static copy they updated from time to time. Alternatively, you could extend setup.py to download code when you are creating a distribution.