How to protect against Pypi package maintainers deciding to remove a package?

How to protect against Pypi package maintainers deciding to remove a package? - python

It seems it's possible to remove packages (or package versions) for Pypi: How to remove a package from Pypi
This can be a problem if you've completed development of some software and expect to be able to pull the dependencies from Pypi when building.
What are the best practices to protect against this?

Unnecessary_intro=<< SKIP_HERE
In fact, this is a much deeper problem than just preventing another leftpad instance.
In general, practices of dependencies management are defined by community norms which are often implicit. Python is especially bad in this sense, because not just it is implicit about these norms, its package management tools also built on premise of no guarantee of dependencies compatibility. Since the emergence of PyPI, neither package installers guaranteed to install compatible version of dependencies. If package A requires packages B==1.0 and C==1.0, and C requires B==0.8, after installing A you might end up having B==0.8, i.e. A dependencies won't be satisfied.
SKIP_HERE
0. Choose wisely, use signals.
Both developers and package maintainers are aware of the situation. People try to minimize chance of such disruption by selecting "good" projects, i.e. having a healthy community where one person is unlikely to make such decisions, and capable to revive the project in an unlikely case it happens.
Design and evaluation of these signals is an area of active research. Most used factors are number of package contributors (bus factor), healthy practices (tests, CI, quality of documentation), number of forks on GitHub, stars etc.
1. Copy package code under your source tree
This is the simplest scenario if you do not expect package to change a lot but afraid of package deletion or breaking changes. It also gives you advantage of customization the package for your needs; on the other side, package updates now will take quite a bit of effort.
2. Republish a copy of the package on PyPI
I do not remember exact names, but some high profile packages were republished by other developers under different names. In this case all you need is just to copy package files and republish, which is presumably less expensive than maintaining a copy under your source tree. It looks dirty, though, and I would discourage from doing this.
3. A private PyPI mirror
A cleaner but more expensive version of #2.
4. Another layer of abstraction
Some people select few competing alternatives and create an abstraction over them with a capability to use different "backends". The reason for this usually is not to cope with possible package deletion, and depending on complexity of the interfaces it might be quite expensive. An example of such abstraction is Keras, an abstraction for neural networks which provides a consistent interface to both tensorflow and theano backends
5. There are more options
More exotic options include distributing snapshots of virtual environments/containers/VMs, reimplementing the packgage (especially if because of licensing issues) etc.

You can create your own local PyPI Mirror(where you update, add, remove packages on your policies) and use your local PyPI mirror for future package installations/update. A full mirrored PyPI will consume about 30GB of storage, so all you need is 30-40GB storage space.
There are many ways to create local pypi mirror. e.g How to roll my own pypi?

Related

Most "Pythonic" way to have package install dependencies based on GPU presence/manufacturer?

I am currently developing a Machine Learning library that allows users to write CPU/GPU agnostic code to achieve certain tasks. In order to run the GPU enabled code, my package has certain dependencies which are only compatible with CUDA-enabled NVIDIA GPUs. Therefore, I would like my package to only install those dependencies when a user installs my package on a device with CUDA-enabled GPU. I researched how to do this, and at first it seemed as though my answer was going to be in PEP 508. Howevever, the environment markers described in this PEP do not include the presence of a GPU, let alone the manufacturer and/or drivers of that GPU.
Therefore, I am looking to see what the most "Pythonic" solution is to install certain dependencies based upon the presence of a CUDA-enabled NVIDIA GPU (if it is much easier to merely check for the presence of a NVIDIA GPU without checking for CUDA status I can do this as solution for the moment but I would definitely prefer to check for CUDA-enabled status).
I understand that setup.py is able to execute arbitrary Python code, so technically I can just solve this problem with code inside there that does this for me, but it is my understanding that the Python community is moving away from reliance on setup.py due to the inherent risks associated with the execution of arbitrary code during package setup, therefore I would like to be compliant with the community's preferences and have my solution use pyproject.toml if possible although I would also be happy to use setup.cfg if that is considered more appropriate (to be honest I do not feel like I fully comprehensively understand the distinction between these two files and which are more appropriate for what purpose, although from the mCoding video I watched to setup my testing it seems as though setup.cfg is more for package metadata).
Even though what I am looking to do was not expressly detailed in PEP 508, it feels very much "in the same spirit" so I would be surprised if there was no canonical way to do this.

Manage dependencies for multiple Python projects in monorepo

Disclaimer: the question may seem like an opinion-based one, but the main goal is to prove one of approaches with standards or style guides.
Problem
Say we have a monorepo with two completely independent Python projects. Some dependencies are shared, some are project specific.
Questions
does it make sense to use shared dependencies list (requirements.txt, poetry.lock, etc.) for the whole repo?
are there any PEPs or other widely used guides that regulate this?
Pros and cons
As for me, the advantages of using isolated dependencies are:
simpler dependency resolving - sub-dependency constraints specific for one project won't affect others. Also, old specific packages may be completely incompatible with modern because of conflicting sub-dependencies
lower risk of unexpected regression issues - otherwise changing project-specific package may cause sub-dependency update and issues with one which doesn't use that package at all
And the disadvantage of isolated deps is a technical debt caused by diverged versions of shared dependencies - a package may be bumped in one of the projects, but others would keep using the older version. And this variance would likely increase over time.
Of course, I'd appreciate other thoughts on these options. And again, are there any Python or general guides on this?

As a python package maintainer, how can I determine lowest working requirements

While it is possible to simply use pip freeze to get the current environment, it is not suitable to require an environment as bleeding edge as what I am used too.
Moreover, some developer tooling are only available on recent version of packages (think type annotations), but not needed for users.
My target users may want to use my package on slowly upgrading machines, and I want to get my requirements as low as possible.
For example, I cannot require better than Python 3.6 (and even then I think some users may be unable to use the package).
Similarly, I want to avoid requiring the last Numpy or Matplotlib versions.
Is there a (semi-)automatic way of determining the oldest compatible version of each dependency?
Alternatively, I can manually try to build a conda environment with old packages, but I would have to try pretty randomly.
Unfortunately, I inherited a medium-sized codebase (~10KLoC) with no automated test yet (I plan on making some, but it takes some time, and it sadly cannot be my priority).
The requirements were not properly defined either so that I don't know what it has been run with two years ago.

Because semantic versionning is not always honored (and because it may be difficult from a developper standpoint to determine what is a minor or major change exactly for each possible user), and because only a human can parse release notes to understand what has changed, there is no simple solution.
My technical approach would be to create a virtual environment with a known working combination of Python and libraries versions. From there, downgrade one version by one version, one lib at a time, verifying that it still works fine (may be difficult if it is manual and/or long to check).
My social solution would be to timebox the technical approach to take no more than a few hours. Then settle for what you have reached. Indicate in the README that lib requirements may be overblown and that help is welcome.
Without fast automated tests in which you are confident, there is no way to automate the exploration of the N-space (each library is a dimension) to find a some minimums.

Python: Multiple packages in one repository or one package per repository?

I have a big Python 3.7+ project and I am currently in the process of splitting it into multiple packages that can be installed separately. My initial thought was to have a single Git repository with multiple packages, each with its own setup.py. However, while doing some research on Google, I found people suggesting one repository per package: (e.g., Python - setuptools - working on two dependent packages (in a single repo?)). However, nobody provides a good explanation as to why they prefer such structure.
So, my question are the following:
What are the implications of having multiple packages (each with its own setup.py) on the same GitHub repo?
Am I going to face issues with such a setup?
Are the common Python tools (documentation generators, pypi packaging, etc) compatible with with such a setup?
Is there a good reason to prefer one setup over the other?
Please keep in mind that this is not an opinion-based question. I want to know if there are any technical issues or problems with any of the two approaches.
Also, I am aware (and please correct me if I am wrong) that setuptools now allow to install dependencies from GitHub repos, even if the GitHub URL of the setup.py is not at the root of the repository.

One aspect is covered here
https://pip.readthedocs.io/en/stable/reference/pip_install/#vcs-support
In particular, if setup.py is not in the root directory you have to specify the subdirectory where to find setup.py in the pip install command.
So if your repository layout is:
pkg_dir/
setup.py # setup.py for package pkg
some_module.py
other_dir/
some_file
some_other_file
You’ll need to use pip install -e vcs+protocol://repo_url/#egg=pkg&subdirectory=pkg_dir.

"Best" approach? That's a matter of opinion, which is not the domain of SO. But here are a couple of justifications for creating separate packages:
Package is functionally independent of the other packages in your project.
That is, doesn't import from them and performs a function that could be useful to other developers. Extra points if the function this package performs is similar to packages already in PyPI.
Extra points if the package has a stable API and clear documentation. Penalty points if package is a thin grab bag of unrelated functions that you factored out of multiple packages for ease of maintenance, but the functions don't have an unifying principle.
The package is optional with respect to your main project, so there'd be cases where users could reasonably choose to skip installing it.
Perhaps one package is a "client" and the other is the "server". Or perhaps the package provides OS-specific capabilities.
Note that a package like this is not functionally independent of the main project and so does not qualify under the previous bullet point, but this would still be a good reason to separate it.
I agree with #boriska's point that the "single package" project structure is a maintenance convenience well worth striving for. But not (and this is just my opinion, I'm going to get downvoted for expressing it) at the expense of cluttering up the public package index with a large number of small packages that are never installed separately.

I am researching the same issue myself. PyPa documentation recommends the layout described in 'native' subdirectory of: https://github.com/pypa/sample-namespace-packages
I find the single package structure described below, very useful, see the discussion around testing the 'installed' version.
https://blog.ionelmc.ro/2014/05/25/python-packaging/#the-structure
I think this can be extended to multiple packages. Will post as I learn more.

The major problem I've with faced when splitting two interdependent packages into two repos came from CI and testing. Specifically branch protections.
Say you have package A and package B and you make some (breaking) changes in both. The automated tests for package A fail because they use the main branch of B (which is no longer compatible with the new version of A) so you can't merge B. And the same problem the other way around.
tldr:
After breaking changees automated tests on merge will fail because they use the main branch of the other repo. Making it impossible to merge.

How to contribute improvements to packages hosted on Cheeseshop ( pypi )?

I've been using zc.buildout more and more and I'm encountering problems with some recipes that I have solutions to.
These packages generally fall into several categories:
Package with no obvious links to a project site
Package with links to free hosted service like github or google code
Setup #2 is better then #1, but not much better because for both of these situations, I would have to wait for the developer to apply these changes before i can use the updated package buildout.
What I've been doing up to this point is basically forking the package, giving it a different name and uploading it to pypi, but this is creating redundancy and I think only aggravating the problem.
One possible solution, is to use to use a personal server package index where I would upload updated versions of the code until the developer updates he/her package. This is doable, but it adds additional work, that I would prefer to avoid.
Is there a better way to do this?
Thank you

Your "upload my personalized fork" solution sounds like a terrible idea. You should try http://pypi.python.org/pypi/collective.recipe.patch which lets you automatically patch eggs. Try setting up a local PyPi-compatible index. I think you can also point find-links = at a directory (not just a http:// url) containing your personal versions of those "almost good enough" packages. You can also try monkey patching the defective package, or take advantage of the Zope component model to override the necessary bits in a new package. Often the real authors are listed somewhere in the source code of a package, even if they decided not to put their names up on PyPi.
I've been trying to cut down on the number of custom versions of packages I use. Usually I work with customized packages as develop eggs by linking src/some.project to my checkout of that project's code. I don't have to build a new egg or reinstall every time I edit those packages.
A lot of Python packages used in buildouts are hosted in Plone's svn collective. It's relatively easy to get commit access to that repository.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to protect against Pypi package maintainers deciding to remove a package? - python

Related

Most "Pythonic" way to have package install dependencies based on GPU presence/manufacturer?

Manage dependencies for multiple Python projects in monorepo

As a python package maintainer, how can I determine lowest working requirements

Python: Multiple packages in one repository or one package per repository?

How to contribute improvements to packages hosted on Cheeseshop ( pypi )?

Categories

Resources