I am have completed my python 3 application, and it is using multiple public modules from PyPi.
However, before I deploy it to run within my company's enterprise which will be handling credentials of our customers and accessing 3rd party APIs, I need to do due diligence that they are both secure and safe.
What steps must I perform:
Validate security of PyPi modules and safe to use, and it is important to note that the target Python 3 app will be handling credentials?
What is the most recommended way validate PyPi modules' signature?
Can PyPi module signature be trusted?
By the way, the Python 3 application will be running within a Docker container.
Thank you
These are 3 separate questions, so:
You'll have to audit the package (or get someone else to do that) to know if it's secure. No easy way around it.
All pypi packages have md5 signature attached (link in parentheses after the file). Some of them also attach the pgp signature which shows up in the same place, but it's up to the author whether they're published or not. (https://pypi.python.org/pypi/rpc4django for example includes both md5 and pgp) Md5 verifies integrity. Pgp verifies integrity and origin, so it's a better choice when available.
Just as much as any other signature.
If you're worried about dependencies to that level, I think you should look at maintaining your internal pypi repository. It gives you better verification (just sign the packages yourself after initial download and only accept your signature). It gives you better reliability and speed (you can still build the software if pypi goes down). And it avoids issues with replaced / updated packages which you haven't audited/approved yet.
Related
I work on a large python code base with several teammates. We are often installing or updating dependencies on other python packages, and inevitably this causes problems when someone else updates their master branch from git, or we deploy on a new system.
I've seen many tools available for deploying environments on new computers, which are great. The problem is that these tools only work if everyone is consistently updating the relevant files (e.g. requirements.txt, setup.py, tarballs on a PyPI server...) every time they update or add a package.
We use Github's pull request system for code reviews. What would be great would be some means of indicating to the reviewer that the dependency structure has changed, prompting the reviewer to check for the necessary updates (also good would be to build in a checklist that the reviewer has to complete, reminding them to do the check).
How have other folks dealt with this problem?
I would enforce the use of tools with a network proxy or network ACL, to block public sites and stand up internal services like gitlab, bitbucket, GitHub enterprise, or internal pypi server, to force the use of certain standards.
I am working a series of Python project that involves making data available to other dev teams (C++/Python) via various services with gRPC. At the start of this initiative all files where self contained within the server project including the proto file. By the way, I version my project with https://semver.org/.
Then I started implementing a client/API library, as such I move the Proto file /generated code to the client project. I also made the server dependent on the client library using a Python package.
The server gets packaged in a RPM due to the nature of my environment (No Docker) while the client generates two artifacts 1) An RPM that can be statically linked for C++ projects, 2) A Pypi package that can be uploaded and downloaded via a Pypi repo. the server downloads the Pypi dependency via PIP.
The issue I have is that the SemVer on the client/API Library gives the wrong meaning as this ties both the version of the proto interface and the actual client version. This is an issue because if there is a bug in the client lib that forces a bump in the version number this will give the impresio that the proto interface has chnaged when this is untrue.
At this point, I am starting to think that I should have a third project that only contains the proto file and generated code. However, this is going to cause and an explosion of a small project (3x) in my git repo each time I need to implement a new service or purhaps I should group all my proto in a single project.
Would appreciate any suggestions or advice on how to share proto files while keeping the semantics behind my version number ?
Git submodule to manage common proto could solve your problem.
When you define a submodule you are sure that this portion of code is common to all the repo pointing to this module.
Also your submodule are in reality pointing to a commit. So if there are update on your module (proto model), you will have to pull manually the update into your buisness code.
On the other hand, if you are delivering a bug, if you don't update the submodule, you will not redeliver the proto.
Had a quick question here, I am used to devpi and was wondering what is the difference between devpi and pypi server?
Is one better than another? Which of this one scale better?
PyPI (Python Package Index)- is the official repository for third-party Python software packages. Every time you use e.g. pip to install a package that is not in the standard it will get downloaded from the PyPI server.
All of the packages that are on PyPI are publicly visible. So if you upload your own package then anybody can start using it. And obviously you need internet access in order to use it.
devpi (not sure what the acronym stands for) - is a self hosted private Python Package server. Additionally you can use it for testing and releasing of your own packages.
Being self hosted it's ideal for proprietary work that maybe you wouldn't want (or can't) share with the rest of the world.
So other features that devpi offers:
PyPI mirror - cache locally any packages that you download form PyPI. This is excellent for CI systems. Don't have to worry if a package or server goes missing. You can even still use it if you don't have internet access.
multiple indexes - unlike PyPI (which has only one index) in devpi you can create multiple indexes. For example a main index for packages that are rock solid and development where you can release packages that are still under development. Although you have to be careful with this because a large amount of indexes can make things hard to track.
The server has a simple web interface where you can you and search for packages.
You can integrate it with pip so that you can use your local devpi server as if you were using PyPI.
So answering you questions:
Is one better than the other? - well these are two different tools really. No clear answer here, depends on what your needs are.
Which scales better? - definitely devpi.
The official website is very useful with good examples: http://doc.devpi.net/latest/
I am looking for a flexible solution for uploading generic builds to an artifact repository (in my case it would be Artifactory but I would not mind if it would also support others, like Nexus)
Because I am not building java code adding maven to the process would only add some unneeded complexity to the game.
Still, the entire infrastructure already supports bash and python everywhere (including Windows) which makes me interested on finding something that involves those two.
I do know that I could code it myself, but now I am looking for a way to make it as easy and flexible as possible.
Gathering the metadata seems simple, only publishing it in the format required by the artefact repository seems to be the issue.
After discovering that the two existing Python packages related to Artifactory are kinda useless as both not being actively maintained, one being only usable as a query interface and the other two having serious bugs that prevent it use, I discovered something than seems to close that what I was looking: http://teamfruit.github.io/defend_against_fruit/
Still, it seems that was designed to deal only with python packages, not with generic builds.
Some points to consider:
Tools like Maven and Gradle are capable of building more than Java projects. Artifactory already integrates with them and this includes gathering the metadata and publishing it together with the build artifacts.
The Artifactory Jenkins plugin supports generic (freestyle) builds. You can use this integration to deploy whatever type of files you like.
You can create your own integration based on the Artifactory's open integration layer for CI build servers - build-info. This is an open source project and all the implementations are also open sourced.The relevant Artifactory REST APIs are documented here.
Disclaimer: I'm affiliated with Artifactory
Where I currently work we've had a small debate about deploying our Python code to the production servers. I voted to build binary dependencies (like the python mysql drivers) on the server itself, just using pip install -r requirements.txt. This was quickly vetoed with no better explanation that "we don't put compilers on the live servers". As a result our deployment process is becoming convoluted and over-engineered simply to avoid this compilation step.
My question is this: What's the reason these days to avoid having a compiler on live servers?
In general, the prevailing wisdom on servers installs is that they should be as stripped-down as possible. There are a few motivations for this, but they don't really apply all that directly to your question about a compiler:
Minimize resource usage. GCC might take up a little extra disk space, but probably not enough to matter - and it won't be running most of the time, so CPU/memory usage isn't a big concern.
Minimize complexity. Building on your server might add a few more failure modes to your build process (if you build elsewhere, then at least you will notice something wrong before you go mess with your production server), but otherwise, it won't get in the way.
Minimize attack surface. As others have pointed out, by the time an attacker can make use of a compiler, you're probably already screwed..
At my company, we generally don't care too much if compilers are installed on our servers, but we also never run pip on our servers, for a rather different reason. We're not so concerned about where packages are built, but when and how they are downloaded.
The particularly paranoid among us will take note that pip (and easy_install) will happily install packages from PYPI without any form of authentication (no SSL, no package signatures, ...). Further, many of these aren't actually hosted on PYPI; pip and easy_install follow redirects. So, there are two problems here:
If pypi - or any of the other sites on which your dependencies are hosted - goes down, then your build process will fail
If an attacker somehow manages to perform a man-in-the-middle attack against your server as it's attempting to download a dependency package, then he'll be able to insert malicious code into the download
So, we download packages when we first add a dependency, do our best to make sure the source is genuine (this is not foolproof), and add them into our own version-control system. We do actually build our packages on a separate build server, but this is less crucial; we simply find it useful to have a binary package we can quickly deploy to multiple instances.
I would suggest to refer to this serverfault post.
It makes sense to avoid exploits being compiled remotely
It makes sense also to me that in terms of security, it will only make the task harder for a hijacker without than with a compiler, but it's not perfect.
It would put a heavy strain on the server?