Hostnames/IP adresses of packages from pypi, CRAN, maven, - python

We have server behind proxy and we want this server to be able to run commands such as:
python: pip install module
R: install.packages("fortunes")
...
Simply to install packages from these sources. Since we are behind proxy, we cannot install these unless the proxy has them whitelisted (otherwise the proxy probihits the connection between the server and wherever the package resides).
My question is: what should we whitelist to be able to run these commands?
I am not sure how the package websites actually works (whether they store the packages themselves or it is just the index and the actual packages resides on other domains/hostnames/...). I believe pypi is quite friendly here (packages are actually found there), but CRAN or Maven = don't know. We are running Spark servers, so our primary concerns are python, R, Java or Scala libraries/packages.

Maven: is actually storing packages. Regarding mirroring, see this answer. It also contains the url of the central repository.
Pypi: From the documentation on how to upload a package to the index, it seems like it is also physically storing the packages.
CRAN: also hosts the packages. There are several mirrors, you will need to whitelist one you want to use
You might want to consider setting up an internal mirror where you put your dependencies once, and then don't need to go to the outside internet.

Related

What URL should be whitelisted to use conda and pip? [duplicate]

I have a server, onto which I want to use Python, that is behind a company firewall. I do not want to mess with it and the only thing I can do is to make a firewall exception for specific URL/domains.
I also want to access packages located on PYPI, using pip or easy_install. Therefore, do you know which URL should I ask to be listed in the exception rules for the firewall, except *.pypi.python.org?
You need to open up your firewall to the download locations of any package you need to install, or connect to a proxy server that has been given access.
Note that the download location is not necessarily on PyPI. The Python package index is a metadata service, one that happens to also provide storage for the indexed packages. As such, not all packages indexed on PyPI are actually downloaded from PyPI, the download location could be anywhere on the internet.
I'd say you start with opening pypi.python.org, then as individual package installions fail, check their PyPI page and add the download location listed for those.
I've solved it adding these domains to the firewall whitelist:
pypi.python.org
pypi.org
pythonhosted.org

Difference between devpi and pypi server

Had a quick question here, I am used to devpi and was wondering what is the difference between devpi and pypi server?
Is one better than another? Which of this one scale better?
PyPI (Python Package Index)- is the official repository for third-party Python software packages. Every time you use e.g. pip to install a package that is not in the standard it will get downloaded from the PyPI server.
All of the packages that are on PyPI are publicly visible. So if you upload your own package then anybody can start using it. And obviously you need internet access in order to use it.
devpi (not sure what the acronym stands for) - is a self hosted private Python Package server. Additionally you can use it for testing and releasing of your own packages.
Being self hosted it's ideal for proprietary work that maybe you wouldn't want (or can't) share with the rest of the world.
So other features that devpi offers:
PyPI mirror - cache locally any packages that you download form PyPI. This is excellent for CI systems. Don't have to worry if a package or server goes missing. You can even still use it if you don't have internet access.
multiple indexes - unlike PyPI (which has only one index) in devpi you can create multiple indexes. For example a main index for packages that are rock solid and development where you can release packages that are still under development. Although you have to be careful with this because a large amount of indexes can make things hard to track.
The server has a simple web interface where you can you and search for packages.
You can integrate it with pip so that you can use your local devpi server as if you were using PyPI.
So answering you questions:
Is one better than the other? - well these are two different tools really. No clear answer here, depends on what your needs are.
Which scales better? - definitely devpi.
The official website is very useful with good examples: http://doc.devpi.net/latest/

Reusable Django apps + Ansible provisioning

I'm a long-time Django developer and have just started using Ansible, after using Vagrant for the last 18 months. Historically I've created a single VM for development of all my projects, and symlinked the reusable Django apps (Python packages) I create, to the site-packages directory.
I've got a working dev box for my latest Django project, but I can't really make changes to my own reusable apps without having to copy those changes back to a Git repo. Here's my ideal scenario:
I checkout all the packages I need to develop as Git submodules within the site I'm working on
I have some way (symlinking or a better method) to tell Ansible to setup the box and install my packages from these Git submodules
I run vagrant up or vagrant provision
It reads requirements.txt and installs the remaining packages (things like South, Pillow, etc), but it skips my set of tools because it knows they're already installed
I hope that makes sense. Basically, imagine I'm developing Django. How do I tell Vagrant (via Ansible I assume) to find my local copy of Django, rather than the one from PyPi?
Currently the only way I can think of doing this is creating individual symlinks for each of those packages I'm developing, but I'm sure there's a more sensible model.
Thanks!
You should probably think of it slightly differently. You create a Vagrant file which specifies Ansible as a provisioner. In that Vagrant file you also specify what playbook to use for your vagrant provision portion.
If your playbooks are written in an idempotent way, running them multiple times will skip steps that already match the desired state.
You should also think about what your desired end-state of a VM should look like and write playbooks to accomplish that. Unless I'm misunderstanding something, all your playbook actions should be happening inside of VM, not directly on your local machine.

Is it good idea to store python package eggs in artifactory?

Currently I am developing automated test framework. This test-framework has different packages. These packages will be refer in different project and these may be modified locally by the developer. I want to manage the python package eggs. I am thinking of using Artifactory. I tried to look for Artifactory help for Python,But I couldn't get anything useful.
should I use Artifactory or PIP ?
Edit:
Is there any way or command in python which can help me to put the eggs in artifactory?
There are numerous reasons to prefer a binary repository manager over a simple shared directory/SCM binary storage:
Fine grained security.
Ability to proxy and cache remote repositories.
More efficient handling of binaries (because it's a tool that's tailored to do so).
Sharing the binaries with other teams and the world is a lot safer and easier.
Integration with many tools in the ecosystem.
Search and manipulation facilities.
Administration tools.
Artifactory exposes a very rich REST API and the deployment of any artifact can be achieved by a simple HTTP PUT request.
Take a look at the Defend Against Fruit project. It provides the previously missing glue between Python and Artifactory.
http://teamfruit.github.io/defend_against_fruit/
You can use "in house" PyPi (either with easy_install -f ... or pip -f ...).
For a server you can have just Apache serving a directory with all the eggs or something like http://pypi.python.org/pypi/pypiserver

Caching Python requirements for production deployments

I'm building various python-based projects that use pip/buildout to install dependencies. But I don't like the idea of someone deleting a github project and crippling my apps, or a network outage meaning I can't perform a deployment.
How do other people solve this?
I've got various ideas, but I think perhaps the one that sounds most promising would be some kind of caching proxy server. I'd point pip to use this internal proxy server which would cache a copy of the downloaded project, and periodically check for updates (if there's a net connection) before serving cached versions.
Does anything like this already exist?
Use case:
I have a project which I deploy to web server 1. I add new features with a remote dependency, and when I come to update to the production web server, PyPi is down so I can't deploy. Or perhaps when I come to set up a new web server, a dependency has disappeared from github or wherever.
How can I make it so my deployments/dev environments can always be brought up regardless of what happens in the wider world?
Also, when I deploy, I won't deploy over the top of existing code. Rather I'll build a new virtualenv and switch over to it so I can rollback if anything goes wrong. So each time I deploy I'll need to rebuild my environment and will need dependencies to exist.
So I'm looking for a solution that will insulate me against short-term network outages to servers hosting dependencies, as well as guarding against projects being deleted.
You should keep a "reference copy" of the projects on which you depend.
If someone removes the project from GitHub (and PyPi and all the mirrors, and every other site on the net) then you have the source and can now distribute it.
I have exactly the same requirements, and also use buildout to manage my deployments. I try not to install ANY of my package dependencies system-wide; I let buildout install eggs for all of them into my buildout. That way if I depend on a newer version of some package in rev N+1 of my project, and at "go-live" time N+1 falls on its face, I can roll back to N and automatically get the packge dependencies that N worked with.
We run a private eggbasket server, and configure buildout to fetch packages only from that. Server contents were initialized by allowing buildout to grab eggs from the network one time, then copying the downloaded eggs.
This way, upgrades to each package are totally under control and I can ensure that 2 successive buildouts of the same snapshot of my code will build out the same thing. When I want to upgrade all, I will let buildout fetch most-recent-versions again, test test test, then copy my eggs to the eggbasket server to go into production mode.
This is what I'm looking for:
http://pypi.python.org/pypi/collective.eggproxy

Categories