I am a Java Developer writing (Stand-Alone) Applications for Apache Spark. To create Artifacts I use Gradle along with the ShadowJar Plugin.
A few team mates want to use Python. Currently, they use JetBrains PyCharm to write these Python Scripts and remotely execute them on the Spark Cluster Environment. However, this process does not scale well (What to do if there is more than one file involved?) and I am looking for a solution in the Python Ecosystem. A problem is that neither I nor one of my Team members is a Python Expert (in fact the other team mates are no developers, but have to write code. Management decisions...), so we do not have any clue what is the best practice for Python development.
I tried PyGradle, but it did not feel smoothly integratable, especially with Apache Spark. I tripped over names like Pip, Pex, Setuptools, VirtualEnv. What are those tools? How do they interfere with each other?
To prevent the X-Y Problem: I want a Codebase which can be build, (unit-)tested and packaged with one command (like gradle build). The resulting artifact should be able to be deployed and executed on a Spark Cluster.
i am also new to this world and want to setup process in the AI startup. i think http://pybuilder.github.io/ is at least good startpoint for automation as i am trying to setup this among us.
Related
my situation is as follows, we have:
an "image" with all our dependencies in terms of software + our in-house python package
a "pod" in which such image is loaded on command (kubernetes pod)
a python project which has some uncommitted code of its own which leverages the in-house package
Also please assume you cannot work on the machine (or cluster) directly (say remote SSH interpreter). the cluster is multi-tenant and we want to optimize it as much as possible so no idle times between trials
also for now forget security issues, everything is locked down on our side so no issues there
we want to essentially to "distribute" remotely the workload -i.e. a script.py - (unfeasible in our local machines) without being constrained by git commits, therefore being able to do it "on the fly". this is necessary as all of the changes are experimental in nature (think ETL/pipeline kind of analysis): we want to be able to experiment at scale but with no bounds with regards to git.
I tried dill but I could not manage to make it work (probably due to the structure of the code). ideally, I would like to replicate the concept mleap applied to ML pipelines for Spark but on a much smaller scale, basically packaging but with little to no constraints.
What would be the preferred route for this use case?
at my work i write numerous small python scripts for DB management, most scripts use one or two common libraries which i sometimes update,
before distribution i freeze the scripts with cxfreeze and copy over some resources and upload to a server.
i would like to set up some up some system which would allow me to automatically rebuild/freeze all of the scripts
, copy over some files, archive and upload to server.
I'm not sure where even to look for such a system, because most of what if found is complicated server based systems like AHP for compiled languages, while i need something small for a single computer
obviously i can write something fast and dirty in python but it seems illogical that there isn't something ready made for these simple requirements.
please forgive my ignorance, I'm still learning.
Have you considered using make? If you're familiar with it from another language, it might be the easiest. There's also a list on the python wiki here.
Have a look at buildout. From their site:
Buildout is a Python-based build system for creating, assembling and deploying applications from multiple parts, some of which may be non-Python-based. It lets you create a buildout configuration and reproduce the same software later.
We are using it to do exactly what you describe. It is flexible and easy to extend through recipes.
I have a server which executes Python scripts from a certain directory path. Incidently this path is a check-out from the SVN trunk version of the scripts. However, I get the feeling that this isn't the right way to provide and update scripts for a server.
Do you suggest other approaches? (compile, copy, package, ant etc.)
In the end a web server will execute some Python script with parameters. How do I do the update process?
Also, I have trouble deciding what is best to handle updated versions which only work for new projects on the server. Therefore, if I update the Python scripts, but only newly created web jobs will know how to handle that. I "delivery" to one of many directories which keep track of versions and the server picks the right one?!
EDIT: I webserver is basically an interface that runs some data analysis. That analysis is the actual scripts that take some parameters and mingle data. I don't really change the web interface. I only need to update the data scripts stored on webserver. Indeed, in some advanced version the web server should also pick the right version of my data scripts. However, at the moment I have no idea which would be the easiest way.
The canonical way of distributing Python code/functionality is by using a PyPi compliant package manager.
A list of available PyPi implementations on python.org:
http://wiki.python.org/moin/PyPiImplementations
Instructions on setting up and using EggBasket:
http://chrisarndt.de/projects/eggbasket/#installation
Instructions on installing ChiShop:
http://justcramer.com/2011/04/04/setting-up-your-own-pypi-server/
Note that for this to work you need to distribute your code as "Eggs"; you can find out how to do this here: http://peak.telecommunity.com/DevCenter/setuptools
A great blog post on the usage of eggs and the different parts in packaging: http://mxm-mad-science.blogspot.com/2008/02/python-eggs-simple-introduction.html
I have a django site that needs to be rebuilt every night. I would like to check out the code from the Git repo and then begin doing the stuff like setting up the virtual environment, downloading the packages, etc. This would have no manual intervention as this would be run from cron
I'm really confused as to what to use for this. Should I write a Python script or a Shell script? Are there any tools that assist in this?
Thanks.
So what I'm looking for is CI and from what I've seen I'll probably end up using Jenkins or Buildbot for it. I've found the docs to be rather cryptic for someone who's never attempted anything like this before.
Do all CI like Buildbot/Jenkins simply run tests and more test and send you reports or do they actually set up a working Django environment that you can access through your browser?
You'll need to create some sort of build script that does everything but the GIT checkout. I've never used any Python build tools, but perhaps something like: http://www.scons.org/.
Once you've created a script you can use Jenkins to schedule a nightly build and report success/failure: http://jenkins-ci.org/. Jenkins will know how to checkout your code and then you can have it run your script.
There are litterally 100's of different tools to do this. You can write python scripts to be run from cron, you can write shell scripts, you can use one of the 100's of different build tools.
Most python/django shops would likely recommend Fabric. This really is a matter of you running through and making sure you understand everything that needs to be done and how to script it. Do you need to run a test suite before you deploy to ensure it doesn't really break everything? Do you need to run South database migrations? You really need to think about what needs to be done and then you just write a fabric script to do those things.
None of this even touches the fact that overall what you're asking for is continuous integration which itself has a whole slew of tools to help manage that.
What you are asking for is Continuous Integration.
There are many CI tools out there, but in the end it boils down to your personal preferences (like always, hopefully) and which one just works for you.
The Django project itself uses buildbot.
If you would ask me, then I would recommend you continuous.io, which works ouf the box with Django applications.
You can manually set how many times you would like to build your Django project, which is great.
You can, of course, write a shell script which rebuilds your Django project via cron, but you should deserve better than that.
I'm looking for a tool to keep track of "what's running where". We have a bunch of servers, and on each of those a bunch of projects. These projects may be running on a specific version (hg tag/commit nr) and have their requirements at specific versions as well.
Fabric looks like a great start to do the actual deployments by automating the ssh part. However, once a deployment is done there is no overview of what was done.
Before reinventing the wheel I'd like to check here on SO as well (I did my best w/ Google but could be looking for the wrong keywords). Is there any such tool already?
(In practice I'm deploying Django projects, but I'm not sure that's relevant for the question; anything that keeps track of pip/virtualenv installs or server state in general should be fine)
many thanks,
Klaas
==========
EDIT FOR TEMP. SOLUTION
==========
For now, we've chosen to simply store this information in a simple key-value store (in our case: the filesystem) that we take great care to back up (in our case: using a DCVS). We keep track of this store with the same deployment tool that we use to do the actual deploys (in our case: fabric)
Passwords are stored inside a TrueCrypt volume that's stored inside our key-value store.
==========
I will still gladly accept any answer when some kind of Open Source solution to this problem pops up somewhere. I might share (part of) our solution somewhere myself in the near future.
pip freeze gives you a listing of all installed packages. Bonus: if you redirect the output to a file, you can use it as part of your deployment process to install all those packages (pip can programmatically install all packages from the file).
I see you're already using virtualenv. Good. You can run pip freeze -E myvirtualenv > myproject.reqs to generate a dependency file that doubles as a status report of the Python environment.
Perhaps you want something like Opscode Chef.
In their own words:
Chef works by allowing you to write
recipes that describe how you want a
part of your server (such as Apache,
MySQL, or Hadoop) to be configured.
These recipes describe a series of
resources that should be in a
particular state - for example,
packages that should be installed,
services that should be running, or
files that should be written. We then
make sure that each resource is
properly configured, only taking
corrective action when it's
neccessary. The result is a safe,
flexible mechanism for making sure
your servers are always running
exactly how you want them to be.
EDIT: Note Chef is not a Python tool, it is a general purpose tool, written in Ruby (it seems). But it is capable of supporting various "cookbooks", including one for installing/maintaining Python apps.