I have a very long running python script, that cannot be parallelized (so it is single threaded = running only with one process).
This job runs several days on my own computer.
It does not benefit from any GPU support.
For analysis and parameter optimization, I assume to run this job several times; perhaps 10 - 20 times with each time different parameters.
As my own existing computer ressources are limited, I would like to use a powerful cloud CPU for this task.
If I realize that the cloud CPUs are really much faster than my own CPU, I probably will migrate the job from e.g. AWS EC2 (Amazon Web Services) to a cheaper flat rate solution like Hetzner.
With this use case: does it make sense to put my setup in a docker container?
Or does this task not justify the effort for engineering and taking the learning curve in docker / docker compose etc.?
Well, for sure you don't need to use docker for that, because of some elements that I will list here:
Docker use is justified for using in encapsulated environment, to obtain security and mostly a controlled access between containers processes.
Another common attraction to Docker is the Continuous Integration/Replication aspects of container development, it is really good to create Docker containers to scale using Kubernetes or easy deployment using Jenkins, for example.
You can read more about it here: https://www.linode.com/docs/applications/containers/when-and-why-to-use-docker/
Now, since your application does not need that, Docker is not the way. And another suggestion, if you need to run it multiple times with only parameter difference between those executions, it is really good for you to parallel it to enjoy a powerful CPU.
Docker will make it much easier for you to run your application in the cloud in the sense that, you will be able to switch machines much easier. In addition, it will make it easy to run it cheaper, because you won't have to spend alot of time spinning up your VMs, and can cheaply and easily spin VMs down, knowing that only docker run and no specific python or yum install steps have to be done to bootstrap the program.
Related
A bioinformatics protocol was developed and we would like to dockerize it to make its usage easier for others. It consists of 2 softwares and several python scripts to prepare and parse data. Locally, we run these modules on a cluster as several dependent jobs (some requiring high resources) with a wrapper script, like this:
parsing input data (python)
running a few 10-100 jobs on a cluster (each having a piece of the output of step 1). Every step's job depends on the previous one finishing, involving:
a) compiled C++ software on each piece from 1
b) a parsing python script on each piece from 2a
c) an other, resource-intensive compiled C++ software; which uses mpirun to distribute all the output of 2b
finalizing results (python script) on all results from step 2
The dockerized version does not necessarily needs to be organized in the same manner, but at least 2c needs to be distributed with mpirun because users will run it on a cluster.
How could I organize this? Have X different containers in a workflow? Any other possible solution that does not involve multiple containers?
Thank you!
ps. I hope I described it clearly enough but can further clarify if needed
I think in your project it is important to differentiate docker images and docker container.
In a docker image, you will package your code and the dependencies to make it work. The first question is, should all your code be in the same image : you have python scripts, c++ software, so it could make sense to have several images, each capable to run one job of your process.
A docker container is a running instance of a docker image. So if you decided to have several images, you will have several docker containers running during your process. If you decide to have only one image, then you can decide to run everything in one container, by running your wrapper script in the container. Or you could have a new wrapper script instantiating docker containers for each step. This could be interesting as you seem to use different hardware depending on the step.
I can't give specifics about mpirun as I'm not familiar with it
So far when dealing with web scraping projects, I've used GAppsScript, meaning that I can easily trigger the script to be run once a day.
Is there an equivalent service when dealing with python scripts? I have a RaspberryPi, so I guess I can keep it on 24/7 and use cronjobs to trigger the script daily. But that seems rather wasteful, since I'm talking about a few small scripts that take only a few seconds to run.
Is there any service that allows me to trigger a python script once a day? (without a need to keep a local machine on 24/7) The simpler the solution the better, wouldn't want to overengineer such a basic use case if a ready-made system already exists.
The only service I've found so far to do this with is WayScript and here's a python example running in the cloud. The free tier that should be enough for most simple/hobby-tier usecases.
my situation is as follows, we have:
an "image" with all our dependencies in terms of software + our in-house python package
a "pod" in which such image is loaded on command (kubernetes pod)
a python project which has some uncommitted code of its own which leverages the in-house package
Also please assume you cannot work on the machine (or cluster) directly (say remote SSH interpreter). the cluster is multi-tenant and we want to optimize it as much as possible so no idle times between trials
also for now forget security issues, everything is locked down on our side so no issues there
we want to essentially to "distribute" remotely the workload -i.e. a script.py - (unfeasible in our local machines) without being constrained by git commits, therefore being able to do it "on the fly". this is necessary as all of the changes are experimental in nature (think ETL/pipeline kind of analysis): we want to be able to experiment at scale but with no bounds with regards to git.
I tried dill but I could not manage to make it work (probably due to the structure of the code). ideally, I would like to replicate the concept mleap applied to ML pipelines for Spark but on a much smaller scale, basically packaging but with little to no constraints.
What would be the preferred route for this use case?
My intention is to run python code 24/7 over several months to collect data through API calls and alert me if certain conditions are met.
How can I do that without keeping the code running on my laptop 24/7? Is there a way of doing it in "the cloud"?
Preferably free but would consider paying. Simplicity also a plus.
There is a plenty of free possibilities. You can use heroku servers. They have a strong documentation for python and is free (but of course with limit on megabytes and time usage). Heroku is also good because they have a lot of extensions (databases, dashboards, loggers etc.). In order to deploy your app on heroku you need to have some experience of working with git (but not that much, don't worry).
Another possibility could be Google Cloud Platform, which should be also easy to use.
I have a highly threaded application running on amazon EC2. I would to convert this application to a cluster on EC2. I would like to use starcluster for this as its easy to manage the cluster.
However I am new to cluster/distributed computing. A after googling I found the following list of python libraries for cluster computing:
http://wiki.python.org/moin/ParallelProcessing (look at the cluster computing section)
I would like know if all the libraries will work with starcluster. Is there anything I need to keep in mind like a dependency when choosing a library since I want the application to work with starcluster?
Basically, StarCluster is a tool to help you manage your cluster. It can add/remove node, set them within a placement and security group, register them into Open Grid Scheduler and more. You can also easily create commands and plugins to help you in your work.
How were you intending to use StarCluster?
If it's as a watcher to load balance your cluster then there shouldn't be any problems.
If it's as an actor (making it directly do the computation by launching it with a command you would craft yourself and parallelizing its execution among the cluster) then I don't know. It might be possible, but StarCluster was not designed for it. We can read from the website:
StarCluster has been designed to simplify the process of building, configuring, and managing clusters of virtual machines on Amazon’s EC2 cloud.