I currently has an executable that when running uses all the cores on my server. I want to add another server, and have the jobs split between the two machines, but still each job using all the cores on the machine it is running. If both machines are busy I need the next job to queue until one of the two machines become free.
I thought this might be controlled by python, however I am a novice and not sure which python package would be the best for this problem.
I liked the "heapq" package for the queuing of the jobs, however it looked like it is designed for a single server use. I then looked into Ipython.parallel, but it seemed more designed for creating a separate smaller job for every core (on either one or more servers).
I saw a huge list of different options here (https://wiki.python.org/moin/ParallelProcessing) but I could do with some guidance as which way to go for a problem like this.
Can anyone suggest a package that may help with this problem, or a different way of approaching it?
Celery does exactly what you want - make it easy to distribute a task queue across multiple (many) machines.
See the Celery tutorial to get started.
Alternatively, IPython has its own multiprocessing library built in, based on ZeroMQ; see the introduction. I have not used this before, but it looks pretty straight-forward.
Related
I have a page where the user selects a Python script, and then this script executes.
My issue is that some scripts take a while to execute (up to 30m) so I'd like to run them in the background while the user can still navigate on the website.
I tried to use Celery but as I'm on Windows I couldn't do better than using --pool=solo which, while allowing the user to do something else, can only do so for one user at a time.
I also saw this thread while searching for a solution, but didn't manage to really understand how it worked nor how to implement it, as well as determine if it was really answering my problem...
So here is my question : how can I have multiple thread/multiple processes on Celery while on Windows ? Or if there's another way, how can I execute several tasks simultaneously in the background ?
Have you identified whether your slow scripts belong to CPU-bound tasks or I/O bound tasks?
if they're I/O bound, you can use eventlet and gevent based on Strategy 1 in the blog from distributedpython.com
but if they're CPU bound, you may have to think of using the ways like a dedicated Celery windows box (or windows Docker container) to workaround Celery billiard issue on Windows by setting the environment variable (FORKED_BY_MULTIPROCESSING=1) based on Strategy 2 in the blog from distributedpython.com
I am pretty new in the Python and at distributed systems.
I am using the ZeroMQ Venitlator-Worker-Sink configuration:
Ventilator - Worker - Sink
Everything is working fine at the moment, my problem is, that I need a lot of workers. Every worker is doing the same work.
At the moment every worker is working in his own Python file and has his own Output-Console.
If I have programm changes, I have to change (or copy) the code in every file.
Next problem is that I have to start/run every file, so it quiet annoying to start 12 files.
What are here the best solutions? Threads, processes?
I have to say that the goal is to run every worker on a diffrent raspberry pi.
This appears to be more of a dev/ops problem. You have your worker code, which is presumably a single codebase, on multiple distributed machines or instances. You make a change to that codebase and you need the resulting code to be distributed to each instance, and then the process restarted.
To start, you should at minimum be using a source control system, like Git. With such a system you could at least go to each instance and pull the most recent commit and restart. Beyond that, you could set up a system like Ansible to go out and run those actions on each instance initiated from a single command.
There's a whole host of other tools, strategies and services that will help you do those things in a myriad of different ways. Using Docker to create a single worker container and then distribute and run that container on your various instances is probably one of the more popular ways to do what you're after, but it'll require a more fundamental change to your infrastructure.
Hope this helps.
I develop python application for use in neuroscience and psychological research. These applications mostly present visual information and/or sounds and require input from the user (experiment subjects). Because of this, I need to solve two specific problems. First - applications frequently need to be distributed to many users, with different environments and operating systems. This is a big headache for me, because the people receiving the applications aren't necessarily very "tech savvy", so I end up spending a lot of time troubleshooting small issues. Second - because these applications are required for research, I need for them to be completely backwards compatible (like, 20 years later, compatible). This is because there are occasionally times that we need to re-run experiments from the past, or revisit some things we have done.
I've been playing around with docker lately,and I'm feeling like this might be the answer to my problems (and maybe for a lot of academics). If I could containerize my applications, with environments set up with specific versions of specific packages, I would be able to send them to anyone (which they could run from the container), and re-run things from the past in their original containers.
I feel like I'm getting conflicting information about the utility of docker for non-web (desktop) applications. Is there any reason this wouldn't work? Often I collect time sensitive input (like reaction time) - would running applications inside the docker (and somehow sharing the screen) substantially alter reaction time data? Would I lose the millisecond precision I am going for? Is this not really what docker is intended for?
I am using apscheduler and wmi to create and install new python based windows services where the service determines the type of job to be run. The services are installed across all the machines on the same network. Given this scenario I want to make sure that these services run only on one machine and not all the machines.
If a machine goes down I still want the job to be run from another machine on the same network. How would I accomplish this task?
I know I need to do some kind of synchronization across machines but not sure how to address it?
I tried to include functionality like this in APScheduler 2.0 but it didn't pan out. Maybe The biggest issue is handling concurrent accesses to jobs and making sure jobs get run even if a particular node crashes. The nodes also need to communicate somehow.
Are you sure you don't want to use Celery instead?
Even though Python and Ruby have one kernel thread per interpreter thread, they have a global interpreter lock (GIL) that is used to protect potentially shared data structures, so this inhibits multi-processor execution. Even though the portions in those languajes that are written in C or C++ can be free-threaded, that's not possible with pure interpreted code unless you use multiple processes. What's the best way to achieve this? Using FastCGI? Creating a cluster or a farm of virtualized servers? Using their Java equivalents, JRuby and Jython?
I'm not totally sure which problem you want so solve, but if you deploy your python/django application via an apache prefork MPM using mod_python apache will start several worker processes for handling different requests.
If one request needs so much resources, that you want to use multiple cores have a look at pyprocessing. But I don't think that would be wise.
The 'standard' way to do this with rails is to run a "pack" of Mongrel instances (ie: 4 copies of the rails application) and then use apache or nginx or some other piece of software to sit in front of them and act as a load balancer.
This is probably how it's done with other ruby frameworks such as merb etc, but I haven't used those personally.
The OS will take care of running each mongrel on it's own CPU.
If you install mod_rails aka phusion passenger it will start and stop multiple copies of the rails process for you as well, so it will end up spreading the load across multiple CPUs/cores in a similar way.
Use an interface that runs each response in a separate interpreter, such as mod_wsgi for Python. This lets multi-threading be used without encountering the GIL.
EDIT: Apparently, mod_wsgi no longer supports multiple interpreters per process because idiots couldn't figure out how to properly implement extension modules. It still supports running requests in separate processes FastCGI-style, though, so that's apparently the current accepted solution.
In Python and Ruby it is only possible to use multiple cores, is to spawn new (heavyweight) processes.
The Java counterparts inherit the possibilities of the Java platform. You could imply use Java threads. That is for example a reason why sometimes (often) Java Application Server like Glassfish are used for Ruby on Rails applications.
For Python, the PyProcessing project allows you to program with processes much like you would use threads. It is included in the standard library of the recently released 2.6 version as multiprocessing. The module has many features for establishing and controlling access to shared data structures (queues, pipes, etc.) and support for common idioms (i.e. managers and worker pools).