Flexible task distribution in workers with Celery - python

TL;DR:
Is there a possibility to easily let the workers decide which tasks they can work on, depending on their (local) configuration and the task args/kwargs?
A quick and dirty solution I thought of would be to raise Reject() in all workers that find themselves unsuitable, but I was hoping there's a more elegant one.
Details:
The application is an (educational) programming assignment evaluation tool - maybe comparable to continuous integration: A web application accepts submissions of source code for (previously specified) programming languages (or better: programming environments) which then need to be compiled and executed with several test cases. Now especially for use in a high performance computing course with GPUs, compiling and executing can not happen on the host where the web application runs (for other cases just think security reasons).
To make this easily configurable for administrators, I'd like to have a configuration file for a worker, where locally available resources, compiler types and paths etc. are configured, which the worker uses to decide whether to work on a task or not.
Simply using different queues and using a custom Router does not seem appealing to me, because the number and configuration of queues could vary at runtime and would look a little messy, I think.
Is there an elegant way to achieve something like that? To be honest, the documentation on Extensions and Bootsteps didn't give me much guidance on this.
Thanks in advance for any tips and pointers.

Related

Is it convenient to use Airflow Architecture to orchestrate Complex Python Microservices?

I have a data product that receives realtime streams of vehicles data, process the information and then displays it through rest APIS. I am looking to rebuild the ETL side of the system in order to enhance reliability and architecture order. I am considering to use Apache Airflow, but have some doubts.
The python microservices I need to orchestrate are complex and have many different dependencies, hence a monolithic solution would be huge and tricky if implemented with python operators in Airflow. Do you see any convenient options of using Airflow for these needs? Might be a better solution to use Kafka or any other Messaging System?
Thanks
Airflow is definitely used to orchestrate ETL systems however if you require real time ETL, a message queue is probably better.
However you can Trigger DAGs Externally if it suits your use case. Now depending on how distributed you want, you can also have multiple instances running to parallelize and speed up your operations.
If you think batch processing will be helpful, you can have a pseudo realtime set up with a very fast schedule. This will ( in my opinion ) be easier to debug then a traditional message queue system.
Airflow gives you a lot of modularity which lets you break down a complex architecture and debug components easily compared to a messaging queue. It also has provisions for auto requeueing and also has named queues which you can use in case you need to send large data loads to a separate processing queue on a machine with more resources.
I have worked on a similar use case as yours where we needed to process realtime data and display it via APIs and we used a mixture of Airflow and messaging queues. In my experience it was always a lot easier to figure out where and why something went wrong with Airflow due to the Airflow UI which makes getting a high level overview incredibly efficient.

worker queues vs. continuous deployment frameworks

There are many CI/CD solutions out there: http://www.devopsbookmarks.com/ci. However, looking at some buildbot examples, the snippets of Python code seem very similar to those created when writing, say, workers for RQ.
RQ seems fairly simple while Buildbot seems quite complex. Are the additional features of a full-blown CI/CD solution like Buildbot really worth it when it's possible to create queues and workers with a much simpler (yet not as fully featured) system like RQ?
In other words, what's the best way to frame the tradeoffs between CI/CD frameworks and worker queues?
We use Jenkins CI and the bonus you get with these larger frameworks are:
web interface not only for task definition, but also to review the results
wider set of plugins for different types of tasks
visualisation of test results
notification of users by e-mail
We were considering doing most of the tasks, Jenkins CI is doing for us (running tests on data) by other means (like AWS Lambda), but the visual interface is the main argument to stay with Jenkins as it allows our users to see the results without a need to do these things ourself.

Managing a running for on-server scripts

The title is a bit fuzzy because I don't know the right vocabulary.
Here's the thing I am trying to do: I have a script/program on the server for running checks. Now my co-workers want that this script can be started from a website, and the logs viewed from there. The process can be quite long running for the checks, usually more than a few hours.
for that, I gathered, I'd have to monitor the processes with the website script, and show their logs. The chosen language would be either PHP or Python.
I'd very much like a hint or view on how such a thing is generally done and what are best practices, as I'm unsure how to start with this one. Especially a reliable way to start/monitor the processes would be much welcome.
If you choose Python check out Celery (although it may be a little bit overkill if you want to keep things simple). It allows you to run asynchronous tasks and you can easily monitor them. There is also a django integration for celery (django-celery) that includes a web monitor for the tasks.

Maximally simple django timed/scheduled tasks (e.g.: reminders)?

Question is relevant to this and this;
the difference is, I'd prefer something with possibly more precision and low load (per-minute cron job isn't preferable for those) and with minimal overhead (i.e. installing celery with rabbitmq seems like a big overkill).
An example task for such is personal reminders server (with reminders that could be edited over web and sent out through e-mail or XMPP).
I'm probably looking for something more like node.js's setTimeout but for django (and though I might prefer to implement reminders in node.js anyway, it's still a possibly interesting question).
For example, it's possible to start new threads in django app (with functions consisting of sleep() and send()); in what ways this can be bad?
The problem with using threads for this solution are the typical problems with Python threads that always drive people towards multi-process solutions instead. The problem is compounded here by the fact your thread isn't driven by the normal request-response cycle. This is summarized nicely by Malcolm Tredinnick here:
Have to disagree. Threads are not a good solution to this problem. The
issue is process management. As written, your threads will never be
rejoined. Webserver processes have a lifecycle uncontrollable by you
(the MaxRequestsPerChild Apache parameter and similar things in other
servers) and you are messing with that by using threads.
If you need a process with a lifecycle that is not matched by the
request-response path — something long running and independent of the
response — a completely separate process is definitely the right model
to use. Using a thread is tying it to the response lifecycle, which
wil have unintended side-effects.
A possible solution for you might be to have a long running process performing your tasks which gets a wake-up signal from a light cron process.
Another possibility would be build something using 0mq, which is much lighter than AMQP style queues (at the cost of some features of course). Tarek Ziade is working on a Mozilla project called powerhose that uses 0mq, looks super simple, and has a heartbeat capability with resolution to the second.

having to run multiple instances of a web service for ruby/python seems like a hack to me

Is it just me or is having to run multiple instances of a web server to scale a hack?
Am I wrong in this?
Clarification
I am referring to how I read people run multiple instances of a web service on a single server. I am not talking about a cluster of servers.
Not really, people were running multiple frontends across a cluster of servers before multicore cpus became widespread
So there has been all the infrastructure for supporting sessions properly across multiple frontends for quite some time before it became really advantageous to run a bunch of threads on one machine.
Infact using asynchronous style frontends gives better performance on the same hardware than a multithreaded approach, so I would say that not running multiple instances in favour of a multithreaded monster is a hack
Since we are now moving towards more cores, rather than faster processors - in order to scale more and more, you will need to be running more instances.
So yes, I reckon you are wrong.
This does not by any means condone brain-dead programming with the excuse that you can just scale it horizontally, that just seems retarded.
With no details, it is very difficult to see what you are getting at. That being said, it is quite possible that you are simply not using the right approach for your problem.
Sometimes multiple separate instances are better. Sometimes, your Python services are actually better deployed behind a single Apache instance (using mod_wsgi) which may elect to use more than a single process. I don't know about Ruby to opinionate there.
In short, if you want to make your service scalable then the way to do so depends heavily on additional details. Is it scaling up or scaling out? What is the operating system and available or possibly installable server software? Is the service itself easily parallelized and how much is it database dependent? How is the database deployed?
Even if Ruby/Python interpreters were perfect, and could utilize all avail CPU with single process, you would still reach maximal capability of single server sooner or later and have to scale across several machines, going back to running several instances of your app.
I would hesitate to say that the issue is a "hack". Or indeed that threaded solutions are necessarily superior.
The situation is a result of design decisions used in the interpreters of languages like Ruby and Python.
I work with Ruby, so the details may be different for other languages.
BUT ... essentially, Ruby uses a Global Interpreter Lock to prevent threading issues:
http://en.wikipedia.org/wiki/Global_Interpreter_Lock
The side-effect of this is that to achieve concurrency with frameworks like Rails, rather than relying on multiple threads within the VM, we use multiple processes, each with its own interpreter and instance of your framework and application code
Each instance of the app handles a single request at a time. To achieve concurrency we have to spin up multiple instances.
In the olden days (2-3 years ago) we would run multiple mongrel (or similar) instances behind a proxy (generally apache). Passenger changed some of this because it is smart enough to manage the processes itself, rather than requiring manual setup. You tell Passenger how many processes it can use and off it goes.
The whole structure is actually not as bad as the thread-orthodoxy would have you believe. For a start, it's pretty easy to make this type of architecture work in a multicore environment. Any modern database is designed to handle highly concurrent loads, so having multiple processes has very little if any effect at that level.
If you use a language like JRuby you can deploy into a threaded app server like Tomcat and have a deployment that looks much more "java-like". However, this is not as big a win as you might think, because now your application needs to be much more thread-aware and you can see side effects and strangeness from threading issues.
Your assumption that Tomcat's and IIS's single process per server is superior is flawed. The choice of a multi-threaded server and a multi-process server depends on a lot of variables.
One main thing is the underlying operating system. Unix systems have always had great support for multi-processing because of the copy-on-write nature of the fork system call. This makes multi-processes a really attractive option because web-serving is usually very shared-nothing and you don't have to worry about locking. Windows on the other hand had much heavier processes and lighter threads so programs like IIS would gravitate to a multi-threading model.
As for the question to wether it's a hack to run multiple servers really depends on your perspective. If you look at Apache, it comes with a variety of pluggable engines to choose from. The MPM-prefork one is the default because it allows the programmer to easily use non-thread-safe C/Perl/database libraries without having to throw locks and semaphores all over the place. To some that might be a hack to work around poorly implemented libraries. To me it's a brilliant way of leaving it to the OS to handle the problems and letting me get back to work.
Also a multi-process model comes with a few features that would be very difficult to implement in a multi-threaded server. Because they are just processes, zero-downtime rolling-updates are trivial. You can do it with a bash script.
It also has it's short-comings. In a single-server model setting up a singleton that holds some global state is trivial, while on a multi-process model you have to serialize that state to a database or Redis server. (Of course if your single-process server outgrows a single server you'll have to do that anyway.)
Is it a hack? Yes and no. Both original implementations (MRI, and CPython) have Global Interpreter Locks that will prevent a multi-core server from operating at it's 100% potential. On the other hand multi-process has it's advantages (especially on the Unix-side of the fence).
There's also nothing inherent in the languages themselves that makes them require a GIL, so you can run your application with Jython, JRuby, IronPython or IronRuby if you really want to share state inside a single process.

Categories