Use a multiprocessing.Queue in luigi - python

I have a set of tasks in luigi which all need to access a database. I can have up to 8 tasks accessing my database at the same time provided they are on different ports (I have the list of allowed ports).
How should I best implement this restriction which seems to be similar to the standard restriction of number of workers, ie for my case a task should run when a worker is free AND a database port is free.
I tried creating a multiprocessing.Queue() in __main__ and pass this to the WrapperTask, which receives it as a luigi.Parameter(), but this gives an error and hangs
UserWarning: Parameter "queue" with value <multiprocessing.queues.Queue object at 0x00000000149E4518>" is not of type string.
warnings.warn('Parameter "{}" with value "{}" is not of type string.'.format(param_name, param_value))
The idea was that a .get() call would hang a Task if the queue is empty and continue once another task .put(port) again.
What is going wrong here? Or am I taking the completely wrong approach to managing the resource in luigi?

You should be using "resources" section in Luigi Configuration. This will ensure that not more than this number of workers share a global resource. Find more here https://luigi.readthedocs.io/en/stable/configuration.html#resources.

Related

How to prevent the same task to be executed by celery?

I'm implementing a cache server that uses a celery task to update the cache in background. There is only one task that I call it with different arguments (cache keys).
Since after connecting this server to my main production server it will receive tens of requests per second for the same cache key I want to make sure there are never more than one of the update tasks with the same cache key inside celery queue (working as a queue and a set at the same time).
I thought of using a redis set to make sure of that before running the task but I'm looking for a better way.
There is only one way, implement your own lock mechanism.
The official doc has a nice example page..
The only limit is your imagination.
Hope this helps.

Python multiprocessing - function-like communication between two processes

I've got the following problem:
I have two different classes; let's call them the interface and worker. The interface is supposed to accept requests from outside, and multiplexes them to several workers.
Contrary to almost every example I have found, I have several peculiarities:
The workers are not supposed to be recreated for every request.
The workers are different; a request for workers[0] cannot be answered by workers[1]. This multiplexing is done in interface.
I have a number of function-like calls which are difficult to model via events or simple queues.
There are a few different requests, which would make one queue per request difficult.
For example, assume that each worker is storing a single integer number (let's say the number of calls this worker received). In non-parallel processing, I'd use something like this:
class interface(object):
workers = None #set somewhere else.
def get_worker_calls(self, worker_id):
return self.workers[worker_id].get_calls()
class worker(object)
calls = 0
def get_calls(self):
self.calls += 1
return self.calls
This, obviously, doesn't work. What does?
Or, maybe more relevantly, I don't have experience with multiprocessing. Is there a design paradigm I'm missing that would easily solve the above?
Thanks!
For reference, I have considered several approaches, and I was unable to find a good one:
Use one request and answer queue. I've discarded this idea since that'd either block interface'for the answer-time of the current worker (making it badly scalable), or would require me sending around extra information.
Use of one request queue. Each message contains a pipe to return the answer to that request. After fixing the issue with being unable to send pipes via pipes, I've run into problems with pipe closing unless sending both ends over the connection.
Use of one request queue. Each message contains a queue to return the answer to that request. Fails since I cannot send queues via queues, but the reduction trick doesn't work.
The above also applies to the respective Manager-generated objects.
Multiprocessing means you have 2+ separated processes running. There is no way to access memory from one process to another directly (as with multithreading).
Your best shot is to use some kind of external Queue mechanism, you can start with Celery or RQ. RQ is simpler but celery has built-in monitoring.
But you have to know that Multiprocessing will work only if Celery/RQ are able to "pack" the needed functions/classes and send them to other process. Therefore you have to use __main__ level functions (that are in top of file, not belongs to any class).
You can always implement it yourself, Redis is very simple, ZeroMQ and RabbitMQ are also good.
Beaver library is good example of how to deal with multiprocessing in python using ZeroMQ queue.

memcache.get returns wrong object (Celery, Django)

Here is what we have currently:
we're trying to get cached django model instance, cache key includes name of model and instance id. Django's standard memcached backend is used. This procedure is a part of common procedure used very widely, not only in celery.
sometimes(randomly and/or very rarely) cache.get(key) returns wrong object: either int or different model instance, even same-model-different-id case appeared. We catch this by checking correspondence of model name & id and cache key.
bug appears only in context of three of our celery tasks, never reproduces in python shell or other celery tasks. UPD: appears under long-running CPU-RAM intensive tasks only
cache stores correct value (we checked that manually at the moment the bug just appeared)
calling same task again with same arguments might don't reproduce the issue, although probability is much higher, so bug appearances tend to "group" in same period of time
restarting celery solves the issue for the random period of time (minutes - weeks)
*NEW* this isn't connected with memory overflow. We always have at least 2Gb free RAM when this happens.
*NEW* we have cache_instance = cache.get_cache("cache_entry") in static code. During investigation, I found that at the moment the bug happens cache_instance.get(key) returns wrong value, although get_cache("cache_entry").get(key) on the next line returns correct one. This means either bug disappears too quickly or for some reason cache_instance object got corrupted.
Isn't cache instance object returned by django's cache thread safe?
*NEW* we logged very strange case: as another wrong object from cache, we got model instance w/o id set. This means, the instance was never saved to DB therefore couldn't be cached. (I hope)
*NEW* At least one MemoryError was logged these days
I know, all of this sounds like some sort of magic.. And really, any ideas how that's possible or how to debug this would be very appreciated.
PS: My current assumption is that this is connected with multiprocessing: as soon as cache instance is created in static code and before Worker process fork this would lead to all workers sharing same socket (Does it sound plausibly?)
Solved it finally:
Celery has dynamic scaling feature- it's capable to add/kill workers according to load
It does it via forking existing one
Opened sockets and files are copied to the forked process, so both processes share them, which leads to race condition, when one process reads response of another one. Simply, it's possible that one process reads response intended for second one, and vise-versa.
from django.core.cache import cache this object stores pre-connected memcached socket. Don't use it when your process could be dynamically forked.. and don't use stored connections, pools and other.
OR store them under current PID, and check it each time you're accessing cache
This has been bugging me for a while until I found this question and answer. I just want to add some things I've learnt.
You can easily reproduce this problem with a local memcached instance:
from django.core.cache import cache
import os
def write_read_test():
pid = os.getpid()
cache.set(pid, pid)
for x in range(5):
value = cache.get(pid)
if value != pid:
print "Unexpected response {} in process {}. Attempt {}/5".format(
value, pid, x+1)
os._exit(0)
cache.set("access cache", "before fork")
for x in range(5):
if os.fork() == 0:
write_read_test()
What you can do is close the cache client as Django does in the request_finished signal:
https://github.com/django/django/blob/master/django/core/cache/init.py#L128
If you put a cache.close() after the fork, everything works as expected.
For celery you could connect to a signal that is fired after the worker is forked and execute cache.close().
This also affects gunicorn when preload is active and the cache is initialized before forking the workers.
For gunicorn, you could use post_fork in your gunicorn configuration:
def post_fork(server, worker):
from django.core.cache import cache
cache.close()

Unknown queue names show on Rabbitmq mgmt. when using Celery

I only created the last 2 queue names that show in Rabbitmq management Webui in the below table:
The rest of the table has hash-like queues, which I don't know:
1- Who created them? (I know it is celery, but which process, task,etc.)
2- Why they are created, and what they are created for?.
I can notice that when the number of pushed messages increase, the number of those hash-like messages increase.
When using celery, Rabbitmq is used as a default result backend, and also to store errors of failing
tasks(that raised exceptions).
Every new task creates a new queue on the server, with thousands of tasks the
broker may be overloaded with queues and this will affect performance
in negative ways.
Each queue in Rabbit will be a separate Erlang process, so if you’re planning to
keep many results simultaneously you may have to increase the Erlang
process limit, and the maximum number of file descriptors your OS
allows.
Old results will not be cleaned automatically, so we have to tell
rabbit to do so.
The below conf. line dictates the time to live of the temp
queues. The default is 1 day
CELERY_AMQP_TASK_RESULT_EXPIRES = Number of seconds
OR, We can change the backend store totally, and not make it in Rabbit.
CELERY_BACKEND = "amqp"
We may also ignore it:
CELERY_IGNORE_RESULT = True.
Also, when ignoring the result, we can also keep the errors stored for later usage,
which means one more queue for the failing tasks.
CELERY_STORE_ERRORS_EVEN_IF_IGNORED = True.
I will not mark this question as answered, waiting for a better answer.
Rererences:
This SO link
Celery documentation
Rabbitmq documentation

Set / get objects with memcached

In a Django Python app, I launch jobs with Celery (a task manager). When each job is launched, they return an object (lets call it an instance of class X) that lets you check on the job and retrieve the return value or errors thrown.
Several people (someday, I hope) will be able to use this web interface at the same time; therefore, several instances of class X may exist at the same time, each corresponding to a job that is queued or running in parallel. It's difficult to come up with a way to hold onto these X objects because I cannot use a global variable (a dictionary that allows me to look up each X objects from a key); this is because Celery uses different processes, not just different threads, so each would modify its own copy of the global table, causing mayhem.
Subsequently, I received the great advice to use memcached to share the memory across the tasks. I got it working and was able to set and get integer and string values between processes.
The trouble is this: after a great deal of debugging today, I learned that memcached's set and get don't seem to work for classes. This is my best guess: Perhaps under the hood memcached serializes objects to the shared memory; class X (understandably) cannot be serialized because it points at live data (the status of the job), and so the serial version may be out of date (i.e. it may point to the wrong place) when it is loaded again.
Attempts to use a SQLite database were similarly fruitless; not only could I not figure out how to serialize objects as database fields (using my Django models.py file), I would be stuck with the same problem: the handles of the launched jobs need to stay in RAM somehow (or use some fancy OS tricks underneath), so that they update as the jobs finish or fail.
My best guess is that (despite the advice that thankfully got me this far) I should be launching each job in some external queue (for instance Sun/Oracle Grid Engine). However, I couldn't come up with a good way of doing that without using a system call, which I thought may be bad style (and potentially insecure).
How do you keep track of jobs that you launch in Django or Django Celery? Do you launch them by simply putting the job arguments into a database and then have another job that polls the database and runs jobs?
Thanks a lot for your help, I'm quite lost.
I think django-celery does this work for you. Did you had a look at the tables made by django-celery? I.e. djcelery_taskstate holds all data for a given task like state, worker_id and so on. For periodic tasks there is a table called djcelery_periodictask.
In a Django view you can access the TaskMeta object:
from djcelery.models import TaskMeta
task = TaskMeta.objects.get(task_id=task_id)
print task.status

Categories