FOR THOSE READING THIS: I have decided to use RQ instead which doesn't fail when running code that uses the multiprocessing module. I suggest you use that.
I am trying to use a multiprocessing pool from within a celery task using Python 3 and redis as the broker (running it on a Mac). However, I don't seem to be able to even create a multiprocessing Pool object from within the Celery task! Instead, I get a strange exception that I really don't know what to do with.
Can anyone tell me how to accomplish this?
The task:
from celery import Celery
from multiprocessing.pool import Pool
app = Celery('tasks', backend='redis', broker='redis://localhost:6379/0')
#app.task
def test_pool():
with Pool() as pool:
# perform some task using the pool
pool.close()
return 'Done!'
which I add to Celery using:
celery -A tasks worker --loglevel=info
and then running it via the following python script:
import tasks
tasks.test_pool.delay()
that returns the following celery output:
[2015-01-12 15:08:57,571: INFO/MainProcess] Connected to redis://localhost:6379/0
[2015-01-12 15:08:57,583: INFO/MainProcess] mingle: searching for neighbors
[2015-01-12 15:08:58,588: INFO/MainProcess] mingle: all alone
[2015-01-12 15:08:58,598: WARNING/MainProcess] celery#Simons-MacBook-Pro.local ready.
[2015-01-12 15:09:02,425: INFO/MainProcess] Received task: tasks.test_pool[38cab553-3a01-4512-8f94-174743b05369]
[2015-01-12 15:09:02,436: ERROR/MainProcess] Task tasks.test_pool[38cab553-3a01-4512-8f94-174743b05369] raised unexpected: AttributeError("'Worker' object has no attribute '_config'",)
Traceback (most recent call last):
File "/usr/local/lib/python3.4/site-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/local/lib/python3.4/site-packages/celery/app/trace.py", line 438, in __protected_call__
return self.run(*args, **kwargs)
File "/Users/simongray/Code/etilbudsavis/offer-sniffer/tasks.py", line 17, in test_pool
with Pool() as pool:
File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/pool.py", line 150, in __init__
self._setup_queues()
File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/pool.py", line 243, in _setup_queues
self._inqueue = self._ctx.SimpleQueue()
File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/context.py", line 111, in SimpleQueue
return SimpleQueue(ctx=self.get_context())
File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/queues.py", line 336, in __init__
self._rlock = ctx.Lock()
File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/context.py", line 66, in Lock
return Lock(ctx=self.get_context())
File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/synchronize.py", line 163, in __init__
SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/synchronize.py", line 59, in __init__
kind, value, maxvalue, self._make_name(),
File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/synchronize.py", line 117, in _make_name
return '%s-%s' % (process.current_process()._config['semprefix'],
AttributeError: 'Worker' object has no attribute '_config'
This is a known issue with celery. It stems from an issue introduced in the billiard dependency. A work-around is to manually set the _config attribute for the current process. Thanks to user #martinth for the work-around below.
from celery.signals import worker_process_init
from multiprocessing import current_process
#worker_process_init.connect
def fix_multiprocessing(**kwargs):
try:
current_process()._config
except AttributeError:
current_process()._config = {'semprefix': '/mp'}
The worker_process_init hook will execute the code upon worker process initialization. We simply check to see if _config exists, and set it if it does not.
Via a useful comment in the Celery issue report linked to in Davy's comment, I was able to solve this by importing the billiard module's Pool class instead.
Replace
from multiprocessing import Pool
with
from billiard.pool import Pool
A quick solution is to use the thread-based "dummy" multiprocessing implementation. Change
from multiprocessing import Pool # or whatever you're using
to
from multiprocessing.dummy import Pool
However since this parallelism is thread-based, the usual caveats (GIL) apply.
Related
EDIT: I found out it only happened on a Windows computer. Everything works fine on Linux server.
I am running a scrapy crawler in a celery process and keep getting this error. Any ideas what am I doing wrong?
[2021-08-18 11:28:42,294: INFO/MainProcess] Connected to sqla+sqlite:///celerydb.sqlite
[2021-08-18 11:28:42,313: INFO/MainProcess] celery#NP45086 ready.
[2021-08-18 09:46:58,330: INFO/MainProcess] Received task: app_celery.scraping_process_cli[e94dc192-e10e-4921-ad0c-bb932be9b568]
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Program Files\Python374\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "C:\Program Files\Python374\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
ModuleNotFoundError: No module named 'app_celery'
[2021-08-18 09:46:58,773: INFO/MainProcess] Task app_celery.scraping_process_cli[e94dc192-e10e-4921-ad0c-bb932be9b568] succeeded in 0.4380000000091968s: None
My app_celery looks like this:
app = Celery('app_celery', backend=..., broker=...)
def scrape_data():
process = CrawlerProcess(get_project_settings())
crawler = process.create_crawler(spider_cls)
process.crawl(spider_cls, **kwargs)
process.start()
#app.task(name='app_celery.scraping_process_cli', time_limit=1200, max_retries=3)
def scraping_process_cli(company_id):
import multiprocessing
a = multiprocessing.Process(target=scrape_data())
a.start()
a.join()
I am running the celery as:
celery -A app_celery worker -c 4 -n worker1 --pool threads
Before do step 2, change directory to the project root of your scrapy project.
You have to tell CrawlerProcess() to load settings.py.
import os
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
def scrape_data():
os.chdir("/path/to/your/scrapy/project_root")
process = CrawlerProcess(get_project_settings())
process.crawl('spider_name')
process.start()
I need to use Celery to schedule a Fabric task on a set of hosts.
My code is:
tasks.py
from fabric_tasks import poll
from api.client import APIClient as client
from celery import Celery
celery = Celery(broker='redis://')
#celery.task
def poll_all():
actions = client().get_actions(status='SCHEDULED)
ids = [a['id'] for a in actions]
execute(poll, ids, hosts=hosts)
CELERYBEAT_SCHEDULE = {
'every-5-second': {
'task': 'tasks.poll_all',
'schedule': timedelta(seconds=5),
},
}
celery -A tasks.celery -B -l info
The task fail with the following exception:
[2016-12-11 17:54:07,928: ERROR/PoolWorker-3] Task tasks.poll_actions[7b26a083-b450-4a90-8971-97f3dd2f4f5d] raised unexpected: AttributeError("'Process' object has no attribute '_authkey'",)
Traceback (most recent call last):
File "/Users/ocervell/.virtualenvs/ndc-v3.3/lib/python2.7/site-packages/celery/app/trace.py", line 367, in trace_task
R = retval = fun(*args, **kwargs)
File "/Users/ocervell/.virtualenvs/ndc-v3.3/lib/python2.7/site-packages/celery/app/trace.py", line 622, in __protected_call__
return self.run(*args, **kwargs)
File "/Users/ocervell/Drive/workspace/ccc/svn/build-support/branches/development-ocervell/modules/ndc/v3.3/tasks.py", line 44, in poll_all
return execute(poll, 'ls', hosts=hosts)
File "/Users/ocervell/.virtualenvs/ndc-v3.3/lib/python2.7/site-packages/fabric/tasks.py", line 387, in execute
multiprocessing
File "/Users/ocervell/.virtualenvs/ndc-v3.3/lib/python2.7/site-packages/fabric/tasks.py", line 269, in _execute
p = multiprocessing.Process(target=inner, kwargs=kwarg_dict)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 98, in __init__
self._authkey = _current_process._authkey
AttributeError: 'Process' object has no attribute '_authkey'
Temporary fix
The following hack put in tasks.py helped me fix this bug (essentially it sets the attributes that caused the AttributeError exception):
from celery.signals import worker_process_init
#worker_process_init.connect
def fix_multiprocessing(**kwargs):
from multiprocessing import current_process
keys = {
'_authkey': '',
'_daemonic': False,
'_tempdir': ''
}
for k, v in keys.items():
try:
getattr(current_process(), k)
except AttributeError:
setattr(current_process(), k, v)
Related:
https://github.com/celery/celery/issues/1709
Python multiprocessing job to Celery task but AttributeError
Alternatives ?
I was wondering if there is a better way for Celery to schedule Fabric tasks without writing dirty hacks around multiprocessing library ?
Question detail: https://github.com/celery/celery/issues/3598
I want to run a scrapy spider with celery, which contain Djangoitems.
this is my celery task:
# coding_task.py
import sys
from celery import Celery
from collector.collector.crawl_agent import crawl
app = Celery('coding.net', backend='redis', broker='redis://localhost:6379/0')
app.config_from_object('celery_config')
#app.task
def period_task():
crawl()
collector.collector.crawl_agent.crawl contains a scrapy crawler who uses djangoitem as item.
the item like:
import django
os.environ['DJANGO_SETTINGS_MODULE'] = 'RaPo3.settings'
django.setup()
from scrapy_djangoitem import DjangoItem
from xxx.models import Collection
class CodingItem(DjangoItem):
django_model = Collection
amount = scrapy.Field(default=0)
role = scrapy.Field()
type = scrapy.Field()
duration = scrapy.Field()
detail = scrapy.Field()
extra = scrapy.Field()
when run: celery -A coding_task worker --loglevel=info --concurrency=1, it wil get some errors below:
[2016-11-16 17:33:41,934: ERROR/Worker-1] Process Worker-1
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/billiard/process.py", line 292, in _bootstrap
self.run()
File "/usr/local/lib/python2.7/site-packages/billiard/pool.py", line 292, in run
self.after_fork()
File "/usr/local/lib/python2.7/site-packages/billiard/pool.py", line 395, in after_fork
self.initializer(*self.initargs)
File "/usr/local/lib/python2.7/site-packages/celery/concurrency/prefork.py", line 80, in process_initializer
signals.worker_process_init.send(sender=None)
File "/usr/local/lib/python2.7/site-packages/celery/utils/dispatch/signal.py", line 151, in send
response = receiver(signal=self, sender=sender, **named)
File "/usr/local/lib/python2.7/site-packages/celery/fixups/django.py", line 152, in on_worker_process_init
self._close_database()
File "/usr/local/lib/python2.7/site-packages/celery/fixups/django.py", line 181, in _close_database
funs = [self._db.close_connection] # pre multidb
AttributeError: 'module' object has no attribute 'close_connection'
[2016-11-16 17:33:41,942: INFO/MainProcess] Connected to redis://localhost:6379/0
[2016-11-16 17:33:41,957: INFO/MainProcess] mingle: searching for neighbors
[2016-11-16 17:33:42,962: INFO/MainProcess] mingle: all alone
/usr/local/lib/python2.7/site-packages/celery/fixups/django.py:199: UserWarning: Using settings.DEBUG leads to a memory leak, never use this setting in production environments!
warnings.warn('Using settings.DEBUG leads to a memory leak, never '
[2016-11-16 17:33:42,968: WARNING/MainProcess] /usr/local/lib/python2.7/site-packages/celery/fixups/django.py:199: UserWarning: Using settings.DEBUG leads to a memory leak, never use this setting in production environments!
warnings.warn('Using settings.DEBUG leads to a memory leak, never '
[2016-11-16 17:33:42,968: WARNING/MainProcess] celery#MacBook-Pro.local ready.
[2016-11-16 17:33:42,969: ERROR/MainProcess] Process 'Worker-1' pid:2777 exited with 'exitcode 1'
[2016-11-16 17:33:42,991: ERROR/MainProcess] Unrecoverable error: WorkerLostError('Could not start worker processes',)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/celery/worker/__init__.py", line 208, in start
self.blueprint.start(self)
File "/usr/local/lib/python2.7/site-packages/celery/bootsteps.py", line 127, in start
step.start(parent)
File "/usr/local/lib/python2.7/site-packages/celery/bootsteps.py", line 378, in start
return self.obj.start()
File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer.py", line 271, in start
blueprint.start(self)
File "/usr/local/lib/python2.7/site-packages/celery/bootsteps.py", line 127, in start
step.start(parent)
File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer.py", line 766, in start
c.loop(*c.loop_args())
File "/usr/local/lib/python2.7/site-packages/celery/worker/loops.py", line 50, in asynloop
raise WorkerLostError('Could not start worker processes')
WorkerLostError: Could not start worker processes
if i delete djangoitem in item:
from scrapy.item import Item
class CodingItem(item):
amount = scrapy.Field(default=0)
role = scrapy.Field()
type = scrapy.Field()
duration = scrapy.Field()
detail = scrapy.Field()
extra = scrapy.Field()
the task will play well and doesn't have any error.
What should i do if i want to use djangoitem in this celery-scrapy task?
Thanks!
You should check the ram usage. It might be possible celery is not getting enough ram
Upgrade Celery to 4.0 will solve the problem.
More detail: https://github.com/celery/celery/issues/3598
I'm trying the the example to use celery and cassandra together:
http://datastax.github.io/python-driver/cqlengine/third_party.html
But without luck.
I get this exception the I'm starting the worker with:
$ celery -A tasks worker -l INFO
[2016-06-12 14:11:53,609: ERROR/Worker-1] Process Worker-1
Traceback (most recent call last):
File "/Users/lutz/work/truncated/truncated-worker/venv/lib/python3.5/site-packages/billiard/process.py", line 292, in _bootstrap
self.run()
File "/Users/lutz/work/truncated/truncated-worker/venv/lib/python3.5/site-packages/billiard/pool.py", line 292, in run
self.after_fork()
File "/Users/lutz/work/truncated/truncated-worker/venv/lib/python3.5/site-packages/billiard/pool.py", line 395, in after_fork
self.initializer(*self.initargs)
File "/Users/lutz/work/truncated/truncated-worker/venv/lib/python3.5/site-packages/celery/concurrency/prefork.py", line 84, in process_initializer
signals.worker_process_init.send(sender=None)
File "/Users/lutz/work/truncated/truncated-worker/venv/lib/python3.5/site-packages/celery/utils/dispatch/signal.py", line 166, in send
response = receiver(signal=self, sender=sender, **named)
TypeError: cassandra_init() got an unexpected keyword argument 'sender'
I'm Using osx el Capitan, python 3.5.1, Celery 3.1.23 and cassandra 3.5.
So any help will be welcome.
Your cassandra_init signal handler function needs to accept arbitrary keyword arguments.
Simply change the line:
def cassandra_init():
into:
def cassandra_init(**kwargs):
For more information about Celery signals, see the user guide at:
http://docs.celeryproject.org/en/latest/userguide/signals.html#basics
Note: It would be helpful if you also submitted some kind of report to the author of that tutorial. Celery signal handlers have always required the keyword arguments, so it's sad to have non-working examples out there.
Consider the following code :
Server :
import sys
from multiprocessing.managers import BaseManager, BaseProxy, Process
def baz(aa) :
l = []
for i in range(3) :
l.append(aa)
return l
class SolverManager(BaseManager): pass
class MyProxy(BaseProxy): pass
manager = SolverManager(address=('127.0.0.1', 50000), authkey='mpm')
manager.register('solver', callable=baz, proxytype=MyProxy)
def serve_forever(server):
try :
server.serve_forever()
except KeyboardInterrupt:
pass
def runpool(n):
server = manager.get_server()
workers = []
for i in range(int(n)):
Process(target=serve_forever, args=(server,)).start()
if __name__ == '__main__':
runpool(sys.argv[1])
Client :
import sys
from multiprocessing.managers import BaseManager, BaseProxy
import multiprocessing, logging
class SolverManager(BaseManager): pass
class MyProxy(BaseProxy): pass
def main(args) :
SolverManager.register('solver')
m = SolverManager(address=('127.0.0.1', 50000), authkey='mpm')
m.connect()
print m.solver(args[1])._getvalue()
if __name__ == '__main__':
sys.exit(main(sys.argv))
If I run the server using only one process as python server.py 1
then the client works as expected. But if I spawn two processes (python server.py 2) listening for connections, I get a nasty error :
$python client.py ping
Traceback (most recent call last):
File "client.py", line 24, in <module>
sys.exit(main(sys.argv))
File "client.py", line 21, in main
print m.solver(args[1])._getvalue()
File "/usr/lib/python2.6/multiprocessing/managers.py", line 637, in temp
authkey=self._authkey, exposed=exp
File "/usr/lib/python2.6/multiprocessing/managers.py", line 894, in AutoProxy
incref=incref)
File "/usr/lib/python2.6/multiprocessing/managers.py", line 700, in __init__
self._incref()
File "/usr/lib/python2.6/multiprocessing/managers.py", line 750, in _incref
dispatch(conn, None, 'incref', (self._id,))
File "/usr/lib/python2.6/multiprocessing/managers.py", line 79, in dispatch
raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError:
---------------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python2.6/multiprocessing/managers.py", line 181, in handle_request
result = func(c, *args, **kwds)
File "/usr/lib/python2.6/multiprocessing/managers.py", line 402, in incref
self.id_to_refcount[ident] += 1
KeyError: '7fb51084c518'
---------------------------------------------------------------------------
My idea is pretty simple. I want to create a server that will spawn a number of workers that will share the same socket and handle requests independently. Maybe I'm using the wrong tool here ?
The goal is to build a 3-tier structure where all requests are handled via an http server and then dispatched to nodes sitting in a cluster and from nodes to workers via the multiprocessing managers...
There is one public server, one node per machine and x number of workers on each machine depending on the number of cores... I know I can use a more sophisticated library, but for such a simple task (I'm just prototyping here) I would just use the multiprocessing library... Is this possible or I should explore directly other solutions ? I feel I'm very close to have something working here ... thanks.
You're trying to invent a wheel, many have invented before.
It sounds to me that you're looking for task queue where your server dispatches tasks to, and your workers execute this tasks.
I would recommend you to have a look at Celery.