why rabbitmq generate lots of idx file in queue directory - python

I set up a RabbitMQ server with docker as below. then configure the celery to use it as broker.
rabbit:
hostname: rabbit
image: rabbitmq:latest
environment:
- RABBITMQ_DEFAULT_USER=admin
- RABBITMQ_DEFAULT_PASS=mypass
ports:
- "5673:5672"
worker:
build:
context: .
dockerfile: dockerfile
volumes:
- .:/app
links:
- rabbit
depends_on:
- rabbit
And the celery configuration is:
from __future__ import absolute_import
from celery import Celery
app = Celery('test_celery',broker='amqp://admin:mypass#host:5673',backend='rpc://',include=['test_celery.tasks'])
Run_tasks code:
from .tasks import longtime_add
import time
if __name__ == '__main__':
url = ['http://example1.com' , 'http://example2.com' , 'http://example3.com' , 'http://example4.com' , 'http://example5.com' , 'http://example6.com' , 'http://example7.com' , 'http://example8.com'] # change them to your ur list.
for i in url:
result = longtime_add.delay(i)
print 'Task result:',result.result
tasks code
from __future__ import absolute_import
from test_celery.celery import app
import time,requests
from pymongo import MongoClient
client = MongoClient('10.1.1.234', 27018) # change the ip and port to your mongo database's
db = client.mongodb_test
collection = db.celery_test
post = db.test
#app.task(bind=True,default_retry_delay=10) # set a retry delay, 10 equal to 10s
def longtime_add(self,i):
try:
r = requests.get(i)
if some conditions happend:
longtime_add.delay(i)
elif some other conditions happened:
post.insert({'status':r.status_code,"creat_time":time.time()})
except Exception as exc:
raise self.retry(exc=exc)
The run_taks code will generate a list of url and send them to RabbitMQ, then tasks will consume them, and check some conditions happened or not, if happend send the result to Rabbtmq again , otherwise store the data to database.
The question here . When tasks run for a long time, 24 hours or even long, Lots of idx files will be generated in directory "mnesia/rabbit#rabbit/queues" by RabbitMQ . The total size will be 40G.
So the question here , how to stop these large file generated automatically or keeping in a small size?

.idx files are queue index segments. Make your apps consume enqueued messages and acknowledge them, then queue indices will be updated and their segment files deleted when they are no longer necessary.
You can also purge or delete queues, of course.

Related

Background worker process crashing

I'm creating a web app that takes username as input and perform some tasks with it. I'm using flask for the web and heroku to deploy it.I've two main python scripts these run the app. app.py and task.py. Everything is okey into my app.py file, where i used flask code for making the app but into my task.py file I've a
Initialisation step
from instabot import Bot
bot = Bot()
bot.login(username="myusername", password="mypassword")
now I deployed these script into the Procfile like
worker: python task.py
web: gunicorn app:app
but after deploying when i check it's log I'm getting
logs
2022-01-28T18:30:06.376795+00:00
heroku[worker.1]: Starting process
with command `python task.py`
2022-01-
28T18:30:07.081356+00:00
heroku[worker.1]: State changed
from starting to up
2022-01-
28T18:30:08.213687+00:00
heroku[worker.1]: Process exited
with status 0
2022-01-
28T18:30:08.274498+00:00
heroku[worker.1]: State changed
from up to crashed
as you can see it's getting crashed!! i don't know what's my mistake is can you please help me to figure out : (
The code in my worker is almost stolen from the redis-rq documentation, with some changes to make it work also in our local development enviroment:
import os
from rq import Queue, Connection
from tools import get_redis_store
# we need to use a different worker when we are in Heroku
if 'DYNO' in os.environ:
from rq.worker import HerokuWorker as Worker
else:
from rq import Worker
redis_url = os.getenv('REDIS_URL')
if not redis_url:
raise RuntimeError('Set up Redis first.')
listen = ['default']
conn = get_redis_store()
if __name__ == '__main__':
with Connection(conn):
worker = Worker(map(Queue, listen))
worker.work()
On tools.py
_redis_store = None
def get_redis_store():
'''
Get a connection pool to redis based on the url configured
on env variable REDIS_URL
Returns
-------
redis.ConnectionPool
'''
global _redis_store
if not _redis_store:
redis_url = os.getenv('REDIS_URL')
if redis_url:
logger.debug('starting redis: %s ' % redis_url)
if redis_url.startswith('rediss://'):
# redis 6 encripted connection
# heroku needs disable ssl verification
_redis_store = redis.from_url(
redis_url, ssl_cert_reqs=None)
else:
_redis_store = redis.from_url(redis_url)
else:
logger.debug('redis not configured')
return _redis_store
On the main app, to queue a task:
from rq import Queue
from tools import get_redis_store
redis_store = get_redis_store()
queue = Queue(connection=redis_store)
def _slow_task(param1, param2, paramn):
# here I execute the slow task
my_slow_code(param1, param2)
# when I need to execute the slow task
queue.enqueue(_slow_task, param1, param2)
There are multiple options to use redis on Heroku, depending on what you use, you could need to change the env variable REDIS_URL. The environment variable DYNO is checked to know if the code is running on Heroku or in a developer machine.

How to share cache between a task and a view with Django?

In my django project, I got a task running every 5 minutes (with Celery and Redis as a broker):
from django.core.cache import cache
#shared_task()
#celery.task(base=QueueOnce)
def cache_date():
cache.set('date', datetime.now)
print('Cached date : ', cache.get('date'))
And it's running fine, printing the new cached date everytime it runs
But then, somewhere in one of my views I try to do this :
from django.core.cache import cache
def get_cached_date():
print('Cached date :', cache.get('date')
And then it prints "Cached date : None"
Here's my cache settings :
CACHES = {
'default': {
'BACKEND': 'django.core.cache.backends.filebased.FileBasedCache',
'LOCATION': '/tmp/cache',
}
}
I don't get it, why is the value available in one place and not in the other while I'm using a singe location file ? Am I trying to do something wrong?
UPDATE
#Satevg I use docker-compose, here is my file :
services:
redis:
image: redis
worker:
image: project
depends_on:
- redis
- webserver
command: project backend worker
volumes:
- cache:/tmp/cache
webserver:
image: project
depends_on:
- redis
volumes:
- cache:/tmp/cache
volumes:
cache:
I tried to share the volumes like this, but when my task tries to write to the cache I get :
OSError: [Errno 13] Permission denied: '/tmp/cache/tmpAg_TAc'
When I look at the filesystems in both containers I can see the folder /tmp/cache, the web app can even write on it and when I look on the worker's container /tmp/cache folder I can see the updated cache
UPDATE2:
The webapp can write to the cache.
cache.set('test', 'Test')
On the worker's container, I can see the cache file on the /tmp/cache folder
When the task tries to read from cache :
print(cache.get('test'))
It says :
None
When the task tries to write from cache it still gets Errno13
As we figured out in comments, celery and app are working in different docker containers.
django.core.cache.backends.filebased.FileBasedCache serializes and stores each cache value as a separate file. But these files are in different file systems.
The solution is to use docker volumes in order to share /tmp/cache folder between these two containers.

Heroku database management from worker

Wondering if anyone can help me with this or at least guide me the correct way.
I currently have a web and a worker process running. I need a task to run 24/7 while the dynos are online, it's job is to access the database and remove records that have expired by checking the "expiry" value for each record against the current timestamp.
My worker.py file:
import os
import redis
from rq import Worker, Queue, Connection
listen = ['high', 'default', 'low']
redis_url = os.getenv('REDISTOGO_URL', 'redis://localhost:6379')
conn = redis.from_url(redis_url)
if __name__ == '__main__':
with Connection(conn):
worker = Worker(map(Queue, listen))
worker.work()
This is as shown by the heroku documentation.
Then in my app.py:
from rq import Queue
from worker import conn
from datetime import datetime
q = Queue(connection=conn)
def myFunction():
while True:
for item in Users.query.all():
if int(item.expiry) < (datetime.now().timestamp()):
db.session.delete(item)
db.session.commit()
If __name__ == “__main__”:
q.enqueue(myFunction)
app.run()
My profile looks like so:
web: gunicorn app:app
worker: python worker.py
When I run this, expired records are not removed from the database. Is there anyway I can solve this or diagnose the issue further?
The code that enqueues your task is inside the __name__ == “__main__” block, so it only runs when your script is run directly - eg via python app.py. But you are running this on Heroku via the procfile, which loads it as a module into gunicorn - so that code is never executed. You need to put it somewhere else.
Note though I can't see any reason for using rq here at all. That's used for creating workers that dynamically run offline tasks when enqueued by your web processes. But you seem to want one function to run continuously; rq is irrelevant here, you should just run that code directly via the procfile.

How to list the queued items in celery?

I have a Django project on an Ubuntu EC2 node, which I have been using to set up an asynchronous using Celery.
I am following http://michal.karzynski.pl/blog/2014/05/18/setting-up-an-asynchronous-task-queue-for-django-using-celery-redis/ along with the docs.
I've been able to get a basic task working at the command line, using:
(env1)ubuntu#ip-172-31-22-65:~/projects/tp$ celery --app=myproject.celery:app worker --loglevel=INFO
I just realized, that I have a bunch of tasks in my queue, that had not executed:
[2015-03-28 16:49:05,916: WARNING/MainProcess] Restoring 4 unacknowledged message(s).
(env1)ubuntu#ip-172-31-22-65:~/projects/tp$ celery -A tp purge
WARNING: This will remove all tasks from queue: celery.
There is no undo for this operation!
(to skip this prompt use the -f option)
Are you sure you want to delete all tasks (yes/NO)? yes
Purged 81 messages from 1 known task queue.
How do I get a list of the queued items from the command line?
If you want to get all scheduled tasks,
celery inspect scheduled
To find all active queues
celery inspect active_queues
For status
celery inspect stats
For all commands
celery inspect
If you want to get it explicitily.Since you are using redis as queue.Then
redis-cli
>KEYS * #find all keys
Then find out something related to celery
>LLEN KEY # i think it gives length of list
Here is a copy-paste solution for Redis:
def get_celery_queue_len(queue_name):
from yourproject.celery import app as celery_app
with celery_app.pool.acquire(block=True) as conn:
return conn.default_channel.client.llen(queue_name)
def get_celery_queue_items(queue_name):
import base64
import json
from yourproject.celery import app as celery_app
with celery_app.pool.acquire(block=True) as conn:
tasks = conn.default_channel.client.lrange(queue_name, 0, -1)
decoded_tasks = []
for task in tasks:
j = json.loads(task)
body = json.loads(base64.b64decode(j['body']))
decoded_tasks.append(body)
return decoded_tasks
It works with Django. Just don't forget to change yourproject.celery.

Django Celery get task count

I am currently using django with celery and everything works fine.
However I want to be able to give the users an opportunity to cancel a task if the server is overloaded by checking how many tasks are currently scheduled.
How can I achieve this ?
I am using redis as broker.
I just found this :
Retrieve list of tasks in a queue in Celery
It is somehow relate to my issue but I don't need to list the tasks , just count them :)
Here is how you can get the number of messages in a queue using celery that is broker-agnostic.
By using connection_or_acquire, you can minimize the number of open connections to your broker by utilizing celery's internal connection pooling.
celery = Celery(app)
with celery.connection_or_acquire() as conn:
conn.default_channel.queue_declare(
queue='my-queue', passive=True).message_count
You can also extend Celery to provide this functionality:
from celery import Celery as _Celery
class Celery(_Celery)
def get_message_count(self, queue):
'''
Raises: amqp.exceptions.NotFound: if queue does not exist
'''
with self.connection_or_acquire() as conn:
return conn.default_channel.queue_declare(
queue=queue, passive=True).message_count
celery = Celery(app)
num_messages = celery.get_message_count('my-queue')
If your broker is configured as redis://localhost:6379/1, and your tasks are submitted to the general celery queue, then you can get the length by the following means:
import redis
queue_name = "celery"
client = redis.Redis(host="localhost", port=6379, db=1)
length = client.llen(queue_name)
Or, from a shell script (good for monitors and such):
$ redis-cli -n 1 -h localhost -p 6379 llen celery
If you have already configured redis in your app, you can try this:
from celery import Celery
QUEUE_NAME = 'celery'
celery = Celery(app)
client = celery.connection().channel().client
length = client.llen(QUEUE_NAME)
Get a redis client instance used by Celery, then check the queue length. Don't forget to release the connection every time you use it (use .acquire):
# Get a configured instance of celery:
from project.celery import app as celery_app
def get_celery_queue_len(queue_name):
with celery_app.pool.acquire(block=True) as conn:
return conn.default_channel.client.llen(queue_name)
Always acquire a connection from the pool, don't create it manually. Otherwise, your redis server will run out of connection slots and this will kill your other clients.
I'll expand on the answer of #StephenFuhry around the not-found error, because more or less broker-agnostic way of retrieving queue length is beneficial even if Celery suggests to mess with brokers directly. In Celery 4 (with Redis broker) this error looks like:
ChannelError: Channel.queue_declare: (404) NOT_FOUND - no queue 'NAME' in vhost '/'
Observations:
ChannelError is a kombu exception (if fact, it's amqp's and kombu "re-exports" it).
On Redis broker Celery/Kombu represent queues as Redis lists
Redis collection type keys are removed whenever the collection becomes empty
If we look at what queue_declare does, it has these lines:
if passive and not self._has_queue(queue, **kwargs):
raise ChannelError(...)
Kombu Redis virtual transport's _has_queue is this:
def _has_queue(self, queue, **kwargs):
with self.conn_or_acquire() as client:
with client.pipeline() as pipe:
for pri in self.priority_steps:
pipe = pipe.exists(self._q_for_pri(queue, pri))
return any(pipe.execute())
The conclusion is that on a Redis broker ChannelError raised from queue_declare is okay (for an existing queue of course), and just means that the queue is empty.
Here's an example of how to output all active Celery queues' lengths (normally should be 0, unless your worker can't cope with the tasks).
from kombu.exceptions import ChannelError
def get_queue_length(name):
with celery_app.connection_or_acquire() as conn:
try:
ok_nt = conn.default_channel.queue_declare(queue=name, passive=True)
except ChannelError:
return 0
else:
return ok_nt.message_count
for queue_info in celery_app.control.inspect().active_queues().values():
print(queue_info[0]['name'], get_queue_length(queue_info[0]['name']))

Categories