We have a Django app that needs to fetch lots of data using Celery. There are 20 or so celery workers running every few minutes. We're running on Google Kubernetes Engine with a Redis queue using Cloud memorystore.
The Redis instance we're using for celery is filling up, even when the queue is empty according to Flower. This results in the Redis DB eventually being full and Celery throwing errors.
In Flower I see tasks coming in and out, and I have increased workers to the point where the queue is always empty now.
If I run redis-cli --bigkeys I see:
# Scanning the entire keyspace to find biggest keys as well as
# average sizes per key type. You can use -i 0.1 to sleep 0.1 sec
# per 100 SCAN commands (not usually needed).
[00.00%] Biggest set found so far '_kombu.binding.my-queue-name-queue' with 1 members
[00.00%] Biggest list found so far 'default' with 611 items
[00.00%] Biggest list found so far 'my-other-queue-name-queue' with 44705 items
[00.00%] Biggest set found so far '_kombu.binding.celery.pidbox' with 19 members
[00.00%] Biggest list found so far 'my-queue-name-queue' with 727179 items
[00.00%] Biggest set found so far '_kombu.binding.celeryev' with 22 members
-------- summary -------
Sampled 12 keys in the keyspace!
Total key length in bytes is 271 (avg len 22.58)
Biggest list found 'my-queue-name-queue' has 727179 items
Biggest set found '_kombu.binding.celeryev' has 22 members
4 lists with 816144 items (33.33% of keys, avg size 204036.00)
0 hashs with 0 fields (00.00% of keys, avg size 0.00)
0 strings with 0 bytes (00.00% of keys, avg size 0.00)
0 streams with 0 entries (00.00% of keys, avg size 0.00)
8 sets with 47 members (66.67% of keys, avg size 5.88)
0 zsets with 0 members (00.00% of keys, avg size 0.00)
If I inspect the queue using LRANGE I see lots of objects like this:
"{\"body\": \"W1syNDQ0NF0sIHsicmVmZXJlbmNlX3RpbWUiOiBudWxsLCAibGF0ZXN0X3RpbWUiOiBudWxsLCAicm9sbGluZyI6IGZhbHNlLCAidGltZWZyYW1lIjogIjFkIiwgIl9udW1fcmV0cmllcyI6IDF9LCB7ImNhbGxiYWNrcyI6IG51bGwsICJlcnJiYWNrcyI6IG51bGwsICJjaGFpbiI6IG51bGwsICJjaG9yZCI6IG51bGx9XQ==\", \"content-encoding\": \"utf-8\", \"content-type\": \"application/json\", \"headers\": {\"lang\": \"py\", \"task\": \"MyDataCollectorClass\", \"id\": \"646910fc-f9db-48c3-b5a9-13febbc00bde\", \"shadow\": null, \"eta\": \"2019-08-20T02:31:05.113875+00:00\", \"expires\": null, \"group\": null, \"retries\": 0, \"timelimit\": [null, null], \"root_id\": \"beeff557-66be-451d-9c0c-dc622ca94493\", \"parent_id\": \"374d8e3e-92b5-423e-be58-e043999a1722\", \"argsrepr\": \"(24444,)\", \"kwargsrepr\": \"{'reference_time': None, 'latest_time': None, 'rolling': False, 'timeframe': '1d', '_num_retries': 1}\", \"origin\": \"gen1#celery-my-queue-name-worker-6595bd8fd8-8vgzq\"}, \"properties\": {\"correlation_id\": \"646910fc-f9db-48c3-b5a9-13febbc00bde\", \"reply_to\": \"e55a31ed-cbba-3d79-9ffc-c19a29e77aac\", \"delivery_mode\": 2, \"delivery_info\": {\"exchange\": \"\", \"routing_key\": \"my-queue-name-queue\"}, \"priority\": 0, \"body_encoding\": \"base64\", \"delivery_tag\": \"a83074a5-8787-49e3-bb7d-a0e69ba7f599\"}}"
We're using django-celery-results to store results, so these shouldn't be going in there, and we're using a separate Redis instance for Django's cache.
If I clear Redis with a FLUSHALL it slowly fills up again.
I'm kind of stumped at where to go next. I don't know Redis well - maybe I can do something to inspect the data to see what's filling this? Maybe it's Flower not reporting properly? Maybe Celery keeps completed tasks for a bit despite us using the Django DB for results?
Thanks loads for any help.
It sounds like Redis is not set up to delete completed items or report & delete failed items--i.e. it may be putting the tasks on the list, but it's not taking them off.
Check out pypi packages: rq, django-rq, django-rq-scheduler
You can read here a little bit about how this should work: https://python-rq.org/docs/
This seems to be a known (or intentional) issue with Celery, with various solutions/workarounds proposed:
https://github.com/celery/celery/issues/436
Related
I'm using Neo4J to identify the connections between different node labels.
Neo4J 4.4.4 Community Edition
DB rolled out in docker container with k8s orchestrating.
MATCH (source_node: Person) WHERE source_node.name in $inputs
MATCH (source_node)-[r]->(child_id:InternalId)
WHERE r.valid_from <= datetime($actualdate) < r.valid_to
WITH [type(r), toString(date(r.valid_from)), child_id.id] as child_path, child_id, false as filtered
CALL apoc.do.when(filtered,
'RETURN child_path as full_path, NULL as issuer_id',
'OPTIONAL MATCH p_path = (child_id)-[:HAS_PARENT_ID*0..50]->(parent_id:InternalId)
WHERE all(a in relationships(p_path) WHERE a.valid_from <= datetime($actualdate) < a.valid_to) AND
NOT EXISTS{ MATCH (parent_id)-[q:HAS_PARENT_ID]->() WHERE q.valid_from <= datetime($actualdate) < q.valid_to}
WITH DISTINCT last(nodes(p_path)) as i_source,
reduce(st = [], q IN relationships(p_path) | st + [type(q), toString(date(q.valid_from)), endNode(q).id])
as parent_path, CASE WHEN length(p_path) = 0 THEN NULL ELSE parent_id END as parent_id, child_path
OPTIONAL MATCH (i_source)-[r:HAS_ISSUER_ID]->(issuer_id:IssuerId)
WHERE r.valid_from <= datetime($actualdate) < r.valid_to
RETURN DISTINCT CASE issuer_id WHEN NULL THEN child_path + parent_path + [type(r), NULL, "NOT FOUND IN RELATION"]
ELSE child_path + parent_path + [type(r), toString(date(r.valid_from)), toInteger(issuer_id.id)]
END as full_path, issuer_id, CASE issuer_id WHEN NULL THEN true ELSE false END as filtered',
{filtered: filtered, child_path: child_path, child_id: child_id, actualdate: $actualdate}
)
YIELD value
RETURN value.full_path as full_path, value.issuer_id as issuer_id, value.filtered as filtered
When query executing on a large number of incoming names (Person), it is processed quickly for example for 100,000 inputs it takes ~2.5 seconds. However, if 100,000 names are divided into small batches and fore each batch query is executed sequentially, the overall processing time increases dramatically:
100 names batch is ~2 min
1000 names batch is ~10 sec
Could you please provide me a clue why exactly this is happening? And how I could get the same executions time as for the entire dataset regardless the batch size?
Is the any possibility to divide transactions into multiple processes? I tried Python multiprocessing using Neo4j Driver. It works faster but still cannot achieve the target execution time of 2.5 sec for some reasons.
Is it any possibility to keep entire graph into memory during the whole container lifecycle? Could it help resolve the issue with the execution speed on multiple batches instead the entire dataset?
Essentially, the goal is to use as small batches as possible in order to process the entire dataset.
Thank you.
PS: Any suggestions to improve the query are very welcome.)
You pass in a list - then it will use an index to efficiently filter down the results by passing the list to the index, and you do additional aggressive filtering on properties.
So if you run the query with PROFILE you will see how much data is loaded / touched at each step.
A single execution makes more efficient use of resources like heap and page-cache.
For individual batched executions it has to go through the whole machinery (driver, query-parsing, planning, runtime), depending if you execute your queries in parallel (do you?) or sequentially, the next query needs to wait until your previous one has finished.
Multiple executions also content for resources like memory, IO, network.
Python is also not the fastest driver esp. if you send/receive larger volumes of data, try one of the other languages if that serves you better.
Why don't you just always execute one large batch then?
With Neo4j EE (e.g. on Aura) or CE 5 you will also get better runtimes and execution.
Yes if you configure your page-cache large enough to hold the store, it will keep the graph in memory during the execution.
If you run PROFILE with your query you should also see page-cache faults, when it needs to fetch data from disk.
I'm trying to see how many users used my bot in the last 5 minute.
I've got an idea to every time a user used my bot I add his/hers id into a redis list with a timer. (reset the timer if user is already in the list)
And every time I want to check how many users are using the bot, I get the length of the list.
But i have no idea how to do that.
something like below code that expiers five minute later:
redis.setex('foo_var', 60 * 5, 'foo_value')
I've managed to add items to a list with :
redis.zadd('foo', {'item1': 0, 'item2': 1})
And get the length of the list like this (I don't know how to get full length of the list. (without using min and max)):
min = 0.0
max = 1000.0
redis.zcount('foo', min, max)
Right now the problem is how to expire items of a list on specific time.
Items within Lists, Sets, Hashes, and their ilk cannot be expired automatically by Redis. That said, it might be worth looking at Streams.
If you're not familiar, Streams are essentially a list of events with associated times and associated data. Think of it like a log file. The nice thing is you can add extra data to the event, like maybe the type of bot interaction.
So, just log an event every time the bot is used:
XADD bot_events * eventType login
The * here means to auto generate the ID of the event based on the server time. You can also provided one manually, but you almost never want to. An event ID is just a UNIX Epoch time in milliseconds and a sequence number separated by a dash like this: 1651232477183-0.
XADD can automatically trim the Stream for your time period so old record don't hang around. Do this by providing an event ID before which events will be deleted:
XADD bot_events MINID ~ 1651232477183-0 * eventType login
Note that ~ instructs Redis to trim Streams performantly. This means that it might not delete all the events. However, it will never delete more than you expect, only less. It can be replaced with = if you want exactness over performance.
Now that you have a Stream of events, you can then query that Stream for events over a specific time period, based on event IDs:
XRANGE bot_events 1651232477183-0 +
+ here, means until the end of the stream. The initial event ID could be replaces with - if you want all the stream events regardless of the time.
From here, you just count the number of results.
Note, all the examples here are presented as raw Redis commands, but it should be easy enough to translate them to Python.
I have a DynamoDB table say data. This table has 400k items. Each item has 4 fields -
id (string) this is my partition key
status (Y/N)
date_added
source
Right now all items have a status = "Y". How can I update all items and set the status to "N" for all 400k items irrespective of the key or any condition?
In MySQL, an equivalent statement would be -
UPDATE data SET status = 'N';
I am looking to do it either through the command line or preferable in python using boto3
There is no easy or cheap way to do what you want to do. What you'll basically need to do is to read and write the entire database:
write:
If you know the key of a single item, you can do a UpdateItem request with an UpdateExpression of "set status = :N". This will only modify the "status" attribute (leaving the rest unchanged), but the cost you will incur (or provisioned throughput you will use) will be the cost of writing the entire item. So the sum of all these operations will be the cost of re-writing the entire database.
You should add to the above UpdateItem a ConditionExpression that will only update the item if the item actually still exists (you can use a attribute_exists() condition on its key attribute to verify that an item exists). This will allow your workload to delete items while doing these changes.
Before starting this change process, change your client code to write new items with status = N. The change process may miss these new items, but it's fine if they are already created with status = N.
You can't use BatchWriteItems (batch_writer() in boto3) to modify a group of items together, because this batch operation can only replace items - not modify an attribute of existing items. In any case, a BatchWriteItems would not have reduced the costs (batches cost the same as the requests they contain).
read:
To get a list of all extant keys in the database, to do the above reads, you need to use a Scan operation, with Projection set to KEYS_ONLY to get only the keys (you don't need the data). The cost to you will be the same as read the entire item, unfortunately, not just reading the keys. So the sum of the cost of all these Scan operations will be reading the entire database.
If you are using provisioned capacity for this table, you may be able to use whatever excess capacity you have that is not used by client requests to do this change slowly, in the background, basically for "free".
Whether or not this makes sense in your case really depends on how much excess capacity (both read and write!) you have provisioned. If you do this, you'll need to watch out not to use too much capacity for this background operation and hurt your real users - you'll need to have some sort of controller that notices capacity-exceeded errors and reduce the amount of capacity used by the background process.
If you actually have a lot of excess provisioned capacity that you've already paid for, you can do this background operation as quickly as you want! The read part, a Scan, can be done in parallel as quickly as you want (using the "parallel scan" feature), and the write part for different keys can also, obviously, be done in parallel.
The following code uses batch_write_item DynamoDB API to update items in batches of size 25, which is the maximum number of items that batch_write_item can take in a single API call. You might need to tweak this number if your items are large.
Warning: This is just a proof of concept example. You should use at your own risk.
import boto3
def update_status(item):
item['status'] = {
'S': 'N'
}
return item
client = boto3.client('dynamodb', region_name='<ddb-region>')
paginator = client.get_paginator('scan')
operation_parameters = {
'TableName': '<ddb-table-name>',
'PaginationConfig': {
'PageSize': 25
}
}
page_iterator = paginator.paginate(**operation_parameters)
for page in page_iterator:
response = client.batch_write_item(RequestItems={
'<ddb-table-name>': [
{
'PutRequest': {
'Item': update_status(item)
}
}
for item in page['Items']
]
})
print(response)
Our Python application serves around 2 million API requests per day. We got a new requirement from our business to generate the report which should contain the count of unique request and response every day.
We would like to use Redis for queuing all the requests & responses.
Another worker instance will retrieve the above data from Redis queue and process it.
The processed results will be persisted to the database.
The simplest option is to use LPUSH and RPOP. But RPOP will return one value at a time which will affect the performance. Is there any way to do a bulk pop from Redis?
Other suggestions for the scenario would be highly appreciated.
A simple solution would be to use redis pipelining
In a single request you will be allowed to perform multiple RPOP instructions.
Most of redis drivers support it. In python with Redis-py it looks like this:
pipe = r.pipeline()
# The following RPOP commands are buffered
pipe.rpop('requests')
pipe.rpop('requests')
pipe.rpop('requests')
pipe.rpop('requests')
# the EXECUTE call sends all buffered commands to the server, returning
# a list of responses, one for each command.
pipe.execute()
Can approach this from a different angle. Your requirement is:
requirement ... to generate the report which should contain the count of unique request and response every day.
Rather than storing requests in the lists and then post-processing the results, why not use Redis features to solve the actual requirements and avoid the problem of bulk LPUSH/LPOP.
If all we want if to record the unique counts, then you may want to consider using sorted sets.
This may go like this:
Collect the request statistics
# Collect the request statistics in the sorted set.
# The key includes date so we can do the "by date" stats.
key = 'requests:date'
r.zincrby(key, request, 1)
Report request statistics
Can use ZSCAN to iterate over all members in batches, but this is unordered.
Can use ZRANGE to get all members in one go (or whatever), ordered.
Python code:
# ZSCAN: Iterate over all members in the set in batches of about 10.
# This will be unordered list.
# zscan_iter returns tuples (member, score)
batchSize = 10
for memberTuple in r.zscan_iter(key, match = None, count = batchSize):
member = memberTuple[0]
score = memberTuple[1]
print str(member) + ' --> ' + str(score)
# ZRANGE: Get all members in the set, ordered by score.
# Here there maxRank=-1 means "no max".
minRank = 0
maxRank = -1
for memberTuple in r.zrange(key, minRank, maxRank, desc = False, withscores = True):
member = memberTuple[0]
score = memberTuple[1]
print str(member) + ' --> ' + str(score)
Benefits of this approach
Solves the actual requirement - reports on the count of unique requests by day.
No need to post-process anything.
Can do additional queries like "top requests" out of the box :)
Another approach would be to use the Hyperloglog data structure.
It was especially designed for this kind of use case.
It allows counting unique items with a low error margin (0.81%) and with a very low memory usage.
Using HLL is really simple:
PFADD myHll "<request1>"
PFADD myHll "<request2>"
PFADD myHll "<request3>"
PFADD myHll "<request4>"
Then to get the count:
PFCOUNT myHll
The actual question was regarding Redis List, You can use lrange to get all values in a single call, below is solution;
import redis
r_server = redis.Redis("localhost")
r_server.rpush("requests", "Adam")
r_server.rpush("requests", "Bob")
r_server.rpush("requests", "Carol")
print r_server.lrange("requests", 0, -1)
print r_server.llen("requests")
print r_server.lindex("requests", 1)
I'm new to MongoDB and pymongo and looking for some guidance in terms of algorithms and performance for a specific task described below. I have posted a link to an image of the data sample and also my sample python code below.
I have a single collection that grows about 5 to 10 Million documents every month. It receives all this info from other systems, which I have no access to modify in any way (they are in different companies). Each document represent sort of a financial transaction. I need to group documents that are part of a same "transaction group".
Each document has hundreds of keys. Almost all keys vary between documents (which is why they moved from MySQL to MongoDB - no easy way to align schema). However, I found out that three keys are guaranteed to always be in all of them. I'll call these keys key1, key2 and key3 in this example. These keys are my only option to identify the transactions that are part of the same transaction group.
The basic rule is:
- If consecutive documents have the same key1, and the same key2, and the same key3, they are all in the same "transaction group". Then I must give it some integer id in a new key named 'transaction_group_id'
- Else, consecutive documents that do not matck key1, key2 and key3 are all in their own individual "transaction_groups".
It's really easy to understand it by looking at the screenshot of a data sample (better than my explanation anyway). See here:
As you can see in the sample:
- Documents 1 and 2 are in the same group, because they match key1, key2 and key3;
- Documents 3 and 4 also match and are in their own group;
- Following the same logic, documents 18 and 19 are a group obviously. However, even though they match the values of documents 1 and 3, they are not in the same group (because the documents are not consecutive).
I created a very simplified version of the current python function, to give you guys an idea of the current implementation:
def groupTransactions(mongo_host,
mongo_port,
mongo_db,
mongo_collection):
"""
Group transactions if Keys 1, 2 and 3 all match in consecutive docs.
"""
mc = MongoClient(mongo_host, mongo_port)
db = mc['testdb']
coll = db['test_collection']
# The first document transaction group must always be equal to 1.
first_doc_id = coll.find_one()['_id']
coll.update({'_id': first_doc_id},
{"$set": {"transaction_group_id": 1}},
upsert=False, multi=False)
# Cursor order is undetermined unless we use sort(), no matter what the _id is. We learned it the hard way.
cur = coll.find().sort('subtransaction_id', ASCENDING)
doc_count = cur.count()
unique_data = []
unique_data.append(cur[0]['key1'], cur[0]['key2'], cur[0]['key3'])
transaction_group_id = 1
i = 1
while i < doc_count:
doc_id = cur[i]['_id']
unique_data.append(cur[i]['key1'], cur[i]['key2'], cur[i]['key3'])
if unique_data[i] != unique_data[i-1]:
# New group find, increase group id by 1
transaction_group_id = transaction_group_id + 1
# Update the group id in the database
coll.update({'_id': doc_id},
{"$set": {"transaction_group_id": transaction_group_id}},
upsert=False, multi=False)
i = i + 1
print "%d subtransactions were grouped into %d transaction groups." % (doc_count, i)
return 1
This is the code, more or less, and it works. But it takes between 2 to 3 days to finish, which is starting to become unacceptable. The hardware is good: VMs in last generation Xeon, local MongoDB in SSD, 128GB RAM). It will probably run fast if we decide to run it on AWS, use threading/subprocesses, etc - which are all obviously good options to try at some point.
However, I'm not convinced this is the best algorithm. It's just the best I could come up with.There must be obvious ways to improve it that I'm not seeing.
Moving to c/c++ or out of NoSQL is out of the question at this point. I have to make it work the way it is.
So basically the question is: Is this the best possible algorithm (using MongoDB/pymongo) in terms of speed? If not, I'd appreciate it if you could point me in the right direction.
EDIT: Just so you can have an idea of how slow this code performance is: Last time I measured it, it took 22 hours to run on 1.000.000 results. As a quick workaround, I wrote something else to load the data to a Pandas DataFrame first and then apply the same logic of this code more or less. It took 3 to 4 minutes to group everything, using the same hardware. I mean, I know Pandas is efficient, etc. But there's something wrong, there can't be such a huge gap between between the two solutions performances (4min vs 1,320min).
It is the case that most of the time is spent writing to the database, which includes the round trip of sending work to the DB, plus the DB doing the work. I will point out a couple of places where you can speed up each of those.
Speeding up the back-and-forth of sending write requests to the DB:
One of the best ways to improve the latency of the requests to the DB is to minimize the number of round trips. In your case, it's actually possible because multiple documents will get updated with the same transaction_group_id. If you accumulate their values and only send a "multi" update for all of them, then it will cut down on the back-and-forth. The larger the transaction groups the more this will help.
In your code, you would replace the current update statement:
coll.update( {'_id': doc_id},
{"$set": {"transaction_group_id": transaction_group_id}},
upsert=False, multi=False)
With an accumulator of doc_id values (appending them to a list should be just fine). When you detect the "pattern" change and transaction group go to the next one, you would then run one update for the whole group as:
coll.update( {'_id': {$in: list-of-docids },
{"$set": {"transaction_group_id": transaction_group_id}},
upsert=False, multi=True)
A second way of increasing parallelism of this process and speeding up end-to-end work would be to split the job between more than one client - the downside of this is that you need a single unit of work to pre-calculate how many transaction_group_id values there will be and where the split points are. Then you can have multiple clients like this one which only handle range of specific subtransaction_id values and their transaction_group_id starting value is not 1 but whatever is given to them by the "pre-work" process.
Speeding up the actual write on the DB:
The reason I asked about existence of the transaction_group_id field is because if a field that's being $set does not exist, it will be created and that increases the document size. If there is not enough space for the increased document, it has to be relocated and that's less efficient than the in-place update.
MongoDB stores documents in BSON format. Different BSON values have different sizes. As a quick demonstration, here's a shell session that shows total document size based on the type and size of value stored:
> db.sizedemo.find()
{ "_id" : ObjectId("535abe7a5168d6c4735121c9"), "transaction_id" : "" }
{ "_id" : ObjectId("535abe7d5168d6c4735121ca"), "transaction_id" : -1 }
{ "_id" : ObjectId("535abe815168d6c4735121cb"), "transaction_id" : 9999 }
{ "_id" : ObjectId("535abe935168d6c4735121cc"), "transaction_id" : NumberLong(123456789) }
{ "_id" : ObjectId("535abed35168d6c4735121cd"), "transaction_id" : " " }
{ "_id" : ObjectId("535abedb5168d6c4735121ce"), "transaction_id" : " " }
> db.sizedemo.find().forEach(function(doc) { print(Object.bsonsize(doc)); })
43
46
46
46
46
53
Note how the empty string takes up three bytes fewer than double or NumberLong do. The string " " takes the same amount as a number and longer strings take proportionally longer. To guarantee that your updates that $set the transaction group never cause the document to grow, you want to set transaction_group_id to the same size value on initial load as it will be updated to (or larger). This is why I suggested -1 or some other agreed upon "invalid" or "unset" value.
You can check if the updates have been causing document moves by looking at the value in db.serverStatus().metrics.record.moves - this is the number of document moves caused by growth since the last time server was restarted. You can compare this number before and after your process runs (or during) and see how much it goes up relative to the number of documents you are updating.