Our Python application serves around 2 million API requests per day. We got a new requirement from our business to generate the report which should contain the count of unique request and response every day.
We would like to use Redis for queuing all the requests & responses.
Another worker instance will retrieve the above data from Redis queue and process it.
The processed results will be persisted to the database.
The simplest option is to use LPUSH and RPOP. But RPOP will return one value at a time which will affect the performance. Is there any way to do a bulk pop from Redis?
Other suggestions for the scenario would be highly appreciated.
A simple solution would be to use redis pipelining
In a single request you will be allowed to perform multiple RPOP instructions.
Most of redis drivers support it. In python with Redis-py it looks like this:
pipe = r.pipeline()
# The following RPOP commands are buffered
pipe.rpop('requests')
pipe.rpop('requests')
pipe.rpop('requests')
pipe.rpop('requests')
# the EXECUTE call sends all buffered commands to the server, returning
# a list of responses, one for each command.
pipe.execute()
Can approach this from a different angle. Your requirement is:
requirement ... to generate the report which should contain the count of unique request and response every day.
Rather than storing requests in the lists and then post-processing the results, why not use Redis features to solve the actual requirements and avoid the problem of bulk LPUSH/LPOP.
If all we want if to record the unique counts, then you may want to consider using sorted sets.
This may go like this:
Collect the request statistics
# Collect the request statistics in the sorted set.
# The key includes date so we can do the "by date" stats.
key = 'requests:date'
r.zincrby(key, request, 1)
Report request statistics
Can use ZSCAN to iterate over all members in batches, but this is unordered.
Can use ZRANGE to get all members in one go (or whatever), ordered.
Python code:
# ZSCAN: Iterate over all members in the set in batches of about 10.
# This will be unordered list.
# zscan_iter returns tuples (member, score)
batchSize = 10
for memberTuple in r.zscan_iter(key, match = None, count = batchSize):
member = memberTuple[0]
score = memberTuple[1]
print str(member) + ' --> ' + str(score)
# ZRANGE: Get all members in the set, ordered by score.
# Here there maxRank=-1 means "no max".
minRank = 0
maxRank = -1
for memberTuple in r.zrange(key, minRank, maxRank, desc = False, withscores = True):
member = memberTuple[0]
score = memberTuple[1]
print str(member) + ' --> ' + str(score)
Benefits of this approach
Solves the actual requirement - reports on the count of unique requests by day.
No need to post-process anything.
Can do additional queries like "top requests" out of the box :)
Another approach would be to use the Hyperloglog data structure.
It was especially designed for this kind of use case.
It allows counting unique items with a low error margin (0.81%) and with a very low memory usage.
Using HLL is really simple:
PFADD myHll "<request1>"
PFADD myHll "<request2>"
PFADD myHll "<request3>"
PFADD myHll "<request4>"
Then to get the count:
PFCOUNT myHll
The actual question was regarding Redis List, You can use lrange to get all values in a single call, below is solution;
import redis
r_server = redis.Redis("localhost")
r_server.rpush("requests", "Adam")
r_server.rpush("requests", "Bob")
r_server.rpush("requests", "Carol")
print r_server.lrange("requests", 0, -1)
print r_server.llen("requests")
print r_server.lindex("requests", 1)
Related
I am trying to implement Pull Model to query change feed using Azure Cosmos Python SDK. I found that to parallelise the querying process, the official documentation mentions about FeedRange value and create FeedIterator to iterate through each range of partition key values obtained from the FeedRange.
Currently my code snippet to query change feed looks like this and it is pretty straight-forward:
# function to get items from change feed based on a condition
def get_response(container_client, condition):
# Historical data read
if condition:
response = container.query_items_change_feed(
is_start_from_beginning = True,
# partition_key_range_id = 0
)
# reading from a checkpoint
else:
response = container.query_items_change_feed(
is_start_from_beginning = False,
continuation = last_continuation_token
)
return response
The problem with this approach is the efficiency when getting all the items from beginning (Historical Data Read). I tried this method with pretty small dataset of 500 items and the response took around 60 seconds. When dealing with millions or even billions of items the response might take too long to return.
Would querying change feed parallelly for each partition key range save time?
If yes, how to get PartitionKeyRangeId in Python SDK?
Is there any problems I need to consider when implementing this?
I hope I make sense!
I have to get data from all members of a list of Telegram chats – groups and supergroups –, but, as Pyrogram documentation alerts, it is only possible to get a total of 10,000 ChatMember results in a single query. Pyrogram's iter_chat_members method is limited to it and does not provide an offset parameter or some kind of pagination handling. So I tried to get 200-sized chunks of data with its get_chat_members method, but after the 50th chunk, which corresponds to the 10,000th ChatMember object, it starts to give me empty results. The draft code I used for testing is as follows:
from pyrogram import Client
def get_chat_members(app, target, offset=0, step=200):
total = app.get_chat_members_count(target)
itrs = (total//step) + 1
members_list = []
itr = 1
while itr <= itrs:
members = app.get_chat_members(target, offset)
members_list.append(members)
offset += step
itr += 1
return members_list
app = Client("my_account")
with app:
results = get_chat_members(app, "example_chat_entity")
print(results)
I thought that despite any of these methods giving me the full chat members data, there should be a workaround, given that what Pyrogram's documentation says about this limit corresponds to a single query. I wonder, then, if there is a way to do more than one query, without flooding the API, and without losing the offset state. Am I missing something or is it impossible to do due to an API limitation?
This is a Server Limitation, not one of Pyrogram itself. The Server simply does not yield any more information after ~10k members. There is no way that a user would need to know detailed information about this many members anyway.
I have a DynamoDB table say data. This table has 400k items. Each item has 4 fields -
id (string) this is my partition key
status (Y/N)
date_added
source
Right now all items have a status = "Y". How can I update all items and set the status to "N" for all 400k items irrespective of the key or any condition?
In MySQL, an equivalent statement would be -
UPDATE data SET status = 'N';
I am looking to do it either through the command line or preferable in python using boto3
There is no easy or cheap way to do what you want to do. What you'll basically need to do is to read and write the entire database:
write:
If you know the key of a single item, you can do a UpdateItem request with an UpdateExpression of "set status = :N". This will only modify the "status" attribute (leaving the rest unchanged), but the cost you will incur (or provisioned throughput you will use) will be the cost of writing the entire item. So the sum of all these operations will be the cost of re-writing the entire database.
You should add to the above UpdateItem a ConditionExpression that will only update the item if the item actually still exists (you can use a attribute_exists() condition on its key attribute to verify that an item exists). This will allow your workload to delete items while doing these changes.
Before starting this change process, change your client code to write new items with status = N. The change process may miss these new items, but it's fine if they are already created with status = N.
You can't use BatchWriteItems (batch_writer() in boto3) to modify a group of items together, because this batch operation can only replace items - not modify an attribute of existing items. In any case, a BatchWriteItems would not have reduced the costs (batches cost the same as the requests they contain).
read:
To get a list of all extant keys in the database, to do the above reads, you need to use a Scan operation, with Projection set to KEYS_ONLY to get only the keys (you don't need the data). The cost to you will be the same as read the entire item, unfortunately, not just reading the keys. So the sum of the cost of all these Scan operations will be reading the entire database.
If you are using provisioned capacity for this table, you may be able to use whatever excess capacity you have that is not used by client requests to do this change slowly, in the background, basically for "free".
Whether or not this makes sense in your case really depends on how much excess capacity (both read and write!) you have provisioned. If you do this, you'll need to watch out not to use too much capacity for this background operation and hurt your real users - you'll need to have some sort of controller that notices capacity-exceeded errors and reduce the amount of capacity used by the background process.
If you actually have a lot of excess provisioned capacity that you've already paid for, you can do this background operation as quickly as you want! The read part, a Scan, can be done in parallel as quickly as you want (using the "parallel scan" feature), and the write part for different keys can also, obviously, be done in parallel.
The following code uses batch_write_item DynamoDB API to update items in batches of size 25, which is the maximum number of items that batch_write_item can take in a single API call. You might need to tweak this number if your items are large.
Warning: This is just a proof of concept example. You should use at your own risk.
import boto3
def update_status(item):
item['status'] = {
'S': 'N'
}
return item
client = boto3.client('dynamodb', region_name='<ddb-region>')
paginator = client.get_paginator('scan')
operation_parameters = {
'TableName': '<ddb-table-name>',
'PaginationConfig': {
'PageSize': 25
}
}
page_iterator = paginator.paginate(**operation_parameters)
for page in page_iterator:
response = client.batch_write_item(RequestItems={
'<ddb-table-name>': [
{
'PutRequest': {
'Item': update_status(item)
}
}
for item in page['Items']
]
})
print(response)
I developed a software with PyQt and sqlite to manage scientific articles. Each article is stored in the sqlite database, and comes from a particular journal.
Sometimes, I need to perform some verifications on the articles of a journal. So I build two lists, one containing the DOI of the articles (a DOI is just a unique id for an article), and one containing booleans, True if the articles are ok, False if the articles are not:
def listDoi(self, journal_abb):
"""Function to get the doi from the database.
Also returns a list of booleans to check if the data are complete"""
list_doi = []
list_ok = []
query = QtSql.QSqlQuery(self.bdd)
query.prepare("SELECT * FROM papers WHERE journal=?")
query.addBindValue(journal_abb)
query.exec_()
while query.next():
record = query.record()
list_doi.append(record.value('doi'))
if record.value('graphical_abstract') != "Empty":
list_ok.append(True)
else:
list_ok.append(False)
return list_doi, list_ok
This function returns the two lists. The lists can contain ~2000 items each. After that, to check if an article is ok, I just check if it is in the two lists.
EDIT: I also need to check if an article is only in list_doi.
So I wonder, because performance matters here: what is faster/better/more economic:
build the two lists, and check if the article is present in the two lists
write the function in another way: checkArticle(doi_article), and the function would perform a SQL query for each article
What about the speed and the space in RAM ? Will the results be different if there are few items, or a lot of them ?
Use time.perf_counter() to determine how long this process takes currently.
time_start = time.perf_counter()
# your code here
print(time.perf_counter() - time_start)
Based on that, if it is going too slow(), you can try each of your options, and time them as well to look for an improvement in performance. As for checking the RAM usage, a simple way is this:
import os
import psutil
process = psutil.Process(os.getpid())
print process.get_memory_info()[0] / float(2 ** 20) # return the memory usage in MB
For a more in-depth memory usage check, look here: https://stackoverflow.com/a/110826/3841261 Always have a way to objectively measure when looking to improve speed/RAM usage/etc.
I would execute one sql query that finds the articles that are OK at once (perhaps in a function called find_articles() or something)
Think of it this way, why do something twice (copy all those rows and work with them) when you could do it once?
You want to basically execute this:
SELECT * from papers where (PAPERID in OTHERTABLE and OTHER RESTRAINT = "WHATEVER")
That's obviously just Pseudocode but I think you can figure it out.
I'm trying to use GeoModel python module to quickly access geospatial data for my Google App Engine app.
I just have a few general questions for issues I'm running into.
There's two main methods, proximity_fetch and bounding_box_fetch, that you can use to return queries. They actually return a result set, not a filtered query, which means you need to fully prepare a filtered query before passing it in. It also limits you from iterating over the query set, since the results are fetched, and you don't have the option to input an offset into the fetch.
Short of modifying the code, can anyone recommend a solution for specifying an offset into the query? My problem is that I need to check each result against a variable to see if I can use it, otherwise throw it away and test the next. I may run into cases where I need to do an additional fetch, but starting with an offset.
You can also work directly with the location_geocells of your model.
from geospatial import geomodel, geocell, geomath
# query is a db.GqlQuery
# location is a db.GeoPt
# A resolution of 4 is box of environs 150km
bbox = geocell.compute_box(geocell.compute(geo_point.location, resolution=4))
cell = geocell.best_bbox_search_cells (bbox, geomodel.default_cost_function)
query.filter('location_geocells IN', cell)
# I want only results from 100kms.
FETCHED=200
DISTANCE=100
def _func (x):
x.dist = geomath.distance(geo_point.location, x.location)
return x.dist
results = sorted(query.fetch(FETCHED), key=_func)
results = [x for x in results if x.dist <= DISTANCE]
There's no practical way to do this, because a call to geoquery devolves into multiple datastore queries, which it merges together into a single result set. If you were able to specify an offset, geoquery would still have to fetch and discard all the first n results before returning the ones you requested.
A better option might be to modify geoquery to support cursors, but each query would have to return a set of cursors, not a single one.