Find Maximum Index in Google Datastore (Pagination in Blog System) - python

I have a series of blog posts as entities. I will receive a URL that looks like this: /blog/page/1. I would like to access the most recent 5 posts in this case. In the case of /blog/page/2, I want the 6-10th most recent.
So allow me to do an X-Y, because I think this is the only way:
How do I find the maximum of a value in numerous entities with Google Cloud Platform Datastore? (I'm using ndb)
I can give each entity an ID value, and then fetch 5 from a query where ID < maxIndex - page * 5 sorted by ID.
But, how do I find maxIndex? Do I fetch 1 from a query ordered by ID, find it's ID, and then run the previous operation? That seems somewhat slow for every pageview.
How can I either A) Find the max index quickly or B) Implement pagination otherwise?
Thanks!

For cursors you can use a Datastore entry to store the cursor then send the Datastore key back and forth. Obscuring it by sending data via post requests is another option.
To answer your original question: To get the the top entry do a query with a sort with a limit of 1. This will actually read all the Datastore entries so you should do a keys only search to get the ID (keys only is free) then get the actual posts. So something like this:
class IndexPage(webapp2.RequestHandler):
def get(self):
maxIndex = Posts.query().order(-Posts.PostIndex).fetch(limit=1, keys_only=True)[0].get().PostIndex
page_number = 0
post_lists = getPageResults(maxIndex, page_number)
while len(post_lists) > 1:
self.response.write("====================PAGE NUMBER %i===================</br>" % page_number)
for post in post_lists:
self.response.write(str(post.get().PostIndex) + "</br>")
page_number += 1
post_lists = getPageResults(maxIndex, page_number)
def getPageResults(maxIndex, page):
index_range = (maxIndex - (page*5))
post_index_list = range(index_range, index_range-5, -1)
return Posts.query(Posts.PostIndex.IN(post_index_list)).order(-Posts.PostIndex).fetch(limit=5, keys_only=True)
Keep in mind I threw this together in a few minutes to illustrate using keys_only and the other points I mentioned above.

Related

Is there a workaround to the 10,000 Telegram server query limit when trying to get all chat members' data with Pyrogram/Python?

I have to get data from all members of a list of Telegram chats – groups and supergroups –, but, as Pyrogram documentation alerts, it is only possible to get a total of 10,000 ChatMember results in a single query. Pyrogram's iter_chat_members method is limited to it and does not provide an offset parameter or some kind of pagination handling. So I tried to get 200-sized chunks of data with its get_chat_members method, but after the 50th chunk, which corresponds to the 10,000th ChatMember object, it starts to give me empty results. The draft code I used for testing is as follows:
from pyrogram import Client
def get_chat_members(app, target, offset=0, step=200):
total = app.get_chat_members_count(target)
itrs = (total//step) + 1
members_list = []
itr = 1
while itr <= itrs:
members = app.get_chat_members(target, offset)
members_list.append(members)
offset += step
itr += 1
return members_list
app = Client("my_account")
with app:
results = get_chat_members(app, "example_chat_entity")
print(results)
I thought that despite any of these methods giving me the full chat members data, there should be a workaround, given that what Pyrogram's documentation says about this limit corresponds to a single query. I wonder, then, if there is a way to do more than one query, without flooding the API, and without losing the offset state. Am I missing something or is it impossible to do due to an API limitation?
This is a Server Limitation, not one of Pyrogram itself. The Server simply does not yield any more information after ~10k members. There is no way that a user would need to know detailed information about this many members anyway.

Filtering data in python

I am working on a web crawler for python that gathers information on posts by users on a site and compares their scores for posts all provided users participate in. It is currently structured so that I receive the following data:
results is a dictionary indexed by username that contains dictionaries of each user's history in a post, points key value structure.
common is a list that starts with all the posts in the dictionary of the first user in results. This list should be filtered down to only the posts all users have in common
points is a dictionary indexed by username that keeps a running total of points on shared posts.
My filtering code is below:
common = list(results.values()[0].keys())
for user in results:
for post_hash in common:
if post_hash not in results[user]:
common.remove(post_hash)
else:
points[user] += results[user][post_hash]
The issue I'm encountering is that this doesn't actually filter out posts that aren't shared, and thus, doesn't provide accurate point values.
What am I doing wrong with my structure, and is there any easier way to find only the common posts?
I think you may have two issues:
Using a list for common means that when you remove an item via common.remove, it will only remove the first item it finds (there could be more)
You're not just adding points for posts shared by all users - you're adding points for users as you encounter them - before you know if that post is shared by everyone or not
Without some actual data to play with, it's a little difficult to write working code, but try this:
# this should give us a list of posts shared by all users
common = set.intersection(*[set(k.keys()) for k in results.values()])
# there's probably a more efficient (functional) way of summing the points
# by user instead of looping, but simple is good.
for user in results:
for post_hash in common:
points[user] += results[user][post_hash]
from collections import Counter
from functools import reduce
posts = []
# Create an array of all the post hashes
for p in results.values():
posts.extend(p.keys())
# use Counter to create a dictionary like object that where the key
# is the post hash and the value is the number of occurrences
posts = Counter(posts)
for user in results:
# Reduce only the posts that show up more than once.
points[user] = reduce(lambda x,y: x+y, (post for post in user if posts[post] > 1))
import functools
iterable = (v.keys() for v in results.values())
common = funtools.reduce(lambda x,y: x & y, iterable)
points = {user: sum(posts[post] for post in common) for user,posts in results.items()}
See if this works.

Tumblr API paging bug when fetching followers?

I'm writing a little python app to fetch the followers of a given tumblr, and I think I may have found a bug in the paging logic.
The tumblr I am testing with has 593 followers and I know the API is block limited to 20 per call. After successful authentication, the fetch logic looks like this:
offset = 0
while True:
response = client.followers(blog, limit=20, offset=offset)
bunch = len(response["users"])
if bunch == 0:
break
j = 0
while j < bunch:
print response["users"][j]["name"]
j = j + 1
offset += bunch
What I observe is that on the third call into the API with offset=40, the first name returned on the list is one I saw in the previous group. It's actually the 38th name. This behavior (seeing one or more names I've seen before) repeats randomly from that point on, though not in every call to the API. Some calls give me a fresh 20 names. It's repeatable across multiple test runs. The sequence I see them in is the same as on Tumblr's site, I just see many of them twice.
An interesting coincidence is that the total number of of non-unique followers returned is the same as what the "Followers" count indicates on the blog itself (593). But only 516 of them are unique.
For what it's worth, running the query on Tumblr's console page returns the same results regardless of the language I choose, so I'm not inclined to think this is a bug in the PyTumblr client, but something lower, at the API level.
Any ideas?

Redis as a Queue - Bulk Retrieval

Our Python application serves around 2 million API requests per day. We got a new requirement from our business to generate the report which should contain the count of unique request and response every day.
We would like to use Redis for queuing all the requests & responses.
Another worker instance will retrieve the above data from Redis queue and process it.
The processed results will be persisted to the database.
The simplest option is to use LPUSH and RPOP. But RPOP will return one value at a time which will affect the performance. Is there any way to do a bulk pop from Redis?
Other suggestions for the scenario would be highly appreciated.
A simple solution would be to use redis pipelining
In a single request you will be allowed to perform multiple RPOP instructions.
Most of redis drivers support it. In python with Redis-py it looks like this:
pipe = r.pipeline()
# The following RPOP commands are buffered
pipe.rpop('requests')
pipe.rpop('requests')
pipe.rpop('requests')
pipe.rpop('requests')
# the EXECUTE call sends all buffered commands to the server, returning
# a list of responses, one for each command.
pipe.execute()
Can approach this from a different angle. Your requirement is:
requirement ... to generate the report which should contain the count of unique request and response every day.
Rather than storing requests in the lists and then post-processing the results, why not use Redis features to solve the actual requirements and avoid the problem of bulk LPUSH/LPOP.
If all we want if to record the unique counts, then you may want to consider using sorted sets.
This may go like this:
Collect the request statistics
# Collect the request statistics in the sorted set.
# The key includes date so we can do the "by date" stats.
key = 'requests:date'
r.zincrby(key, request, 1)
Report request statistics
Can use ZSCAN to iterate over all members in batches, but this is unordered.
Can use ZRANGE to get all members in one go (or whatever), ordered.
Python code:
# ZSCAN: Iterate over all members in the set in batches of about 10.
# This will be unordered list.
# zscan_iter returns tuples (member, score)
batchSize = 10
for memberTuple in r.zscan_iter(key, match = None, count = batchSize):
member = memberTuple[0]
score = memberTuple[1]
print str(member) + ' --> ' + str(score)
# ZRANGE: Get all members in the set, ordered by score.
# Here there maxRank=-1 means "no max".
minRank = 0
maxRank = -1
for memberTuple in r.zrange(key, minRank, maxRank, desc = False, withscores = True):
member = memberTuple[0]
score = memberTuple[1]
print str(member) + ' --> ' + str(score)
Benefits of this approach
Solves the actual requirement - reports on the count of unique requests by day.
No need to post-process anything.
Can do additional queries like "top requests" out of the box :)
Another approach would be to use the Hyperloglog data structure.
It was especially designed for this kind of use case.
It allows counting unique items with a low error margin (0.81%) and with a very low memory usage.
Using HLL is really simple:
PFADD myHll "<request1>"
PFADD myHll "<request2>"
PFADD myHll "<request3>"
PFADD myHll "<request4>"
Then to get the count:
PFCOUNT myHll
The actual question was regarding Redis List, You can use lrange to get all values in a single call, below is solution;
import redis
r_server = redis.Redis("localhost")
r_server.rpush("requests", "Adam")
r_server.rpush("requests", "Bob")
r_server.rpush("requests", "Carol")
print r_server.lrange("requests", 0, -1)
print r_server.llen("requests")
print r_server.lindex("requests", 1)

Sorting entities and filtering ListProperty without incurring in exploding indexes

I'm developing a simple Blogging/Bookmarking platform and I'm trying to add a tags-explorer/drill-down feature a là delicious to allow users to filter the posts specifying a list of specific tags.
Something like this:
Posts are represented in the datastore with this simplified model:
class Post(db.Model):
title = db.StringProperty(required = True)
link = db.LinkProperty(required = True)
description = db.StringProperty(required = True)
tags = db.ListProperty(str)
created = db.DateTimeProperty(required = True, auto_now_add = True)
Post's tags are stored in a ListProperty and, in order to retrieve the list of posts tagged with a specific list of tags, the Post model exposes the following static method:
#staticmethod
def get_posts(limit, offset, tags_filter = []):
posts = Post.all()
for tag in tags_filter:
if tag:
posts.filter('tags', tag)
return posts.fetch(limit = limit, offset = offset)
This works well, although I've not stressed it too much.
The problem raises when I try to add a "sorting" order to the get_posts method to keep the result ordered by "-created" date:
#staticmethod
def get_posts(limit, offset, tags_filter = []):
posts = Post.all()
for tag in tags_filter:
if tag:
posts.filter('tags', tag)
posts.order("-created")
return posts.fetch(limit = limit, offset = offset)
The sorting order adds an index for each tag to filter, leading to the dreaded exploding indexes problem.
One last thing that makes this thing more complicated is that the get_posts method should provide some pagination mechanism.
Do you know any Strategy/Idea/Workaround/Hack to solve this problem?
Queries involving keys use indexes
just like queries involving
properties. Queries on keys require
custom indexes in the same cases as
with properties, with a couple of
exceptions: inequality filters or an
ascending sort order on key do not
require a custom index, but a
descending sort order on
Entity.KEY_RESERVED_PROPERTY_key_
does.
So use a sortable date string for the primary key of the entity:
class Post(db.Model):
title = db.StringProperty(required = True)
link = db.LinkProperty(required = True)
description = db.StringProperty(required = True)
tags = db.ListProperty(str)
created = db.DateTimeProperty(required = True, auto_now_add = True)
#classmethod
def create(*args, **kw):
kw.update(dict(key_name=inverse_millisecond_str() + disambig_chars()))
return Post(*args, **kw)
...
def inverse_microsecond_str(): #gives string of 8 characters from ascii 23 to 'z' which sorts in reverse temporal order
t = datetime.datetime.now()
inv_us = int(1e16 - (time.mktime(t.timetuple()) * 1e6 + t.microsecond)) #no y2k for >100 yrs
base_100_chars = []
while inv_us:
digit, inv_us = inv_us % 100, inv_us / 100
base_100_str = [chr(23 + digit)] + base_100_chars
return "".join(base_100_chars)
Now, you don't even have to include a sort order in your queries, although it won't hurt to explicitly sort by key.
Things to remember:
This won't work unless you use the "create" here for all your Posts.
You'll have to migrate old data
No ancestors allowed.
The key is stored once per index, so it is worthwhile to keep it short; that's why I'm doing the base-100 encoding above.
This is not 100% reliable because of the possibility of key collisions. The above code, without disambig_chars, nominally gives reliability of the number of microseconds between transactions, so if you had 10 posts per second at peak times, it would fail 1/100,000. However, I'd shave off a couple orders of magnitude for possible app engine clock tick issues, so I'd actually only trust it for 1/1000. If that's not good enough, add disambig_chars; and if you need 100% reliability, then you probably shouldn't be on app engine, but I guess you could include logic to handle key collisions on save().
What if you inverted the relationship? Instead of a post with a list of tags you would have a tag entity with a list of posts.
class Tag(db.Model):
tag = db.StringProperty()
posts = db.ListProperty(db.Key, indexed=False)
To search for tags you would do tags = Tag.all().filter('tag IN', ['python','blog','async'])
This would give you hopefully 3 or more Tag entities, each with a list of posts that are using that tag. You could then do post_union = set(tags[0].posts).intersection(tags[1].posts, tags[2].posts) to find the set of posts that have all tags.
Then you could fetch those posts and order them by created (I think). Posts.all().filter('__key__ IN', post_union).order("-created")
Note: This code is off the top of my head, I can't remember if you can manipulate sets like that.
Edit: #Yasser pointed out that you can only do IN queries for < 30 items.
Instead you could have the key name for each post start with the creation time. Then you could sort the keys you retrieved via the first query and just do Posts.get(sorted_posts).
Don't know how this would scale to a system with millions of posts and/or tags.
Edit2: I meant set intersection, not union.
This question sounds similar to:
Data Modelling Advice for Blog Tagging system on Google App Engine
Mapping Data for a Google App Engine Blog Application:
parent->child relationships in appengine python (bigtable)
As pointed by Robert Kluin in the last one, you could also consider using a pattern similar to "Relation Index" as described in this Google I/O presentation.
# Model definitions
class Article(db.Model):
title = db.StringProperty()
content = db.StringProperty()
class TagIndex(db.Model):
tags = db.StringListProperty()
# Tags are child entities of Articles
article1 = Article(title="foo", content="foo content")
article1.put()
TagIndex(parent=article1, tags=["hop"]).put()
# Get all articles for a given tag
tags = db.GqlQuery("SELECT __key__ FROM Tag where tags = :1", "hop")
keys = (t.parent() for t in tags)
articles = db.get(keys)
Depending on how many Page you expect back by Tags query, sorting could either be made in memory or by making the date string representation part of Article key_name
Updated with StringListProperty and sorting notes after Robert Kluin and Wooble comments on #appengine IRC channel.
One workaround could be this:
Sort and concatenate a post's tags with a delimiter like | and store them as a StringProperty when storing a post. When you receive the tags_filter, you can sort and concatenate them to create a single StringProperty filter for the posts. Obviously this would be an AND query and not an OR query but thats what your current code seems to be doing as well.
EDIT: as rightly pointed out, this would only match exact tag list not partial tag list, which is obviously not very useful.
EDIT: what if you model your Post model with boolean placeholders for tags e.g. b1, b2, b3 etc. When a new tag is defined, you can map it to the next available placeholder e.g. blog=b1, python=b2, async=b3 and keep the mapping in a separate entity. When a tag is assigned to a post, you just switch its equivalent placeholder value to True.
This way when you receive a tag_filter set, you can construct your query from the map e.g.
Post.all().filter("b1",True).filter("b2",True).order('-created')
can give you all the posts which have tags python and blog.

Categories