Tumblr API paging bug when fetching followers? - python

I'm writing a little python app to fetch the followers of a given tumblr, and I think I may have found a bug in the paging logic.
The tumblr I am testing with has 593 followers and I know the API is block limited to 20 per call. After successful authentication, the fetch logic looks like this:
offset = 0
while True:
response = client.followers(blog, limit=20, offset=offset)
bunch = len(response["users"])
if bunch == 0:
break
j = 0
while j < bunch:
print response["users"][j]["name"]
j = j + 1
offset += bunch
What I observe is that on the third call into the API with offset=40, the first name returned on the list is one I saw in the previous group. It's actually the 38th name. This behavior (seeing one or more names I've seen before) repeats randomly from that point on, though not in every call to the API. Some calls give me a fresh 20 names. It's repeatable across multiple test runs. The sequence I see them in is the same as on Tumblr's site, I just see many of them twice.
An interesting coincidence is that the total number of of non-unique followers returned is the same as what the "Followers" count indicates on the blog itself (593). But only 516 of them are unique.
For what it's worth, running the query on Tumblr's console page returns the same results regardless of the language I choose, so I'm not inclined to think this is a bug in the PyTumblr client, but something lower, at the API level.
Any ideas?

Related

Twitter pagination per page limit in downloading user profile Tweets

Here is the code I am using from this link. I have updated the original code as I need the full .json object. But I am having a problem with pagination as I am not getting the full 3200 Tweets.
api = tweepy.API(auth, parser=tweepy.parsers.JSONParser(),wait_on_rate_limit=True)
jsonFile = open(path+filname+'.json', "a+",encoding='utf-8')
page=1
max_pages=3200
result_limit=2
last_tweet_id=False
while page <= max_pages:
if last_tweet_id:
tweet = api.user_timeline(screen_name=user,
count=result_limit,
max_id=last_tweet_id - 1,
tweet_mode = 'extended',
include_retweets=True
)
else:
tweet = api.user_timeline(screen_name=user,
count=result_limit,
tweet_mode = 'extended',
include_retweets=True)
json_str = json.dumps(tweet, ensure_ascii=False, indent=4)
as per author "result_limit and max_pages are multiplied together to get the number of tweets called."
Then shouldn't I get 6400 Tweets by this definition. But the problem is I am getting 2 Tweets 3200 times. I also updated the values to
max_pages=3200
result_limit=5000
You can say it as a super limit so I should at least get 3200 Tweets. But in this case I got 200 Tweets repeated many times (as I terminated the code).
I just want 3200 Tweets per user profile, nothing fancy. Consider that I have 100 users list, so I want that in an efficient way. Currently seems like I am just sending so many requests and wasting time and assets.
Even though I update the code with a smaller value of max_pages, I am still not sure what should be that value, How am I supposed to know that a one-page covers how many Tweets?
Note: "This answer is not useful" as it has an error at .item() so please don't mark it duplicate.
You don't change last_tweet_id after setting it to False, so only the code in the else block is executing. None of the parameters in that method call change while looping, so you're making the same request and receiving the same response back over and over again.
Also, neither page nor max_pages changes within your loop, so this will loop infinitely.
I would recommend looking into using tweepy.Cursor instead, as it handles pagination for you.

Is there a workaround to the 10,000 Telegram server query limit when trying to get all chat members' data with Pyrogram/Python?

I have to get data from all members of a list of Telegram chats – groups and supergroups –, but, as Pyrogram documentation alerts, it is only possible to get a total of 10,000 ChatMember results in a single query. Pyrogram's iter_chat_members method is limited to it and does not provide an offset parameter or some kind of pagination handling. So I tried to get 200-sized chunks of data with its get_chat_members method, but after the 50th chunk, which corresponds to the 10,000th ChatMember object, it starts to give me empty results. The draft code I used for testing is as follows:
from pyrogram import Client
def get_chat_members(app, target, offset=0, step=200):
total = app.get_chat_members_count(target)
itrs = (total//step) + 1
members_list = []
itr = 1
while itr <= itrs:
members = app.get_chat_members(target, offset)
members_list.append(members)
offset += step
itr += 1
return members_list
app = Client("my_account")
with app:
results = get_chat_members(app, "example_chat_entity")
print(results)
I thought that despite any of these methods giving me the full chat members data, there should be a workaround, given that what Pyrogram's documentation says about this limit corresponds to a single query. I wonder, then, if there is a way to do more than one query, without flooding the API, and without losing the offset state. Am I missing something or is it impossible to do due to an API limitation?
This is a Server Limitation, not one of Pyrogram itself. The Server simply does not yield any more information after ~10k members. There is no way that a user would need to know detailed information about this many members anyway.

"StatusCode.DEADLINE_EXCEEDED" error while using bigtable.scan() function

I have a millions of articles in bigtable and to scan 50,000 articles I have used something as :
for key, data in mytable.scan(limit=50000):
print (key,data)
It works fine for limit upto 10000 but as I exceed the limit of 10000 I get this error
_Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.DEADLINE_EXCEEDED)
There was a fix for this problem, where the client automatically retries temporary failures like this one. That fix was not released yet, but will hopefully be released soon.
I had a similar problem, where I had to retrieve data from many rows at the same time. What you're using looks like the hbase client, my solution is using the native one, so I'll try to post both - one which I tested and one which might work.
I never found an example which would demonstrate how to simply iterate over the rows as they come using the consume_next() method, which is mentioned here and I didn't manage to figure it out on my own. Calling consume_all() for too many rows yielded the same DEADLINE EXCEEDED error.
LIMIT = 10000
previous_start_key = None
while start_key != previous_start_key:
previous_start_key = start_key
row_iterator = bt_table.read_rows(start_key=start_key, end_key=end_key,
filter_=filter_, limit=LIMIT)
row_iterator.consume_all()
for _row_key, row in row_iterator.rows.items():
row_key = _row_key.decode()
if row_key == previous_start_key: # Avoid repeated processing
continue
# do stuff
print(row)
start_key = row_key
So basically you can start with whatever start_key, retrieve 10k results, do consume_all(), then retrieve the next batch starting where you left off, and so on, until some reasonable condition applies.
For you it might be something like:
row_start = None
for i in range(5):
for key, data in mytable.scan(row_start=row_start, limit=10000):
if key == row_start: # Avoid repeated processing
continue
print (key,data)
row_start = key
There might be a much better solution, and I really want to know what it is, but this works for me for the time being.

Twitter error code 429 with Tweepy

I am trying to create a project that accesses a twitter account using the tweepy api but I am faced with status code 429. Now, I've looked around and I see that it means that I have too many requests. However, I am only ever for 10 tweets at a time and within those, only one should exist during my testing.
for tweet in tweepy.Cursor(api.search, q = '#realtwitchess ',lang = ' ').items(10):
try:
text = str(tweet.text)
textparts = str.split(text) #convert tweet into string array to disect
print(text)
for x, string in enumerate(textparts):
if (x < len(textparts)-1): #prevents error that arises with an incomplete call of the twitter bot to start a game
if string == "gamestart" and textparts[x+1][:1] == "#": #find games
otheruser = api.get_user(screen_name = textparts[2][1:]) #drop the # sign (although it might not matter)
self.games.append((tweet.user.id,otheruser.id))
elif (len(textparts[x]) == 4): #find moves
newMove = Move(tweet.user.id,string)
print newMove.getMove()
self.moves.append(newMove)
if tweet.user.id == thisBot.id: #ignore self tweets
continue
except tweepy.TweepError as e:
print(e.reason)
sleep(900)
continue
except StopIteration: #stop iteration when last tweet is reached
break
When the error does appear, it is in the first for loop line. The kinda weird part is that it doesn't complain every time, or even in consistent intervals. Sometimes it will work and other times, seemingly randomly, not work.
We have tried adding longer sleep times in the loop and reducing the item count.
Add wait_on_rate_limit=True on the API call like this:
api = tweepy.API(auth, wait_on_rate_limit=True)
This will make the rest of the code obey the rate limit
You found the correct information about error code. In fact, the 429 code is returned when a request cannot be served due to the application’s rate limit having been exhausted for the resource.(from documentation)
I suppose that your problem regards not the quantity of data but the frequency.
Check the Twitter API rate limits (that are the same for tweepy).
Rate limits are divided into 15 minute intervals. All endpoints require authentication, so there is no concept of unauthenticated calls and rate limits.
There are two initial buckets available for GET requests: 15 calls every 15 minutes, and 180 calls every 15 minutes.
I think that you can try to use API in this range to avoid the problem
Update
For the latest versions of Tweepy (from 3.2.0), the wait_on_rate_limit has been introduced.
If set to True, it allows to automatically avoid this problem.
From documentation:
wait_on_rate_limit – Whether or not to automatically wait for rate limits to replenish
api =tweepy.API(auth,wait_on_rate_limit=True,wait_on_rate_limit_notify=True)
this should help for setting rate

Find Maximum Index in Google Datastore (Pagination in Blog System)

I have a series of blog posts as entities. I will receive a URL that looks like this: /blog/page/1. I would like to access the most recent 5 posts in this case. In the case of /blog/page/2, I want the 6-10th most recent.
So allow me to do an X-Y, because I think this is the only way:
How do I find the maximum of a value in numerous entities with Google Cloud Platform Datastore? (I'm using ndb)
I can give each entity an ID value, and then fetch 5 from a query where ID < maxIndex - page * 5 sorted by ID.
But, how do I find maxIndex? Do I fetch 1 from a query ordered by ID, find it's ID, and then run the previous operation? That seems somewhat slow for every pageview.
How can I either A) Find the max index quickly or B) Implement pagination otherwise?
Thanks!
For cursors you can use a Datastore entry to store the cursor then send the Datastore key back and forth. Obscuring it by sending data via post requests is another option.
To answer your original question: To get the the top entry do a query with a sort with a limit of 1. This will actually read all the Datastore entries so you should do a keys only search to get the ID (keys only is free) then get the actual posts. So something like this:
class IndexPage(webapp2.RequestHandler):
def get(self):
maxIndex = Posts.query().order(-Posts.PostIndex).fetch(limit=1, keys_only=True)[0].get().PostIndex
page_number = 0
post_lists = getPageResults(maxIndex, page_number)
while len(post_lists) > 1:
self.response.write("====================PAGE NUMBER %i===================</br>" % page_number)
for post in post_lists:
self.response.write(str(post.get().PostIndex) + "</br>")
page_number += 1
post_lists = getPageResults(maxIndex, page_number)
def getPageResults(maxIndex, page):
index_range = (maxIndex - (page*5))
post_index_list = range(index_range, index_range-5, -1)
return Posts.query(Posts.PostIndex.IN(post_index_list)).order(-Posts.PostIndex).fetch(limit=5, keys_only=True)
Keep in mind I threw this together in a few minutes to illustrate using keys_only and the other points I mentioned above.

Categories