that's too slow to get my wall from API graph - python

i want to know how much time(secend) do i need to get my whole facebook wall from json(graph api)
it takes about 190 seconds to get my whole wall's post (maybe 2000 posts and 131pages(json))
follow is python code. that code is just reading the posts.
is there any problem in my code? and should i cut my response time?
accessToken = "Secret"
requestURL = "https://graph.facebook.com/me/feed?access_token="+accessToken
beforeSec = time.time()*1000
pages = 1
while 1:
read = urllib.urlopen(requestURL).read()
read = json.loads(read)
data = read["data"]
for i in range(0, len(data)):
pass
try:
requestURL = read["paging"]["next"]
pages+=1
except:
break
afterSec = time.time()*1000
print afterSec - beforeSec

It depends offcourse on how big the users wall is ... They have released a new batch function : http://developers.facebook.com/docs/reference/api/batch/
Mayb you can use that?

Your code is synchronous, so you download the pages one by one.
It's very slow, you could download several pages in parallel instead.
Greenlet are the new hype for Python paraller computing, so give a try to gevent.
Well, this is provided you can get the next page before downloading the entire previous page of course. Try to see if you can get the next paging in a quick way.

Related

How can I speed up get request, if what is a faster method?

I have some code inside of an app that is slowing me down wayyy too much, and it's a simple
'get' function...
This portion of the code is just finding the location of the PDF on the internet, then extracting it. I thought it was the extraction process that was taking so long, but after some testing, I believe it's the 'get' request. I am passing a variable into the URL because there are many different PDFs that the user can indirectly select. I have tried to use kivy's Urlrequest but I honestly can't get my head around getting a result frim it. I have heard it is faster though. I have another 2 'post' sessions in different functions that work 10 times faster than this one, so not sure what the issue is...
The rest of my program is working just fine, it's just this which is adding sometimes upwards of 20-25 seconds onto load times (which is unreasonable).
I will include a working extract of the problem below for you to please try.
I have found on it's first attempt at an "airport_loc" it is the slowest, please try swapping out the airport_loc variable with some of these examples:
"YPAD"
"YMLT"
"YPPH"
What can I do different here to speed it up or simply make it more efficient?
import requests
from html2text import html2text
import re
s = requests.session()
page = s.get('https://www.airservicesaustralia.com/aip/pending/dap/AeroProcChartsTOC.htm')
text = html2text(page.text)
airport_loc = "YSSY"
finding_airport = (re.search(r'.%s.' % re.escape(airport_loc), text)).group()
ap_id_loc = int(text.index(finding_airport)) + 6
ap_id_onward = text[ap_id_loc:]
next_loc = re.search(r'[(]Y\w\w\w[)]', ap_id_onward)
next_loc_stop = text.index(next_loc.group())
ap_id_to_nxt_ap = text[ap_id_loc:next_loc_stop]
needed_text = (html2text(ap_id_to_nxt_ap))
airport_id_less_Y = airport_loc[1:]
app_1 = re.search(r'%sGN.*' % re.escape(airport_id_less_Y), needed_text)
app_2 = re.search(r'%sII.*' % re.escape(airport_id_less_Y), needed_text)
try:
if app_2.group():
line_of_chart = (app_2.group())
except:
if app_1.group():
line_of_chart = (app_1.group())
chart_title = (re.search(r'\w\w\w\w\w\d\d[-]\d*[_][\d\w]*[.]pdf', line_of_chart)).group()
# getting exact pdf now
chart_PDF = ('https://www.airservicesaustralia.com/aip/pending/dap/' + chart_title)
retrieve = s.get(chart_PDF)
content = retrieve.content
print(content)
# from here on is working fine.
I haven't included the code following this because it's not really relevant I think.
Please help me speed this thing up :(
It still takes 3 seconds to me with just your code.
latency might come from server.
to make request little faster, I try to edit HTTP adapter like this.
s.mount('http://', requests.adapters.HTTPAdapter(max_retries=0))
retrieve = s.get(chart_PDF)
It shows little improvement (3sec -> 2sec)
But have a risk for failure.
using "asyncio" or other async http library is more better ways

Python Script skipping for loop

I am trying to iterate through some json data, but the information is found on several pages. I don't have a problem working through the first page, however it will just skip over the next set. The weird thing is, it will execute fine while in debug mode. I'm guessing its a timing issue while working with the json loads, but I tried putting sleep timers around that code and the issue persisted.
url = apipath + query + apikey
response = requests.get(url)
data = json.loads(response.text)
for x in data["results"]:
nameList.append(x["name"])
latList.append(x["geometry"]["location"]["lat"])
lonList.append(x["geometry"]["location"]["lng"])
pagetoken = "pagetoken=" + data["next_page_token"]
url = apipath + pagetoken + apikey
response = requests.get(url)
data = json.loads(response.text)
for x in data["results"]:
nameList.append(x["name"])
latList.append(x["geometry"]["location"]["lat"])
lonList.append(x["geometry"]["location"]["lng"])
I would venture to guess that data["results"] equates to a None value and therefore calling for x in None: would result in the program skipping your for loop. Have you tried putting a print above the for loop? Perhaps try print(data["results"]) before going into your loop to ensure the data you want exists. If that returns None then maybe try just print(data) and see what the program is reading.
Well it did end up being a timing issue. I placed a 2 second timer before the second request and it now will load the data just fine. I guess Python just couldn't keep up.

Python reddit API: efficiently parse all comments in a subreddit

I am trying to code a chatbot and to have it scanning through all the comments added to it.
Currently I do so by scanning every X seconds to the last Y comments:
handle = praw.Reddit(username=config.username,
password=config.password,
client_id=config.client_id,
client_secret=config.client_secret,
user_agent="cristiano corrector v0.1a")
while True:
last_comments = handle.subreddit(subreddit).comments(limit=Y)
for comment in last_comments:
#process comments
time.sleep(X)
I am quite unsatisfied as there can be a lot of overlap (which can be solved by tracking the comments id) and some comments are scanned twice while others are ignored. Is there a better way of doing so with this API?
I found a solution making use of stream inside the PRAW API. Details in https://praw.readthedocs.io/en/latest/tutorials/reply_bot.html
And in my code:
handle = praw.Reddit(username=config.username,
password=config.password,
client_id=config.client_id,
client_secret=config.client_secret,
user_agent="cristiano corrector v0.1a")
for comment in handle.subreddit(subreddit).stream.comments():
#process comments
This should save some CPU and network load.

Tumblr API paging bug when fetching followers?

I'm writing a little python app to fetch the followers of a given tumblr, and I think I may have found a bug in the paging logic.
The tumblr I am testing with has 593 followers and I know the API is block limited to 20 per call. After successful authentication, the fetch logic looks like this:
offset = 0
while True:
response = client.followers(blog, limit=20, offset=offset)
bunch = len(response["users"])
if bunch == 0:
break
j = 0
while j < bunch:
print response["users"][j]["name"]
j = j + 1
offset += bunch
What I observe is that on the third call into the API with offset=40, the first name returned on the list is one I saw in the previous group. It's actually the 38th name. This behavior (seeing one or more names I've seen before) repeats randomly from that point on, though not in every call to the API. Some calls give me a fresh 20 names. It's repeatable across multiple test runs. The sequence I see them in is the same as on Tumblr's site, I just see many of them twice.
An interesting coincidence is that the total number of of non-unique followers returned is the same as what the "Followers" count indicates on the blog itself (593). But only 516 of them are unique.
For what it's worth, running the query on Tumblr's console page returns the same results regardless of the language I choose, so I'm not inclined to think this is a bug in the PyTumblr client, but something lower, at the API level.
Any ideas?

What's the best performing xml parsing for GAE (Python Version)?

I think we all know this page, but the benchmarks provided dated from more than two years ago. So, I would like to know if you could point out the best xml parser around. As I need just a xml parser, the more important thing to me is speed over everything else.
My objective is to process some xml feeds (about 25k) that are 4kb in size (this will be a daily task). As you probably know, I'm restricted by the 30 seconds request timeout. So, what's the best parser (Python only) that I can use?
Thanks for your anwsers.
Edit 01:
#Peter Recore
I'll. I'm writing some code now and plan to run some profiling in the near future. Regarding your question, the answer is no. Processing takes just a little time when compared with downloading the actual xml feed. But, I can't increase Google's Bandwidth, so I can only focus on this right now.
My only problem is that i need to do this as fastest as possible because my objective is to get a snapshot of a website status. And, as internet is live and people keep adding and changing it's data, i need the fastest method because any data insertion during the "downloading and processing" time span will actually mess with my statistical analisys.
I used to do it from my own computer and the process took 24 minutes back then, but now the website has 12 times more information.
I know that this don't awnser my question directly, but id does what i just needed.
I remenbered that xml is not the only file type I could use, so instead of using a xml parser I choose to use json. About 2.5 times smaller in size. What means a decrease in download time. I used simplejson as my json libray.
I used from google.appengine.api import urlfetch to get the json feeds in parallel:
class GetEntityJSON(webapp.RequestHandler):
def post(self):
url = 'http://url.that.generates.the.feeds/'
if self.request.get('idList'):
idList = self.request.get('idList').split(',')
try:
asyncRequests = self._asyncFetch([url + id + '.json' for id in idList])
except urlfetch.DownloadError:
# Dealed with time out errors (#5) as these were very frequent
for result in asyncRequests:
if result.status_code == 200:
entityJSON = simplejson.loads(result.content)
# Filled a database entity with some json info. It goes like this:
# entity= Entity(
# name = entityJSON['name'],
# dateOfBirth = entityJSON['date_of_birth']
# ).put()
self.redirect('/')
def _asyncFetch(self, urlList):
rpcs = []
for url in urlList:
rpc = urlfetch.create_rpc(deadline = 10)
urlfetch.make_fetch_call(rpc, url)
rpcs.append(rpc)
return [rpc.get_result() for rpc in rpcs]
I tried getting 10 feeds at a time, but most of the times an individual feed raised the DownloadError #5 (Time out). Then, I increased the deadline to 10 seconds and started getting 5 feeds at a time.
But still, 25k feeds getting 5 at a time results in 5k calls. In a queue that can spawn 5 tasks a second, the total task time should be 17min in the end.

Categories