Python reddit API: efficiently parse all comments in a subreddit - python

I am trying to code a chatbot and to have it scanning through all the comments added to it.
Currently I do so by scanning every X seconds to the last Y comments:
handle = praw.Reddit(username=config.username,
password=config.password,
client_id=config.client_id,
client_secret=config.client_secret,
user_agent="cristiano corrector v0.1a")
while True:
last_comments = handle.subreddit(subreddit).comments(limit=Y)
for comment in last_comments:
#process comments
time.sleep(X)
I am quite unsatisfied as there can be a lot of overlap (which can be solved by tracking the comments id) and some comments are scanned twice while others are ignored. Is there a better way of doing so with this API?

I found a solution making use of stream inside the PRAW API. Details in https://praw.readthedocs.io/en/latest/tutorials/reply_bot.html
And in my code:
handle = praw.Reddit(username=config.username,
password=config.password,
client_id=config.client_id,
client_secret=config.client_secret,
user_agent="cristiano corrector v0.1a")
for comment in handle.subreddit(subreddit).stream.comments():
#process comments
This should save some CPU and network load.

Related

How can I speed up get request, if what is a faster method?

I have some code inside of an app that is slowing me down wayyy too much, and it's a simple
'get' function...
This portion of the code is just finding the location of the PDF on the internet, then extracting it. I thought it was the extraction process that was taking so long, but after some testing, I believe it's the 'get' request. I am passing a variable into the URL because there are many different PDFs that the user can indirectly select. I have tried to use kivy's Urlrequest but I honestly can't get my head around getting a result frim it. I have heard it is faster though. I have another 2 'post' sessions in different functions that work 10 times faster than this one, so not sure what the issue is...
The rest of my program is working just fine, it's just this which is adding sometimes upwards of 20-25 seconds onto load times (which is unreasonable).
I will include a working extract of the problem below for you to please try.
I have found on it's first attempt at an "airport_loc" it is the slowest, please try swapping out the airport_loc variable with some of these examples:
"YPAD"
"YMLT"
"YPPH"
What can I do different here to speed it up or simply make it more efficient?
import requests
from html2text import html2text
import re
s = requests.session()
page = s.get('https://www.airservicesaustralia.com/aip/pending/dap/AeroProcChartsTOC.htm')
text = html2text(page.text)
airport_loc = "YSSY"
finding_airport = (re.search(r'.%s.' % re.escape(airport_loc), text)).group()
ap_id_loc = int(text.index(finding_airport)) + 6
ap_id_onward = text[ap_id_loc:]
next_loc = re.search(r'[(]Y\w\w\w[)]', ap_id_onward)
next_loc_stop = text.index(next_loc.group())
ap_id_to_nxt_ap = text[ap_id_loc:next_loc_stop]
needed_text = (html2text(ap_id_to_nxt_ap))
airport_id_less_Y = airport_loc[1:]
app_1 = re.search(r'%sGN.*' % re.escape(airport_id_less_Y), needed_text)
app_2 = re.search(r'%sII.*' % re.escape(airport_id_less_Y), needed_text)
try:
if app_2.group():
line_of_chart = (app_2.group())
except:
if app_1.group():
line_of_chart = (app_1.group())
chart_title = (re.search(r'\w\w\w\w\w\d\d[-]\d*[_][\d\w]*[.]pdf', line_of_chart)).group()
# getting exact pdf now
chart_PDF = ('https://www.airservicesaustralia.com/aip/pending/dap/' + chart_title)
retrieve = s.get(chart_PDF)
content = retrieve.content
print(content)
# from here on is working fine.
I haven't included the code following this because it's not really relevant I think.
Please help me speed this thing up :(
It still takes 3 seconds to me with just your code.
latency might come from server.
to make request little faster, I try to edit HTTP adapter like this.
s.mount('http://', requests.adapters.HTTPAdapter(max_retries=0))
retrieve = s.get(chart_PDF)
It shows little improvement (3sec -> 2sec)
But have a risk for failure.
using "asyncio" or other async http library is more better ways

Is there any way to speed up PRAW's comment parsing?

I'm writing a script to scrape text data (posts and their comments) with praw, and comments add so much time to the download it's ridiculous.
If I only download posts and no comments, it gets ~100 per second, but if I download comments with the posts, it goes down to 1-2 per second, and that's just for top level comments. If I include nested comments, it takes ~5-10 minutes for 1 post (granted the post I tested on was the top post from /r/raskreddit, but still). Here's the method I'm using, please let me know if there's a way to make this any faster!
for top_level_comment in submission.comments.list():
commentnumber += 1
comment1= long variable, not important
savepost.write(comment1)
for comment in top_level_comment.replies:
parentvar = comment.parent_id
parent = reddit.comment(parentvar[3:])
parentauthor = str(parent.author)
comment2 = again, long variable
savepost.write(comment2)
I've also tried this method, thinking that fewer for statements with comment requests might help, but it didn't
for comment in submission.comments.list():
if str(comment.parent_id[:2]) == "t1":
parent = reddit.comment(comment.parent_id[3:])
comment2 = super long variable, not important
if comment.parent_id not in commentidlist:
commentidlist.append(comment.parent_id)
comment1 = Again, long variable
savepost.write(comment1)
print("Wrote parent comment Id: "+comment.parent_id[3:])
savepost.write(comment2)

BioPython Pubmed Eutils url?

I'm trying to run some queries against Pubmed's Eutils service. If I run them on the website I get a certain number of records returned, in this case 13126 (link to pubmed).
A while ago I bodged together a python script to build a query to do much the same thing, and the resultant url returns the same number of hits (link to Eutils result).
Of course, not having any formal programming background, it was all a bit cludgy, so I'm trying to do the same thing using Biopython. I think the following code should do the same thing, but it returns a greater number of hits, 23303.
from Bio import Entrez
Entrez.email = "A.N.Other#example.com"
handle = Entrez.esearch(db="pubmed", term="stem+cell[All Fields]",datetype="pdat", mindate="2012", maxdate="2012")
record = Entrez.read(handle)
print(record["Count"])
I'm fairly sure it's just down to some subtlety in how the url is being generated, but I can't work out how to see what url is being generated by Biopython. Can anyone give me some pointers?
Thanks!
EDIT:
It's something to do with how the url is being generated, as I can get back the original number of hits by modifying the code to include double quotes around the search term, thus:
handle = Entrez.esearch(db='pubmed', term='"stem+cell"[ALL]', datetype='pdat', mindate='2012', maxdate='2012')
I'm still interested in knowing what url is being generated by Biopython as it'll help me work out how i have to structure the search term for when i want to do more complicated searches.
handle = Entrez.esearch(db="pubmed", term="stem+cell[All Fields]",datetype="pdat", mindate="2012", maxdate="2012")
print(handle.url)
You've solved this already (Entrez likes explicit double quoting round combined search terms), but currently the URL generated is not exposed via the API. The simplest trick would be to edit the Bio/Entrez/__init__.py file to add a print statement inside the _open function.
Update: Recent versions of Biopython now save the URL as an attribute of the returned handle, i.e. in this example try doing print(handle.url)

that's too slow to get my wall from API graph

i want to know how much time(secend) do i need to get my whole facebook wall from json(graph api)
it takes about 190 seconds to get my whole wall's post (maybe 2000 posts and 131pages(json))
follow is python code. that code is just reading the posts.
is there any problem in my code? and should i cut my response time?
accessToken = "Secret"
requestURL = "https://graph.facebook.com/me/feed?access_token="+accessToken
beforeSec = time.time()*1000
pages = 1
while 1:
read = urllib.urlopen(requestURL).read()
read = json.loads(read)
data = read["data"]
for i in range(0, len(data)):
pass
try:
requestURL = read["paging"]["next"]
pages+=1
except:
break
afterSec = time.time()*1000
print afterSec - beforeSec
It depends offcourse on how big the users wall is ... They have released a new batch function : http://developers.facebook.com/docs/reference/api/batch/
Mayb you can use that?
Your code is synchronous, so you download the pages one by one.
It's very slow, you could download several pages in parallel instead.
Greenlet are the new hype for Python paraller computing, so give a try to gevent.
Well, this is provided you can get the next page before downloading the entire previous page of course. Try to see if you can get the next paging in a quick way.

What's the best performing xml parsing for GAE (Python Version)?

I think we all know this page, but the benchmarks provided dated from more than two years ago. So, I would like to know if you could point out the best xml parser around. As I need just a xml parser, the more important thing to me is speed over everything else.
My objective is to process some xml feeds (about 25k) that are 4kb in size (this will be a daily task). As you probably know, I'm restricted by the 30 seconds request timeout. So, what's the best parser (Python only) that I can use?
Thanks for your anwsers.
Edit 01:
#Peter Recore
I'll. I'm writing some code now and plan to run some profiling in the near future. Regarding your question, the answer is no. Processing takes just a little time when compared with downloading the actual xml feed. But, I can't increase Google's Bandwidth, so I can only focus on this right now.
My only problem is that i need to do this as fastest as possible because my objective is to get a snapshot of a website status. And, as internet is live and people keep adding and changing it's data, i need the fastest method because any data insertion during the "downloading and processing" time span will actually mess with my statistical analisys.
I used to do it from my own computer and the process took 24 minutes back then, but now the website has 12 times more information.
I know that this don't awnser my question directly, but id does what i just needed.
I remenbered that xml is not the only file type I could use, so instead of using a xml parser I choose to use json. About 2.5 times smaller in size. What means a decrease in download time. I used simplejson as my json libray.
I used from google.appengine.api import urlfetch to get the json feeds in parallel:
class GetEntityJSON(webapp.RequestHandler):
def post(self):
url = 'http://url.that.generates.the.feeds/'
if self.request.get('idList'):
idList = self.request.get('idList').split(',')
try:
asyncRequests = self._asyncFetch([url + id + '.json' for id in idList])
except urlfetch.DownloadError:
# Dealed with time out errors (#5) as these were very frequent
for result in asyncRequests:
if result.status_code == 200:
entityJSON = simplejson.loads(result.content)
# Filled a database entity with some json info. It goes like this:
# entity= Entity(
# name = entityJSON['name'],
# dateOfBirth = entityJSON['date_of_birth']
# ).put()
self.redirect('/')
def _asyncFetch(self, urlList):
rpcs = []
for url in urlList:
rpc = urlfetch.create_rpc(deadline = 10)
urlfetch.make_fetch_call(rpc, url)
rpcs.append(rpc)
return [rpc.get_result() for rpc in rpcs]
I tried getting 10 feeds at a time, but most of the times an individual feed raised the DownloadError #5 (Time out). Then, I increased the deadline to 10 seconds and started getting 5 feeds at a time.
But still, 25k feeds getting 5 at a time results in 5k calls. In a queue that can spawn 5 tasks a second, the total task time should be 17min in the end.

Categories