Is there any way to speed up PRAW's comment parsing? - python

I'm writing a script to scrape text data (posts and their comments) with praw, and comments add so much time to the download it's ridiculous.
If I only download posts and no comments, it gets ~100 per second, but if I download comments with the posts, it goes down to 1-2 per second, and that's just for top level comments. If I include nested comments, it takes ~5-10 minutes for 1 post (granted the post I tested on was the top post from /r/raskreddit, but still). Here's the method I'm using, please let me know if there's a way to make this any faster!
for top_level_comment in submission.comments.list():
commentnumber += 1
comment1= long variable, not important
savepost.write(comment1)
for comment in top_level_comment.replies:
parentvar = comment.parent_id
parent = reddit.comment(parentvar[3:])
parentauthor = str(parent.author)
comment2 = again, long variable
savepost.write(comment2)
I've also tried this method, thinking that fewer for statements with comment requests might help, but it didn't
for comment in submission.comments.list():
if str(comment.parent_id[:2]) == "t1":
parent = reddit.comment(comment.parent_id[3:])
comment2 = super long variable, not important
if comment.parent_id not in commentidlist:
commentidlist.append(comment.parent_id)
comment1 = Again, long variable
savepost.write(comment1)
print("Wrote parent comment Id: "+comment.parent_id[3:])
savepost.write(comment2)

Related

Is there a way to get the starts and ends of numbers between a split from an array?

Sorry, I don't even know what title to give this problem.
I have an array of numbers I call pages.
They are pages I need to print out physically from say a browser.
pagesToPrint = [2,3,4,5,7,8,9,12,14,15,16,17,18,19,20]
Now, what is the problem with just printing 2,3,4,5,7...20?
When a page or pages are sent to the printer it takes a while to be sent and process. So to speed up the process it is preferable to just print in batches. Say instead of printing 2-2, 3-3, 4-4 let's just print 2-5 we can not print 2-20 because it would print pages 6,10,11,13 and so on.
I don't really care in which programming language the answer is but the logic behind it.
Ultimately I am trying to fix this problem in AutoHotkey.
Well you can solve this by a bit of "top-down" thinking. In an ideal world, there'd already be a function you could call: split_into_consecutive_batches(pages).
How would you describe, on a high level, how that would work? That's basically just a slightly more precise rewording of your initial post and requirements!
"Well as long as there's pages in left in the list of pages it should give me the next batch."
Aha!
def split_into_consecutive_batches(pages):
batches = []
while pages:
batches.append(grab_next_batch(pages))
return batches
Aha! That wasn't so bad, right? The big overall problem is now reduced to a slightly smaller, slightly simpler problem. How do we grab the very next batch? Well, we grab the first page. Then we check if the next page is a consecutive page or not. If it is, we add it to the batch and continue. If not, we consider the batch done and stop:
def grab_next_batch(pages):
first_page = pages.pop(0) # Grab (and delete) first page from list.
batch = [first_page]
while pages:
# Check that next page is one larger than the last page in our batch:
if pages[0] == batch[-1] + 1:
# It is consecutive! So remove from pages and add to batch
batch.append(pages.pop(0))
else:
# Not consecutive! So the current batch is done! Return it!
return batch
# If we made it to here, we have removed all the pages. So we're done too!
return batch
That should do it. Though it could be cleaned up a bit; maybe you don't like the side-effect of removing items from the pages list. And instead of copying stuff around you could just figure out the indices. I'll leave that as an exercise :)

Efficient way to get data from lotus notes view

I am trying to get all data from view(Lotus Notes) with lotusscript and Python(noteslib module) and export it to csv, but problem is that this takes too much time. I have tried two ways with loop through all documents:
import noteslib
db = noteslib.Database('database','file.nsf')
view = db.GetView('My View')
doc = view.GetFirstDocument()
data = list()
while doc:
data.append(doc.ColumnValues)
doc = view.GetNextDocument(doc)
To get about 1000 lines of data it took me 70 seconds, but view has about 85000 lines so get all data will be too much time, because manually when I use File->Export in Lotus Notes it is about 2 minutes to export all data to csv.
And I tried second way with AllEntries, but it was even slower:
database = []
ec = view.AllEntries
ent = ec.Getfirstentry()
while ent:
row = []
for v in ent.Columnvalues:
row.append(v)
database.append(row)
ent = ec.GetNextEntry(ent)
Everything that I found on the Internet is based on "NextDocument" or "AllEntries". Is there any way to do it faster?
It is (or at least used to be) very expensive from a time standpoint to open a Notes document, like you are doing in your code.
Since you are saying that you want to export the data that is being displayed in the view, you could use the NotesViewEntry class instead. It should be much faster.
Set col = view.AllEntries
Set entry = col.GetFirstEntry()
Do Until entry Is Nothing
values = entry.ColumnValues '*** Array of column values
'*** Do stuff here
Set entry = col.GetNextEntry(entry)
Loop
I wrote a blog about this back in 2013:
http://blog.texasswede.com/which-is-faster-columnvalues-or-getitemvalue/
Something is going on with your code "outside" the view navigation: You already chose the most performant way to navigate a view using "GetFirstDocument" and "GetNextDocument". Using the NotesViewNavigator as mentioned in the comments will be slightly better, but not significant.
You might get a little bit of performance out of your code by setting view.AutoUpdate = False to prohibit the view object to refresh when something in the backend changes. But as you only read data and not change view data that will not give you much of a performance boost.
My suggestion: Identify the REAL bottleneck of your code by commenting out single sections to find out when it starts to get slower:
First attempt:
while doc:
doc = view.GetNextDocument(doc)
Slow?
If not then next attempt:
while doc:
arr = doc.ColumnValues
doc = view.GetNextDocument(doc)
Slow?
If yes: ColumnValues is your enemy...
If not then next attempt:
while doc:
arr = doc.ColumnValues
data.append(arr)
doc = view.GetNextDocument(doc)
I would be very interested to get your results of where it starts to become slow.
I would suspect the performance issue is using COM/ActiveX in Python to access Notes databases. Transferring data via COM involves datatype 'marshalling', possibly at every step, and especially for 'out-of-process' method/property calls.
I don't think there is any way around this in COM. You should consider arranging a Notes 'agent' to do this for you instead (LotusScript or Java maybe). Even a basic LotusScript agent can export 000's of docs per minute. A further alternative may be to look at the Notes C-API (not an easy option and requires API calls from Python).

Python reddit API: efficiently parse all comments in a subreddit

I am trying to code a chatbot and to have it scanning through all the comments added to it.
Currently I do so by scanning every X seconds to the last Y comments:
handle = praw.Reddit(username=config.username,
password=config.password,
client_id=config.client_id,
client_secret=config.client_secret,
user_agent="cristiano corrector v0.1a")
while True:
last_comments = handle.subreddit(subreddit).comments(limit=Y)
for comment in last_comments:
#process comments
time.sleep(X)
I am quite unsatisfied as there can be a lot of overlap (which can be solved by tracking the comments id) and some comments are scanned twice while others are ignored. Is there a better way of doing so with this API?
I found a solution making use of stream inside the PRAW API. Details in https://praw.readthedocs.io/en/latest/tutorials/reply_bot.html
And in my code:
handle = praw.Reddit(username=config.username,
password=config.password,
client_id=config.client_id,
client_secret=config.client_secret,
user_agent="cristiano corrector v0.1a")
for comment in handle.subreddit(subreddit).stream.comments():
#process comments
This should save some CPU and network load.

that's too slow to get my wall from API graph

i want to know how much time(secend) do i need to get my whole facebook wall from json(graph api)
it takes about 190 seconds to get my whole wall's post (maybe 2000 posts and 131pages(json))
follow is python code. that code is just reading the posts.
is there any problem in my code? and should i cut my response time?
accessToken = "Secret"
requestURL = "https://graph.facebook.com/me/feed?access_token="+accessToken
beforeSec = time.time()*1000
pages = 1
while 1:
read = urllib.urlopen(requestURL).read()
read = json.loads(read)
data = read["data"]
for i in range(0, len(data)):
pass
try:
requestURL = read["paging"]["next"]
pages+=1
except:
break
afterSec = time.time()*1000
print afterSec - beforeSec
It depends offcourse on how big the users wall is ... They have released a new batch function : http://developers.facebook.com/docs/reference/api/batch/
Mayb you can use that?
Your code is synchronous, so you download the pages one by one.
It's very slow, you could download several pages in parallel instead.
Greenlet are the new hype for Python paraller computing, so give a try to gevent.
Well, this is provided you can get the next page before downloading the entire previous page of course. Try to see if you can get the next paging in a quick way.

What's the best performing xml parsing for GAE (Python Version)?

I think we all know this page, but the benchmarks provided dated from more than two years ago. So, I would like to know if you could point out the best xml parser around. As I need just a xml parser, the more important thing to me is speed over everything else.
My objective is to process some xml feeds (about 25k) that are 4kb in size (this will be a daily task). As you probably know, I'm restricted by the 30 seconds request timeout. So, what's the best parser (Python only) that I can use?
Thanks for your anwsers.
Edit 01:
#Peter Recore
I'll. I'm writing some code now and plan to run some profiling in the near future. Regarding your question, the answer is no. Processing takes just a little time when compared with downloading the actual xml feed. But, I can't increase Google's Bandwidth, so I can only focus on this right now.
My only problem is that i need to do this as fastest as possible because my objective is to get a snapshot of a website status. And, as internet is live and people keep adding and changing it's data, i need the fastest method because any data insertion during the "downloading and processing" time span will actually mess with my statistical analisys.
I used to do it from my own computer and the process took 24 minutes back then, but now the website has 12 times more information.
I know that this don't awnser my question directly, but id does what i just needed.
I remenbered that xml is not the only file type I could use, so instead of using a xml parser I choose to use json. About 2.5 times smaller in size. What means a decrease in download time. I used simplejson as my json libray.
I used from google.appengine.api import urlfetch to get the json feeds in parallel:
class GetEntityJSON(webapp.RequestHandler):
def post(self):
url = 'http://url.that.generates.the.feeds/'
if self.request.get('idList'):
idList = self.request.get('idList').split(',')
try:
asyncRequests = self._asyncFetch([url + id + '.json' for id in idList])
except urlfetch.DownloadError:
# Dealed with time out errors (#5) as these were very frequent
for result in asyncRequests:
if result.status_code == 200:
entityJSON = simplejson.loads(result.content)
# Filled a database entity with some json info. It goes like this:
# entity= Entity(
# name = entityJSON['name'],
# dateOfBirth = entityJSON['date_of_birth']
# ).put()
self.redirect('/')
def _asyncFetch(self, urlList):
rpcs = []
for url in urlList:
rpc = urlfetch.create_rpc(deadline = 10)
urlfetch.make_fetch_call(rpc, url)
rpcs.append(rpc)
return [rpc.get_result() for rpc in rpcs]
I tried getting 10 feeds at a time, but most of the times an individual feed raised the DownloadError #5 (Time out). Then, I increased the deadline to 10 seconds and started getting 5 feeds at a time.
But still, 25k feeds getting 5 at a time results in 5k calls. In a queue that can spawn 5 tasks a second, the total task time should be 17min in the end.

Categories