I'm looking to backup a subreddit to disk. So far, it doesn't seem to be easily possible with the way that the Reddit API works. My best bet at getting a single JSON tree with all comments (and nested comments) would seem to be storing them inside of a database and doing a pretty ridiculous recursive query to generate the JSON.
Is there a Reddit API method which will give me a tree containing all comments on a given post in the expected order?
The number of comments you get from the API has a hard limit, for performance reasons; to ensure you're getting all comments, you have to parse through the child nodes and make additional calls as necessary.
Be aware that the subreddit listing will only include the latest 1000 posts, so if your target subreddit has more than that, you probably won't be able to obtain a full backup anyways.
Related
I am trying to collect all posts from the "controversial" listing of a subbredit.
I tried with the Reddit API but there is a limit on how many post you can collect.
Then using PushiftAPI i can retrieve a big amount of posts but I can't find a way to collect from controversial listing.
Is there a way to collect controversial posts?
PRAW gives a way to get controversial streams and you can specify the time period as well.
From the docs
This method can be used like:
reddit.domain("imgur.com").controversial("week")
reddit.multireddit("samuraisam", "programming").controversial("day")
reddit.redditor("spez").controversial("month")
reddit.redditor("spez").comments.controversial("year")
reddit.redditor("spez").submissions.controversial("all")
reddit.subreddit("all").controversial("hour")
I am not familiar with PushiftAPI but hopefully this helps.
I want to get all the articles names under a category and its sub-categories.
Options I'm aware of:
Using the Wikipedia API. Does it have such an option??
d/l the dump. Which format would be better for my usage?
There is also an option to search in Wikipedia something like incategory:"music", but I didn't see an option to view that in XML.
Please share your thoughts
The following resource will help you to download all pages from the category and all its subcategories:
http://en.wikipedia.org/wiki/Wikipedia:CatScan
There is also an API available here:
https://www.mediawiki.org/wiki/API:Categorymembers
You can do this through the following two API methods:
For articles pages for this category
YOUR_URL/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Music
For get subcategories:
YOUR_URL/api.php?action=query&format=json&list=categorymembers&cmtype=subcat&cmtitle=Category:Music
You can get more info on Mediawiki API
Note that Wikipedia's categorization system is not a tree, or even an acyclic graph. It is quite possible that by continually following subcategory links you will eventually wind up back where you started.
If you are going to be making many such queries, you would be best served by downloading a database dump. If this will be an infrequent thing and will only be dealing with small categories, you could probably get away with making repeated queries to list=categorymembers.
incategory:"music" does not appear to do subcategory searching.
I have a fairly basic GAE app that takes some input, fetches some data from a webpage, parses, then presents it to the user. Right now, the fairly spare input HTML form POSTs the arguments to the output 'file' which is wholly generated by the handler for that URL.
I'd like to do a couple things with the data (e.g. graph it at a landing page perhaps, then write it to an output file), but I don't know how I should pass the parsed data between the different handlers. I could maybe encode it then successively POST it to other handlers, but my gut says that I shouldn't need to HTTP the data back and forth within my app—it seems terribly inefficient (my gut is also hungry...).
In fairly broad swaths (or maybe a link to an example), how should my handlers handle this?
Later thoughts (edit)
My very rough idea now is to have the form submitted to a page that 1) enters the subsequent query into a database (datastore?) keyed to some hash, then uses that to 2) grab and parse all the data. The parsed data would be stored in memory (memcache?) for near-immediate use to graph it and/or process it into a variety of tabular formats for download. The script that does said parsing redirects to a unique URL based on the hash which can be referred to in order to get the data.
The thought would be that you could save the URL, then if you visit it later when the data has been lost, it can re-query the source to get it back/update it.
Reasonable? Should I be looking at other things?
I cant find "best" solution for very simple problem(or not very)
Have classical set of data: posts that attached to users, comments that attached to post and to user.
Now i can't decide how to build scheme/classes
On way is to store user_id inside comments and inside.
But what happens when i have 200 comments on page?
Or when i have N posts on page?
I mean it should be 200 additional requests to database to display user info(such as name,avatar)
Another solution is to embed user data into each comment and each post.
But first -> it is huge overhead, second -> model system is getting corrupted(using mongoalchemy), third-> user can change his info(like avatar). And what then? As i understand update operation on huge collections of comments or posts is not simple operation...
What would you suggest? Is 200 requests per page to mongodb is OK(must aim for performance)?
Or may be I am just missing something...
You can avoid the N+1-problem of hundreds of requests using $in-queries. Consider this:
Post {
PosterId: ObjectId
Text: string
Comments: [ObjectId, ObjectId, ...] // option 1
}
Comment {
PostId: ObjectId // option 2 (better)
Created: dateTime,
AuthorName: string,
AuthorId: ObjectId,
Text: string
}
Now you can find the posts comments with an $in query, and you can also easily find all comments made by a specific author.
Of course, you could also store the comments as an embedded array in post, and perform an $in query on the user information when you fetch the comments. That way, you don't need to de-normalize user names and still don't need hundreds of queries.
If you choose to denormalize the user names, you will have to update all comments ever made by that user when a user changes e.g. his name. On the other hand, if such operations don't occur very often, it shouldn't be a big deal. Or maybe it's even better to store the name the user had when he made the comment, depending your requirements.
A general problem with embedding is that different writers will write to the same object, so you will have to use the atomic modifiers (such as $push). This is sometimes harder to use with mappers (I don't know mongoalchemy though), and generally less flexible.
What I would do with mongodb would be to embed the user id into the comments (which are part of the structure of the "post" document).
Three simple hints for better performances:
1) Make sure to ensure an index on the user_id
2) Use comment pagination method to avoid querying 200 times the database
3) Caching is your friend
You could cache your user objects so you don't have to query the database each time.
I like the idea of embedding user data into each post but then you have to think about what happens when a user's profile is updated? have to make sure that no post is missed.
I would recommend starting out just by skimming how mongo recommends you handle schemas.
Generally, for "contains" relationships between entities,
embedding should be be chosen. Use linking when not using linking would result in
duplication of data.
http://www.mongodb.org/display/DOCS/Schema+Design
There's a pretty good use case from the MongoDB docs: http://docs.mongodb.org/manual/use-cases/storing-comments/
Conveniently it's also written in Python :-)
Additional questions regarding SilentGhost's initial answer to a problem I'm having parsing Twitter RSS feeds. See also partial code below.
First, could I insert tags[0], tags[1], etc., into the database, or is there a different/better way to do it?
Second, almost all of the entries have a url, but a few don't; likewise, many entries don't have the hashtags. So, would the thing to do be to create default values for url and tags? And if so, do you have any hints on how to do that? :)
Third, when you say the single-table db design is not optimal, do you mean I should create a separate table for tags? Right now, I have one table for the RSS feed urls and another table with all the rss entry data (summar.y, date, etc.).
I've pasted in a modified version of the code you posted. I had some success in getting a "tinyurl" variable to get into the sqlite database, but now it isn't working. Not sure why.
Lastly, assuming I can get the whole thing up and running (smile), is there a central site where people might appreciate seeing my solution? Or should I just post something on my own blog?
Best,
Greg
I would suggest reading up on database normalisation, especially on 1st and 2nd normal forms. Once you're done with it, I hope there won't be need for default values, and your db schema evolves into something more appropriate.
There are plenty of options for sharing your source code on the web, depending on what versioning system you're most comfortable with you might have a look at such well know sites as google code, bitbucket, github and many other.