Search every subreddit by keyword with Praw

Search every subreddit by keyword with Praw - python

I'm having trouble understanding if this is possible in the praw API: I'd like to get a list of all posts that have comments mentioning a keyword, say "python". It seems like the search function is always called form a specific subreddit, as in
for submission in reddit.subreddit("all").search("python", sort="comments", limit=None):
print(submission.title)
But won't this only return posts that have made it to r/all? How can I search all subreddits, without brute force searching one subreddit at a time?

Searching /r/all will search all subreddits. (Or maybe it's all subreddits that have opted into /r/all)
"Made it to /r/all" includes all posts (at least from subreddits that opted into /r/all, which is most of them). The posts might appear in different listings, such as /hot and /new, or they might not be accessible through any listings due to the 1000-item limit, even though theoretically they are still part of the listing, just further down. Regardless, they will all be searchable this way.

Related

Collect "controversial" posts of a subbredit

I am trying to collect all posts from the "controversial" listing of a subbredit.
I tried with the Reddit API but there is a limit on how many post you can collect.
Then using PushiftAPI i can retrieve a big amount of posts but I can't find a way to collect from controversial listing.
Is there a way to collect controversial posts?

PRAW gives a way to get controversial streams and you can specify the time period as well.
From the docs
This method can be used like:
reddit.domain("imgur.com").controversial("week")
reddit.multireddit("samuraisam", "programming").controversial("day")
reddit.redditor("spez").controversial("month")
reddit.redditor("spez").comments.controversial("year")
reddit.redditor("spez").submissions.controversial("all")
reddit.subreddit("all").controversial("hour")
I am not familiar with PushiftAPI but hopefully this helps.

URL path parameters vs query parameters in Django

I've looked around for a little while now and can't seem to find anything that even touches on the differences. As the title states, I'm trying to find out what difference getting your data via url path parameters like /content/7 then using regex in your urls.py, and getting them from query params like /content?num=7 using request.GET.get() actually makes.
What are the pros and cons of each, and are there any scenarios where one would clearly be a better choice than the other?
Also, from what I can tell, the (Django's) preferred method seems to be using url path params with regex. Is there any reason for this, other than potentially cleaner URLs? Any additional information pertinent to the topic is welcome.

This would depend on what architectural pattern you would like to adhere to. For example, according to the REST architectural pattern (which we can argue is the most common), you want do design URLs such that without query params, they point to "resources" which roughly correspond to nouns in your application and then HTTP verbs correspond to actions you can perform on that resource.
If, for instance, your application has users, you would want to design URLs like this:
GET /users/ # gets all users
POST /users/ # creates a new user
GET /users/<id>/ # gets a user with that id. Notice this url still points to a user resource
PUT /users/<id> # updates an existing user's information
DELETE /users/<id> # deletes a user
You could then use query params to filter a set of users at a resource. For example, to get users that are active, your URL would look something like
/users?active=true
So to summarize, query params vs. path params depends on your architectural preference.
A more detailed explanation of REST: http://www.vinaysahni.com/best-practices-for-a-pragmatic-restful-api
Roy Fielding's version if you want to get really academic: http://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm

Filter results from browse_release_groups by artist_id to get discography, python

I'm trying to retrieve discographies for various artists. Wikipedia and the manual web interface for MusicBrainz.org seem to agree on what albums make this up, for the artists I've checked. My first thought was to attempt to screen-scrape either of these resources, but that looks like hard work to do it properly.
Direct queries of the musicbrainz data seemed to offer a quicker way to get clean data. I would ideally construct a request like this ...
data = get_release_groups(artist=mbid,
primary_type='Album',
status='Official',
includes=['first_release_date',
'title',
'secondary_type_list'])
I chose to use the python wrapper musicbrainsngs, as I am fairly experienced with python. It gave me a choice of three methods, get_, search_ and browse_. Get_ will not return sufficient records. Browse_ appeared to be what I wanted, so I tried that first, especially as search_ was documented around looking for text in the python examples, rather than the mb_id, which I already had.
When I did a browse_release_groups(artist=artist_id,,,), I got a list of release groups, each containing the data I wanted, which was album title, type and year. However, I also got a large number of other release groups that don't appear on their manual web results for (for example The Rolling Stones) https://musicbrainz.org/artist/b071f9fa-14b0-4217-8e97-eb41da73f598
There didn't appear to be any way to filter in the query for status='official', or to include the status as part of the results so I could manually filter.
In response to this question, Wieland has suggested I use the search_ query. I have tested search_release_groups(arid=mbid, status='official', primarytype='Album', strict=True, limit=...) and this returns many fewer release groups. As far as studio albums are concerned, it matches 1:1. There are still a few minor discrepancies in the compilations, which I can live with. However, this query did not return the first-release-date, and so far, it has been resistant to my attempts to find how to include it. I notice in the server search code linked to that every query starts off manipulating rgm.first_release_date_year etc, but it's not clear how/when this gets returned from a query.
It's just occurred to me that I can use both a browse_ and a search_ , as together they give me all the information. So I have a work around, but it feels rather agricultural.
TL;DR I want release groups (titles, dates, types, status) by artist ID. If I browse, I get dates, but can't include or filter by status. If I search, I can filter by status, but don't get dates. How can I get both in one query?

I'm not entirely sure what your question is, but the find_by_artist method of release groups (source here) is what's doing the filtering of release groups for the artist pages, in particular:
# Show only RGs with official releases by default, plus all-status-less ones so people fix the status
unless ($show_all) {
push #$conditions, "(EXISTS (SELECT 1 FROM release where release.release_group = rg.id AND release.status = '1') OR
NOT EXISTS (SELECT 1 FROM release where release.release_group = rg.id AND release.status IS NOT NULL))";
}
Unfortunately, I think it's not possible to express that condition in a normal web service call. You can, however, use the search web service to filter for release groups by the rolling stones that contain at least one "official" release: http://musicbrainz.org/ws/2/release-group/?query=arid:b071f9fa-14b0-4217-8e97-eb41da73f598%20AND%20status:official&offset=0. In python-musicbrainzngs, the call for this is
search_release_groups(arid="b071f9fa-14b0-4217-8e97-eb41da73f598", status="official", strict=True)
Unfortunately, the search results don't include the first-release-date field. There's an open ticket about it, but it's not going to be fixed in the near future.

How to get all Wikipedia articles from category and sub categories using Python? [duplicate]

I want to get all the articles names under a category and its sub-categories.
Options I'm aware of:
Using the Wikipedia API. Does it have such an option??
d/l the dump. Which format would be better for my usage?
There is also an option to search in Wikipedia something like incategory:"music", but I didn't see an option to view that in XML.
Please share your thoughts

The following resource will help you to download all pages from the category and all its subcategories:
http://en.wikipedia.org/wiki/Wikipedia:CatScan
There is also an API available here:
https://www.mediawiki.org/wiki/API:Categorymembers

You can do this through the following two API methods:
For articles pages for this category
YOUR_URL/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Music
For get subcategories:
YOUR_URL/api.php?action=query&format=json&list=categorymembers&cmtype=subcat&cmtitle=Category:Music
You can get more info on Mediawiki API

Note that Wikipedia's categorization system is not a tree, or even an acyclic graph. It is quite possible that by continually following subcategory links you will eventually wind up back where you started.
If you are going to be making many such queries, you would be best served by downloading a database dump. If this will be an infrequent thing and will only be dealing with small categories, you could probably get away with making repeated queries to list=categorymembers.
incategory:"music" does not appear to do subcategory searching.

Get a JSON tree of all comments of a post?

I'm looking to backup a subreddit to disk. So far, it doesn't seem to be easily possible with the way that the Reddit API works. My best bet at getting a single JSON tree with all comments (and nested comments) would seem to be storing them inside of a database and doing a pretty ridiculous recursive query to generate the JSON.
Is there a Reddit API method which will give me a tree containing all comments on a given post in the expected order?

The number of comments you get from the API has a hard limit, for performance reasons; to ensure you're getting all comments, you have to parse through the child nodes and make additional calls as necessary.
Be aware that the subreddit listing will only include the latest 1000 posts, so if your target subreddit has more than that, you probably won't be able to obtain a full backup anyways.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.