Getting all articles related to a query without a bias

Getting all articles related to a query without a bias - python

I trying to build a corpus of documents related to earthquakes. I want to download all news articles related to that event. My problem is that using google search(stackoverflow.com/questions/…) gives bias with respect what is revelant now. Instead I want all articles irrespective of time or relevance.

The problem is that Google is trying to guess what is the most relevant search result for a user entering your query, and you are interested in all of them.
You would be better served by a newspaper article database than by Google in this case. If you are currently enrolled in a university, ask your library for this kind of resource. If you have access to such a database, you will be able to search for every article containing a given keyword, and some search forms will even let you filter by publisher, by date, by geographical location, etc...
Eureka.cc is an example of such a database.
Some newspapers' websites will give you access to their article archive. New York Times is one of those.
Here is a result searching in their article database for "earthquake".
More info about newspaper article databases

Related

Restrict words in Google Books API

Is there a way to restrict certain words from appearing in a title in google books api.
For example, I want to receive data about fantasy books however I keep getting books such as "Guide to Literary Agents 2017" in my search. I was wondering if I could restrict the some words such as "Literary" in my search (or would there be a better way to resolve this problem).
Also this is my api link:
https://www.googleapis.com/books/v1/volumes?q=subject: fantasy+ young adult &printType=books&langRestrict= en&maxResults=40&key=APIKey'

Yes, in the Books APIs Developers Guide, I found this.
To exclude entries that match a given term, use the form q=-term.
So, in your example you could try something like
https://www.googleapis.com/books/v1/volumes?q=-Literary&subject:fantasy+young%20adult&printType=books&langRestrict=en&maxResults=40
I didn't see the title Guide to Literary Agents 2017 in the results, so I tried excluding a few other keywords and it does seem to exclude those titles.

Is there a way to analyse single user data in google analytics?

I'm trying to analyze (for business intelligence purpose) some google analytics data in python.
All I get after many tutorials are "aggregated" data... like the number of views in a day the thing I need instead is something capable of tracking the behavior of a single user.. like what page of the web site he visited, his bounce rate if he used the e-commerce and so on.
I saw many CSV already prepared for such analysis but I'm starting from scratch with my web site.

You can use the User-ID feature, when you send Analytics an ID and related data from multiple sessions, your reports tell a more unified, holistic story about a user’s relationship with your business:
https://support.google.com/analytics/answer/3123662?hl=en
Otherwise, you can examine individual-user behavior at the session level in User Explorer report. The User Explorer report lets you isolate and examine individual rather than aggregate user behavior. Individual user behavior is associated with either Client ID or User ID.
https://support.google.com/analytics/answer/6339208?hl=en

How to get all Wikipedia articles from category and sub categories using Python? [duplicate]

I want to get all the articles names under a category and its sub-categories.
Options I'm aware of:
Using the Wikipedia API. Does it have such an option??
d/l the dump. Which format would be better for my usage?
There is also an option to search in Wikipedia something like incategory:"music", but I didn't see an option to view that in XML.
Please share your thoughts

The following resource will help you to download all pages from the category and all its subcategories:
http://en.wikipedia.org/wiki/Wikipedia:CatScan
There is also an API available here:
https://www.mediawiki.org/wiki/API:Categorymembers

You can do this through the following two API methods:
For articles pages for this category
YOUR_URL/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Music
For get subcategories:
YOUR_URL/api.php?action=query&format=json&list=categorymembers&cmtype=subcat&cmtitle=Category:Music
You can get more info on Mediawiki API

Note that Wikipedia's categorization system is not a tree, or even an acyclic graph. It is quite possible that by continually following subcategory links you will eventually wind up back where you started.
If you are going to be making many such queries, you would be best served by downloading a database dump. If this will be an infrequent thing and will only be dealing with small categories, you could probably get away with making repeated queries to list=categorymembers.
incategory:"music" does not appear to do subcategory searching.

Extract statistical information from Wikipedia article

I'm currently extracting data from DBpedia articles using a SPARQLWrapper for python, but I can't seem to find how to extract the number of watchers (and other statistical information) for a given article.
Is there an easy way to achieve this? I don't mind if it's through DBpedia, or directly through wikipedia (using wget, for example).
Thanks for any advice.

It shell be prohibited to get the number of watchers for every arbitrary article, as it is considered to be a security leak if everyone could find unwatched pages. For example, only privileged users have access to Special:Unwatched Pages. There is a toolserver tool (which has access to the DB) showing the number of watchers, but it is restricted to pages with more than 30 watchers for the same reasons - at least unauthenticated.
The MediaWiki query API exposes only mostly content and status information about articles, though you can query and evaluate the public logs or revision histories as well to get statistical data about (public) user actions. For more stats about the Wikimedia sites you may have a look at Meta:Statistics, where various data sources (mostly http://stats.wikimedia.org/) and visualisations of them are listed.

Performing multiple searches of a term in Twitter

I have little working knowledge of python. I know that there is something called a Twitter search API, but I'm not really sure what I'm doing. I know what I need to do:
I need point data for a class. I thought I would just pull up a map of the world in a GIS application, select cities that have x population or larger, then export those selections to a new table. That table would have a key and city name.
next i randomly select 100 of those cities. Then I perform a search of a certain term (in this case, Gaddafi) for each of those 100 cities. All I need to know is how many posts there were on a certain day (or over a few days depending on amount of tweets there were).
I just have a feeling there is something that already exsists that does this, and I'm having a hard time finding it. I've dowloaded and installed python-twitter but have no idea how to get this search done. Anyone know where I can find or how I can make this tool? Any suggestions would really help. Thanks!

A tweet itself comes with a geo tag. But it is a new feature and majority tweets do not have it. So it is not possible to search for all tweets containing "Gaddafi" from a city given the city name.
What you could do is the reverse, you search for "Gaddafi" first (regardless of geo location), using search api. Then, for each tweet, find the location of the poster (either thru the RESTful api or use some sort of web scraping).
so basically you can classify the tweets collected according to the location of the poster.
I think only tweepy have access to both twitter search API as well as RESTful API.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting all articles related to a query without a bias - python

Related

Restrict words in Google Books API

Is there a way to analyse single user data in google analytics?

How to get all Wikipedia articles from category and sub categories using Python? [duplicate]

Extract statistical information from Wikipedia article

Performing multiple searches of a term in Twitter

Categories

Resources