How to scrape the data from Wikipedia by category? - python

I would like to use only medical data from Wikipedia for analysis. I use python for scraping.
I have used this library for searching by word in query:
import wikipedia
import requests
import pprint
from bs4 import BeautifulSoup
wikipedia.set_lang("en")
query = raw_input()
WikiPage = wikipedia.page(title = query,auto_suggest = True)
cat = WikiPage.categories
for i in cat:
print i
and get the categories.
But, my problem is vice versa:
I want to give category, for example: health or medical terminology and get all articles with this type.
how can I do this?

Edit: actual answer
There is API:Categorymembers, which documents usage, parameters and gives examples on "how to retrieve lists of pages in a given category, ordered by title". It won't save you from having to descend through the category tree (cf. below) yourself, but you get a nice entry point and machine-readable results.
Old answer: related information
A very brief pointer is given on the Help:Category page, section Searching for articles in categories:
In addition to browsing through hierarchies of categories, it is
possible to use the search tool to find specific articles in specific
categories. To search for articles in a specific category, type
incategory:"CategoryName" in the search box.
An "OR" can be added to join the contents of one category with the
contents of another. For example, enter
incategory:"Suspension bridges" OR incategory:"Bridges in New York City"
to return all pages that belong to either (or both) of the categories,
as here.
Note that using search to find categories will not find articles which
have been categorized using templates. This feature also doesn't
return pages in subcategories.
To address the subcategory problem, the page Special:CategoryTree can be used instead. However, the page does not point to an obvious documentation. So I think the <form> fields must be manually searched for in the page source to create a programmatic API.

Related

Extract pageviews from a all the pages of a wiki glossary?

How do I go about extracting only page views (all time or maximum year-wise- not really interested in daily, monthly, etc) of all the subpages from a glossary page.
Example: https://en.wikipedia.org/wiki/Glossary_of_areas_of_mathematics
I found this tool. But it does category.
Is there a way in python or something which I can implement to GET pageviews for all the listed links on a page?
There won't be a simple way to do this because that article content is unstructured, unlike for a category.
You'll need to manually extract the page titles by parsing the article and then pass each of the titles to the API to get the pageviews. It is documented here: https://pageviews.toolforge.org/pageviews/url_structure/ you can pass multiple titles by separating them with | but there will be a limit to the number, so you will need to make multiple queries.

Scraping Multiple Sites for Similar Information

I need to scrape several different sites for the same information. Basically, I am looking for similar information, but the sites could belong to different vendors and can have different HTML structures. For example, if I am trying to scrape the data related to text books in Barns&Nobles and Biblio (this is only two but there could be many) and get the book name, author and prices for the books how would one do that?
https://www.barnesandnoble.com/b/textbooks/mathematics/algebra/_/N-8q9Z18k3
https://www.biblio.com/search.php?stage=1&result_type=works&keyisbn=algebra
Of course, I can parse the two sites independently, but I am looking for a general methodology that can be easily applied to other vendors as well to extract the same information.
In a separate but related question, I would also like to know how google show different product information from different sources when you search for a product? For example, if you google for "MacBook Pro", at the top of the page, you'd get a carousel of products from different vendors. I assume google is scraping this information from different sources automatically to suggest to the user what are available.
Have a look at scrapely. It can really be helpful if you don't want to manually parse different HTML structures.

How to scrape tag information from questions on Stack Exchange

My problem is that I want to create a data base of all of the questions, answers, and most importantly, the tags, from a certain (somewhat small) Stack Exchange. The relationships among tags (e.g. tags more often used together have a strong relation) could reveal a lot about the structure of the community and popularity or interest in certain sub fields.
So, what is the easiest way to go through a list of questions (that are positively ranked) and extract the tag information using Python?
The easiest way to get the shared-tag count for all questions is to use the Stack Exchange API.
import requests
r = requests.get(
'http://api.stackexchange.com/2.2/tags/python/related?pagesize=3&site=stackoverflow')
for item in r.json()['items']:
print("{name} shares {count} tags with Python".format(**item))
If this doesn't satisfy your need, there are many other API queries available.
Visit the site to find the URL that shows the information you want, then look at the page source to see how it has been formatted.
In order to scrape the pages use the urllib2 library.
Parse the text using the BeautifulSoup library.
Place the data into a database.
The difficult thing is going to be structuring your database and developing queries that reveal what you want.

Wikipedia philosophy game diagram in python and R

So I am relatively new to python, and in order to learn, I have started writing a program that goes online to wikipedia, finds the first link in the overview section of a random article, follows that link and keeps going until it either enters a loop or finds the philosophy page (as detailed here) and then repeats this process for a new random article a specified number of times. I then want to collect the results in some form of useful data structure, so that I can pass the data to R using the Rpy library so that I can draw some sort of network diagram (R is pretty good at drawing things like that) with each node in the diagram representing the pages visited, and the arrows that paths taken from the starting article to the philosophy page.
So I have no problem getting python to return the fairly structured html from wiki but there are some problems that I can't quite figure out. Up till now I have selected the first link using a cssselector from the lxml library. It selects for the first link ( in an a tag) that is a direct descendant of a p tag, that is a direct descendant of a div tag with class="mw-content-ltr" like this:
user_agent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT)'
values = {'name' : 'David Kavanagh',
'location' : 'Belfast',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
encodes = urllib.urlencode(values)
req = urllib2.Request(url, encodes, headers)
page = urllib2.urlopen(req)
root = parse(page).getroot()
return root.cssselect("div.mw-content-ltr>p>a")[0].get('href')
This code resides in a function which I use to find the first link in the page. It works for the most part but the problem is if the first link is inside some other tag as opposed to being a direct descendant of a p tag like let's say a b tag or something then I miss it. As you can see from the wiki article above, links in italics or inside parentheses aren't eligible for the game, which means that I never get a link in italics (good) but frequently do get links that are inside parentheses (bad) and sometimes miss the first link on a page like the first link on the Chair article, which is stool, but it is in bold, so I don't get it. I have tried removing the direct descendant stipulation but then I frequently get links that are "above" the overview section, that are usually in the side box, in a p tag, in a table, in the same div as the overview section.
So the first part of my question is:
How could I use cssselectors or some other function or library to select the first link in the overview section that is not inside parentheses or in italics. I thought about using regular expressions to look through the raw html but that seems like a very clunky solution and I thought that there might be something a bit nicer out there that I haven't thought of.
So currently I am storing the results in a list of lists. So I have a list called paths, in which there are lists that contain strings that contain the title of the wiki article.
The second part of the question is:
How can I traverse this list of lists to represent the multiple convergent paths? Is storing the results like this a good idea? Since the end diagram should look something like an upside down tree, I thought about making some kind of tree class, but that seems like a lot of work for something that is conceptually, fairly simple.
Any ideas or suggestions would be greatly appreciated.
Cheers,
Davy
I'll just answer the second question:
For a start, just keep a dict mapping one Wikipedia article title to the next. This will make it easy and fast to check I've you've hit an article you've already found before. Essentially this is just storing a directed graph's vertices, indexed by their origins.
If you get to the point where a Python dict is not efficient enough (it does have a significant memory overhead, once you have millions of items memory can be an issue) you can find a more efficient graph data structure to suit your needs.
EDIT
Ok, I'll answer the first question as well...
For the first part, I highly recommend using the MediaWiki API instead of getting the HTML version and parsing it. The API enables querying for certain types of links, for instance just inter-wiki links or just inter-language links. Additionally, there are Python client libraries for this API, which should make using it from Python code simple.
Don't parse a website's HTML if it provides a comprehensive and well-documented API!
For the first part, it's not possible to find brackets with css selectors, because as far as the html is concerned brackets are just text.
If I were you, I'd use selectors to find all the relevant paragraph elements that are valid for the game. Then, I'd look in the text of the paragraph element, and remove any that is not valid - for instance, anything between brackets, and anything between italic tags. I'd then search this processed text for the link elements I need. This is slightly nicer than manually processing the whole html document.
I'm not sure I follow exactly what you are doing for the second part, but as for representing the results of this search as a tree: This is a bad idea as you are looking for cycles, which trees can't represent.
For a data structure, I would have lists of 'nodes', where a node represents a page and has a URL and a count of occurances. I'd then use a brute force algorithm to compare lists of nodes - if the two lists have a node that are the same, you could merge them, increasing the 'occurances' count for each mirrored node.
I would not use the standard python 'list' as this can't loop back on itself. Maybe create your own linked list implementation to contain the nodes.

python method to extract content (excluding navigation) from an HTML page

Of course an HTML page can be parsed using any number of python parsers, but I'm surprised that there don't seem to be any public parsing scripts to extract meaningful content (excluding sidebars, navigation, etc.) from a given HTML doc.
I'm guessing it's something like collecting DIV and P elements and then checking them for a minimum amount of text content, but I'm sure a solid implementation would include plenty of things that I haven't thought of.
Try the Beautiful Soup library for Python. It has very simple methods to extract information from an html file.
Trying to generically extract data from webpages would require people to write their pages in a similar way... but there's an almost infinite number of ways to convey a page that looks identical let alone all the conbinations you can have to convey the same information.
Was there a particular type of information you were trying to extract or some other end goal?
You could try extracting any content in 'div' and 'p' markers and compare the relative sizes of all the information in the page. The problem then is that people probably group information into collections of 'div's and 'p's (or at least they do if they're writing well formed html!).
Maybe if you formed a tree of how the information is related (nodes would be the 'p' or 'div or whatever and each node would contain the associated text) you could do some sort of analysis to identify the smallest 'p' or 'div' that encompases what appears to be the majority of the information.. ?
[EDIT] Maybe if you can get it into the tree structure I suggested, you could then use a similar points system to spam assassin. Define some rules that attempt to classify the information. Some examples:
+1 points for every 100 words
+1 points for every child element that has > 100 words
-1 points if the section name contains the word 'nav'
-2 points if the section name contains the word 'advert'
If you have a lots of low scoring rules which add up when you find more relevent looking sections, I think that could evolve into a fairly powerful and robust technique.
[EDIT2] Looking at the readability, it seems to be doing pretty much exactly what I just suggested! Maybe it could be improved to try and understand tables better?
Have a look at templatemaker: http://www.holovaty.com/writing/templatemaker/
It's written by one of the founders of Django. Basically you feed it a few example html files and it will generate a "template" that you can then use to extract just the bits that are different (which is usually the meaningful content).
Here's an example from the google code page:
# Import the Template class.
>>> from templatemaker import Template
# Create a Template instance.
>>> t = Template()
# Learn a Sample String.
>>> t.learn('<b>this and that</b>')
# Output the template so far, using the "!" character to mark holes.
# We've only learned a single string, so the template has no holes.
>>> t.as_text('!')
'<b>this and that</b>'
# Learn another string. The True return value means the template gained
# at least one hole.
>>> t.learn('<b>alex and sue</b>')
True
# Sure enough, the template now has some holes.
>>> t.as_text('!')
'<b>! and !</b>'
You might use the boilerpipe Web application to fetch and extract content on the fly.
(This is not specific to Python, as you only need to issue a HTTP GET request to a page on Google AppEngine).
Cheers,
Christian
What is meaningful and what is not, it depends on the semantic of the page. If the semantics is crappy, your code won't "guess" what is meaningful. I use readability, which you linked in the comment, and I see that on many pages I try to read it does not provide any result, not talking about a decent one.
If someone puts the content in a table, you're doomed. Try readability on a phpbb forum you'll see what I mean.
If you want to do it, go with a regexp on <p></p>, or parse the DOM.
Goose is just the library for this task. To quote their README:
Goose will try to extract the following information:
Main text of an article
Main image of article
Any Youtube/Vimeo movies embedded in article
Meta Description
Meta tags

Categories