How do I go about extracting only page views (all time or maximum year-wise- not really interested in daily, monthly, etc) of all the subpages from a glossary page.
Example: https://en.wikipedia.org/wiki/Glossary_of_areas_of_mathematics
I found this tool. But it does category.
Is there a way in python or something which I can implement to GET pageviews for all the listed links on a page?
There won't be a simple way to do this because that article content is unstructured, unlike for a category.
You'll need to manually extract the page titles by parsing the article and then pass each of the titles to the API to get the pageviews. It is documented here: https://pageviews.toolforge.org/pageviews/url_structure/ you can pass multiple titles by separating them with | but there will be a limit to the number, so you will need to make multiple queries.
Related
I need to scrape several different sites for the same information. Basically, I am looking for similar information, but the sites could belong to different vendors and can have different HTML structures. For example, if I am trying to scrape the data related to text books in Barns&Nobles and Biblio (this is only two but there could be many) and get the book name, author and prices for the books how would one do that?
https://www.barnesandnoble.com/b/textbooks/mathematics/algebra/_/N-8q9Z18k3
https://www.biblio.com/search.php?stage=1&result_type=works&keyisbn=algebra
Of course, I can parse the two sites independently, but I am looking for a general methodology that can be easily applied to other vendors as well to extract the same information.
In a separate but related question, I would also like to know how google show different product information from different sources when you search for a product? For example, if you google for "MacBook Pro", at the top of the page, you'd get a carousel of products from different vendors. I assume google is scraping this information from different sources automatically to suggest to the user what are available.
Have a look at scrapely. It can really be helpful if you don't want to manually parse different HTML structures.
I am new to scrapy, trying to extract google news from the the given link bellow:
https://www.google.co.in/search?q=cholera+news&safe=strict&source=lnms&tbm=nws&sa=X&ved=0ahUKEwik0KLV-JfYAhWLpY8KHVpaAL0Q_AUICigB&biw=1863&bih=966
"cholera" key word was provided that shows small blocks of various news associated with cholera key world further I try this with scrapy to extract the each block that contents individual news.
fetch("https://www.google.co.in/search?q=cholera+news&safe=strict&source=lnms&tbm=nws&sa=X&ved=0ahUKEwik0KLV-JfYAhWLpY8KHVpaAL0Q_AUICigB&biw=1863&bih=966")
response.css(".ts._JGs._KHs._oGs._KGs._jHs::text").extract()
where .ts._JGs._KHs._oGs._KGs._jHs::text represent the div class="ts _JGs _KHs _oGs _KGs _jHs for each block of news.
but it return None.
After struggling I find out a way to scrap desired data with very simple trick,
fetch("https://www.google.co.in/search?q=cholera+news&safe=strict&source=lnms&tbm=nws&sa=X&ved=0ahUKEwik0KLV-JfYAhWLpY8KHVpaAL0Q_AUICigB&biw=1863&bih=966")
and css selector "class="g" tag can be used to extract desired block like this
response.css(".g").extract()
which return list of all the individual news blocks which can be further used on the basis of list index like this:
response.css(".g").extract()[0]
or
response.css(".g").extract()[1]
In scrapy shell uses view(response) and you will see in web browser what you fetch().
Google uses JavaScript to display data, but it can also send page which doesn't use JavaScript. But page without JavaScript usually has different tags and classes.
You can also turn off JavaScript in your browse and then open Google to see tags.
Try this:
response.css('#search td ::text').extract()
The problem
I'm using the Wikipedia API to get page HTML which I parse. I use queries like this one to get the HTML for the first section of a page.
The MediaWiki API provides a handy parameter, redirects, which will cause the API to automatically follow pages that redirect other pages. For example, if I search for 'Cats' with https://en.wikipedia.org/w/api.php?page=Cats&redirects, I will be shown the results for Cat because Cats redirects to Cat.
I'd like a similar function for disambiguation pages such as this, by which if I arrive at a disambiguation page, I am automatically redirected to the first link. For example, if I make a request to a page like Mercury, I'd automatically be redirected to Mercury (element), as it is the first link listed in the page.
The Python HTML parser BeautifulSoup is fairly slow on large documents. By only requesting the first section of articles (that's all I need for my use), using section=0, I can parse it quickly. This is perfect for most articles. But for disambiguation pages, the first section does not include any of the links to specific pages, making it a poor solution. But if I request more than the first section, the HTML loading slows down, which is unnecessary for most articles. See this query for an example of a disambiguation page in which links are not included in the first section.
What I have so far
As of right now, I've gotten as far as detecting when a disambiguation page is reached. I use code like
bs4.BeautifulSoup(page_html).find("p", recursive=false).get_text().endswith(("refer to:", "refers to:"))
I also spent a while trying to write code that automatically followed a link, before I realized that the links were not included in
My constraints
I'd prefer to keep the number of requests made to a minimum. I also need to be parsing as little HTML as possible, because speed is essential for my application.
Possible solutions (which I need help executing)
I could envision several solutions to this problem:
A way within the MediaWiki API to automatically follow the first link from disambiguation pages
A method within the Mediawiki API that allows it to return different amounts of HTML content based on a condition (like presence of a disambiguation template)
A way to dramatically improve the speed of bs4 so that it doesn't matter if I end up having to parse the entire page HTML
As Tgr and everybody said, no, such a feature doesn't exist because it doesn't make sense. The first link in a disambiguation page doesn't have any special status or meaning.
As for the existing API, see https://www.mediawiki.org/wiki/Extension:Disambiguator#API_usage
By the way, the "bot policy" you linked does not really apply to crawlers/scraper; the only relevant policy/guideline is the User-Agent policy.
I would like to use only medical data from Wikipedia for analysis. I use python for scraping.
I have used this library for searching by word in query:
import wikipedia
import requests
import pprint
from bs4 import BeautifulSoup
wikipedia.set_lang("en")
query = raw_input()
WikiPage = wikipedia.page(title = query,auto_suggest = True)
cat = WikiPage.categories
for i in cat:
print i
and get the categories.
But, my problem is vice versa:
I want to give category, for example: health or medical terminology and get all articles with this type.
how can I do this?
Edit: actual answer
There is API:Categorymembers, which documents usage, parameters and gives examples on "how to retrieve lists of pages in a given category, ordered by title". It won't save you from having to descend through the category tree (cf. below) yourself, but you get a nice entry point and machine-readable results.
Old answer: related information
A very brief pointer is given on the Help:Category page, section Searching for articles in categories:
In addition to browsing through hierarchies of categories, it is
possible to use the search tool to find specific articles in specific
categories. To search for articles in a specific category, type
incategory:"CategoryName" in the search box.
An "OR" can be added to join the contents of one category with the
contents of another. For example, enter
incategory:"Suspension bridges" OR incategory:"Bridges in New York City"
to return all pages that belong to either (or both) of the categories,
as here.
Note that using search to find categories will not find articles which
have been categorized using templates. This feature also doesn't
return pages in subcategories.
To address the subcategory problem, the page Special:CategoryTree can be used instead. However, the page does not point to an obvious documentation. So I think the <form> fields must be manually searched for in the page source to create a programmatic API.
My problem is that I want to create a data base of all of the questions, answers, and most importantly, the tags, from a certain (somewhat small) Stack Exchange. The relationships among tags (e.g. tags more often used together have a strong relation) could reveal a lot about the structure of the community and popularity or interest in certain sub fields.
So, what is the easiest way to go through a list of questions (that are positively ranked) and extract the tag information using Python?
The easiest way to get the shared-tag count for all questions is to use the Stack Exchange API.
import requests
r = requests.get(
'http://api.stackexchange.com/2.2/tags/python/related?pagesize=3&site=stackoverflow')
for item in r.json()['items']:
print("{name} shares {count} tags with Python".format(**item))
If this doesn't satisfy your need, there are many other API queries available.
Visit the site to find the URL that shows the information you want, then look at the page source to see how it has been formatted.
In order to scrape the pages use the urllib2 library.
Parse the text using the BeautifulSoup library.
Place the data into a database.
The difficult thing is going to be structuring your database and developing queries that reveal what you want.