I need to scrape several different sites for the same information. Basically, I am looking for similar information, but the sites could belong to different vendors and can have different HTML structures. For example, if I am trying to scrape the data related to text books in Barns&Nobles and Biblio (this is only two but there could be many) and get the book name, author and prices for the books how would one do that?
https://www.barnesandnoble.com/b/textbooks/mathematics/algebra/_/N-8q9Z18k3
https://www.biblio.com/search.php?stage=1&result_type=works&keyisbn=algebra
Of course, I can parse the two sites independently, but I am looking for a general methodology that can be easily applied to other vendors as well to extract the same information.
In a separate but related question, I would also like to know how google show different product information from different sources when you search for a product? For example, if you google for "MacBook Pro", at the top of the page, you'd get a carousel of products from different vendors. I assume google is scraping this information from different sources automatically to suggest to the user what are available.
Have a look at scrapely. It can really be helpful if you don't want to manually parse different HTML structures.
Related
I want to collect contact information from all county governments. I do not have a list of their websites. I want to do three things with Python: 1) create a list of county government websites, 2) extract names, email addresses, and phone numbers of government officials, and 3) convert URLs and all the contact information into an excel sheet or csv.
I am a beginner in Python, and any guidance would be greatly appreciated. Thanks!
For creating tables, you would use a package called pandas
for extracting info from websites, a package called beautifulsoup4 is commonly used.
For scraping a website (all data present in the world) you should
define what type of search you want to start, I mean do you want to
Search in google or a specific website for both of them you need a
request library to curl a site or query a google (like search in
the search bar) and got HTML. for parsing data, you have gotten you
can choose BEATIFULSOAP. Both of them have good documents and you
must read them don't disappoint it's easy.
Because the count of countries around the world is more than 170+
you should manage your data; for managing data I recommend using pandas
and finally, after processing data you can convert data to any type of file
pandas.to_excel, pandas.to_csv and more.
How do I go about extracting only page views (all time or maximum year-wise- not really interested in daily, monthly, etc) of all the subpages from a glossary page.
Example: https://en.wikipedia.org/wiki/Glossary_of_areas_of_mathematics
I found this tool. But it does category.
Is there a way in python or something which I can implement to GET pageviews for all the listed links on a page?
There won't be a simple way to do this because that article content is unstructured, unlike for a category.
You'll need to manually extract the page titles by parsing the article and then pass each of the titles to the API to get the pageviews. It is documented here: https://pageviews.toolforge.org/pageviews/url_structure/ you can pass multiple titles by separating them with | but there will be a limit to the number, so you will need to make multiple queries.
I would like to use only medical data from Wikipedia for analysis. I use python for scraping.
I have used this library for searching by word in query:
import wikipedia
import requests
import pprint
from bs4 import BeautifulSoup
wikipedia.set_lang("en")
query = raw_input()
WikiPage = wikipedia.page(title = query,auto_suggest = True)
cat = WikiPage.categories
for i in cat:
print i
and get the categories.
But, my problem is vice versa:
I want to give category, for example: health or medical terminology and get all articles with this type.
how can I do this?
Edit: actual answer
There is API:Categorymembers, which documents usage, parameters and gives examples on "how to retrieve lists of pages in a given category, ordered by title". It won't save you from having to descend through the category tree (cf. below) yourself, but you get a nice entry point and machine-readable results.
Old answer: related information
A very brief pointer is given on the Help:Category page, section Searching for articles in categories:
In addition to browsing through hierarchies of categories, it is
possible to use the search tool to find specific articles in specific
categories. To search for articles in a specific category, type
incategory:"CategoryName" in the search box.
An "OR" can be added to join the contents of one category with the
contents of another. For example, enter
incategory:"Suspension bridges" OR incategory:"Bridges in New York City"
to return all pages that belong to either (or both) of the categories,
as here.
Note that using search to find categories will not find articles which
have been categorized using templates. This feature also doesn't
return pages in subcategories.
To address the subcategory problem, the page Special:CategoryTree can be used instead. However, the page does not point to an obvious documentation. So I think the <form> fields must be manually searched for in the page source to create a programmatic API.
My problem is that I want to create a data base of all of the questions, answers, and most importantly, the tags, from a certain (somewhat small) Stack Exchange. The relationships among tags (e.g. tags more often used together have a strong relation) could reveal a lot about the structure of the community and popularity or interest in certain sub fields.
So, what is the easiest way to go through a list of questions (that are positively ranked) and extract the tag information using Python?
The easiest way to get the shared-tag count for all questions is to use the Stack Exchange API.
import requests
r = requests.get(
'http://api.stackexchange.com/2.2/tags/python/related?pagesize=3&site=stackoverflow')
for item in r.json()['items']:
print("{name} shares {count} tags with Python".format(**item))
If this doesn't satisfy your need, there are many other API queries available.
Visit the site to find the URL that shows the information you want, then look at the page source to see how it has been formatted.
In order to scrape the pages use the urllib2 library.
Parse the text using the BeautifulSoup library.
Place the data into a database.
The difficult thing is going to be structuring your database and developing queries that reveal what you want.
I'm pretty new to web development and I have an idea for something that I would like to explore and I'd like some advice on what tools I should use. I know python and have been learning django recently so I would ideally like to incorporate them.
What I want to do is related to some basic html parsing and use of regular expressions I think. Basically, I want to be able to aggregate certain bits of useful information from several websites into one site. Suppose, for example, there are a dozen high schools whose graduation dates, times, and locations I'm interested in knowing. How the information on each high school site is presented is roughly similar and so I want to extract the data for the word after "location" or "venue", "time", "date", etc and then have that automatically posted on my site and I would also like it updated if any of the info happens to change on any of the high school sites.
What would you use to accomplish this task? Also, if you know of any useful tutorials, resources, etc that you could point me to, that would be much appreciated!
For the extraction part I think your best bet would be Beautiful soup mostly beacause it's easy to use and would try to parse anything even broken xml/html.
Check out BeautifulSoup
Update:
If you want to fill forms you can use mechanize