Scraping Biography.com using urllib2

Scraping Biography.com using urllib2 - python

So I've scraped websites before, but this time I am stumped. I am attempting to search for a person on Biography.com and retrieve his/her biography. But whenever I search the site using urllib2 and query the URL: http://www.biography.com/search/ I get a blank page with no data in it.
When I look into the source generated in the browser by clicking View Source, I still do not see any data. When I use Chrome's developer tools, I find some data but still no links leading to the biography.
I have tried changing the User Agent, adding referrers, using cookies in Python but to no avail. If someone could help me out with this task it would be really helpful.
I am planning to use this text for my NLP project and worst case, I'll have to manually copy-paste the text. But I hope it doesn't come to that.

Chrome/Chromium's Developer Tools (or Firebug) is definitely your friend here. I can see that the initial search on Biography's site is made via a call to a Google API, e.g.
https://www.googleapis.com/customsearch/v1?q=Barack%20Obama&key=AIzaSyCMGfdDaSfjqv5zYoS0mTJnOT3e9MURWkU&cx=011223861749738482324%3Aijiqp2ioyxw&num=8&callback=angular.callbacks._0
The search term I used is in the q= part of the query string: q=Barack%20Obama.
This returns JSON inside of which there is a key link with the value of the article of interest's URL.
"link": "http://www.biography.com/people/barack-obama-12782369"
Visiting that page shows me that this is generated by a request to:
http://api.saymedia-content.com/:apiproxy-anon/content-sites/cs01a33b78d5c5860e/content-customs/#published/#by-custom-type/ContentPerson/#by-slug/barack-obama-12782369
which returns JSON containing HTML.
So, replacing the last part of the link barack-obama-12782369 with the relevant info for the person of interest in the saymedia-content link may well pull out what you want.
To implement:
You'll need to use urllib2 (or requests) to do the search via their Google API call, using urllib2.urlopen(url) or requests.get(url). Replace the Barack%20Obama with a URL escaped search string, e.g. Bill%20Clinton.
Parse the JSON using Python's json module to extract the string that gives you the http://www.biography.com/people link. From this, extract the part of this link of interest (as barack-obama-12782369 above).
Use urllib2 or requests to do a saymedia-content API request replacing barack-obama-12782369 after #by-slug/ with whatever you extract from 2; i.e. do another urllib2.urlopen on this URL.
Parse the JSON from the response of this second request to extract the content you want.
(Caveat: This is provided that there are no session-based strings in those two API calls that might expire.)
Alternatively, you can use Selenium to visit the website, do the search and then extract the content.

You will most likely need to manually copy and paste, as biography.com is a completely javascript-based site, so it can't be scraped with traditional methods.

You can discover an api url with httpfox (firefox addon). f.e. http://www.biography.com/.api/item/search?config=published&query=marx
brings you a json you can process searching for /people/ to retrive biography links.
Or you can use an screen crawler like selenium

Related

Where is the information stored in a html? (web-scraping)

A beginner here.
I want to extract all the jobs from Barclays (https://search.jobs.barclays/search-jobs)
I got through scraping the first page but am struggling to go to the next page, as the url don't change.
I tried to scrape the url on the next page button, but that href brings me back to the homepage.
Does that mean that all the job data is actually stored within the original html?
If so, how can I extract it?
Thanks!

So I analyzed the website, and it communicates with the server using an API, so you can get data directly from it as a JSON file.
This is the API link in this specific case(for my computer): https://search.jobs.barclays/search-jobs/results?ActiveFacetID=44699&CurrentPage=2&RecordsPerPage=15&Distance=50&RadiusUnitType=0&Keywords=&Location=&ShowRadius=False&IsPagination=False&CustomFacetName=&FacetTerm=&FacetType=0&SearchResultsModuleName=Search+Results&SearchFiltersModuleName=Search+Filters&SortCriteria=0&SortDirection=0&SearchType=5&PostalCode=&fc=&fl=&fcf=&afc=&afl=&afcf=
For you the url might be different, but the concept is the same:
As you ca see, there is a 'CurrentPage=2' inside the url which you can use to get any of the pages using requests, then extract what you need from the json.

How to extract hidden html content with scrapy?

I'm using scrapy (on PyCharm v2020.1.3) to build a spider that crawls this webpage: "https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas", i want to extract the products names, and the breadcrumb in a list format, and save the results in a csv file.
I tried the following code but it returns empty brackets [] , after i've inspected the html code i discovred that the content is hidden in angularjs format.
If someone has a solution for that it would be great
Thank you
import scrapy
class ProductsSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas']
def parse(self, response):
product = response.css('a.shelfProductTile-descriptionLink::text').extract()
yield "productnames"

You won't be able to get the desired products through parsing the HTML. It is heavily javascript orientated and therefore scrapy wont parse this.
The simplest way to get the product names, I'm not sure what you mean by breadcrumbs is to re-engineer the HTTP requests. The woolworths website generates the product details via an API. If we can mimick the request the browser makes to obtain that product information we can get the information in a nice neat format.
First you have to set within settings.py ROBOTSTXT_OBEY = False. Becareful about protracted scrapes of this data because your IP will probably get banned at some point.
Code Example
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['woolworths.com']
data = {
'excludeUnavailable': 'true',
'source': 'RR-Best Sellers'}
def start_requests(self):
url = 'https://www.woolworths.com.au/apis/ui/products/58520,341057,305224,70660,208073,69391,69418,65416,305227,305084,305223,427068,201688,427069,341058,305195,201689,317793,714860,57624'
yield scrapy.Request(url=url,meta=self.data,callback=self.parse)
def parse(self, response):
data = response.json()
for a in data:
yield {
'name': a['Name'],
}
Explanation
We start of with our defined url in start_requests. This URL is the specific URL of the API woolworth uses to obtain information for iced tea. For any other link on woolworths the part of the URL after /products/ will be specific to that part of the website.
The reason why we're using this, is because using browser activity is slow and prone to being brittle. This is fast and the information we can get is usually highly structured much better for scraping.
So how do we get the URL you may be asking ? You need to inspect the page, and find the correct request. If you click on network tools and then reload the website. You'll see a bunch of requests. Usually the largest sized request is the one with all the data. Clicking that and clicking preview gives you a box on the right hand side. This gives all the details of the products.
In this next image, you can see a preview of the product data
We can then get the request URL and anything else from this request.
I will often copy this request as a CURL (Bash Command) as seen here
And enter it into curl.trillworks.com. This can convert CURL to python. Giving you a nice formatted headers and any other data needed to mimick the request.
Now putting this into jupyter and playing about, you actually only need the params NOT the headers which is much better.
So back to the code. We do a request, using meta argument we can pass on the data, remember because it's outside the function we have to use self.data and then specifying the callback to parse.
We can use the response.json() method to convert the JSON object to a set of python dictionaries corresponding to each product. YOU MUST have scrapy V2.2 to use this method. Other you could use data = json.loads(response.text), but you'll have put to import json at the top of the script.
From the preview and playing about with the json in requests we can see these python dictionaries are actually within a list and so we can use a for loop to loop round each product, which is what we are doing here.
We then yield a dictionary to extract the data, a refers to each products which is it's own dictionary and a['Name'] refers to that specific python dictionary key 'Name' and giving us the correct value. To get a better feel for this, I always use requests package in jupyter to figure out the correct way to get the data I want.
The only thing left to do is to use scrapy crawl test -o products.csv to output this to a CSV file.
I can't really help you more than this until you specify any other data you want from this page. Please remember that you're going against what the site wants you to scrape, but also any other pages on that website you will need to find out the specific URL to the API to get those products. I have given you the way to do this, I suggest if you want to automate this it would be worth your while trying to struggle with this. We are hear to help but an attempt on your part is how you're going to learn coding.
Additional Information on the Approach of Dynamic Content
There is a wealth of information on this topic. Here are some guidelines to think about when looking at javascript orientated websites. The default is you should try re-engineer the requests the browser makes to load the pages information. This is what the javascript in this site and many other sites is doing, it's providing a dynamic way without reloading the page to display new information by making an HTTP request. If we can mimic that request, we can get the information we desire. This is the most efficent way to get dynamic content.
In order of preference
Re-engineering the HTTP requests
Scrapy-splash
Scrapy_selenium
importing selenium package into your scripts
Scrapy-splash is slightly better than the selenium package, as it pre-renders the page, giving you access to the selectors with the data. Selenium is slow, prone to errors but will allow you to mimic browser activity.
There are multiple ways to include selenium into your scripts see down below as an overview.
Recommended Reading/Research
Look at the scrapy documentation with regard to dynamic content here
This will give you an overview of the steps to handling dynamic content. I will say generally speaking selenium should be thought of as a last resort. It's pretty inefficient when doing larger scale scraping.
If you are consider adding in the selenium package into your script. This might be the lower barrier of entry to getting your script working but not necessarily that efficient. At the end of the day scrapy is a framework but there is a lot of flexibility in adding in 3rd party packages. The spider scripts are just a python class importing the scrapy architecture in the background. As long as you're mindful of the response and translating some of the selenium to work with scrapy, you should be able to input selenium into your scripts. I would this solution is probably the least efficient though.
Consider using scrapy-splash, splash pre-renders the page and allows for you to add in javascript execution. Docs are here and a good article from scrapinghub here
Scrapy-selenium is a package with a custom scrapy downloader middleware that allows you to do selenium actions and execute javascript. Docs here You'll need to have a play around to get the login in procedure from this, it doesn't have the same level of detail as the selenium package itself.

Python - Scrapy ecommerce website

I'm trying to scrape the price of this product
http://www.asos.com/au/fila/fila-vintage-plus-ringer-t-shirt-with-small-logo-in-green/prd/9065343?clr=green&SearchQuery=&cid=7616&gridcolumn=2&gridrow=1&gridsize=4&pge=1&pgesize=72&totalstyles=4699
With the following code but it returns an empty array
response.xpath('//*[#id="product-price"]/div/span[2]/text()').extract()
Any help is appreciated, Thanks.

Because the site is dynamic(this is what I got when I use view(response) command in scrapy shell:
As you can see, the price info doesn't come out.
Solutions:
1. splash.
2. selenium+phantomJS
It might help also by checking this answer:Empty List From Scrapy When Using Xpath to Extract Values

The price is later added by the browser which renders the page using javascript code found in the html. If you disable javascript in your browser, you would notice that the page would look a bit different. Also, take a look at the page source, usually that's unaltered, to see that the tag you're looking for doesn't exist (yet).
Scrapy doesn't execute any javascript code. It receives the plain html and that's what you have to work with.
If you want to extract data from pages which look the same as in the browser, I recommend using an headless browser like Splash (if you're already using scrapy): https://github.com/scrapinghub/splash
You can programaticaly tell it to download your page, render it and select the data points you're interested in.
The other way is to check for the request made to the Asos API which asks for the product data. In your case, for this product:
http://www.asos.com/api/product/catalogue/v2/stockprice?productIds=9065343&currency=AUD&keyStoreDataversion=0ggz8b-4.1&store=AU
I got this url by taking a look at all the XMLHttpRequest (XHR) requests sent in the Network tab found in Developers Tools (on Google Chrome).

You can try to find JSON inside HTML (using regular expression) and parse it:
json_string = response.xpath('//script[contains(., "function (view) {")]/text()').re_first( r'view\(\'([^\']+)' )
data = json.loads(json_string)
price = data["price"]["current"]

How to retrieve data from API Explorer?

My question is more in the "concept" side, as I don't have any code to show yet. I've basically got access to an API Explorer for a website, but the information retrieved when I put a specific url in the API Explorer is not the same as the html information I'd get if I opened a webpage with the same url and "inspected" the elements. I'm honestly lost on how to retrieve the data I need, as they are only present in the API Explorer but can't be accessible via web scraping.
Here is an example to show you what I mean:
API Explorer link: https://platform.worldcat.org/api-explorer/apis/worldcatidentities/identity/Read,
and the specific url to request is: http://www.worldcat.org/identities/lccn-n80126307/
If I put the url (http://www.worldcat.org/identities/lccn-n80126307/) myself and "inspect element", this piece of information:
does not have all the same data as:
For example, the language count, audLevel, oclcnum and many others are not existent in the html version but are in the API Explorer and with other authors, the genres count is only existent in the API Explorer.
I realize that one is in xml and the other in html so is that why the data is not the same in both versions? And whatever is the reason, what can I do to retrieve the data present only in the API Explorer? (such as genres count, audLevel, oclcnum, etc.)
Any insight would be really helpful.

It's not unusual for sites not showing all the data, that's in the underlying json/xml. Those sorts of things often holds interesting content that aren't displayed anywhere onsite.
In this case the server gives you, what you ask for. If you're going for the data using Python, all you really have to do is specify in your header what you're after. If you don't do that on this site, you get the html-stuff.
If you do like this, you'll get the xml data, you're interested in:
import requests
import xml.dom.minidom
url = 'https://www.worldcat.org/identities/lccn-n80126307/'
r = requests.get(url, headers={'Accept': 'application/json'})
# a couple of lines for printing the xml pretty
xml = xml.dom.minidom.parseString(r.text)
pretty_xml_as_string = xml.toprettyxml()
print(pretty_xml_as_string)
Then all you have to do is extract the content, you're after. That can be done in many ways. Let me know if this helps you.

Download files after opening a webpage using Python

I have opened a webpage('http://example.com/protected_page.php') using Python's requests Library.
from requests import session
payload = {
'action': 'login',
'username': USERNAME,
'password': PASSWORD
}
with session() as c:
c.post('http://example.com/login.php', data=payload)
response = c.get('http://example.com/protected_page.php')
Now there are around 15 links on that page to download files.
I wish to download files from only 2 links(say, linkA and linkB).
How can I specify this in my code, so that the 2 files get downloaded when I run my code.

Can you please give more information about these links ?
Are these linkA and linkB always the same links ?
If yes then you can use :
r = requests.get(linkA, stream=True)
If the url links are not the same all the time , then maybe you can find another way, using the order of the link maybe, for instance if the linkA and linkB is always the first and the second link on the page etc.
Another way is to use any unique class name or id from the page. But it would be better if you could provide us more informations.

In fact what you are referring is more precisely called as web scraping , in which one can scrape some specific contents from the given web site:
Web scraping is a computer software technique of extracting
information from websites. This technique mostly focuses on the
transformation of unstructured data (HTML format) on the web into
structured data (database or spreadsheet).
without knowing the HTML semantics it is not possible to give you a snap of code , for what you are looking for. But here i can advice you some of the way using which you can do web scrape from your site.
1. Non-programming way:
For those of you, who need a non-programming way to extract
information out of web pages, you can also look at import.io . It
provides a GUI driven interface to perform all basic web scraping
operations.
2. Programmers way:
You may find many libraries to perform one function using python. Hence, it is necessary to find the best to use library. I prefer BeautifulSoup , since it is easy and intuitive to work on. Precisely, you use two Python modules for scraping data:
Urllib2: It is a Python module which can be used for fetching URLs. It defines functions and classes to help with URL actions (basic
and digest authentication, redirections, cookies, etc). For more
detail refer to the documentation page.
BeautifulSoup: It is an incredible tool for pulling out information
from a webpage. You can use it to extract tables, lists, paragraph and
you can also put filters to extract information from web pages. the latest available version is BeautifulSoup 4. You can look
at the installation instruction in its documentation page.
BeautifulSoup does not fetch the web page for us. That’s why, need to use urllib2 in combination with the BeautifulSoup library.
Python has several other options for HTML scraping in addition to BeatifulSoup. Here are some others:
mechanize
scrapemark
scrapy

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.