Web Scrape page with multiple sections - python

Pretty new to python... and I'm trying to my hands at my first project.
Been able to replicate few simple demo... but i think there are few extra complexities with what I'm trying to do.
I'm trying to scrape the gamelogs for from the NHL website
Here is that i came up with... similar code work for the top section of the site (ex: get the age) but it fail on the section with display logic (dependent if the user click on Career, game Logs or splits)
Thanks in advance for your help
import urllib2
from bs4 import BeautifulSoup
url = 'https://www.nhl.com/player/ryan-getzlaf-8470612?stats=gamelogs-r-nhl&season=20162017'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
Test = soup.find_all('div', attrs={'id': "gamelogsTable"})

This happens with many web pages. It's because some of the content is downloaded by Javascript code that is part of the initial download. By doing does this designers are able to show visitors the most important parts of a page without waiting for the entire page to download.
When you want to scrape a page the first thing you should do is to examine the source code for it (often using Ctrl-u in a Windows environment) to see if the content you require is available. If not then you will need to use something beyond BeautifulSoup.
>>> getzlafURL = 'https://www.nhl.com/player/ryan-getzlaf-8470612?stats=gamelogs-r-nhl&season=20162017'
>>> import requests
>>> import selenium.webdriver as webdriver
>>> import lxml.html as html
>>> import lxml.html.clean as clean
>>> browser = webdriver.Chrome()
>>> browser.get(getzlafURL)
>>> content = browser.page_source
>>> cleaner = clean.Cleaner()
>>> content = cleaner.clean_html(content)
>>> doc = html.fromstring(content)
>>> type(doc)
<class 'lxml.html.HtmlElement'>
>>> open('c:/scratch/temp.htm', 'w').write(content)
775838
By searching within the file temp.htm for the heading 'Ryan Getzlaf Game Logs' I was able to find this section of HTML code. As you can see, it's about what you expected to find in the original downloaded HTML. However, this additional step is required to get at it.
</div>
</li>
</ul>
<h5 class="statistics__subheading">Ryan Getzlaf Game Logs</h5>
<div id="gamelogsTable"><div class="responsive-datatable">
I should mention that there are alternative ways of accessing such code, one of them being dryscrape. I simply can't be bothered installing that one on this Windows machine.

Related

How to scrape all the home page text content of a website?

So I am new to webscraping, I want to scrape all the text content of only the home page.
this is my code, but it now working correctly.
from bs4 import BeautifulSoup
import requests
website_url = "http://www.traiteurcheminfaisant.com/"
ra = requests.get(website_url)
soup = BeautifulSoup(ra.text, "html.parser")
full_text = soup.find_all()
print(full_text)
When I print "full_text" it give me a lot of html content but not all, when I ctrl + f " traiteurcheminfaisant#hotmail.com" the email adress that is on the home page (footer)
is not found on full_text.
Thanks you for helping!
A quick glance at the website that you're attempting to scrape from makes me suspect that not all content is loaded when sending a simple get request via the requests module. In other words, it seems likely that some components on the site, such as the footer you mentioned, are being loaded asynchronously with Javascript.
If that is the case, you'll probably want to use some sort of automation tool to navigate to the page, wait for it to load and then parse the fully loaded source code. For this, the most common tool would be Selenium. It can be a bit tricky to set up the first time since you'll also need to install a separate webdriver for whatever browser you'd like to use. That said, the last time I set this up it was pretty easy. Here's a rough example of what this might look like for you (once you've got Selenium properly set up):
from bs4 import BeautifulSoup
from selenium import webdriver
import time
driver = webdriver.Firefox(executable_path='/your/path/to/geckodriver')
driver.get('http://www.traiteurcheminfaisant.com')
time.sleep(2)
source = driver.page_source
soup = BeautifulSoup(source, 'html.parser')
full_text = soup.find_all()
print(full_text)
I haven't used BeatifulSoup before, but try using urlopen instead. This will store the webpage as a string, which you can use to find the email.
from urllib.request import urlopen
try:
response = urlopen("http://www.traiteurcheminfaisant.com")
html = response.read().decode(encoding = "UTF8", errors='ignore')
print(html.find("traiteurcheminfaisant#hotmail.com"))
except:
print("Cannot open webpage")

How to fix Python requests/BeautifulSoup response from database

I am new to web scraping/coding, and I am trying to use Python requests/BeautifulSoup to parse through the html code in order to get some physical and chemical properties.
For some reason, although I have used the following script for other websites successfully, BeautifulSoup has only printed a few lines from the header and footer, and then pages of HTML code that doesn't really make sense. This is the code I have been using:
import requests
from bs4 import BeautifulSoup
url='https://comptox.epa.gov/dashboard/dsstoxdb/results?search=ammonia#properties'
response = requests.get(url).text
soup=BeautifulSoup(response,'lxml')
print(soup.prettify())
When I try to find the table or even a row, it gives no output. Is there something I haven't accounted for? Any help would be greatly appreciated!
It is present in one of the attributes. You can extract as follows (there is a lot more info there but I subset to physical properties
import requests
from bs4 import BeautifulSoup as bs
import json
url = "https://comptox.epa.gov/dashboard/dsstoxdb/results?search=ammonia#properties"
r = requests.get(url)
soup = bs(r.content, 'lxml')
soup.select_one('[data-result]')['data-result']
data = json.loads(soup.select_one('[data-result]')['data-result'])
properties = data['physprop']
print(properties)
It's pretty common that if a page is populated by JavaScript after you load it requests and BeautifulSoup will not process the page correctly. The best thing to do is likely switch to the selenium module which allows your program to dynamically access the page and interact with elements. After loading (and maybe clicking on a couple elements) you can feed the HTML to BeautifulSoup and process it how you wish. The basic framework I recommend you start with would look like:
from selenium import webdriver
browser = webdriver.Chrome() # You'll need to download drivers from link above
browser.implicitly_wait(10) # probably unnecessary, just makes sure all pages you visit fully load
browser.get('https://stips.co.il/explore')
while True:
input('Press Enter to print HTML')
HTML = browser.page_source
print(HTML)
Just click around in the browser and when you want to see if the HTML is correct, click back to your prompt and press ENTER. This is how you would locate elements automatically, so you don't have to manually interact with the page every time

I want to get all links from a certain webpage using python

i want to be able to pull all urls from the following webpage using python https://yeezysupply.com/pages/all i tried using some other suggestions i found but they didn't seem to work with this particular website. i would end up not finding any urls at all.
import urllib
import lxml.html
connection = urllib.urlopen('https://yeezysupply.com/pages/all')
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'):
print link
perhaps it would be useful for you to make use of modules specifically designed for this. heres a quick and dirty script that gets the relative links on the page
#!/usr/bin/python3
import requests, bs4
res = requests.get('https://yeezysupply.com/pages/all')
soup = bs4.BeautifulSoup(res.text,'html.parser')
links = soup.find_all('a')
for link in links:
print(link.attrs['href'])
it generates output like this:
/pages/jewelry
/pages/clothing
/pages/footwear
/pages/all
/cart
/products/womens-boucle-dress-bleach/?back=%2Fpages%2Fall
/products/double-sleeve-sweatshirt-bleach/?back=%2Fpages%2Fall
/products/boxy-fit-zip-up-hoodie-light-sand/?back=%2Fpages%2Fall
/products/womens-boucle-skirt-cream/?back=%2Fpages%2Fall
etc...
is this what you are looking for? requests and beautiful soup are amazing tools for scraping.
There are no links in the page source; they are inserted using Javascript after the page is loaded int the browser.

Using Beautiful Soup in Python to check availability of a product online

I am using python 2.7 and version 4.5.1 of Beautiful Soup
I'm at my wits end trying to make this very simple script to work. My goal is to to get the information on the online availability status of the NES console from Best Buy's website by parsing the html for the product's page and extracting the information in
<div class="status online-availability-status"> Sold out online </div>
This is my first time using the Beautiful Soup module so forgive me if I have missed something obvious. Here is the script I wrote to try to get the information above:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02')
soup = BeautifulSoup(page.content, 'html.parser')
avail = soup.findAll('div', {"class": "status online-availability-status"})
But then I just get an empty list for avail. Any idea why?
Any help is greatly appreciated.
As the comments above suggest, it seems that you are looking for a tag which is generated client side by JavaScript; it shows up using 'inspect' on the loaded page, but not when viewing the page source, which is what the call to requests is pulling back. You might try using dryscrape (which you may need to install with pip install dryscrape).
import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
url = 'http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02'
session.visit(url)
response = session.body()
soup = BeautifulSoup(response)
avail = soup.findAll('div', {"class": "status online-availability-status"})
This was the most popular solution in a question relating to scraping dynamically generated content:
Web-scraping JavaScript page with Python
If you try printing soup you'll see it probably returns something like Access Denied. This is because Best Buy requires an allowable User-Agent to be making the GET request. As you do not have a User-Agent specified in the Header, it is not returning anything.
Here is a link to generate a User Agent
How to use Python requests to fake a browser visit a.k.a and generate User Agent?
or you could figure out your user agent generated when you are viewing the webpage in your own browser
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
Availability is loaded in JSON. You don't even need to parse HTML for that:
import urllib
import simplejson
sku = 1048865 # look at the URL of the web page, it is <blablah>//10488665.aspx
# chnage locations to get the right store
response = urllib.urlopen('http://api.bestbuy.ca/availability/products?callback=apiAvailability&accept-language=en&skus=%s&accept=application%2Fvnd.bestbuy.standardproduct.v1%2Bjson&postalCode=M5G2C3&locations=977%7C203%7C931%7C62%7C617&maxlos=3'%sku)
availability = simplejson.loads(response.read())
print availability[0]['shipping']['status']

source code of web page not available using urllib.urlopen()

I am trying to get video links from 'https://www.youtube.com/trendsdashboard#loc0=ind'. When I do inspect elements, it displays me the source html code for each videos. In source code retrieved using
urllib2.urlopen("https://www.youtube.com/trendsdashboard#loc0=ind").read()
It does not display html source for videos. Is there any otherway to do this?
<a href="/watch?v=dCdvyFkctOo" alt="Flipkart Wish Chain">
<img src="//i.ytimg.com/vi/dCdvyFkctOo/hqdefault.jpg" alt="Flipkart Wish Chain">
</a>
This simple code appears when we inspect elements from browser, but not in source code retrived by urllib
To view the source code you need use read method
If you just use open it gives you something like this.
In [12]: urllib2.urlopen('https://www.youtube.com/trendsdashboard#loc0=ind')
Out[12]: <addinfourl at 3054207052L whose fp = <socket._fileobject object at 0xb60a6f2c>>
To see the source use read
urllib2.urlopen('https://www.youtube.com/trendsdashboard#loc0=ind').read()
Whenever you compare the source code between Python code and Web browser, dont do it through Insect Element, right click on the webpage and click view source, then you will find the actual source. Inspect Element displays the aggregated source code returned by as many network requests created as well as javascript code being executed.
Keep Developer Console open before opening the webpage, stay on Network tab and make sure that 'Preserve Log' is open for Chrome or 'Persist' for Firebug in Firefox, then you will see all the network requests made.
works for me...
import urllib2
url = 'https://www.youtube.com/trendsdashboard#loc0=ind'
html = urllib.urlopen(url).read()
IMO I'd use requests instead of urllib - it's a bit easier to use:
import requests
url = 'https://www.youtube.com/trendsdashboard#loc0=ind'
response = requests.get(url)
html = response.content
Edit
This will get you a list of all <a></a> tags with hyperlinks as per your edit. I use the library BeautifulSoup to parse the html:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
links = [tag for tag in soup.findAll('a') if tag.has_attr('href')]
we also need to decode the data to utf-8.
here is the code:
just use
response.decode('utf-8')
print(response)

Categories