I am new to web scraping/coding, and I am trying to use Python requests/BeautifulSoup to parse through the html code in order to get some physical and chemical properties.
For some reason, although I have used the following script for other websites successfully, BeautifulSoup has only printed a few lines from the header and footer, and then pages of HTML code that doesn't really make sense. This is the code I have been using:
import requests
from bs4 import BeautifulSoup
url='https://comptox.epa.gov/dashboard/dsstoxdb/results?search=ammonia#properties'
response = requests.get(url).text
soup=BeautifulSoup(response,'lxml')
print(soup.prettify())
When I try to find the table or even a row, it gives no output. Is there something I haven't accounted for? Any help would be greatly appreciated!
It is present in one of the attributes. You can extract as follows (there is a lot more info there but I subset to physical properties
import requests
from bs4 import BeautifulSoup as bs
import json
url = "https://comptox.epa.gov/dashboard/dsstoxdb/results?search=ammonia#properties"
r = requests.get(url)
soup = bs(r.content, 'lxml')
soup.select_one('[data-result]')['data-result']
data = json.loads(soup.select_one('[data-result]')['data-result'])
properties = data['physprop']
print(properties)
It's pretty common that if a page is populated by JavaScript after you load it requests and BeautifulSoup will not process the page correctly. The best thing to do is likely switch to the selenium module which allows your program to dynamically access the page and interact with elements. After loading (and maybe clicking on a couple elements) you can feed the HTML to BeautifulSoup and process it how you wish. The basic framework I recommend you start with would look like:
from selenium import webdriver
browser = webdriver.Chrome() # You'll need to download drivers from link above
browser.implicitly_wait(10) # probably unnecessary, just makes sure all pages you visit fully load
browser.get('https://stips.co.il/explore')
while True:
input('Press Enter to print HTML')
HTML = browser.page_source
print(HTML)
Just click around in the browser and when you want to see if the HTML is correct, click back to your prompt and press ENTER. This is how you would locate elements automatically, so you don't have to manually interact with the page every time
Related
So I am new to webscraping, I want to scrape all the text content of only the home page.
this is my code, but it now working correctly.
from bs4 import BeautifulSoup
import requests
website_url = "http://www.traiteurcheminfaisant.com/"
ra = requests.get(website_url)
soup = BeautifulSoup(ra.text, "html.parser")
full_text = soup.find_all()
print(full_text)
When I print "full_text" it give me a lot of html content but not all, when I ctrl + f " traiteurcheminfaisant#hotmail.com" the email adress that is on the home page (footer)
is not found on full_text.
Thanks you for helping!
A quick glance at the website that you're attempting to scrape from makes me suspect that not all content is loaded when sending a simple get request via the requests module. In other words, it seems likely that some components on the site, such as the footer you mentioned, are being loaded asynchronously with Javascript.
If that is the case, you'll probably want to use some sort of automation tool to navigate to the page, wait for it to load and then parse the fully loaded source code. For this, the most common tool would be Selenium. It can be a bit tricky to set up the first time since you'll also need to install a separate webdriver for whatever browser you'd like to use. That said, the last time I set this up it was pretty easy. Here's a rough example of what this might look like for you (once you've got Selenium properly set up):
from bs4 import BeautifulSoup
from selenium import webdriver
import time
driver = webdriver.Firefox(executable_path='/your/path/to/geckodriver')
driver.get('http://www.traiteurcheminfaisant.com')
time.sleep(2)
source = driver.page_source
soup = BeautifulSoup(source, 'html.parser')
full_text = soup.find_all()
print(full_text)
I haven't used BeatifulSoup before, but try using urlopen instead. This will store the webpage as a string, which you can use to find the email.
from urllib.request import urlopen
try:
response = urlopen("http://www.traiteurcheminfaisant.com")
html = response.read().decode(encoding = "UTF8", errors='ignore')
print(html.find("traiteurcheminfaisant#hotmail.com"))
except:
print("Cannot open webpage")
I am trying to use Beautiful Soup 4 to help me download an image from Imgur, although I doubt the Imgur part is relevant. As an example, I'm using the webpage here: https://imgur.com/t/lenovo/mLwnorj
My code is as follows:
import webbrowser, time, sys, requests, os, bs4 # Not all libraries are used in this code snippet
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://imgur.com/t/lenovo/mLwnorj")
res = requests.get(https://imgur.com/t/lenovo/mLwnorj)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, features="html.parser")
imageElement = soup.findAll('img', {'class': 'post-image-placeholder'})
print(imageElement)
The HTML code on the Imgur link contains a part that reads as:
<img alt="" src="//i.imgur.com/JfLsH5y.jpg" class="post-image-placeholder" style="max-width: 100%; min-height: 546px;" original-title="">
which I found by picking the first image element on the page using the point and click tool in Inspect Element.
The problem is that I would expect there to be two items in imageElement, one for each image, however, the print function shows []. I have also tried other forms of soup.findAll('img', {'class': 'post-image-placeholder'}) such as soup.findall("img[class='post-image-placeholder']") but that made no difference.
Furthermore, when I used
imageElement = soup.select("h1[class='post-title']")
,just to test, the print function did return a match, which made me wonder if it had something to do with the tag.
[<h1 class="post-title">Cable management increases performance. </h1>]
Thank you for your time and effort
The fundamental problem here seems to be that the actual <img ...> element is not present when the page is first loaded. The best solution to this, in my opinion, would be to take advantage of the selenium webdriver that you already have available to grab the image. Selenium will allow the page to properly render (with JavaScript and all), and then locate whatever elements you care about.
For example:
import webbrowser, time, sys, requests, os, bs4 # Not all libraries are used in this code snippet
from selenium import webdriver
# For pretty debugging output
import pprint
browser = webdriver.Firefox()
browser.get("https://imgur.com/t/lenovo/mLwnorj")
# Give the page up to 10 seconds of a grace period to finish rendering
# before complaining about images not being found.
browser.implicitly_wait(10)
# Find elements via Selenium's search
selenium_image_elements = browser.find_elements_by_css_selector('img.post-image-placeholder')
pprint.pprint(selenium_image_elements)
# Use page source to attempt to find them with BeautifulSoup 4
soup = bs4.BeautifulSoup(browser.page_source, features="html.parser")
soup_image_elements = soup.findAll('img', {'class': 'post-image-placeholder'})
pprint.pprint(soup_image_elements)
I cannot say that I have tested this code yet on my side, but the general concept should work.
Update:
I went ahead and tested this on my side, fixed some errors in the code, and I then got the results I was hoping to see:
If a website will insert objects after page load you will need to use Selenium instead of requests.
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://imgur.com/t/lenovo/mLwnorj'
browser = webdriver.Firefox()
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'html.parser')
images = soup.find_all('img', {'class': 'post-image-placeholder'})
[print(image['src']) for image in images]
# //i.imgur.com/JfLsH5yr.jpg
# //i.imgur.com/lLcKMBzr.jpg
I am using python 2.7 and version 4.5.1 of Beautiful Soup
I'm at my wits end trying to make this very simple script to work. My goal is to to get the information on the online availability status of the NES console from Best Buy's website by parsing the html for the product's page and extracting the information in
<div class="status online-availability-status"> Sold out online </div>
This is my first time using the Beautiful Soup module so forgive me if I have missed something obvious. Here is the script I wrote to try to get the information above:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02')
soup = BeautifulSoup(page.content, 'html.parser')
avail = soup.findAll('div', {"class": "status online-availability-status"})
But then I just get an empty list for avail. Any idea why?
Any help is greatly appreciated.
As the comments above suggest, it seems that you are looking for a tag which is generated client side by JavaScript; it shows up using 'inspect' on the loaded page, but not when viewing the page source, which is what the call to requests is pulling back. You might try using dryscrape (which you may need to install with pip install dryscrape).
import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
url = 'http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02'
session.visit(url)
response = session.body()
soup = BeautifulSoup(response)
avail = soup.findAll('div', {"class": "status online-availability-status"})
This was the most popular solution in a question relating to scraping dynamically generated content:
Web-scraping JavaScript page with Python
If you try printing soup you'll see it probably returns something like Access Denied. This is because Best Buy requires an allowable User-Agent to be making the GET request. As you do not have a User-Agent specified in the Header, it is not returning anything.
Here is a link to generate a User Agent
How to use Python requests to fake a browser visit a.k.a and generate User Agent?
or you could figure out your user agent generated when you are viewing the webpage in your own browser
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
Availability is loaded in JSON. You don't even need to parse HTML for that:
import urllib
import simplejson
sku = 1048865 # look at the URL of the web page, it is <blablah>//10488665.aspx
# chnage locations to get the right store
response = urllib.urlopen('http://api.bestbuy.ca/availability/products?callback=apiAvailability&accept-language=en&skus=%s&accept=application%2Fvnd.bestbuy.standardproduct.v1%2Bjson&postalCode=M5G2C3&locations=977%7C203%7C931%7C62%7C617&maxlos=3'%sku)
availability = simplejson.loads(response.read())
print availability[0]['shipping']['status']
I am trying to get video links from 'https://www.youtube.com/trendsdashboard#loc0=ind'. When I do inspect elements, it displays me the source html code for each videos. In source code retrieved using
urllib2.urlopen("https://www.youtube.com/trendsdashboard#loc0=ind").read()
It does not display html source for videos. Is there any otherway to do this?
<a href="/watch?v=dCdvyFkctOo" alt="Flipkart Wish Chain">
<img src="//i.ytimg.com/vi/dCdvyFkctOo/hqdefault.jpg" alt="Flipkart Wish Chain">
</a>
This simple code appears when we inspect elements from browser, but not in source code retrived by urllib
To view the source code you need use read method
If you just use open it gives you something like this.
In [12]: urllib2.urlopen('https://www.youtube.com/trendsdashboard#loc0=ind')
Out[12]: <addinfourl at 3054207052L whose fp = <socket._fileobject object at 0xb60a6f2c>>
To see the source use read
urllib2.urlopen('https://www.youtube.com/trendsdashboard#loc0=ind').read()
Whenever you compare the source code between Python code and Web browser, dont do it through Insect Element, right click on the webpage and click view source, then you will find the actual source. Inspect Element displays the aggregated source code returned by as many network requests created as well as javascript code being executed.
Keep Developer Console open before opening the webpage, stay on Network tab and make sure that 'Preserve Log' is open for Chrome or 'Persist' for Firebug in Firefox, then you will see all the network requests made.
works for me...
import urllib2
url = 'https://www.youtube.com/trendsdashboard#loc0=ind'
html = urllib.urlopen(url).read()
IMO I'd use requests instead of urllib - it's a bit easier to use:
import requests
url = 'https://www.youtube.com/trendsdashboard#loc0=ind'
response = requests.get(url)
html = response.content
Edit
This will get you a list of all <a></a> tags with hyperlinks as per your edit. I use the library BeautifulSoup to parse the html:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
links = [tag for tag in soup.findAll('a') if tag.has_attr('href')]
we also need to decode the data to utf-8.
here is the code:
just use
response.decode('utf-8')
print(response)
I want to scrape pictures from a public Instagram account. I'm pretty familiar with bs4 so I started with that. Using the element inspector on Chrome, I noted the pictures are in an unordered list and li has class 'photo', so I figure, what the hell -- can't be that hard to scrape with findAll, right?
Wrong: it doesn't return anything (code below) and I soon notice that the code shown in element inspector and the code that I drew from requests were not the same AKA no unordered list in the code I pulled from requests.
Any idea how I can get the code that shows up in element inspector?
Just for the record, this was my code to start, which didn't work because the unordered list was not there:
from bs4 import BeautifulSoup
import requests
import re
r = requests.get('http://instagram.com/umnpics/')
soup = BeautifulSoup(r.text)
for x in soup.findAll('li', {'class':'photo'}):
print x
Thank you for your help.
If you look at the source code for the page, you'll see that some javascript generates the webpage. What you see in the element browser is the webpage after the script has been run, and beautifulsoup just gets the html file. In order to parse the rendered webpage you'll need to use something like Selenium to render the webpage for you.
So, for example, this is how it would look with Selenium:
from bs4 import BeautifulSoup
import selenium.webdriver as webdriver
url = 'http://instagram.com/umnpics/'
driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source)
for x in soup.findAll('li', {'class':'photo'}):
print x
Now the soup should be what you are expecting.