I am trying to learn to use the python library BeautifulSoup, I would like to, for example, scrape a price of a flight on Google Flights.
So I connected to Google Flights, for example at this link, and I want to get the cheapest flight price.
So I would get the value inside the div with this class "gws-flights-results__itinerary-price" (as in the figure).
Here is the simple code I wrote:
from bs4 import BeautifulSoup
import urllib.request
url = 'https://www.google.com/flights?hl=it#flt=/m/07_pf./m/05qtj.2019-04-27;c:EUR;e:1;sd:1;t:f;tt:o'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
div = soup.find('div', attrs={'class': 'gws-flights-results__itinerary-price'})
But the resulting div has class NoneType.
I also try with
find_all('div')
but within all the div I found in this way, there was not the div I was interested in.
Can someone help me?
Looks like javascript needs to run so use a method like selenium
from selenium import webdriver
url = 'https://www.google.com/flights?hl=it#flt=/m/07_pf./m/05qtj.2019-04-27;c:EUR;e:1;sd:1;t:f;tt:o'
driver = webdriver.Chrome()
driver.get(url)
print(driver.find_element_by_css_selector('.gws-flights-results__cheapest-price').text)
driver.quit()
Its great that you are learning web scraping! The reason you are getting NoneType as a result is because the website that you are scraping loads content dynamically. When requests library fetches the url it only contains javascript. and the div with this class "gws-flights-results__itinerary-price" isn't rendered yet! So it won't be possible by the scraping approach you are using to scrape this website.
However you can use other methods such as fetching the page using tools such as selenium or splash to render the javascript and then parse the content.
BeautifulSoup is a great tool for extracting part of HTML or XML, but here it looks like you only need to get the url to another GET-request for a JSON object.
(I am not by a computer now, can update with an example tomorrow.)
Related
This is the website I'm trying to scrape with Python:
https://www.ebay.de/sch/i.html?_from=R40&_nkw=iphone+8&_sacat=0&LH_Sold=1&LH_Complete=1&rt=nc&LH_ItemCondition=3000
I want to access the 'ul' element with the class of 'srp-results srp-list clearfix'. This is what I tried with requests and BeautifulSoup:
from bs4 import BeautifulSoup
import requests
url = 'https://www.ebay.de/sch/i.html?_from=R40&_nkw=iphone+8&_sacat=0&LH_Sold=1&LH_Complete=1&rt=nc&LH_ItemCondition=3000'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
uls = soup.find_all('ul', attrs = {'class': 'srp-results srp-list clearfix'})
And the output is always an empty string.
I also tried scraping the website with Selenium Webdriver and I got the same result.
First I was a little bit confused about your error but after a bit of debugging I figured out that: eBay dynamically generates that ul with JavaScript
So since you can't execute JavaScript with BeautifulSoup you have to use selenium and wait until the JavaScript loads that ul
It is probably because the content you are looking for is rendered by JavaScript After the page loads on a web browser this means that the web browser load that content after running javascript which you cannot get with requests.get request from python.
I would suggest to learn Selenium to Scrape the data you want
I am currently trying to make a web scraper using python. The objective I have is for my web scraper to find the name and the price of a stock. Here is my code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://finance.yahoo.com/quote/MA?p=MA&.tsrc=fin-srch')
soup = BeautifulSoup(page.content, "html.parser")
stock_name = soup.find({ "class" : "D(ib) Fz(18px)"})
print(stock_name)
but when i run it i get this:
C:\Users\baribal\Desktop>py web_scraper.py
None
thank you in advance!
Your request just gives you the raw HTML of the webpage. The elements you are trying to retrieve are React components that are rendered in the browser after the HTML source text is loaded.
You need to use a headless browser like Selenium instead.
i'm working on web crawler right now, and it seems that I couldn't get the class that is inside div from a particular website. Below is my code. I use BeautifulSoup in Python3
import requests
from bs4 import BeautifulSoup as bs
response = requests.get('https://e27.co/startup/flipkart').text
soup = bs(response, 'html.parser')
content_div = soup.findAll('h1',class_ = 'profile-startup')
print(content_div)
I want to extract the text inside the h1 that has class "profile-startup". the above code returns nothing. can you guys help me?
This website is populating data using Javascript. If you take a look at the contents in response you will see that there is no h1. You have to see if they have an API you can use to retrieve the information you need or consider using a browser automation technology like Selenium: http://selenium-python.readthedocs.io/installation.html#introduction
I am trying to scrape website, but I encountered a problem. When I try to scrape data, it looks like the html differs from what I see on google inspect and from what I get from python. I get this with http://edition.cnn.com/election/results/states/arizona/house/01 I tried to scrape election results. I used this script to check HTML part of the webpage, and I noticed that they different. There is no classes that I need, like section-wrapper.
page =requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content, "lxml")
print(soup)
Anyone knows what is the problem ?
http://data.cnn.com/ELECTION/2016/AZ/county/H_d1_county.json
This site use JavaScript fetch data, you can check the url above.
You can find this url in chrome dev-tools, there are many links, check it out
Chrome >>F12>> network tab>>F5(refresh page)>>double click the .josn url>> open new tab
import requests
from bs4 import BeautifulSoup
page=requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content)
#you can try all sorts of tags here I used class: "ad" and class:"ec-placeholder"
g_data = soup.find_all("div", {"class":"ec-placeholder"})
h_data = soup.find_all("div"),{"class":"ad"}
for item in g_data:print item
#print '\n'
#for item in h_data:print item
I am using python 2.7 and version 4.5.1 of Beautiful Soup
I'm at my wits end trying to make this very simple script to work. My goal is to to get the information on the online availability status of the NES console from Best Buy's website by parsing the html for the product's page and extracting the information in
<div class="status online-availability-status"> Sold out online </div>
This is my first time using the Beautiful Soup module so forgive me if I have missed something obvious. Here is the script I wrote to try to get the information above:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02')
soup = BeautifulSoup(page.content, 'html.parser')
avail = soup.findAll('div', {"class": "status online-availability-status"})
But then I just get an empty list for avail. Any idea why?
Any help is greatly appreciated.
As the comments above suggest, it seems that you are looking for a tag which is generated client side by JavaScript; it shows up using 'inspect' on the loaded page, but not when viewing the page source, which is what the call to requests is pulling back. You might try using dryscrape (which you may need to install with pip install dryscrape).
import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
url = 'http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02'
session.visit(url)
response = session.body()
soup = BeautifulSoup(response)
avail = soup.findAll('div', {"class": "status online-availability-status"})
This was the most popular solution in a question relating to scraping dynamically generated content:
Web-scraping JavaScript page with Python
If you try printing soup you'll see it probably returns something like Access Denied. This is because Best Buy requires an allowable User-Agent to be making the GET request. As you do not have a User-Agent specified in the Header, it is not returning anything.
Here is a link to generate a User Agent
How to use Python requests to fake a browser visit a.k.a and generate User Agent?
or you could figure out your user agent generated when you are viewing the webpage in your own browser
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
Availability is loaded in JSON. You don't even need to parse HTML for that:
import urllib
import simplejson
sku = 1048865 # look at the URL of the web page, it is <blablah>//10488665.aspx
# chnage locations to get the right store
response = urllib.urlopen('http://api.bestbuy.ca/availability/products?callback=apiAvailability&accept-language=en&skus=%s&accept=application%2Fvnd.bestbuy.standardproduct.v1%2Bjson&postalCode=M5G2C3&locations=977%7C203%7C931%7C62%7C617&maxlos=3'%sku)
availability = simplejson.loads(response.read())
print availability[0]['shipping']['status']