This is the website I'm trying to scrape with Python:
https://www.ebay.de/sch/i.html?_from=R40&_nkw=iphone+8&_sacat=0&LH_Sold=1&LH_Complete=1&rt=nc&LH_ItemCondition=3000
I want to access the 'ul' element with the class of 'srp-results srp-list clearfix'. This is what I tried with requests and BeautifulSoup:
from bs4 import BeautifulSoup
import requests
url = 'https://www.ebay.de/sch/i.html?_from=R40&_nkw=iphone+8&_sacat=0&LH_Sold=1&LH_Complete=1&rt=nc&LH_ItemCondition=3000'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
uls = soup.find_all('ul', attrs = {'class': 'srp-results srp-list clearfix'})
And the output is always an empty string.
I also tried scraping the website with Selenium Webdriver and I got the same result.
First I was a little bit confused about your error but after a bit of debugging I figured out that: eBay dynamically generates that ul with JavaScript
So since you can't execute JavaScript with BeautifulSoup you have to use selenium and wait until the JavaScript loads that ul
It is probably because the content you are looking for is rendered by JavaScript After the page loads on a web browser this means that the web browser load that content after running javascript which you cannot get with requests.get request from python.
I would suggest to learn Selenium to Scrape the data you want
Related
for context I am pretty new to Python. I am trying to use bs4 to parse some data out of https://bigfuture.collegeboard.org/college-university-search/university-of-california-los-angeles
To be exact, I want to obtain the 57% number in the "paying" section of the webpage.
My problem is that bs4 will only return the first layer of the HTML, while the data I want is deeply nested in the code. I think it's under 17 divs.
Here is my python code:
import requests
import bs4
url = 'https://bigfuture.collegeboard.org/college-university-search/university-of-california-los-angeles'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, "html.parser")
print(soup.find_all("div", {"id": "gwtDiv"}))
(This returns [<div class="clearfix margin60 marginBottomOnly" id="gwtDiv" style="min-height: 300px;height: 300px;height: auto;"></div>] None of the elements inside it are shown.)
If the page is using js to render content inside the element then requests will not be able to get that content since that content is rendered on the client side in a browser. I'd recommend using ChromeDriver and Selenium along with BeautifulSoup.
You can download the chrome driver from here:https://chromedriver.chromium.org/
Put this in the same folder in which you're running your program.
Try something like this
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://bigfuture.collegeboard.org/college-university-search/university-of-california-los-angeles'
driver = webdriver.Chrome()
driver.get(url)
html = driver.execute_script("return document.documentElement.outerHTML")
sel_soup = BeautifulSoup(html, 'html.parser')
print(soup.find_all("div", {"id": "gwtDiv"}))
i try to scrape some informations from a webpage and on the one page it is working fine, but on the other webpage it is not working cause i only get a none return-value
This code / webpage is working fine:
# https://realpython.com/beautiful-soup-web-scraper-python/
import requests
from bs4 import BeautifulSoup
URL = "https://www.monster.at/jobs/suche/?q=Software-Devel&where=Graz"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
name_box = soup.findAll("div", attrs={"class": "company"})
print (name_box)
But with this code / webpage i only get a None as return-value
# https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe/
import requests
from bs4 import BeautifulSoup
URL = "https://www.bloomberg.com/quote/SPX:IND"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
name_box = soup.find("h1", attrs={"class": "companyName__99a4824b"})
print (name_box)
Why is that?
(at first i thought due the number in the class on the second webpage "companyName__99a4824b" it changes the classname dynamicly - but this is not the case - when i refresh the webpage it is still the same classname...)
The reason you get None is that the Bloomberg page uses Javascript to load its content while the user is on the page.
BeautifulSoup simply returns to you the html of the page as found as soon as it reaches the page -- which does not contain the companyName_99a4824b class-tag.
Only after the user has waited for the page to fully load does the html include the desired tag.
If you want to scrape that data, you'll need to use something like Selenium, which you can instruct to wait until the desired element of the page is ready.
The website blocks scrapers, check the title:
print(soup.find("title"))
To bypass this you must use a real browser which can run JavaScript.
A tool called Selenium can do that for you.
I am currently trying to make a web scraper using python. The objective I have is for my web scraper to find the name and the price of a stock. Here is my code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://finance.yahoo.com/quote/MA?p=MA&.tsrc=fin-srch')
soup = BeautifulSoup(page.content, "html.parser")
stock_name = soup.find({ "class" : "D(ib) Fz(18px)"})
print(stock_name)
but when i run it i get this:
C:\Users\baribal\Desktop>py web_scraper.py
None
thank you in advance!
Your request just gives you the raw HTML of the webpage. The elements you are trying to retrieve are React components that are rendered in the browser after the HTML source text is loaded.
You need to use a headless browser like Selenium instead.
I am trying to learn to use the python library BeautifulSoup, I would like to, for example, scrape a price of a flight on Google Flights.
So I connected to Google Flights, for example at this link, and I want to get the cheapest flight price.
So I would get the value inside the div with this class "gws-flights-results__itinerary-price" (as in the figure).
Here is the simple code I wrote:
from bs4 import BeautifulSoup
import urllib.request
url = 'https://www.google.com/flights?hl=it#flt=/m/07_pf./m/05qtj.2019-04-27;c:EUR;e:1;sd:1;t:f;tt:o'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
div = soup.find('div', attrs={'class': 'gws-flights-results__itinerary-price'})
But the resulting div has class NoneType.
I also try with
find_all('div')
but within all the div I found in this way, there was not the div I was interested in.
Can someone help me?
Looks like javascript needs to run so use a method like selenium
from selenium import webdriver
url = 'https://www.google.com/flights?hl=it#flt=/m/07_pf./m/05qtj.2019-04-27;c:EUR;e:1;sd:1;t:f;tt:o'
driver = webdriver.Chrome()
driver.get(url)
print(driver.find_element_by_css_selector('.gws-flights-results__cheapest-price').text)
driver.quit()
Its great that you are learning web scraping! The reason you are getting NoneType as a result is because the website that you are scraping loads content dynamically. When requests library fetches the url it only contains javascript. and the div with this class "gws-flights-results__itinerary-price" isn't rendered yet! So it won't be possible by the scraping approach you are using to scrape this website.
However you can use other methods such as fetching the page using tools such as selenium or splash to render the javascript and then parse the content.
BeautifulSoup is a great tool for extracting part of HTML or XML, but here it looks like you only need to get the url to another GET-request for a JSON object.
(I am not by a computer now, can update with an example tomorrow.)
I want to load the list of images in this page in Python. However, when I opened the page in my browser (Chrome or Safari) and opened the dev tools, the inspector returned the list of images as <img class="grid-item--image">....
However, when I tried to parse it in Python, the result seemed different. Specifically, I got the list of images as <img class="carousel--image"...>, whereas the soup.findAll("img", "grid-item--image") did return an empty list. Also, I tried saving those images using its srcset tag, most of the saved images are NOT those that were listed on the web.
I think the web page used some sort of technics when rendering. How can I parse the web pages successfully?
I used BeautifulSoup 4 on Python 3.5. I loaded the page as follows:
import requests
from bs4 import BeautifulSoup
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser", from_encoding="utf-8")
return soup
You would do better to use something like selenium for this as follows:
from bs4 import BeautifulSoup
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://www.vogue.com/fashion-shows/fall-2016-menswear/fendi#collection")
html_source = browser.page_source
soup = BeautifulSoup(html_source, "html.parser")
for item in soup.find_all("img", {"class":"grid-item--image"}):
print(item.get('srcset'))
This would display the following kind of output:
http://assets.vogue.com/photos/569d37e434324c316bd70f04/master/w_195/_FEN0016.jpg
http://assets.vogue.com/photos/569d37e5d928983d20a78e4f/master/w_195/_FEN0027.jpg
http://assets.vogue.com/photos/569d37e834324c316bd70f0a/master/w_195/_FEN0041.jpg
http://assets.vogue.com/photos/569d37e334324c316bd70efe/master/w_195/_FEN0049.jpg
http://assets.vogue.com/photos/569d37e702e08d8957a11e32/master/w_195/_FEN0059.jpg
...
...
...
http://assets.vogue.com/photos/569d3836486d6d3e20ae9625/master/w_195/_FEN0616.jpg
http://assets.vogue.com/photos/569d381834324c316bd70f3b/master/w_195/_FEN0634.jpg
http://assets.vogue.com/photos/569d3829fa6d6c9057f91d2a/master/w_195/_FEN0649.jpg
http://assets.vogue.com/photos/569d382234324c316bd70f41/master/w_195/_FEN0663.jpg
http://assets.vogue.com/photos/569d382b7dcd2a8a57748d05/master/w_195/_FEN0678.jpg
http://assets.vogue.com/photos/569d381334324c316bd70f2f/master/w_195/_FEN0690.jpg
http://assets.vogue.com/photos/569d382dd928983d20a78eb1/master/w_195/_FEN0846.jpg
This allows the full rendering of the page to take place inside the browser, and the resulting HTML can then be obtained.