same CSS, different outcome in browser and bs4 .select() method - python

I'm trying to retrieve some info from the following web page:
https://web.archive.org/web/19990421025223/http://www.rbc.ru
I constructed a selector which does highlight the desired table in Chrome's Inspection mode:
selector = 'body > table:nth-of-type(2) > tbody:nth-of-type(1)>tr:nth-of-type(1)>td:nth-of-type(5)>table:nth-of-type(1)>tbody:nth-of-type(1)'
however when running a script with bs4 .select() method:
import requests
from bs4 import BeautifulSoup
import lxml
url = 'https://web.archive.org/web/19990421025223/http://www.rbc.ru'
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
selector = 'body > table:nth-of-type(2) > tbody:nth-of-type(1)>tr:nth-of-type(1)>td:nth-of-type(5)>table:nth-of-type(1)>tbody:nth-of-type(1)'
print(soup.select(selector=selector))
the output is: [] - which is very different from what is expected based on the fact that it consists of html code in browser.
What am I missing here?

You could not expect the browser-generated selectors to reliably work in BeautifulSoup as when a page is rendered in the browser the markup changes while when you download a page in your Python code, there is no rendering and you only get the very initial non-rendered HTML page.
Here, you have to come up with your own CSS selector or another way to locate the table element.
As the markup of the page is not really HTML-parsing-friendly, I'd locate a table element by one of it's column names:
table = soup.find("b", text="спрос").find_parent("table")
Note that it only worked for me when I parsed the page with a lenient html5lib parser:
soup = BeautifulSoup(response.content, "html5lib")

Since at run time javascript can render the entire page differently from the source, bs4 is not good for websites that changes dynamically.
I would recommend using Selenium, as it actually opens the website, and it allows you to pause the search before certain element gets rendered. There are also other headless browser libraries that emulate the browser environment silently if you don't want to see a browser pops up.

You have 2 problem in your code, first, in BeautifulSoup if you want to use CSS selector the symbols + > ~ need to be separated by space, see here if you want to patch bs4.
Second, as my previous answer to your questions there is no tbody in the page source, it generated by browser.
And here fixed CSS selector
selector = 'body > table:nth-of-type(2) > tr:nth-of-type(1) > td:nth-of-type(5) > table:nth-of-type(1)'

Related

The copied CSS selector from the browser returns a different result using BeautifulSoup4 in Python

Usually when I want to scrape a particular text from a website, I right click the text and select inspect. Then in the HTML code, I look for the text I am interested in and right-click -> 'copy' -> 'copy selector'.
Then I paste that string of text I just copied within soup.select('enter copied text here') and save it to a variable. I can then perform text stripping functions to get the key text I need.
Now for the situation I am working with, I want to get the total number of cars shown on this webpage in the header h1: cars.com/cars/used/.
This is my code:
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.cars.com/used"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.41 Safari/537.36'}
res = requests.get(url,headers = headers)
res.raise_for_status()
soup = bs(res.text, 'html.parser')
total_cars_element = soup.select('body > div.listing > div.container.listing-container.has-header-sticky > div.row.flex-nowrap.no-gutters > div:nth-child(1) > div:nth-child(1) > div')
print(total_cars)
# the above prints an empty list.
I really just want to know why this is not working. I understand there are other work arounds as I have mentioned in the code above. But I really want to stick with the soup.select method.
Any insights are much appreciated!
Thanks!
The issue stems from the fact that the HTML fetched via Python is not the same as the one that gets generated in your browser. Try printing soup and see for yourself.
One particular tag, which is part of your query, is troublesome. In the browser, it looks like this:
<div class="container listing-container has-header-sticky">
but your Python code sees this instead:
<div class="container listing-container">
Change your selector to:
body > div.listing > div.container.listing-container > div.row.flex-nowrap.no-gutters > div:nth-child(1) > div:nth-child(1) > div
and you'll get the expected result.
This behaviour is considered normal since the page you're trying to scrape is dynamic. That means that JavaScript adds or removes certain parts of the original HTML page after the page loads.
If you want to scrape a dynamic web page using Python, you'll need something more than just Beautiful Soup. See https://scrapingant.com/blog/scrape-dynamic-website-with-python for more info on that subject.
with #Janez Kuhar nice Answer, You could also use
total_cars_element = soup.select('h1.title')
print(total_cars_element[0].text)
more about CSS Select

How can I find out correct div, class, span when scraping a html page

I am new in Web scraping technology. I tried to implement Web scraping after reading various web tutorials like this and this. Those articles are about amazon web scraping and Netflix web scraping. There are lots of other tutorials on Imdb, Rotten Tomatoes and others. Those tutorials give me overview which attributes need to take like class attributes, div tags etc. Different websites have different methods to take those tags. However those tags are the fundamental elements of web scraping. When I follow those tutorials I can implement those codes but when I try to parse a different website other than the mentioned one I failed. Recently, I tried the code block over priceline. But I just messed up with so many html codes.
My code for price line
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
url= 'https://www.priceline.com/relax/in/3000005381/from/20210301/to/20210319/rooms/1?vrid=8848a774a531423bde3ed4ff3486f8bb'
r = requests.get(url, headers=headers)#, proxies=proxies)
content = r.content
soup = BeautifulSoup(content)
name=[]
hotel_div = soup.find_all('div', class_='Box-sc-8h3cds-0.Flex-sc-1ydst80-0.iNmVhl')
for container in hotel_div:
name = d.find('span', attrs={'class':'Box-sc-8h3cds-0 Flex-sc-1ydst80-0 BadgeRow__BadgeContainer-fofgl-0 kmpPcP SummaryHeader__BadgeRowWithMB-m5g1dm-0 dQyPUf SummaryHeader__BadgeRowWithMB-m5g1dm-0 dQyPUf'})
n = name.find_all('img', alt=True)
row={}
if name is not None:
#print(n[0]['alt'])
row['Name'] = n[0]['alt']
else:
row['Name'] = "unknown-product"
print(name)
It returns an empty array.
Can any one suggest any tutorial or web blogs which help me to identify the correct html tags for any website?
Thank you for the help
Each web developer will choose to name their classes and tags differently.
To check how a new site is structured you can right click on what you want to scrape and then click on inspect and a tab should appear where you can find the tag, class name, etc
(UPDATED) Now it works:
import re
from bs4 import BeautifulSoup as soup
import requests
from selenium import webdriver
url = 'https://www.priceline.com/relax/in/3000005381/from/20210301/to/20210319/rooms/1?vrid=04bab06455d612983ec0c76e621d7c48'
driver = webdriver.Chrome()
driver.get(url)
html = driver.page_source
soup = soup(html,"lxml")
container = soup.find('a',{'class':'Link-sc-16qjtx7-0 TitleLink__TitleLinkText-vs18lp-0 jtrNVn'}).text
print(container)
https://i.stack.imgur.com/IJK0c.png

Is there a way to print this unshowed tag text? [duplicate]

This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 2 years ago.
I'm trying to webscrape a webpage inventories, but the problem is that they don't show up in the output of the my Python script
Here's the original tag that appears on the navigator, with the text i want to scrape:
<span class="currentInv">251</span>
" in stock"
and this is the tag after parsing it using beautifulsoup as a library and lxml as a parser, I even tries other parsers like html.parser and html5lib:
<span class="currentInv"></span>
Here's my full Python script:
import requests
from bs4 import BeautifulSoup as bs
url = f'https://www.hancocks.co.uk/buy-wholesale-sweets?warehouse=1983&p=1'
parser = 'lxml'
headers = {'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}
response = requests.get(url, headers=headers)
data = response.text
soup = bs(data, parser)
print(soup.find('span', class_ = 'currentInv').text)
The output is empty
I tried many times over and over, but nothing seems to work well for me
Any help would be so much appreciated.
So if you go to view source of the page you'll see the server side render HTML that gets sent down to the page actually also contains no value in that span tag. (i.e. view-source:https://www.hancocks.co.uk/buy-wholesale-sweets?warehouse=1983&p=1).
The value 251 is likely getting added client-side after the DOM is loaded via JavaScript.
I'd go through this answer Web-scraping JavaScript page with Python for more ways to try and extract that JavaScript value.
Most likely the page you see in your browser contains dynamic content. This means that when you inspect the page, you see the final result after some JavaScript code ran and manipulated the DOM that is rendered in the browser. When you load the same page in Python code using Beautiful Soup, you get the raw HTML that comes from the request. The JavaScript code for the dynamic content isn't executed, so you will not see the same results.
One solution is to use Selenium instead of Beautiful Soup. Selenium will load a page in a browser and provides an API to interact with that page.

Is there a way to extract CSS from a webpage using BeautifulSoup?

I am working on a project which requires me to view a webpage, but to use the HTML further, I have to see it fully and not as a bunch of lines mixed in with pictures. Is there a way to parse the CSS along with the HTML using BeautifulSoup?
Here is my code:
from bs4 import BeautifulSoup
def get_html(url, name):
r = requests.get(url)
r.encoding = 'utf8'
return r.text
link = 'https://www.labirint.ru/books/255282/'
with open('labirint.html', 'w', encoding='utf-8') as file:
file.write(get_html(link, '255282'))
WARNING: The page: https://www.labirint.ru/books/255282/ has a redirect to https://www.labirint.ru/books/733371/.
If your goal is to truly parse the css:
There are some various methods here: Prev Question w/ Answers
I also have used a nice example from this site: Python Code Article
Beautiful soup will pull the entire page - and it does include the header, styles, scripts, linked in css and js, etc. I have used the method in the pythonCodeArticle before and retested it for the link you provided.
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin
# URL of the web page you want to extract
url = "ENTER YOUR LINK HERE"
# initialize a session & set User-Agent as a regular browser
session = requests.Session()
session.headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
# get the HTML content
html = session.get(url).content
# parse HTML using beautiful soup
soup = bs(html, "html.parser")
print(soup)
By looking at the soup output (It is very long, I will not paste here).. you can see it is a complete page. Just make sure to paste in your specific link
NOW If you wanted to parse the result to pick up all css urls.... you can add this: (I am still using parts of the code from the very well described python Code article link above)
# get the CSS files
css_files = []
for css in soup.find_all("link"):
if css.attrs.get("href"):
# if the link tag has the 'href' attribute
css_url = urljoin(url, css.attrs.get("href"))
css_files.append(css_url)
print(css_files)
The output css_files will be a list of all css files. You can now go visit those separately and see the styles that are being imported.
NOTE:this particular site has a mix of styles inline with the html (i.e. they did not always use css to set the style properties... sometimes the styles are inside the html content.)
This should get you started.

Python Beautiful Soup Returns Nonetype

I am trying to develop a program that can grab runes for a specific champion in League of Legends.
And here is my code:
import requests
import re
from bs4 import BeautifulSoup
url = 'https://www.leagueofgraphs.com/zh/champions/builds/darius'
response = requests.get(url).text
soup = BeautifulSoup(response,'lxml')
tables = soup.find('div',class_ = 'img-align-block')
print(tables)
And here is the original HTML File:
<img src="//cdn2.leagueofgraphs.com/img/perks/10.8/64/8010.png" alt="征服者" tooltip="<itemname><img src="//cdn2.leagueofgraphs.com/img/perks/10.8/64/8010.png" width="24" height="24" alt="征服者" /> 征服者</itemname><br/><br/>基礎攻擊或技能在命中敵方英雄時獲得 2 層征服者效果,持續 6 秒,每層效果提供 2-5 適性之力。 最多可以疊加 10 次。遠程英雄每次普攻只會提供 1 層效果。<br><br>在疊滿層數後,你對英雄造成的 15% 傷害會轉化為對自身的回復效果(遠程英雄則為 8%)。" height="36" width="36" class="requireTooltip">
I am not able to by any chance access this part and parse it nor find the IMG src. However, I can browse through this on their website.
How could I fix this issue?
The part you are interested in is not in the HTML. You can double check by searching:
soup.prettify()
Probably parts of the website are loaded with JavaScript, so you could use code that opens a browser and visit that page. For example, you could use selenium
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get(url)
time.sleep(6) # give the website some time to load
page = driver.page_source
soup = BeautifulSoup(page,'lxml')
tables = soup.find('div', class_='img-align-block')
print(tables)
The website uses JavaScript processing, so you need to use Selenium or another scraping tool that supports JS loading.
Try setting a User-Agent on the headers of your request, without it, the website sends a different content, i.e.:
import requests
from bs4 import BeautifulSoup
url = 'https://www.leagueofgraphs.com/zh/champions/builds/darius'
h = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0"}
response = requests.get(url, headers=h).text
soup = BeautifulSoup(response,'html.parser')
images = soup.find_all('img', {"class" : 'mainPicture'})
for img in images:
print(img['src'])
//cdn2.leagueofgraphs.com/img/perks/10.8/64/8010.png
//cdn2.leagueofgraphs.com/img/perks/10.8/64/8010.png
//cdn2.leagueofgraphs.com/img/perks/10.8/64/8230.png
//cdn2.leagueofgraphs.com/img/perks/10.8/64/8230.png
//cdn2.leagueofgraphs.com/img/perks/10.8/64/8230.png
Notes:
Demo
If my answer helped you, please consider accepting it as the correct answer, thanks!

Categories