I have to scrape a web page using BeautifulSoup in python.So to extract the complete div which hass the relavent information and looks like the one below:
<div data-v-24a74549="" class="row row-mg-mod term-row">
I wrote soup.find('div',{'class':'row row-mg-mod term-row'}).
But it is returning nothing.I guess it is something to do with this data-v value.
Can someone tell the exact syntaxof scraping this type of data?
Give this a try:
from bs4 import BeautifulSoup
content = """
<div data-v-24a74549="" class="row row-mg-mod term-row">"""
soup = BeautifulSoup(content,'html.parser')
for div in soup.find_all("div", {"class" : "row"}):
print(div)
Related
This is an example of the type of block of HTML source code I'm targeting with BeautifulSoup
<div class="fighter_list left">
<meta itemprop="image" content="/image_crop/44/44/_images/fighter/1406924569376_20140801011731_Picture17.JPG">
<img class="lazy" src="/image_crop/44/44/_images/fighter/1406924569376_20140801011731_Picture17.JPG" data-original="/image_crop/44/44/_images/fighter/1406924569376_20140801011731_Picture17.JPG" alt="Jason DeLucia" title="Jason DeLucia" />
<div class="fighter_result_data">
<a itemprop="url" href="/fighter/Jason-DeLucia-22"><span itemprop="name">Jason<br />DeLucia</span></a><br>
This is one of multiple blocks like this for each "fighter_list left" on the page.
I want to get all of the itemprop="url" href links that are in the "fighter_list left" class (i.e. /fighter/Jason-DeLucia-22)
When I try the below code I get nothing.
for link in html.find_all('a', class_="fighter_List left", itemprop="url"):
print(link.get('href'))
The closest I can get is getting every itemprop=url link on the page when I omit the class_= part.
But I only want the ones under the fighter_list left class.
This is the website https://www.sherdog.com/events/UFC-1-The-Beginning-7
You can use CSS selector for the task:
import requests
from bs4 import BeautifulSoup
url = "https://www.sherdog.com/events/UFC-1-The-Beginning-7"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for link in soup.select('.fighter_list.left [itemprop="url"]'):
print(link["href"])
Prints:
/fighter/Jason-DeLucia-22
/fighter/Royce-Gracie-19
/fighter/Gerard-Gordeau-15
/fighter/Ken-Shamrock-4
/fighter/Royce-Gracie-19
/fighter/Kevin-Rosier-17
/fighter/Gerard-Gordeau-15
I want to scrape 2015 from below HTML:
I use the below code but am only able to scrape "Annee"
soup.find('span', {'class':'optionLabel'}).get_text()
Can someone please help?
I am a new learner.
Simply try to find its next span that holds the text you wanna scrape:
soup.find('span', {'class':'optionLabel'}).find_next('span').get_text()
or css selectors with adjacent sibling combinator:
soup.select_one('span.optionLabel + span').get_text()
Example
html='''
<span class="optionLabel"><button>Année</button</span> :
<span>2015</span>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
soup.find('span', {'class':'optionLabel'}).find_next('span').get_text()
Output
2015
I want to build a program that automatically gets the live price of the german index (DAX). Therefore i use a website with the price provider FXCM.
In my code i use beautifulsoup and requests as packages. The div Box where the current value is stored looks like this :
<div class="left" data-item="quoteContainer" data-bg_quotepush="133962:74:bid">
<div class="wrapper cf">
<div class="left">
<span class="quote quote_standard" data-bg_quotepush="quote" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="quote" data-bg_quotepush_c="40">13.599,24</span>
<span class="label" data-bg_quotepush="time" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="time" data-bg_quotepush_c="41">25.12.2020</span>
<span class="label"> • </span>
<span class="label" data-item="currency"></span>
</div>
<div class="right">
<span class="percent up" data-bg_quotepush="percent" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="percent" data-bg_quotepush_c="42">+0,00<span>%</span></span>
<span class="label up" data-bg_quotepush="change" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="change" data-bg_quotepush_c="43">0,00</span>
</div>
</div>
</div>
The value i want to have is the one after data-bg_quotepush_c="40" and has a vaulue of 13.599,24.
My Python code looks like this:
import requests as rq
from bs4 import BeautifulSoup as bs
url = "https://news.guidants.com/#Ticker/Profil/?i=133962&e=74"
response = rq.get(url)
soup = bs(response.text, "lxml")
price = soup.find_all("div", {"class":"left"})[0].find("span")
print(price["data-bg_quotepush_c"])
It returns the following error:
File "C:\Users\Felix\anaconda3\lib\site-packages\bs4\element.py", line 1406, in __getitem__
return self.attrs[key]
KeyError: 'data-bg_quotepush_c'
Use Selenium instead of requests if working with dynamically generated content
What is going on?
Requesting the website with requests just provide the initial content, that not contains all the dynamically generatet information, so you can not find what your looking for.
To wait until website loaded completely use Selenium and sleep() as simple method or selenium waits in advanced.
Avoiding the error
Use price.text to get the text of the element that looks like this:
<span class="quote quote_standard" data-bg_quotepush="quote" data-bg_quotepush_c="40" data-bg_quotepush_f="quote" data-bg_quotepush_i="133962:74:bid">13.599,24</span>
Example
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://news.guidants.com/#Ticker/Profil/?i=133962&e=74"
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
driver.get(url)
driver.implicitly_wait(3)
soup = BeautifulSoup(driver.page_source,"html5lib")
price = soup.find_all("div", {"class":"left"})[0].find("span")
print(price.text)
driver.close()
Output
13.599,24
if you scraping the value of div class try this, example
driver = webdriver.Chrome(YourPATH to driver)
from bs4 import BeautifulSoup
# create variable to store a url strings
url = 'https://news.guidants.com/#Ticker/Profil/?i=133962&e=74'
driver.get(url)
# scraping proccess
soup = BeautifulSoup(driver.page_source,"html5lib")
# parse
prices = soup.find_all("div", attrs={"class":"left"})
for price in prices:
total_price = price.find('span')
# close the driver
driver.close()
if you using requests module try use different parser
you can install with pip example html5lib
pip install html5lib
thanks
I'm trying to get links to group members:
response.css('.text--ellipsisOneLine::attr(href)').getall()
Why isn't this working?
html:
<div class="flex flex--row flex--noGutters flex--alignCenter">
<div class="flex-item _memberItem-module_name__BSx8i">
<a href="/ru-RU/Connect-IT-Meetup-in-Chisinau/members/280162178/profile/?returnPage=1">
<h4 class="text--bold text--ellipsisOneLine">Liviu Cernei</h4>
</a>
</div>
</div>
Your selector isn't working because you are looking for a attribute (href) that this element doesn't have.
response.css('.text--ellipsisOneLine::attr(href)').getall()
This selector is searching for href inside elements of class text--ellipsisOneLine. In your HTML snippet that class matches only with this:
<h4 class="text--bold text--ellipsisOneLine">Liviu Cernei</h4>
As you can see, there is no href attribute. Now, if you want the text between this h4 element you need to use ::text pseudo-element.
response.css('.text--ellipsisOneLine::text').getall()
Read more here.
I realize that this isn't scrapy, but personally for web scraping I use the requests module and BeautifulSoup4, and the following code snippet will get you a list of users with the aforementioned modules:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://www.meetup.com/ru-RU/Connect-IT-Meetup-in-Chisinau/members/')
if response.status_code == 200:
html_doc = response.text
html_source = BeautifulSoup(html_doc, 'html.parser')
users = html_source.findAll('h4')
for user in users:
print(user.text)
css:
response.css('.member-item .flex--alignCenter a::attr(href)').getall()
I'm working on a personal project where I scrape data from a website. I'm trying to use beautiful soup to do this but I came across data in the same class but a different attribute. For example:
<div class="pi--secondary-price">
<span class="pi--price">$11.99 /<abbr title="Kilogram">kg</abbr></span>
<span class="pi--price">$5.44 /<abbr title="Pound">lb.</abbr></span>
</div>
How do I just get $11.99/kg? Right now I'm getting
$11.99 /kg
$5.44 /lb.
I've done x.select('.pi--secondary-price') but it returns both prices. How do I only get 1 price ($11.99 /kg)?
You could first get the <abbr> tag and then search for the respective parent tag. Like this:
from bs4 import BeautifulSoup
html = '''
<div class="pi--secondary-price">
<span class="pi--price">$11.99 /<abbr title="Kilogram">kg</abbr></span>
<span class="pi--price">$5.44 /<abbr title="Pound">lb.</abbr></span>
</div>
'''
soup = BeautifulSoup(html, 'html.parser')
kg = soup.find(title="Kilogram")
print(kg.parent.text)
This gives you the desired output $11.99 /kg. For more information, see the BeautifulSoup documentation.