Finding a specific span element with BeautifulSoup - python

I am trying to create a script to scrape price data from Udemy courses.
I'm struggling with navigating the HTML tree because the element I'm looking for is located inside multiple nested divs.
here's the structure of the HTML element I'm trying to access:
what I tried:
response = requests.get(url)
print(response)
doc = BeautifulSoup(response.text, "html.parser")
parent_div = doc.find(class_="sidebar-container--purchase-section--17KRp")
print(parent_div.find_all("span"))
and even:
response = requests.get(url)
print(response)
doc = BeautifulSoup(response.text, "html.parser")
main = doc.find(class_="generic-purchase-section--main-cta-container--3xxeM")
title = main.select_one("div span span span")
Here’s the URL: https://www.udemy.com/course/the-complete-web-development-bootcamp/
tried searching all the spans in the HTML and the specific span I'm searching for doesn't appear maybe because it's nested inside a div?
would appreciate a little guidance!

The price is being loaded by JavaScript. So it is not possible to scrape using beautifulsoup.
The data is loaded from an API Endpoint which takes in the course-id of the course.
Course-id of this course: 1565838
You can directly get the info from that endpoint like this.
import requests
course_id = '1565838'
url= f'https://www.udemy.com/api-2.0/course-landing-components/{course_id}/me/?components=price_text'
response = requests.get(url)
x = response.json()
print(x['price_text']['data']['pricing_result']['price'])
{'amount': 455.0, 'currency': 'INR', 'price_string': '₹455', 'currency_symbol': '₹'}

I tried your first approach several times and it works more-or-less for me, although it has returned a different number of span elements on different attempts (10 is the usual number but I have seen as few as 1 on one occasion):
import requests
from bs4 import BeautifulSoup
url = 'https://www.udemy.com/course/the-complete-web-development-bootcamp/'
response = requests.get(url)
doc = BeautifulSoup(response.text, "html.parser")
parent_div = doc.find(class_="sidebar-container--purchase-section--17KRp")
spans = parent_div.find_all("span")
print(len(spans))
for span in spans:
print(span)
Prints:
10
<span data-checked="checked" data-name="u872-accordion--3" data-type="radio" id="u872-accordion-panel--4" style="display:none"></span>
<span class="purchase-options--option-radio--1zjJ_ udlite-fake-toggle-input udlite-fake-toggle-radio udlite-fake-toggle-radio-small"></span>
<span class="udlite-accordion-panel-title">Subscribe</span>
<span>Try it free for 7 days</span>
<span class="udlite-text-xs purchase-section-container--cta-subscript--349MH">$29.99 per month after trial</span>
<span class="purchase-section-container--line--159eG"></span>
<span data-checked="" data-name="u872-accordion--3" data-type="radio" id="u872-accordion-panel--6" style="display:none"></span>
<span class="purchase-options--option-radio--1zjJ_ udlite-fake-toggle-input udlite-fake-toggle-radio udlite-fake-toggle-radio-small"></span>
<span class="udlite-accordion-panel-title">Buy Course</span>
<span class="money-back">30-Day Money-Back Guarantee</span>
As afar as your second approach goes, your main div does not have that many nested span elements, so it is bound to fail. Try just one span element:
import requests
from bs4 import BeautifulSoup
url = 'https://www.udemy.com/course/the-complete-web-development-bootcamp/'
response = requests.get(url)
doc = BeautifulSoup(response.text, "html.parser")
main = doc.find(class_="generic-purchase-section--main-cta-container--3xxeM")
title = main.select_one("div span")
print(title)
Prints:
<span class="money-back">30-Day Money-Back Guarantee</span>

Related

Retrieving td value from tr that has certain other td value

Need to get the links from a td in rows that has a certain td value.
this is a tr in the table and I want to get the link from the div "Match" if the div "Home team" is of a certain value. There are many rows and I want to find every link that is matching. I have tried this and every time I only get the first row of the table. Here is the link https://wp.nif.no/PageTournamentDetailWithMatches.aspx?tournamentId=403373&seasonId=200937&number=all . Note that I translated some of the values to English in the examples below
homegames = browser.find_elements_by_xpath('//div[#data-title = "Home team"]/a[text()="Cleveland"]//parent::div//parent::td//parent::tr')
for link in homegames:
print(link.find_element_by_xpath('//td[3]/div/a').get_attribute('href'))
<td><div data-title="Date">23.10.2021</div></td>
<td><div data-title="Tid">16:15</div></td>
<td>div data-title="Matchnr">
2121503051
</div>
</td><td><div data-title="Home team">Cleveland</div></td>
<td><div data-title="Away team">
Ohio Travellers</div></td>
<td><div data-title="Court">F21</div></td><td><div data-title="Result">71 - 64</div></td>
<td><div data-title="Referee">John Doe<br>Will Smith<br></div></td></tr>```
The data is within the html source (so no need to use Selenium). But regardless of using Selenium or not, what you can do here is let BeautifulSoup find the specific tags you are after.
Without Selenium, it requires a little manipulation as decode the html.
import requests
from bs4 import BeautifulSoup
import json
import html
keyword = 'Askim'
url = 'https://wp.nif.no/PageTournamentDetailWithMatches.aspx?tournamentId=403373&seasonId=200937&number=all'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
jsonStr = soup.find('div', {'class':'xwp_table_bg'}).find_next('input')['value']
jsonData = json.loads(jsonStr)
links_list = []
for each in jsonData['data']:
#each = jsonData['data'][6]
htmlStr = ''.join(each)
htmlStr = html.unescape(htmlStr)
soup = BeautifulSoup(htmlStr, 'html.parser')
if soup.find('div', {'data-title':'Hjemmelag'}, text=keyword):
link = soup.find('div', {'data-title':'Kampnr'}).find('a')['href']
links_list.append(link)

BeautifulSoup parsing issues some div not showing

I'm trying to parse this page: https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/
The problem is, in this element: https://gyazo.com/e544be64a41a121bdb0c0f71aef50692 ,
I want the div that contains the price. If you inspect the page, you can see the html code for this part, shows like this:
<div class="price">
<div class"price">
"thePrice"
<sup>93</sup>
</div>
</div>
BUT, when using page_soup = soup(my_html_page, 'html.parser') or page_soup = soup(my_html_page, 'lxml') or page_soup = soup(my_html_page, 'html5lib') I only get this as the result for that part:
<div class="price"></div>
And that's it. I've been searching for hours on the internet to figure out why that inner div doesn't get parsed.
Three different parsers, and none seems to get passed the fact that the inner child shares the same class name than its parent, if this is the issue.
Hope its help you.
from bs4 import BeautifulSoup
import requests
url = 'https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/'
html = BeautifulSoup(requests.get(url).content, 'html.parser')
prices = html.find_all("div", {"class": "price"})
for price in prices:
print(price.text)
print output
561€95
169€94
165€95
1 165€94
7 599€95
267€95
259€94
599€95
511€94
1 042€94
2 572€94
783€95
2 479€94
2 699€95
499€94
386€95
169€94
2 343€95
783€95
499€94
499€94
259€94
267€95
165€95
169€94
2 399€95
561€95
2 699€95
2 699€95
6 059€95
7 589€95
10 991€95
9 619€94
2 479€94
3 135€95
7 589€95
511€94
1 042€94
386€95
599€95
1 165€94
2 572€94
783€95
2 479€94
2 699€95
499€94
169€94
2 343€95
2 699€95
3 135€95
6 816€95
7 589€95
561€95
267€95
To scrape all prices where class="price"> see this example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# Select all the 'price' classes
for tag in soup.select('div.price'):
print(tag.text)

python nested Tags (beautiful Soup)

I used beautiful soup using python to get data from a specific website
but I don't know how to get one of these prices but I want the price in gram (g)
AS shown below this is the HTML codeL:
<div class="promoPrice margBottom7">16,000
L.L./200g<br/><span class="kiloPrice">79,999
L.L./Kg</span></div>
I use this code:
p_price = product.findAll("div{"class":"promoPricemargBottom7"})[0].text
my result was:
16,000 L.L./200g 79,999 L.L./Kg
but i want to have:
16,000 L.L./200g
only
You will need to first decompose the span inside the div element:
from bs4 import BeautifulSoup
h = """
<div class="promoPrice margBottom7">16,000 L.L./200g<br/>
<span class="kiloPrice">79,999 L.L./Kg</span></div>
"""
soup = BeautifulSoup(h, "html.parser")
element = soup.find("div", {'class': 'promoPrice'})
element.span.decompose()
print(element.text)
#16,000 L.L./200g
Try using soup.select_one('div.promoPrice').contents[0]
from bs4 import BeautifulSoup
html = """<div class="promoPrice margBottom7">16,000 L.L./200g<br/>
<span class="kiloPrice">79,999 L.L./Kg</span></div>"""
soup = BeautifulSoup(html, features='html.parser')
# value = soup.select('div.promoPrice > span') # for 79,999 L.L./Kg
value = soup.select_one('div.promoPrice').contents[0]
print(value)
Prints
16,000 L.L./200g

What is the proper syntax for .find() in bs4?

I am trying to scrape the bitcoin price off of coinbase and cannot find the proper syntax. When I run the program (without the line with question marks) I get the block of html that I need, but I don't know how to narrow down and retrieve the price itself. Any help appreciated, thanks.
import requests
from bs4 import BeautifulSoup
url = 'https://www.coinbase.com/charts'
data = requests.get(url)
nicedata = data.text
soup = BeautifulSoup(nicedata, 'html.parser')
prettysoup = soup.prettify()
bitcoin = soup.find('h4', {'class':
'Header__StyledHeader-sc-1q6y56a-0 hZxUBM
TextElement__Spacer-sc-18l8wi5-0 hpeTzd'})
price = bitcoin.find('???')
print(price)
The attached image contains the html
To get text from item:
price = bitcoin.text
But this page has many items <h4> with this class but find() gets only first one and it has text Bitcoin, not price from your image. You may need find_all() to get list with all items and then you can use index [index] or slicing [start:end] to get some items, or you can use for-loop to work with every item on list.
import requests
from bs4 import BeautifulSoup
url = 'https://www.coinbase.com/charts'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
all_h4 = soup.find_all('h4', {'class': 'Header__StyledHeader-sc-1q6y56a-0 hZxUBM TextElement__Spacer-sc-18l8wi5-0 hpeTzd'})
for h4 in all_h4:
print(h4.text)
It can be easier to work with data if you keep it in list of list or array or DataFrame. But to create list of lists it would be easier to find rows <tr> and inside every row search <h4>
import requests
from bs4 import BeautifulSoup
url = 'https://www.coinbase.com/charts'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
all_tr = soup.find_all('tr')
data = []
for tr in all_tr:
row = []
for h4 in tr.find_all('h4'):
row.append(h4.text)
if row: # skip empty row
data.append(row)
for row in data:
print(row)
It doesn't need class to get all h4.
BTW: This page uses JavaScript to append new rows when you scroll page but requests and BeautifulSoup can't run JavaScript - so if you will need all rows then you may need Selenium to control web browser which runs JavaScript

Scraping multiple data tags from HTML using beautiful Soup

I am attempting to scrape HTML to create a dictionary that includes a pitchers name and his handed-ness. The data-tags are buried--so far I've only been able to collect the pitchers name from the data set. The HTML output (for each player) is as follows:
<div class="pitcher players">
<input name="import-data" type="hidden" value="%5B%7B%22slate_id%22%3A20190%2C%22type%22%3A%22classic%22%2C%22player_id%22%3A%2210893103%22%2C%22salary%22%3A%2211800%22%2C%22position%22%3A%22SP%22%2C%22fpts%22%3A14.96%7D%2C%7B%22slate_id%22%3A20192%2C%22type%22%3A%22classic%22%2C%22player_id%22%3A%2210894893%22%2C%22salary%22%3A%2211800%22%2C%22position%22%3A%22SP%22%2C%22fpts%22%3A14.96%7D%2C%7B%22slate_id%22%3A20193%2C%22type%22%3A%22classic%22%2C%22player_id%22%3A%2210895115%22%2C%22salary%22%3A%2211800%22%2C%22position%22%3A%22SP%22%2C%22fpts%22%3A14.96%7D%5D"/>
<a class="player-popup" data-url="https://rotogrinders.com/players/johnny-cueto-11193?site=draftkings" href="https://rotogrinders.com/players/johnny-cueto-11193">Johnny Cueto</a>
<span class="meta stats">
<span class="stats">
R
</span>
<span class="salary" data-role="salary" data-salary="$11.8K">
$11.8K
</span>
<span class="fpts" data-fpts="14.96" data-product="56" data-role="authorize" title="Projected Points">14.96</span>
I've tinkered and and coming up empty--I'm sure I'm overthinking this. Here is the code I have so far:
import requests
from bs4 import BeautifulSoup
url = "https://rotogrinders.com/lineups/mlb?site=draftkings"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
players_confirmed = {}
results = [soup.find_all("div", {'class':'pitcher players'}]
What's the best way to loop through the results set for the more granular data tag information I need?
I need the text from the HTML beginning with , and handed-ness from the tag
Optimally, I would have a dictionary with the following:
{Johnny Cueto : R, Player 2 : L, ...}
import requests
from bs4 import BeautifulSoup
url = "https://rotogrinders.com/lineups/mlb?site=draftkings"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
players_confirmed = {}
results = soup.find_all("div", {'class': 'pitcher players'})
dicti={}
for j in results:
dicti[j.a.text]=j.select(".stats")[1].text.strip("\n").strip()
just use select or find function of the founded element,and you will be able to iterate

Categories