How to extract exact information by span class using Beautiful Soup - python

This is my code and output for my price monitoring code:
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="result_0_name").get_text()
price = soup.find("span", class_ = "normal_price")
#converted_price = price[0:3]
print(price.get_text())
print(title.strip())
the output is as follows
Starting at:
$0.70 USD
$0.67 USD
Operation Broken Fang Case
and html of the page is as so
<span class="market_table_value normal_price">Starting at:<br/>
<span class="normal_price" data-currency="1" data-price="69">$0.69 USD</span>
<span class="sale_price">$0.66 USD</span>
</span>
as you can see there is no ID, so I cannot use that, I only wish to display the 'normal_price' and not the other data in that span. Any ideas?

Just make the selection of the span more specific, for example use the fact, that it is an element inside an element:
soup.select_one('span > span.normal_price').get_text()
Example
from bs4 import BeautifulSoup
html='''
<span class="market_table_value normal_price">
Starting at:<br/>
<span class="normal_price" data-currency="1" data-price="69">$0.69 USD</span>
<span class="sale_price">$0.66 USD</span>
</span>
'''
soup = BeautifulSoup(html,'html.parser')
price = soup.select_one('span > span.normal_price').get_text()
price
Output
$0.69 USD

You can also try like this
from bs4 import BeautifulSoup
html ="""<span class="market_table_value normal_price">
Starting at:<br/>
<span class="normal_price" data-currency="1" data-price="69">$0.69 USD</span>
<span class="sale_price">$0.66 USD</span>
</span>"""
soup = BeautifulSoup(html, 'html.parser')
using attribute
price = soup.select_one('span[data-currency="1"]').get_text()
exact attribute
price = soup.select_one('span[data-currency^="1"]').get_text()
print(price) #$0.69 USD

Related

Web scrape a a title after a specific class by python

I'm trying to scrape some information about the positions, artists and songs from a ranking list online. Here is the ranking list website: https://kma.kkbox.com/charts/weekly/newrelease?terr=my&lang=en
I'm was trying to use the following code to scrape:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://kma.kkbox.com/charts/weekly/newrelease?terr=my&lang=en')
print(page.status_code)
soup = BeautifulSoup(page.content, 'html.parser')
all_songs = soup.find_all(class_="charts-list-song")
all_artists = soup.find_all(class_="charts-list-artist")
print(all_songs)
print(all_artists)
However, the output only shows:
[<span class="charts-list-desc">
<span class="charts-list-song"></span>
<span class="charts-list-artist"></span>
</span>, <span class="charts-list-desc">
<span class="charts-list-song"></span>
...
and
<span class="charts-list-song"></span>, <span class="charts-list-song"></span>, <span class="charts-list-song"></span>, <span class="charts-list-song"></span>, <span class="charts-list-song"></span>, <span class="charts-list-song"></span>,
My expected output should be:
Pos artist songs
1 張哲瀚 洪荒劇場Primordial Theater
2 張哲瀚 冰川消失那天Lost Glacier
3 告五人 又到天黑
Use view source in Chrome, you can see that the actual chart content is at the end of the html source code and loaded as chart variable.
code
import requests
from bs4 import BeautifulSoup
import json, re
page = requests.get('https://kma.kkbox.com/charts/weekly/newrelease?terr=my&lang=en')
print(page.status_code)
soup = BeautifulSoup(page.content, 'html.parser')
data = soup.select('script')[-2].string
m = re.search(r'var chart = (\[{.*}\])', data)
songs = json.loads(m.group(1))
for song in songs:
print(song['rankings']['this_period'], song['artist_name'], song['song_name'])
output
1 張哲瀚 洪荒劇場Primordial Theater
2 張哲瀚 冰川消失那天Lost Glacier
3 告五人 又到天黑
4 孫盛希 Shi Shi 眼淚記得你 (Remembered)
5 陳零九 Nine Chen 夢裡的女孩 (The Girl)
6 告五人 一念之間
7 苏有朋 玫瑰急救箱
8 林俊傑 想見你想見你想見你
...

BeautifulSoup - extracting text from multiple span elements w/o classes

So that's how HTML looks:
<p class="details">
<span>detail1</span>
<span class="number">1</span>
<span>detail2</span>
<span>detail3</span>
</p>
I need to extract detail2 & detail3.
But with this piece of code I only get detail1.
info = data.find("p", class_ = "details").span.text
How do I extract the needed items?
Thanks in advance!
Select your elements more specific in your case all sibling <span> of <span> with class number:
soup.select('span.number ~ span')
Example
from bs4 import BeautifulSoup
html='''<p class="details">
<span>detail1</span>
<span class="number">1</span>
<span>detail2</span>
<span>detail3</span>
</p>'''
soup = BeautifulSoup(html)
[t.text for t in soup.select('span.number ~ span')]
Output
['detail2', 'detail3']
You can find all <span>s and do normal indexing:
from bs4 import BeautifulSoup
html_doc = """\
<p class="details">
<span>detail1</span>
<span class="number">1</span>
<span>detail2</span>
<span>detail3</span>
</p>"""
soup = BeautifulSoup(html_doc, "html.parser")
spans = soup.find("p", class_="details").find_all("span")
for s in spans[-2:]:
print(s.text)
Prints:
detail2
detail3
Or CSS selectors:
spans = soup.select(".details span:nth-last-of-type(-n+2)")
for s in spans:
print(s.text)
Prints:
detail2
detail3

How to just get the content of the tag when you use findAll in Beautiful Soup?

So on the project I'm building I want to find the price contained on the multiple results I got with the findAll() command. Here's the code:
soup = BeautifulSoup(driver.page_source, 'html.parser')
price = soup.find_all(class_='search-result__market-price--value')
print(price)
And this is what I get:
[<span class="search-result__market-price--value" tabindex="-1"> $0.11 </span>, <span class="search-result__market-price--value" tabindex="-1"> $0.24 </span>, ... ]
I tried using this code I found somewhere else price = soup.find_all(class_='search-result__market-price--value')[0].string but it just gave the error IndexError: list index out of range.
What can I do to just get the numbers?
Iterate the ResultSet created by find_all():
soup = BeautifulSoup(driver.page_source, 'html.parser')
for price in soup.find_all(class_='search-result__market-price--value'):
print(price.text)
or to just get the numbers
print(price.text.split('$')[-1])
Example
from bs4 import BeautifulSoup
html='''
<span class="search-result__market-price--value" tabindex="-1"> $0.11 </span>
<span class="search-result__market-price--value" tabindex="-1"> $0.24 </span>
'''
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.find_all('span'):
print(tag.text.split('$')[-1])
Output
0.11
0.24

Beautiful Soup - getting value from tag definition

HTML Code:
<option data-formated='<span class="price">AUD $100.08</span>' data-qtyid="qty-219" value="1">
Unit Price </option>
how do i get AUD $100.08 out of this?
You can use
find('option')['data-formated']
to get text <span class="price">AUD $100.08</span> and then you can slice/span()/etc. - ie [20:-7] - or you can use again BeautifulSoup to search in this HTML.
from bs4 import BeautifulSoup
text = '''<option data-formated='<span class="price">AUD $100.08</span>' data-qtyid="qty-219" value="1">
Unit Price </option>'''
soup = BeautifulSoup(text, 'html.parser')
item = soup.find('option')['data-formated']
print(item[20:-7]) # AUD $100.08
soup = BeautifulSoup(item, 'html.parser')
print(soup.find('span').text) # AUD $100.08

Scrape 2 inner texts within div as one value

I have the following html
<div class="price-block__highlight"><span class="promo-price" data-
test="price">102,
<sup class="promo-price__fraction" data-test="price-fraction">99</sup>
</span>
</div>
I want to print the price of this html without comma, so
print price should result in:
102.99
I have the following code
pricea = page_soup.find("div", {"class":"price-block__highlight"})
price = str(pricea.text.replace('-','').replace(',','.').strip())
print price
This however results in:
102.
99
When writing in a csv it creates multiple rows. How to get both numbers in one value?
i think you are using bs4
from bs4 import BeautifulSoup
html_doc = """
<div class="price-block__highlight"><span class="promo-price" data-
test="price">102,
<sup class="promo-price__fraction" data-test="price-fraction">99</sup>
</span>
</div>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
price_div = soup.find("div", {"class": 'price-block__highlight'})
texts = [x.strip() for x in price_div.text.split(',')]
print('.'.join(texts))
Output
102.99

Categories