How to extract text which is surrounded by two <span> element? - python

I am in mid of scraping data from a website, but I encounter following code
code = "<li class="price-current">
<span class="price-current-label">
</span>₹ 7,372 
<span class="price-current-range">
<abbr title="to">–</abbr>
</span>
</li> "
I need to extract only "₹ 7,372".
I have tried following.
1. Code.text
but it result to
'\n\n₹ 7,372\xa0\r\n \n–\n\n'
code.text.strip()
but it result to
'₹ 7,372\xa0\r\n \n–'
Is there any method?
Please let me know, so that I can complete my project.

Ok, I managed to clean data that you need. This way is a little ugly, but works=)
from bs4 import BeautifulSoup as BS
html= """<li class="price-current">
<span class="price-current-label">
</span>₹ 7,372
<span class="price-current-range">
<abbr title="to">–</abbr>
</span>
</li> """
soup=BS(html)
li = soup.find('li').text
for j in range(3):
for i in ['\n',' ', '–', '\xa0', '\r','\x20','\x0a','\x09','\x0c','\x0d']:
li=li.strip(i)
print(li)
output:
₹ 7,372
In the loop list I outlined all (as far as I know) ASCII spaces and the symbols that you get.
Loop launches 3 times because needed value doesn't clean from the first time, you can check it every iteration in variable explorer.
Also optionally you can try to figure out what precise symbol gives a lot of pseudo spaces between <span> tags.

from bs4 import BeautifulSoup as bs
code = '''<li class="price-current">
<span class="price-current-label">
</span>₹ 7,372
<span class="price-current-range">
<abbr title="to">–</abbr>
</span>
</li>'''
soup = bs(code,'html.parser')
w = soup.find_all('li')
l = []
for item in w:
l.append(item)
words = str(l)
t = words.split('\n')
print(t[2][7:])
₹ 7,372

Related

Find tags based on an attribute value with the highest number

I am trying to find a way to find the <span> tag with the highest number in the data-resolution attribute value.
This is what I have managed to do so far.
from bs4 import BeautifulSoup
html_doc = """
<span data-type="video-src" data-url="https://1080.com" data-resolution="1080"></span>
<span data-type="video-src" data-url="https://720.com" data-resolution="720"></span>
<span data-type="video-src" data-url="https://480.com" data-resolution="480"></span>
<span data-type="video-src" data-url="https://360.com" data-resolution="360"></span>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
x = soup.select('span[data-resolution="1080"]')
print(x)
The problem here is that 1080 is a fixed value. I guess I could create a conditional if statement with all numbers I assume will be available, but I would rather just pick the highest number.
So how can I make Beautiful Soup look for the the <span> tag with the highest number in the data-resolution attribute value?
You can narrow down on just the data-resolution items and then use max on that attribute since each item is essentially a dictionary.
def custom_key(css):
val = css.get("data-resolution", 0)
return int(val)
max(soup.select('span[data-resolution]'), key=custom_key)
<span data-resolution="1080" data-type="video-src" data-url="https://1080.com"></span>
There may be better and more succinct ways to do this but here's one idea.
Iterate over all the span elements. Build a tuple comprised of the data-resolution (as an integer) and the entire span element. Append that tuple to a list. Sort the list. Extract the data of interest.
from bs4 import BeautifulSoup
html_doc = """
<span data-type="video-src" data-url="https://1080.com" data-resolution="1080"></span>
<span data-type="video-src" data-url="https://720.com" data-resolution="720"></span>
<span data-type="video-src" data-url="https://480.com" data-resolution="480"></span>
<span data-type="video-src" data-url="https://360.com" data-resolution="360"></span>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
spanlist = []
for span in soup.select('span'):
spanlist.append((int(span['data-resolution']), span))
print(sorted(spanlist)[-1][1])
Output:
<span data-resolution="1080" data-type="video-src" data-url="https://1080.com"></span>
Or, if you want something that's hard to understand but functional then:
print(max([((int(span['data-resolution']), span)) for span in soup.select('span')])[1])
Because we all love one-liners:
result = max(soup.select('span[data-resolution]'), key=lambda span: int(span.get('data-resolution')))

Extracting text :after an element with Beautiful Soup

I would like to extract the text :after the <strong> element.
<li data-toggle="tooltip" title="" data-original-title=""><strong>06:25</strong> vP</li>
I've tried the following
medmar_live_departures_table = list(soup.select('li.tratta'))
for li in medmar_live_departures_table:
info = li.text
but I'm getting both texts. I could use re to split the string but I was wondering if there was a more efficient and straightforward way of doing it.
Output
16:40  vP
Desired output
vP
You can get last text child node of each li as below:
medmar_live_departures_table = soup.select('li.tratta')
for li in medmar_live_departures_table :
info = [text for text in li.stripped_strings][-1]
print(info)

Extracting from BS4 and storing as list elements in Python

I have extracted data wrapped within multiple HTML p tags from a webpage using BeautifulSoup4. I have stored all of the extracted data in a list. But I want each of the extracted data as separate list elements separated by a comma.
HTML content structure:
<ul>
<li>
<p>
<span class="TextRun">
<span class="NormalTextrun"> Data 1 </span>
</span>
</p>
</li>
<li>
<p>
<span class="TextRun">
<span class="NormalTextrun"> Data 2 </span>
</span>
</p>
</li>
<li>
<p>
<span class="TextRun">
<span class="NormalTextrun"> Data 3 </span>
</span>
</p>
</li>
</ul>
Code to extract:
for data in elem.find_all('span', class_="TextRun"):
data = ''.join([' '.join(item.text.split()) for item in elem.select(".NormalTextRun")])
data = data.replace(u'\xa0', '')
events_parsed_thisweek.append(data)
print (events_parsed_thisweek)
Current output:
[Data1Data2Data3]
Expected output:
[Data1, Data2, Data3]
Any help is much appreciated!
data = [x.text.strip() for x in elem.find_all('span', {'class': 'NormalTextrun'})]
Printing data will give you: ['Data 1', 'Data 2', 'Data 3']
I think what #Sagun Shrestha suggest works. To deal with it more detailly like the inner span and the extra spaces. Maybe you should try:
data = [s.text.strip() for s in b.find_all('span', class_='NormalTextrun')]
print(data)
If you specifically want the string output without the quotation marks. You can try this:
data = [s.text.strip() for s in b.find_all('span', class_='NormalTextrun')]
print('[', ', '.join(data), ']', sep='')
Hope it's what you want.
This should solve your problem
data = [x.text for x in elem.find_all('span', attrs={'class':'TextRun'})]
This gives the correct output:
data = [ele.text for ele in soup.find_all('span', {'class':'NormalTextrun'})]
print(data)
Output:
[' Data 1 ', ' Data 2 ', ' Data 3 ']

Beautiful Soup / Regular Expressions: Extract a portion of text from NavigableString

I'm really new to learning python so this could be really obvious, but I have extracted a NavigableString from BeautifulSoup and I need to find data in the string. However, it's not as easy as some of the examples I've seen online.
My end goal is to create a dictionary that looks something like this:
dict = {'Fandom':'Undertale (Video Game)', 'Works':15341}
Here's are two examples of the strings:
<li>
<a class="tag" href="/tags/Undertale%20(Video%20Game)/works">Undertale (Video Game)</a>
(15341)
</li>
<a class="tag" href="/tags/Sherlock%20Holmes%20*a*%20Related%20Fandoms/works">Sherlock Holmes & Related Fandoms</a>
(101015)
</li>
I've already succeeded extracting fandom from the string, but now I need the works count in parenthesis. How would I use Beautiful Soup and/or Regular Expressions to do this?
I also need to do error handling because while a fandom will always be displayed, it may not have a work count next to it.
<li>
<a class="tag" href="/tags/Composer%20-%20Fandom/works">Composer - Fandom</a>
</li>
Here's the relevant pieces of code:
for each_f in cate:
#print(each_f)
result = each_f.find('a')
if result !=-1:
#here is where I grab the Fandom vals
fandom_name = result.contents
#print(result.contents)
NOTE: I know I'm missing the code to append to the dictionary, I haven't made it that far yet. I'm just trying to get the values to print to the screen.
use dict.fromkeys(('Fandom', 'Works')) to get :
In [17]: dict.fromkeys(('Fandom', 'Works'))
Out[17]: {'Fandom': None, 'Works': None}
use zip to combines the key with strings in the li tag, this will only combines the shortest:
zip(('Fandom', 'Works'),li.stripped_strings)
[('Fandom', 'Undertale (Video Game)'), ('Works', '(15341)')]
[('Fandom', 'Sherlock Holmes & Related Fandoms'), ('Works', '(101015)')]
[('Fandom', 'Composer - Fandom')]
then we update the dict with those data:
In [20]: for li in soup.find_all('li'):
...: d = dict.fromkeys(('Fandom', 'Works'))
...: out = zip(('Fandom', 'Works'),li.stripped_strings)
...: d.update(out)
...: print(d)
out:
{'Works': '(15341)', 'Fandom': 'Undertale (Video Game)'}
{'Works': '(101015)', 'Fandom': 'Sherlock Holmes & Related Fandoms'}
{'Works': None, 'Fandom': 'Composer - Fandom'}
You can use stripped_strings and unpack the values to get your blocks of text. You can store the results in a dictso that you can use them later.
Example:
from bs4 import BeautifulSoup
import requests
example = """<li>
<a class="tag" href="/tags/Undertale%20(Video%20Game)/works">Undertale (Video Game)</a>
(15341)
</li>
<li><a class="tag" href="/tags/Sherlock%20Holmes%20*a*%20Related%20Fandoms/works">Sherlock Holmes & Related Fandoms</a>
(101015)
</li>
<li>
<a class="tag" href="/tags/Composer%20-%20Fandom/works">Composer - Fandom</a>
</li>"""
soup = BeautifulSoup(example, "html.parser")
Fandom = {"Fandom" : []}
for li in soup.find_all("li"):
try:
fandom, count = li.stripped_strings
Fandom["Fandom"].append({fandom.strip() : count[1:-1]})
except:
fandom = li.text.strip()
Fandom["Fandom"].append({fandom.strip() : 0})
print (Fandom)
This outputs:
{'Fandom': [{'Undertale (Video Game)': '15341'}, {'Sherlock Holmes & Related Fandoms': '101015'}, {'Composer - Fandom': 0}]}
The try-catch will catch any unpacking that doesn't contains two values: your fandom title and the word count.

Scrapy extract non-wrapped data

I'm trying to extract specific price data with scrapy, but the price is not wrapped in a tag.
How can i compose the price out of this html?
<div class="value">
<span class="currency">€</span>
39
<span class="decimal">.98</span>
<span class="type">New</span>
</div>
How can I refine this to 39,98?
Solution used:
"".join(response.xpath("//section[#class='order order--new order--left']//div[#class='spec price']//div[#class='value']//text()").re("[0-9.]+"))
Get all the text nodes inside the element and join:
$ scrapy shell file://$PWD/index.html
In [1]: "".join(response.xpath("//div[#class = 'value']//text()").re("[0-9.]+"))
Out[1]: u'39.98'
The .re() helps to filter digits and dots only.
If there are multiple values in the document:
In [2]: ["".join(value.xpath(".//text()").re(r"[0-9.]+")) for value in response.xpath("//div[#class = 'value']")]
Out[2]: [u'39.98']

Categories