Python - BeautifulSoup - Unable to extract Span Value - python

I have an XML with mutiple Div Classes/Span Classes and I'm struggling to extract a text value.
<div class="line">
<span class="html-tag">
"This is a Heading that I dont want"
</span>
<span>This is the text I want</span>
So far I have written this:
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
spans = soup.find_all('span', attrs={'class': 'html-tag'})[29]
print(spans.text)
This unfortunately only prints out the "This is a Heading that I dont want" value e.g.
This is the heading I dont want
Number [29] in the code is the position where the text I need will always appear.
I'm unsure how to retrieve the span value I need.
Please can you assist. Thanks

You can search by <div class="line"> and then select second <span>.
For example:
txt = '''
# line 1
<div class="line">
<span class="html-tag">
"This is a Heading that I dont want"
</span>
<span>This is the text I dont want</span>
</div>
# line 2
<div class="line">
<span class="html-tag">
"This is a Heading that I dont want"
</span>
<span>This is the text I dont want</span>
</div>
# line 3
<div class="line">
<span class="html-tag">
"This is a Heading that I dont want"
</span>
<span>This is the text I want</span> <--- this is I want
</div>'''
soup = BeautifulSoup(txt, 'html.parser')
s = soup.select('div.line')[2].select('span')[1] # select 3rd line 2nd span
print(s.text)
Prints:
This is the text I want

Related

Extracting text from multiple spans with different classes using BeautifulSoup

I am trying to extract some data from a webpage that I've parsed through BeautifulSoup.
<div class="product-data-list data-points-en_GB">
<div class="float-left in-left col-totalNetAssets" style="height: 36px;">
<span class="caption">
Net Assets of Share Class
<span class="as-of-date">
as of 20-Jul-20
</span>
</span>
<span class="data">
USD 36,636,694,134
</span>
</div>
<div class="float-left in-right col-totalNetAssetsFundLevel">
<span class="caption">
Net Assets of Fund
<span class="as-of-date">
as of 20-Jul-20
</span>
</span>
<span class="data">
USD 37,992,258,237
</span>
</div>
<div class="float-left in-left col-baseCurrencyCode" style="height: 16px;">
<span class="caption">
Fund Base Currency
<span class="as-of-date">
</span>
</span>
<span class="data">
USD
</span>
</div>
I want to capture the information from the 'caption', 'as-of-date' and 'data' spans to create something like:
[('Net Assets of Share Class','20-Jul-20','USD 36,636,694,134'),
('Net Assets of Fund','20-Jul-20','USD 37,992,258,237'),
('Fund Base Currency','','USD')]
This is my code:
data=[]
for tag in soup.findAll("div", {"id": "keyFundFacts"}):
for span in tag.findAll("div", {"class": "product-data-list data-points-en_GB"}):
a = span.find("span", {"class": "caption"}).text
b = span.find("span", {"class": "as-of-date"}).text
c = span.find("span", {"class": "data"}).text
data.append((a,b,c))
however, I only get 1 result when I look at the list 'data':
<pre>
[('\nNet Assets of Share Class\n\nas of 20-Jul-20\n\n', '\nas of 20-Jul-20\n', '\nUSD 36,636,694,134\n')]
</pre>
Aside from needing to strip out the new lines, I know I am missing something to get the script to go through all the other spans but have been staring at the screen for so long, it isn't getting any clearer.
Can anyone help put me out of my misery?!
One solution is to cycle through all the div elements that are under your main "div", {"class": "product-data-list data-points-en_GB" element. This way for each div element you will get the elements you want.
for tag in soup.findAll("div", {"id": "keyFundFacts"}):
for element in tag.findAll("div", {"class": "product-data-list data-points-en_GB"}):
for divEle in element.findAll('div')
a = divEle.find("span", {"class": "caption"}).text
b = divEle.find("span", {"class": "as-of-date"}).text
c = divEle.find("span", {"class": "data"}).text
This makes for a lot of nested loops so I don't recommend this. I suggest finding a more precise way. If you have a url with the html I could take a look.
I have stumbled upon a solution which seems to do the trick:
data=[]
for tag in soup.findAll("div", {"id": "keyFundFacts"}):
for element in tag.findAll("div", {"class": "product-data-list data-points-en_GB"}):
for thing in element.findChildren('div'):
a = thing.findNext("span", {"class": "caption"}).text
b = thing.findNext("span", {"class": "as-of-date"}).text
c = thing.findNext("span", {"class": "data"}).text
data.append((a,b,c))
Its not perfect but hopefully functional.
thanks all

How to exclude inner tags with beautifulsoup

Hey Im currently trying to parse through a website and I'm almost done, but there's a little problem. I wannt to exclude inner tags from a html code
<span class="moto-color5_5">
<strong>Text 1 </strong>
<span style="font-size:8px;">Text 2</span>
</span>
I tried using
...find("span", "moto-color5_5") but this returns
Text 1 Text 2
instead of only returning Text 1
Any suggestions?
sincierly :)
Excluding inner tags would also exclude Text 1 because it's in an inner tag <strong>.
You can however just find strong inside of your current soup:
html = """<span class="moto-color5_5">
<strong>Text 1 </strong>
<span style="font-size:8px;">Text 2</span>
</span>
"""
soup = BeautifulSoup(html)
result = soup.find("span", "moto-color5_5").find('strong')
print(result.text) # Text 1

Unable to fetch <div> tag values in python

The required value is present within the div tag:
<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>
I am using the below code to fetch the value "Rs. 350":
soup.select('div.search-page-text'):
But in the output i get "None". Could you pls help me resolve this issue?
An element with both a sub-element and string content can be accessed using strippe_strings:
from bs4 import BeautifulSoup
h = """<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>"""
soup = BeautifulSoup(h)
for s in soup.select("div.search-page-text")[0].stripped_strings:
print(s)
Output:
Cost for 2:
Rs. 350
The problem is that this includes both the strong content of the span and the div. But if you know that the div first contains the span with text, you could get the intersting string as
list(soup.select("div.search-page-text")[0].stripped_strings)[1]
If you know you only ever want the string that is the immediate text of the <div> tag and not the <span> child element, you could do this.
from bs4 import BeautifulSoup
txt = '''<div class="search-page-text">
<span class="upc grey-text sml">Cost for 2: </span>
Rs. 350
</div>'''
soup = BeautifulSoup(txt)
for div in soup.find_all("div", { "class" : "search-page-text" }):
print ''.join(div.find_all(text=True, recursive=False)).strip()
#print div.find_all(text=True, recursive=False)[1].strip()
One of the lines returned by div.find_all is just a newline. That could be handled in a variety of ways. I chose to join and strip it rather than rely on the text being at a certain index (see commented line) in the resultant list.
Python 3
For python 3 the print line should be
print (''.join(div.find_all(text=True, recursive=False)).strip())

How to print the text inside of a child tag and the href of a grandchild element with a single BeautifulSoup Object?

I have a document which contains several div.inventory siblings.
<div class="inventory">
<span class="item-number">123</span>
<span class="cost">
$1.23
</span>
</div>
I would like to iterate over them to print the item number and link of the item.
123 http://linktoitem
456 http://linktoitem2
789 http://linktoitem3
How do I parse these two values after selecting the div.inventory element?
import requests
from bs4 import BeautifulSoup
htmlSource = requests.get(url).text
soup = BeautifulSoup(htmlSource)
matches = soup.select('div.inventory')
for match in matches:
#prints 123
#prints http://linktoitem
Also - what is the difference between the select function and find* functions?
You can find both items using find() relying on the class attributes:
soup = BeautifulSoup(data)
for inventory in soup.select('div.inventory'):
number = inventory.find('span', class_='item-number').text
link = inventory.find('span', class_='cost').a.get('href')
print number, link
Example:
from bs4 import BeautifulSoup
data = """
<body>
<div class="inventory">
<span class="item-number">123</span>
<span class="cost">
$1.23
</span>
</div>
<div class="inventory">
<span class="item-number">456</span>
<span class="cost">
$1.23
</span>
</div>
<div class="inventory">
<span class="item-number">789</span>
<span class="cost">
$1.23
</span>
</div>
</body>
"""
soup = BeautifulSoup(data)
for inventory in soup.select('div.inventory'):
number = inventory.find('span', class_='item-number').text
link = inventory.find('span', class_='cost').a.get('href')
print number, link
Prints:
123 http://linktoitem
456 http://linktoitem2
789 http://linktoitem3
Note the use of select() - this method allows to use CSS Selectors for searching over the page. Also note the use of class_ argument - underscore is important since class is a reversed keyword in Python.

Beautiful Soup to Find and Regex Replace text 'not within <a></a> '

I am using Beautiful Soup to parse a html to find all text that is
1.Not contained inside any anchor elements
I came up with this code which finds all links within href but not the other way around.
How can I modify this code to get only plain text using Beautiful Soup, so that I can do some find and replace and modify the soup?
for a in soup.findAll('a',href=True):
print a['href']
EDIT:
Example:
<html><body>
<div> test1 </div>
<div><br></div>
<div>test2</div>
<div><br></div><div><br></div>
<div>
This should be identified
Identify me 1
Identify me 2
<p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
</div>
</body></html>
Output:
This should be identified
Identify me 1
Identify me 2
This paragraph should be identified.
I am doing this operation to find text not within <a></a> : then find "Identify" and do replace operation with "Replaced"
So the final output will be like this:
<html><body>
<div> test1 </div>
<div><br></div>
<div>test2</div>
<div><br></div><div><br></div>
<div>
This should be identified
Repalced me 1
Replaced me 2
<p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
</div>
</body></html>
Thanks for your time !
If I understand you correct, you want to get the text that is inside an a element that contains an href attribute. If you want to get the text of the element, you can use the .text attribute.
>>> soup = BeautifulSoup.BeautifulSoup()
>>> soup.feed('this is some text')
>>> soup.findAll('a', href=True)[0]['href']
u'http://something.com'
>>> soup.findAll('a', href=True)[0].text
u'this is some text'
Edit
This finds all the text elements, with identified in them:
>>> soup = BeautifulSoup.BeautifulSoup()
>>> soup.feed(yourhtml)
>>> [txt for txt in soup.findAll(text=True) if 'identified' in txt.lower()]
[u'\n This should be identified \n\n Identify me 1 \n\n Identify me 2 \n ', u' identified ']
The returned objects are of type BeautifulSoup.NavigableString. If you want to check if the parent is an a element you can do txt.parent.name == 'a'.
Another edit:
Here's another example with a regex and a replacement.
import BeautifulSoup
import re
soup = BeautifulSoup.BeautifulSoup()
html = '''
<html><body>
<div> test1 </div>
<div><br></div>
<div>test2</div>
<div><br></div><div><br></div>
<div>
This should be identified
Identify me 1
Identify me 2
<p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
</div>
</body></html>
'''
soup.feed(html)
for txt in soup.findAll(text=True):
if re.search('identi',txt,re.I) and txt.parent.name != 'a':
newtext = re.sub(r'identi(\w+)', r'replace\1', txt.lower())
txt.replaceWith(newtext)
print(soup)
<html><body>
<div> test1 </div>
<div><br /></div>
<div>test2</div>
<div><br /></div><div><br /></div>
<div>
this should be replacefied
replacefy me 1
replacefy me 2
<p id="firstpara" align="center"> This paragraph should be<b> replacefied </b>.</p>
</div>
</body></html>

Categories