Extracting text from multiple spans with different classes using BeautifulSoup - python

I am trying to extract some data from a webpage that I've parsed through BeautifulSoup.
<div class="product-data-list data-points-en_GB">
<div class="float-left in-left col-totalNetAssets" style="height: 36px;">
<span class="caption">
Net Assets of Share Class
<span class="as-of-date">
as of 20-Jul-20
</span>
</span>
<span class="data">
USD 36,636,694,134
</span>
</div>
<div class="float-left in-right col-totalNetAssetsFundLevel">
<span class="caption">
Net Assets of Fund
<span class="as-of-date">
as of 20-Jul-20
</span>
</span>
<span class="data">
USD 37,992,258,237
</span>
</div>
<div class="float-left in-left col-baseCurrencyCode" style="height: 16px;">
<span class="caption">
Fund Base Currency
<span class="as-of-date">
</span>
</span>
<span class="data">
USD
</span>
</div>
I want to capture the information from the 'caption', 'as-of-date' and 'data' spans to create something like:
[('Net Assets of Share Class','20-Jul-20','USD 36,636,694,134'),
('Net Assets of Fund','20-Jul-20','USD 37,992,258,237'),
('Fund Base Currency','','USD')]
This is my code:
data=[]
for tag in soup.findAll("div", {"id": "keyFundFacts"}):
for span in tag.findAll("div", {"class": "product-data-list data-points-en_GB"}):
a = span.find("span", {"class": "caption"}).text
b = span.find("span", {"class": "as-of-date"}).text
c = span.find("span", {"class": "data"}).text
data.append((a,b,c))
however, I only get 1 result when I look at the list 'data':
<pre>
[('\nNet Assets of Share Class\n\nas of 20-Jul-20\n\n', '\nas of 20-Jul-20\n', '\nUSD 36,636,694,134\n')]
</pre>
Aside from needing to strip out the new lines, I know I am missing something to get the script to go through all the other spans but have been staring at the screen for so long, it isn't getting any clearer.
Can anyone help put me out of my misery?!

One solution is to cycle through all the div elements that are under your main "div", {"class": "product-data-list data-points-en_GB" element. This way for each div element you will get the elements you want.
for tag in soup.findAll("div", {"id": "keyFundFacts"}):
for element in tag.findAll("div", {"class": "product-data-list data-points-en_GB"}):
for divEle in element.findAll('div')
a = divEle.find("span", {"class": "caption"}).text
b = divEle.find("span", {"class": "as-of-date"}).text
c = divEle.find("span", {"class": "data"}).text
This makes for a lot of nested loops so I don't recommend this. I suggest finding a more precise way. If you have a url with the html I could take a look.

I have stumbled upon a solution which seems to do the trick:
data=[]
for tag in soup.findAll("div", {"id": "keyFundFacts"}):
for element in tag.findAll("div", {"class": "product-data-list data-points-en_GB"}):
for thing in element.findChildren('div'):
a = thing.findNext("span", {"class": "caption"}).text
b = thing.findNext("span", {"class": "as-of-date"}).text
c = thing.findNext("span", {"class": "data"}).text
data.append((a,b,c))
Its not perfect but hopefully functional.
thanks all

Related

Web scraping from the span element

I am on a scraping project and I am lookin to scrape from the following.
<div class="spec-subcat attributes-religion">
<span class="h5">Faith:</span>
<span>Christian</span>
<span>Islam</span>
</div>
I want to extract only Christian, Islam as the output.(Without the 'Faith:').
This is my try:
faithdiv = soup.find('div', class_='spec-subcat attributes-religion')
faith = faithdiv.find('span').text.strip()
How can I make this done?
There are several ways you can fix this, I would suggest the following - Find all <span> in <div> that have not the class="h5":
soup.select('div.spec-subcat.attributes-religion span:not(.h5)')
Example
import requests
html_text = '''
<div class="spec-subcat attributes-religion">
<span class="h5">Faith:</span>
<span>Christian</span>
<span>Islam</span>
</div>
'''
soup = BeautifulSoup(html_text, 'lxml')
', '.join([x.get_text() for x in soup.select('div.spec-subcat.attributes-religion span:not(.h5)')])
Output
Christian, Islam

How to get child value of div seperately using beautifulsoup

I have a div tag which contains three anchor tags and have url in them.
I am able to print those 3 hrefs but they get merged into one value.
Is there a way I can get three seperate values.
Div looks like this:
<div class="speaker_social_wrap">
<a href="https://twitter.com/Sigve_telenor" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-twitter" data-x-icon-b=""></i>
</a>
<a href="https://no.linkedin.com/in/sigvebrekke" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-linkedin-in" data-x-icon-b=""></i>
</a>
<a href="https://www.facebook.com/sigve.telenor" target="_blank">
<i aria-hidden="true" class="x-icon x-icon-facebook-f" data-x-icon-b=""></i>
</a>
What I have tried so far:
social_media_url = soup.find_all('div', class_ = 'foo')
for url in social_media_url:
print(url)
Expected Result:
http://twitter-url
http://linkedin-url
http://facebook-url
My Output
<div><a twitter-url><a linkedin-url><a facebook-url></div>
You can do like this:
from bs4 import BeautifulSoup
import requests
url = 'https://dtw.tmforum.org/speakers/sigve-brekke-2/'
r = requests.get(url)
soup = BeautifulSoup(r.text,'lxml')
a = soup.find('div', class_='speaker_social_wrap').find_all('a')
for i in a:
print(i['href'])
https://twitter.com/Sigve_telenor
https://no.linkedin.com/in/sigvebrekke
https://www.facebook.com/sigve.telenor
Your selector gives you the div not the urls array. You need something more like:
social_media_div = soup.find_all('div', class_ = 'foo')
social_media_anchors = social_media_div.find_all('a')
for anchor in social_media_anchors:
print(anchor.get('href'))

Python - BeautifulSoup - Unable to extract Span Value

I have an XML with mutiple Div Classes/Span Classes and I'm struggling to extract a text value.
<div class="line">
<span class="html-tag">
"This is a Heading that I dont want"
</span>
<span>This is the text I want</span>
So far I have written this:
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
spans = soup.find_all('span', attrs={'class': 'html-tag'})[29]
print(spans.text)
This unfortunately only prints out the "This is a Heading that I dont want" value e.g.
This is the heading I dont want
Number [29] in the code is the position where the text I need will always appear.
I'm unsure how to retrieve the span value I need.
Please can you assist. Thanks
You can search by <div class="line"> and then select second <span>.
For example:
txt = '''
# line 1
<div class="line">
<span class="html-tag">
"This is a Heading that I dont want"
</span>
<span>This is the text I dont want</span>
</div>
# line 2
<div class="line">
<span class="html-tag">
"This is a Heading that I dont want"
</span>
<span>This is the text I dont want</span>
</div>
# line 3
<div class="line">
<span class="html-tag">
"This is a Heading that I dont want"
</span>
<span>This is the text I want</span> <--- this is I want
</div>'''
soup = BeautifulSoup(txt, 'html.parser')
s = soup.select('div.line')[2].select('span')[1] # select 3rd line 2nd span
print(s.text)
Prints:
This is the text I want

I need the text from only the first p tags of both lists

I only want the text first text from both lists. I have added a * next to the ones.
<h3 class="icon fa-comment">Colorado Locations</h3>
<section class="lists" id="locate">
<div class="border" id="ranch">
<p *class="lists" id="stores">Colorado Ranch Market</p>
<p class="lists" id="stores">11505 E. Colfax Ave</p>
<p class="lists" id="stores">Aurora, CO 80010</p>
<p class="lists" id="stores">PH: 720-343-2195</p>
<p class="lists" id="stores">FAX: 720-343-2196</p>
</div>
<div class="border">
<p *class="lists" id="stores">Save-A-Lot</p>
<p class="lists" id="stores">4255 W Florida Ave</p>
<p class="lists" id="stores">Denver, CO 80219</p>
<p class="lists" id="stores">PH: 303-935-0880</p>
<p class="lists" id="stores">FAX: 303-935-4002</p>
</div>
a = soup.find_all('div', {'class': 'border'})[0]
That can be done with a CSS selector:
a = soup.select("div.border p:first-child")
It returns a list with the first p elements of div with class border.
Another version, using :nth-of-type() selector:
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print([p.text for p in soup.select('div > p:nth-of-type(1)')])
Prints:
['Colorado Ranch Market', 'Save-A-Lot']
Might as well have the suite of css examples
soup = bs(html, 'lxml')
items = [item.text for item in soup.select('.border p:nth-child(1)')]
print(items)
Something you wouldn't want to do (particular just to example html given):
items = [item.text for item in soup.select('#locate p:nth-child(3n+1):nth-child(odd)')]

How to print the text inside of a child tag and the href of a grandchild element with a single BeautifulSoup Object?

I have a document which contains several div.inventory siblings.
<div class="inventory">
<span class="item-number">123</span>
<span class="cost">
$1.23
</span>
</div>
I would like to iterate over them to print the item number and link of the item.
123 http://linktoitem
456 http://linktoitem2
789 http://linktoitem3
How do I parse these two values after selecting the div.inventory element?
import requests
from bs4 import BeautifulSoup
htmlSource = requests.get(url).text
soup = BeautifulSoup(htmlSource)
matches = soup.select('div.inventory')
for match in matches:
#prints 123
#prints http://linktoitem
Also - what is the difference between the select function and find* functions?
You can find both items using find() relying on the class attributes:
soup = BeautifulSoup(data)
for inventory in soup.select('div.inventory'):
number = inventory.find('span', class_='item-number').text
link = inventory.find('span', class_='cost').a.get('href')
print number, link
Example:
from bs4 import BeautifulSoup
data = """
<body>
<div class="inventory">
<span class="item-number">123</span>
<span class="cost">
$1.23
</span>
</div>
<div class="inventory">
<span class="item-number">456</span>
<span class="cost">
$1.23
</span>
</div>
<div class="inventory">
<span class="item-number">789</span>
<span class="cost">
$1.23
</span>
</div>
</body>
"""
soup = BeautifulSoup(data)
for inventory in soup.select('div.inventory'):
number = inventory.find('span', class_='item-number').text
link = inventory.find('span', class_='cost').a.get('href')
print number, link
Prints:
123 http://linktoitem
456 http://linktoitem2
789 http://linktoitem3
Note the use of select() - this method allows to use CSS Selectors for searching over the page. Also note the use of class_ argument - underscore is important since class is a reversed keyword in Python.

Categories