Find element with multiple classes using BeautifulSoup - python

How to get the h1 text "Mini Militia - Doodle Army 2 apk"
https://www.apkmonk.com/app/com.appsomniacs.da2/
I tried this but I got None
title = soup.find('div', class_='col l8 s8')
please Note there is multiple elements on the page that have classes "hide-on-med-and-down" and "hide-on-large-only"
<div class="col l8 s8">
<h1 class="hide-on-med-and-down" style="font-size: 2em;">Mini Militia - Doodle Army 2 apk</h1>
<h1 class="hide-on-large-only" style="font-size: 1.5em;margin:0px; padding: 0px;">Mini Militia - Doodle Army 2 apk</h1>
<p class="hide-on-small-only" style="font-size: 1.2em;"><span class="item" style="display:none !important;"><span class="fn">Download Mini Militia - Doodle Army 2 APK Latest Version</span></span><b>App Rating</b>: <span class="rating"><span class="average">4.1</span>/<span class="best">5</span></span></p>
<a class="hide-on-med-and-up" onclick="ga('send', 'event', 'nav', 'similar_mob_link', 'com.appsomniacs.da2');" style="font-size: 1.2em;" href="#similar">(<b>Similar Apps</b>)</a>
</div>

This will help you.
title = soup.find("h1", {'class':'hide-on-med-and-down'}).text

What happens?
There are two <h1> on site with same content only the class names are different, to control what size, ... should displayed on different resolutions
How to fix?
Cause content is identical, just select the first <h1> in tree to get your result, class names do not matter in this case, cause result is always the same:
title = soup.find('h1').text
Output
Mini Militia - Doodle Army 2 apk

Related

Extracting text from multiple spans with different classes using BeautifulSoup

I am trying to extract some data from a webpage that I've parsed through BeautifulSoup.
<div class="product-data-list data-points-en_GB">
<div class="float-left in-left col-totalNetAssets" style="height: 36px;">
<span class="caption">
Net Assets of Share Class
<span class="as-of-date">
as of 20-Jul-20
</span>
</span>
<span class="data">
USD 36,636,694,134
</span>
</div>
<div class="float-left in-right col-totalNetAssetsFundLevel">
<span class="caption">
Net Assets of Fund
<span class="as-of-date">
as of 20-Jul-20
</span>
</span>
<span class="data">
USD 37,992,258,237
</span>
</div>
<div class="float-left in-left col-baseCurrencyCode" style="height: 16px;">
<span class="caption">
Fund Base Currency
<span class="as-of-date">
</span>
</span>
<span class="data">
USD
</span>
</div>
I want to capture the information from the 'caption', 'as-of-date' and 'data' spans to create something like:
[('Net Assets of Share Class','20-Jul-20','USD 36,636,694,134'),
('Net Assets of Fund','20-Jul-20','USD 37,992,258,237'),
('Fund Base Currency','','USD')]
This is my code:
data=[]
for tag in soup.findAll("div", {"id": "keyFundFacts"}):
for span in tag.findAll("div", {"class": "product-data-list data-points-en_GB"}):
a = span.find("span", {"class": "caption"}).text
b = span.find("span", {"class": "as-of-date"}).text
c = span.find("span", {"class": "data"}).text
data.append((a,b,c))
however, I only get 1 result when I look at the list 'data':
<pre>
[('\nNet Assets of Share Class\n\nas of 20-Jul-20\n\n', '\nas of 20-Jul-20\n', '\nUSD 36,636,694,134\n')]
</pre>
Aside from needing to strip out the new lines, I know I am missing something to get the script to go through all the other spans but have been staring at the screen for so long, it isn't getting any clearer.
Can anyone help put me out of my misery?!
One solution is to cycle through all the div elements that are under your main "div", {"class": "product-data-list data-points-en_GB" element. This way for each div element you will get the elements you want.
for tag in soup.findAll("div", {"id": "keyFundFacts"}):
for element in tag.findAll("div", {"class": "product-data-list data-points-en_GB"}):
for divEle in element.findAll('div')
a = divEle.find("span", {"class": "caption"}).text
b = divEle.find("span", {"class": "as-of-date"}).text
c = divEle.find("span", {"class": "data"}).text
This makes for a lot of nested loops so I don't recommend this. I suggest finding a more precise way. If you have a url with the html I could take a look.
I have stumbled upon a solution which seems to do the trick:
data=[]
for tag in soup.findAll("div", {"id": "keyFundFacts"}):
for element in tag.findAll("div", {"class": "product-data-list data-points-en_GB"}):
for thing in element.findChildren('div'):
a = thing.findNext("span", {"class": "caption"}).text
b = thing.findNext("span", {"class": "as-of-date"}).text
c = thing.findNext("span", {"class": "data"}).text
data.append((a,b,c))
Its not perfect but hopefully functional.
thanks all

split entries from source in dataframe - and put them all in one entry

a rather tricky question today. For me at least. I want to split the entries in the 'result', so that they get on a line each, but in the same DF entry. Can anyone help? Thanks!
heres my html
html ='''<div data-itf-inject="BeneficialNames"><div><ul class="bullet_list" data-six-show-max="2"><li>Mr. Fox</li><li>Mr. Gander</li><li style="display: none;">Mr. Daepp</li><li style="display: none;">Power&Brothers Memory Fund III GP Ltd</li></ul><a data-six-showmore="true" href="#" style="display: inline-block;"><i class="fa fa-chevron-circle-down" title="Mehr anzeigen"></i> Mehr anzeigen</a></div></div>'''
I put it into BS:
h = BeautifulSoup(html, 'html.parser')
then I get the text out.
BN = h.find('div', {'data-itf-inject': "BeneficialNames"}).text
Which returns a rather messy result.
Now, I'd like to put that in one DF-Entry, much like a multi index, but in one DF.
The rest of the DF exist already, with the addition it looks like this:
ISSUER SHARE BN
'Company' '95' 'Mr. FoxMr. GanderMr. DaeppPower&Brothers Memory Fund III GP Ltd'
But I want it to look like this:
ISSUER SHARE BN
'Company' '95' 'Mr. Fox'
'Mr. Gander'
'Mr. Daepp'
'Power&Brothers Memory Fund III GP Ltd'
What do I do? Thanks!
What about this solution?
from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<div data-itf-inject="BeneficialNames">
<div>
<ul class="bullet_list" data-six-show-max="2">
<li>Mr. Fox</li>
<li>Mr. Gander</li>
<li style="display: none;">Mr. Daepp</li>
<li style="display: none;">Power&Brothers Memory Fund III GP Ltd</li>
</ul><a data-six-showmore="true" href="#" style="display: inline-block;"><i class="fa fa-chevron-circle-down"
title="Mehr anzeigen"></i> Mehr anzeigen</a>
</div>
</div>
'''
doc = SimplifiedDoc(html)
div = doc.select('div#data-itf-inject=BeneficialNames')
lis = div.ul.lis
print ([li.text for li in lis])
Result:
['Mr. Fox', 'Mr. Gander', 'Mr. Daepp', 'Power&Brothers Memory Fund III GP Ltd']

Get value form multiple child which is having same parent name and child name in Selenium using Python

I want fetch the value only for "Publisher " since I have the same class name and for both the parent, I am not able to figure how to do it.
<div class="block-record-info">
<div class="title3">Publisher</div>
<p class="FR_field">
<value>INFORMS, 5521 RESEARCH PARK DR, SUITE 200, CATONSVILLE, MD 21228 USA</value>
</p>
</div>
<div class="block-record-info">
<div class="title3">Categories / Classification</div>
<p class="FR_field">
<span class="FR_label">Research Areas:</span>
Computer Science; Operations Research & Management Science
</p>
the code I used :
valuexpath1 = '//div[#class="block-record-info"]' valueElement1 =
driver.find_element_by_xpath(valuexpath1) valuexpath2 = '//*'
valueElement2 = valueElement1.find_element_by_xpath(valuexpath2)
valueValue2 = valueElement2.text print(valueValue2)
it is giving me the value of "Categories / Classification" and "Publisher ". but I want only publisher.
As the first div doesn't have span element, you can try exclude div which contain p/span:
valuexpath1 = '//div[#class="block-record-info" and not(p/span)]'

Retrieve bbc weather data with identical span class and nested spans

I am trying to pull data form BBC weather with a view to use in a home automation dashboard.
The HTML code I can pull fine and I can pull one set of temps but it just pulls the first.
</li>
<li class="daily__day-tab day-20150418 ">
<a data-ajax-href="/weather/en/2646504/daily/2015-04-18?day=3" href="/weather/2646504?day=3" rel="nofollow">
<div class="daily__day-header">
<h3 class="daily__day-date">
<span aria-label="Saturday" class="day-name">Sat</span>
</h3>
</div>
<span class="weather-type-image weather-type-image-40" title="Sunny"><img alt="Sunny" src="http://static.bbci.co.uk/weather/0.5.327/images/icons/tab_sprites/40px/1.png"/></span>
<span class="max-temp max-temp-value"> <span class="units-values temperature-units-values"><span class="units-value temperature-value temperature-value-unit-c" data-unit="c">13<span class="unit">°C</span></span><span class="unit-types-separator"> </span><span class="units-value temperature-value temperature-value-unit-f" data-unit="f">55<span class="unit">°F</span></span></span></span>
<span class="min-temp min-temp-value"> <span class="units-values temperature-units-values"><span class="units-value temperature-value temperature-value-unit-c" data-unit="c">5<span class="unit">°C</span></span><span class="unit-types-separator"> </span><span class="units-value temperature-value temperature-value-unit-f" data-unit="f">41<span class="unit">°F</span></span></span></span>
<span class="wind wind-speed windrose-icon windrose-icon--average windrose-icon-40 windrose-icon-40--average wind-direction-ene" data-tooltip-kph="31 km/h, East North Easterly" data-tooltip-mph="19 mph, East North Easterly" title="19 mph, East North Easterly">
<span class="speed"> <span class="wind-speed__description wind-speed__description--average">Wind Speed</span>
<span class="units-values windspeed-units-values"><span class="units-value windspeed-value windspeed-value-unit-kph" data-unit="kph">31 <span class="unit">km/h</span></span><span class="unit-types-separator"> </span><span class="units-value windspeed-value windspeed-value-unit-mph" data-unit="mph">19 <span class="unit">mph</span></span></span></span>
<span class="description blq-hide">East North Easterly</span>
</span>
This is my code which isn’t working
import urllib2
import pprint
from bs4 import BeautifulSoup
htmlFile=urllib2.urlopen('http://www.bbc.co.uk/weather/2646504?day=1')
htmlData = htmlFile.read()
soup = BeautifulSoup(htmlData)
table=soup.find("div","daily-window")
temperatures=[str(tem.contents[0]) for tem in table.find_all("span",class_="units-value temperature-value temperature-value-unit-c")]
mintemp=[str(min.contents[0]) for min in table.find_("span",class_="min-temp min-temp-value")]
maxtemp=[str(min.contents[0]) for min in table.find_all("span",class_="max-temp max-temp-value")]
windspeeds=[str(speed.contents[0]) for speed in table.find_all("span",class_="units-value windspeed-value windspeed-value-unit-mph")]
pprint.pprint(zip(temperatures,temp2,windspeeds))
your min and max temp extract is wrong.You just find the hole min temp span (include both c and f format).Get the first thing of content gives you empty string.
And the min temp tag identify class=min-temp.min-temp-value is not the same with the c-type min temp class=temperature-value-unit-c.So I suggest you to use css selector.
Eg,find all of your min temp span could be
table.select('span.min-temp.min-temp-value span.temperature-value-unit-c')
This means select all class=temperature-value-unit-c spans which are children of class=min-temp min-temp-value spans.
So do the other information lists like max_temp wind

Get span text from a website using selenium

The website I'm trying to scrape looks like this:
<div align="center" class="movietable">
<span style="width:45px;height:47px;vertical-align:middle;display:table-cell;">
<img border="0" src="styles/images/cat/hd.png" alt="HdO">
</span>
</div>
<div align="left" class="movietable">
<span style="padding:0px 5px;width:455px;height:47px;vertical-align:middle;display:table-cell;">
<a data-toggle="tooltip" data-placement="bottom" data-html="true" title="" href="details.php?id=578197" data-original-title="<img src='https://trasd.tmdb.org//tqistSlQGQVlvDZHweD.jpg'>">
<b>GET THIS TEXT</b></a><br><font class="small">[Action, Horror, Sci-Fi]</font>
</span>
</div>
How can I extract:
The text in the <b> tag - in this case GET THIS TEXT
The content of the font_class= 'small' - in this case this would be Action, Horror, Sci-Fi
.movietable b works great!!
The img_scr link - in thiscase it would be https://trasd.tmdb.org//tqistSlQGQVlvDZHweD.jpg
I have no ideea how to do this
Below are CSS selectors you can use:
driver.find_element_by_css_selector('div[align=left] b')
driver.find_element_by_css_selector('div[align=left] .small')
driver.find_element_by_css_selector('a[title]').get_attribute('data-original-title')
You can access all of them using xpath:
1) [parents before this div]/div[2]/span/a/b
2) [parents before this div]/div[2]/span/font
3) [parents before this div]/div[1]/span/a/img
[parents before this div] should be /html/body/...
As per the HTML you have shared to extract the items you can use the following solution:
GET THIS TEXT:
driver.find_element_by_xpath("//div[#class='movietable' and #align='left']/span/a[#data-toggle='tooltip' and #data-placement='bottom']/b").get_attribute("innerHTML")
[Action, Horror, Sci-Fi]:
driver.find_element_by_xpath("//div[#class='movietable' and #align='left']/span//font[#class='small']").get_attribute("innerHTML")
https://trasd.tmdb.org//tqistSlQGQVlvDZHweD.jpg:
img_src = driver.find_element_by_xpath("//div[#class='movietable' and #align='left']/span/a[#data-toggle='tooltip' and #data-placement='bottom']").get_attribute("data-original-title")
src = img_src.replace("'", "-").split("-")
print(src[1])

Categories