Retrieve bbc weather data with identical span class and nested spans - python

I am trying to pull data form BBC weather with a view to use in a home automation dashboard.
The HTML code I can pull fine and I can pull one set of temps but it just pulls the first.
</li>
<li class="daily__day-tab day-20150418 ">
<a data-ajax-href="/weather/en/2646504/daily/2015-04-18?day=3" href="/weather/2646504?day=3" rel="nofollow">
<div class="daily__day-header">
<h3 class="daily__day-date">
<span aria-label="Saturday" class="day-name">Sat</span>
</h3>
</div>
<span class="weather-type-image weather-type-image-40" title="Sunny"><img alt="Sunny" src="http://static.bbci.co.uk/weather/0.5.327/images/icons/tab_sprites/40px/1.png"/></span>
<span class="max-temp max-temp-value"> <span class="units-values temperature-units-values"><span class="units-value temperature-value temperature-value-unit-c" data-unit="c">13<span class="unit">°C</span></span><span class="unit-types-separator"> </span><span class="units-value temperature-value temperature-value-unit-f" data-unit="f">55<span class="unit">°F</span></span></span></span>
<span class="min-temp min-temp-value"> <span class="units-values temperature-units-values"><span class="units-value temperature-value temperature-value-unit-c" data-unit="c">5<span class="unit">°C</span></span><span class="unit-types-separator"> </span><span class="units-value temperature-value temperature-value-unit-f" data-unit="f">41<span class="unit">°F</span></span></span></span>
<span class="wind wind-speed windrose-icon windrose-icon--average windrose-icon-40 windrose-icon-40--average wind-direction-ene" data-tooltip-kph="31 km/h, East North Easterly" data-tooltip-mph="19 mph, East North Easterly" title="19 mph, East North Easterly">
<span class="speed"> <span class="wind-speed__description wind-speed__description--average">Wind Speed</span>
<span class="units-values windspeed-units-values"><span class="units-value windspeed-value windspeed-value-unit-kph" data-unit="kph">31 <span class="unit">km/h</span></span><span class="unit-types-separator"> </span><span class="units-value windspeed-value windspeed-value-unit-mph" data-unit="mph">19 <span class="unit">mph</span></span></span></span>
<span class="description blq-hide">East North Easterly</span>
</span>
This is my code which isn’t working
import urllib2
import pprint
from bs4 import BeautifulSoup
htmlFile=urllib2.urlopen('http://www.bbc.co.uk/weather/2646504?day=1')
htmlData = htmlFile.read()
soup = BeautifulSoup(htmlData)
table=soup.find("div","daily-window")
temperatures=[str(tem.contents[0]) for tem in table.find_all("span",class_="units-value temperature-value temperature-value-unit-c")]
mintemp=[str(min.contents[0]) for min in table.find_("span",class_="min-temp min-temp-value")]
maxtemp=[str(min.contents[0]) for min in table.find_all("span",class_="max-temp max-temp-value")]
windspeeds=[str(speed.contents[0]) for speed in table.find_all("span",class_="units-value windspeed-value windspeed-value-unit-mph")]
pprint.pprint(zip(temperatures,temp2,windspeeds))

your min and max temp extract is wrong.You just find the hole min temp span (include both c and f format).Get the first thing of content gives you empty string.
And the min temp tag identify class=min-temp.min-temp-value is not the same with the c-type min temp class=temperature-value-unit-c.So I suggest you to use css selector.
Eg,find all of your min temp span could be
table.select('span.min-temp.min-temp-value span.temperature-value-unit-c')
This means select all class=temperature-value-unit-c spans which are children of class=min-temp min-temp-value spans.
So do the other information lists like max_temp wind

Related

Parse complex <li> tag using beautifulSoup

I have a webpage with following code:
<li>
Thalassery (<a class="mw-redirect" href="/wiki/Malayalam_language" title="Malayalam language">Malayalam</a>: <span lang="ml">തലശ്ശേരി</span>), from
<i>Tellicherry</i></li>
<li>Thanjavur (Tamil: <span lang="ta">தஞ்சாவூர்</span>), from British name <i>Tanjore</i></li>
<li>Thane (Marathi: <span lang="mr">ठाणे</span>), from British name <i>Tannah</i></li>
<li>Thoothukudi (Tamil: <span lang="ta">தூத்துக்குடி</span>), from <i>Tuticorin</i> and its short form <i>Tuty</i></li>
I need to parse the output such that the result will be extracting words like: Thalassery, Tellicherry, Thanjavur, Tanjore, Thane, Tannah, Thoothukudi, Tuticorin
Can anyone please help with this
You can use .findAll() to get all the li elements and use find() 'a' and 'i' tag
for item in soup.findAll('li'):
print(item.find('a').text,item.find('i').text)
>>>
Thalassery Tellicherry
Thanjavur Tanjore
Thane Tannah
Thoothukudi Tuticorin
Try simplified_scrapy's solution, its fault tolerance
from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''
<li>
Thalassery (<a class="mw-redirect" href="/wiki/Malayalam_language" title="Malayalam language">Malayalam</a>: <span lang="ml">തലശ്ശേരി</span>), from
<i>Tellicherry</i></li>
<li>Thanjavur (Tamil: <span lang="ta">தஞ்சாவூர்</span>), from British name <i>Tanjore</i></li>
<li>Thane (Marathi: <span lang="mr">ठाणे</span>), from British name <i>Tannah</i></li>
<li>Thoothukudi (Tamil: <span lang="ta">தூத்துக்குடி</span>), from <i>Tuticorin</i> and its short form <i>Tuty</i></li>
'''
doc = SimplifiedDoc(html)
lis = doc.lis
print ([(li.a.text,li.i.text if li.i else '') for li in lis])
Result:
[('Thalassery', 'Tellicherry'), ('Thanjavur', 'Tanjore'), ('Thane', 'Tannah'), ('Thoothukudi', 'Tuticorin')]

Encode() not working in all cases

I am using Beautiful Soup 4 to scan through an html file and extract certain features. Specifically, I am using it to find soccer player names, clubs, leagues, stats, etc. Since many player and club names have accent marks I am looking for a way to print out these accent marks rather than seeing an output like "Kak\xe1" I was able to make it work by using
# open html page
fut_page = open('futhead1.html','r')
# read content from html page
fut_read = fut_page.read()
# html parsed page
fut_soup = BeautifulSoup(fut_read, "html.parser")
# grabs all players
players = fut_soup.findAll('li',{'class':'list-group-item list-group-table-row player-group-item dark-hover'})
player = players[2]
# name_tag contains tag with player's name
name_tag = player.find("span",{"class":"player-name"})
# extract just the player's name
player_name = name_tag.text
print player_name.encode('utf-8')
This prints out the proper player name: "Kaká" However, I do not see the same result when using a regex to extract the club name, for example
regex_club = re.compile(ur'\[.*?</strong>\\n\s+\|\s\\n\s+(.*?)\\n', re.MULTILINE)
# extract club name
player_club = re.match(regex_club, str(pos_clb_lge_tag))
print player_club.group(1).encode('utf-8')
This code works in printing out the proper club name, say, "Atl\xe9tico Madrid" but encode() does not work in getting rid of "\xe9" and replacing it with "é"
Below is the piece of the html file where I apply the regex
<li class="list-group-item list-group-table-row player-group-item dark-hover">
<div class="content player-item font-24">
<a class="display-block padding-0" href="/fifa-mobile/17/players/33194/jan-oblak/">
<span class="player-rating stream-col-50 text-center">
<span class="revision-gradient shadowed font-12 fut elite">100</span>
</span>
<span class="player-info">
<img class="player-image" src="http://futhead.cursecdn.com/static/img/fm/17/players/200389_SASC.png">
<img class="player-program" src="http://futhead.cursecdn.com/static/img/fm/17/resources/program_17_VSATTACK.png">
<span class="player-name">Jan Oblak</span>
<span class="player-club-league-name">
<strong>GK</strong>
|
Atlético Madrid
|
LaLiga Santander
</span>
</span>
<span class="player-right text-center hidden-xs">
<span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">83</span><span class="hover-label">PAC</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">50</span><span class="hover-label">SHO</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">66</span><span class="hover-label">PAS</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">55</span><span class="hover-label">DRI</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">58</span><span class="hover-label">DEF</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">85</span><span class="hover-label">PHY</span></span><span class="player-stat stream-col-60 font-12 font-medium text-upper">35</span>
</span>
<span class="player-right slide hidden-sm hidden-xs" data-direction="right" data-max="-482px">
<span class="slide-content text-upper">
<span class="trigger icon icon-dots-three-horizontal"></span>
<span class="player-stat stream-col-80">
<span class="value">+2</span>
<span class="hover-label">MRK</span>
</span>
<span class="player-stat stream-col-80">
<span class="value">+1</span>
<span class="hover-label">OVR</span>
</span>
<span class="player-stat stream-col-100"><span class="value">right</span><span class="hover-label">Strong Foot</span></span>
<span class="player-stat stream-col-100"><span class="value">18<span class="icon icon-star gold margin-l-4"></span></span><span class="hover-label">Weak Foot</span></span>
</span>
</span>
</a>
</div>
So basically, why does encode() not work when I use a regex in the intermediate? If any further clarification is needed please let me know. Thank you.
I suspect you haven't shown all the code (see [mcve]), but calling str on the Unicode object is the wrong thing to do, and should have given:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 40: ordinal not in range(128)
I suspect you've done a setdefaultencoding which is a bad habit.
What str() did was convert the Unicode string into a byte string with escape code text, e.g. '\\n' (two characters) instead of '\n' (one character) and it did the same for the non-ascii character.
If your terminal is configured correctly you should not have to manually encode the final result when printing as well.
Here's a working example that uses BeautifulSoup to retrieve just the text to parse:
from bs4 import BeautifulSoup
import re
# open html page
fut_page = open('futhead1.html','r')
# read content from html page
fut_read = fut_page.read()
# html parsed page
fut_soup = BeautifulSoup(fut_read, "html.parser")
# grabs all players
players = fut_soup.findAll('li',{'class':'list-group-item list-group-table-row player-group-item dark-hover'})
player = players[0]
# name_tag contains tag with player's name
name_tag = player.find("span",{"class":"player-club-league-name"})
# extract just the player's name
pos_clb_lge_tag = name_tag.contents[-1]
regex_club = re.compile(ur'\n\s+\|\s\n\s+(.*?)\n')
# extract club name
player_club = regex_club.match(pos_clb_lge_tag)
print player_club.group(1)
Atlético Madrid

Python Selenium get_attribute not returning value

I am trying to extract the data-message-id from the following html. My original goal is to extract the data-message- id for the span containing a particular text and then clicking on the star_button to star it.
<div class="message_content_header">
<div class="message_content_header_left">
krishnag0902
<span class="ts_tip_float message_current_status ts_tip ts_tip_top ts_tip_multiline ts_tip_delay_150 color_U5TPDSMQQ color_9f69e7 hidden ts_tip_hidden">
<span class="ts_tip_tip ts_tip_inner_current_status">
<span class="ts_tip_multiline_inner">
</span>
</span>
</span>
<i class="copy_only">[</i>4:34 PM<i class="copy_only">]</i><span class="ts_tip_tip"><span class="ts_tip_multiline_inner">Yesterday at 4:34:07 PM</span></span>
<span class="message_star_holder">
Star this message
</div>
</div>
<span class="message_body">hoho<span class="constrain_triple_clicks"></span></span>
<div class="rxn_panel rxns_key_message-1498084447_119862-C5UGEFBS9"></div>
<i class="copy_only"><br></i>
<span id="msg_1498084447_119862_label" class="message_aria_label hidden">
<strong>krishnag0902</strong>.
hoho.
four thirty-four PM.
</span>
and i am using the code on the above span(message_star_holder) which is returning a None
data_mess= star_button_span.find_element_by_xpath("//button[#class=
'star ts_icon ts_icon_star_o ts_icon_inherit ts_tip_top star_message
ts_tip ts_tip_float ts_tip_hidden btn_unstyle']")
print data_mess.get_attribute("innerHTML")
print star_button_span.get_attribute("data-msg-id")
star_button_span doesn't have data-msg-id attribute. data_mess has
print data_mess.get_attribute("data-msg-id")

How to find the number of text objects in BeautifulSoup object

I'm web scraping a wikipedia page using BeautifulSoup in python and I was wondering whether there is anyone to know the number of text objects in an HTML object. For example the following code gets me the following HTML:
soup.find_all(class_ = 'toctext')
<span class="toctext">Actors and actresses</span>, <span class="toctext">Archaeologists and anthropologists</span>, <span class="toctext">Architects</span>, <span class="toctext">Artists</span>, <span class="toctext">Broadcasters</span>, <span class="toctext">Businessmen</span>, <span class="toctext">Chefs</span>, <span class="toctext">Clergy</span>, <span class="toctext">Criminals</span>, <span class="toctext">Conspirators</span>, <span class="toctext">Economists</span>, <span class="toctext">Engineers</span>, <span class="toctext">Explorers</span>, <span class="toctext">Filmmakers</span>, <span class="toctext">Historians</span>, <span class="toctext">Humourists</span>, <span class="toctext">Inventors / engineers</span>, <span class="toctext">Journalists / newsreaders</span>, <span class="toctext">Military: soldiers/sailors/airmen</span>, <span class="toctext">Monarchs</span>, <span class="toctext">Musicians</span>, <span class="toctext">Philosophers</span>, <span class="toctext">Photographers</span>, <span class="toctext">Politicians</span>, <span class="toctext">Scientists</span>, <span class="toctext">Sportsmen and sportswomen</span>, <span class="toctext">Writers</span>, <span class="toctext">Other notables</span>, <span class="toctext">English expatriates</span>, <span class="toctext">References</span>, <span class="toctext">See also</span>
I can get the first text object by running the following:
soup.find_all(class_ = 'toctext')[0].text
My goal here is to get and store all of the text objects in a list. I'm doing this by using a for loop, however I don't know how many text objects there are in the html block. Naturally I would hit an error if I get to an index that doesn't exist Is there an alternative?
You can use a for...in loop.
In [13]: [t.text for t in soup.find_all(class_ = 'toctext')]
Out[13]:
['Actors and actresses',
'Archaeologists and anthropologists',
'Architects',
'Artists',
'Broadcasters',
'Businessmen',
'Chefs',
'Clergy',
'Criminals',
'Conspirators',
'Economists',
'Engineers',
'Explorers',
'Filmmakers',
'Historians',
'Humourists',
'Inventors / engineers',
'Journalists / newsreaders',
'Military: soldiers/sailors/airmen',
'Monarchs',
'Musicians',
'Philosophers',
'Photographers',
'Politicians',
'Scientists',
'Sportsmen and sportswomen',
'Writers',
'Other notables',
'English expatriates',
'References',
'See also']
Try the following code:
for txt in soup.find_all(class_ = 'toctext'):
print(txt.text)

Get span text from a website using selenium

The website I'm trying to scrape looks like this:
<div align="center" class="movietable">
<span style="width:45px;height:47px;vertical-align:middle;display:table-cell;">
<img border="0" src="styles/images/cat/hd.png" alt="HdO">
</span>
</div>
<div align="left" class="movietable">
<span style="padding:0px 5px;width:455px;height:47px;vertical-align:middle;display:table-cell;">
<a data-toggle="tooltip" data-placement="bottom" data-html="true" title="" href="details.php?id=578197" data-original-title="<img src='https://trasd.tmdb.org//tqistSlQGQVlvDZHweD.jpg'>">
<b>GET THIS TEXT</b></a><br><font class="small">[Action, Horror, Sci-Fi]</font>
</span>
</div>
How can I extract:
The text in the <b> tag - in this case GET THIS TEXT
The content of the font_class= 'small' - in this case this would be Action, Horror, Sci-Fi
.movietable b works great!!
The img_scr link - in thiscase it would be https://trasd.tmdb.org//tqistSlQGQVlvDZHweD.jpg
I have no ideea how to do this
Below are CSS selectors you can use:
driver.find_element_by_css_selector('div[align=left] b')
driver.find_element_by_css_selector('div[align=left] .small')
driver.find_element_by_css_selector('a[title]').get_attribute('data-original-title')
You can access all of them using xpath:
1) [parents before this div]/div[2]/span/a/b
2) [parents before this div]/div[2]/span/font
3) [parents before this div]/div[1]/span/a/img
[parents before this div] should be /html/body/...
As per the HTML you have shared to extract the items you can use the following solution:
GET THIS TEXT:
driver.find_element_by_xpath("//div[#class='movietable' and #align='left']/span/a[#data-toggle='tooltip' and #data-placement='bottom']/b").get_attribute("innerHTML")
[Action, Horror, Sci-Fi]:
driver.find_element_by_xpath("//div[#class='movietable' and #align='left']/span//font[#class='small']").get_attribute("innerHTML")
https://trasd.tmdb.org//tqistSlQGQVlvDZHweD.jpg:
img_src = driver.find_element_by_xpath("//div[#class='movietable' and #align='left']/span/a[#data-toggle='tooltip' and #data-placement='bottom']").get_attribute("data-original-title")
src = img_src.replace("'", "-").split("-")
print(src[1])

Categories