Python Selenium get_attribute not returning value - python

I am trying to extract the data-message-id from the following html. My original goal is to extract the data-message- id for the span containing a particular text and then clicking on the star_button to star it.
<div class="message_content_header">
<div class="message_content_header_left">
krishnag0902
<span class="ts_tip_float message_current_status ts_tip ts_tip_top ts_tip_multiline ts_tip_delay_150 color_U5TPDSMQQ color_9f69e7 hidden ts_tip_hidden">
<span class="ts_tip_tip ts_tip_inner_current_status">
<span class="ts_tip_multiline_inner">
</span>
</span>
</span>
<i class="copy_only">[</i>4:34 PM<i class="copy_only">]</i><span class="ts_tip_tip"><span class="ts_tip_multiline_inner">Yesterday at 4:34:07 PM</span></span>
<span class="message_star_holder">
Star this message
</div>
</div>
<span class="message_body">hoho<span class="constrain_triple_clicks"></span></span>
<div class="rxn_panel rxns_key_message-1498084447_119862-C5UGEFBS9"></div>
<i class="copy_only"><br></i>
<span id="msg_1498084447_119862_label" class="message_aria_label hidden">
<strong>krishnag0902</strong>.
hoho.
four thirty-four PM.
</span>
and i am using the code on the above span(message_star_holder) which is returning a None
data_mess= star_button_span.find_element_by_xpath("//button[#class=
'star ts_icon ts_icon_star_o ts_icon_inherit ts_tip_top star_message
ts_tip ts_tip_float ts_tip_hidden btn_unstyle']")
print data_mess.get_attribute("innerHTML")
print star_button_span.get_attribute("data-msg-id")

star_button_span doesn't have data-msg-id attribute. data_mess has
print data_mess.get_attribute("data-msg-id")

Related

Find multiple tags with multiple search parameters with Beautifulsoup

I have html code that looks something like this (the soup):
<label for="02" class="highlited">"Some text here"</label>
<span class="type3 type3-display">
<label for="01" class="highlited">"Some text here"</label>
<span class="type1 type1-display">
<label> Somete text here </label>
<span class="type999 type999-display">
<span class="type1 type1-display">
I have to grab both the labels and the spans from the page but with multiple search parameters.
for the labels i have to grab only those that contains for= (any text inside)
for the spans i have to grab only those that contain a word in a list e.g
myList = ['type1', 'type2', 'type3']
The order as they are found on the page must be respected
the result I need would look like this:
<label for="02" class="highlited">"Some text here"</label>
<span class="type3 type3-display">
<label for="01" class="highlited">"Some text here"</label>
<span class="type1 type1-display">
<span class="type1 type1-display">
To find the labels that contain anything after "for=" i use the following code:
soup.find_all('label', {'for': re.compile('.*')}) # it works as expected
But now i need to also find all the spans with specific wording and respect the order as they are found on the web page.
I tried this but it didn't worked:
soup.find_all(['label', 'span'], [{'for': re.compile('.*')}, {'class': 'type1'}], recursive=False) # here i just used {'class': 'type1'} becase I don't know how to pass in a list to soup to search for a match)
Thank you in advance!
edit: I also tried to combine 2 find_all() searches with (+) but then i loose the order.
edit2: spelling
You can do that without regex as well.
from bs4 import BeautifulSoup
data='''<label for="02" class="highlited">"Some text here"</label>
<span class="type3 type3-display"></span>
<label for="01" class="highlited">"Some text here"</label>
<span class="type1 type1-display"></span>
<label> Somete text here </label>
<span class="type999 type999-display"></span>
<span class="type1 type1-display"></span>'''
myList = ['type1', 'type2', 'type3']
soup=BeautifulSoup(data,'html.parser')
for item in soup.find_all():
if (item.name=='label') and 'for' in item.attrs :
print(item)
if (item.name == 'span') and item['class'][0] in myList :
print(item)
Output:
<label class="highlited" for="02">"Some text here"</label>
<span class="type3 type3-display"></span>
<label class="highlited" for="01">"Some text here"</label>
<span class="type1 type1-display"></span>
<span class="type1 type1-display"></span>

How to find the number of text objects in BeautifulSoup object

I'm web scraping a wikipedia page using BeautifulSoup in python and I was wondering whether there is anyone to know the number of text objects in an HTML object. For example the following code gets me the following HTML:
soup.find_all(class_ = 'toctext')
<span class="toctext">Actors and actresses</span>, <span class="toctext">Archaeologists and anthropologists</span>, <span class="toctext">Architects</span>, <span class="toctext">Artists</span>, <span class="toctext">Broadcasters</span>, <span class="toctext">Businessmen</span>, <span class="toctext">Chefs</span>, <span class="toctext">Clergy</span>, <span class="toctext">Criminals</span>, <span class="toctext">Conspirators</span>, <span class="toctext">Economists</span>, <span class="toctext">Engineers</span>, <span class="toctext">Explorers</span>, <span class="toctext">Filmmakers</span>, <span class="toctext">Historians</span>, <span class="toctext">Humourists</span>, <span class="toctext">Inventors / engineers</span>, <span class="toctext">Journalists / newsreaders</span>, <span class="toctext">Military: soldiers/sailors/airmen</span>, <span class="toctext">Monarchs</span>, <span class="toctext">Musicians</span>, <span class="toctext">Philosophers</span>, <span class="toctext">Photographers</span>, <span class="toctext">Politicians</span>, <span class="toctext">Scientists</span>, <span class="toctext">Sportsmen and sportswomen</span>, <span class="toctext">Writers</span>, <span class="toctext">Other notables</span>, <span class="toctext">English expatriates</span>, <span class="toctext">References</span>, <span class="toctext">See also</span>
I can get the first text object by running the following:
soup.find_all(class_ = 'toctext')[0].text
My goal here is to get and store all of the text objects in a list. I'm doing this by using a for loop, however I don't know how many text objects there are in the html block. Naturally I would hit an error if I get to an index that doesn't exist Is there an alternative?
You can use a for...in loop.
In [13]: [t.text for t in soup.find_all(class_ = 'toctext')]
Out[13]:
['Actors and actresses',
'Archaeologists and anthropologists',
'Architects',
'Artists',
'Broadcasters',
'Businessmen',
'Chefs',
'Clergy',
'Criminals',
'Conspirators',
'Economists',
'Engineers',
'Explorers',
'Filmmakers',
'Historians',
'Humourists',
'Inventors / engineers',
'Journalists / newsreaders',
'Military: soldiers/sailors/airmen',
'Monarchs',
'Musicians',
'Philosophers',
'Photographers',
'Politicians',
'Scientists',
'Sportsmen and sportswomen',
'Writers',
'Other notables',
'English expatriates',
'References',
'See also']
Try the following code:
for txt in soup.find_all(class_ = 'toctext'):
print(txt.text)

Errno 2 - No such file or directory

I have a scrip that I use to scan through a directory and copy the content (actually I only want part of it...) of all files that containt a certain string into a new file.
import os
dr = "C:/Python34/Downloaded Files LALME/"; out = "C:/Python34/output_though_only.txt"; tag = ".txt"
files = os.listdir(dr)
for f in files:
if f.endswith(tag):
content = open(dr+"/"+f).readlines()
all_lines = content
if ">though<" in all_lines:
open(out, "a+").write(all_lines)
infile = "C:/Python34/output_though_only.txt"
outfile = "C:/Python34/cleaned_output_though_only.txt"
delete_list = ['']
fin = open(infile)
fout = open(outfile, "a+")
for line in fin:
for word in delete_list:
line = line.replace(word, "")
fout.write(line)
fin.close()
fout.close()
When I run it from the shell it returns the following error message:
C:\Python34>python LALME_script_though_extract_clean.txt -w
Traceback (most recent call last):
File "LALME_script_though_extract_clean.txt", line 13, in <module>
fin = open(infile)
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Python34/output_tho
ugh_only.txt'
Of course, the file does not exist (yet), but the script is supposed to create it if it doesnt exist. This is a modified version of a code that I used for other strings and files and there it worked just fine.
Can anyone find what went wrong here ?
Edit: here is an older version of the script that works and was used for differenty files/directories.
import os
dr = "C:\Python34/tag/"; out = "C:\\Users/Yorishimo/Desktop/codes/LAEME/though/output_laeme_though/output_laeme_though.txt"; tag = ".tag"
files = os.listdir(dr)
for f in files:
if f.endswith(tag):
content = open(dr+"/"+f).readlines()
all_lines = content
needed_lines = content[21:27]
lines_totest = ("").join(all_lines)+"\n"
final_lines = ("").join(needed_lines)+"\n"
if "$though/" in lines_totest:
open(out, "a+").write(final_lines)
infile = "C:\\Users/Yorishimo/Desktop/codes/LAEME/though/output_laeme_though/output_laeme_though.txt"
outfile = "C:\\Users/Yorishimo/Desktop/codes/LAEME/though/output_laeme_though/cleaned_output_though_only.txt"
delete_list = ['</span></li><li><span class="list">', '<style type="text/css"> UL LI { list-style: none } </style><ul><li><span class="list">']
fin = open(infile)
fout = open(outfile, "a+")
for line in fin:
for word in delete_list:
line = line.replace(word, "")
fout.write(line)
fin.close()
fout.close()
Edit2:
Here is an example of a file that I am trying to have scanned for the string >though<:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>eLALME</title>
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
<meta name="Title" content="">
<meta name="Description" content="">
<meta name="Keywords" content="">
<meta name="Author" content="">
<meta name="Publisher" content="">
<link rel="stylesheet" href="http://archive.ling.ed.ac.uk/ihd/elalme_scripts/lib/css/elalme_actionstyle.css" type="text/css" target="taskarea">
</head>
<!-- MAIN CONTENT BELOW HERE -->
<body>
<style type="text/css">
UL LI { list-style: none}
</style>
<p><span class="emphasis">LP 1</span></p><p>Dublin, Trinity College 154 (A.6.12). <span class="contr">ca.</span> 1400. MS in one hand. ff. 1r-105r: religious treatises. Analysis from ff. 1r-41r, then scan. LP 1. Grid 478 332. Leicestershire.</p></p><table><tr><td>1</td><td> <span class="smcap">THE</span>: </td><td>y<sup>e</sup></td></tr><tr><td>2</td><td> <span class="smcap">THESE</span>: </td><td>thes (theys) ((these))</td></tr><tr><td>3</td><td> <span class="smcap">THOSE</span>: </td><td>those (tho)</td></tr><tr><td>4</td><td> <span class="smcap">SHE</span>: </td><td>sche, she</td></tr><tr><td>5</td><td> <span class="smcap">HER</span>: </td><td>hir, hyr</td></tr><tr><td>6</td><td> <span class="smcap">IT</span>: </td><td>it</td></tr><tr><td>7</td><td> <span class="smcap">THEY</span>: </td><td>they ((thay, thei))</td></tr><tr><td>8</td><td> <span class="smcap">THEM</span>: </td><td>theym, them (thayme, theyme) ((yem, thaym, theme, yam))</td></tr><tr><td>9</td><td> <span class="smcap">THEIR</span>: </td><td>theyr (ther, y<span class="contr">er</span>) ((thayre, thayr, theyre, thare))</td></tr><tr><td>10</td><td> <span class="smcap">SUCH</span>: </td><td>suche ((syche))</td></tr><tr><td>11</td><td> <span class="smcap">WHICH</span>: </td><td>wiche, which (wyche, whiche, whyche) ((y<sup>e</sup>-wiche))</td></tr><tr><td>13</td><td> <span class="smcap">MANY</span>: </td><td>mony (many)</td></tr><tr><td>14</td><td> <span class="smcap">MAN</span>: </td><td>man, ma<span class="contr">n</span></td></tr><tr><td>15</td><td> <span class="smcap">ANY</span>: </td><td>any</td></tr><tr><td>16</td><td> <span class="smcap">MUCH</span>: </td><td>myche (mych, moche, meche, muche)</td></tr><tr><td>17</td><td> <span class="smcap">ARE</span>: </td><td>be (ar) ((are, er, byn))</td></tr><tr><td>18</td><td> <span class="smcap">WERE</span>: </td><td>were (ware) ((wher))</td></tr><tr><td>19</td><td> <span class="smcap">IS</span>: </td><td>is</td></tr><tr><td>21</td><td> <span class="smcap">WAS</span>: </td><td>was</td></tr><tr><td>22</td><td> <span class="smcap">SHALL</span> <span class="contr">sg</span>: </td><td>shal, schal, schall</td></tr><tr><td>22-30</td><td> <span class="smcap">SHALL</span> <span class="contr">pl</span>: </td><td>shall (shal) ((schall))</td></tr><tr><td>23</td><td> <span class="smcap">SHOULD</span> <span class="contr">sg</span>: </td><td>sholde (schulde) ((shulde))</td></tr><tr><td>23-30</td><td> <span class="smcap">SHOULD</span> <span class="contr">pl</span>: </td><td>sholde (schulde, shulde)</td></tr><tr><td>24</td><td> <span class="smcap">WILL</span> <span class="contr">sg</span>: </td><td>wyll</td></tr><tr><td>24-30</td><td> <span class="smcap">WILL</span> <span class="contr">pl</span>: </td><td>wyll</td></tr><tr><td>25</td><td> <span class="smcap">WOULD</span> <span class="contr">sg</span>: </td><td>wolde</td></tr><tr><td>26-30</td><td> <span class="smcap">TO</span> <span class="contr">prep</span> +V: </td><td>to</td></tr><tr><td>27</td><td> <span class="smcap">TO</span> <span class="contr">+inf</span> +C: </td><td>to</td></tr><tr><td>28</td><td> <span class="smcap">FROM</span>: </td><td>from (frome) ((fro))</td></tr><tr><td>29</td><td> <span class="smcap">AFTER</span>: </td><td>after</td></tr><tr><td>30</td><td> <span class="smcap">THEN</span>: </td><td>then ((than))</td></tr><tr><td>31</td><td> <span class="smcap">THAN</span>: </td><td>then (than) ((yen))</td></tr><tr><td>32</td><td> <span class="smcap">THOUGH</span>: </td><td>alof (yof) ((yoff))</td></tr><tr><td>33</td><td> <span class="smcap">IF</span>: </td><td>yff, yff-that, yf ((yf-y<sup>t</sup>, yff-y<sup>t</sup>, gyffe-y<sup>t</sup>))</td></tr><tr><td>34</td><td> <span class="smcap">AS</span>: </td><td>as</td></tr><tr><td>35</td><td> <span class="smcap">AS..AS</span>: </td><td>as+as</td></tr><tr><td>36</td><td> <span class="smcap">AGAINST</span>: </td><td>agance (agaynst, agayns, agayne)</td></tr><tr><td>39-20</td><td> <span class="smcap">SINCE</span> <span class="contr">conj</span>: </td><td>sythen-y<sup>t</sup>, syn</td></tr><tr><td>40</td><td> <span class="smcap">YET</span>: </td><td>ȝet ((ȝett))</td></tr><tr><td>41</td><td> <span class="smcap">WHILE</span>: </td><td>whylys-y<sup>t</sup></td></tr><tr><td>42</td><td> <span class="smcap">STRENGTH</span>: </td><td>strenght ((strenghe))</td></tr><tr><td>42-20</td><td> <span class="smcap">STRENGTHEN</span> <span class="contr">vb</span>: </td><td>strenght</td></tr><tr><td>44</td><td> <span class="smcap">WH-</span>: </td><td>wh- ((w-))</td></tr><tr><td>46</td><td> <span class="smcap">NOT</span>: </td><td>not, nott</td></tr><tr><td>47</td><td> <span class="smcap">NOR</span>: </td><td>nor (ne)</td></tr><tr><td>48</td><td> <span class="smcap">OE</span>, <span class="smcap">ON</span> <span class="contr">ā</span> (‘a’, ‘o’): </td><td>o</td></tr><tr><td>49</td><td> <span class="smcap">WORLD</span>: </td><td>woorlde, worlde, warlde, world</td></tr><tr><td>50</td><td> <span class="smcap">THINK</span> <span class="contr">vb</span>: </td><td>thynke, thyngke</td></tr><tr><td>51</td><td> <span class="smcap">WORK</span> <span class="contr">sb</span>: </td><td>werke</td></tr><tr><td>51-10</td><td> <span class="smcap">WORK</span> <span class="contr">pres stem</span>: </td><td>werke</td></tr><tr><td>52</td><td> <span class="smcap">THERE</span>: </td><td>ther ((y<span class="contr">er</span>))</td></tr><tr><td>53</td><td> <span class="smcap">WHERE</span>: </td><td>wher-, where</td></tr><tr><td>54</td><td> <span class="smcap">MIGHT</span> <span class="contr">vb</span>: </td><td>myght</td></tr><tr><td>55</td><td> <span class="smcap">THROUGH</span>: </td><td>throughe (throghe) ((throught, through))</td></tr><tr><td>56</td><td> <span class="smcap">WHEN</span>: </td><td>when</td></tr><tr><td>57</td><td> <span class="contr">Sb pl</span>: </td><td>-ys (-s, -<span class="contr">es</span>) ((-es, -is))</td></tr><tr><td>58</td><td> <span class="contr">Pres part</span>: </td><td>-yng</td></tr><tr><td>59</td><td> <span class="contr">Vbl sb</span>: </td><td>-yng</td></tr><tr><td>61</td><td> <span class="contr">Pres 3sg</span>: </td><td>-ys (-yth, -eth, -<span class="contr">es</span>, -s)</td></tr><tr><td>62</td><td> <span class="contr">Pres pl</span>: </td><td>-th</td></tr><tr><td>65</td><td> <span class="contr">Weak ppl</span>: </td><td>-ed (-et) ((-yd))</td></tr><tr><td>66</td><td> <span class="contr">Str ppl</span>: </td><td>-en, -on, -yne</td></tr><tr><td>70-20</td><td> <span class="smcap">ABOUT</span> <span class="contr">pr</span>: </td><td>abowte</td></tr><tr><td>71-20</td><td> <span class="smcap">ABOVE</span> <span class="contr">pr</span>: </td><td>a-boue, abowe</td></tr><tr><td>73</td><td> <span class="smcap">AFTERWARDS</span>: </td><td>afterward</td></tr><tr><td>75</td><td> <span class="smcap">ALL</span>: </td><td>all, al</td></tr><tr><td>77</td><td> <span class="smcap">AMONG</span> <span class="contr">adv</span>: </td><td>emong</td></tr><tr><td>77-20</td><td> <span class="smcap">AMONG</span> <span class="contr">pr</span>: </td><td>emong</td></tr><tr><td>78-20</td><td> <span class="smcap">ANSWER</span> <span class="contr">vb</span>: </td><td>answer</td></tr><tr><td>80</td><td> <span class="smcap">ASK</span> <span class="contr">vb</span>: </td><td>aske</td></tr><tr><td>81</td><td> <span class="smcap">AT</span><span class="contr">+inf</span>: </td><td>at</td></tr><tr><td>83</td><td> <span class="smcap">AWAY</span>: </td><td>away</td></tr><tr><td>84-20</td><td> <span class="smcap">BE</span> <span class="contr">ppl</span>: </td><td>beyn (byn)</td></tr><tr><td>85-20</td><td> <span class="smcap">BEFORE</span> <span class="contr">adv-time</span>: </td><td>before, befor</td></tr><tr><td>85-31</td><td> <span class="smcap">BEFORE</span> <span class="contr">pr-place</span>: </td><td>be-fore, be-for</td></tr><tr><td>89</td><td> <span class="smcap">BETWEEN</span> <span class="contr">pr</span>: </td><td>betwene</td></tr><tr><td>93</td><td> <span class="smcap">BLESSED</span> <span class="contr">adj/ppl</span>: </td><td>blessyd (blessed) ((blyssed))</td></tr><tr><td>94</td><td> <span class="smcap">BOTH</span>: </td><td>bothe</td></tr><tr><td>96</td><td> <span class="smcap">BROTHER</span>: </td><td>broder (brother)</td></tr><tr><td>99</td><td> <span class="smcap">BUSY</span> <span class="contr">adj</span>: </td><td>besy ((busy))</td></tr><tr><td>99-20</td><td> <span class="smcap">BUSY</span> <span class="contr">vb</span>: </td><td>besy-, busy</td></tr><tr><td>100</td><td> <span class="smcap">BUT</span>: </td><td>bot (bott)</td></tr><tr><td>102</td><td> <span class="smcap">BY</span>: </td><td>by</td></tr><tr><td>103-30</td><td> <span class="smcap">CALLED</span> <span class="contr">ppl</span>: </td><td>called</td></tr><tr><td>104</td><td> <span class="smcap">CAME</span> <span class="contr">sg</span>: </td><td>came</td></tr><tr><td>105-20</td><td> CAN <span class="contr">1/3sg</span>: </td><td>can</td></tr><tr><td>106</td><td> <span class="smcap">CAST</span> <span class="contr">vb</span>: </td><td>cast</td></tr><tr><td>108</td><td> <span class="smcap">CHURCH</span>: </td><td>churche</td></tr><tr><td>109</td><td> <span class="smcap">COULD</span> <span class="contr">1/3sg</span>: </td><td>cowthe</td></tr><tr><td>112</td><td> <span class="smcap">DAY</span>: </td><td>day</td></tr><tr><td>113</td><td> <span class="smcap">DEATH</span>: </td><td>dethe</td></tr><tr><td>114</td><td> <span class="smcap">DIE</span> <span class="contr">vb</span>: </td><td>dye</td></tr><tr><td>115-70</td><td> <span class="smcap">DID</span> <span class="contr">pl</span>: </td><td>dyd</td></tr><tr><td>116</td><td> <span class="smcap">DOWN</span>: </td><td>downe</td></tr><tr><td>119</td><td> <span class="smcap">EARTH</span>: </td><td>erthe</td></tr><tr><td>125</td><td> <span class="smcap">ENOUGH</span>: </td><td>enoughe</td></tr><tr><td>129</td><td> <span class="smcap">FAR</span>: </td><td>far</td></tr><tr><td>130</td><td> <span class="smcap">FATHER</span>: </td><td>fad<span class="contr">er</span>, fader</td></tr><tr><td>132</td><td> <span class="smcap">FELLOW</span>: </td><td>felou-</td></tr><tr><td>134</td><td> <span class="smcap">FIGHT</span> <span class="contr">pres</span>: </td><td>feght-</td></tr><tr><td>137</td><td> <span class="smcap">FIRE</span>: </td><td>fyer, fyre, fire</td></tr><tr><td>138</td><td> <span class="smcap">FIRST</span> <span class="contr">undiff</span>: </td><td>ferst, first</td></tr><tr><td>139</td><td> <span class="smcap">FIVE</span>: </td><td>fyve</td></tr><tr><td>139-20</td><td> <span class="smcap">FIFTH</span>: </td><td>feyfte</td></tr><tr><td>140</td><td> <span class="smcap">FLESH</span>: </td><td>fleshe, flesche</td></tr><tr><td>141</td><td> <span class="smcap">FOLLOW</span> <span class="contr">vb</span>: </td><td>folow, folo-</td></tr><tr><td>144-20</td><td> <span class="smcap">FOURTH</span>: </td><td>fawrte</td></tr><tr><td>146</td><td> <span class="smcap">FRIEND</span>: </td><td>frend-</td></tr><tr><td>147</td><td> <span class="smcap">FRUIT</span>: </td><td>frutt-</td></tr><tr><td>153</td><td> <span class="smcap">GIVE</span> <span class="contr">pres</span>: </td><td>gyue (gyff-)</td></tr><tr><td>155</td><td> <span class="smcap">GOOD</span>: </td><td>good, gud</td></tr><tr><td>157</td><td> <span class="smcap">GROW</span> <span class="contr">pres</span>: </td><td>groue (growe)</td></tr><tr><td>160</td><td> <span class="smcap">HAVE</span> <span class="contr">pres</span>: </td><td>haue</td></tr><tr><td>160-40</td><td> <span class="smcap">HAS</span> <span class="contr">3sg</span>: </td><td>hayth ((haythe, haith))</td></tr><tr><td>164</td><td> <span class="smcap">HEAVEN</span>: </td><td>hewyn</td></tr><tr><td>165</td><td> <span class="smcap">HEIGHT</span>: </td><td>heghte</td></tr><tr><td>166</td><td> <span class="smcap">HELL</span>: </td><td>hell</td></tr><tr><td>168</td><td> <span class="smcap">HIGH</span>: </td><td>hegh (heghe, hyghe)</td></tr><tr><td>168-20</td><td> <span class="smcap">HIGHER</span>: </td><td>hyer</td></tr><tr><td>171</td><td> <span class="smcap">HIM</span>: </td><td>hym</td></tr><tr><td>175</td><td> <span class="smcap">HOLY</span>: </td><td>holy</td></tr><tr><td>176</td><td> <span class="smcap">HOW</span>: </td><td>how (howe)</td></tr><tr><td>181</td><td> <span class="smcap">KNOW</span> <span class="contr">pres</span>: </td><td>knawe, knaw</td></tr><tr><td>185</td><td> <span class="smcap">LAW</span>: </td><td>lawe</td></tr><tr><td>187</td><td> <span class="smcap">LESS</span>: </td><td>lesse, leysse</td></tr><tr><td>190</td><td> <span class="smcap">LIFE</span>: </td><td>lyue</td></tr><tr><td>191</td><td> <span class="smcap">LITTLE</span>: </td><td>lyttyll (lyttyl)</td></tr><tr><td>192</td><td> <span class="smcap">LIVE</span> <span class="contr">vb</span>: </td><td>lyff-</td></tr><tr><td>194</td><td> <span class="smcap">LORD</span>: </td><td>lorde ((lord))</td></tr><tr><td>196</td><td> <span class="smcap">LOVE</span> <span class="contr">sb</span>: </td><td>loue (luffe, luff)</td></tr><tr><td>196-20</td><td> <span class="smcap">LOVE</span> <span class="contr">vb</span>: </td><td>loue</td></tr><tr><td>197</td><td> <span class="smcap">LOW</span>: </td><td>low-</td></tr><tr><td>199-10</td><td> <span class="smcap">MAY</span> <span class="contr">1/3sg</span>: </td><td>may</td></tr><tr><td>202</td><td> <span class="smcap">MOON</span>: </td><td>mone</td></tr><tr><td>203</td><td> <span class="smcap">MOTHER</span>: </td><td>mother, moder</td></tr><tr><td>204</td><td> <span class="smcap">MY</span> +C: </td><td>my</td></tr><tr><td>204-20</td><td> <span class="smcap">MY</span> <span class="contr">+h</span>: </td><td>my</td></tr><tr><td>205</td><td> <span class="smcap">NAME</span> <span class="contr">sb</span>: </td><td>name</td></tr><tr><td>210</td><td> <span class="smcap">NEITHER</span> <span class="contr">pron</span>: </td><td>nawther</td></tr><tr><td>211</td><td> <span class="smcap">NEITHER..NOR</span>: </td><td>nawther+ne, nother+nor, nawther+then</td></tr><tr><td>212</td><td> <span class="smcap">NEVER</span>: </td><td>neu<span class="contr">er</span></td></tr><tr><td>213</td><td> <span class="smcap">NEW</span>: </td><td>new-</td></tr><tr><td>214</td><td> <span class="smcap">NIGH</span>: </td><td>neyr, nere-</td></tr><tr><td>218</td><td> <span class="smcap">NOW</span>: </td><td>nowe, now</td></tr><tr><td>219</td><td> <span class="smcap">OLD</span>: </td><td>holde</td></tr><tr><td>220-20</td><td> <span class="smcap">ONE</span> <span class="contr">pron</span>: </td><td>one</td></tr><tr><td>221</td><td> <span class="smcap">OR</span>: </td><td>or</td></tr><tr><td>222</td><td> <span class="smcap">OTHER</span>: </td><td>other</td></tr><tr><td>224</td><td> <span class="smcap">OUR</span>: </td><td>oure</td></tr><tr><td>225</td><td> <span class="smcap">OUT</span>: </td><td>oute, out, owt</td></tr><tr><td>226</td><td> <span class="smcap">OWN</span> <span class="contr">adj</span>: </td><td>awne</td></tr><tr><td>227</td><td> <span class="smcap">PEOPLE</span>: </td><td>people, peopyll</td></tr><tr><td>228</td><td> <span class="smcap">POOR</span>: </td><td>pooer, poer, poore</td></tr><tr><td>229</td><td> <span class="smcap">PRAY</span> <span class="contr">vb</span>: </td><td>pray</td></tr><tr><td>235</td><td> <span class="smcap">SAY</span> <span class="contr">pres</span>: </td><td>say</td></tr><tr><td>235-21</td><td> <span class="smcap">SAYS</span> <span class="contr">3sg</span>: </td><td>sayth</td></tr><tr><td>235-30</td><td> <span class="smcap">SAY</span> <span class="contr">pl</span>: </td><td>sayth</td></tr><tr><td>235-40</td><td> <span class="smcap">SAID</span> <span class="contr">sg</span>: </td><td>sayde</td></tr><tr><td>235-60</td><td> <span class="smcap">SAID</span> <span class="contr">ppl</span>: </td><td>sayd</td></tr><tr><td>236</td><td> <span class="smcap">SEE</span> <span class="contr">vb</span>: </td><td>se</td></tr><tr><td>236-21</td><td> <span class="smcap">SEES</span> <span class="contr">3sg</span>: </td><td>seeth</td></tr><tr><td>237</td><td> <span class="smcap">SEEK</span> <span class="contr">pres</span>: </td><td>seke</td></tr><tr><td>238</td><td> <span class="smcap">SELF</span>: </td><td>selffe, selfe</td></tr><tr><td>242</td><td> <span class="smcap">SIN</span> <span class="contr">sb</span>: </td><td>synn-, syn ((syne))</td></tr><tr><td>242-30</td><td> <span class="smcap">SIN</span> <span class="contr">vb</span>: </td><td>synn-</td></tr><tr><td>243</td><td> <span class="smcap">SISTER</span>: </td><td>sust<span class="contr">er</span>, suster</td></tr><tr><td>244</td><td> <span class="smcap">SIX</span>: </td><td>sex</td></tr><tr><td>246</td><td> <span class="smcap">SOME</span>: </td><td>su<span class="contr">m</span>me ((sume))</td></tr><tr><td>248</td><td> <span class="smcap">SORROW</span> <span class="contr">sb</span>: </td><td>sorow, sorowe, sorousys<<span class="contr">pl</span>></td></tr><tr><td>249</td><td> <span class="smcap">SOUL</span>: </td><td>soule (saule)</td></tr><tr><td>249-20</td><td> <span class="smcap">SOULS</span>: </td><td>soules, saulis, sawl<span class="contr">es</span></td></tr><tr><td>254</td><td> <span class="smcap">STEAD</span>: </td><td>sted-</td></tr><tr><td>261</td><td> <span class="smcap">THOU</span>: </td><td>y<sup>u</sup></td></tr><tr><td>262</td><td> <span class="smcap">THEE</span>: </td><td>ye</td></tr><tr><td>263</td><td> <span class="smcap">THY</span> +C: </td><td>y<sup>i</sup></td></tr><tr><td>266</td><td> <span class="smcap">THOUSAND</span>: </td><td>thousande</td></tr><tr><td>267-20</td><td> <span class="smcap">THIRD</span>: </td><td>thryde, therde, threde</td></tr><tr><td>268</td><td> <span class="smcap">TOGETHER</span>: </td><td>to-gedder, to-gether</td></tr><tr><td>270</td><td> <span class="smcap">TRUE</span>: </td><td>true</td></tr><tr><td>273</td><td> <span class="smcap">TWELVE</span>: </td><td>twelue</td></tr><tr><td>275</td><td> <span class="smcap">TWO</span>: </td><td>too</td></tr><tr><td>278</td><td> <span class="smcap">UPON</span>: </td><td>apon (appon) ((vppon))</td></tr><tr><td>281</td><td> <span class="smcap">WELL</span> <span class="contr">adv</span>: </td><td>well ((wel))</td></tr><tr><td>282</td><td> <span class="smcap">WENT</span>: </td><td>went</td></tr><tr><td>285</td><td> <span class="smcap">WHETHER</span>: </td><td>whether (whed<span class="contr">er</span>, wether)</td></tr><tr><td>286</td><td> <span class="smcap">WHITHER</span>: </td><td>wheder</td></tr><tr><td>291</td><td> <span class="smcap">WHY</span>: </td><td>why</td></tr><tr><td>292-20</td><td> <span class="smcap">WIT</span> <span class="contr">1/3sg <span class="smcap">KNOW</span></span>: </td><td>wote</td></tr><tr><td>295</td><td> <span class="smcap">WITHOUT</span> <span class="contr">pr</span>: </td><td>w<sup>t</sup>-owte, w<sup>t</sup>-owt, w<sup>t</sup>-owtyn (w<sup>t</sup>-out)</td></tr><tr><td>297</td><td> <span class="smcap">WORSHIP</span> <span class="contr">sb</span>: </td><td>worschippe, worschip</td></tr><tr><td>298</td><td> <span class="smcap">YE</span>: </td><td>ȝe</td></tr><tr><td>299</td><td> <span class="smcap">YOU</span>: </td><td>you ((youe))</td></tr><tr><td>300</td><td> <span class="smcap">YOUR</span>: </td><td>you<span class="contr">r</span> (youre)</td></tr><tr><td>303</td><td> <span class="smcap">YOUNG</span>: </td><td>yong</td></tr><tr><td>304</td><td> <span class="smcap">-ALD</span>: </td><td>-old</td></tr><tr><td>306</td><td> <span class="smcap">-AND</span>: </td><td>-and (-ond)</td></tr><tr><td>307</td><td> <span class="smcap">-ANG</span>: </td><td>-ong ((-ang))</td></tr><tr><td>308</td><td> <span class="smcap">-ANK</span>: </td><td>-ank, -angk</td></tr><tr><td>309</td><td> <span class="smcap">-DOM</span>: </td><td>-dome, -dom</td></tr><tr><td>312</td><td> <span class="smcap">-ER</span>: </td><td>-er (-<span class="contr">er</span>) ((-ar))</td></tr><tr><td>313</td><td> <span class="smcap">-EST</span> <span class="contr">sup</span>: </td><td>-est</td></tr><tr><td>314</td><td> <span class="smcap">-FUL</span>: </td><td>-full</td></tr><tr><td>315</td><td> <span class="smcap">-HOOD</span>: </td><td>-hede, -hed</td></tr><tr><td>316</td><td> <span class="smcap">-LESS</span>: </td><td>-les</td></tr><tr><td>317</td><td> <span class="smcap">-LY</span>: </td><td>-ly</td></tr><tr><td>318</td><td> <span class="smcap">-NESS</span>: </td><td>-nes</td></tr><tr><td>319</td><td> <span class="smcap">-SHIP</span>: </td><td>-schippe, -schip</td></tr></table>
If "<though>" is supposed to be a line it won't be in readlines as there will be a newline to "<though>" at the end. You need to add a newline or str.rstrip the lines in the list.
In [15]: l = [">though<\n"]
In [16]: ">though<" in l
Out[16]: False
In [17]: ">though<\n" in l
Out[17]: True
This will check for a match properly and use glob to find all the txt files in the directory also without storing all lines in a list:
dr = "C:\Python34/Downloaded Files LALME/"
out = "C:\\Python34/output_though_only.txt"
from glob import glob
tag = ".txt"
files = glob(dr+"*.txt")
with open(out, "w") as f:
for fle in files:
with open(fle) as f2:
for line in f2:
if line.rstrip() == ">though<":
f.seek(0)
f.writelines(f2)
break
And then reopen the file you after writing to it:
outfile = "C:\\Python34/cleaned_output_though_only.txt"
delete_list = ['']
with open(out) as fin,open(outfile, "w") as fout:
for line in fin:
for word in delete_list:
line = line.replace(word, "")
fout.write(line)
But your delete list is empty so not sure exactly what that is doing.
Your code from your edit works as you are checking a string not a list:
n [20]: f = "foo"
In [21]: l = ["foo\n"]
In [22]: f in l
Out[22]: False
In [23]: s = "".join(l)
In [24]: s
Out[24]: 'foo\n'
In [25]: f in s
Out[25]: True
You are checking for a substring in the edited code so even without the newline you will get a match, using a list you need to have an exact match which is what you have in your original code using readlines as you don't call join on as you do in the edited code.
So the big difference between both is one is checking for a substring in a string, the other is checking if an exact match is the list which fails because of the newline character.
Based on your new edit you are not checking for a line, you are checking for a substring an a case insensitive match so use:
`if ">though<" in line.lower()`

Retrieve bbc weather data with identical span class and nested spans

I am trying to pull data form BBC weather with a view to use in a home automation dashboard.
The HTML code I can pull fine and I can pull one set of temps but it just pulls the first.
</li>
<li class="daily__day-tab day-20150418 ">
<a data-ajax-href="/weather/en/2646504/daily/2015-04-18?day=3" href="/weather/2646504?day=3" rel="nofollow">
<div class="daily__day-header">
<h3 class="daily__day-date">
<span aria-label="Saturday" class="day-name">Sat</span>
</h3>
</div>
<span class="weather-type-image weather-type-image-40" title="Sunny"><img alt="Sunny" src="http://static.bbci.co.uk/weather/0.5.327/images/icons/tab_sprites/40px/1.png"/></span>
<span class="max-temp max-temp-value"> <span class="units-values temperature-units-values"><span class="units-value temperature-value temperature-value-unit-c" data-unit="c">13<span class="unit">°C</span></span><span class="unit-types-separator"> </span><span class="units-value temperature-value temperature-value-unit-f" data-unit="f">55<span class="unit">°F</span></span></span></span>
<span class="min-temp min-temp-value"> <span class="units-values temperature-units-values"><span class="units-value temperature-value temperature-value-unit-c" data-unit="c">5<span class="unit">°C</span></span><span class="unit-types-separator"> </span><span class="units-value temperature-value temperature-value-unit-f" data-unit="f">41<span class="unit">°F</span></span></span></span>
<span class="wind wind-speed windrose-icon windrose-icon--average windrose-icon-40 windrose-icon-40--average wind-direction-ene" data-tooltip-kph="31 km/h, East North Easterly" data-tooltip-mph="19 mph, East North Easterly" title="19 mph, East North Easterly">
<span class="speed"> <span class="wind-speed__description wind-speed__description--average">Wind Speed</span>
<span class="units-values windspeed-units-values"><span class="units-value windspeed-value windspeed-value-unit-kph" data-unit="kph">31 <span class="unit">km/h</span></span><span class="unit-types-separator"> </span><span class="units-value windspeed-value windspeed-value-unit-mph" data-unit="mph">19 <span class="unit">mph</span></span></span></span>
<span class="description blq-hide">East North Easterly</span>
</span>
This is my code which isn’t working
import urllib2
import pprint
from bs4 import BeautifulSoup
htmlFile=urllib2.urlopen('http://www.bbc.co.uk/weather/2646504?day=1')
htmlData = htmlFile.read()
soup = BeautifulSoup(htmlData)
table=soup.find("div","daily-window")
temperatures=[str(tem.contents[0]) for tem in table.find_all("span",class_="units-value temperature-value temperature-value-unit-c")]
mintemp=[str(min.contents[0]) for min in table.find_("span",class_="min-temp min-temp-value")]
maxtemp=[str(min.contents[0]) for min in table.find_all("span",class_="max-temp max-temp-value")]
windspeeds=[str(speed.contents[0]) for speed in table.find_all("span",class_="units-value windspeed-value windspeed-value-unit-mph")]
pprint.pprint(zip(temperatures,temp2,windspeeds))
your min and max temp extract is wrong.You just find the hole min temp span (include both c and f format).Get the first thing of content gives you empty string.
And the min temp tag identify class=min-temp.min-temp-value is not the same with the c-type min temp class=temperature-value-unit-c.So I suggest you to use css selector.
Eg,find all of your min temp span could be
table.select('span.min-temp.min-temp-value span.temperature-value-unit-c')
This means select all class=temperature-value-unit-c spans which are children of class=min-temp min-temp-value spans.
So do the other information lists like max_temp wind

Get span text from a website using selenium

The website I'm trying to scrape looks like this:
<div align="center" class="movietable">
<span style="width:45px;height:47px;vertical-align:middle;display:table-cell;">
<img border="0" src="styles/images/cat/hd.png" alt="HdO">
</span>
</div>
<div align="left" class="movietable">
<span style="padding:0px 5px;width:455px;height:47px;vertical-align:middle;display:table-cell;">
<a data-toggle="tooltip" data-placement="bottom" data-html="true" title="" href="details.php?id=578197" data-original-title="<img src='https://trasd.tmdb.org//tqistSlQGQVlvDZHweD.jpg'>">
<b>GET THIS TEXT</b></a><br><font class="small">[Action, Horror, Sci-Fi]</font>
</span>
</div>
How can I extract:
The text in the <b> tag - in this case GET THIS TEXT
The content of the font_class= 'small' - in this case this would be Action, Horror, Sci-Fi
.movietable b works great!!
The img_scr link - in thiscase it would be https://trasd.tmdb.org//tqistSlQGQVlvDZHweD.jpg
I have no ideea how to do this
Below are CSS selectors you can use:
driver.find_element_by_css_selector('div[align=left] b')
driver.find_element_by_css_selector('div[align=left] .small')
driver.find_element_by_css_selector('a[title]').get_attribute('data-original-title')
You can access all of them using xpath:
1) [parents before this div]/div[2]/span/a/b
2) [parents before this div]/div[2]/span/font
3) [parents before this div]/div[1]/span/a/img
[parents before this div] should be /html/body/...
As per the HTML you have shared to extract the items you can use the following solution:
GET THIS TEXT:
driver.find_element_by_xpath("//div[#class='movietable' and #align='left']/span/a[#data-toggle='tooltip' and #data-placement='bottom']/b").get_attribute("innerHTML")
[Action, Horror, Sci-Fi]:
driver.find_element_by_xpath("//div[#class='movietable' and #align='left']/span//font[#class='small']").get_attribute("innerHTML")
https://trasd.tmdb.org//tqistSlQGQVlvDZHweD.jpg:
img_src = driver.find_element_by_xpath("//div[#class='movietable' and #align='left']/span/a[#data-toggle='tooltip' and #data-placement='bottom']").get_attribute("data-original-title")
src = img_src.replace("'", "-").split("-")
print(src[1])

Categories