I have a webpage with following code:
<li>
Thalassery (<a class="mw-redirect" href="/wiki/Malayalam_language" title="Malayalam language">Malayalam</a>: <span lang="ml">തലശ്ശേരി</span>), from
<i>Tellicherry</i></li>
<li>Thanjavur (Tamil: <span lang="ta">தஞ்சாவூர்</span>), from British name <i>Tanjore</i></li>
<li>Thane (Marathi: <span lang="mr">ठाणे</span>), from British name <i>Tannah</i></li>
<li>Thoothukudi (Tamil: <span lang="ta">தூத்துக்குடி</span>), from <i>Tuticorin</i> and its short form <i>Tuty</i></li>
I need to parse the output such that the result will be extracting words like: Thalassery, Tellicherry, Thanjavur, Tanjore, Thane, Tannah, Thoothukudi, Tuticorin
Can anyone please help with this
You can use .findAll() to get all the li elements and use find() 'a' and 'i' tag
for item in soup.findAll('li'):
print(item.find('a').text,item.find('i').text)
>>>
Thalassery Tellicherry
Thanjavur Tanjore
Thane Tannah
Thoothukudi Tuticorin
Try simplified_scrapy's solution, its fault tolerance
from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''
<li>
Thalassery (<a class="mw-redirect" href="/wiki/Malayalam_language" title="Malayalam language">Malayalam</a>: <span lang="ml">തലശ്ശേരി</span>), from
<i>Tellicherry</i></li>
<li>Thanjavur (Tamil: <span lang="ta">தஞ்சாவூர்</span>), from British name <i>Tanjore</i></li>
<li>Thane (Marathi: <span lang="mr">ठाणे</span>), from British name <i>Tannah</i></li>
<li>Thoothukudi (Tamil: <span lang="ta">தூத்துக்குடி</span>), from <i>Tuticorin</i> and its short form <i>Tuty</i></li>
'''
doc = SimplifiedDoc(html)
lis = doc.lis
print ([(li.a.text,li.i.text if li.i else '') for li in lis])
Result:
[('Thalassery', 'Tellicherry'), ('Thanjavur', 'Tanjore'), ('Thane', 'Tannah'), ('Thoothukudi', 'Tuticorin')]
I am trying to extract the data-message-id from the following html. My original goal is to extract the data-message- id for the span containing a particular text and then clicking on the star_button to star it.
<div class="message_content_header">
<div class="message_content_header_left">
krishnag0902
<span class="ts_tip_float message_current_status ts_tip ts_tip_top ts_tip_multiline ts_tip_delay_150 color_U5TPDSMQQ color_9f69e7 hidden ts_tip_hidden">
<span class="ts_tip_tip ts_tip_inner_current_status">
<span class="ts_tip_multiline_inner">
</span>
</span>
</span>
<i class="copy_only">[</i>4:34 PM<i class="copy_only">]</i><span class="ts_tip_tip"><span class="ts_tip_multiline_inner">Yesterday at 4:34:07 PM</span></span>
<span class="message_star_holder">
Star this message
</div>
</div>
<span class="message_body">hoho<span class="constrain_triple_clicks"></span></span>
<div class="rxn_panel rxns_key_message-1498084447_119862-C5UGEFBS9"></div>
<i class="copy_only"><br></i>
<span id="msg_1498084447_119862_label" class="message_aria_label hidden">
<strong>krishnag0902</strong>.
hoho.
four thirty-four PM.
</span>
and i am using the code on the above span(message_star_holder) which is returning a None
data_mess= star_button_span.find_element_by_xpath("//button[#class=
'star ts_icon ts_icon_star_o ts_icon_inherit ts_tip_top star_message
ts_tip ts_tip_float ts_tip_hidden btn_unstyle']")
print data_mess.get_attribute("innerHTML")
print star_button_span.get_attribute("data-msg-id")
star_button_span doesn't have data-msg-id attribute. data_mess has
print data_mess.get_attribute("data-msg-id")
I have a scrip that I use to scan through a directory and copy the content (actually I only want part of it...) of all files that containt a certain string into a new file.
import os
dr = "C:/Python34/Downloaded Files LALME/"; out = "C:/Python34/output_though_only.txt"; tag = ".txt"
files = os.listdir(dr)
for f in files:
if f.endswith(tag):
content = open(dr+"/"+f).readlines()
all_lines = content
if ">though<" in all_lines:
open(out, "a+").write(all_lines)
infile = "C:/Python34/output_though_only.txt"
outfile = "C:/Python34/cleaned_output_though_only.txt"
delete_list = ['']
fin = open(infile)
fout = open(outfile, "a+")
for line in fin:
for word in delete_list:
line = line.replace(word, "")
fout.write(line)
fin.close()
fout.close()
When I run it from the shell it returns the following error message:
C:\Python34>python LALME_script_though_extract_clean.txt -w
Traceback (most recent call last):
File "LALME_script_though_extract_clean.txt", line 13, in <module>
fin = open(infile)
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Python34/output_tho
ugh_only.txt'
Of course, the file does not exist (yet), but the script is supposed to create it if it doesnt exist. This is a modified version of a code that I used for other strings and files and there it worked just fine.
Can anyone find what went wrong here ?
Edit: here is an older version of the script that works and was used for differenty files/directories.
import os
dr = "C:\Python34/tag/"; out = "C:\\Users/Yorishimo/Desktop/codes/LAEME/though/output_laeme_though/output_laeme_though.txt"; tag = ".tag"
files = os.listdir(dr)
for f in files:
if f.endswith(tag):
content = open(dr+"/"+f).readlines()
all_lines = content
needed_lines = content[21:27]
lines_totest = ("").join(all_lines)+"\n"
final_lines = ("").join(needed_lines)+"\n"
if "$though/" in lines_totest:
open(out, "a+").write(final_lines)
infile = "C:\\Users/Yorishimo/Desktop/codes/LAEME/though/output_laeme_though/output_laeme_though.txt"
outfile = "C:\\Users/Yorishimo/Desktop/codes/LAEME/though/output_laeme_though/cleaned_output_though_only.txt"
delete_list = ['</span></li><li><span class="list">', '<style type="text/css"> UL LI { list-style: none } </style><ul><li><span class="list">']
fin = open(infile)
fout = open(outfile, "a+")
for line in fin:
for word in delete_list:
line = line.replace(word, "")
fout.write(line)
fin.close()
fout.close()
Edit2:
Here is an example of a file that I am trying to have scanned for the string >though<:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>eLALME</title>
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
<meta name="Title" content="">
<meta name="Description" content="">
<meta name="Keywords" content="">
<meta name="Author" content="">
<meta name="Publisher" content="">
<link rel="stylesheet" href="http://archive.ling.ed.ac.uk/ihd/elalme_scripts/lib/css/elalme_actionstyle.css" type="text/css" target="taskarea">
</head>
<!-- MAIN CONTENT BELOW HERE -->
<body>
<style type="text/css">
UL LI { list-style: none}
</style>
<p><span class="emphasis">LP 1</span></p><p>Dublin, Trinity College 154 (A.6.12). <span class="contr">ca.</span> 1400. MS in one hand. ff. 1r-105r: religious treatises. Analysis from ff. 1r-41r, then scan. LP 1. Grid 478 332. Leicestershire.</p></p><table><tr><td>1</td><td> <span class="smcap">THE</span>: </td><td>y<sup>e</sup></td></tr><tr><td>2</td><td> <span class="smcap">THESE</span>: </td><td>thes (theys) ((these))</td></tr><tr><td>3</td><td> <span class="smcap">THOSE</span>: </td><td>those (tho)</td></tr><tr><td>4</td><td> <span class="smcap">SHE</span>: </td><td>sche, she</td></tr><tr><td>5</td><td> <span class="smcap">HER</span>: </td><td>hir, hyr</td></tr><tr><td>6</td><td> <span class="smcap">IT</span>: </td><td>it</td></tr><tr><td>7</td><td> <span class="smcap">THEY</span>: </td><td>they ((thay, thei))</td></tr><tr><td>8</td><td> <span class="smcap">THEM</span>: </td><td>theym, them (thayme, theyme) ((yem, thaym, theme, yam))</td></tr><tr><td>9</td><td> <span class="smcap">THEIR</span>: </td><td>theyr (ther, y<span class="contr">er</span>) ((thayre, thayr, theyre, thare))</td></tr><tr><td>10</td><td> <span class="smcap">SUCH</span>: </td><td>suche ((syche))</td></tr><tr><td>11</td><td> <span class="smcap">WHICH</span>: </td><td>wiche, which (wyche, whiche, whyche) ((y<sup>e</sup>-wiche))</td></tr><tr><td>13</td><td> <span class="smcap">MANY</span>: </td><td>mony (many)</td></tr><tr><td>14</td><td> <span class="smcap">MAN</span>: </td><td>man, ma<span class="contr">n</span></td></tr><tr><td>15</td><td> <span class="smcap">ANY</span>: </td><td>any</td></tr><tr><td>16</td><td> <span class="smcap">MUCH</span>: </td><td>myche (mych, moche, meche, muche)</td></tr><tr><td>17</td><td> <span class="smcap">ARE</span>: </td><td>be (ar) ((are, er, byn))</td></tr><tr><td>18</td><td> <span class="smcap">WERE</span>: </td><td>were (ware) ((wher))</td></tr><tr><td>19</td><td> <span class="smcap">IS</span>: </td><td>is</td></tr><tr><td>21</td><td> <span class="smcap">WAS</span>: </td><td>was</td></tr><tr><td>22</td><td> <span class="smcap">SHALL</span> <span class="contr">sg</span>: </td><td>shal, schal, schall</td></tr><tr><td>22-30</td><td> <span class="smcap">SHALL</span> <span class="contr">pl</span>: </td><td>shall (shal) ((schall))</td></tr><tr><td>23</td><td> <span class="smcap">SHOULD</span> <span class="contr">sg</span>: </td><td>sholde (schulde) ((shulde))</td></tr><tr><td>23-30</td><td> <span class="smcap">SHOULD</span> <span class="contr">pl</span>: </td><td>sholde (schulde, shulde)</td></tr><tr><td>24</td><td> <span class="smcap">WILL</span> <span class="contr">sg</span>: </td><td>wyll</td></tr><tr><td>24-30</td><td> <span class="smcap">WILL</span> <span class="contr">pl</span>: </td><td>wyll</td></tr><tr><td>25</td><td> <span class="smcap">WOULD</span> <span class="contr">sg</span>: </td><td>wolde</td></tr><tr><td>26-30</td><td> <span class="smcap">TO</span> <span class="contr">prep</span> +V: </td><td>to</td></tr><tr><td>27</td><td> <span class="smcap">TO</span> <span class="contr">+inf</span> +C: </td><td>to</td></tr><tr><td>28</td><td> <span class="smcap">FROM</span>: </td><td>from (frome) ((fro))</td></tr><tr><td>29</td><td> <span class="smcap">AFTER</span>: </td><td>after</td></tr><tr><td>30</td><td> <span class="smcap">THEN</span>: </td><td>then ((than))</td></tr><tr><td>31</td><td> <span class="smcap">THAN</span>: </td><td>then (than) ((yen))</td></tr><tr><td>32</td><td> <span class="smcap">THOUGH</span>: </td><td>alof (yof) ((yoff))</td></tr><tr><td>33</td><td> <span class="smcap">IF</span>: </td><td>yff, yff-that, yf ((yf-y<sup>t</sup>, yff-y<sup>t</sup>, gyffe-y<sup>t</sup>))</td></tr><tr><td>34</td><td> <span class="smcap">AS</span>: </td><td>as</td></tr><tr><td>35</td><td> <span class="smcap">AS..AS</span>: </td><td>as+as</td></tr><tr><td>36</td><td> <span class="smcap">AGAINST</span>: </td><td>agance (agaynst, agayns, agayne)</td></tr><tr><td>39-20</td><td> <span class="smcap">SINCE</span> <span class="contr">conj</span>: </td><td>sythen-y<sup>t</sup>, syn</td></tr><tr><td>40</td><td> <span class="smcap">YET</span>: </td><td>ȝet ((ȝett))</td></tr><tr><td>41</td><td> <span class="smcap">WHILE</span>: </td><td>whylys-y<sup>t</sup></td></tr><tr><td>42</td><td> <span class="smcap">STRENGTH</span>: </td><td>strenght ((strenghe))</td></tr><tr><td>42-20</td><td> <span class="smcap">STRENGTHEN</span> <span class="contr">vb</span>: </td><td>strenght</td></tr><tr><td>44</td><td> <span class="smcap">WH-</span>: </td><td>wh- ((w-))</td></tr><tr><td>46</td><td> <span class="smcap">NOT</span>: </td><td>not, nott</td></tr><tr><td>47</td><td> <span class="smcap">NOR</span>: </td><td>nor (ne)</td></tr><tr><td>48</td><td> <span class="smcap">OE</span>, <span class="smcap">ON</span> <span class="contr">ā</span> (‘a’, ‘o’): </td><td>o</td></tr><tr><td>49</td><td> <span class="smcap">WORLD</span>: </td><td>woorlde, worlde, warlde, world</td></tr><tr><td>50</td><td> <span class="smcap">THINK</span> <span class="contr">vb</span>: </td><td>thynke, thyngke</td></tr><tr><td>51</td><td> <span class="smcap">WORK</span> <span class="contr">sb</span>: </td><td>werke</td></tr><tr><td>51-10</td><td> <span class="smcap">WORK</span> <span class="contr">pres stem</span>: </td><td>werke</td></tr><tr><td>52</td><td> <span class="smcap">THERE</span>: </td><td>ther ((y<span class="contr">er</span>))</td></tr><tr><td>53</td><td> <span class="smcap">WHERE</span>: </td><td>wher-, where</td></tr><tr><td>54</td><td> <span class="smcap">MIGHT</span> <span class="contr">vb</span>: </td><td>myght</td></tr><tr><td>55</td><td> <span class="smcap">THROUGH</span>: </td><td>throughe (throghe) ((throught, through))</td></tr><tr><td>56</td><td> <span class="smcap">WHEN</span>: </td><td>when</td></tr><tr><td>57</td><td> <span class="contr">Sb pl</span>: </td><td>-ys (-s, -<span class="contr">es</span>) ((-es, -is))</td></tr><tr><td>58</td><td> <span class="contr">Pres part</span>: </td><td>-yng</td></tr><tr><td>59</td><td> <span class="contr">Vbl sb</span>: </td><td>-yng</td></tr><tr><td>61</td><td> <span class="contr">Pres 3sg</span>: </td><td>-ys (-yth, -eth, -<span class="contr">es</span>, -s)</td></tr><tr><td>62</td><td> <span class="contr">Pres pl</span>: </td><td>-th</td></tr><tr><td>65</td><td> <span class="contr">Weak ppl</span>: </td><td>-ed (-et) ((-yd))</td></tr><tr><td>66</td><td> <span class="contr">Str ppl</span>: </td><td>-en, -on, -yne</td></tr><tr><td>70-20</td><td> <span class="smcap">ABOUT</span> <span class="contr">pr</span>: </td><td>abowte</td></tr><tr><td>71-20</td><td> <span class="smcap">ABOVE</span> <span class="contr">pr</span>: </td><td>a-boue, abowe</td></tr><tr><td>73</td><td> <span class="smcap">AFTERWARDS</span>: </td><td>afterward</td></tr><tr><td>75</td><td> <span class="smcap">ALL</span>: </td><td>all, al</td></tr><tr><td>77</td><td> <span class="smcap">AMONG</span> <span class="contr">adv</span>: </td><td>emong</td></tr><tr><td>77-20</td><td> <span class="smcap">AMONG</span> <span class="contr">pr</span>: </td><td>emong</td></tr><tr><td>78-20</td><td> <span class="smcap">ANSWER</span> <span class="contr">vb</span>: </td><td>answer</td></tr><tr><td>80</td><td> <span class="smcap">ASK</span> <span class="contr">vb</span>: </td><td>aske</td></tr><tr><td>81</td><td> <span class="smcap">AT</span><span class="contr">+inf</span>: </td><td>at</td></tr><tr><td>83</td><td> <span class="smcap">AWAY</span>: </td><td>away</td></tr><tr><td>84-20</td><td> <span class="smcap">BE</span> <span class="contr">ppl</span>: </td><td>beyn (byn)</td></tr><tr><td>85-20</td><td> <span class="smcap">BEFORE</span> <span class="contr">adv-time</span>: </td><td>before, befor</td></tr><tr><td>85-31</td><td> <span class="smcap">BEFORE</span> <span class="contr">pr-place</span>: </td><td>be-fore, be-for</td></tr><tr><td>89</td><td> <span class="smcap">BETWEEN</span> <span class="contr">pr</span>: </td><td>betwene</td></tr><tr><td>93</td><td> <span class="smcap">BLESSED</span> <span class="contr">adj/ppl</span>: </td><td>blessyd (blessed) ((blyssed))</td></tr><tr><td>94</td><td> <span class="smcap">BOTH</span>: </td><td>bothe</td></tr><tr><td>96</td><td> <span class="smcap">BROTHER</span>: </td><td>broder (brother)</td></tr><tr><td>99</td><td> <span class="smcap">BUSY</span> <span class="contr">adj</span>: </td><td>besy ((busy))</td></tr><tr><td>99-20</td><td> <span class="smcap">BUSY</span> <span class="contr">vb</span>: </td><td>besy-, busy</td></tr><tr><td>100</td><td> <span class="smcap">BUT</span>: </td><td>bot (bott)</td></tr><tr><td>102</td><td> <span class="smcap">BY</span>: </td><td>by</td></tr><tr><td>103-30</td><td> <span class="smcap">CALLED</span> <span class="contr">ppl</span>: </td><td>called</td></tr><tr><td>104</td><td> <span class="smcap">CAME</span> <span class="contr">sg</span>: </td><td>came</td></tr><tr><td>105-20</td><td> CAN <span class="contr">1/3sg</span>: </td><td>can</td></tr><tr><td>106</td><td> <span class="smcap">CAST</span> <span class="contr">vb</span>: </td><td>cast</td></tr><tr><td>108</td><td> <span class="smcap">CHURCH</span>: </td><td>churche</td></tr><tr><td>109</td><td> <span class="smcap">COULD</span> <span class="contr">1/3sg</span>: </td><td>cowthe</td></tr><tr><td>112</td><td> <span class="smcap">DAY</span>: </td><td>day</td></tr><tr><td>113</td><td> <span class="smcap">DEATH</span>: </td><td>dethe</td></tr><tr><td>114</td><td> <span class="smcap">DIE</span> <span class="contr">vb</span>: </td><td>dye</td></tr><tr><td>115-70</td><td> <span class="smcap">DID</span> <span class="contr">pl</span>: </td><td>dyd</td></tr><tr><td>116</td><td> <span class="smcap">DOWN</span>: </td><td>downe</td></tr><tr><td>119</td><td> <span class="smcap">EARTH</span>: </td><td>erthe</td></tr><tr><td>125</td><td> <span class="smcap">ENOUGH</span>: </td><td>enoughe</td></tr><tr><td>129</td><td> <span class="smcap">FAR</span>: </td><td>far</td></tr><tr><td>130</td><td> <span class="smcap">FATHER</span>: </td><td>fad<span class="contr">er</span>, fader</td></tr><tr><td>132</td><td> <span class="smcap">FELLOW</span>: </td><td>felou-</td></tr><tr><td>134</td><td> <span class="smcap">FIGHT</span> <span class="contr">pres</span>: </td><td>feght-</td></tr><tr><td>137</td><td> <span class="smcap">FIRE</span>: </td><td>fyer, fyre, fire</td></tr><tr><td>138</td><td> <span class="smcap">FIRST</span> <span class="contr">undiff</span>: </td><td>ferst, first</td></tr><tr><td>139</td><td> <span class="smcap">FIVE</span>: </td><td>fyve</td></tr><tr><td>139-20</td><td> <span class="smcap">FIFTH</span>: </td><td>feyfte</td></tr><tr><td>140</td><td> <span class="smcap">FLESH</span>: </td><td>fleshe, flesche</td></tr><tr><td>141</td><td> <span class="smcap">FOLLOW</span> <span class="contr">vb</span>: </td><td>folow, folo-</td></tr><tr><td>144-20</td><td> <span class="smcap">FOURTH</span>: </td><td>fawrte</td></tr><tr><td>146</td><td> <span class="smcap">FRIEND</span>: </td><td>frend-</td></tr><tr><td>147</td><td> <span class="smcap">FRUIT</span>: </td><td>frutt-</td></tr><tr><td>153</td><td> <span class="smcap">GIVE</span> <span class="contr">pres</span>: </td><td>gyue (gyff-)</td></tr><tr><td>155</td><td> <span class="smcap">GOOD</span>: </td><td>good, gud</td></tr><tr><td>157</td><td> <span class="smcap">GROW</span> <span class="contr">pres</span>: </td><td>groue (growe)</td></tr><tr><td>160</td><td> <span class="smcap">HAVE</span> <span class="contr">pres</span>: </td><td>haue</td></tr><tr><td>160-40</td><td> <span class="smcap">HAS</span> <span class="contr">3sg</span>: </td><td>hayth ((haythe, haith))</td></tr><tr><td>164</td><td> <span class="smcap">HEAVEN</span>: </td><td>hewyn</td></tr><tr><td>165</td><td> <span class="smcap">HEIGHT</span>: </td><td>heghte</td></tr><tr><td>166</td><td> <span class="smcap">HELL</span>: </td><td>hell</td></tr><tr><td>168</td><td> <span class="smcap">HIGH</span>: </td><td>hegh (heghe, hyghe)</td></tr><tr><td>168-20</td><td> <span class="smcap">HIGHER</span>: </td><td>hyer</td></tr><tr><td>171</td><td> <span class="smcap">HIM</span>: </td><td>hym</td></tr><tr><td>175</td><td> <span class="smcap">HOLY</span>: </td><td>holy</td></tr><tr><td>176</td><td> <span class="smcap">HOW</span>: </td><td>how (howe)</td></tr><tr><td>181</td><td> <span class="smcap">KNOW</span> <span class="contr">pres</span>: </td><td>knawe, knaw</td></tr><tr><td>185</td><td> <span class="smcap">LAW</span>: </td><td>lawe</td></tr><tr><td>187</td><td> <span class="smcap">LESS</span>: </td><td>lesse, leysse</td></tr><tr><td>190</td><td> <span class="smcap">LIFE</span>: </td><td>lyue</td></tr><tr><td>191</td><td> <span class="smcap">LITTLE</span>: </td><td>lyttyll (lyttyl)</td></tr><tr><td>192</td><td> <span class="smcap">LIVE</span> <span class="contr">vb</span>: </td><td>lyff-</td></tr><tr><td>194</td><td> <span class="smcap">LORD</span>: </td><td>lorde ((lord))</td></tr><tr><td>196</td><td> <span class="smcap">LOVE</span> <span class="contr">sb</span>: </td><td>loue (luffe, luff)</td></tr><tr><td>196-20</td><td> <span class="smcap">LOVE</span> <span class="contr">vb</span>: </td><td>loue</td></tr><tr><td>197</td><td> <span class="smcap">LOW</span>: </td><td>low-</td></tr><tr><td>199-10</td><td> <span class="smcap">MAY</span> <span class="contr">1/3sg</span>: </td><td>may</td></tr><tr><td>202</td><td> <span class="smcap">MOON</span>: </td><td>mone</td></tr><tr><td>203</td><td> <span class="smcap">MOTHER</span>: </td><td>mother, moder</td></tr><tr><td>204</td><td> <span class="smcap">MY</span> +C: </td><td>my</td></tr><tr><td>204-20</td><td> <span class="smcap">MY</span> <span class="contr">+h</span>: </td><td>my</td></tr><tr><td>205</td><td> <span class="smcap">NAME</span> <span class="contr">sb</span>: </td><td>name</td></tr><tr><td>210</td><td> <span class="smcap">NEITHER</span> <span class="contr">pron</span>: </td><td>nawther</td></tr><tr><td>211</td><td> <span class="smcap">NEITHER..NOR</span>: </td><td>nawther+ne, nother+nor, nawther+then</td></tr><tr><td>212</td><td> <span class="smcap">NEVER</span>: </td><td>neu<span class="contr">er</span></td></tr><tr><td>213</td><td> <span class="smcap">NEW</span>: </td><td>new-</td></tr><tr><td>214</td><td> <span class="smcap">NIGH</span>: </td><td>neyr, nere-</td></tr><tr><td>218</td><td> <span class="smcap">NOW</span>: </td><td>nowe, now</td></tr><tr><td>219</td><td> <span class="smcap">OLD</span>: </td><td>holde</td></tr><tr><td>220-20</td><td> <span class="smcap">ONE</span> <span class="contr">pron</span>: </td><td>one</td></tr><tr><td>221</td><td> <span class="smcap">OR</span>: </td><td>or</td></tr><tr><td>222</td><td> <span class="smcap">OTHER</span>: </td><td>other</td></tr><tr><td>224</td><td> <span class="smcap">OUR</span>: </td><td>oure</td></tr><tr><td>225</td><td> <span class="smcap">OUT</span>: </td><td>oute, out, owt</td></tr><tr><td>226</td><td> <span class="smcap">OWN</span> <span class="contr">adj</span>: </td><td>awne</td></tr><tr><td>227</td><td> <span class="smcap">PEOPLE</span>: </td><td>people, peopyll</td></tr><tr><td>228</td><td> <span class="smcap">POOR</span>: </td><td>pooer, poer, poore</td></tr><tr><td>229</td><td> <span class="smcap">PRAY</span> <span class="contr">vb</span>: </td><td>pray</td></tr><tr><td>235</td><td> <span class="smcap">SAY</span> <span class="contr">pres</span>: </td><td>say</td></tr><tr><td>235-21</td><td> <span class="smcap">SAYS</span> <span class="contr">3sg</span>: </td><td>sayth</td></tr><tr><td>235-30</td><td> <span class="smcap">SAY</span> <span class="contr">pl</span>: </td><td>sayth</td></tr><tr><td>235-40</td><td> <span class="smcap">SAID</span> <span class="contr">sg</span>: </td><td>sayde</td></tr><tr><td>235-60</td><td> <span class="smcap">SAID</span> <span class="contr">ppl</span>: </td><td>sayd</td></tr><tr><td>236</td><td> <span class="smcap">SEE</span> <span class="contr">vb</span>: </td><td>se</td></tr><tr><td>236-21</td><td> <span class="smcap">SEES</span> <span class="contr">3sg</span>: </td><td>seeth</td></tr><tr><td>237</td><td> <span class="smcap">SEEK</span> <span class="contr">pres</span>: </td><td>seke</td></tr><tr><td>238</td><td> <span class="smcap">SELF</span>: </td><td>selffe, selfe</td></tr><tr><td>242</td><td> <span class="smcap">SIN</span> <span class="contr">sb</span>: </td><td>synn-, syn ((syne))</td></tr><tr><td>242-30</td><td> <span class="smcap">SIN</span> <span class="contr">vb</span>: </td><td>synn-</td></tr><tr><td>243</td><td> <span class="smcap">SISTER</span>: </td><td>sust<span class="contr">er</span>, suster</td></tr><tr><td>244</td><td> <span class="smcap">SIX</span>: </td><td>sex</td></tr><tr><td>246</td><td> <span class="smcap">SOME</span>: </td><td>su<span class="contr">m</span>me ((sume))</td></tr><tr><td>248</td><td> <span class="smcap">SORROW</span> <span class="contr">sb</span>: </td><td>sorow, sorowe, sorousys<<span class="contr">pl</span>></td></tr><tr><td>249</td><td> <span class="smcap">SOUL</span>: </td><td>soule (saule)</td></tr><tr><td>249-20</td><td> <span class="smcap">SOULS</span>: </td><td>soules, saulis, sawl<span class="contr">es</span></td></tr><tr><td>254</td><td> <span class="smcap">STEAD</span>: </td><td>sted-</td></tr><tr><td>261</td><td> <span class="smcap">THOU</span>: </td><td>y<sup>u</sup></td></tr><tr><td>262</td><td> <span class="smcap">THEE</span>: </td><td>ye</td></tr><tr><td>263</td><td> <span class="smcap">THY</span> +C: </td><td>y<sup>i</sup></td></tr><tr><td>266</td><td> <span class="smcap">THOUSAND</span>: </td><td>thousande</td></tr><tr><td>267-20</td><td> <span class="smcap">THIRD</span>: </td><td>thryde, therde, threde</td></tr><tr><td>268</td><td> <span class="smcap">TOGETHER</span>: </td><td>to-gedder, to-gether</td></tr><tr><td>270</td><td> <span class="smcap">TRUE</span>: </td><td>true</td></tr><tr><td>273</td><td> <span class="smcap">TWELVE</span>: </td><td>twelue</td></tr><tr><td>275</td><td> <span class="smcap">TWO</span>: </td><td>too</td></tr><tr><td>278</td><td> <span class="smcap">UPON</span>: </td><td>apon (appon) ((vppon))</td></tr><tr><td>281</td><td> <span class="smcap">WELL</span> <span class="contr">adv</span>: </td><td>well ((wel))</td></tr><tr><td>282</td><td> <span class="smcap">WENT</span>: </td><td>went</td></tr><tr><td>285</td><td> <span class="smcap">WHETHER</span>: </td><td>whether (whed<span class="contr">er</span>, wether)</td></tr><tr><td>286</td><td> <span class="smcap">WHITHER</span>: </td><td>wheder</td></tr><tr><td>291</td><td> <span class="smcap">WHY</span>: </td><td>why</td></tr><tr><td>292-20</td><td> <span class="smcap">WIT</span> <span class="contr">1/3sg <span class="smcap">KNOW</span></span>: </td><td>wote</td></tr><tr><td>295</td><td> <span class="smcap">WITHOUT</span> <span class="contr">pr</span>: </td><td>w<sup>t</sup>-owte, w<sup>t</sup>-owt, w<sup>t</sup>-owtyn (w<sup>t</sup>-out)</td></tr><tr><td>297</td><td> <span class="smcap">WORSHIP</span> <span class="contr">sb</span>: </td><td>worschippe, worschip</td></tr><tr><td>298</td><td> <span class="smcap">YE</span>: </td><td>ȝe</td></tr><tr><td>299</td><td> <span class="smcap">YOU</span>: </td><td>you ((youe))</td></tr><tr><td>300</td><td> <span class="smcap">YOUR</span>: </td><td>you<span class="contr">r</span> (youre)</td></tr><tr><td>303</td><td> <span class="smcap">YOUNG</span>: </td><td>yong</td></tr><tr><td>304</td><td> <span class="smcap">-ALD</span>: </td><td>-old</td></tr><tr><td>306</td><td> <span class="smcap">-AND</span>: </td><td>-and (-ond)</td></tr><tr><td>307</td><td> <span class="smcap">-ANG</span>: </td><td>-ong ((-ang))</td></tr><tr><td>308</td><td> <span class="smcap">-ANK</span>: </td><td>-ank, -angk</td></tr><tr><td>309</td><td> <span class="smcap">-DOM</span>: </td><td>-dome, -dom</td></tr><tr><td>312</td><td> <span class="smcap">-ER</span>: </td><td>-er (-<span class="contr">er</span>) ((-ar))</td></tr><tr><td>313</td><td> <span class="smcap">-EST</span> <span class="contr">sup</span>: </td><td>-est</td></tr><tr><td>314</td><td> <span class="smcap">-FUL</span>: </td><td>-full</td></tr><tr><td>315</td><td> <span class="smcap">-HOOD</span>: </td><td>-hede, -hed</td></tr><tr><td>316</td><td> <span class="smcap">-LESS</span>: </td><td>-les</td></tr><tr><td>317</td><td> <span class="smcap">-LY</span>: </td><td>-ly</td></tr><tr><td>318</td><td> <span class="smcap">-NESS</span>: </td><td>-nes</td></tr><tr><td>319</td><td> <span class="smcap">-SHIP</span>: </td><td>-schippe, -schip</td></tr></table>
If "<though>" is supposed to be a line it won't be in readlines as there will be a newline to "<though>" at the end. You need to add a newline or str.rstrip the lines in the list.
In [15]: l = [">though<\n"]
In [16]: ">though<" in l
Out[16]: False
In [17]: ">though<\n" in l
Out[17]: True
This will check for a match properly and use glob to find all the txt files in the directory also without storing all lines in a list:
dr = "C:\Python34/Downloaded Files LALME/"
out = "C:\\Python34/output_though_only.txt"
from glob import glob
tag = ".txt"
files = glob(dr+"*.txt")
with open(out, "w") as f:
for fle in files:
with open(fle) as f2:
for line in f2:
if line.rstrip() == ">though<":
f.seek(0)
f.writelines(f2)
break
And then reopen the file you after writing to it:
outfile = "C:\\Python34/cleaned_output_though_only.txt"
delete_list = ['']
with open(out) as fin,open(outfile, "w") as fout:
for line in fin:
for word in delete_list:
line = line.replace(word, "")
fout.write(line)
But your delete list is empty so not sure exactly what that is doing.
Your code from your edit works as you are checking a string not a list:
n [20]: f = "foo"
In [21]: l = ["foo\n"]
In [22]: f in l
Out[22]: False
In [23]: s = "".join(l)
In [24]: s
Out[24]: 'foo\n'
In [25]: f in s
Out[25]: True
You are checking for a substring in the edited code so even without the newline you will get a match, using a list you need to have an exact match which is what you have in your original code using readlines as you don't call join on as you do in the edited code.
So the big difference between both is one is checking for a substring in a string, the other is checking if an exact match is the list which fails because of the newline character.
Based on your new edit you are not checking for a line, you are checking for a substring an a case insensitive match so use:
`if ">though<" in line.lower()`
I am trying to pull data form BBC weather with a view to use in a home automation dashboard.
The HTML code I can pull fine and I can pull one set of temps but it just pulls the first.
</li>
<li class="daily__day-tab day-20150418 ">
<a data-ajax-href="/weather/en/2646504/daily/2015-04-18?day=3" href="/weather/2646504?day=3" rel="nofollow">
<div class="daily__day-header">
<h3 class="daily__day-date">
<span aria-label="Saturday" class="day-name">Sat</span>
</h3>
</div>
<span class="weather-type-image weather-type-image-40" title="Sunny"><img alt="Sunny" src="http://static.bbci.co.uk/weather/0.5.327/images/icons/tab_sprites/40px/1.png"/></span>
<span class="max-temp max-temp-value"> <span class="units-values temperature-units-values"><span class="units-value temperature-value temperature-value-unit-c" data-unit="c">13<span class="unit">°C</span></span><span class="unit-types-separator"> </span><span class="units-value temperature-value temperature-value-unit-f" data-unit="f">55<span class="unit">°F</span></span></span></span>
<span class="min-temp min-temp-value"> <span class="units-values temperature-units-values"><span class="units-value temperature-value temperature-value-unit-c" data-unit="c">5<span class="unit">°C</span></span><span class="unit-types-separator"> </span><span class="units-value temperature-value temperature-value-unit-f" data-unit="f">41<span class="unit">°F</span></span></span></span>
<span class="wind wind-speed windrose-icon windrose-icon--average windrose-icon-40 windrose-icon-40--average wind-direction-ene" data-tooltip-kph="31 km/h, East North Easterly" data-tooltip-mph="19 mph, East North Easterly" title="19 mph, East North Easterly">
<span class="speed"> <span class="wind-speed__description wind-speed__description--average">Wind Speed</span>
<span class="units-values windspeed-units-values"><span class="units-value windspeed-value windspeed-value-unit-kph" data-unit="kph">31 <span class="unit">km/h</span></span><span class="unit-types-separator"> </span><span class="units-value windspeed-value windspeed-value-unit-mph" data-unit="mph">19 <span class="unit">mph</span></span></span></span>
<span class="description blq-hide">East North Easterly</span>
</span>
This is my code which isn’t working
import urllib2
import pprint
from bs4 import BeautifulSoup
htmlFile=urllib2.urlopen('http://www.bbc.co.uk/weather/2646504?day=1')
htmlData = htmlFile.read()
soup = BeautifulSoup(htmlData)
table=soup.find("div","daily-window")
temperatures=[str(tem.contents[0]) for tem in table.find_all("span",class_="units-value temperature-value temperature-value-unit-c")]
mintemp=[str(min.contents[0]) for min in table.find_("span",class_="min-temp min-temp-value")]
maxtemp=[str(min.contents[0]) for min in table.find_all("span",class_="max-temp max-temp-value")]
windspeeds=[str(speed.contents[0]) for speed in table.find_all("span",class_="units-value windspeed-value windspeed-value-unit-mph")]
pprint.pprint(zip(temperatures,temp2,windspeeds))
your min and max temp extract is wrong.You just find the hole min temp span (include both c and f format).Get the first thing of content gives you empty string.
And the min temp tag identify class=min-temp.min-temp-value is not the same with the c-type min temp class=temperature-value-unit-c.So I suggest you to use css selector.
Eg,find all of your min temp span could be
table.select('span.min-temp.min-temp-value span.temperature-value-unit-c')
This means select all class=temperature-value-unit-c spans which are children of class=min-temp min-temp-value spans.
So do the other information lists like max_temp wind
The website I'm trying to scrape looks like this:
<div align="center" class="movietable">
<span style="width:45px;height:47px;vertical-align:middle;display:table-cell;">
<img border="0" src="styles/images/cat/hd.png" alt="HdO">
</span>
</div>
<div align="left" class="movietable">
<span style="padding:0px 5px;width:455px;height:47px;vertical-align:middle;display:table-cell;">
<a data-toggle="tooltip" data-placement="bottom" data-html="true" title="" href="details.php?id=578197" data-original-title="<img src='https://trasd.tmdb.org//tqistSlQGQVlvDZHweD.jpg'>">
<b>GET THIS TEXT</b></a><br><font class="small">[Action, Horror, Sci-Fi]</font>
</span>
</div>
How can I extract:
The text in the <b> tag - in this case GET THIS TEXT
The content of the font_class= 'small' - in this case this would be Action, Horror, Sci-Fi
.movietable b works great!!
The img_scr link - in thiscase it would be https://trasd.tmdb.org//tqistSlQGQVlvDZHweD.jpg
I have no ideea how to do this
Below are CSS selectors you can use:
driver.find_element_by_css_selector('div[align=left] b')
driver.find_element_by_css_selector('div[align=left] .small')
driver.find_element_by_css_selector('a[title]').get_attribute('data-original-title')
You can access all of them using xpath:
1) [parents before this div]/div[2]/span/a/b
2) [parents before this div]/div[2]/span/font
3) [parents before this div]/div[1]/span/a/img
[parents before this div] should be /html/body/...
As per the HTML you have shared to extract the items you can use the following solution:
GET THIS TEXT:
driver.find_element_by_xpath("//div[#class='movietable' and #align='left']/span/a[#data-toggle='tooltip' and #data-placement='bottom']/b").get_attribute("innerHTML")
[Action, Horror, Sci-Fi]:
driver.find_element_by_xpath("//div[#class='movietable' and #align='left']/span//font[#class='small']").get_attribute("innerHTML")
https://trasd.tmdb.org//tqistSlQGQVlvDZHweD.jpg:
img_src = driver.find_element_by_xpath("//div[#class='movietable' and #align='left']/span/a[#data-toggle='tooltip' and #data-placement='bottom']").get_attribute("data-original-title")
src = img_src.replace("'", "-").split("-")
print(src[1])