Errno 2 - No such file or directory - python

I have a scrip that I use to scan through a directory and copy the content (actually I only want part of it...) of all files that containt a certain string into a new file.
import os
dr = "C:/Python34/Downloaded Files LALME/"; out = "C:/Python34/output_though_only.txt"; tag = ".txt"
files = os.listdir(dr)
for f in files:
if f.endswith(tag):
content = open(dr+"/"+f).readlines()
all_lines = content
if ">though<" in all_lines:
open(out, "a+").write(all_lines)
infile = "C:/Python34/output_though_only.txt"
outfile = "C:/Python34/cleaned_output_though_only.txt"
delete_list = ['']
fin = open(infile)
fout = open(outfile, "a+")
for line in fin:
for word in delete_list:
line = line.replace(word, "")
fout.write(line)
fin.close()
fout.close()
When I run it from the shell it returns the following error message:
C:\Python34>python LALME_script_though_extract_clean.txt -w
Traceback (most recent call last):
File "LALME_script_though_extract_clean.txt", line 13, in <module>
fin = open(infile)
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Python34/output_tho
ugh_only.txt'
Of course, the file does not exist (yet), but the script is supposed to create it if it doesnt exist. This is a modified version of a code that I used for other strings and files and there it worked just fine.
Can anyone find what went wrong here ?
Edit: here is an older version of the script that works and was used for differenty files/directories.
import os
dr = "C:\Python34/tag/"; out = "C:\\Users/Yorishimo/Desktop/codes/LAEME/though/output_laeme_though/output_laeme_though.txt"; tag = ".tag"
files = os.listdir(dr)
for f in files:
if f.endswith(tag):
content = open(dr+"/"+f).readlines()
all_lines = content
needed_lines = content[21:27]
lines_totest = ("").join(all_lines)+"\n"
final_lines = ("").join(needed_lines)+"\n"
if "$though/" in lines_totest:
open(out, "a+").write(final_lines)
infile = "C:\\Users/Yorishimo/Desktop/codes/LAEME/though/output_laeme_though/output_laeme_though.txt"
outfile = "C:\\Users/Yorishimo/Desktop/codes/LAEME/though/output_laeme_though/cleaned_output_though_only.txt"
delete_list = ['</span></li><li><span class="list">', '<style type="text/css"> UL LI { list-style: none } </style><ul><li><span class="list">']
fin = open(infile)
fout = open(outfile, "a+")
for line in fin:
for word in delete_list:
line = line.replace(word, "")
fout.write(line)
fin.close()
fout.close()
Edit2:
Here is an example of a file that I am trying to have scanned for the string >though<:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>eLALME</title>
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
<meta name="Title" content="">
<meta name="Description" content="">
<meta name="Keywords" content="">
<meta name="Author" content="">
<meta name="Publisher" content="">
<link rel="stylesheet" href="http://archive.ling.ed.ac.uk/ihd/elalme_scripts/lib/css/elalme_actionstyle.css" type="text/css" target="taskarea">
</head>
<!-- MAIN CONTENT BELOW HERE -->
<body>
<style type="text/css">
UL LI { list-style: none}
</style>
<p><span class="emphasis">LP 1</span></p><p>Dublin, Trinity College 154 (A.6.12). <span class="contr">ca.</span> 1400. MS in one hand. ff. 1r-105r: religious treatises. Analysis from ff. 1r-41r, then scan. LP 1. Grid 478 332. Leicestershire.</p></p><table><tr><td>1</td><td> <span class="smcap">THE</span>: </td><td>y<sup>e</sup></td></tr><tr><td>2</td><td> <span class="smcap">THESE</span>: </td><td>thes (theys) ((these))</td></tr><tr><td>3</td><td> <span class="smcap">THOSE</span>: </td><td>those (tho)</td></tr><tr><td>4</td><td> <span class="smcap">SHE</span>: </td><td>sche, she</td></tr><tr><td>5</td><td> <span class="smcap">HER</span>: </td><td>hir, hyr</td></tr><tr><td>6</td><td> <span class="smcap">IT</span>: </td><td>it</td></tr><tr><td>7</td><td> <span class="smcap">THEY</span>: </td><td>they ((thay, thei))</td></tr><tr><td>8</td><td> <span class="smcap">THEM</span>: </td><td>theym, them (thayme, theyme) ((yem, thaym, theme, yam))</td></tr><tr><td>9</td><td> <span class="smcap">THEIR</span>: </td><td>theyr (ther, y<span class="contr">er</span>) ((thayre, thayr, theyre, thare))</td></tr><tr><td>10</td><td> <span class="smcap">SUCH</span>: </td><td>suche ((syche))</td></tr><tr><td>11</td><td> <span class="smcap">WHICH</span>: </td><td>wiche, which (wyche, whiche, whyche) ((y<sup>e</sup>-wiche))</td></tr><tr><td>13</td><td> <span class="smcap">MANY</span>: </td><td>mony (many)</td></tr><tr><td>14</td><td> <span class="smcap">MAN</span>: </td><td>man, ma<span class="contr">n</span></td></tr><tr><td>15</td><td> <span class="smcap">ANY</span>: </td><td>any</td></tr><tr><td>16</td><td> <span class="smcap">MUCH</span>: </td><td>myche (mych, moche, meche, muche)</td></tr><tr><td>17</td><td> <span class="smcap">ARE</span>: </td><td>be (ar) ((are, er, byn))</td></tr><tr><td>18</td><td> <span class="smcap">WERE</span>: </td><td>were (ware) ((wher))</td></tr><tr><td>19</td><td> <span class="smcap">IS</span>: </td><td>is</td></tr><tr><td>21</td><td> <span class="smcap">WAS</span>: </td><td>was</td></tr><tr><td>22</td><td> <span class="smcap">SHALL</span> <span class="contr">sg</span>: </td><td>shal, schal, schall</td></tr><tr><td>22-30</td><td> <span class="smcap">SHALL</span> <span class="contr">pl</span>: </td><td>shall (shal) ((schall))</td></tr><tr><td>23</td><td> <span class="smcap">SHOULD</span> <span class="contr">sg</span>: </td><td>sholde (schulde) ((shulde))</td></tr><tr><td>23-30</td><td> <span class="smcap">SHOULD</span> <span class="contr">pl</span>: </td><td>sholde (schulde, shulde)</td></tr><tr><td>24</td><td> <span class="smcap">WILL</span> <span class="contr">sg</span>: </td><td>wyll</td></tr><tr><td>24-30</td><td> <span class="smcap">WILL</span> <span class="contr">pl</span>: </td><td>wyll</td></tr><tr><td>25</td><td> <span class="smcap">WOULD</span> <span class="contr">sg</span>: </td><td>wolde</td></tr><tr><td>26-30</td><td> <span class="smcap">TO</span> <span class="contr">prep</span> +V: </td><td>to</td></tr><tr><td>27</td><td> <span class="smcap">TO</span> <span class="contr">+inf</span> +C: </td><td>to</td></tr><tr><td>28</td><td> <span class="smcap">FROM</span>: </td><td>from (frome) ((fro))</td></tr><tr><td>29</td><td> <span class="smcap">AFTER</span>: </td><td>after</td></tr><tr><td>30</td><td> <span class="smcap">THEN</span>: </td><td>then ((than))</td></tr><tr><td>31</td><td> <span class="smcap">THAN</span>: </td><td>then (than) ((yen))</td></tr><tr><td>32</td><td> <span class="smcap">THOUGH</span>: </td><td>alof (yof) ((yoff))</td></tr><tr><td>33</td><td> <span class="smcap">IF</span>: </td><td>yff, yff-that, yf ((yf-y<sup>t</sup>, yff-y<sup>t</sup>, gyffe-y<sup>t</sup>))</td></tr><tr><td>34</td><td> <span class="smcap">AS</span>: </td><td>as</td></tr><tr><td>35</td><td> <span class="smcap">AS..AS</span>: </td><td>as+as</td></tr><tr><td>36</td><td> <span class="smcap">AGAINST</span>: </td><td>agance (agaynst, agayns, agayne)</td></tr><tr><td>39-20</td><td> <span class="smcap">SINCE</span> <span class="contr">conj</span>: </td><td>sythen-y<sup>t</sup>, syn</td></tr><tr><td>40</td><td> <span class="smcap">YET</span>: </td><td>ȝet ((ȝett))</td></tr><tr><td>41</td><td> <span class="smcap">WHILE</span>: </td><td>whylys-y<sup>t</sup></td></tr><tr><td>42</td><td> <span class="smcap">STRENGTH</span>: </td><td>strenght ((strenghe))</td></tr><tr><td>42-20</td><td> <span class="smcap">STRENGTHEN</span> <span class="contr">vb</span>: </td><td>strenght</td></tr><tr><td>44</td><td> <span class="smcap">WH-</span>: </td><td>wh- ((w-))</td></tr><tr><td>46</td><td> <span class="smcap">NOT</span>: </td><td>not, nott</td></tr><tr><td>47</td><td> <span class="smcap">NOR</span>: </td><td>nor (ne)</td></tr><tr><td>48</td><td> <span class="smcap">OE</span>, <span class="smcap">ON</span> <span class="contr">ā</span> (‘a’, ‘o’): </td><td>o</td></tr><tr><td>49</td><td> <span class="smcap">WORLD</span>: </td><td>woorlde, worlde, warlde, world</td></tr><tr><td>50</td><td> <span class="smcap">THINK</span> <span class="contr">vb</span>: </td><td>thynke, thyngke</td></tr><tr><td>51</td><td> <span class="smcap">WORK</span> <span class="contr">sb</span>: </td><td>werke</td></tr><tr><td>51-10</td><td> <span class="smcap">WORK</span> <span class="contr">pres stem</span>: </td><td>werke</td></tr><tr><td>52</td><td> <span class="smcap">THERE</span>: </td><td>ther ((y<span class="contr">er</span>))</td></tr><tr><td>53</td><td> <span class="smcap">WHERE</span>: </td><td>wher-, where</td></tr><tr><td>54</td><td> <span class="smcap">MIGHT</span> <span class="contr">vb</span>: </td><td>myght</td></tr><tr><td>55</td><td> <span class="smcap">THROUGH</span>: </td><td>throughe (throghe) ((throught, through))</td></tr><tr><td>56</td><td> <span class="smcap">WHEN</span>: </td><td>when</td></tr><tr><td>57</td><td> <span class="contr">Sb pl</span>: </td><td>-ys (-s, -<span class="contr">es</span>) ((-es, -is))</td></tr><tr><td>58</td><td> <span class="contr">Pres part</span>: </td><td>-yng</td></tr><tr><td>59</td><td> <span class="contr">Vbl sb</span>: </td><td>-yng</td></tr><tr><td>61</td><td> <span class="contr">Pres 3sg</span>: </td><td>-ys (-yth, -eth, -<span class="contr">es</span>, -s)</td></tr><tr><td>62</td><td> <span class="contr">Pres pl</span>: </td><td>-th</td></tr><tr><td>65</td><td> <span class="contr">Weak ppl</span>: </td><td>-ed (-et) ((-yd))</td></tr><tr><td>66</td><td> <span class="contr">Str ppl</span>: </td><td>-en, -on, -yne</td></tr><tr><td>70-20</td><td> <span class="smcap">ABOUT</span> <span class="contr">pr</span>: </td><td>abowte</td></tr><tr><td>71-20</td><td> <span class="smcap">ABOVE</span> <span class="contr">pr</span>: </td><td>a-boue, abowe</td></tr><tr><td>73</td><td> <span class="smcap">AFTERWARDS</span>: </td><td>afterward</td></tr><tr><td>75</td><td> <span class="smcap">ALL</span>: </td><td>all, al</td></tr><tr><td>77</td><td> <span class="smcap">AMONG</span> <span class="contr">adv</span>: </td><td>emong</td></tr><tr><td>77-20</td><td> <span class="smcap">AMONG</span> <span class="contr">pr</span>: </td><td>emong</td></tr><tr><td>78-20</td><td> <span class="smcap">ANSWER</span> <span class="contr">vb</span>: </td><td>answer</td></tr><tr><td>80</td><td> <span class="smcap">ASK</span> <span class="contr">vb</span>: </td><td>aske</td></tr><tr><td>81</td><td> <span class="smcap">AT</span><span class="contr">+inf</span>: </td><td>at</td></tr><tr><td>83</td><td> <span class="smcap">AWAY</span>: </td><td>away</td></tr><tr><td>84-20</td><td> <span class="smcap">BE</span> <span class="contr">ppl</span>: </td><td>beyn (byn)</td></tr><tr><td>85-20</td><td> <span class="smcap">BEFORE</span> <span class="contr">adv-time</span>: </td><td>before, befor</td></tr><tr><td>85-31</td><td> <span class="smcap">BEFORE</span> <span class="contr">pr-place</span>: </td><td>be-fore, be-for</td></tr><tr><td>89</td><td> <span class="smcap">BETWEEN</span> <span class="contr">pr</span>: </td><td>betwene</td></tr><tr><td>93</td><td> <span class="smcap">BLESSED</span> <span class="contr">adj/ppl</span>: </td><td>blessyd (blessed) ((blyssed))</td></tr><tr><td>94</td><td> <span class="smcap">BOTH</span>: </td><td>bothe</td></tr><tr><td>96</td><td> <span class="smcap">BROTHER</span>: </td><td>broder (brother)</td></tr><tr><td>99</td><td> <span class="smcap">BUSY</span> <span class="contr">adj</span>: </td><td>besy ((busy))</td></tr><tr><td>99-20</td><td> <span class="smcap">BUSY</span> <span class="contr">vb</span>: </td><td>besy-, busy</td></tr><tr><td>100</td><td> <span class="smcap">BUT</span>: </td><td>bot (bott)</td></tr><tr><td>102</td><td> <span class="smcap">BY</span>: </td><td>by</td></tr><tr><td>103-30</td><td> <span class="smcap">CALLED</span> <span class="contr">ppl</span>: </td><td>called</td></tr><tr><td>104</td><td> <span class="smcap">CAME</span> <span class="contr">sg</span>: </td><td>came</td></tr><tr><td>105-20</td><td> CAN <span class="contr">1/3sg</span>: </td><td>can</td></tr><tr><td>106</td><td> <span class="smcap">CAST</span> <span class="contr">vb</span>: </td><td>cast</td></tr><tr><td>108</td><td> <span class="smcap">CHURCH</span>: </td><td>churche</td></tr><tr><td>109</td><td> <span class="smcap">COULD</span> <span class="contr">1/3sg</span>: </td><td>cowthe</td></tr><tr><td>112</td><td> <span class="smcap">DAY</span>: </td><td>day</td></tr><tr><td>113</td><td> <span class="smcap">DEATH</span>: </td><td>dethe</td></tr><tr><td>114</td><td> <span class="smcap">DIE</span> <span class="contr">vb</span>: </td><td>dye</td></tr><tr><td>115-70</td><td> <span class="smcap">DID</span> <span class="contr">pl</span>: </td><td>dyd</td></tr><tr><td>116</td><td> <span class="smcap">DOWN</span>: </td><td>downe</td></tr><tr><td>119</td><td> <span class="smcap">EARTH</span>: </td><td>erthe</td></tr><tr><td>125</td><td> <span class="smcap">ENOUGH</span>: </td><td>enoughe</td></tr><tr><td>129</td><td> <span class="smcap">FAR</span>: </td><td>far</td></tr><tr><td>130</td><td> <span class="smcap">FATHER</span>: </td><td>fad<span class="contr">er</span>, fader</td></tr><tr><td>132</td><td> <span class="smcap">FELLOW</span>: </td><td>felou-</td></tr><tr><td>134</td><td> <span class="smcap">FIGHT</span> <span class="contr">pres</span>: </td><td>feght-</td></tr><tr><td>137</td><td> <span class="smcap">FIRE</span>: </td><td>fyer, fyre, fire</td></tr><tr><td>138</td><td> <span class="smcap">FIRST</span> <span class="contr">undiff</span>: </td><td>ferst, first</td></tr><tr><td>139</td><td> <span class="smcap">FIVE</span>: </td><td>fyve</td></tr><tr><td>139-20</td><td> <span class="smcap">FIFTH</span>: </td><td>feyfte</td></tr><tr><td>140</td><td> <span class="smcap">FLESH</span>: </td><td>fleshe, flesche</td></tr><tr><td>141</td><td> <span class="smcap">FOLLOW</span> <span class="contr">vb</span>: </td><td>folow, folo-</td></tr><tr><td>144-20</td><td> <span class="smcap">FOURTH</span>: </td><td>fawrte</td></tr><tr><td>146</td><td> <span class="smcap">FRIEND</span>: </td><td>frend-</td></tr><tr><td>147</td><td> <span class="smcap">FRUIT</span>: </td><td>frutt-</td></tr><tr><td>153</td><td> <span class="smcap">GIVE</span> <span class="contr">pres</span>: </td><td>gyue (gyff-)</td></tr><tr><td>155</td><td> <span class="smcap">GOOD</span>: </td><td>good, gud</td></tr><tr><td>157</td><td> <span class="smcap">GROW</span> <span class="contr">pres</span>: </td><td>groue (growe)</td></tr><tr><td>160</td><td> <span class="smcap">HAVE</span> <span class="contr">pres</span>: </td><td>haue</td></tr><tr><td>160-40</td><td> <span class="smcap">HAS</span> <span class="contr">3sg</span>: </td><td>hayth ((haythe, haith))</td></tr><tr><td>164</td><td> <span class="smcap">HEAVEN</span>: </td><td>hewyn</td></tr><tr><td>165</td><td> <span class="smcap">HEIGHT</span>: </td><td>heghte</td></tr><tr><td>166</td><td> <span class="smcap">HELL</span>: </td><td>hell</td></tr><tr><td>168</td><td> <span class="smcap">HIGH</span>: </td><td>hegh (heghe, hyghe)</td></tr><tr><td>168-20</td><td> <span class="smcap">HIGHER</span>: </td><td>hyer</td></tr><tr><td>171</td><td> <span class="smcap">HIM</span>: </td><td>hym</td></tr><tr><td>175</td><td> <span class="smcap">HOLY</span>: </td><td>holy</td></tr><tr><td>176</td><td> <span class="smcap">HOW</span>: </td><td>how (howe)</td></tr><tr><td>181</td><td> <span class="smcap">KNOW</span> <span class="contr">pres</span>: </td><td>knawe, knaw</td></tr><tr><td>185</td><td> <span class="smcap">LAW</span>: </td><td>lawe</td></tr><tr><td>187</td><td> <span class="smcap">LESS</span>: </td><td>lesse, leysse</td></tr><tr><td>190</td><td> <span class="smcap">LIFE</span>: </td><td>lyue</td></tr><tr><td>191</td><td> <span class="smcap">LITTLE</span>: </td><td>lyttyll (lyttyl)</td></tr><tr><td>192</td><td> <span class="smcap">LIVE</span> <span class="contr">vb</span>: </td><td>lyff-</td></tr><tr><td>194</td><td> <span class="smcap">LORD</span>: </td><td>lorde ((lord))</td></tr><tr><td>196</td><td> <span class="smcap">LOVE</span> <span class="contr">sb</span>: </td><td>loue (luffe, luff)</td></tr><tr><td>196-20</td><td> <span class="smcap">LOVE</span> <span class="contr">vb</span>: </td><td>loue</td></tr><tr><td>197</td><td> <span class="smcap">LOW</span>: </td><td>low-</td></tr><tr><td>199-10</td><td> <span class="smcap">MAY</span> <span class="contr">1/3sg</span>: </td><td>may</td></tr><tr><td>202</td><td> <span class="smcap">MOON</span>: </td><td>mone</td></tr><tr><td>203</td><td> <span class="smcap">MOTHER</span>: </td><td>mother, moder</td></tr><tr><td>204</td><td> <span class="smcap">MY</span> +C: </td><td>my</td></tr><tr><td>204-20</td><td> <span class="smcap">MY</span> <span class="contr">+h</span>: </td><td>my</td></tr><tr><td>205</td><td> <span class="smcap">NAME</span> <span class="contr">sb</span>: </td><td>name</td></tr><tr><td>210</td><td> <span class="smcap">NEITHER</span> <span class="contr">pron</span>: </td><td>nawther</td></tr><tr><td>211</td><td> <span class="smcap">NEITHER..NOR</span>: </td><td>nawther+ne, nother+nor, nawther+then</td></tr><tr><td>212</td><td> <span class="smcap">NEVER</span>: </td><td>neu<span class="contr">er</span></td></tr><tr><td>213</td><td> <span class="smcap">NEW</span>: </td><td>new-</td></tr><tr><td>214</td><td> <span class="smcap">NIGH</span>: </td><td>neyr, nere-</td></tr><tr><td>218</td><td> <span class="smcap">NOW</span>: </td><td>nowe, now</td></tr><tr><td>219</td><td> <span class="smcap">OLD</span>: </td><td>holde</td></tr><tr><td>220-20</td><td> <span class="smcap">ONE</span> <span class="contr">pron</span>: </td><td>one</td></tr><tr><td>221</td><td> <span class="smcap">OR</span>: </td><td>or</td></tr><tr><td>222</td><td> <span class="smcap">OTHER</span>: </td><td>other</td></tr><tr><td>224</td><td> <span class="smcap">OUR</span>: </td><td>oure</td></tr><tr><td>225</td><td> <span class="smcap">OUT</span>: </td><td>oute, out, owt</td></tr><tr><td>226</td><td> <span class="smcap">OWN</span> <span class="contr">adj</span>: </td><td>awne</td></tr><tr><td>227</td><td> <span class="smcap">PEOPLE</span>: </td><td>people, peopyll</td></tr><tr><td>228</td><td> <span class="smcap">POOR</span>: </td><td>pooer, poer, poore</td></tr><tr><td>229</td><td> <span class="smcap">PRAY</span> <span class="contr">vb</span>: </td><td>pray</td></tr><tr><td>235</td><td> <span class="smcap">SAY</span> <span class="contr">pres</span>: </td><td>say</td></tr><tr><td>235-21</td><td> <span class="smcap">SAYS</span> <span class="contr">3sg</span>: </td><td>sayth</td></tr><tr><td>235-30</td><td> <span class="smcap">SAY</span> <span class="contr">pl</span>: </td><td>sayth</td></tr><tr><td>235-40</td><td> <span class="smcap">SAID</span> <span class="contr">sg</span>: </td><td>sayde</td></tr><tr><td>235-60</td><td> <span class="smcap">SAID</span> <span class="contr">ppl</span>: </td><td>sayd</td></tr><tr><td>236</td><td> <span class="smcap">SEE</span> <span class="contr">vb</span>: </td><td>se</td></tr><tr><td>236-21</td><td> <span class="smcap">SEES</span> <span class="contr">3sg</span>: </td><td>seeth</td></tr><tr><td>237</td><td> <span class="smcap">SEEK</span> <span class="contr">pres</span>: </td><td>seke</td></tr><tr><td>238</td><td> <span class="smcap">SELF</span>: </td><td>selffe, selfe</td></tr><tr><td>242</td><td> <span class="smcap">SIN</span> <span class="contr">sb</span>: </td><td>synn-, syn ((syne))</td></tr><tr><td>242-30</td><td> <span class="smcap">SIN</span> <span class="contr">vb</span>: </td><td>synn-</td></tr><tr><td>243</td><td> <span class="smcap">SISTER</span>: </td><td>sust<span class="contr">er</span>, suster</td></tr><tr><td>244</td><td> <span class="smcap">SIX</span>: </td><td>sex</td></tr><tr><td>246</td><td> <span class="smcap">SOME</span>: </td><td>su<span class="contr">m</span>me ((sume))</td></tr><tr><td>248</td><td> <span class="smcap">SORROW</span> <span class="contr">sb</span>: </td><td>sorow, sorowe, sorousys<<span class="contr">pl</span>></td></tr><tr><td>249</td><td> <span class="smcap">SOUL</span>: </td><td>soule (saule)</td></tr><tr><td>249-20</td><td> <span class="smcap">SOULS</span>: </td><td>soules, saulis, sawl<span class="contr">es</span></td></tr><tr><td>254</td><td> <span class="smcap">STEAD</span>: </td><td>sted-</td></tr><tr><td>261</td><td> <span class="smcap">THOU</span>: </td><td>y<sup>u</sup></td></tr><tr><td>262</td><td> <span class="smcap">THEE</span>: </td><td>ye</td></tr><tr><td>263</td><td> <span class="smcap">THY</span> +C: </td><td>y<sup>i</sup></td></tr><tr><td>266</td><td> <span class="smcap">THOUSAND</span>: </td><td>thousande</td></tr><tr><td>267-20</td><td> <span class="smcap">THIRD</span>: </td><td>thryde, therde, threde</td></tr><tr><td>268</td><td> <span class="smcap">TOGETHER</span>: </td><td>to-gedder, to-gether</td></tr><tr><td>270</td><td> <span class="smcap">TRUE</span>: </td><td>true</td></tr><tr><td>273</td><td> <span class="smcap">TWELVE</span>: </td><td>twelue</td></tr><tr><td>275</td><td> <span class="smcap">TWO</span>: </td><td>too</td></tr><tr><td>278</td><td> <span class="smcap">UPON</span>: </td><td>apon (appon) ((vppon))</td></tr><tr><td>281</td><td> <span class="smcap">WELL</span> <span class="contr">adv</span>: </td><td>well ((wel))</td></tr><tr><td>282</td><td> <span class="smcap">WENT</span>: </td><td>went</td></tr><tr><td>285</td><td> <span class="smcap">WHETHER</span>: </td><td>whether (whed<span class="contr">er</span>, wether)</td></tr><tr><td>286</td><td> <span class="smcap">WHITHER</span>: </td><td>wheder</td></tr><tr><td>291</td><td> <span class="smcap">WHY</span>: </td><td>why</td></tr><tr><td>292-20</td><td> <span class="smcap">WIT</span> <span class="contr">1/3sg <span class="smcap">KNOW</span></span>: </td><td>wote</td></tr><tr><td>295</td><td> <span class="smcap">WITHOUT</span> <span class="contr">pr</span>: </td><td>w<sup>t</sup>-owte, w<sup>t</sup>-owt, w<sup>t</sup>-owtyn (w<sup>t</sup>-out)</td></tr><tr><td>297</td><td> <span class="smcap">WORSHIP</span> <span class="contr">sb</span>: </td><td>worschippe, worschip</td></tr><tr><td>298</td><td> <span class="smcap">YE</span>: </td><td>ȝe</td></tr><tr><td>299</td><td> <span class="smcap">YOU</span>: </td><td>you ((youe))</td></tr><tr><td>300</td><td> <span class="smcap">YOUR</span>: </td><td>you<span class="contr">r</span> (youre)</td></tr><tr><td>303</td><td> <span class="smcap">YOUNG</span>: </td><td>yong</td></tr><tr><td>304</td><td> <span class="smcap">-ALD</span>: </td><td>-old</td></tr><tr><td>306</td><td> <span class="smcap">-AND</span>: </td><td>-and (-ond)</td></tr><tr><td>307</td><td> <span class="smcap">-ANG</span>: </td><td>-ong ((-ang))</td></tr><tr><td>308</td><td> <span class="smcap">-ANK</span>: </td><td>-ank, -angk</td></tr><tr><td>309</td><td> <span class="smcap">-DOM</span>: </td><td>-dome, -dom</td></tr><tr><td>312</td><td> <span class="smcap">-ER</span>: </td><td>-er (-<span class="contr">er</span>) ((-ar))</td></tr><tr><td>313</td><td> <span class="smcap">-EST</span> <span class="contr">sup</span>: </td><td>-est</td></tr><tr><td>314</td><td> <span class="smcap">-FUL</span>: </td><td>-full</td></tr><tr><td>315</td><td> <span class="smcap">-HOOD</span>: </td><td>-hede, -hed</td></tr><tr><td>316</td><td> <span class="smcap">-LESS</span>: </td><td>-les</td></tr><tr><td>317</td><td> <span class="smcap">-LY</span>: </td><td>-ly</td></tr><tr><td>318</td><td> <span class="smcap">-NESS</span>: </td><td>-nes</td></tr><tr><td>319</td><td> <span class="smcap">-SHIP</span>: </td><td>-schippe, -schip</td></tr></table>

If "<though>" is supposed to be a line it won't be in readlines as there will be a newline to "<though>" at the end. You need to add a newline or str.rstrip the lines in the list.
In [15]: l = [">though<\n"]
In [16]: ">though<" in l
Out[16]: False
In [17]: ">though<\n" in l
Out[17]: True
This will check for a match properly and use glob to find all the txt files in the directory also without storing all lines in a list:
dr = "C:\Python34/Downloaded Files LALME/"
out = "C:\\Python34/output_though_only.txt"
from glob import glob
tag = ".txt"
files = glob(dr+"*.txt")
with open(out, "w") as f:
for fle in files:
with open(fle) as f2:
for line in f2:
if line.rstrip() == ">though<":
f.seek(0)
f.writelines(f2)
break
And then reopen the file you after writing to it:
outfile = "C:\\Python34/cleaned_output_though_only.txt"
delete_list = ['']
with open(out) as fin,open(outfile, "w") as fout:
for line in fin:
for word in delete_list:
line = line.replace(word, "")
fout.write(line)
But your delete list is empty so not sure exactly what that is doing.
Your code from your edit works as you are checking a string not a list:
n [20]: f = "foo"
In [21]: l = ["foo\n"]
In [22]: f in l
Out[22]: False
In [23]: s = "".join(l)
In [24]: s
Out[24]: 'foo\n'
In [25]: f in s
Out[25]: True
You are checking for a substring in the edited code so even without the newline you will get a match, using a list you need to have an exact match which is what you have in your original code using readlines as you don't call join on as you do in the edited code.
So the big difference between both is one is checking for a substring in a string, the other is checking if an exact match is the list which fails because of the newline character.
Based on your new edit you are not checking for a line, you are checking for a substring an a case insensitive match so use:
`if ">though<" in line.lower()`

Related

Python Selenium get_attribute not returning value

I am trying to extract the data-message-id from the following html. My original goal is to extract the data-message- id for the span containing a particular text and then clicking on the star_button to star it.
<div class="message_content_header">
<div class="message_content_header_left">
krishnag0902
<span class="ts_tip_float message_current_status ts_tip ts_tip_top ts_tip_multiline ts_tip_delay_150 color_U5TPDSMQQ color_9f69e7 hidden ts_tip_hidden">
<span class="ts_tip_tip ts_tip_inner_current_status">
<span class="ts_tip_multiline_inner">
</span>
</span>
</span>
<i class="copy_only">[</i>4:34 PM<i class="copy_only">]</i><span class="ts_tip_tip"><span class="ts_tip_multiline_inner">Yesterday at 4:34:07 PM</span></span>
<span class="message_star_holder">
Star this message
</div>
</div>
<span class="message_body">hoho<span class="constrain_triple_clicks"></span></span>
<div class="rxn_panel rxns_key_message-1498084447_119862-C5UGEFBS9"></div>
<i class="copy_only"><br></i>
<span id="msg_1498084447_119862_label" class="message_aria_label hidden">
<strong>krishnag0902</strong>.
hoho.
four thirty-four PM.
</span>
and i am using the code on the above span(message_star_holder) which is returning a None
data_mess= star_button_span.find_element_by_xpath("//button[#class=
'star ts_icon ts_icon_star_o ts_icon_inherit ts_tip_top star_message
ts_tip ts_tip_float ts_tip_hidden btn_unstyle']")
print data_mess.get_attribute("innerHTML")
print star_button_span.get_attribute("data-msg-id")
star_button_span doesn't have data-msg-id attribute. data_mess has
print data_mess.get_attribute("data-msg-id")

How do I scrape nested data using selenium and Python>

I basically want to scrape Feb 2016 - Present under <span class="visually-hidden">, but I can't see to get to it. Here's the HTML at code:
<div class="pv-entity__summary-info">
<h3 class="Sans-17px-black-85%-semibold">Litigation Paralegal</h3>
<h4>
<span class="visually-hidden">Company Name</span>
<span class="pv-entity__secondary-title Sans-15px-black-55%">Olswang</span>
</h4>
<div class="pv-entity__position-info detail-facet m0"><h4 class="pv-entity__date-range Sans-15px-black-55%">
<span class="visually-hidden">Dates Employed</span>
<span>Feb 2016 – Present</span>
</h4><h4 class="pv-entity__duration de Sans-15px-black-55% ml0">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item">1 yr 2 mos</span>
</h4><h4 class="pv-entity__location detail-facet Sans-15px-black-55% inline-block">
<span class="visually-hidden">Location</span>
<span class="pv-entity__bullet-item">London, United Kingdom</span>
</h4></div>
</div>
And here is what I've been doing at the moment with selenium in my code:
date= browser.find_element_by_xpath('.//div[#class = "pv-entity__duration de Sans-15px-black-55% ml0"]').text
print date
But this gives no results. How would I go about either pulling the date?
There is no div with class="pv-entity__duration de Sans-15px-black-55% ml0", but h4. If you want to get text of div, then try:
date= browser.find_element_by_xpath('.//div[#class = "pv-entity__position-info detail-facet m0"]').text
print date
If you want to get "Feb 2016 - Present", then try
date= browser.find_element_by_xpath('//h4[#class="pv-entity__date-range Sans-15px-black-55%"]/span[2]').text
print date
You can rewrite your xpath code something like this :
# -*- coding: utf-8 -*-
from lxml import html
import unicodedata
html_str = """
<div class="pv-entity__summary-info">
<h3 class="Sans-17px-black-85%-semibold">Litigation Paralegal</h3>
<h4>
<span class="visually-hidden">Company Name</span>
<span class="pv-entity__secondary-title Sans-15px-black-55%">Olswang</span>
</h4>
<div class="pv-entity__position-info detail-facet m0"><h4 class="pv-entity__date-range Sans-15px-black-55%">
<span class="visually-hidden">Dates Employed</span>
<span>Feb 2016 – Present</span>
</h4><h4 class="pv-entity__duration de Sans-15px-black-55% ml0">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item">1 yr 2 mos</span>
</h4><h4 class="pv-entity__location detail-facet Sans-15px-black-55% inline-block">
<span class="visually-hidden">Location</span>
<span class="pv-entity__bullet-item">London, United Kingdom</span>
</h4></div>
</div>
"""
root = html.fromstring(html_str)
# For fetching Feb 2016 â Present :
txt = root.xpath('//h4[#class="pv-entity__date-range Sans-15px-black-55%"]/span/text()')[1]
# For fetching 1 yr 2 mos :
txt1 = root.xpath('//h4[#class="pv-entity__duration de Sans-15px-black-55% ml0"]/span/text()')[1]
print txt
print txt1
This will result in :
Feb 2016 â Present
1 yr 2 mos

How to find the number of text objects in BeautifulSoup object

I'm web scraping a wikipedia page using BeautifulSoup in python and I was wondering whether there is anyone to know the number of text objects in an HTML object. For example the following code gets me the following HTML:
soup.find_all(class_ = 'toctext')
<span class="toctext">Actors and actresses</span>, <span class="toctext">Archaeologists and anthropologists</span>, <span class="toctext">Architects</span>, <span class="toctext">Artists</span>, <span class="toctext">Broadcasters</span>, <span class="toctext">Businessmen</span>, <span class="toctext">Chefs</span>, <span class="toctext">Clergy</span>, <span class="toctext">Criminals</span>, <span class="toctext">Conspirators</span>, <span class="toctext">Economists</span>, <span class="toctext">Engineers</span>, <span class="toctext">Explorers</span>, <span class="toctext">Filmmakers</span>, <span class="toctext">Historians</span>, <span class="toctext">Humourists</span>, <span class="toctext">Inventors / engineers</span>, <span class="toctext">Journalists / newsreaders</span>, <span class="toctext">Military: soldiers/sailors/airmen</span>, <span class="toctext">Monarchs</span>, <span class="toctext">Musicians</span>, <span class="toctext">Philosophers</span>, <span class="toctext">Photographers</span>, <span class="toctext">Politicians</span>, <span class="toctext">Scientists</span>, <span class="toctext">Sportsmen and sportswomen</span>, <span class="toctext">Writers</span>, <span class="toctext">Other notables</span>, <span class="toctext">English expatriates</span>, <span class="toctext">References</span>, <span class="toctext">See also</span>
I can get the first text object by running the following:
soup.find_all(class_ = 'toctext')[0].text
My goal here is to get and store all of the text objects in a list. I'm doing this by using a for loop, however I don't know how many text objects there are in the html block. Naturally I would hit an error if I get to an index that doesn't exist Is there an alternative?
You can use a for...in loop.
In [13]: [t.text for t in soup.find_all(class_ = 'toctext')]
Out[13]:
['Actors and actresses',
'Archaeologists and anthropologists',
'Architects',
'Artists',
'Broadcasters',
'Businessmen',
'Chefs',
'Clergy',
'Criminals',
'Conspirators',
'Economists',
'Engineers',
'Explorers',
'Filmmakers',
'Historians',
'Humourists',
'Inventors / engineers',
'Journalists / newsreaders',
'Military: soldiers/sailors/airmen',
'Monarchs',
'Musicians',
'Philosophers',
'Photographers',
'Politicians',
'Scientists',
'Sportsmen and sportswomen',
'Writers',
'Other notables',
'English expatriates',
'References',
'See also']
Try the following code:
for txt in soup.find_all(class_ = 'toctext'):
print(txt.text)

Retrieve bbc weather data with identical span class and nested spans

I am trying to pull data form BBC weather with a view to use in a home automation dashboard.
The HTML code I can pull fine and I can pull one set of temps but it just pulls the first.
</li>
<li class="daily__day-tab day-20150418 ">
<a data-ajax-href="/weather/en/2646504/daily/2015-04-18?day=3" href="/weather/2646504?day=3" rel="nofollow">
<div class="daily__day-header">
<h3 class="daily__day-date">
<span aria-label="Saturday" class="day-name">Sat</span>
</h3>
</div>
<span class="weather-type-image weather-type-image-40" title="Sunny"><img alt="Sunny" src="http://static.bbci.co.uk/weather/0.5.327/images/icons/tab_sprites/40px/1.png"/></span>
<span class="max-temp max-temp-value"> <span class="units-values temperature-units-values"><span class="units-value temperature-value temperature-value-unit-c" data-unit="c">13<span class="unit">°C</span></span><span class="unit-types-separator"> </span><span class="units-value temperature-value temperature-value-unit-f" data-unit="f">55<span class="unit">°F</span></span></span></span>
<span class="min-temp min-temp-value"> <span class="units-values temperature-units-values"><span class="units-value temperature-value temperature-value-unit-c" data-unit="c">5<span class="unit">°C</span></span><span class="unit-types-separator"> </span><span class="units-value temperature-value temperature-value-unit-f" data-unit="f">41<span class="unit">°F</span></span></span></span>
<span class="wind wind-speed windrose-icon windrose-icon--average windrose-icon-40 windrose-icon-40--average wind-direction-ene" data-tooltip-kph="31 km/h, East North Easterly" data-tooltip-mph="19 mph, East North Easterly" title="19 mph, East North Easterly">
<span class="speed"> <span class="wind-speed__description wind-speed__description--average">Wind Speed</span>
<span class="units-values windspeed-units-values"><span class="units-value windspeed-value windspeed-value-unit-kph" data-unit="kph">31 <span class="unit">km/h</span></span><span class="unit-types-separator"> </span><span class="units-value windspeed-value windspeed-value-unit-mph" data-unit="mph">19 <span class="unit">mph</span></span></span></span>
<span class="description blq-hide">East North Easterly</span>
</span>
This is my code which isn’t working
import urllib2
import pprint
from bs4 import BeautifulSoup
htmlFile=urllib2.urlopen('http://www.bbc.co.uk/weather/2646504?day=1')
htmlData = htmlFile.read()
soup = BeautifulSoup(htmlData)
table=soup.find("div","daily-window")
temperatures=[str(tem.contents[0]) for tem in table.find_all("span",class_="units-value temperature-value temperature-value-unit-c")]
mintemp=[str(min.contents[0]) for min in table.find_("span",class_="min-temp min-temp-value")]
maxtemp=[str(min.contents[0]) for min in table.find_all("span",class_="max-temp max-temp-value")]
windspeeds=[str(speed.contents[0]) for speed in table.find_all("span",class_="units-value windspeed-value windspeed-value-unit-mph")]
pprint.pprint(zip(temperatures,temp2,windspeeds))
your min and max temp extract is wrong.You just find the hole min temp span (include both c and f format).Get the first thing of content gives you empty string.
And the min temp tag identify class=min-temp.min-temp-value is not the same with the c-type min temp class=temperature-value-unit-c.So I suggest you to use css selector.
Eg,find all of your min temp span could be
table.select('span.min-temp.min-temp-value span.temperature-value-unit-c')
This means select all class=temperature-value-unit-c spans which are children of class=min-temp min-temp-value spans.
So do the other information lists like max_temp wind

Get span text from a website using selenium

The website I'm trying to scrape looks like this:
<div align="center" class="movietable">
<span style="width:45px;height:47px;vertical-align:middle;display:table-cell;">
<img border="0" src="styles/images/cat/hd.png" alt="HdO">
</span>
</div>
<div align="left" class="movietable">
<span style="padding:0px 5px;width:455px;height:47px;vertical-align:middle;display:table-cell;">
<a data-toggle="tooltip" data-placement="bottom" data-html="true" title="" href="details.php?id=578197" data-original-title="<img src='https://trasd.tmdb.org//tqistSlQGQVlvDZHweD.jpg'>">
<b>GET THIS TEXT</b></a><br><font class="small">[Action, Horror, Sci-Fi]</font>
</span>
</div>
How can I extract:
The text in the <b> tag - in this case GET THIS TEXT
The content of the font_class= 'small' - in this case this would be Action, Horror, Sci-Fi
.movietable b works great!!
The img_scr link - in thiscase it would be https://trasd.tmdb.org//tqistSlQGQVlvDZHweD.jpg
I have no ideea how to do this
Below are CSS selectors you can use:
driver.find_element_by_css_selector('div[align=left] b')
driver.find_element_by_css_selector('div[align=left] .small')
driver.find_element_by_css_selector('a[title]').get_attribute('data-original-title')
You can access all of them using xpath:
1) [parents before this div]/div[2]/span/a/b
2) [parents before this div]/div[2]/span/font
3) [parents before this div]/div[1]/span/a/img
[parents before this div] should be /html/body/...
As per the HTML you have shared to extract the items you can use the following solution:
GET THIS TEXT:
driver.find_element_by_xpath("//div[#class='movietable' and #align='left']/span/a[#data-toggle='tooltip' and #data-placement='bottom']/b").get_attribute("innerHTML")
[Action, Horror, Sci-Fi]:
driver.find_element_by_xpath("//div[#class='movietable' and #align='left']/span//font[#class='small']").get_attribute("innerHTML")
https://trasd.tmdb.org//tqistSlQGQVlvDZHweD.jpg:
img_src = driver.find_element_by_xpath("//div[#class='movietable' and #align='left']/span/a[#data-toggle='tooltip' and #data-placement='bottom']").get_attribute("data-original-title")
src = img_src.replace("'", "-").split("-")
print(src[1])

Categories