Element.text data loss - python

I am writing a web scraper, which collects data from a sport site. There are tables where I want to write the text from each tr into an array. In some lines it isnt possible to get the whole text.
While debugging at a breakpoint after t=...
element_table = WebDriverWait(driver, 20).until(
EC.presence_of_all_elements_located((By.XPATH, '//table//tbody//tr')))
for count, e in enumerate(element_table):
if count > 3:
line = e.text.splitlines()
t = e.text
in the debugger e's text was
text= {str} 'Salzburg\n4-3-1-2\n57%\n2 1.42\n14/4\n28.57%\n594/489\n82.32%\n66.7\n130\n12/43/75\n108\n38/48/22\n210/85\n40.48%'
but when I look at t
t = {str} 'Salzburg\n4-3-1-2\n2 1.42\n14/4\n594/489\n66.7\n130\n108\n210/85',
so does element.text not get me all the text that is in the tr? Also it happens just on few lines.
the line that doesnt work and then some line that does work:
<tr>
<td>Salzburg</td>
<td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>4-3-1-2</em><small>57%</small></span></td>
<td class="Index__video-cell___s1IHu"><span class="Index__stat-wrapper___n5jnZ">2</span><div class="Index__video-cell-icon___3Pnub"></div></td>
<td class="Index__simple-cell-widget___1BYWx"><span class="Index__stat-wrapper___n5jnZ">1.42</span></td>
<td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>14/4</em><small> 28.57%</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
<td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>594/489</em><small> 82.32%</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
<td class="Index__simple-cell-widget___1BYWx"><span class="Index__stat-wrapper___n5jnZ">66.7</span></td>
<td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>130</em><small>12/43/75</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
<td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>108</em><small>38/48/22</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
<td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>210/85</em><small> 40.48%</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
</tr>
<tr>
<td>Sturm Graz</td>
<td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>3-4-3</em><small>80%</small></span></td>
<td class="Index__video-cell___s1IHu"><span class="Index__stat-wrapper___n5jnZ">3</span><div class="Index__video-cell-icon___3Pnub"></div></td><td class="Index__simple-cell-widget___1BYWx"><span class="Index__stat-wrapper___n5jnZ">1.73</span></td>
<td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>14/7</em><small> 50%</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
<td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>484/400</em><small> 82.64%</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
<td class="Index__simple-cell-widget___1BYWx"><span class="Index__stat-wrapper___n5jnZ">49.41</span></td><td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>128</em><small>9/50/69</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
<td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>101</em><small>33/50/18</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
<td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>228/87</em><small> 38.16%</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
</tr>

Well, I can't reproduce the problem that you are reporting using python 2.7.10.
If I were to speculate, you mentioned that you were looking at "t" in a debugger at some later point...did some other code manipulate "t"?
I also would suggest that if you want to split out all of the different components of each row, you should call out those "em" and "small" elements as separate entities. Here is some code demonstrating:
driver.get('file://path_to_html_from_above/text_attribute_missing_td_content.html')
rows = WebDriverWait(self.driver, 5).until(
EC.presence_of_all_elements_located((By.XPATH, '//table/tbody/tr')))
for count, e in enumerate(rows):
line = e.text.splitlines()
t = e.text
# this demonstrates that they have the same content
self.assertEqual(line, t.splitlines())
# storing a list of lists
# representing rows of text, but splitting the content of the td in two if it has em and small HTML elements
table_content = list()
for row in rows:
# pull out each column
cols = row.find_elements_by_xpath('./td')
r = list()
for col in cols:
# if the column has em and small elements grab those
try:
em = col.find_element_by_tag_name('em')
r.append(em.text)
small = col.find_element_by_tag_name('small')
r.append(small.text)
except NoSuchElementException:
# otherwise, just get straight text
r.append(col.text)
table_content.append(r)
print(table_content)
Notice that I changed your XPATH to be specific about finding just trs in the table/tbody.

First I thought it was the Webdriver so I used the Firefox one. The Solution was to use InnerHTML.

Related

Python: Accessing a new <tr> while inside a different <tr> with BeautifulSoup4

I am trying to gather some data by webscraping a local HTML file using BeautifulSoup4. The problem is, that the information I'm trying to get is on different rows that have the same class tags. I'm not sure about how to access them. The following html screenshot contains the two rows I'm accessing with the data I need highlighted (sensitive info is scribbled out).
The code I have currently is:
def find_data(fileName):
with open(fileName) as html_file:
soup = bs(html_file, "lxml")
hline1 = soup.find("td", class_="headerTableEntry")
hline2 = hline1.find_next_sibling("td")
hline3 = hline2.find_next_sibling("td")
hline4 = hline3.find_next_sibling("td", class_="headerTableEntry")
line1 = hline1.text
line2 = hline2.text
line3 = hline3.text
#Nothing yet for lines 4,5,6
The first 3 lines work great and give 13, 39, and 33.3% as they should. But for line 4 (which should be the second tag and first tag with class=headerTableEntry) I get an error "'NoneType' object is not callable".
My question is, is there a different way to go at this so I can access all 6 data cells or is there a way to edit how I wrote line 4 to work? Thank you for your help, it is very much appreciated!
The <tr> tag is not inside another <tr> tag as you can see that first <tr> tag is closed with the </tr> So that next <td> is not a sibling of the previous, hence it returns None. It's within the next <tr> tag.
Pandas is a great package to parse html <table> tags (which this is). It actually uses beautifulsoup under the hood. Just get the full table, and slice the table for the columns you want:
html_file = '''<table>
<tr>
<td class="headerName">File:</td>
<td class="HeaderValue">Some Value</td>
<td></td>
<td class="headerName">Lines:</td>
<td class="headerTableEntry">13</td>
<td class="headerTableEntry">39</td>
<td class="headerTableEntry" style="back-ground-color:LightPink">33.3 %</td>
</tr>
<tr>
<td class="headerName">Date:</td>
<td class="HeaderValue">2020-06-18 11:15:19</td>
<td></td>
<td class="headerName">Branches:</td>
<td class="headerTableEntry">10</td>
<td class="headerTableEntry">12</td>
<td class="headerTableEntry" style="back-ground-color:#FFFF55">83.3 %</td>
</tr>
</table>'''
import pandas as pd
df = pd.read_html(html_file)[0]
df = df.iloc[:,3:]
So for your code:
def find_data(fileName):
with open(fileName) as html_file:
df = pd.read_html(html_file)[0].iloc[:,3:]
print (df)
Output:
print (df)
3 4 5 6
0 Lines: 13 39 33.3 %
1 Branches: 10 12 83.3 %

Parsing webpage in Python using Beautiful Soup

This is the Web page Source code which I am scraping using Beautiful Soup.
<tr>
<td>
1
</td>
<td style="cipher1">
<img class="cipher2" src="http://cipher3.png" alt="cipher4" title="cipher5" />
<span class="cipher8">t</span>cipher9
</td>
<td>
112
</td>
<td>
3510
</td>
// Pattern Repeated
<tr >
<td>
2
</td>
<td style="cipher1">
I wrote some code using BeautifulSoup but I am getting more results than I want due to multiple occurrences of the pattern.
I have used
row1 = soup.find_all('a' ,class_ = "cipher7" )
for row in row1:
f.write( row['title'] + "\n")
But with this I get multiple occurences for 'cipher7' since it is occurring multiple times in the web page.
So the thing I can use this
<td style="cipher1">...
since it is unique to the things which I want.
So, How to modify my code to do this?
You can use a convenient select method which takes a CSS selector as an argument:
row = soup.select("td[style=cipher1] > a.cipher7")
You can first find the td tag (since you said it is unique) and then find the specified atag from it.
all_as = []
rows = soup.find_all('td', {'style':'cipher1'})
for row in rows:
all_as.append(row.find_all('a', class_ = "cipher7"))

Python + BS Picking a specific word(location) form webpage table

Hello all…I want to pick a word on specific locaiton from a table on webpage. The source code is like:
table = '''
<TABLE class=form border=0 cellSpacing=1 cellPadding=2 width=500>
<TBODY>
<TR>
<TD vAlign=top colSpan=3><IMG class=ad src="/images/ad.gif" width=1 height=1></TD></TR>
<TR>
<TH vAlign=top width=22>Code:</TH>
<TD class=dash vAlign=top width=5 lign="left"> </TD>
<TD class=dash vAlign=top width=30 align=left><B>BAN</B></TD></TR>
<TR>
<TH vAlign=top>Color:</TH>
<TD class=dash vAlign=top align=left> </TD>
<TD class=dash vAlign=top align=left>White</TD></TR>
<TR>
<TD colSpan=3> </TD></TR></TBODY></TABLE>
'''
I want to pick the word of color here (it could be “White”, "red" or something else). What I tried is:
soup = BeautifulSoup(table)
for a in soup.find_all('table')[0].find_all('tr')[2:3]:
print a.text
It gives:
Color:
 
White
It looks like 4 lines. I tried to add them into a list then remove the unwanted but unsuccessful.
What’s the best way to only pick the color in the table?
Many thanks.
This will match all instances of 'white' case independent ...
soup = BeautifulSoup(table)
res = []
for a in soup.find_all('table')[0].find_all('tr')[2:3]:
if 'white' in a.text.lower():
text = a.text.encode('ascii', 'ignore').replace(':','').split()
res.append(text)
slightly better implementation ...
# this will iterate through all 'table' and 'tr' tags within each 'table'
res = [tr.text.encode('ascii', 'ignore').replace(':','').split() \
for table in soup.findAll('table') for tr in table.findAll('tr') \
if 'color' in tr.text.lower()]
print res
[['Color', 'White']]
to only return the colors themselves, do...
# Assuming the same format throughout the html
# if format is changing just add more logic
tr.text.encode('ascii', 'ignore').replace(':','').split()[1]
...
print res
['White']

How to extract info from varying table entries: Text vs. DIV vs. SPAN

I am new to python and searched the internet to find an answer to my problem, but so far I failed...
The problem: My aim is to extract data from websites. More specifically, from the tables in these websites. The relevant snippet from the website-code you find in "data" in my python-code example here:
from bs4 import BeautifulSoup
data = '''<table class="ds-table">
<tr>
<td class="data-label">year of birth:</td>
<td class="data-value">1994</td>
</tr>
<tr>
<td class="data-label">reporting period:</td>
<td class="data-value">
<span class="editable" id="c-scope_beginning_date">
? </span>
-
<span class="editable" id="c-scope_ending_date">
? </span>
</td>
</tr>
<tr>
<td class="data-label">reporting cycle:</td>
<td class="data-value">
<span class="editable" id="c-periodicity">
- </span>
</td>
</tr>
<tr>
<td class="data-label">grade:</td>
<td class="data-value">1.3, upper 10% of class</td>
</tr>
<tr>
<td class="data-label">status:</td>
<td class="data-value"></td>
</tr>
</table>
<table class="ds-table">
<tr>
<td class="data-label">economics:</td>
<td class="data-value"><span class="positive-value"></span></td>
</tr>
<tr>
<td class="data-label">statistics:</td>
<td class="data-value"><span class="negative-value"></span></td>
</tr>
<tr>
<td class="data-label">social:</td>
<td class="data-value"><div id="music_id" class="trigger"><span class="negative-value"></span></div></td>
</tr>
<tr>
<td class="data-label">misc:</td>
<td class="data-value">
<div id="c_assurance" class="">
<span class="positive-value"></span> </div>
</td>
</tr>
<tr>
<td class="data-label">recommendation:</td>
<td class="data-value">
<span class="negative-value"></span> </td>
</tr>
</table>'''
soup = BeautifulSoup(data)
For the class="data-label" so far I successfully implemented...
box_cdl = []
for i, cdl in enumerate(soup.findAll('td', attrs={'class': 'data-label'})):
box_cdl.append(cdl.contents[0])
print box_cdl
...which extracts the text from the columns, in the (for me satisfying) output:
[u'year of birth:',
u'reporting period:',
u'reporting cycle:',
u'grade:',
u'status:',
u'economics:',
u'statistics:',
u'social:',
u'misc:',
u'recommendation:']
Where I get stuck is the part for class="data-value" with the div- and span-fields and that some of the relevant information is hidden in the span-class. Moreover, the amount of the tr-rows can change from website to website, e.g. "status" comes after "reporting cycle" (instead of "grade").
However, when I do...
box_cdv = []
for j, cdv in enumerate(soup.findAll('td', attrs={'class': 'data-value'})):
box_cdv.append(cdv.contents[0])
print box_cdv
...I get the error:
Traceback (most recent call last):
File "<ipython-input-53-7d5c095cf647>", line 3, in <module>
box_cdv.append(cdv.contents[0])
IndexError: list index out of range
What I would like to get instead is something like this (corresponding to the above "data"-example):
[u'1994',
u'? - ?',
u'-',
u'1.3, upper 10% of class',
u'',
u'positive-value',
u'negative-value',
u'negative-value',
u'positive-value',
u'negative-value']
The Question: how can I extract this information and collect the relevant data from each tr-row, given that the adequate extraction-code depends on the type of the category (year of birth, reporting period, ..., recommendation)?
Or, asking differently: what code extracts me, depending on the category (year of birth, reporting period, ..., recommendation), the corresponding value (1994, ..., negative-value)?
Since the amount and the type of the table-entries can differ between websites, a simple "on the i-th entry do the following" procedure is not applicable. The thing I am looking for I think is something like "if you find the text "recommendation:", then extract the class-type from the span-field", I guess. But unfortunately I do not have any clue how to translate that into python-language.
Any help is highly appreciated.
You get that error because one of the tags don't have any children so the contents list gives an error when searching for that index.
You can approeach this on the following way:
1) Search for the data-label tags;
2) Find the next TD sibling;
3 A) Check of the sibling has text;
3 A) 1) If so create a dict entry with data-label as the key and the sibling text as its value;
3 A) B) If not check if the sibling first child have a class containing -value`
4) Parse the data.
Example:
soup = BeautifulSoup(data, 'lxml')
result = {}
for tag in soup.find_all("td", { "class" : "data-label" }):
NextSibling = tag.find_next("td", { "class" : "data-value" }).get_text(strip = True)
if not NextSibling and len(tag.find_next("td").select('span[class*=-value]')) > 0:
NextSibling = tag.find_next("td").select('span[class*=-value]')[0]["class"][0]
result[tag.get_text(strip = True)] = NextSibling
print (result)
Result:
{
'year of birth:': '1994',
'reporting period:': '?-?',
'reporting cycle:': '-',
'grade:': '1.3, upper 10% of class',
'status:': '',
'economics:': 'positive-value',
'statistics:': 'negative-value',
'social:': 'negative-value',
'misc:': 'positive-value',
'recommendation:': 'negative-value'
}

Using Python and Beautifulsoup how do I select the desired table in a div?

I would like to be able to select the table containing the "Accounts Payable" text but I'm not getting anywhere with what I'm trying and I'm pretty much guessing using findall. Can someone show me how I would do this?
For example this is what I start with:
<div>
<tr>
<td class="lft lm">Accounts Payable
</td>
<td class="r">222.82</td>
<td class="r">92.54</td>
<td class="r">100.34</td>
<td class="r rm">99.95</td>
</tr>
<tr>
<td class="lft lm">Accrued Expenses
</td>
<td class="r">36.49</td>
<td class="r">33.39</td>
<td class="r">31.39</td>
<td class="r rm">36.47</td>
</tr>
</div>
And this is what I would like to get as a result:
<tr>
<td class="lft lm">Accounts Payable
</td>
<td class="r">222.82</td>
<td class="r">92.54</td>
<td class="r">100.34</td>
<td class="r rm">99.95</td>
</tr>
You can select the td elements with class lft lm and then examine the element.string to determine if you have the "Accounts Payable" td:
import sys
from BeautifulSoup import BeautifulSoup
# where so_soup.txt is your html
f = open ("so_soup.txt", "r")
data = f.readlines ()
f.close ()
soup = BeautifulSoup ("".join (data))
cells = soup.findAll('td', {"class" : "lft lm"})
for cell in cells:
# You can compare cell.string against "Accounts Payable"
print (cell.string)
If you would like to examine the following siblings for Accounts Payable for instance, you could use the following:
if (cell.string.strip () == "Accounts Payable"):
sibling = cell.findNextSibling ()
while (sibling):
print ("\t" + sibling.string)
sibling = sibling.findNextSibling ()
Update for Edit
If you would like to print out the original HTML, just for the siblings that follow the Accounts Payable element, this is the code for that:
lines = ["<tr>"]
for cell in cells:
lines.append (cell.prettify().decode('ascii'))
if (cell.string.strip () == "Accounts Payable"):
sibling = cell.findNextSibling ()
while (sibling):
lines.append (sibling.prettify().decode('ascii'))
sibling = sibling.findNextSibling ()
lines.append ("</tr>")
f = open ("so_soup_out.txt", "wt")
f.writelines (lines)
f.close ()

Categories