I am trying to gather some data by webscraping a local HTML file using BeautifulSoup4. The problem is, that the information I'm trying to get is on different rows that have the same class tags. I'm not sure about how to access them. The following html screenshot contains the two rows I'm accessing with the data I need highlighted (sensitive info is scribbled out).
The code I have currently is:
def find_data(fileName):
with open(fileName) as html_file:
soup = bs(html_file, "lxml")
hline1 = soup.find("td", class_="headerTableEntry")
hline2 = hline1.find_next_sibling("td")
hline3 = hline2.find_next_sibling("td")
hline4 = hline3.find_next_sibling("td", class_="headerTableEntry")
line1 = hline1.text
line2 = hline2.text
line3 = hline3.text
#Nothing yet for lines 4,5,6
The first 3 lines work great and give 13, 39, and 33.3% as they should. But for line 4 (which should be the second tag and first tag with class=headerTableEntry) I get an error "'NoneType' object is not callable".
My question is, is there a different way to go at this so I can access all 6 data cells or is there a way to edit how I wrote line 4 to work? Thank you for your help, it is very much appreciated!
The <tr> tag is not inside another <tr> tag as you can see that first <tr> tag is closed with the </tr> So that next <td> is not a sibling of the previous, hence it returns None. It's within the next <tr> tag.
Pandas is a great package to parse html <table> tags (which this is). It actually uses beautifulsoup under the hood. Just get the full table, and slice the table for the columns you want:
html_file = '''<table>
<tr>
<td class="headerName">File:</td>
<td class="HeaderValue">Some Value</td>
<td></td>
<td class="headerName">Lines:</td>
<td class="headerTableEntry">13</td>
<td class="headerTableEntry">39</td>
<td class="headerTableEntry" style="back-ground-color:LightPink">33.3 %</td>
</tr>
<tr>
<td class="headerName">Date:</td>
<td class="HeaderValue">2020-06-18 11:15:19</td>
<td></td>
<td class="headerName">Branches:</td>
<td class="headerTableEntry">10</td>
<td class="headerTableEntry">12</td>
<td class="headerTableEntry" style="back-ground-color:#FFFF55">83.3 %</td>
</tr>
</table>'''
import pandas as pd
df = pd.read_html(html_file)[0]
df = df.iloc[:,3:]
So for your code:
def find_data(fileName):
with open(fileName) as html_file:
df = pd.read_html(html_file)[0].iloc[:,3:]
print (df)
Output:
print (df)
3 4 5 6
0 Lines: 13 39 33.3 %
1 Branches: 10 12 83.3 %
Related
I am reading in a .html file that looks similar to the following format:
html = '''
<tr>
<td class="SmallFormText" colspan="3">hours per response:</td><td class="SmallFormTextR">23.8</td>
</tr>
<hr>
<table width="100%" border="0" cellspacing="0" cellpadding="4" summary="Form 13F-NT Header Information">
<tbody>
<tr>
<td class="FormTextC">COLUMN 1</td><td class="FormTextC">COLUMN 2</td><td class="FormTextC">COLUMN 3</td><td class="FormTextR">COLUMN 4</td><td class="FormTextC" colspan="3">COLUMN 5</td><td class="FormTextC">COLUMN 6</td><td class="FormTextR">COLUMN 7</td><td class="FormTextC" colspan="3">COLUMN 8</td>
</tr>
<tr>
<td class="FormText"></td><td class="FormText"></td><td class="FormText"></td><td class="FormTextR">VALUE</td><td class="FormTextR">SHRS OR</td><td class="FormText">SH/</td><td class="FormText">PUT/</td><td class="FormText">INVESTMENT</td><td class="FormTextR">OTHER</td><td class="FormTextC" colspan="3">VOTING AUTHORITY</td>
</tr>
<tr>
<td class="FormText">NAME OF ISSUER</td><td class="FormText">TITLE OF CLASS</td><td class="FormText">CUSIP</td><td class="FormTextR">(x$1000)</td><td class="FormTextR">PRN AMT</td><td class="FormText">PRN</td><td class="FormText">CALL</td><td class="FormText">DISCRETION</td><td class="FormTextR">MANAGER</td><td class="FormTextR">SOLE</td><td class="FormTextR">SHARED</td><td class="FormTextR">NONE</td>
</tr>
<tr>
<td class="FormData">1ST SOURCE CORP</td><td class="FormData">COM</td><td class="FormData">336901103</td><td class="FormDataR">8</td><td class="FormDataR">335</td><td class="FormData">SH</td><td> </td><td class="FormData">SOLE</td><td class="FormData">7</td><td class="FormDataR">335</td><td class="FormDataR">0</td><td class="FormDataR">0</td>
</tr>
<tr>
<td class="FormData">1ST UNITED BANCORP INC FLA</td><td class="FormData">COM</td><td class="FormData">33740N105</td><td class="FormDataR">7</td><td class="FormDataR">989</td><td class="FormData">SH</td><td> </td><td class="FormData">SOLE</td><td class="FormData">7</td><td class="FormDataR">989</td><td class="FormDataR">0</td><td class="FormDataR">0</td>
</tr> '''
In this code, I am trying to extract the information between the < tr > and < /tr > tags. In particular, I want to assign a given information, such as "NAME OF ISSUER" to a column name called "NAME_OF_ISSUER", using beautiful soup. However, when I run the following code, I am facing an error that looks simple to be solved (it's more or less a data formatting issue). Given that I am new to Python, I got stuck for a few hours trying alternative solutions. I would appreciate any comments or feedback.
Here is my code (please run the above code as well to obtain the html data):
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
rows = soup.find_all('tr')[11:]
positions = []
dic = {}
position = rows.find_all('td')
dic["NAME_OF_ISSUER"] = position[0].text
dic["CUSIP"] = position[2].text
dic["VALUE"] = int(position[3].text.replace(',', ''))*1000
dic["SHARES"] = int(position[4].text.replace(',', ''))
positions.append(dic)
df = pd.DataFrame(positions)
I am getting an "AttributeError" right after defining position, stating that the list object has no attribute "find_all".
What exactly does this mean? Also, how would I need to transform the html data to avoid this issue?
Edited part:
Here is the full stack trace:
position = rows.find_all('td')
Traceback (most recent call last):
File "<ipython-input-8-37353b5ab2ef>", line 1, in <module>
position = rows.find_all('td')
AttributeError: 'list' object has no attribute 'find_all'
soup.find_all returns a python list of elements. All you need to do is iterate through the list and grab data from those elements.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
rows = soup.find_all('tr')
# scan for header row and trim list
for index, row in enumerate(rows):
cells = row.find_all('td')
if cells and "NAME OF ISSUER" in cells[0].text.upper():
del rows[:index+1]
break
# convert remaining html rows to dict to create dataframe
positions = []
for position in rows:
dic = {}
cells = position.find_all('td')
dic["NAME_OF_ISSUER"] = cells[0].text
dic["CUSIP"] = cells[2].text
dic["VALUE"] = int(cells[3].text.replace(',', ''))*1000
dic["SHARES"] = int(celss[4].text.replace(',', ''))
positions.append(dic)
df = pd.DataFrame(positions)
This is the Web page Source code which I am scraping using Beautiful Soup.
<tr>
<td>
1
</td>
<td style="cipher1">
<img class="cipher2" src="http://cipher3.png" alt="cipher4" title="cipher5" />
<span class="cipher8">t</span>cipher9
</td>
<td>
112
</td>
<td>
3510
</td>
// Pattern Repeated
<tr >
<td>
2
</td>
<td style="cipher1">
I wrote some code using BeautifulSoup but I am getting more results than I want due to multiple occurrences of the pattern.
I have used
row1 = soup.find_all('a' ,class_ = "cipher7" )
for row in row1:
f.write( row['title'] + "\n")
But with this I get multiple occurences for 'cipher7' since it is occurring multiple times in the web page.
So the thing I can use this
<td style="cipher1">...
since it is unique to the things which I want.
So, How to modify my code to do this?
You can use a convenient select method which takes a CSS selector as an argument:
row = soup.select("td[style=cipher1] > a.cipher7")
You can first find the td tag (since you said it is unique) and then find the specified atag from it.
all_as = []
rows = soup.find_all('td', {'style':'cipher1'})
for row in rows:
all_as.append(row.find_all('a', class_ = "cipher7"))
I have a large table from the web, accessed via requests and parsed with BeautifulSoup. Part of it looks something like this:
<table>
<tbody>
<tr>
<td>265</td>
<td> JonesBlue</td>
<td>29</td>
</tr>
<tr >
<td>266</td>
<td> Smith</td>
<td>34</td>
</tr>
</tbody>
</table>
When I convert this to pandas using pd.read_html(tbl) the output is like this:
0 1 2
0 265 JonesBlue 29
1 266 Smith 34
I need to keep the information in the <A HREF ... > tag, since the unique identifier is stored in the link. That is, the table should look like this:
0 1 2
0 265 jones03 29
1 266 smith01 34
I'm fine with various other outputs (for example, jones03 Jones would be even more helpful) but the unique ID is critical.
Other cells also have html tags in them, and in general I don't want those to be saved, but if that's the only way of getting the uid I'm OK with keeping those tags and cleaning them up later, if I have to.
Is there a simple way of accessing this information?
Since this parsing job requires the extraction of both text and attribute
values, it can not be done entirely "out-of-the-box" by a function such as
pd.read_html. Some of it has to be done by hand.
Using lxml, you could extract the attribute values with XPath:
import lxml.html as LH
import pandas as pd
content = '''
<table>
<tbody>
<tr>
<td>265</td>
<td> JonesBlue</td>
<td >29</td>
</tr>
<tr >
<td>266</td>
<td> Smith</td>
<td>34</td>
</tr>
</tbody>
</table>'''
table = LH.fromstring(content)
for df in pd.read_html(content):
df['refname'] = table.xpath('//tr/td/a/#href')
df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
print(df)
yields
0 1 2 refname
0 265 JonesBlue 29 jones03
1 266 Smith 34 smith01
The above may be useful since it requires only a few
extra lines of code to add the refname column.
But both LH.fromstring and pd.read_html parse the HTML.
So it's efficiency could be improved by removing pd.read_html and
parsing the table once with LH.fromstring:
table = LH.fromstring(content)
# extract the text from `<td>` tags
data = [[elt.text_content() for elt in tr.xpath('td')]
for tr in table.xpath('//tr')]
df = pd.DataFrame(data, columns=['id', 'name', 'val'])
for col in ('id', 'val'):
df[col] = df[col].astype(int)
# extract the href attribute values
df['refname'] = table.xpath('//tr/td/a/#href')
df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
print(df)
yields
id name val refname
0 265 JonesBlue 29 jones03
1 266 Smith 34 smith01
You could simply parse the table manually like this:
import BeautifulSoup
import pandas as pd
TABLE = """<table>
<tbody>
<tr>
<td>265</td>
<td JonesBlue</td>
<td >29</td>
</tr>
<tr >
<td>266</td>
<td Smith</td>
<td>34</td>
</tr>
</tbody>
</table>"""
table = BeautifulSoup.BeautifulSoup(TABLE)
records = []
for tr in table.findAll("tr"):
trs = tr.findAll("td")
record = []
record.append(trs[0].text)
record.append(trs[1].a["href"])
record.append(trs[2].text)
records.append(record)
df = pd.DataFrame(data=records)
df
which gives you
0 1 2
0 265 /j/jones03.shtml 29
1 266 /s/smith01.shtml 34
You could use regular expressions to modify the text first and remove the html tags:
import re, pandas as pd
tbl = """<table>
<tbody>
<tr>
<td>265</td>
<td> JonesBlue</td>
<td>29</td>
</tr>
<tr >
<td>266</td>
<td> Smith</td>
<td>34</td>
</tr>
</tbody>
</table>"""
tbl = re.sub('<a.*?href="(.*?)">(.*?)</a>', '\\1 \\2', tbl)
pd.read_html(tbl)
which gives you
[ 0 1 2
0 265 /j/jones03.shtml JonesBlue 29
1 266 /s/smith01.shtml Smith 34]
This available now in Pandas 1.5.0+ using the extract_links parameter.
extract_links - possible options: {None, “all”, “header”, “body”, “footer”}
Table elements in the specified section(s) with tags will have their href extracted.
Documentation
Example
html_table = """
<table>
<tr>
<th>GitHub</th>
</tr>
<tr>
<td>pandas
</td>
</tr>
</table>
"""
df = pd.read_html(
html_table,
extract_links="all"
)[0]
Hello all…I want to pick a word on specific locaiton from a table on webpage. The source code is like:
table = '''
<TABLE class=form border=0 cellSpacing=1 cellPadding=2 width=500>
<TBODY>
<TR>
<TD vAlign=top colSpan=3><IMG class=ad src="/images/ad.gif" width=1 height=1></TD></TR>
<TR>
<TH vAlign=top width=22>Code:</TH>
<TD class=dash vAlign=top width=5 lign="left"> </TD>
<TD class=dash vAlign=top width=30 align=left><B>BAN</B></TD></TR>
<TR>
<TH vAlign=top>Color:</TH>
<TD class=dash vAlign=top align=left> </TD>
<TD class=dash vAlign=top align=left>White</TD></TR>
<TR>
<TD colSpan=3> </TD></TR></TBODY></TABLE>
'''
I want to pick the word of color here (it could be “White”, "red" or something else). What I tried is:
soup = BeautifulSoup(table)
for a in soup.find_all('table')[0].find_all('tr')[2:3]:
print a.text
It gives:
Color:
White
It looks like 4 lines. I tried to add them into a list then remove the unwanted but unsuccessful.
What’s the best way to only pick the color in the table?
Many thanks.
This will match all instances of 'white' case independent ...
soup = BeautifulSoup(table)
res = []
for a in soup.find_all('table')[0].find_all('tr')[2:3]:
if 'white' in a.text.lower():
text = a.text.encode('ascii', 'ignore').replace(':','').split()
res.append(text)
slightly better implementation ...
# this will iterate through all 'table' and 'tr' tags within each 'table'
res = [tr.text.encode('ascii', 'ignore').replace(':','').split() \
for table in soup.findAll('table') for tr in table.findAll('tr') \
if 'color' in tr.text.lower()]
print res
[['Color', 'White']]
to only return the colors themselves, do...
# Assuming the same format throughout the html
# if format is changing just add more logic
tr.text.encode('ascii', 'ignore').replace(':','').split()[1]
...
print res
['White']
I am new to python and searched the internet to find an answer to my problem, but so far I failed...
The problem: My aim is to extract data from websites. More specifically, from the tables in these websites. The relevant snippet from the website-code you find in "data" in my python-code example here:
from bs4 import BeautifulSoup
data = '''<table class="ds-table">
<tr>
<td class="data-label">year of birth:</td>
<td class="data-value">1994</td>
</tr>
<tr>
<td class="data-label">reporting period:</td>
<td class="data-value">
<span class="editable" id="c-scope_beginning_date">
? </span>
-
<span class="editable" id="c-scope_ending_date">
? </span>
</td>
</tr>
<tr>
<td class="data-label">reporting cycle:</td>
<td class="data-value">
<span class="editable" id="c-periodicity">
- </span>
</td>
</tr>
<tr>
<td class="data-label">grade:</td>
<td class="data-value">1.3, upper 10% of class</td>
</tr>
<tr>
<td class="data-label">status:</td>
<td class="data-value"></td>
</tr>
</table>
<table class="ds-table">
<tr>
<td class="data-label">economics:</td>
<td class="data-value"><span class="positive-value"></span></td>
</tr>
<tr>
<td class="data-label">statistics:</td>
<td class="data-value"><span class="negative-value"></span></td>
</tr>
<tr>
<td class="data-label">social:</td>
<td class="data-value"><div id="music_id" class="trigger"><span class="negative-value"></span></div></td>
</tr>
<tr>
<td class="data-label">misc:</td>
<td class="data-value">
<div id="c_assurance" class="">
<span class="positive-value"></span> </div>
</td>
</tr>
<tr>
<td class="data-label">recommendation:</td>
<td class="data-value">
<span class="negative-value"></span> </td>
</tr>
</table>'''
soup = BeautifulSoup(data)
For the class="data-label" so far I successfully implemented...
box_cdl = []
for i, cdl in enumerate(soup.findAll('td', attrs={'class': 'data-label'})):
box_cdl.append(cdl.contents[0])
print box_cdl
...which extracts the text from the columns, in the (for me satisfying) output:
[u'year of birth:',
u'reporting period:',
u'reporting cycle:',
u'grade:',
u'status:',
u'economics:',
u'statistics:',
u'social:',
u'misc:',
u'recommendation:']
Where I get stuck is the part for class="data-value" with the div- and span-fields and that some of the relevant information is hidden in the span-class. Moreover, the amount of the tr-rows can change from website to website, e.g. "status" comes after "reporting cycle" (instead of "grade").
However, when I do...
box_cdv = []
for j, cdv in enumerate(soup.findAll('td', attrs={'class': 'data-value'})):
box_cdv.append(cdv.contents[0])
print box_cdv
...I get the error:
Traceback (most recent call last):
File "<ipython-input-53-7d5c095cf647>", line 3, in <module>
box_cdv.append(cdv.contents[0])
IndexError: list index out of range
What I would like to get instead is something like this (corresponding to the above "data"-example):
[u'1994',
u'? - ?',
u'-',
u'1.3, upper 10% of class',
u'',
u'positive-value',
u'negative-value',
u'negative-value',
u'positive-value',
u'negative-value']
The Question: how can I extract this information and collect the relevant data from each tr-row, given that the adequate extraction-code depends on the type of the category (year of birth, reporting period, ..., recommendation)?
Or, asking differently: what code extracts me, depending on the category (year of birth, reporting period, ..., recommendation), the corresponding value (1994, ..., negative-value)?
Since the amount and the type of the table-entries can differ between websites, a simple "on the i-th entry do the following" procedure is not applicable. The thing I am looking for I think is something like "if you find the text "recommendation:", then extract the class-type from the span-field", I guess. But unfortunately I do not have any clue how to translate that into python-language.
Any help is highly appreciated.
You get that error because one of the tags don't have any children so the contents list gives an error when searching for that index.
You can approeach this on the following way:
1) Search for the data-label tags;
2) Find the next TD sibling;
3 A) Check of the sibling has text;
3 A) 1) If so create a dict entry with data-label as the key and the sibling text as its value;
3 A) B) If not check if the sibling first child have a class containing -value`
4) Parse the data.
Example:
soup = BeautifulSoup(data, 'lxml')
result = {}
for tag in soup.find_all("td", { "class" : "data-label" }):
NextSibling = tag.find_next("td", { "class" : "data-value" }).get_text(strip = True)
if not NextSibling and len(tag.find_next("td").select('span[class*=-value]')) > 0:
NextSibling = tag.find_next("td").select('span[class*=-value]')[0]["class"][0]
result[tag.get_text(strip = True)] = NextSibling
print (result)
Result:
{
'year of birth:': '1994',
'reporting period:': '?-?',
'reporting cycle:': '-',
'grade:': '1.3, upper 10% of class',
'status:': '',
'economics:': 'positive-value',
'statistics:': 'negative-value',
'social:': 'negative-value',
'misc:': 'positive-value',
'recommendation:': 'negative-value'
}