I have this html
<tr class="BgWhite">
<td headers="th0" valign="top">
3
</td>
<td headers="th1" style="width: 125px;" valign="top">
8340-01-551-1310
</td>
I want to find this number id "8340-01-551-1310" so I used this code
test = container1.find_all("td", {"headers": "th1"})
test1 = test.find_all("a", {"title":"go to NSN view"})
but it displays this message
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
what am I doing wrongly and how do I fix this?
Here is one way:
from bs4 import BeautifulSoup
data = """<tr class="BgWhite">
<td headers="th0" valign="top">
3
</td>
<td headers="th1" style="width: 125px;" valign="top">
8340-01-551-1310
</td>"""
soup = BeautifulSoup(data, "lxml")
for td in soup.find_all('td', {"headers": "th1"}):
for a in td.find_all('a'):
print(a.text)
Output:
8340-01-551-1310
However, if you are sure you will have only one "th1" or just want the first one. And if you are sure that will have only one "a" or you just want the first one. You could try:
print(soup.find('td', {"headers": "th1"}).find('a').text)
Which returns the same output.
EDIT:
Just noticed it could be simplified to:
print(soup.find('td', {"headers": "th1"}).a.text)
Related
I am reading in a .html file that looks similar to the following format:
html = '''
<tr>
<td class="SmallFormText" colspan="3">hours per response:</td><td class="SmallFormTextR">23.8</td>
</tr>
<hr>
<table width="100%" border="0" cellspacing="0" cellpadding="4" summary="Form 13F-NT Header Information">
<tbody>
<tr>
<td class="FormTextC">COLUMN 1</td><td class="FormTextC">COLUMN 2</td><td class="FormTextC">COLUMN 3</td><td class="FormTextR">COLUMN 4</td><td class="FormTextC" colspan="3">COLUMN 5</td><td class="FormTextC">COLUMN 6</td><td class="FormTextR">COLUMN 7</td><td class="FormTextC" colspan="3">COLUMN 8</td>
</tr>
<tr>
<td class="FormText"></td><td class="FormText"></td><td class="FormText"></td><td class="FormTextR">VALUE</td><td class="FormTextR">SHRS OR</td><td class="FormText">SH/</td><td class="FormText">PUT/</td><td class="FormText">INVESTMENT</td><td class="FormTextR">OTHER</td><td class="FormTextC" colspan="3">VOTING AUTHORITY</td>
</tr>
<tr>
<td class="FormText">NAME OF ISSUER</td><td class="FormText">TITLE OF CLASS</td><td class="FormText">CUSIP</td><td class="FormTextR">(x$1000)</td><td class="FormTextR">PRN AMT</td><td class="FormText">PRN</td><td class="FormText">CALL</td><td class="FormText">DISCRETION</td><td class="FormTextR">MANAGER</td><td class="FormTextR">SOLE</td><td class="FormTextR">SHARED</td><td class="FormTextR">NONE</td>
</tr>
<tr>
<td class="FormData">1ST SOURCE CORP</td><td class="FormData">COM</td><td class="FormData">336901103</td><td class="FormDataR">8</td><td class="FormDataR">335</td><td class="FormData">SH</td><td> </td><td class="FormData">SOLE</td><td class="FormData">7</td><td class="FormDataR">335</td><td class="FormDataR">0</td><td class="FormDataR">0</td>
</tr>
<tr>
<td class="FormData">1ST UNITED BANCORP INC FLA</td><td class="FormData">COM</td><td class="FormData">33740N105</td><td class="FormDataR">7</td><td class="FormDataR">989</td><td class="FormData">SH</td><td> </td><td class="FormData">SOLE</td><td class="FormData">7</td><td class="FormDataR">989</td><td class="FormDataR">0</td><td class="FormDataR">0</td>
</tr> '''
In this code, I am trying to extract the information between the < tr > and < /tr > tags. In particular, I want to assign a given information, such as "NAME OF ISSUER" to a column name called "NAME_OF_ISSUER", using beautiful soup. However, when I run the following code, I am facing an error that looks simple to be solved (it's more or less a data formatting issue). Given that I am new to Python, I got stuck for a few hours trying alternative solutions. I would appreciate any comments or feedback.
Here is my code (please run the above code as well to obtain the html data):
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
rows = soup.find_all('tr')[11:]
positions = []
dic = {}
position = rows.find_all('td')
dic["NAME_OF_ISSUER"] = position[0].text
dic["CUSIP"] = position[2].text
dic["VALUE"] = int(position[3].text.replace(',', ''))*1000
dic["SHARES"] = int(position[4].text.replace(',', ''))
positions.append(dic)
df = pd.DataFrame(positions)
I am getting an "AttributeError" right after defining position, stating that the list object has no attribute "find_all".
What exactly does this mean? Also, how would I need to transform the html data to avoid this issue?
Edited part:
Here is the full stack trace:
position = rows.find_all('td')
Traceback (most recent call last):
File "<ipython-input-8-37353b5ab2ef>", line 1, in <module>
position = rows.find_all('td')
AttributeError: 'list' object has no attribute 'find_all'
soup.find_all returns a python list of elements. All you need to do is iterate through the list and grab data from those elements.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
rows = soup.find_all('tr')
# scan for header row and trim list
for index, row in enumerate(rows):
cells = row.find_all('td')
if cells and "NAME OF ISSUER" in cells[0].text.upper():
del rows[:index+1]
break
# convert remaining html rows to dict to create dataframe
positions = []
for position in rows:
dic = {}
cells = position.find_all('td')
dic["NAME_OF_ISSUER"] = cells[0].text
dic["CUSIP"] = cells[2].text
dic["VALUE"] = int(cells[3].text.replace(',', ''))*1000
dic["SHARES"] = int(celss[4].text.replace(',', ''))
positions.append(dic)
df = pd.DataFrame(positions)
I have an HTML as follows:
<table class="stocksTable" summary="株価詳細">
<tr>
<th class="symbol"><h1>(株)みずほフィナンシャルグループ</h1></th>
<td class="stoksPrice realTimChange">
<div class="realTimChangeMod">
</div>
</td>
td class="stoksPrice">191.1</td>
<td class="change"><span class="yjSt">前日比</span><span class="icoUpGreen yjMSt">+2.5(+1.33%)</span></td>
</tr>
</table>
I tried to extract 191.1 from a line containing td class="stoksPrice">191.1</td>.
soup = BeautifulSoup(html)
res = soup.find_all('stoksPrice')
print (res)
But result is [].
How to find it guys?
There seem to be two issues:
First is that your usage of find_all is invalid. The current way you're searching for a tagname called stoksPrice which is wrong ad your tags are table, tr, td, div, span. You need to change that to:
>>> res = soup.find_all(class_='stoksPrice')
to search for tags with that class.
Second, your HTML is malformed. The list with stoksPrice is:
</td>
td class="stoksPrice">191.1</td>
it should have been:
</td>
<td class)="stoksPrice">191.1</td>
(Note that < before the td)
Not sure if that was a copy error into Stack Overflow or the HTML is originally malformed but that is not going to be easy to parse ...
Since there are multiple tags having the same class, you can use CSS selectors to get an exact match.
html = '''<table class="stocksTable" summary="株価詳細">
<tr>
<th class="symbol"><h1>(株)みずほフィナンシャルグループ</h1></th>
<td class="stoksPrice realTimChange">
<div class="realTimChangeMod">
</div>
</td>
<td class="stoksPrice">191.1</td>
<td class="change"><span class="yjSt">前日比</span><span class="icoUpGreen yjMSt">+2.5(+1.33%)</span></td>
</tr>
</table>'''
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one('td[class="stoksPrice"]').text)
# 191.1
Or, you could use lambda and find to get the same.
print(soup.find(lambda t: t.name == 'td' and t['class'] == ['stoksPrice']).text)
# 191.1
Note: BeautifulSoup converts multi-valued class attributes in lists. So, the classes of the two td tags look like ['stoksPrice'] and ['stoksPrice', 'realTimChange'].
Here is one way to do it using findAll.
Because all the previous stoksPrice are empty the only one that remains is the one with the price..
You can put in a check using try/except clause to check if it is a floating point number.
If it is not it will continue iterating and if it is it will return it.
res = soup.findAll("td", {"class": "stoksPrice"})
for r in res:
try:
t = float(r.text)
print(t)
except:
pass
191.1
I now practice to parse HTML with Beautifulsoup4.
I encounter problems using find function.
Here is my code.
soup1 = BeautifulSoup(a,"html.parser")
tables1 = soup1.find('div', {'id':'auction_container'}).findAll('table')
for table in tables1:
if '매각기일' in table.get_text():
clue1 = table.find('td', {'class': 'head_con center'})
pro_clue1 = clue1.find('span', {'class':'bold'})
pro_clue2 = clue1.find('span',{'class':'no'})
clue2 = table.find('tr', {'valign': 'bottom'})
print(clue2.find('span', {'class': 'num'}))
variable a is too long pagesource, so I write full script in my blog. You can get the script in this like. http://blog.naver.com/khm2963/220983094160
When I execute this code, I got output below
None
<span class="num"><span class="f20">2015</span>타경<span class="f20">2321</span></span>
And When I append .get_text () function behind clue2.find('span', {'class': 'num'}) like print(clue2.find('span', {'class': 'num'}).get_text()), I got error below.
Traceback (most recent call last):
File "D:/python code/auction_crawl/test bs4.py", line 5895, in <module>
print(clue2.find('span', {'class': 'num'}).get_text())
AttributeError: 'NoneType' object has no attribute 'get_text'
If I print print(clue2) without .find('span', {'class': 'num'})
I got result like below
<tr valign="bottom">
<td class="head_num left"><img alt="굿옥션로고" height="26"
src="/img/common/top_logo.gif" width="100"><span class="logo_pid"></span>
</img>
</td>
<td class="head_con center">
<span class="bold">서울남부지방법원 본원 8계(02-2192-1338)</span> / 매각기일 :
<span class="bold"><span class="no">2017.04.12(水)</span> <span class="no">
(10:00)</span>
</span></td>
</tr>
<tr valign="bottom">
<td class="head_num bold no left" style="width:190px;padding:10px 0 2px
0;font-size:15px"><span class="num"><span class="f20">2015</span>타경<span
class="f20">2321</span></span></td>
<td class="head_con center" style="padding-bottom:6px"><div>
<span class="ltblue"><img src="/img/icon/point_blue.gif" style="vertical-
align:middle"/></span> <span class="blue bold">서울남부지방법원 본원
</span> <span class="ltblue"><img src="/img/icon/point_blue.gif"
style="vertical-align:middle"/></span> 매각기일 : <span class="blue bold
no">2017.04.12(水) (10:00)</span> <span class="ltblue"><img
src="/img/icon/point_blue.gif" style="vertical-align:middle"/></span> <span
class="blue bold">경매 8계</span>(전화:02-2192-1338)</div>
</td>
</tr>
So I made HTML code above to variable d. and made another code like below.
d = ''' HTML code above '''
soup4= BeautifulSoup(d,"html.parser")
clue = soup4.find('span', {'class': 'num'})
print(clue.get_text().strip())
When I activate the code above, I got response like this 2015타경2321.
This is what I want.
I want to get 2015타경2321 from the top code. How can I get it??
You can just verify if your clue2.find('span', {'class': 'num'}) has results and if so, print the result:
...
clue2number = clue2.find('span', {'class': 'num'})
if clue2number is not None:
print (clue2number.get_text(strip=True))
Which outputs:
2015타경2321
<tr bgcolor="#FFFFFF">
<td class="tablecontent" scope="row" rowspan="1">
ZURICH AMERICAN INSURANCE COMPANY
</td>
<td class="tablecontent" scope="row" rowspan="1">
FARMERS GROUP INC (14523)
</td>
<td class="tablecontent" scope="row">
znaf
</td>
<td class="tablecontent" scope="row">
anhm
</td>
</tr>
I have an HTML document which contains multiple tr tags. I want to extract the href link from the first td and data from third td tag onwards under every tr tag. How can this be achieved?
You can find all tr elements, iterate over them, then do the context-specific searches for the inner td elements and get the first and the third:
for tr in soup.find_all('tr'):
cells = tr.find_all('td')
if len(cells) < 3:
continue # safety pillow
link = cells[0].a['href'] # assuming every first td has an "a" element
data = cells[2].get_text()
print(link, data)
As a side note and depending what you are trying to accomplish in the HTML parsing, I usually find pandas.read_html() a great and convenient way to parse HTML tables into dataframes and work with the dataframes after, which are quite convenient data structures to work with.
You can use css selector nth-of-type to navigate through the tds
Here's a sample"
soup = BeautifulSoup(html, 'html.parser')
a = soup.select('td:nth-of-type(1) a')[0]
href = a['href']
td = soup.select("td:nth-of-type(3)")[0]
text = td.get_text(strip=True)
This is the Web page Source code which I am scraping using Beautiful Soup.
<tr>
<td>
1
</td>
<td style="cipher1">
<img class="cipher2" src="http://cipher3.png" alt="cipher4" title="cipher5" />
<span class="cipher8">t</span>cipher9
</td>
<td>
112
</td>
<td>
3510
</td>
// Pattern Repeated
<tr >
<td>
2
</td>
<td style="cipher1">
I wrote some code using BeautifulSoup but I am getting more results than I want due to multiple occurrences of the pattern.
I have used
row1 = soup.find_all('a' ,class_ = "cipher7" )
for row in row1:
f.write( row['title'] + "\n")
But with this I get multiple occurences for 'cipher7' since it is occurring multiple times in the web page.
So the thing I can use this
<td style="cipher1">...
since it is unique to the things which I want.
So, How to modify my code to do this?
You can use a convenient select method which takes a CSS selector as an argument:
row = soup.select("td[style=cipher1] > a.cipher7")
You can first find the td tag (since you said it is unique) and then find the specified atag from it.
all_as = []
rows = soup.find_all('td', {'style':'cipher1'})
for row in rows:
all_as.append(row.find_all('a', class_ = "cipher7"))