Cant extract tables from a html code

Cant extract tables from a html code - python

I am working to parse a html table given below(its a section of complete html code) But the code is not working. Can some one please help me.There is an error saying "table has no attribute findall".
The code is:
import re
import HTMLParser
from urllib2 import urlopen
import urllib2
from bs4 import BeautifulSoup
url = 'http://164.100.47.132/LssNew/Members/Biography.aspx?mpsno=4064'
url_data = urlopen(url).read()
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page)
title = soup.title
final_tit = title.string
table = soup.find('table',id = "ctl00_ContPlaceHolderMain_Bioprofile1_Datagrid1")
tr = table.findall('tr')
for tr in table:
cols = tr.findAll('td')
for td in cols:
text = ''.join(td.find(text=True))
print text+"|",
print
<table style="WIDTH: 565px">
<tr>
<td vAlign="top" align="left"><img id="ctl00_ContPlaceHolderMain_Bioprofile1_Image1" src="http://164.100.47.132/mpimage/photo/4064.jpg" style="height:140px;border-width:0px;" /></td>
<td vAlign="top"><table cellspacing="0" rules="all" border="2" id="ctl00_ContPlaceHolderMain_Bioprofile1_Datagrid1" style="border-color:#FAE3C3;border-width:2px;border-style:Solid;width:433px;border-collapse:collapse;">
<tr>
<td>
<table align="center" height="30px">
<tr valign="top">
<td align="center" valign="top" class="gridheader1">Aaroon Rasheed,Shri J.M.</td>
</tr>
</table>
<table height="110px">
<tr>
<td align="left" class="darkerb" width="133px" valign="top">Constituency :</td>
<td align="left" valign="top" class="griditem2" width="300px">Theni (Tamil Nadu )</td>
</tr>
<tr>
<td align="left" width="133px" class="darkerb" valign="top">
Party Name :</td>
<td align="left" width="300px" valign="top" class="griditem2">Indian National Congress(INC)</td>
</tr>
<tr>
<td align="left" class="darkerb" valign="top" width="133px">
Email Address :
</td>
<td align="left" valign="top" class="griditem2" width="300px">jm.aaronrasheed#sansad.nic.in</td>
</tr>
</table>
</td>
</tr>
</table></td>
</tr>
</table>

The method is called find_all(), not findall:
tr = table.find_all('tr')

Related

how to skip the first table, and skip the second table head during parsing a local html file in python?

I am trying to parse a local html file, I don't know why same codes resulted differently between sample html text and the whole html file. Can anyone help? I really appreciate it.
The sample html text:
s = '''
<table width=90%>
<tr>
<td align="center" width=18%></td>
<td align="left" width=15%></td>
</tr>
</table>
<table border>
<tr>
<td nowrap="nowrap"><b>Rec</b></td>
<td align="RIGHT" nowrap="nowrap"><b>ID</b></td>
</tr>
<tr>
<td align="RIGHT" nowrap="nowrap" VALIGN=TOP>1</td>
<td nowrap="nowrap" VALIGN=TOP>ID<br />100100</td>
</tr>
</table>
<p>
<style type="text/css">
.....
</style>
<table border>
<tr>
<td nowrap="nowrap"><b>Rec</b></td>
<td align="RIGHT" nowrap="nowrap"><b>ID</b></td>
</tr>
<tr>
<td align="RIGHT" nowrap="nowrap" VALIGN=TOP>2</td>
<td nowrap="nowrap" VALIGN=TOP>ID<br />101101</td>
</tr>
</table>
'''
I have tried following:
''''
# with open('myfile.html', 'r', encoding='utf-8') as f: # when use the whole file
# s = f.read() # when use the whole file
soup = BeautifulSoup(s, "html.parser")
tables = [
[
[td.get_text(strip=True) for td in tr.find_all('td')]
for tr in table.find_all('tr')
]
for table in soup.find_all('table')
]
table_data = [i.text for i in soup.find_all('td')]
print(table_data)
''''
expected output:
Rec ID
1 ID100100
2 ID101101
the current output is:
['', '', 'Rec', 'ID', '1', 'ID100100', 'Rec', 'ID', '2', 'ID101101']
also, when I implemented the same codes with the whole HTML file, the result was included something like below, did I miss something here:
'', '</tr>', '', '</table>', '', '</table>', '', '</center>', '', '<hr />', '', '<center>', '',

You can apply list slicing
from bs4 import BeautifulSoup
s = '''
<table width=90%>
<tr>
<td align="center" width=18%></td>
<td align="left" width=15%></td>
</tr>
</table>
<table border>
<tr>
<td nowrap="nowrap"><b>Rec</b></td>
<td align="RIGHT" nowrap="nowrap"><b>ID</b></td>
</tr>
<tr>
<td align="RIGHT" nowrap="nowrap" VALIGN=TOP>1</td>
<td nowrap="nowrap" VALIGN=TOP>ID<br />100100</td>
</tr>
</table>
<p>
<style type="text/css">
.....
</style>
<table border>
<tr>
<td nowrap="nowrap"><b>Rec</b></td>
<td align="RIGHT" nowrap="nowrap"><b>ID</b></td>
</tr>
<tr>
<td align="RIGHT" nowrap="nowrap" VALIGN=TOP>2</td>
<td nowrap="nowrap" VALIGN=TOP>ID<br />101101</td>
</tr>
</table>
'''
soup = BeautifulSoup(s, "html.parser")
table = soup.find_all('table')[2]
#print(len(table))
data=[]
table_data = [i.text for i in soup.find_all('td')]
rec=table_data[-3]
num_1= table_data[-5]
num_2= table_data[-1]
data.append([rec,num_1,num_2])
print(data)
Output:
[['ID', 'ID100100', 'ID101101']]

How to beautifulsoup in this case without class or id

How to get the text of 'Wow, you get it!' i can print the Date, but i cant get the td that come next of the date.
<table border="0" cellpadding="4" cellspacing="1" width="100%">
<tr bgcolor="#505050">
<td class="white" colspan="2">
<b>
Account Here
</b>
</td>
</tr>
<tr bgcolor="#F1E0C6">
<td colspan="2">
There is nothing
</td>
</tr>
</table>
<br/>
<br/>
<table border="0" cellpadding="4" cellspacing="1" width="100%">
<tr bgcolor="#505050">
<td class="white" colspan="2">
<b>
Death
</b>
</td>
</tr>
<tr bgcolor="#F1E0C6">
<td valign="top" width="25%">
Aug 15 2021, 18:36:22 CEST
</td>
<td>
Wow, you get it!
</td>
</tr>
<tr bgcolor="#D4C0A1">
<td valign="top" width="25%">
Aug 01 2021, 21:25:39 CEST
</td>
<td>
Next Time
</td>
</tr>
</table>
i got the date with this code:
print(soup.find_all('td', {'valign': 'top'})[0].get_text())
show this
Aug 15 2021, 18:36:22 CEST
but i cant find any solution to get the next td of the date

If html_doc contains the HTML snippet from the question:
soup = BeautifulSoup(html_doc, "html.parser")
txt = soup.select_one('td[valign="top"] + td').get_text(strip=True)
print(txt)
Prints:
Wow, you get it!
Or:
txt = soup.find("td", {"valign": "top"}).find_next("td").get_text(strip=True)

How to read the data from HTML file and write the data to CSV file using python?

I have a .html file report which consists of the data in terms of tables and pass-fail criteria. so I want this data to be written to .csv file Using Python3.
Please suggest me how to proceed?
For example, the data will be like this:
<h2>Sequence Evaluation of Entire Project <em class="contentlink">[Contents]</em> </h2>
<table width="100%" class="coverage">
<tr class="nohover">
<td colspan="8" class="tableabove">Test Sequence State</td>
</tr>
<tr>
<th colspan="2" style="white-space:nowrap;">Metric</th>
<th colspan="2">Percentage</th>
<th>Target</th>
<th>Total</th>
<th>Reached</th>
<th>Unreached</th>
</tr>
<tr>
<td colspan="2">Test Sequence Work Progress</td>
<td>100.0%</td>
<td>
<table class="metricbar">
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
<tr>
<td class="covreached" width="99%"></td>
<td class="target" width="1%"></td>
<td class="covreached" width="0%"></td>
<td class="covnotreached" width="0%"></td>
</tr>
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
</table>
</td>
<td>100%</td>
<td>24</td>
<td>-</td>
<td>-</td>
</tr>
<tr>

Assuming you know header and really only need the associated percentage, with bs4 4.7.1 you can use :contains to target header and then take next td. You would be reading your HTML from file into html variable shown.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
html = '''
<h2>Sequence Evaluation of Entire Project <em class="contentlink">[Contents]</em> </h2>
<table width="100%" class="coverage">
<tr class="nohover">
<td colspan="8" class="tableabove">Test Sequence State</td>
</tr>
<tr>
<th colspan="2" style="white-space:nowrap;">Metric</th>
<th colspan="2">Percentage</th>
<th>Target</th>
<th>Total</th>
<th>Reached</th>
<th>Unreached</th>
</tr>
<tr>
<td colspan="2">Test Sequence Work Progress</td>
<td>100.0%</td>
<td>
<table class="metricbar">
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
<tr>
<td class="covreached" width="99%"></td>
<td class="target" width="1%"></td>
<td class="covreached" width="0%"></td>
<td class="covnotreached" width="0%"></td>
</tr>
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
</table>
</td>
<td>100%</td>
<td>24</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
'''
soup = bs(html, 'lxml') # 'html.parser' if lxml not installed
header = 'Test Sequence Work Progress'
result = soup.select_one('td:contains("' + header + '") + td').text
df = pd.DataFrame([result], columns = [header])
print(df)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )

import csv
from bs4 import BeautifulSoup
out = open('out.csv', 'w', encoding='utf-8')
path="my.html" #add the path of your local file here
soup = BeautifulSoup(open(path), 'html.parser')
for link in soup.find_all('p'): #add tag whichyou want to extract
a=link.get_text()
out.write(a)
out.write('\n')
out.close()

Using BeautifulSoup to pick up text in table, on webpages

I want to use BeautifulSoup to pick up the ‘Model Type’ values on company’s webpages which from codes like below:
it forms 2 tables shown on the webpage, side by side.
updated source code of the webpage
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>
I am using following however it doesn’t get the ‘VIP QB662FG’ wanted:
from bs4 import BeautifulSoup
import urllib2
url = "http://www.thewebpage.com"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
find_it = soup.find_all(text=re.compile("Model Type "))
the_value = find_it[0].findNext('td').contents[0]
print the_value
in what way I can get it? I'm using Python 2.7.

You are looking for the next row, then the next cell in the same position. The latter is tricky; we could assume it is always the 3rd column:
header_text = soup.find(text=re.compile("Model Type "))
value = header_cell.find_next('tr').select('td:nth-of-type(3)')[0].get_text()
If you just ask for the next td, you get the Design Year column instead.
There could well be better methods to get to your one cell; if we assume there is only one tr row with the class row1, for example, the following would get your value in one step:
value = soup.select('tr.row1 td:nth-of-type(3)')[0].get_text()

Find all tr's and output it's third child unless it's first row
import bs4
data = """
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD>
"""
soup = bs4.BeautifulSoup(data)
#table = soup.find('tr', {'class':'tableheader'}).parent
table = soup.find('table', {'class':'tableforms'})
for i,tr in enumerate(table.findChildren()):
if i>0:
for idx,td in enumerate(tr.findChildren()):
if idx==2:
print td.get_text().replace('(Registered)','').strip()

I think you can do as follows :
from bs4 import BeautifulSoup
html = """<TD colSpan=3>Desinger </TD></TR>
<TR>
<TD class=row2bold width="5%"> </TD>
<TD class=row2bold width="30%" align=left>Gender </TD>
<TD class=row1 width="20%" align=left>Male </TD></TR>
<TR>
<TD class=row2bold width="5%"> </TD>
<TD class=row2bold width="30%" align=left>Born Country </TD>
<TD class=row1 width="20%" align=left>DE </TD></TR></TBODY></TABLE></TD>
<TD height="100%" vAlign=top>
<TABLE class=tableforms>
<TBODY>
<TR class=tableheader>
<TD colSpan=4>Remarks </TD></TR>
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>"""
soup = BeautifulSoup(html, "html.parser")
soup = soup.find('table',{'class':'tableforms'})
dico = {}
l1 = soup.findAll('tr')[1].findAll('td')
l2 = soup.findAll('tr')[2].findAll('td')
for i in range(len(l1)):
dico[l1[i].getText().strip()] = l2[i].getText().replace('(Registered)','').strip()
print dico['Model Type']
It prints : u'VIP QB662FG'

Extract a row in html file without the html tags

I need to extract a row containing a specific string but my following code gives html tags along with it.
from BeautifulSoup import BeautifulSoup
import re
import os
import codecs
import sys
get_company = "ABB LTD"
OUTFILE = os.path.join('company', 'a', 'viewids')
soup = BeautifulSoup(open("/company/a/searches/a"))
rows = soup.findAll("table",{"id":"cos"})[0].findAll('tr')
userrows = [t for t in rows if t.findAll(text=re.compile(get_company))]
print userrows
This is my table format
<table id="cos" width="500" cellpadding="3" cellspacing="0" border="1">
<tr>
<th>Company Name</th>
<th>CIK Number</th>
<th>SIC Code</th>
</tr>
<tr valign="top">
<td>A CONSULTING TEAM INC</td>
<td align="right">1040792</td>
<td align="right">7380</td>
</tr>
<tr valign="top">
<td>A J&J PHARMA CORP</td>
<td align="right">1140452</td>
<td align="right">9995</td>
</tr>
</table>
So if I need A J&J PHARMA CORP's CIK number how to do it? Right now it gives me an output like this:
[<tr valign="top">
<td>A J&J PHARMA CORP</td>
<td align="right">1140452</td>
<td align="right">9995</td>
</tr>]

import re
from BeautifulSoup import BeautifulSoup
html= '''
<table id="cos" width="500" cellpadding="3" cellspacing="0" border="1">
<tr>
<th>Company Name</th>
<th>CIK Number</th>
<th>SIC Code</th>
</tr>
<tr valign="top">
<td>A CONSULTING TEAM INC</td>
<td align="right">1040792</td>
<td align="right">7380</td>
</tr>
<tr valign="top">
<td>A J&J PHARMA CORP</td>
<td align="right">1140452</td>
<td align="right">9995</td>
</tr>
</table>
'''
soup = BeautifulSoup(html)
table = soup.find("table", {"id":"cos"})
td = table.find('td', text='A J&J PHARMA CORP')
# ^ This return text node, not td.
print(td.parent.parent.findAll('td')[1].string)
prints
1140452

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cant extract tables from a html code - python

The method is called find_all(), not findall: tr = table.find_all('tr')

Related

how to skip the first table, and skip the second table head during parsing a local html file in python?

How to beautifulsoup in this case without class or id

How to read the data from HTML file and write the data to CSV file using python?

Using BeautifulSoup to pick up text in table, on webpages

Extract a row in html file without the html tags

Categories

Resources