Python parsing tables with BeautifulSoup - python

HTML-page structure:
<table>
<tbody>
<tr>
<th>Timestamp</th>
<th>Call</th>
<th>MHz</th>
<th>SNR</th>
<th>Drift</th>
<th>Grid</th>
<th>Pwr</th>
<th>Reporter</th>
<th>RGrid</th>
<th>km</th>
<th>az</th>
</tr>
<tr>
<td align="right"> 2019-12-10 14:02 </td>
<td align="left"> DL1DUZ </td>
<td align="right"> 10.140271 </td>
<td align="right"> -26 </td>
<td align="right"> 0 </td>
<td align="left"> JO61tb </td>
<td align="right"> 0.2 </td>
<td align="left"> F4DWV </td>
<td align="left"> IN98bc </td>
<td align="right"> 1162 </td>
<td align="right"> 260 </td>
</tr>
<tr>
<td align="right"> 2019-10-10 14:02 </td>
<td align="left"> DL23UH </td>
<td align="right"> 11.0021 </td>
<td align="right"> -20 </td>
<td align="right"> 0 </td>
<td align="left"> JO61tb </td>
<td align="right"> 0.2 </td>
<td align="left"> F4DWV </td>
<td align="left"> IN98bc </td>
<td align="right"> 1162 </td>
<td align="right"> 260 </td>
</tr>
</tbody>
</table>
and so on tr-td...
My code:
from bs4 import BeautifulSoup as bs
import requests
import csv
base_url = 'some_url'
session = requests.Session()
request = session.get(base_url)
val_th = []
val_td = []
if request.status_code == 200:
soup = bs(request.content, 'html.parser')
table = soup.findChildren('table')
tr = soup.findChildren('tr')
my_table = table[0]
my_tr_th = tr[0]
my_tr_td = tr[1]
rows = my_table.findChildren('tr')
row_th = my_tr_th.findChildren('th')
row_td = my_tr_td.findChildren('td')
for r_th in row_th:
heading = r_th.text
val_th.append(heading)
for r_td in row_td:
data = r_td.text
val_td.append(data)
with open('output.csv', 'w') as f:
a_pen = csv.writer(f)
a_pen.writerow(val_th)
a_pen.writerow(val_td)
1) I printed 1 line of td. How to make sure that all the lines of td on the page are displayed in csv?
2) td tags - many on the page.
3) If my_tr_td = tr[1] write as my_tr_td = tr[1:50] - it's mistake.
How to write all data from tr-td lines to a csv file?
Thanks in advance.

Let's try it this way:
import lxml.html
import csv
import requests
url = "http://wsprnet.org/drupal/wsprnet/spots"
res = requests.get(url)
doc = lxml.html.fromstring(res.text)
cols = []
#first, we need to extract the column headers, stuck all the way at the top, with the first one in a particular location and format
cols.append(doc.xpath('//table/tr/node()/text()')[0])
for item in doc.xpath('//table/tr/th'):
typ = str(type(item.getnext()))
if not 'NoneType' in typ:
cols.append(item.getnext().text)
#now for the actual data
inf = []
for item in doc.xpath('//table//tr//td'):
inf.append(item.text.replace('\\xa02', '').strip()) #text info needs to be cleaned
#this will take all the data and split it into rows for each column
rows = [inf[x:x+len(cols)] for x in range(0, len(inf), len(cols))]
#finally, write to file:
with open("output.csv", "w", newline='') as f:
writer = csv.writer(f)
writer.writerow(cols)
for l in rows:
writer.writerow(l)

Related

Can't scrape some data out of a table in a customized way

I'm trying to parse tabular content out of some html elements and arrange them in customized manner so that I can write them accordingly in a csv file later.
The table looks almost exactly like this.
Html elements are like (truncated):
<tr>
<td align="center" colspan="4" class="header">ATLANTIC</td>
</tr>
<tr>
<td class="black10bold">Facility</td>
<td class="black10bold">Type</td>
<td class="black10bold">Funding</td>
</tr>
<tr>
<td style="width: 55%">
Complete Care at Linwood, LLC
</td>
</tr>
<tr>
<td style="width: 55%">
The Health Center At Galloway
</td>
</tr>
<tr>
<td align="center" colspan="4" class="header">BERGEN</td>
</tr>
<tr>
<td class="black10bold">Facility</td>
<td class="black10bold">Type</td>
<td class="black10bold">Funding</td>
</tr>
<tr>
<td style="width: 55%">
The Actors Fund Homes
</td>
</tr>
<tr>
<td style="width: 55%">
Actors Fund Home, The
</td>
</tr>
I've tried so far:
for item in soup.select("tr"):
try:
header = item.select_one("td.header").text
except AttributeError:
header = ""
try:
item_name = item.select_one("td > a").text
except AttributeError:
item_name = ""
print(item_name,header)
Output it produces:
ATLANTIC
Complete Care at Linwood, LLC
The Health Center At Galloway
BERGEN
The Actors' Fund Homes
Actors Fund Home, The
Output I would like to have:
Complete Care at Linwood, LLC ATLANTIC
The Health Center At Galloway ATLANTIC
The Actors' Fund Homes BERGEN
Actors Fund Home, The BERGEN
This should produce the output the way you wanted to have.
for item in soup.select("tr"):
if item.select_one("td.header"):
header = item.select_one("td.header").text
elif item.select_one("td > a"):
item_name = item.select_one("td > a").text
print(item_name,header)
Hope its help you.
import os
import csv
html = """<tr>
<td align="center" colspan="4" class="header">ATLANTIC</td>
</tr>
<tr>
<td class="black10bold">Facility</td>
<td class="black10bold">Type</td>
<td class="black10bold">Funding</td>
</tr>
<tr>
<td style="width: 55%">
Complete Care at Linwood, LLC
</td>
</tr>
<tr>
<td style="width: 55%">
The Health Center At Galloway
</td>
</tr>
<tr>
<td align="center" colspan="4" class="header">BERGEN</td>
</tr>
<tr>
<td class="black10bold">Facility</td>
<td class="black10bold">Type</td>
<td class="black10bold">Funding</td>
</tr>
<tr>
<td style="width: 55%">
The Actors Fund Homes
</td>
</tr>
<tr>
<td style="width: 55%">
Actors Fund Home, The
</td>
</tr>"""
soup = BeautifulSoup(html, 'lxml')
output_rows = []
for table_row in soup.findAll('tr'):
columns = table_row.findAll('td')
output_row = []
for column in columns:
output_row.append(column.text)
output_rows.append(output_row)
print(output_rows)
with open('output.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(output_rows)

BeautifulSoup4 extract multiple data from TD tags within TR

Using beautifulsou 4 to scrape a HTML table.
To display values from one of the table rows and remove any empty td fields.
The source being scraped shares classes=''
So is there any way to pull the data form just one row? using
data-name ="Georgia" in the html source below
Using: beautifulsoup4
Current code
import bs4 as bs from urllib.request import FancyURLopener
class MyOpener(FancyURLopener):
version = 'My new User-Agent' # Set this to a string you want for your user agent
myopener = MyOpener()
sauce = myopener.open('')
soup = bs.BeautifulSoup(sauce,'lxml')
#table = soupe.table
table = soup.find('table')
table_rows = table.find_all_next('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
HTML SOURCE
<tr>
<td class="text--gray">
<span class="save-button" data-status="unselected" data-type="country" data-name="Kazakhstan">★</span>
Kazakhstan
</td>
<td class="text--green">
81
</td>
<td class="text--green">
9
</td>
<td class="text--green">
12.5
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--blue">
0
</td>
<td class="text--yellow">
0
</td>
</tr>
<tr>
<td class="text--gray">
<span class="save-button" data-status="unselected" data-type="country" data-name="Georgia">★</span>
Georgia
</td>
<td class="text--green">
75
</td>
<td class="text--green">
0
</td>
<td class="text--green">
0
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--blue">
10
</td>
<td class="text--yellow">
1
</td>
</tr>
Are you talking about something like:
tr.find_all('td',{'data-name' : True})
That should find any td that contains data name. I could be reading your question all wrong though.

How to read the data from HTML file and write the data to CSV file using python?

I have a .html file report which consists of the data in terms of tables and pass-fail criteria. so I want this data to be written to .csv file Using Python3.
Please suggest me how to proceed?
For example, the data will be like this:
<h2>Sequence Evaluation of Entire Project <em class="contentlink">[Contents]</em> </h2>
<table width="100%" class="coverage">
<tr class="nohover">
<td colspan="8" class="tableabove">Test Sequence State</td>
</tr>
<tr>
<th colspan="2" style="white-space:nowrap;">Metric</th>
<th colspan="2">Percentage</th>
<th>Target</th>
<th>Total</th>
<th>Reached</th>
<th>Unreached</th>
</tr>
<tr>
<td colspan="2">Test Sequence Work Progress</td>
<td>100.0%</td>
<td>
<table class="metricbar">
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
<tr>
<td class="covreached" width="99%"></td>
<td class="target" width="1%"></td>
<td class="covreached" width="0%"></td>
<td class="covnotreached" width="0%"></td>
</tr>
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
</table>
</td>
<td>100%</td>
<td>24</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
Assuming you know header and really only need the associated percentage, with bs4 4.7.1 you can use :contains to target header and then take next td. You would be reading your HTML from file into html variable shown.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
html = '''
<h2>Sequence Evaluation of Entire Project <em class="contentlink">[Contents]</em> </h2>
<table width="100%" class="coverage">
<tr class="nohover">
<td colspan="8" class="tableabove">Test Sequence State</td>
</tr>
<tr>
<th colspan="2" style="white-space:nowrap;">Metric</th>
<th colspan="2">Percentage</th>
<th>Target</th>
<th>Total</th>
<th>Reached</th>
<th>Unreached</th>
</tr>
<tr>
<td colspan="2">Test Sequence Work Progress</td>
<td>100.0%</td>
<td>
<table class="metricbar">
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
<tr>
<td class="covreached" width="99%"></td>
<td class="target" width="1%"></td>
<td class="covreached" width="0%"></td>
<td class="covnotreached" width="0%"></td>
</tr>
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
</table>
</td>
<td>100%</td>
<td>24</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
'''
soup = bs(html, 'lxml') # 'html.parser' if lxml not installed
header = 'Test Sequence Work Progress'
result = soup.select_one('td:contains("' + header + '") + td').text
df = pd.DataFrame([result], columns = [header])
print(df)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )
import csv
from bs4 import BeautifulSoup
out = open('out.csv', 'w', encoding='utf-8')
path="my.html" #add the path of your local file here
soup = BeautifulSoup(open(path), 'html.parser')
for link in soup.find_all('p'): #add tag whichyou want to extract
a=link.get_text()
out.write(a)
out.write('\n')
out.close()

Using BeautifulSoup to pick up text in table, on webpages

I want to use BeautifulSoup to pick up the ‘Model Type’ values on company’s webpages which from codes like below:
it forms 2 tables shown on the webpage, side by side.
updated source code of the webpage
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>
I am using following however it doesn’t get the ‘VIP QB662FG’ wanted:
from bs4 import BeautifulSoup
import urllib2
url = "http://www.thewebpage.com"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
find_it = soup.find_all(text=re.compile("Model Type "))
the_value = find_it[0].findNext('td').contents[0]
print the_value
in what way I can get it? I'm using Python 2.7.
You are looking for the next row, then the next cell in the same position. The latter is tricky; we could assume it is always the 3rd column:
header_text = soup.find(text=re.compile("Model Type "))
value = header_cell.find_next('tr').select('td:nth-of-type(3)')[0].get_text()
If you just ask for the next td, you get the Design Year column instead.
There could well be better methods to get to your one cell; if we assume there is only one tr row with the class row1, for example, the following would get your value in one step:
value = soup.select('tr.row1 td:nth-of-type(3)')[0].get_text()
Find all tr's and output it's third child unless it's first row
import bs4
data = """
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD>
"""
soup = bs4.BeautifulSoup(data)
#table = soup.find('tr', {'class':'tableheader'}).parent
table = soup.find('table', {'class':'tableforms'})
for i,tr in enumerate(table.findChildren()):
if i>0:
for idx,td in enumerate(tr.findChildren()):
if idx==2:
print td.get_text().replace('(Registered)','').strip()
I think you can do as follows :
from bs4 import BeautifulSoup
html = """<TD colSpan=3>Desinger </TD></TR>
<TR>
<TD class=row2bold width="5%"> </TD>
<TD class=row2bold width="30%" align=left>Gender </TD>
<TD class=row1 width="20%" align=left>Male </TD></TR>
<TR>
<TD class=row2bold width="5%"> </TD>
<TD class=row2bold width="30%" align=left>Born Country </TD>
<TD class=row1 width="20%" align=left>DE </TD></TR></TBODY></TABLE></TD>
<TD height="100%" vAlign=top>
<TABLE class=tableforms>
<TBODY>
<TR class=tableheader>
<TD colSpan=4>Remarks </TD></TR>
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>"""
soup = BeautifulSoup(html, "html.parser")
soup = soup.find('table',{'class':'tableforms'})
dico = {}
l1 = soup.findAll('tr')[1].findAll('td')
l2 = soup.findAll('tr')[2].findAll('td')
for i in range(len(l1)):
dico[l1[i].getText().strip()] = l2[i].getText().replace('(Registered)','').strip()
print dico['Model Type']
It prints : u'VIP QB662FG'

Cant extract tables from a html code

I am working to parse a html table given below(its a section of complete html code) But the code is not working. Can some one please help me.There is an error saying "table has no attribute findall".
The code is:
import re
import HTMLParser
from urllib2 import urlopen
import urllib2
from bs4 import BeautifulSoup
url = 'http://164.100.47.132/LssNew/Members/Biography.aspx?mpsno=4064'
url_data = urlopen(url).read()
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page)
title = soup.title
final_tit = title.string
table = soup.find('table',id = "ctl00_ContPlaceHolderMain_Bioprofile1_Datagrid1")
tr = table.findall('tr')
for tr in table:
cols = tr.findAll('td')
for td in cols:
text = ''.join(td.find(text=True))
print text+"|",
print
<table style="WIDTH: 565px">
<tr>
<td vAlign="top" align="left"><img id="ctl00_ContPlaceHolderMain_Bioprofile1_Image1" src="http://164.100.47.132/mpimage/photo/4064.jpg" style="height:140px;border-width:0px;" /></td>
<td vAlign="top"><table cellspacing="0" rules="all" border="2" id="ctl00_ContPlaceHolderMain_Bioprofile1_Datagrid1" style="border-color:#FAE3C3;border-width:2px;border-style:Solid;width:433px;border-collapse:collapse;">
<tr>
<td>
<table align="center" height="30px">
<tr valign="top">
<td align="center" valign="top" class="gridheader1">Aaroon Rasheed,Shri J.M.</td>
</tr>
</table>
<table height="110px">
<tr>
<td align="left" class="darkerb" width="133px" valign="top">Constituency :</td>
<td align="left" valign="top" class="griditem2" width="300px">Theni (Tamil Nadu )</td>
</tr>
<tr>
<td align="left" width="133px" class="darkerb" valign="top">
Party Name :</td>
<td align="left" width="300px" valign="top" class="griditem2">Indian National Congress(INC)</td>
</tr>
<tr>
<td align="left" class="darkerb" valign="top" width="133px">
Email Address :
</td>
<td align="left" valign="top" class="griditem2" width="300px">jm.aaronrasheed#sansad.nic.in</td>
</tr>
</table>
</td>
</tr>
</table></td>
</tr>
</table>
The method is called find_all(), not findall:
tr = table.find_all('tr')

Categories