parse table using beautifulsoup in python - python

I want to traverse through each row and capture values of td.text. However problem here is table does not have class. and all the td got same class name. I want to traverse through each row and want following output:
1st row)"AMERICANS SOCCER CLUB","B11EB - AMERICANS-B11EB-WARZALA","Cameron Coya","Player 228004","2016-09-10","player persistently infringes the laws of the game","C" (new line)
2nd row) "AVIATORS SOCCER CLUB","G12DB - AVIATORS-G12DB-REYNGOUDT","Saskia Reyes","Player 224463","2016-09-11","player/sub guilty of unsporting behavior"," C" (new line)
<div style="overflow:auto; border:1px #cccccc solid;">
<table cellspacing="0" cellpadding="3" align="left" border="0" width="100%">
<tbody>
<tr class="tblHeading">
<td colspan="7">AMERICANS SOCCER CLUB</td>
</tr>
<tr bgcolor="#CCE4F1">
<td colspan="7">B11EB - AMERICANS-B11EB-WARZALA</td>
</tr>
<tr bgcolor="#FFFFFF">
<td width="19%" class="tdUnderLine"> Cameron Coya </td>
<td width="19%" class="tdUnderLine">
Rozel, Max
</td>
<td width="06%" class="tdUnderLine">
09-11-2016
</td>
<td width="05%" class="tdUnderLine" align="center">
228004
</td>
<td width="16%" class="tdUnderLine" align="center">
09/10/16 02:15 PM
</td>
<td width="30%" class="tdUnderLine"> player persistently infringes the laws of the game </td>
<td class="tdUnderLine"> Cautioned </td>
</tr>
<tr class="tblHeading">
<td colspan="7">AVIATORS SOCCER CLUB</td>
</tr>
<tr bgcolor="#CCE4F1">
<td colspan="7">G12DB - AVIATORS-G12DB-REYNGOUDT</td>
</tr>
<tr bgcolor="#FBFBFB">
<td width="19%" class="tdUnderLine"> Saskia Reyes </td>
<td width="19%" class="tdUnderLine">
HollaenderNardelli, Eric
</td>
<td width="06%" class="tdUnderLine">
09-11-2016
</td>
<td width="05%" class="tdUnderLine" align="center">
224463
</td>
<td width="16%" class="tdUnderLine" align="center">
09/11/16 06:45 PM
</td>
<td width="30%" class="tdUnderLine"> player/sub guilty of unsporting behavior </td>
<td class="tdUnderLine"> Cautioned </td>
</tr>
<tr class="tblHeading">
<td colspan="7">BERGENFIELD SOCCER CLUB</td>
</tr>
<tr bgcolor="#CCE4F1">
<td colspan="7">B11CW - BERGENFIELD-B11CW-NARVAEZ</td>
</tr>
<tr bgcolor="#FFFFFF">
<td width="19%" class="tdUnderLine"> Christian Latorre </td>
<td width="19%" class="tdUnderLine">
Coyle, Kevin
</td>
<td width="06%" class="tdUnderLine">
09-10-2016
</td>
<td width="05%" class="tdUnderLine" align="center">
226294
</td>
<td width="16%" class="tdUnderLine" align="center">
09/10/16 11:00 AM
</td>
<td width="30%" class="tdUnderLine"> player persistently infringes the laws of the game </td>
<td class="tdUnderLine"> Cautioned </td>
</tr>
I tried with following code.
import requests
from bs4 import BeautifulSoup
import re
try:
import urllib.request as urllib2
except ImportError:
import urllib2
url = r"G:\Freelancer\NC Soccer\Northern Counties Soccer Association ©.html"
page = open(url, encoding="utf8")
soup = BeautifulSoup(page.read(),"html.parser")
#tableList = soup.findAll("table")
for tr in soup.find_all("tr"):
for td in tr.find_all("td"):
print(td.text.strip())
but it is obvious that it will return text form all td and I will not able to identify particular column name or will not able to determine start of new record. I want to know
1) how to identify each column(because class name is same) and there are headings as well (I will appreciate if you provide code for that)
2) how to identify new record in such structure

If the data is really structured like a table, there's a good chance you can read it into pandas directly with pd.read_table(). Note that it accepts urls in the filepath_or_buffer argument.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html

count = 0
string = ""
for td in soup.find_all("td"):
string += "\""+td.text.strip()+"\","
count +=1
if(count % 9 ==0):
print string[:-1] + "\n\n" # string[:-1] to remove the last ","
string = ""
as the table is not in the proper required format we shall just go with the td rather than going into each row then going into td in each row which complicates the work. I just used a string you can append the data into a list of lists and get process it for later use.
Hope this solves your problem

from __future__ import print_function
import re
import datetime
from bs4 import BeautifulSoup
soup = ""
with open("/tmp/a.html") as page:
soup = BeautifulSoup(page.read(),"html.parser")
table = soup.find('div', {'style': 'overflow:auto; border:1px #cccccc solid;'}).find('table')
trs = table.find_all('tr')
table_dict = {}
game = ""
section = ""
for tr in trs:
if tr.has_attr('class'):
game = tr.text.strip('\n')
if tr.has_attr('bgcolor'):
if tr['bgcolor'] == '#CCE4F1':
section = tr.text.strip('\n')
else:
tds = tr.find_all('td')
extracted_text = [re.sub(r'([^\x00-\x7F])+','', td.text) for td in tds]
extracted_text = [x.strip() for x in extracted_text]
extracted_text = list(filter(lambda x: len(x) > 2, extracted_text))
extracted_text.pop(1)
extracted_text[2] = "Player " + extracted_text[2]
extracted_text[3] = datetime.datetime.strptime(extracted_text[3], '%m/%d/%y %I:%M %p').strftime("%Y-%m-%d")
extracted_text = ['"' + x + '"' for x in [game, section] + extracted_text]
print(','.join(extracted_text))
And when run:
$ python a.py
"AMERICANS SOCCER CLUB","B11EB - AMERICANS-B11EB-WARZALA","Cameron Coya","Player 228004","2016-09-10","player persistently infringes the laws of the game","C"
"AVIATORS SOCCER CLUB","G12DB - AVIATORS-G12DB-REYNGOUDT","Saskia Reyes","Player 224463","2016-09-11","player/sub guilty of unsporting behavior","C"
"BERGENFIELD SOCCER CLUB","B11CW - BERGENFIELD-B11CW-NARVAEZ","Christian Latorre","Player 226294","2016-09-10","player persistently infringes the laws of the game","C"
Based on further conversation with the OP, the input was https://paste.fedoraproject.org/428111/87928814/raw/ and the output after running the above code is: https://paste.fedoraproject.org/428110/38792211/raw/

There seems to be a pattern. After every 7 tr(s), there is a new line.
So, what you can do is keep a counter starting from 1, when it touches 7, append a new line and restart it to 0.
counter = 1
for tr in find_all("tr"):
for td in tr.find_all("td"):
# place code
counter = counter + 1
if counter == 7:
print "\n"
counter = 1

Related

BeautifulSoup to find a HTML tag that contains tags with specific class

How to find all tags that include tags with certain class?
The data is:
<tr>
<td class="TDo1" width=17%>Tournament</td>
<td class="TDo2" width=8%>Date</td>
<td class="TDo2" width=6%>Pts.</td>
<td class="TDo2" width=34%>Pos. Player (team)</td>
<td class="TDo5" width=35%>Pos. Opponent (team)</td>
</tr>
<tr>
<td class=TDq1>GpWl(op) 4.01/02</td>
<td class=TDq2>17.02.02</td>
<td class=TDq3>34/75</td>
<td class=TDq5>39. John Deep</td>
<td class=TDq9>68. Mark Deep</td>
</tr>
<tr>
<td class=TDp1>GpWl(op) 4.01/02</td>
<td class=TDp2>17.02.02</td>
<td class=TDp3>34/75</td>
<td class=TDp6>39. John Deep</td>
<td class=TDp8>7. Darius Star</td>
</tr>
I am trying
for mtable in bs.find_all('tr', text=re.compile(r'class=TD?3')):
print(mtable)
but this returns zero results.
I suppose you want to find all <tr> that contains any tag with class TD<any character>3:
import re
# `html` contains your html from the question
soup = BeautifulSoup(html, "html.parser")
pat = re.compile(r"TD.3")
for tr in soup.find_all(
lambda tag: tag.name == "tr"
and tag.find(class_=lambda cl: cl and pat.match(cl))
):
print(tr)
Prints:
<tr>
<td class="TDq1">GpWl(op) 4.01/02</td>
<td class="TDq2">17.02.02</td>
<td class="TDq3">34/75</td>
<td class="TDq5">39. John Deep</td>
<td class="TDq9">68. Mark Deep</td>
</tr>
<tr>
<td class="TDp1">GpWl(op) 4.01/02</td>
<td class="TDp2">17.02.02</td>
<td class="TDp3">34/75</td>
<td class="TDp6">39. John Deep</td>
<td class="TDp8">7. Darius Star</td>
</tr>
You need to find matching with td. Like this,
In [1]: bs.find_all('td', {"class": re.compile(r'TD\w\d')})
Out[1]:
[<td class="TDo1" width="17%">Tournament</td>,
<td class="TDo2" width="8%">Date</td>,
<td class="TDo2" width="6%">Pts.</td>,
<td class="TDo2" width="34%">Pos. Player (team)</td>,
<td class="TDo5" width="35%">Pos. Opponent (team)</td>,
<td class="TDq1">GpWl(op) 4.01/02</td>,
<td class="TDq2">17.02.02</td>,
<td class="TDq3">34/75</td>,
<td class="TDq5">39. John Deep</td>,
<td class="TDq9">68. Mark Deep</td>,
<td class="TDp1">GpWl(op) 4.01/02</td>,
<td class="TDp2">17.02.02</td>,
<td class="TDp3">34/75</td>,
<td class="TDp6">39. John Deep</td>,
<td class="TDp8">7. Darius Star</td>]
This may help you:
from bs4 import BeautifulSoup
import re
t = 'your page source'
pat = re.compile(r'class=TD.3')
classes = re.findall(pat,t)
classes = [j[6:] for j in classes]
soup = BeautifulSoup(t)
result = list()
for i in classes:
item = soup.find_all(attrs={"class": i})
result.extend(item)
for i in result:
print(i.parent)
You could use css selectors to get the tags with class = "TD...3" and then get their parent tags
mtables = [s.parent for s in bs.select('*[class^="TD"][class$="3"]')]
# if you want tr only:
# mtables = [s.parent for s in bs.select('tr *[class^="TD"][class$="3"]')]
mtables = list(set(mtables)) # no duplicates
but this will not work if there are multiple classes (unless the first one starts with "TD" and the last ends in "3"), and you can't limit the characters in between.
You could use find with lambda twice like this
tagPat = '^TD.3$'
# tagPat = '^TD.*3$' # if there might be more than one character between TD and 3
mtables = bs.find_all(
lambda p:
p.name == 'tr' and # remove if you want all tags and not just tr
p.find(
lambda t: t.get('class') is not None and
len([c for c in t.get('class') if re.search('^TD.3$', c)]) > 0
, recursive=False # prevent getting <tr><tr><td class="TDo3">someval</td></tr></tr>
))
If you don't want to use lambda, you can replace the select in the first method with regex and find
tagPat = '^TD.3$' # '^TD.*3$' #
mtables = [
s.parent for s in bs.find_all(class_ = re.compile(tagPat))
if s.parent.name == 'tr' # remove if you want all tags and not just tr
]
mtables = list(set(mtables)) # no duplicates
For the html in your question all 3 methods would lead to same data - you can print with
for mtable in mtables: print('---\n', mtable, '\n---')
and get the output
---
<tr>
<td class="TDq1">GpWl(op) 4.01/02</td>
<td class="TDq2">17.02.02</td>
<td class="TDq3">34/75</td>
<td class="TDq5">39. John Deep</td>
<td class="TDq9">68. Mark Deep</td>
</tr>
---
---
<tr>
<td class="TDp1">GpWl(op) 4.01/02</td>
<td class="TDp2">17.02.02</td>
<td class="TDp3">34/75</td>
<td class="TDp6">39. John Deep</td>
<td class="TDp8">7. Darius Star</td>
</tr>
---

BeautifulSoup4 extract multiple data from TD tags within TR

Using beautifulsou 4 to scrape a HTML table.
To display values from one of the table rows and remove any empty td fields.
The source being scraped shares classes=''
So is there any way to pull the data form just one row? using
data-name ="Georgia" in the html source below
Using: beautifulsoup4
Current code
import bs4 as bs from urllib.request import FancyURLopener
class MyOpener(FancyURLopener):
version = 'My new User-Agent' # Set this to a string you want for your user agent
myopener = MyOpener()
sauce = myopener.open('')
soup = bs.BeautifulSoup(sauce,'lxml')
#table = soupe.table
table = soup.find('table')
table_rows = table.find_all_next('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
HTML SOURCE
<tr>
<td class="text--gray">
<span class="save-button" data-status="unselected" data-type="country" data-name="Kazakhstan">★</span>
Kazakhstan
</td>
<td class="text--green">
81
</td>
<td class="text--green">
9
</td>
<td class="text--green">
12.5
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--blue">
0
</td>
<td class="text--yellow">
0
</td>
</tr>
<tr>
<td class="text--gray">
<span class="save-button" data-status="unselected" data-type="country" data-name="Georgia">★</span>
Georgia
</td>
<td class="text--green">
75
</td>
<td class="text--green">
0
</td>
<td class="text--green">
0
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--blue">
10
</td>
<td class="text--yellow">
1
</td>
</tr>
Are you talking about something like:
tr.find_all('td',{'data-name' : True})
That should find any td that contains data name. I could be reading your question all wrong though.

Python parsing tables with BeautifulSoup

HTML-page structure:
<table>
<tbody>
<tr>
<th>Timestamp</th>
<th>Call</th>
<th>MHz</th>
<th>SNR</th>
<th>Drift</th>
<th>Grid</th>
<th>Pwr</th>
<th>Reporter</th>
<th>RGrid</th>
<th>km</th>
<th>az</th>
</tr>
<tr>
<td align="right"> 2019-12-10 14:02 </td>
<td align="left"> DL1DUZ </td>
<td align="right"> 10.140271 </td>
<td align="right"> -26 </td>
<td align="right"> 0 </td>
<td align="left"> JO61tb </td>
<td align="right"> 0.2 </td>
<td align="left"> F4DWV </td>
<td align="left"> IN98bc </td>
<td align="right"> 1162 </td>
<td align="right"> 260 </td>
</tr>
<tr>
<td align="right"> 2019-10-10 14:02 </td>
<td align="left"> DL23UH </td>
<td align="right"> 11.0021 </td>
<td align="right"> -20 </td>
<td align="right"> 0 </td>
<td align="left"> JO61tb </td>
<td align="right"> 0.2 </td>
<td align="left"> F4DWV </td>
<td align="left"> IN98bc </td>
<td align="right"> 1162 </td>
<td align="right"> 260 </td>
</tr>
</tbody>
</table>
and so on tr-td...
My code:
from bs4 import BeautifulSoup as bs
import requests
import csv
base_url = 'some_url'
session = requests.Session()
request = session.get(base_url)
val_th = []
val_td = []
if request.status_code == 200:
soup = bs(request.content, 'html.parser')
table = soup.findChildren('table')
tr = soup.findChildren('tr')
my_table = table[0]
my_tr_th = tr[0]
my_tr_td = tr[1]
rows = my_table.findChildren('tr')
row_th = my_tr_th.findChildren('th')
row_td = my_tr_td.findChildren('td')
for r_th in row_th:
heading = r_th.text
val_th.append(heading)
for r_td in row_td:
data = r_td.text
val_td.append(data)
with open('output.csv', 'w') as f:
a_pen = csv.writer(f)
a_pen.writerow(val_th)
a_pen.writerow(val_td)
1) I printed 1 line of td. How to make sure that all the lines of td on the page are displayed in csv?
2) td tags - many on the page.
3) If my_tr_td = tr[1] write as my_tr_td = tr[1:50] - it's mistake.
How to write all data from tr-td lines to a csv file?
Thanks in advance.
Let's try it this way:
import lxml.html
import csv
import requests
url = "http://wsprnet.org/drupal/wsprnet/spots"
res = requests.get(url)
doc = lxml.html.fromstring(res.text)
cols = []
#first, we need to extract the column headers, stuck all the way at the top, with the first one in a particular location and format
cols.append(doc.xpath('//table/tr/node()/text()')[0])
for item in doc.xpath('//table/tr/th'):
typ = str(type(item.getnext()))
if not 'NoneType' in typ:
cols.append(item.getnext().text)
#now for the actual data
inf = []
for item in doc.xpath('//table//tr//td'):
inf.append(item.text.replace('\\xa02', '').strip()) #text info needs to be cleaned
#this will take all the data and split it into rows for each column
rows = [inf[x:x+len(cols)] for x in range(0, len(inf), len(cols))]
#finally, write to file:
with open("output.csv", "w", newline='') as f:
writer = csv.writer(f)
writer.writerow(cols)
for l in rows:
writer.writerow(l)

Using BeautifulSoup to pick up text in table, on webpages

I want to use BeautifulSoup to pick up the ‘Model Type’ values on company’s webpages which from codes like below:
it forms 2 tables shown on the webpage, side by side.
updated source code of the webpage
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>
I am using following however it doesn’t get the ‘VIP QB662FG’ wanted:
from bs4 import BeautifulSoup
import urllib2
url = "http://www.thewebpage.com"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
find_it = soup.find_all(text=re.compile("Model Type "))
the_value = find_it[0].findNext('td').contents[0]
print the_value
in what way I can get it? I'm using Python 2.7.
You are looking for the next row, then the next cell in the same position. The latter is tricky; we could assume it is always the 3rd column:
header_text = soup.find(text=re.compile("Model Type "))
value = header_cell.find_next('tr').select('td:nth-of-type(3)')[0].get_text()
If you just ask for the next td, you get the Design Year column instead.
There could well be better methods to get to your one cell; if we assume there is only one tr row with the class row1, for example, the following would get your value in one step:
value = soup.select('tr.row1 td:nth-of-type(3)')[0].get_text()
Find all tr's and output it's third child unless it's first row
import bs4
data = """
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD>
"""
soup = bs4.BeautifulSoup(data)
#table = soup.find('tr', {'class':'tableheader'}).parent
table = soup.find('table', {'class':'tableforms'})
for i,tr in enumerate(table.findChildren()):
if i>0:
for idx,td in enumerate(tr.findChildren()):
if idx==2:
print td.get_text().replace('(Registered)','').strip()
I think you can do as follows :
from bs4 import BeautifulSoup
html = """<TD colSpan=3>Desinger </TD></TR>
<TR>
<TD class=row2bold width="5%"> </TD>
<TD class=row2bold width="30%" align=left>Gender </TD>
<TD class=row1 width="20%" align=left>Male </TD></TR>
<TR>
<TD class=row2bold width="5%"> </TD>
<TD class=row2bold width="30%" align=left>Born Country </TD>
<TD class=row1 width="20%" align=left>DE </TD></TR></TBODY></TABLE></TD>
<TD height="100%" vAlign=top>
<TABLE class=tableforms>
<TBODY>
<TR class=tableheader>
<TD colSpan=4>Remarks </TD></TR>
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>"""
soup = BeautifulSoup(html, "html.parser")
soup = soup.find('table',{'class':'tableforms'})
dico = {}
l1 = soup.findAll('tr')[1].findAll('td')
l2 = soup.findAll('tr')[2].findAll('td')
for i in range(len(l1)):
dico[l1[i].getText().strip()] = l2[i].getText().replace('(Registered)','').strip()
print dico['Model Type']
It prints : u'VIP QB662FG'

scraping tables with beautifulsoup

I seem to be stuck, If i had the following table:
<table align=center cellpadding=3 cellspacing=0 border=1>
<tr bgcolor="#EEEEFF">
<td align="center">
40 </td>
<td align="center">
44 </td>
<td align="center">
<font color="green"><b>+4</b></font>
</td>
<td align="center">
1,000</td>
<td align="center">
15,000 </td>
<td align="center">
44,000 </td>
<td align="center">
<font color="green"><b><nobr>+193.33%</nobr></b></font>
</td>
</tr>
what would be the ideal way to use find_all to pull the 44,000 td from the table?
If it is a recurring position of the table you would like to scrape you would like to scrape I would use beautiful soup to extract all elements in the table and then extract that data. See the pseudo code below.
known_position = 5
tds = bs4.find_all('td')
number = tds[known_position].text()
on the other hand if you're specifically searching for a given number I would just iterate over the list.
tds = bs4.find_all('td')
for td in tds:
if td.text = 'number here':
# do your stuff

Categories