Python - multiple HTML page parsing to a single TXT file - python

I'm trying to parse specific content from an X number of HTML files to a single TXT file.
I have dirtily coded the following:
#!/usr/bin/python
import sys, mechanize, BeautifulSoup
def parsedata():
##do stuff
prvitekst = soup.find(text='Random Number')
prvikesh = prvitekst.findNextSiblings('td')
drugitekst = soup.find(text='Random Month/Yeare')
drugikesh = drugitekst.findNextSiblings('td')
trechitekst = soup.find(text='Small Random Number')
trechikesh = trechitekst.findNextSiblings('td')
content = prvikesh + ";" + drugikesh + ";" + trechikesh + ";"
writeFile(content);
def readFile(id):
fi = open('result/page-%s.html' % id, 'r');
def writeFile(content):
f = open('parsed.txt', 'a')
f.write(content,"\n")
f.close();
def main(start):
##initialize vars
id = int(start)
page = readFile(id)
soup = BeautifulSoup(page)
print soup.prettify()
readFile(id)
for id in range(1000000000):
parsedata()
id = id + 1
continue
main(sys.argv[1]);
While the HTML part im trying to scrape looks like this
<tr style="height:40px; background-color:#f0f0f0;"><td colspan="4" class="textLargeBold" style="border-bottom: solid 1px #c4c4c4;">Random Details</td></tr>
<tr class="text">
<td align="left" valign="top"><b>Type</b></td>
<td align="left" valign="top">Color</td>
<td align="left" valign="top"><b>Random Number</b></td>
<td align="left" valign="top">213523123123123</td>
</tr>
<tr class="text"
<td align="left" valign="top"><b>Random Month/Year</b></td>
<td align="left" valign="top">12/13</td>
<td align="left" valign="top"><b>Small Random Number</b></td>
<td align="left" valign="top">13233</td>
</tr>
I want the details that come after the first one. thus if I'm searching for Typem I want it to show me Color.
and ofcourse in the end I'd like the gotten contents to be parsed in a format similar to CSV.
Type;Random Number;Random Month/Year
it should parse
Color;213523123123123;12/13
ofcourse in the code i made already im not searching for type, but that can easily be changed.
EDIT : Fixed intendation

html='''
<tr style="height:40px; background-color:#f0f0f0;"><td colspan="4" class="textLargeBold" style="border-bottom: solid 1px #c4c4c4;">Random Details</td></tr> <tr class="text"> <td align="left" valign="top"><b>Type</b></td> <td align="left" valign="top">Color</td> <td align="left" valign="top"><b>Random Number</b></td> <td align="left" valign="top">213523123123123</td> </tr> <tr class="text" <td align="left" valign="top"><b>Random Month/Year</b></td> <td align="left" valign="top">12/13</td> <td align="left" valign="top"><b>Small Random Number</b></td> <td align="left" valign="top">13233</td> </tr>
'''
import htql;
a=[x for x in htql.HTQL(html, "<b sep>2-0 {name=<b>:tx; value=<td>1:tx } ")];
a

Related

How to beautifulsoup in this case without class or id

How to get the text of 'Wow, you get it!' i can print the Date, but i cant get the td that come next of the date.
<table border="0" cellpadding="4" cellspacing="1" width="100%">
<tr bgcolor="#505050">
<td class="white" colspan="2">
<b>
Account Here
</b>
</td>
</tr>
<tr bgcolor="#F1E0C6">
<td colspan="2">
There is nothing
</td>
</tr>
</table>
<br/>
<br/>
<table border="0" cellpadding="4" cellspacing="1" width="100%">
<tr bgcolor="#505050">
<td class="white" colspan="2">
<b>
Death
</b>
</td>
</tr>
<tr bgcolor="#F1E0C6">
<td valign="top" width="25%">
Aug 15 2021, 18:36:22 CEST
</td>
<td>
Wow, you get it!
</td>
</tr>
<tr bgcolor="#D4C0A1">
<td valign="top" width="25%">
Aug 01 2021, 21:25:39 CEST
</td>
<td>
Next Time
</td>
</tr>
</table>
i got the date with this code:
print(soup.find_all('td', {'valign': 'top'})[0].get_text())
show this
Aug 15 2021, 18:36:22 CEST
but i cant find any solution to get the next td of the date
If html_doc contains the HTML snippet from the question:
soup = BeautifulSoup(html_doc, "html.parser")
txt = soup.select_one('td[valign="top"] + td').get_text(strip=True)
print(txt)
Prints:
Wow, you get it!
Or:
txt = soup.find("td", {"valign": "top"}).find_next("td").get_text(strip=True)

BeautifulSoup returns [ ] when string is present

I am scraping a web page and it worked pretty good except a part where re.compile() returns empty [] while the text passed to it is present. Here is my scrape code
dob = soup.find(text = re.compile('Date of Birth')).findNext('td').text
print(dob)
father_name = soup.find(text = re.compile("Father's Name")).findNext('td').text
print(father_name)
mob_no_parent = soup.find(text = re.compile("Mobile Number")).findNext('td').text
print(mob_no_parent)
mob_no_student = soup.findAll(text = re.compile("Mobile Number(Student)"))
print(mob_no_student)
email = soup.find(text = re.compile("E - Mail Address")).findNext('td').text
print(email)
p_address = soup.find(text = re.compile("PermanentAddress")).findNext('td').text
print(p_address)
The above code works fine on every text except
mob_no_student = soup.findAll(text = re.compile("Mobile Number(Student)"))
print(mob_no_student)
The above one returns []
Here is my html code
<td align="left" width="50%" class="inner_padding_even"> Registration No </td>
<td align="left" width="50%" class="inner_padding_even">CPT0000</td>
</tr>
<tr>
<td align="left" width="50%" class="inner_padding_odd"> Name of Candidate</td>
<td align="left" width="50%" class="inner_padding_odd"><font face=arial size=2>KKKKKKK B.</font></td>
</tr>
<tr>
<td align="left" class="inner_padding_even"> Date of Birth</td>
<td align="left" class="inner_padding_even">16.11.1900</td>
</tr>
<tr>
<td align="left" class="inner_padding_even"> Father's Name</td>
<td align="left" class="inner_padding_even">BBBBBBBB.</td>
</tr>
<tr>
<td align="left" class="inner_padding_even"> Mobile Number</font>(Parent)</td>
<td align="left" class="inner_padding_even">99999999999</td>
</tr>
<tr>
<td align="left" class="inner_padding_odd"> Mobile Number(Student)</td>
<td align="left" class="inner_padding_odd">9999999999</td>
</tr>
<tr>
<td align="left" class="inner_padding_even"> E - Mail Address</td>
<td align="left" class="inner_padding_even">keyansgm#gmail.com</td>
</tr>
<tr>
<td width="50%" align="left" class="inner_padding_even"> Permanent Address</td>
<td width="50%" align="left" class="inner_padding_even">Blah blah</td>
</tr>
What am i missing here ?
In a regex, you need to escape brackets, If not it will refer to a group
Try this
mob_no_student = soup.findAll(text = re.compile("Mobile Number\(Student\)"))

parse table using beautifulsoup in python

I want to traverse through each row and capture values of td.text. However problem here is table does not have class. and all the td got same class name. I want to traverse through each row and want following output:
1st row)"AMERICANS SOCCER CLUB","B11EB - AMERICANS-B11EB-WARZALA","Cameron Coya","Player 228004","2016-09-10","player persistently infringes the laws of the game","C" (new line)
2nd row) "AVIATORS SOCCER CLUB","G12DB - AVIATORS-G12DB-REYNGOUDT","Saskia Reyes","Player 224463","2016-09-11","player/sub guilty of unsporting behavior"," C" (new line)
<div style="overflow:auto; border:1px #cccccc solid;">
<table cellspacing="0" cellpadding="3" align="left" border="0" width="100%">
<tbody>
<tr class="tblHeading">
<td colspan="7">AMERICANS SOCCER CLUB</td>
</tr>
<tr bgcolor="#CCE4F1">
<td colspan="7">B11EB - AMERICANS-B11EB-WARZALA</td>
</tr>
<tr bgcolor="#FFFFFF">
<td width="19%" class="tdUnderLine"> Cameron Coya </td>
<td width="19%" class="tdUnderLine">
Rozel, Max
</td>
<td width="06%" class="tdUnderLine">
09-11-2016
</td>
<td width="05%" class="tdUnderLine" align="center">
228004
</td>
<td width="16%" class="tdUnderLine" align="center">
09/10/16 02:15 PM
</td>
<td width="30%" class="tdUnderLine"> player persistently infringes the laws of the game </td>
<td class="tdUnderLine"> Cautioned </td>
</tr>
<tr class="tblHeading">
<td colspan="7">AVIATORS SOCCER CLUB</td>
</tr>
<tr bgcolor="#CCE4F1">
<td colspan="7">G12DB - AVIATORS-G12DB-REYNGOUDT</td>
</tr>
<tr bgcolor="#FBFBFB">
<td width="19%" class="tdUnderLine"> Saskia Reyes </td>
<td width="19%" class="tdUnderLine">
HollaenderNardelli, Eric
</td>
<td width="06%" class="tdUnderLine">
09-11-2016
</td>
<td width="05%" class="tdUnderLine" align="center">
224463
</td>
<td width="16%" class="tdUnderLine" align="center">
09/11/16 06:45 PM
</td>
<td width="30%" class="tdUnderLine"> player/sub guilty of unsporting behavior </td>
<td class="tdUnderLine"> Cautioned </td>
</tr>
<tr class="tblHeading">
<td colspan="7">BERGENFIELD SOCCER CLUB</td>
</tr>
<tr bgcolor="#CCE4F1">
<td colspan="7">B11CW - BERGENFIELD-B11CW-NARVAEZ</td>
</tr>
<tr bgcolor="#FFFFFF">
<td width="19%" class="tdUnderLine"> Christian Latorre </td>
<td width="19%" class="tdUnderLine">
Coyle, Kevin
</td>
<td width="06%" class="tdUnderLine">
09-10-2016
</td>
<td width="05%" class="tdUnderLine" align="center">
226294
</td>
<td width="16%" class="tdUnderLine" align="center">
09/10/16 11:00 AM
</td>
<td width="30%" class="tdUnderLine"> player persistently infringes the laws of the game </td>
<td class="tdUnderLine"> Cautioned </td>
</tr>
I tried with following code.
import requests
from bs4 import BeautifulSoup
import re
try:
import urllib.request as urllib2
except ImportError:
import urllib2
url = r"G:\Freelancer\NC Soccer\Northern Counties Soccer Association ©.html"
page = open(url, encoding="utf8")
soup = BeautifulSoup(page.read(),"html.parser")
#tableList = soup.findAll("table")
for tr in soup.find_all("tr"):
for td in tr.find_all("td"):
print(td.text.strip())
but it is obvious that it will return text form all td and I will not able to identify particular column name or will not able to determine start of new record. I want to know
1) how to identify each column(because class name is same) and there are headings as well (I will appreciate if you provide code for that)
2) how to identify new record in such structure
If the data is really structured like a table, there's a good chance you can read it into pandas directly with pd.read_table(). Note that it accepts urls in the filepath_or_buffer argument.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html
count = 0
string = ""
for td in soup.find_all("td"):
string += "\""+td.text.strip()+"\","
count +=1
if(count % 9 ==0):
print string[:-1] + "\n\n" # string[:-1] to remove the last ","
string = ""
as the table is not in the proper required format we shall just go with the td rather than going into each row then going into td in each row which complicates the work. I just used a string you can append the data into a list of lists and get process it for later use.
Hope this solves your problem
from __future__ import print_function
import re
import datetime
from bs4 import BeautifulSoup
soup = ""
with open("/tmp/a.html") as page:
soup = BeautifulSoup(page.read(),"html.parser")
table = soup.find('div', {'style': 'overflow:auto; border:1px #cccccc solid;'}).find('table')
trs = table.find_all('tr')
table_dict = {}
game = ""
section = ""
for tr in trs:
if tr.has_attr('class'):
game = tr.text.strip('\n')
if tr.has_attr('bgcolor'):
if tr['bgcolor'] == '#CCE4F1':
section = tr.text.strip('\n')
else:
tds = tr.find_all('td')
extracted_text = [re.sub(r'([^\x00-\x7F])+','', td.text) for td in tds]
extracted_text = [x.strip() for x in extracted_text]
extracted_text = list(filter(lambda x: len(x) > 2, extracted_text))
extracted_text.pop(1)
extracted_text[2] = "Player " + extracted_text[2]
extracted_text[3] = datetime.datetime.strptime(extracted_text[3], '%m/%d/%y %I:%M %p').strftime("%Y-%m-%d")
extracted_text = ['"' + x + '"' for x in [game, section] + extracted_text]
print(','.join(extracted_text))
And when run:
$ python a.py
"AMERICANS SOCCER CLUB","B11EB - AMERICANS-B11EB-WARZALA","Cameron Coya","Player 228004","2016-09-10","player persistently infringes the laws of the game","C"
"AVIATORS SOCCER CLUB","G12DB - AVIATORS-G12DB-REYNGOUDT","Saskia Reyes","Player 224463","2016-09-11","player/sub guilty of unsporting behavior","C"
"BERGENFIELD SOCCER CLUB","B11CW - BERGENFIELD-B11CW-NARVAEZ","Christian Latorre","Player 226294","2016-09-10","player persistently infringes the laws of the game","C"
Based on further conversation with the OP, the input was https://paste.fedoraproject.org/428111/87928814/raw/ and the output after running the above code is: https://paste.fedoraproject.org/428110/38792211/raw/
There seems to be a pattern. After every 7 tr(s), there is a new line.
So, what you can do is keep a counter starting from 1, when it touches 7, append a new line and restart it to 0.
counter = 1
for tr in find_all("tr"):
for td in tr.find_all("td"):
# place code
counter = counter + 1
if counter == 7:
print "\n"
counter = 1

Using BeautifulSoup to pick up text in table, on webpages

I want to use BeautifulSoup to pick up the ‘Model Type’ values on company’s webpages which from codes like below:
it forms 2 tables shown on the webpage, side by side.
updated source code of the webpage
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>
I am using following however it doesn’t get the ‘VIP QB662FG’ wanted:
from bs4 import BeautifulSoup
import urllib2
url = "http://www.thewebpage.com"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
find_it = soup.find_all(text=re.compile("Model Type "))
the_value = find_it[0].findNext('td').contents[0]
print the_value
in what way I can get it? I'm using Python 2.7.
You are looking for the next row, then the next cell in the same position. The latter is tricky; we could assume it is always the 3rd column:
header_text = soup.find(text=re.compile("Model Type "))
value = header_cell.find_next('tr').select('td:nth-of-type(3)')[0].get_text()
If you just ask for the next td, you get the Design Year column instead.
There could well be better methods to get to your one cell; if we assume there is only one tr row with the class row1, for example, the following would get your value in one step:
value = soup.select('tr.row1 td:nth-of-type(3)')[0].get_text()
Find all tr's and output it's third child unless it's first row
import bs4
data = """
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD>
"""
soup = bs4.BeautifulSoup(data)
#table = soup.find('tr', {'class':'tableheader'}).parent
table = soup.find('table', {'class':'tableforms'})
for i,tr in enumerate(table.findChildren()):
if i>0:
for idx,td in enumerate(tr.findChildren()):
if idx==2:
print td.get_text().replace('(Registered)','').strip()
I think you can do as follows :
from bs4 import BeautifulSoup
html = """<TD colSpan=3>Desinger </TD></TR>
<TR>
<TD class=row2bold width="5%"> </TD>
<TD class=row2bold width="30%" align=left>Gender </TD>
<TD class=row1 width="20%" align=left>Male </TD></TR>
<TR>
<TD class=row2bold width="5%"> </TD>
<TD class=row2bold width="30%" align=left>Born Country </TD>
<TD class=row1 width="20%" align=left>DE </TD></TR></TBODY></TABLE></TD>
<TD height="100%" vAlign=top>
<TABLE class=tableforms>
<TBODY>
<TR class=tableheader>
<TD colSpan=4>Remarks </TD></TR>
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>"""
soup = BeautifulSoup(html, "html.parser")
soup = soup.find('table',{'class':'tableforms'})
dico = {}
l1 = soup.findAll('tr')[1].findAll('td')
l2 = soup.findAll('tr')[2].findAll('td')
for i in range(len(l1)):
dico[l1[i].getText().strip()] = l2[i].getText().replace('(Registered)','').strip()
print dico['Model Type']
It prints : u'VIP QB662FG'

Cant extract tables from a html code

I am working to parse a html table given below(its a section of complete html code) But the code is not working. Can some one please help me.There is an error saying "table has no attribute findall".
The code is:
import re
import HTMLParser
from urllib2 import urlopen
import urllib2
from bs4 import BeautifulSoup
url = 'http://164.100.47.132/LssNew/Members/Biography.aspx?mpsno=4064'
url_data = urlopen(url).read()
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page)
title = soup.title
final_tit = title.string
table = soup.find('table',id = "ctl00_ContPlaceHolderMain_Bioprofile1_Datagrid1")
tr = table.findall('tr')
for tr in table:
cols = tr.findAll('td')
for td in cols:
text = ''.join(td.find(text=True))
print text+"|",
print
<table style="WIDTH: 565px">
<tr>
<td vAlign="top" align="left"><img id="ctl00_ContPlaceHolderMain_Bioprofile1_Image1" src="http://164.100.47.132/mpimage/photo/4064.jpg" style="height:140px;border-width:0px;" /></td>
<td vAlign="top"><table cellspacing="0" rules="all" border="2" id="ctl00_ContPlaceHolderMain_Bioprofile1_Datagrid1" style="border-color:#FAE3C3;border-width:2px;border-style:Solid;width:433px;border-collapse:collapse;">
<tr>
<td>
<table align="center" height="30px">
<tr valign="top">
<td align="center" valign="top" class="gridheader1">Aaroon Rasheed,Shri J.M.</td>
</tr>
</table>
<table height="110px">
<tr>
<td align="left" class="darkerb" width="133px" valign="top">Constituency :</td>
<td align="left" valign="top" class="griditem2" width="300px">Theni (Tamil Nadu )</td>
</tr>
<tr>
<td align="left" width="133px" class="darkerb" valign="top">
Party Name :</td>
<td align="left" width="300px" valign="top" class="griditem2">Indian National Congress(INC)</td>
</tr>
<tr>
<td align="left" class="darkerb" valign="top" width="133px">
Email Address :
</td>
<td align="left" valign="top" class="griditem2" width="300px">jm.aaronrasheed#sansad.nic.in</td>
</tr>
</table>
</td>
</tr>
</table></td>
</tr>
</table>
The method is called find_all(), not findall:
tr = table.find_all('tr')

Categories