Scrape Table HTML with beautifulSoup - python

I'm trying to scrape a website which has been built with tables. Here a link of a page's example: http://www.rc2.vd.ch/registres/hrcintapp-pub/companyReport.action?rcentId=5947621600000055031025&lang=FR&showHeader=false
My goal is to get the name and the last name : Lass Christian (screenshot below).
[![enter image description here][1]][1]
[1]: http://i.stack.imgur.com/q3nMb.png
I've already scraped many websites but this one I have absolutly no idea how to proceed. There are only 'tables' without any ID/Class tags and I can't figure out where I'm supposed to start.
Here's an exemple of the HTML code :
<table border="1" cellpadding="1" cellspacing="0" width="100%">
<tbody><tr bgcolor="#f0eef2">
<th colspan="3">Associés, gérants et personnes ayant qualité pour signer</th>
</tr>
<tr bgcolor="#f0eef2">
<th>
<a class="hoverable" onclick="document.forms[0].rcentId.value='5947621600000055031025';document.forms[0].lang.value='FR';document.forms[0].searchLang.value='FR';document.forms[0].order.value='N';document.forms[0].rad.value='N';document.forms[0].goToAdm.value='true';document.forms[0].showHeader.value=false;document.forms[0].submit();event.returnValue=false; return false;">
Nom et Prénoms, Origine, Domicile, Part sociale
</a>
</th>
<th>
<a class="hoverable" onclick="document.forms[0].rcentId.value='5947621600000055031025';document.forms[0].lang.value='FR';document.forms[0].searchLang.value='FR';document.forms[0].order.value='F';document.forms[0].rad.value='N';document.forms[0].goToAdm.value='true';document.forms[0].showHeader.value=false;document.forms[0].submit();event.returnValue=false; return false;">
Fonctions
</a>
<img src="/registres/hrcintapp-pub/img/down_r.png" align="bottom" border="0" alt="">
</th>
<th>Mode Signature</th>
</tr>
<tr bgcolor="#ffffff">
<td>
<span style="text-decoration: none;">
Lass Christian, du Danemark, à Yverdon-les-Bains, avec 200 parts de CHF 100
</span>
</td>
<td><span style="text-decoration: none;">associé gérant </span> </td>
<td><span style="text-decoration: none;">signature individuelle</span> </td>
</tr>
</tbody></table>

This will get the name from the page, the table is right after the anchor with the id adm, once you have that you have numerous ways to get what you need:
from bs4 import BeautifulSoup
import requests
r = requests.get('http://www.rc2.vd.ch/registres/hrcintapp-pub/companyReport.action?rcentId=5947621600000055031025&lang=FR&showHeader=false')
soup = BeautifulSoup(r.content,"lxml")
table = soup.select_one("#adm").find_next("table")
name = table.select_one("td span[style^=text-decoration:]").text.split(",", 1)[0].strip()
print(name)
Output:
Lass Christian
Or:
table = soup.select_one("#adm").find_next("table")
name = table.find("tr",bgcolor="#ffffff").td.span.text.split(",", 1)[0].strip()

Something like this?
results = soup.find_all("tr", {"bgcolor" : "#ffffff"})
for result in results:
the_name = result.td.span.get_text().split(',')[0]

Related

How to get a text of certain elements BeautifulSoup Python

I have this kind of html code
<tr>
<td class="a">...</td>
<td class="a">...</td>
<td class="a">
<p>
<sup>
Name Name Name
</sup>
</p>
</td>
<td class="a">...</td>
<td class="a">...</td>
<td class="a">
<p>
<sup>25.01.1980</sup>
</p>
</td>
<td class="a">...</td>
<td class="a">...</td>
</tr>
<tr>...</tr>
<tr>...</tr>
I need to get the text of every 3rd and 5th td of every tr
Apparently this doesn't work:)
from bs4 import BeautifulSoup
import index
soup = BeautifulSoup(index.index_doc, 'lxml')
for i in soup.find_all('tr')[2:]:
print(i[2].text, i[4].text)
You could use css selectors and pseudo classe :nth-of-type() to select your elements (assumed you need the date, so I selected the 6th td):
data = [e.get_text(strip=True) for e in soup.select('tr td:nth-of-type(3),tr td:nth-of-type(6)')]
And to get a list of tuples:
list(zip(data, data[1:]))
Example
from bs4 import BeautifulSoup
html = '''
<tr>
<td class="a">...</td>
<td class="a">...</td>
<td class="a">
<p>
<sup>
Name Name Name
</sup>
</p>
</td>
<td class="a">...</td>
<td class="a">...</td>
<td class="a">
<p>
<sup>25.01.1980</sup>
</p>
</td>
<td class="a">...</td>
<td class="a">...</td>
</tr>
<tr>...</tr>
<tr>...</tr>
'''
soup = BeautifulSoup(html)
data = [e.get_text(strip=True) for e in soup.select('tr td:nth-of-type(3),tr td:nth-of-type(6)')]
list(zip(data, data[1:]))

How to extract text from table moving between<tr> tags using Beautifulsoup

I need to extract text from a table using BeautifulSoup.
Below is the code which I have written and output
HTML:
<div class="Tech">
<div class="select">
<span>Selection is mandatory</span>
</div>
<table id="product">
<tbody>
<tr class="feature">
<td class="title" rowspan="3">
<h2>Information</h2>
</td>
<td class="label">
<h3>Design</h3>
</td>
<td class="checkbox">product</td>
</tr>
<tr>
<td class="label">
<h3>Marque</h3>
</td>
<td class="checkbox">
<input type="checkbox">
<label>retro</label>
<a href="link">
Landlord
</a>
</td>
</tr>
<tr>
<td class="label">
<h3>Model</h3>
</td>
<td class="checkbox">model123</td>
</tr>
import requests
from bs4 import BeautifulSoup
url='someurl.com'
source2= requests.get(url,timeout=30).text
soup2=BeautifulSoup(source2,'lxml')
element2= soup2.find('div',class_='Tech')
pin= element2.find('table',id='product').tbody.tr.text
print(pin)
Output that I am getting is:
Information
Design
product
How to do I move between <tr>s? I need the output as: model123.
To get output model123, you can try:
# search <h3> that contains "Model"
h3 = soup.select_one('h3:contains("Model")')
# search next <td>
model = h3.find_next("td").text
print(model)
Prints:
model123
Or without CSS selectors:
model = (
soup.find(lambda tag: tag.name == "h3" and tag.text.strip() == "Model")
.find_next("td")
.text
)
print(model)

Finding sibling tag in BeautifulSoup with no attributes

Sorry, kind of a beginner question about BeatifulSoup, but I can't find the answer.
I'm having trouble figuring out how to scrape HTML tags without attributes.
Here's the section of code.
<tr bgcolor="#ffffff">
<td>
No-Lobbying List
</td>
<tr bgcolor="#efefef">
<td rowspan="2" valign="top">
6/24/2019
</td>
<td>
<a href="document.cfm?id=322577" target="_blank">
Brian Manley, Chief of Police, Austin Police Department
</a>
<a href="document.cfm?id=322577" target="_blank">
<img alt="Click here to download the PDF document" border="0" height="16" src="https://assets.austintexas.gov/edims/images/pdf_icon.gif" width="16"/>
</a>
</td>
<tr bgcolor="#efefef">
<td>
Preliminary 2018 Annual Crime Report - Executive Summary
</td>
</tr>
</tr>
</tr>
How can I navigate to the tag with the text "Preliminary 2018 Annual Crime Report - Executive Summary"?
I have tried moving from a with an attribute and using .next_sibling, but I've failed miserable.
Thank you.
trgrewy = soup.findAll('tr', {'bgcolor':'#efefef'}) #the cells alternate colors
trwhite = soup.findAll('tr', {'bgcolor':'#ffffff'})
trs = trgrewy + trwhite #merge them into a list
for item in trs:
mdate = item.find('td', {'rowspan':'2'}) #find if it's today's date
if mdate:
datetime_object = datetime.strptime(mdate.text, '%m/%d/%Y')
if datetime_object.date() == now.date():
sender = item.find('a').text
pdf = item.find('a')['href']
link = baseurl + pdf
title = item.findAll('td')[2] #this is where i've failed
You can use CSS selectors:
data = '''
<tr bgcolor="#ffffff">
<td>
No-Lobbying List
</td>
<tr bgcolor="#efefef">
<td rowspan="2" valign="top">
6/24/2019
</td>
<td>
<a href="document.cfm?id=322577" target="_blank">
Brian Manley, Chief of Police, Austin Police Department
</a>
<a href="document.cfm?id=322577" target="_blank">
<img alt="Click here to download the PDF document" border="0" height="16" src="https://assets.austintexas.gov/edims/images/pdf_icon.gif" width="16"/>
</a>
</td>
<tr bgcolor="#efefef">
<td>
Preliminary 2018 Annual Crime Report - Executive Summary
</td>
</tr>
</tr>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
# This will find date
print(soup.select_one('td[rowspan="2"]').get_text(strip=True))
# This will find next row after the row with date
print(soup.select_one('tr:has(td[rowspan="2"]) + tr').get_text(strip=True))
Prints:
6/24/2019
Preliminary 2018 Annual Crime Report - Executive Summary
Further reading:
CSS Selectors Reference
I think you should try this
page = BeautifulSoup(HTML_TEXT)
text = page.find('td').findAll(text=True, recursive=False)
for i in text:
print i

Parsing nested tags with BeautifulSoup and requests

I'm new to BeautifulSoup. I was trying to parse an HTML web page with requests. Code I wrote for now is:
import requests
from bs4 import BeautifulSoup
link = "SOME_URL"
f = requests.get(link)
soup = BeautifulSoup(f.text, 'html.parser')
for el in (soup.findAll("td",{"class": "g-res-tab-cell"})):
print(el)
exit
The output is as follows:
<td class="g-res-tab-cell">
<div style="padding:8px;">
<div style="padding-top:8px;">
<table cellspacing="0" cellpadding="0" border="0" style="width:100%;">
<tr>
<td valign="top">
<div itemscope itemtype="URL">
<table cellspacing="0" cellpadding="0" style="width:100%;">
<tr>
<td valign="top" class="g-res-tab-cell" style="width:100%;">
<div style="width:100%;padding-left:4px;">
<div class="subtext_view_med" itemprop="name">
NAME1
</div>
<div style="direction:ltr;padding-left:5px;margin-bottom:2px;" class="smtext">
<span class="Gray">In English:</span> ENGLISH_NAME1
</div>
<div style="padding-bottom:2px;padding-top:8px;font-size:14px;text-align:justify;min-height:158px;" itemprop="description">DESCRIPTION1</div>
</div>
</td>
</tr>
</table>
</div>
</td>
</tr>
</table>
<table cellspacing="0" cellpadding="0" border="0" style="width:100%;">
<tr>
<td valign="top">
<div itemscope itemtype="URL">
<table cellspacing="0" cellpadding="0" style="width:100%;">
<tr>
<td valign="top" class="g-res-tab-cell" style="width:100%;">
<div style="width:100%;padding-left:4px;">
<div class="subtext_view_med" itemprop="name">
NAME2
</div>
<div style="direction:ltr;padding-left:5px;margin-bottom:2px;" class="smtext">
<span class="Gray">In English:</span> ENGLISH_NAME2
</div>
</div>
<div style="padding-bottom:2px;padding-top:8px;font-size:14px;text-align:justify;min-height:158px;" itemprop="description">DESCRIPTION2</div>
</td>
</tr>
</table>
</div>
</td>
</tr>
</table>
</div>
</div>
</td>
Now I got stuck. I'm trying to parse the NAME, DESCRIPTION and ENGLISH_NAME for each block. I would like to print each one of them so the output will be:
name = NAME1
en_name = ENGLISH_NAME1
description = DESCRIPTION1
name = NAME2
en_name = ENGLISH_NAME2
description = DESCRIPTION2
I tried to read the docs but I could not find how to treat nested attributes especially without a class or id name. As I understand, each block starts with <table cellspacing="0" cellpadding="0" border="0" style="width:100%;">. In each block I should find tag a that has itemprop="url" and get the NAME. Then in <span class="Gray">In English:</span> get the en_name and in itemprop="description" get the description. But I feels like BeautifulSoup can't do it (or at least very hard to achieve it). How to solve it?
You can iterate over each td with class g-res-tab-cell using soup.find_all:
from bs4 import BeautifulSoup as soup
d = soup(content, 'html.parser').td.find_all('td', {'class':'g-res-tab-cell'})
results = [[i.find('div', {'class':'subtext_view_med'}).a.text, i.find('div', {'class':'smtext'}).contents[1].text, i.find('div', {'itemprop':'description'}).text] for i in d]
Output:
[['NAME1', 'In English:', 'DESCRIPTION1'], ['NAME2', 'In English:', 'DESCRIPTION2']]
Edit: from link:
import requests
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://www.sratim.co.il/browsenewmovies.php?page=1').text, 'html.parser')
movies = d.find_all('div', {'itemtype':'http://schema.org/Movie'})
result = [[getattr(i.find('a', {'itemprop':'url'}), 'text', 'N/A'), getattr(i.find('div', {'class':'smtext'}), 'text', 'N/A'), getattr(i.find('div', {'itemprop':'description'}), 'text', 'N/A')] for i in movies]
Here is another way. As that information is present for all films you should have a fully populated result set.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
r = requests.get('https://www.sratim.co.il/browsenewmovies.php?page=1')
soup = bs(r.content, 'lxml')
names = [item.text for item in soup.select('[itemprop=url]')] #32
english_names = [item.next_sibling for item in soup.select('.smtext:contains("In English: ") span')]
descriptions = [item.text for item in soup.select('[itemprop=description]')]
results = list(zip(names, english_names, descriptions))
df = pd.DataFrame(results, columns = ['Name', 'English_Name', 'Description'])
print(df)

Beautiful Soup Table, stop getting info

Hey everyone I have some html that I am parsing, here it is:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<table class="dayinner">
<tr class="lun">
<td class="mealname" colspan="3">LUNCH</td>
</tr>
<tr class="lun">
<td class="station"> Deli</td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000010000047598_35356" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox" /> <span class="ul" onclick=
"nf('0000047598_35356');" onmouseout="pcls(this);"
onmouseover="ws(this);">Made to Order Deli Core</span>
</div>
</td>
<td class="price"></td>
</tr>
<tr class="lun">
<td class="station"> </td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000020000046033_63436" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox" /> <span class="ul" onclick=
"nf('0000046033_63436');" onmouseout="pcls(this);"
onmouseover="ws(this);">Chicken Caesar Wrap</span>
</div>
</td>
<td class="price"></td>
</tr>
<tr class="lun">
<td colspan="3" style="height:3px;"></td>
</tr>
<tr class="lun">
<td colspan="3" style="background-color:#c0c0c0; height:1px;"></td>
</tr>
<tr class="lun">
<td class="station"> Dessert</td>
<td class="station"> </td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000020000046033_63436" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox" /> <span class="ul" onclick=
"nf('0000046033_63436');" onmouseout="pcls(this);"
onmouseover="ws(this);">Chicken Caesar Wrap</span>
</div>
</td>
</tr>
</table>
</body>
</html>
Here is the code I have, I want just the items under the deli section, and normally I won't know how many there are is there a way to do this?
soup = BeautifulSoup(open("upperMenu.html"))
title = soup.find('td', class_='station').text.strip()
spans = soup.find_all('span', class_='ul')[:2]
but this only works if there are two items, how can I have it work if the number of items is unknown?
Thanks in advance
You can use the text attribute in find_all function to 1. find all the rows whose station column contains the substring Deli.. 2. Loop through every row and find the spans within that row whose class is ul.
import re
soup = BeautifulSoup(text)
tds_deli = soup.find_all(name='td', attrs={'class':'station'}, text=re.compile('Deli'))
for td in tds_deli:
try:
tr = td.find_parent()
spans = tr.find_all('span', {'class':'ul'})
for span in spans:
# do something
print span.text
print '------------one row -------------'
except:
pass
Sample Output in this case:
Made to Order Deli Core
------------one row -------------
Not sure if I am understanding the problem correctly but I think my code might help you get started.

Categories