Python, Beautiful Soup: how to get the desired element - python

I am trying to arrive to a certain element, parsing a source code of a site.
this is a snippet from the part i'm trying to parse (here until Friday), but it is the same for all the days of the week
<div id="intForecast">
<h2>Forecast for Rome</h2>
<table cellspacing="0" cellpadding="0" id="nonCA">
<tr>
<td onclick="showDetails('1');return false" id="day1" class="on">
<span>Thursday</span>
<div class="intIcon"><img src="http://icons.wunderground.com/graphics/conds/2005/sunny.gif" alt="sunny" /></div>
<div>Clear</div>
<div><span class="hi">H <span>22</span>°</span> / <span class="lo">L <span>11</span>°</span></div>
</td>
<td onclick="showDetails('2');return false" id="day2" class="off">
<span>Friday</span>
<div class="intIcon"><img src="http://icons.wunderground.com/graphics/conds/2005/partlycloudy.gif" alt="partlycloudy" /></div>
<div>Partly Cloudy</div>
<div><span class="hi">H <span>21</span>°</span> / <span class="lo">L <span>15</span>°</span></div>
</td>
</tr>
</table>
</div>
....and so on for all the days
Actually i got my result but in a ugly way i think:
forecastFriday= soup.find('div',text='Friday').findNext('div').findNext('div').string
now, as you can see i go deep down the elements repeating .findNext('div')and finally arrive at .string
I want to get the information "Partly Cloudy" of Friday
So any more pythonic way to do this?
thanks!

Simply find all of the <td>s and iterate over them:
soup = BeautifulSoup(your_html)
div = soup('div',{'id':'intForecast'})[0]
tds = div.find('table').findAll('td')
for td in tds:
day = td('span')[0].text
forecast = td('div')[1].text
print day, forecast

Related

python beautifulsoup - parsing an HTML table row

I'm using BeautifulSoup to parse a bunch of combined tables' rows, row by row, column by column in order to import it into Pandas. I can't use to_html() because one of the columns has a list of tag links in each cell. The data structure is the same in all the tables.
I can't figure out the correct method to skip a td.div tag containing the attribute{'class': ['stars']}. My following code works but it doesn't seem correct. I can't just do a if col.div: continue because some of the required columns have extra <div> tags I need for later.
def rebuild_row(self, row):
new_row = []
for col in row.find_all('td'):
if col.img:
continue
if col.div and 'star' in str(col.div.attrs):
continue
if col.a:
new_row.append(self.handle_links(col))
else:
if not col.text or not col.text.strip():
new_row.append(['NaN'])
else:
new_text = self.clean_tag_text(col)
new_row.append(new_text)
return new_row
I first tried if 'stars' in col.div['class']: but it choked on key 'class'. So then I tried to find the error:
if col.div:
if not hasattr(col.div, 'class'):
continue
else:
print(f"{col.div['class']}")
but I get this output & error that I don't understand the why of because shouldn't the not hasattr() catch it?
['stars']
['stars']
Exception has occurred: KeyError
'class'
HTML row example:
<tr id="groupBook3144889">
<td width="5%"><img alt="The 7th of Victorica by Beau Schemery" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1531785241l/40851529._SY75_.jpg" title="The 7th of Victorica by Beau Schemery"/></td>
<td width="30%">
The 7th of Victorica (Gadgets and Shadows, #2)
</td>
<td width="10%">
Schemery, Beau
<span title="Goodreads Author!">*</span>
</td>
<td width="1%">
<div class="stars" data-rating="0" data-resource-id="40851529" data-restore-rating="null" data-submit-url="/review/rate/40851529?stars_click=false" data-user-id="0"><a class="star off" href="#" ref="" title="did not like it">1 of 5 stars</a><a class="star off" href="#" ref="" title="it was ok">2 of 5 stars</a><a class="star off" href="#" ref="" title="liked it">3 of 5 stars</a><a class="star off" href="#" ref="" title="really liked it">4 of 5 stars</a><a class="star off" href="#" ref="" title="it was amazing">5 of 5 stars</a></div>
</td>
<td width="1%">
<a class="actionLinkLite" href="/group/bookshelf/64285?shelf=read">read</a>,
<a class="actionLinkLite" href="/group/bookshelf/64285?shelf=genre-action-adventure">genre-action-adve...</a>,
<a class="actionLinkLite" href="/group/bookshelf/64285?shelf=genre-steampunk-dieselpunk">genre-steampunk-d...</a>,
<a class="actionLinkLite" href="/group/bookshelf/64285?shelf=genre-young-adult">genre-young-adult</a>
</td>
<td width="1%">
</td>
<td width="1%">
</td>
<td width="1%">
Meghan
</td>
<td width="1%">2022/12/25</td>
<td class="view" width="1%">
<a class="actionLink" href="/group/show_book/64285?group_book_id=3144889" style="white-space: nowrap">view activity »</a>
</td>
</tr>
Exception has occurred: KeyError
'class'
You could try to avoid the error with
if col.div:
if not col.div.get('class'): continue
print(f"{col.div['class']}")
figure out the correct method to skip a td.div tag containing the attribute{'class': ['stars']}
(I'm assuming that by td.div tag you mean a td tag containing a certain type of div.)
If you use select [instead of find_all] with CSS Selectors, you can filter them out right from the start.
# for col in row.select('td:not(:has(div.stars)):not(:has(img))'):
for col in row.select('td:not(:has(div.stars))'):

Finding sibling tag in BeautifulSoup with no attributes

Sorry, kind of a beginner question about BeatifulSoup, but I can't find the answer.
I'm having trouble figuring out how to scrape HTML tags without attributes.
Here's the section of code.
<tr bgcolor="#ffffff">
<td>
No-Lobbying List
</td>
<tr bgcolor="#efefef">
<td rowspan="2" valign="top">
6/24/2019
</td>
<td>
<a href="document.cfm?id=322577" target="_blank">
Brian Manley, Chief of Police, Austin Police Department
</a>
<a href="document.cfm?id=322577" target="_blank">
<img alt="Click here to download the PDF document" border="0" height="16" src="https://assets.austintexas.gov/edims/images/pdf_icon.gif" width="16"/>
</a>
</td>
<tr bgcolor="#efefef">
<td>
Preliminary 2018 Annual Crime Report - Executive Summary
</td>
</tr>
</tr>
</tr>
How can I navigate to the tag with the text "Preliminary 2018 Annual Crime Report - Executive Summary"?
I have tried moving from a with an attribute and using .next_sibling, but I've failed miserable.
Thank you.
trgrewy = soup.findAll('tr', {'bgcolor':'#efefef'}) #the cells alternate colors
trwhite = soup.findAll('tr', {'bgcolor':'#ffffff'})
trs = trgrewy + trwhite #merge them into a list
for item in trs:
mdate = item.find('td', {'rowspan':'2'}) #find if it's today's date
if mdate:
datetime_object = datetime.strptime(mdate.text, '%m/%d/%Y')
if datetime_object.date() == now.date():
sender = item.find('a').text
pdf = item.find('a')['href']
link = baseurl + pdf
title = item.findAll('td')[2] #this is where i've failed
You can use CSS selectors:
data = '''
<tr bgcolor="#ffffff">
<td>
No-Lobbying List
</td>
<tr bgcolor="#efefef">
<td rowspan="2" valign="top">
6/24/2019
</td>
<td>
<a href="document.cfm?id=322577" target="_blank">
Brian Manley, Chief of Police, Austin Police Department
</a>
<a href="document.cfm?id=322577" target="_blank">
<img alt="Click here to download the PDF document" border="0" height="16" src="https://assets.austintexas.gov/edims/images/pdf_icon.gif" width="16"/>
</a>
</td>
<tr bgcolor="#efefef">
<td>
Preliminary 2018 Annual Crime Report - Executive Summary
</td>
</tr>
</tr>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
# This will find date
print(soup.select_one('td[rowspan="2"]').get_text(strip=True))
# This will find next row after the row with date
print(soup.select_one('tr:has(td[rowspan="2"]) + tr').get_text(strip=True))
Prints:
6/24/2019
Preliminary 2018 Annual Crime Report - Executive Summary
Further reading:
CSS Selectors Reference
I think you should try this
page = BeautifulSoup(HTML_TEXT)
text = page.find('td').findAll(text=True, recursive=False)
for i in text:
print i

Scrape Table HTML with beautifulSoup

I'm trying to scrape a website which has been built with tables. Here a link of a page's example: http://www.rc2.vd.ch/registres/hrcintapp-pub/companyReport.action?rcentId=5947621600000055031025&lang=FR&showHeader=false
My goal is to get the name and the last name : Lass Christian (screenshot below).
[![enter image description here][1]][1]
[1]: http://i.stack.imgur.com/q3nMb.png
I've already scraped many websites but this one I have absolutly no idea how to proceed. There are only 'tables' without any ID/Class tags and I can't figure out where I'm supposed to start.
Here's an exemple of the HTML code :
<table border="1" cellpadding="1" cellspacing="0" width="100%">
<tbody><tr bgcolor="#f0eef2">
<th colspan="3">Associés, gérants et personnes ayant qualité pour signer</th>
</tr>
<tr bgcolor="#f0eef2">
<th>
<a class="hoverable" onclick="document.forms[0].rcentId.value='5947621600000055031025';document.forms[0].lang.value='FR';document.forms[0].searchLang.value='FR';document.forms[0].order.value='N';document.forms[0].rad.value='N';document.forms[0].goToAdm.value='true';document.forms[0].showHeader.value=false;document.forms[0].submit();event.returnValue=false; return false;">
Nom et Prénoms, Origine, Domicile, Part sociale
</a>
</th>
<th>
<a class="hoverable" onclick="document.forms[0].rcentId.value='5947621600000055031025';document.forms[0].lang.value='FR';document.forms[0].searchLang.value='FR';document.forms[0].order.value='F';document.forms[0].rad.value='N';document.forms[0].goToAdm.value='true';document.forms[0].showHeader.value=false;document.forms[0].submit();event.returnValue=false; return false;">
Fonctions
</a>
<img src="/registres/hrcintapp-pub/img/down_r.png" align="bottom" border="0" alt="">
</th>
<th>Mode Signature</th>
</tr>
<tr bgcolor="#ffffff">
<td>
<span style="text-decoration: none;">
Lass Christian, du Danemark, à Yverdon-les-Bains, avec 200 parts de CHF 100
</span>
</td>
<td><span style="text-decoration: none;">associé gérant </span> </td>
<td><span style="text-decoration: none;">signature individuelle</span> </td>
</tr>
</tbody></table>
This will get the name from the page, the table is right after the anchor with the id adm, once you have that you have numerous ways to get what you need:
from bs4 import BeautifulSoup
import requests
r = requests.get('http://www.rc2.vd.ch/registres/hrcintapp-pub/companyReport.action?rcentId=5947621600000055031025&lang=FR&showHeader=false')
soup = BeautifulSoup(r.content,"lxml")
table = soup.select_one("#adm").find_next("table")
name = table.select_one("td span[style^=text-decoration:]").text.split(",", 1)[0].strip()
print(name)
Output:
Lass Christian
Or:
table = soup.select_one("#adm").find_next("table")
name = table.find("tr",bgcolor="#ffffff").td.span.text.split(",", 1)[0].strip()
Something like this?
results = soup.find_all("tr", {"bgcolor" : "#ffffff"})
for result in results:
the_name = result.td.span.get_text().split(',')[0]

Beautiful Soup Table, stop getting info

Hey everyone I have some html that I am parsing, here it is:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<table class="dayinner">
<tr class="lun">
<td class="mealname" colspan="3">LUNCH</td>
</tr>
<tr class="lun">
<td class="station"> Deli</td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000010000047598_35356" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox" /> <span class="ul" onclick=
"nf('0000047598_35356');" onmouseout="pcls(this);"
onmouseover="ws(this);">Made to Order Deli Core</span>
</div>
</td>
<td class="price"></td>
</tr>
<tr class="lun">
<td class="station"> </td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000020000046033_63436" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox" /> <span class="ul" onclick=
"nf('0000046033_63436');" onmouseout="pcls(this);"
onmouseover="ws(this);">Chicken Caesar Wrap</span>
</div>
</td>
<td class="price"></td>
</tr>
<tr class="lun">
<td colspan="3" style="height:3px;"></td>
</tr>
<tr class="lun">
<td colspan="3" style="background-color:#c0c0c0; height:1px;"></td>
</tr>
<tr class="lun">
<td class="station"> Dessert</td>
<td class="station"> </td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000020000046033_63436" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox" /> <span class="ul" onclick=
"nf('0000046033_63436');" onmouseout="pcls(this);"
onmouseover="ws(this);">Chicken Caesar Wrap</span>
</div>
</td>
</tr>
</table>
</body>
</html>
Here is the code I have, I want just the items under the deli section, and normally I won't know how many there are is there a way to do this?
soup = BeautifulSoup(open("upperMenu.html"))
title = soup.find('td', class_='station').text.strip()
spans = soup.find_all('span', class_='ul')[:2]
but this only works if there are two items, how can I have it work if the number of items is unknown?
Thanks in advance
You can use the text attribute in find_all function to 1. find all the rows whose station column contains the substring Deli.. 2. Loop through every row and find the spans within that row whose class is ul.
import re
soup = BeautifulSoup(text)
tds_deli = soup.find_all(name='td', attrs={'class':'station'}, text=re.compile('Deli'))
for td in tds_deli:
try:
tr = td.find_parent()
spans = tr.find_all('span', {'class':'ul'})
for span in spans:
# do something
print span.text
print '------------one row -------------'
except:
pass
Sample Output in this case:
Made to Order Deli Core
------------one row -------------
Not sure if I am understanding the problem correctly but I think my code might help you get started.

Parse the date string from html in lxml

s = """
<tbody>
<tr>
<td style="border-bottom: none">
<span class="graytext" style="font-weight: bold;"> Reply #3 - </span>
<span class="graytext" style="font-size: 11px">
05/13/09 2:02am
<br>
</span>
</td>
</tr>
</tbody>
"""
In the HTML string I need to take out the date string.
I tried in this way
import lxml
doc = lxml.html.fromstring(s)
doc.xpath('//span[#class="graytext" and #style="font-size: 11px"]')
But this is not working. I should have to take only the Datestring.
Your query is selecting the span, you need to grab the text from it:
>>> doc.xpath('//span[#class="graytext" and #style="font-size: 11px"]')
[<Element span at 1c9d4c8>]
Most queries return a sequence, I normally use a helper function that gets the first item.
from lxml import etree
s = """
<tbody>
<tr>
<td style="border-bottom: none">
<span class="graytext" style="font-weight: bold;"> Reply #3 - </span>
<span class="graytext" style="font-size: 11px">
05/13/09 2:02am
<br>
</span>
</td>
</tr>
</tbody>
"""
doc = etree.HTML(s)
def first(sequence,default=None):
for item in sequence:
return item
return default
Then:
>>> doc.xpath('//span[#class="graytext" and #style="font-size: 11px"]')
[<Element span at 1c9d4c8>]
>>> doc.xpath('//span[#class="graytext" and #style="font-size: 11px"]/text()')
['\n 05/13/09 2:02am\n ']
>>> first(doc.xpath('//span[#class="graytext" and #style="font-size: 11px"]/text()'),'').strip()
'05/13/09 2:02am'
Try the following instead of the last line:
print doc.xpath('//span[#class="graytext" and #style="font-size: 11px"]/text()')[0]
The first part of the xpath expression is correct, //span[#class="graytext" and #style="font-size: 11px"] selects all matching span nodes and then you need to specify what you want to select from the node. text() used here selects the contents of the node.

Categories