python beautifulsoup - parsing an HTML table row - python

I'm using BeautifulSoup to parse a bunch of combined tables' rows, row by row, column by column in order to import it into Pandas. I can't use to_html() because one of the columns has a list of tag links in each cell. The data structure is the same in all the tables.
I can't figure out the correct method to skip a td.div tag containing the attribute{'class': ['stars']}. My following code works but it doesn't seem correct. I can't just do a if col.div: continue because some of the required columns have extra <div> tags I need for later.
def rebuild_row(self, row):
new_row = []
for col in row.find_all('td'):
if col.img:
continue
if col.div and 'star' in str(col.div.attrs):
continue
if col.a:
new_row.append(self.handle_links(col))
else:
if not col.text or not col.text.strip():
new_row.append(['NaN'])
else:
new_text = self.clean_tag_text(col)
new_row.append(new_text)
return new_row
I first tried if 'stars' in col.div['class']: but it choked on key 'class'. So then I tried to find the error:
if col.div:
if not hasattr(col.div, 'class'):
continue
else:
print(f"{col.div['class']}")
but I get this output & error that I don't understand the why of because shouldn't the not hasattr() catch it?
['stars']
['stars']
Exception has occurred: KeyError
'class'
HTML row example:
<tr id="groupBook3144889">
<td width="5%"><img alt="The 7th of Victorica by Beau Schemery" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1531785241l/40851529._SY75_.jpg" title="The 7th of Victorica by Beau Schemery"/></td>
<td width="30%">
The 7th of Victorica (Gadgets and Shadows, #2)
</td>
<td width="10%">
Schemery, Beau
<span title="Goodreads Author!">*</span>
</td>
<td width="1%">
<div class="stars" data-rating="0" data-resource-id="40851529" data-restore-rating="null" data-submit-url="/review/rate/40851529?stars_click=false" data-user-id="0"><a class="star off" href="#" ref="" title="did not like it">1 of 5 stars</a><a class="star off" href="#" ref="" title="it was ok">2 of 5 stars</a><a class="star off" href="#" ref="" title="liked it">3 of 5 stars</a><a class="star off" href="#" ref="" title="really liked it">4 of 5 stars</a><a class="star off" href="#" ref="" title="it was amazing">5 of 5 stars</a></div>
</td>
<td width="1%">
<a class="actionLinkLite" href="/group/bookshelf/64285?shelf=read">read</a>,
<a class="actionLinkLite" href="/group/bookshelf/64285?shelf=genre-action-adventure">genre-action-adve...</a>,
<a class="actionLinkLite" href="/group/bookshelf/64285?shelf=genre-steampunk-dieselpunk">genre-steampunk-d...</a>,
<a class="actionLinkLite" href="/group/bookshelf/64285?shelf=genre-young-adult">genre-young-adult</a>
</td>
<td width="1%">
</td>
<td width="1%">
</td>
<td width="1%">
Meghan
</td>
<td width="1%">2022/12/25</td>
<td class="view" width="1%">
<a class="actionLink" href="/group/show_book/64285?group_book_id=3144889" style="white-space: nowrap">view activity »</a>
</td>
</tr>

Exception has occurred: KeyError
'class'
You could try to avoid the error with
if col.div:
if not col.div.get('class'): continue
print(f"{col.div['class']}")
figure out the correct method to skip a td.div tag containing the attribute{'class': ['stars']}
(I'm assuming that by td.div tag you mean a td tag containing a certain type of div.)
If you use select [instead of find_all] with CSS Selectors, you can filter them out right from the start.
# for col in row.select('td:not(:has(div.stars)):not(:has(img))'):
for col in row.select('td:not(:has(div.stars))'):

Related

Unwanted CSV Output scraped from a website|Using Python and Selenium

I'm having trouble with the CSV export result on a website I am trying to scrape data from.
Output Problems:
Output in column but just the first column and it only output's the first column of data
Output in rows but just one row
I just want it to output the typical way
Here's a segment of the whole site's html where my particular target is:
<tbody id="sitesList">
<tr data-value="11230" class="item-row">
<td class="text-left">example.com <i class="fa fa-external-link"></i>
<br><span>» view site details</span></td>
<td>92</td>
<td>71</td>
<td>Do Follow</td>
<td style="font-size:12px;font-family:sans-serif !important;">Education
<br>Family & Parenting
<br>Food & Drink
<br>
</td>
<td>Included</td>
<td><strong>$1</strong></td>
<td><span data-id="11230" class="btn btn-success btn-sm addtocart">Buy Website $1</span></td>
</tr>
<tr data-value="11229" class="item-row">
<td class="text-left">example1.com <i class="fa fa-external-link"></i>
<br><span>» view site details</span></td>
<td>65</td>
<td>34</td>
<td>Do Follow</td>
<td style="font-size:12px;font-family:sans-serif !important;">Business & Finance
<br>General: Multi-Niche
<br>
</td>
<td>Included</td>
<td><strong>$2</strong></td>
<td><span data-id="11229" class="btn btn-success btn-sm addtocart">Buy Website $2</span></td>
</tr>
<tr data-value="11228" class="item-row">
<td class="text-left">example2.com <i class="fa fa-external-link"></i>
<div class="tooltip owner_tooltip" style="float: right;opacity: 1;width: 20px;height: 20px;background-size: 100%;"><span class="tooltiptext">Owner Verified</span></div>
<br><span>» view site details</span></td>
<td>27</td>
<td>26</td>
<td>Do Follow</td>
<td style="font-size:12px;font-family:sans-serif !important;">Cryptocurrency
<br>
</td>
<td>Not Included</td>
<td><strong>$3</strong></td>
<td><span data-id="11228" class="btn btn-success btn-sm addtocart">Buy Website $3</span></td>
</tr>
<tr data-value="11227" class="item-row">
<td class="text-left">example3.com <i class="fa fa-external-link"></i>
<br><span>» view site details</span></td>
<td>23</td>
<td>29</td>
<td>Do Follow</td>
<td style="font-size:12px;font-family:sans-serif !important;">Business & Finance
<br>Health
<br>SEO & Digital Marketing
<br>
</td>
<td>Included</td>
<td><strong>$4</strong></td>
<td><span data-id="11227" class="btn btn-success btn-sm addtocart">Buy Website $4</span></td>
</tr>
</tbody>
I'm using selenium and here's my code:
siteList_tds = driver.find_elements(By.XPATH, "//tbody[#id='sitesList']//tr//td")
with open('test.csv', 'w') as f:
write = csv.writer(f)
for s in siteList_tds:
write.writerow(s.text) ## or write.writerow([s.text])
I would try the following:
siteList_trs = driver.find_elements(By.XPATH, "//tbody[#id='sitesList']//tr")
with open('test.csv', 'w') as f:
write = csv.writer(f)
for r in siteList_trs:
href = r.find_element_by_xpath('.//a').get_attribute('href')
tds = r.find_elements_by_xpath('.//td')
data = []
for td in tds:
data.append(td.text)
write.writerow(data.insert(0, href))
What this code is doing differently:
find the tr tags instead of td to iterate through
Grab href attribute from the a tag which descends from tr
Grab the rest of that row's td elements
Iterate through the td's to create a list that consists of each td.text
Finally, write the data list to a row, but insert the href string at the beginning in the same line.

How to extract text from table moving between<tr> tags using Beautifulsoup

I need to extract text from a table using BeautifulSoup.
Below is the code which I have written and output
HTML:
<div class="Tech">
<div class="select">
<span>Selection is mandatory</span>
</div>
<table id="product">
<tbody>
<tr class="feature">
<td class="title" rowspan="3">
<h2>Information</h2>
</td>
<td class="label">
<h3>Design</h3>
</td>
<td class="checkbox">product</td>
</tr>
<tr>
<td class="label">
<h3>Marque</h3>
</td>
<td class="checkbox">
<input type="checkbox">
<label>retro</label>
<a href="link">
Landlord
</a>
</td>
</tr>
<tr>
<td class="label">
<h3>Model</h3>
</td>
<td class="checkbox">model123</td>
</tr>
import requests
from bs4 import BeautifulSoup
url='someurl.com'
source2= requests.get(url,timeout=30).text
soup2=BeautifulSoup(source2,'lxml')
element2= soup2.find('div',class_='Tech')
pin= element2.find('table',id='product').tbody.tr.text
print(pin)
Output that I am getting is:
Information
Design
product
How to do I move between <tr>s? I need the output as: model123.
To get output model123, you can try:
# search <h3> that contains "Model"
h3 = soup.select_one('h3:contains("Model")')
# search next <td>
model = h3.find_next("td").text
print(model)
Prints:
model123
Or without CSS selectors:
model = (
soup.find(lambda tag: tag.name == "h3" and tag.text.strip() == "Model")
.find_next("td")
.text
)
print(model)

How to parse column values and its href with selenuim

im new with selenium and parsing data from the website.
The problem is: i have website table with such HTML code:
<table width="580" cellspacing="1" cellpadding="3" bgcolor="#ffffff" id="restab">
<tbody>
<tr align="center" valign="middle">
<td width="40" bgcolor="#555555"><font color="#ffffff">№</font></td>
<td width="350" bgcolor="#555555"><font color="#ffffff">Название организации</font></td>
<td width="100" bgcolor="#555555"><font color="#ffffff">Город</font></td>
<td width="60" bgcolor="#555555"><span title="Число публикаций данной организации на eLibrary.Ru"><font color="#ffffff">Публ.</font></span></td><td width="30" bgcolor="#555555"><span title="Число ссылок на публикации организации"><font color="#ffffff">Цит.</font></span></td>
</tr>
<tr valign="middle" bgcolor="#f5f5f5" id="a18098">
<td align="center"><font color="#00008f">1</font></td>
<td align="left"><font color="#00008f"><a href="org_about.asp?orgsid=18098">
"Академия информатизации образования" по Ленинградской области</a></font></td>
<td align="center"><font color="#00008f">Гатчина</font></td>
<td align="right"><font color="#00008f">0<img src="/pic/1pix.gif" hspace="16"></font></td>
<td align="center"><font color="#00008f">0</font></td>
</tr>
<tr valign="middle" bgcolor="#f5f5f5" id="a17954">
<td align="center"><font color="#00008f">2</font></td>
<td align="left"><font color="#00008f"><a href="org_about.asp?orgsid=17954">
"Академия талантов" Санкт-Петербурга</a></font></td>
<td align="center"><font color="#00008f">Санкт-Петербург</font></td>
<td align="right"><font color="#00008f">3<img src="/pic/stat.gif" width="12" height="13" hspace="10" border="0"></font></td>
<td align="center"><font color="#00008f">0</font></td>
</tr>
</tbody>
</table>
and i need to get all this table values and href's of each value in left td
I tried to use Xpath, but it writes some error, how to do it better?
In conclusion i need to get dataframe with table values + extra column with href of left column
First try to use pandas.read_html(). See code example below.
If that doesn't work, then use use right-click menu on browser such as Mozilla Firefox (Inspect Element) or Google Chrome (Developer Tools) to find the CSS or Xpath. Then feed the CSS or Xpath into Selenium.
Another useful tool for finding complicated CSS/Xpath is the Inspector Gadget browser plug-in.
import pandas as pd
# this is the website you want to read ... table with "Minimum Level for Adult Cats"
str_url = 'http://www.felinecrf.org/catfood_data_how_to_use.htm'
# use pandas.read_html()
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
list_df = pd.read_html(str_url, match='DMA')
print('Number of dataframes on the page: ', len(list_df))
print()
for idx, each_df in enumerate(list_df):
print(f'Show dataframe number {idx}:')
print(each_df.head())
print()
# use table 2 on the page
df_target = list_df[2]
# create column headers
# https://chrisalbon.com/python/data_wrangling/pandas_rename_column_headers/
header_row = df_target.iloc[0]
# Replace the dataframe with a new one which does not contain the first row
df_target = df_target[1:]
# Rename the dataframe's column values with the header variable
df_target.columns = header_row
print(df_target.head())

Selenium Python how to locate a specified row and a column from a table of elements

I have a table with some rows and columns. I would like to iterate through the table and locate row 81 and 82 and column index 4 (Surname column)
I would like to check the value from column 4 for row 81 and 82.
The data is fixed so identifying row 81 and 82 for my test purpose is fine.
I have made a start with some code to get the table and iterate through the rows.
How do i go to row 81 and column 4 directly?
My code snippet is:
def is_surname_Campbell_and_CAMPBELL_together(self): # is the surname "Campbell" and "CAMPBELL" together
try:
table_id = WebDriverWait(self.driver, 20).until(
EC.presence_of_element_located((By.ID, 'data_configuration_view_preview_dg_main_body')))
rows = table_id.find_elements(By.TAG_NAME, "tr")
for row in rows:
# Get the columns
col_sname = row.find_elements(By.TAG_NAME, "td")[4] # This is the SNAME column
print "col_sname.text = "
if col_sname.text == "Campbell":
return True
return False
except NoSuchElementException, e:
print "Element not found "
print e
self.save_screenshot("is_surname_Campbell_and_CAMPBELL_together")
return False
HTML snippet (a small section otherwise it will be too long to paste)
<table id="data_configuration_view_preview_dg_main_body" cellspacing="0" style="table-layout: fixed; width: 100%; margin-bottom: 17px;">
<colgroup>
<tbody>
<tr class="GJPPK2LBJM GJPPK2LBAN" __gwt_subrow="0" __gwt_row="0">
<td class="GJPPK2LBIM GJPPK2LBKM GJPPK2LBLM GJPPK2LBBN">
<td class="GJPPK2LBIM GJPPK2LBKM GJPPK2LBBN">
<div __gwt_cell="cell-gwt-uid-756" style="outline-style:none;" tabindex="0">
<span class="" title="f1/48" style="white-space:nowrap;overflow:hidden;text-overflow:ellipsis;empty-cells:show;display:block;padding-right: 1px;">f1/48</span>
</div>
</td>
<td class="GJPPK2LBIM GJPPK2LBKM GJPPK2LBBN">
<div __gwt_cell="cell-gwt-uid-757" style="outline-style:none;">
<span class="" title="" style="white-space:nowrap;overflow:hidden;text-overflow:ellipsis;empty-cells:show;display:block;padding-right: 1px;"/>
</div>
</td>
<td class="GJPPK2LBIM GJPPK2LBKM GJPPK2LBBN">
<div __gwt_cell="cell-gwt-uid-758" style="outline-style:none;">
<span class="" title="Keith" style="white-space:nowrap;overflow:hidden;text-overflow:ellipsis;empty-cells:show;display:block;padding-right: 1px;">Keith</span>
</div>
</td>
<td class="GJPPK2LBIM GJPPK2LBKM GJPPK2LBBN">
<div __gwt_cell="cell-gwt-uid-759" style="outline-style:none;">
<span class="" title="Campbell" style="background-color:yellow;white-space:nowrap;overflow:hidden;text-overflow:ellipsis;empty-cells:show;display:block;padding-right: 1px;">Campbell</span>
</div>
</td>
<td class="GJPPK2LBIM GJPPK2LBKM GJPPK2LBBN">
<div __gwt_cell="cell-gwt-uid-760" style="outline-style:none;">
<span class="" title="" style="white-space:nowrap;overflow:hidden;text-overflow:ellipsis;empty-cells:show;display:block;padding-right: 1px;"/>
</div>
</td>
<td class="GJPPK2LBIM GJPPK2LBKM GJPPK2LBBN">
<td class="GJPPK2LBIM GJPPK2LBKM GJPPK2LBBN">
<td class="GJPPK2LBIM GJPPK2LBKM GJPPK2LBBN">
</tbody>
</table>
Thanks,
Riaz
In common case value of cell in 81 row 4 column can be defined as
driver.find_element_by_xpath('//tr[81]/td[4]').text
PS. HTML elements indexation starts from [1] element, but not from [0] as in Python

Beautiful Soup Table, stop getting info

Hey everyone I have some html that I am parsing, here it is:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<table class="dayinner">
<tr class="lun">
<td class="mealname" colspan="3">LUNCH</td>
</tr>
<tr class="lun">
<td class="station"> Deli</td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000010000047598_35356" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox" /> <span class="ul" onclick=
"nf('0000047598_35356');" onmouseout="pcls(this);"
onmouseover="ws(this);">Made to Order Deli Core</span>
</div>
</td>
<td class="price"></td>
</tr>
<tr class="lun">
<td class="station"> </td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000020000046033_63436" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox" /> <span class="ul" onclick=
"nf('0000046033_63436');" onmouseout="pcls(this);"
onmouseover="ws(this);">Chicken Caesar Wrap</span>
</div>
</td>
<td class="price"></td>
</tr>
<tr class="lun">
<td colspan="3" style="height:3px;"></td>
</tr>
<tr class="lun">
<td colspan="3" style="background-color:#c0c0c0; height:1px;"></td>
</tr>
<tr class="lun">
<td class="station"> Dessert</td>
<td class="station"> </td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000020000046033_63436" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox" /> <span class="ul" onclick=
"nf('0000046033_63436');" onmouseout="pcls(this);"
onmouseover="ws(this);">Chicken Caesar Wrap</span>
</div>
</td>
</tr>
</table>
</body>
</html>
Here is the code I have, I want just the items under the deli section, and normally I won't know how many there are is there a way to do this?
soup = BeautifulSoup(open("upperMenu.html"))
title = soup.find('td', class_='station').text.strip()
spans = soup.find_all('span', class_='ul')[:2]
but this only works if there are two items, how can I have it work if the number of items is unknown?
Thanks in advance
You can use the text attribute in find_all function to 1. find all the rows whose station column contains the substring Deli.. 2. Loop through every row and find the spans within that row whose class is ul.
import re
soup = BeautifulSoup(text)
tds_deli = soup.find_all(name='td', attrs={'class':'station'}, text=re.compile('Deli'))
for td in tds_deli:
try:
tr = td.find_parent()
spans = tr.find_all('span', {'class':'ul'})
for span in spans:
# do something
print span.text
print '------------one row -------------'
except:
pass
Sample Output in this case:
Made to Order Deli Core
------------one row -------------
Not sure if I am understanding the problem correctly but I think my code might help you get started.

Categories