Unwanted CSV Output scraped from a website|Using Python and Selenium - python

I'm having trouble with the CSV export result on a website I am trying to scrape data from.
Output Problems:
Output in column but just the first column and it only output's the first column of data
Output in rows but just one row
I just want it to output the typical way
Here's a segment of the whole site's html where my particular target is:
<tbody id="sitesList">
<tr data-value="11230" class="item-row">
<td class="text-left">example.com <i class="fa fa-external-link"></i>
<br><span>» view site details</span></td>
<td>92</td>
<td>71</td>
<td>Do Follow</td>
<td style="font-size:12px;font-family:sans-serif !important;">Education
<br>Family & Parenting
<br>Food & Drink
<br>
</td>
<td>Included</td>
<td><strong>$1</strong></td>
<td><span data-id="11230" class="btn btn-success btn-sm addtocart">Buy Website $1</span></td>
</tr>
<tr data-value="11229" class="item-row">
<td class="text-left">example1.com <i class="fa fa-external-link"></i>
<br><span>» view site details</span></td>
<td>65</td>
<td>34</td>
<td>Do Follow</td>
<td style="font-size:12px;font-family:sans-serif !important;">Business & Finance
<br>General: Multi-Niche
<br>
</td>
<td>Included</td>
<td><strong>$2</strong></td>
<td><span data-id="11229" class="btn btn-success btn-sm addtocart">Buy Website $2</span></td>
</tr>
<tr data-value="11228" class="item-row">
<td class="text-left">example2.com <i class="fa fa-external-link"></i>
<div class="tooltip owner_tooltip" style="float: right;opacity: 1;width: 20px;height: 20px;background-size: 100%;"><span class="tooltiptext">Owner Verified</span></div>
<br><span>» view site details</span></td>
<td>27</td>
<td>26</td>
<td>Do Follow</td>
<td style="font-size:12px;font-family:sans-serif !important;">Cryptocurrency
<br>
</td>
<td>Not Included</td>
<td><strong>$3</strong></td>
<td><span data-id="11228" class="btn btn-success btn-sm addtocart">Buy Website $3</span></td>
</tr>
<tr data-value="11227" class="item-row">
<td class="text-left">example3.com <i class="fa fa-external-link"></i>
<br><span>» view site details</span></td>
<td>23</td>
<td>29</td>
<td>Do Follow</td>
<td style="font-size:12px;font-family:sans-serif !important;">Business & Finance
<br>Health
<br>SEO & Digital Marketing
<br>
</td>
<td>Included</td>
<td><strong>$4</strong></td>
<td><span data-id="11227" class="btn btn-success btn-sm addtocart">Buy Website $4</span></td>
</tr>
</tbody>
I'm using selenium and here's my code:
siteList_tds = driver.find_elements(By.XPATH, "//tbody[#id='sitesList']//tr//td")
with open('test.csv', 'w') as f:
write = csv.writer(f)
for s in siteList_tds:
write.writerow(s.text) ## or write.writerow([s.text])

I would try the following:
siteList_trs = driver.find_elements(By.XPATH, "//tbody[#id='sitesList']//tr")
with open('test.csv', 'w') as f:
write = csv.writer(f)
for r in siteList_trs:
href = r.find_element_by_xpath('.//a').get_attribute('href')
tds = r.find_elements_by_xpath('.//td')
data = []
for td in tds:
data.append(td.text)
write.writerow(data.insert(0, href))
What this code is doing differently:
find the tr tags instead of td to iterate through
Grab href attribute from the a tag which descends from tr
Grab the rest of that row's td elements
Iterate through the td's to create a list that consists of each td.text
Finally, write the data list to a row, but insert the href string at the beginning in the same line.

Related

python beautifulsoup - parsing an HTML table row

I'm using BeautifulSoup to parse a bunch of combined tables' rows, row by row, column by column in order to import it into Pandas. I can't use to_html() because one of the columns has a list of tag links in each cell. The data structure is the same in all the tables.
I can't figure out the correct method to skip a td.div tag containing the attribute{'class': ['stars']}. My following code works but it doesn't seem correct. I can't just do a if col.div: continue because some of the required columns have extra <div> tags I need for later.
def rebuild_row(self, row):
new_row = []
for col in row.find_all('td'):
if col.img:
continue
if col.div and 'star' in str(col.div.attrs):
continue
if col.a:
new_row.append(self.handle_links(col))
else:
if not col.text or not col.text.strip():
new_row.append(['NaN'])
else:
new_text = self.clean_tag_text(col)
new_row.append(new_text)
return new_row
I first tried if 'stars' in col.div['class']: but it choked on key 'class'. So then I tried to find the error:
if col.div:
if not hasattr(col.div, 'class'):
continue
else:
print(f"{col.div['class']}")
but I get this output & error that I don't understand the why of because shouldn't the not hasattr() catch it?
['stars']
['stars']
Exception has occurred: KeyError
'class'
HTML row example:
<tr id="groupBook3144889">
<td width="5%"><img alt="The 7th of Victorica by Beau Schemery" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1531785241l/40851529._SY75_.jpg" title="The 7th of Victorica by Beau Schemery"/></td>
<td width="30%">
The 7th of Victorica (Gadgets and Shadows, #2)
</td>
<td width="10%">
Schemery, Beau
<span title="Goodreads Author!">*</span>
</td>
<td width="1%">
<div class="stars" data-rating="0" data-resource-id="40851529" data-restore-rating="null" data-submit-url="/review/rate/40851529?stars_click=false" data-user-id="0"><a class="star off" href="#" ref="" title="did not like it">1 of 5 stars</a><a class="star off" href="#" ref="" title="it was ok">2 of 5 stars</a><a class="star off" href="#" ref="" title="liked it">3 of 5 stars</a><a class="star off" href="#" ref="" title="really liked it">4 of 5 stars</a><a class="star off" href="#" ref="" title="it was amazing">5 of 5 stars</a></div>
</td>
<td width="1%">
<a class="actionLinkLite" href="/group/bookshelf/64285?shelf=read">read</a>,
<a class="actionLinkLite" href="/group/bookshelf/64285?shelf=genre-action-adventure">genre-action-adve...</a>,
<a class="actionLinkLite" href="/group/bookshelf/64285?shelf=genre-steampunk-dieselpunk">genre-steampunk-d...</a>,
<a class="actionLinkLite" href="/group/bookshelf/64285?shelf=genre-young-adult">genre-young-adult</a>
</td>
<td width="1%">
</td>
<td width="1%">
</td>
<td width="1%">
Meghan
</td>
<td width="1%">2022/12/25</td>
<td class="view" width="1%">
<a class="actionLink" href="/group/show_book/64285?group_book_id=3144889" style="white-space: nowrap">view activity »</a>
</td>
</tr>
Exception has occurred: KeyError
'class'
You could try to avoid the error with
if col.div:
if not col.div.get('class'): continue
print(f"{col.div['class']}")
figure out the correct method to skip a td.div tag containing the attribute{'class': ['stars']}
(I'm assuming that by td.div tag you mean a td tag containing a certain type of div.)
If you use select [instead of find_all] with CSS Selectors, you can filter them out right from the start.
# for col in row.select('td:not(:has(div.stars)):not(:has(img))'):
for col in row.select('td:not(:has(div.stars))'):

Selenium find similar links based on conditional tags

I need to find a specific href link below is an example of 3 rows. The rows are very similar but they are a bit different. I need the link to the Product ABC that is MSSQL and CS
<tr>
<th class=\"align-middle\" scope=\"row\">
<span class=\"badge bg-primary position-relative py-2\">Product ABC
<span class=\"position-absolute top-0 start-100 translate-middle badge rounded-pill bg-secondary\">P3
</span>
</span>
</th>
<td class=\"align-middle small\">MySQL</td>
<td class=\"align-middle small\">MR</td>
<td class=\"align-middle small\">
<div class=\"btn-group\" role=\"group\">
<span data-bs-placement=\"left\" data-bs-toggle=\"tooltip\" title=\"\" data-bs-original-title=\"Show Application\" aria-label=\"Show Application\">
<a class=\"btn btn-sm btn-outline-primary\" href=\"/repo/applications/328\">
<svg class=\"bi flex-shrink-0\" height=\"18\" role=\"img\" width=\"18\">
<use href=\"#icon_eye\"></use>
</svg>
</a>
</span>
</div>
</td>
</tr>
<tr>
<th class=\"align-middle\" scope=\"row\">
<span class=\"badge bg-primary position-relative py-2\">Product ABC
<span class=\"position-absolute top-0 start-100 translate-middle badge rounded-pill bg-secondary\">P3
</span>
</span>
</th>
<td class=\"align-middle small\">MySQL</td>
<td class=\"align-middle small\">MR</td>
<td class=\"align-middle small\">
<div class=\"btn-group\" role=\"group\">
<span data-bs-placement=\"left\" data-bs-toggle=\"tooltip\" title=\"\" data-bs-original-title=\"Show Application\" aria-label=\"Show Application\">
<a class=\"btn btn-sm btn-outline-primary\" href=\"/repo/applications/329\">
<svg class=\"bi flex-shrink-0\" height=\"18\" role=\"img\" width=\"18\">
<use href=\"#icon_eye\"></use>
</svg>
</a>
</span>
</div>
</td>
</tr>
<tr>
<th class=\"align-middle\" scope=\"row\">
<span class=\"badge bg-primary position-relative py-2\">Product ABC
<span class=\"position-absolute top-0 start-100 translate-middle badge rounded-pill bg-secondary\">P3
</span>
</span>
</th>
<td class=\"align-middle small\">SQLServer</td>
<td class=\"align-middle small\">CS</td>
<td class=\"align-middle small\">
<div class=\"btn-group\" role=\"group\">
<span data-bs-placement=\"left\" data-bs-toggle=\"tooltip\" title=\"\" data-bs-original-title=\"Show Application\" aria-label=\"Show Application\">
<a class=\"btn btn-sm btn-outline-primary\" href=\"/repo/applications/330\">
<svg class=\"bi flex-shrink-0\" height=\"18\" role=\"img\" width=\"18\">
<use href=\"#icon_eye\"></use>
</svg>
</a>
</span>
</div>
</td>
</tr>
I currently have this
element = driver.find_element(By.XPATH, "//tr[.//span[contains(.,'Product ABC')]]//a")
element.get_attribute("href")
The code above works but is returns the first Product ABC that it sees in some cases that is ok but some times its incorrect. How do i make sure i filter my xpath so I return the href applications/330 and not the others.
In case you want to select the a element containing the desired href link based both on Product ABC value and on SQLServer value the XPath locator will be as following:
element = driver.find_element(By.XPATH, "//tr[.//span[contains(.,'Product ABC')] and .//td[contains(.,'SQLServer')]]//a")
In case you will need to add dependency of CS too, it can be added in the same way here:
element = driver.find_element(By.XPATH, "//tr[.//span[contains(.,'Product ABC')] and .//td[contains(.,'SQLServer')] and .//td[contains(.,'CS')]]//a")
In case you will need to locate the link containing element based on MySQL or/and on MR this can be done in the same manner.

How to extract text from table moving between<tr> tags using Beautifulsoup

I need to extract text from a table using BeautifulSoup.
Below is the code which I have written and output
HTML:
<div class="Tech">
<div class="select">
<span>Selection is mandatory</span>
</div>
<table id="product">
<tbody>
<tr class="feature">
<td class="title" rowspan="3">
<h2>Information</h2>
</td>
<td class="label">
<h3>Design</h3>
</td>
<td class="checkbox">product</td>
</tr>
<tr>
<td class="label">
<h3>Marque</h3>
</td>
<td class="checkbox">
<input type="checkbox">
<label>retro</label>
<a href="link">
Landlord
</a>
</td>
</tr>
<tr>
<td class="label">
<h3>Model</h3>
</td>
<td class="checkbox">model123</td>
</tr>
import requests
from bs4 import BeautifulSoup
url='someurl.com'
source2= requests.get(url,timeout=30).text
soup2=BeautifulSoup(source2,'lxml')
element2= soup2.find('div',class_='Tech')
pin= element2.find('table',id='product').tbody.tr.text
print(pin)
Output that I am getting is:
Information
Design
product
How to do I move between <tr>s? I need the output as: model123.
To get output model123, you can try:
# search <h3> that contains "Model"
h3 = soup.select_one('h3:contains("Model")')
# search next <td>
model = h3.find_next("td").text
print(model)
Prints:
model123
Or without CSS selectors:
model = (
soup.find(lambda tag: tag.name == "h3" and tag.text.strip() == "Model")
.find_next("td")
.text
)
print(model)

Beautifulsoup to parse html table for text and links

I have a table with several columns. The last column may contain link to documents, number of links per cell is not determined (from 0 to infinity).
<tbody>
<tr>
<td>
<h2>Table Section</h2>
</td>
</tr>
<tr>
<td>
Object 1
</td>
<td>Param 1</td>
<td>
<span class="text-nowrap">Param 2</span>
</td>
<td class="text-nowrap"></td>
</tr>
<tr>
<td>
Object 2
</td>
<td>Param 1</td>
<td>
<span class="text-nowrap">Param 2</span>
<td>
<ul>
<li>
<small>
TitleNotes
</small>
</li>
<li>
<small>
Title2Notes2
</small>
</li>
</ul>
</td>
</tr>
</tbody>
So basic parsing is not a problem. I'm stuck with getting those links with titles and notes and appending them tor python's list (or numpy array).
from bs4 import BeautifulSoup
with open("new 1.html", encoding="utf8") as dump:
soup = BeautifulSoup(dump, features="lxml")
data = []
table_body = soup.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append(cols)
a = row.find_all('a')
for ele1 in a:
if ele1.get('href') != "#":
data.append([ele1.get('href')])
print(*data, sep='\n')
Output:
['Table Section']
['Object 1', 'Param 1', 'Param 2', '']
['Object 2', 'Param 1', 'Param 2', 'TitleNotes\n\t\t\t \n\n\n\nTitle2Notes2']
['link_to.doc']
['another_link_to.doc']
Is there any way to append links to the first list? I wish a list for a second row looked like this:
['Object 2', 'Param 1', 'Param 2', 'Title', 'Notes', 'link_to.doc', ' Title2', 'Notes2', 'another_link_to.doc']
Something like this
from bs4 import BeautifulSoup
html = '''<tbody>
<tr>
<td>
<h2>Table Section</h2>
</td>
</tr>
<tr>
<td>
Object 1
</td>
<td>Param 1</td>
<td>
<span class="text-nowrap">Param 2</span>
</td>
<td class="text-nowrap"></td>
</tr>
<tr>
<td>
Object 2
</td>
<td>Param 1</td>
<td>
<span class="text-nowrap">Param 2</span>
<td>
<ul>
<li>
<small>
TitleNotes
</small>
</li>
<li>
<small>
Title2Notes2
</small>
</li>
</ul>
</td>
</tr>
</tbody>'''
soup = BeautifulSoup(html, features="lxml")
smalls = soup.find_all('small')
links = [s.contents[1].attrs['href'] for s in smalls]
print(links)
output
['link_to.doc', 'another_link_to.doc']

Beautiful Soup Table, stop getting info

Hey everyone I have some html that I am parsing, here it is:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<table class="dayinner">
<tr class="lun">
<td class="mealname" colspan="3">LUNCH</td>
</tr>
<tr class="lun">
<td class="station"> Deli</td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000010000047598_35356" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox" /> <span class="ul" onclick=
"nf('0000047598_35356');" onmouseout="pcls(this);"
onmouseover="ws(this);">Made to Order Deli Core</span>
</div>
</td>
<td class="price"></td>
</tr>
<tr class="lun">
<td class="station"> </td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000020000046033_63436" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox" /> <span class="ul" onclick=
"nf('0000046033_63436');" onmouseout="pcls(this);"
onmouseover="ws(this);">Chicken Caesar Wrap</span>
</div>
</td>
<td class="price"></td>
</tr>
<tr class="lun">
<td colspan="3" style="height:3px;"></td>
</tr>
<tr class="lun">
<td colspan="3" style="background-color:#c0c0c0; height:1px;"></td>
</tr>
<tr class="lun">
<td class="station"> Dessert</td>
<td class="station"> </td>
<td class="menuitem">
<div class="menuitem">
<input class="chk" id="S1L0000020000046033_63436" onclick=
"rptlist(this);" onmouseout="wschk(0);" onmouseover=
"wschk(1);" type="checkbox" /> <span class="ul" onclick=
"nf('0000046033_63436');" onmouseout="pcls(this);"
onmouseover="ws(this);">Chicken Caesar Wrap</span>
</div>
</td>
</tr>
</table>
</body>
</html>
Here is the code I have, I want just the items under the deli section, and normally I won't know how many there are is there a way to do this?
soup = BeautifulSoup(open("upperMenu.html"))
title = soup.find('td', class_='station').text.strip()
spans = soup.find_all('span', class_='ul')[:2]
but this only works if there are two items, how can I have it work if the number of items is unknown?
Thanks in advance
You can use the text attribute in find_all function to 1. find all the rows whose station column contains the substring Deli.. 2. Loop through every row and find the spans within that row whose class is ul.
import re
soup = BeautifulSoup(text)
tds_deli = soup.find_all(name='td', attrs={'class':'station'}, text=re.compile('Deli'))
for td in tds_deli:
try:
tr = td.find_parent()
spans = tr.find_all('span', {'class':'ul'})
for span in spans:
# do something
print span.text
print '------------one row -------------'
except:
pass
Sample Output in this case:
Made to Order Deli Core
------------one row -------------
Not sure if I am understanding the problem correctly but I think my code might help you get started.

Categories