im new with selenium and parsing data from the website.
The problem is: i have website table with such HTML code:
<table width="580" cellspacing="1" cellpadding="3" bgcolor="#ffffff" id="restab">
<tbody>
<tr align="center" valign="middle">
<td width="40" bgcolor="#555555"><font color="#ffffff">№</font></td>
<td width="350" bgcolor="#555555"><font color="#ffffff">Название организации</font></td>
<td width="100" bgcolor="#555555"><font color="#ffffff">Город</font></td>
<td width="60" bgcolor="#555555"><span title="Число публикаций данной организации на eLibrary.Ru"><font color="#ffffff">Публ.</font></span></td><td width="30" bgcolor="#555555"><span title="Число ссылок на публикации организации"><font color="#ffffff">Цит.</font></span></td>
</tr>
<tr valign="middle" bgcolor="#f5f5f5" id="a18098">
<td align="center"><font color="#00008f">1</font></td>
<td align="left"><font color="#00008f"><a href="org_about.asp?orgsid=18098">
"Академия информатизации образования" по Ленинградской области</a></font></td>
<td align="center"><font color="#00008f">Гатчина</font></td>
<td align="right"><font color="#00008f">0<img src="/pic/1pix.gif" hspace="16"></font></td>
<td align="center"><font color="#00008f">0</font></td>
</tr>
<tr valign="middle" bgcolor="#f5f5f5" id="a17954">
<td align="center"><font color="#00008f">2</font></td>
<td align="left"><font color="#00008f"><a href="org_about.asp?orgsid=17954">
"Академия талантов" Санкт-Петербурга</a></font></td>
<td align="center"><font color="#00008f">Санкт-Петербург</font></td>
<td align="right"><font color="#00008f">3<img src="/pic/stat.gif" width="12" height="13" hspace="10" border="0"></font></td>
<td align="center"><font color="#00008f">0</font></td>
</tr>
</tbody>
</table>
and i need to get all this table values and href's of each value in left td
I tried to use Xpath, but it writes some error, how to do it better?
In conclusion i need to get dataframe with table values + extra column with href of left column
First try to use pandas.read_html(). See code example below.
If that doesn't work, then use use right-click menu on browser such as Mozilla Firefox (Inspect Element) or Google Chrome (Developer Tools) to find the CSS or Xpath. Then feed the CSS or Xpath into Selenium.
Another useful tool for finding complicated CSS/Xpath is the Inspector Gadget browser plug-in.
import pandas as pd
# this is the website you want to read ... table with "Minimum Level for Adult Cats"
str_url = 'http://www.felinecrf.org/catfood_data_how_to_use.htm'
# use pandas.read_html()
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
list_df = pd.read_html(str_url, match='DMA')
print('Number of dataframes on the page: ', len(list_df))
print()
for idx, each_df in enumerate(list_df):
print(f'Show dataframe number {idx}:')
print(each_df.head())
print()
# use table 2 on the page
df_target = list_df[2]
# create column headers
# https://chrisalbon.com/python/data_wrangling/pandas_rename_column_headers/
header_row = df_target.iloc[0]
# Replace the dataframe with a new one which does not contain the first row
df_target = df_target[1:]
# Rename the dataframe's column values with the header variable
df_target.columns = header_row
print(df_target.head())
Related
I have a webpage which looks like this:
<table class="data" width="100%" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<th>1</th>
<th>2</th>
<th>3 by</th>
</tr>
<tr>
<td width="10%">5120432</td>
<td width="70%">INTERESTED_SITE1/</td>
<td width="20%">foo2</td>
</tr>
<tr class="alt">
<td width="10%">5120431</td>
<td width="70%">INTERESTED_SITE2</td>
<td width="20%">foo2</td>
</tr>
</tbody>
</table>
I want to put those two sites somewhere (interested_site1 and interested_site2). I tried doing something like this:
chrome = webdriver.Chrome(chrome_path)
chrome.get("fooSite")
time.sleep(.5)
alert = chrome.find_element_by_xpath("/div/table/tbody/tr[2]/td[2]").text
print (alert)
But I can't find the first site. If I can't do this in a for loop, I don't mind getting every link separately. How can I get to that link?
It would be easier to use a CSS query:
driver.find_element_by_css_selector("td:nth-child(2)")
You can use an XPath expression to deal with this by looping over each row.
XPath expression: html/body/table/tbody/tr[i]/td[2]
Get the number of rows by,
totals_rows = chrome.find_elements_by_xpath("html/body/table/tbody/tr")
total_rows_length = len(totals_rows)
for (row in totals_rows):
count = 1
site = "html/body/table/tbody/tr["+counter+]+"/td[2]"
print("site name is:" + chrome.find_element_by_xpath(site).text)
site += 1
Basically, loop through each row and get the value in the second column (td[2]).
I have a table (screenshot below) where I want to mark off the checkboxes that have the text "Xatu Auto Test" in the same row using selenium python.
I've tried following these two posts:
Iterating Through a Table in Selenium Very Slow
Get row & column values in web table using python web driver
But I couldn't get those solutions to work on my code.
My code:
form = self.browser.find_element_by_id("quotes-form")
try:
rows = form.find_elements_by_tag_name("tr")
for row in rows:
columns = row.find_elements_by_tag_name("td")
for column in columns:
if column.text == self.group_name:
column.find_element_by_name("quote_id").click()
except NoSuchElementException:
pass
The checkboxes are never clicked and I am wondering what I am doing wrong.
This is the HTML when I inspect with FirePath:
<form id="quotes-form" action="/admin/quote/delete_multiple" method="post" name="quotesForm">
<table class="table table-striped table-shadow">
<thead>
<tbody id="quote-rows">
<tr>
<tr>
<td class="document-column">
<td>47</td>
<td class="nobr">
<td class="nobr">
<td class="nobr">
<td class="nobr">
<a title="Xatu Auto Test Data: No" href="http://192.168.56.10:5001/admin/quote/47/">Xatu Auto Test</a>
</td>
<td>$100,000</td>
<td style="text-align: right;">1,000</td>
<td class="nobr">Processing...</td>
<td class="nobr">192.168....</td>
<td/>
<td>
<input type="checkbox" value="47" name="quote_id"/>
</td>
</tr>
<tr>
</tbody>
<tbody id="quote-rows-footer">
</table>
<div class="btn-toolbar" style="text-align:center; width:100%;">
With a quick look, I reckon this line needs changing as you're trying to access column's quote_id, it should be row's:
From:
column.find_element_by_name("quote_id").click()
To:
row.find_element_by_name("quote_id").click()
P.S. Provided that like #Saifur commented, you have your comparison done correctly.
Updated:
I have run a simulation and indeed the checkbox is ticked if changing column to row, simplified version:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('your-form-sample.html')
form = driver.find_element_by_id("quotes-form")
rows = form.find_elements_by_tag_name("tr")
for row in rows:
columns = row.find_elements_by_tag_name("td")
for column in columns:
# I changed this to the actual string provided your comparison is correct
if column.text == 'Xatu Auto Test':
# you need to change from column to row, and it will work
row.find_element_by_name("quote_id").click()
Here's the output:
I seem to be stuck, If i had the following table:
<table align=center cellpadding=3 cellspacing=0 border=1>
<tr bgcolor="#EEEEFF">
<td align="center">
40 </td>
<td align="center">
44 </td>
<td align="center">
<font color="green"><b>+4</b></font>
</td>
<td align="center">
1,000</td>
<td align="center">
15,000 </td>
<td align="center">
44,000 </td>
<td align="center">
<font color="green"><b><nobr>+193.33%</nobr></b></font>
</td>
</tr>
what would be the ideal way to use find_all to pull the 44,000 td from the table?
If it is a recurring position of the table you would like to scrape you would like to scrape I would use beautiful soup to extract all elements in the table and then extract that data. See the pseudo code below.
known_position = 5
tds = bs4.find_all('td')
number = tds[known_position].text()
on the other hand if you're specifically searching for a given number I would just iterate over the list.
tds = bs4.find_all('td')
for td in tds:
if td.text = 'number here':
# do your stuff
I have a question about selecting a list of tags (or single tags) using a condition on one of the attributes of it's children. Specifically, given the HTML code:
<tbody>
<tr class="" data-row="0">
<tr class="" data-row="1">
<tr class="" data-row="2">
<td align="right" csk="13">13</td>
<td align="left" csk="Jones,Andre">Andre Jones
</td>
<tr class="" data-row="3">
<td align="right" csk="7">7</td>
<td align="left" csk="Jones,DeAndre">DeAndre Jones
</td>
<tr class="" data-row="4">
<tr class="" data-row="5">
I have a unicode variable coming from an external loop and I am trying to look through each row in the table to extract the <tr> tags with Player==Table.tr.a.text and to identify duplicate player names in Table. So, for instance, if there is more than one player with Player=Andre Jones the MyRow object returns all <tr> tags that contain that players name, while if there is only one row with Player=Andre Jones, then MyRow just contains the single element <tr> with anchor text attribute equal to Andre Jones. I've been trying things like
Table = soup.find('tbody')
MyRow = Table.find_all(lambda X: X.name=='tr' and Player == X.text)
But this returns [] for MyRow. If I use
MyRow = Table.find_all(lambda X: X.name=='tr' and Player in X.text)
This will pick any <tr> that has Player as a substring of X.text. In the example code above, it extracts both <tr> tags withe Table.tr.td.a.text=='Andre Jones' and Table.tr.td.a.text=='DeAndre Jones'. Any help would be appreciated.
You could do this easily with XPath and lxml:
import lxml.html
root = lxml.html.fromstring('''...''')
td = root.xpath('//tr[.//a[text() = "FooName"]]')
The BeautifulSoup "equivalent" would be something like:
rows = soup.find('tbody').find_all('tr')
td = next(row for row in rows if row.find('a', text='FooName'))
Or if you think about it backwards:
td = soup.find('a', text='FooName').find_parent('tr')
Whatever you desire. :)
Solution1
Logic: find the first tag whose tag name is tr and contains 'FooName' in this tag's text including its children.
# Exact Match (text is unicode, turn into str)
print Table.find(lambda tag: tag.name=='tr' and 'FooName' == tag.text.encode('utf-8'))
# Fuzzy Match
# print Table.find(lambda tag: tag.name=='tr' and 'FooName' in tag.text)
Output:
<tr class="" data-row="2">
<td align="right" csk="3">3</td>
<td align="left" csk="Wentz,Parker">
FooName
</td>
</tr>
Solution2
Logic: find the element whose text contains FooName, the anchor tag in this case. Then go up the tree and search for the all its parents(including ancestors) whose tag name is tr
# Exact Match
print Table.find(text='FooName').find_parent('tr')
# Fuzzy Match
# import re
# print Table.find(text=re.compile('FooName')).find_parent('tr')
Output
<tr class="" data-row="2">
<td align="right" csk="3">3</td>
<td align="left" csk="Wentz,Parker">
FooName
</td>
</tr>
I am parsing a html document using a Beautiful Soup 4.0.
Here is an example of table in document
<tr>
<td class="nob"></td>
<td class="">Time of price</td>
<td class=" pullElement pullData-DE000BWB14W0.teFull">08/06/2012</td>
<td class=" pullElement pullData-DE000BWB14W0.PriceTimeFull">11:43:08 </td>
<td class="nob"></td>
</tr>
<tr>
<td class="nob"></td>
<td class="">Daily volume (units)</td>
<td colspan="2" class=" pullElement pullData-DE000BWB14W0.EWXlume">0</td>
<td class="nob"></td>
<t/r>
I would like to extract 08/06/2012 and 11:43:08 DAily volume, 0 etc.
This is my code to find specific table and all data of it
html = file("some_file.html")
soup = BeautifulSoup(html)
t = soup.find(id="ctnt-2308")
dat = [ map(str, row.findAll("td")) for row in t.findAll("tr") ]
I get a list of data that needs to be organized
Any suggestions to do it in a simple way??
Thank you
list(soup.stripped_strings)
will give you all the string in that soup (removing all trailing spaces)