I have a webpage which looks like this:
<table class="data" width="100%" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<th>1</th>
<th>2</th>
<th>3 by</th>
</tr>
<tr>
<td width="10%">5120432</td>
<td width="70%">INTERESTED_SITE1/</td>
<td width="20%">foo2</td>
</tr>
<tr class="alt">
<td width="10%">5120431</td>
<td width="70%">INTERESTED_SITE2</td>
<td width="20%">foo2</td>
</tr>
</tbody>
</table>
I want to put those two sites somewhere (interested_site1 and interested_site2). I tried doing something like this:
chrome = webdriver.Chrome(chrome_path)
chrome.get("fooSite")
time.sleep(.5)
alert = chrome.find_element_by_xpath("/div/table/tbody/tr[2]/td[2]").text
print (alert)
But I can't find the first site. If I can't do this in a for loop, I don't mind getting every link separately. How can I get to that link?
It would be easier to use a CSS query:
driver.find_element_by_css_selector("td:nth-child(2)")
You can use an XPath expression to deal with this by looping over each row.
XPath expression: html/body/table/tbody/tr[i]/td[2]
Get the number of rows by,
totals_rows = chrome.find_elements_by_xpath("html/body/table/tbody/tr")
total_rows_length = len(totals_rows)
for (row in totals_rows):
count = 1
site = "html/body/table/tbody/tr["+counter+]+"/td[2]"
print("site name is:" + chrome.find_element_by_xpath(site).text)
site += 1
Basically, loop through each row and get the value in the second column (td[2]).
Related
im new with selenium and parsing data from the website.
The problem is: i have website table with such HTML code:
<table width="580" cellspacing="1" cellpadding="3" bgcolor="#ffffff" id="restab">
<tbody>
<tr align="center" valign="middle">
<td width="40" bgcolor="#555555"><font color="#ffffff">№</font></td>
<td width="350" bgcolor="#555555"><font color="#ffffff">Название организации</font></td>
<td width="100" bgcolor="#555555"><font color="#ffffff">Город</font></td>
<td width="60" bgcolor="#555555"><span title="Число публикаций данной организации на eLibrary.Ru"><font color="#ffffff">Публ.</font></span></td><td width="30" bgcolor="#555555"><span title="Число ссылок на публикации организации"><font color="#ffffff">Цит.</font></span></td>
</tr>
<tr valign="middle" bgcolor="#f5f5f5" id="a18098">
<td align="center"><font color="#00008f">1</font></td>
<td align="left"><font color="#00008f"><a href="org_about.asp?orgsid=18098">
"Академия информатизации образования" по Ленинградской области</a></font></td>
<td align="center"><font color="#00008f">Гатчина</font></td>
<td align="right"><font color="#00008f">0<img src="/pic/1pix.gif" hspace="16"></font></td>
<td align="center"><font color="#00008f">0</font></td>
</tr>
<tr valign="middle" bgcolor="#f5f5f5" id="a17954">
<td align="center"><font color="#00008f">2</font></td>
<td align="left"><font color="#00008f"><a href="org_about.asp?orgsid=17954">
"Академия талантов" Санкт-Петербурга</a></font></td>
<td align="center"><font color="#00008f">Санкт-Петербург</font></td>
<td align="right"><font color="#00008f">3<img src="/pic/stat.gif" width="12" height="13" hspace="10" border="0"></font></td>
<td align="center"><font color="#00008f">0</font></td>
</tr>
</tbody>
</table>
and i need to get all this table values and href's of each value in left td
I tried to use Xpath, but it writes some error, how to do it better?
In conclusion i need to get dataframe with table values + extra column with href of left column
First try to use pandas.read_html(). See code example below.
If that doesn't work, then use use right-click menu on browser such as Mozilla Firefox (Inspect Element) or Google Chrome (Developer Tools) to find the CSS or Xpath. Then feed the CSS or Xpath into Selenium.
Another useful tool for finding complicated CSS/Xpath is the Inspector Gadget browser plug-in.
import pandas as pd
# this is the website you want to read ... table with "Minimum Level for Adult Cats"
str_url = 'http://www.felinecrf.org/catfood_data_how_to_use.htm'
# use pandas.read_html()
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
list_df = pd.read_html(str_url, match='DMA')
print('Number of dataframes on the page: ', len(list_df))
print()
for idx, each_df in enumerate(list_df):
print(f'Show dataframe number {idx}:')
print(each_df.head())
print()
# use table 2 on the page
df_target = list_df[2]
# create column headers
# https://chrisalbon.com/python/data_wrangling/pandas_rename_column_headers/
header_row = df_target.iloc[0]
# Replace the dataframe with a new one which does not contain the first row
df_target = df_target[1:]
# Rename the dataframe's column values with the header variable
df_target.columns = header_row
print(df_target.head())
I want to edit a table of an .htm file, which roughly looks like this:
<table>
<tr>
<td>
parameter A
</td>
<td>
value A
</td>
<tr/>
<tr>
<td>
parameter B
</td>
<td>
value B
</td>
<tr/>
...
</table>
I made a preformatted template in Word, which has nicely formatted style="" attributes. I insert parameter values into the appropreatte tds from a poorly formatted .html file (This is the output from a scientific program). My job is to automate the creation of html tables so that they can be used in a paper, basically.
This works fine, while the template has empty td instances in a tr. But when I try create additional tds inside a tr (over which I iterate), I get stuck. The .append and .append_after methods for the rows just overwrite existing td instances. I need to create new tds, since I want to create the number of columns dynamically and I need to iterate over a number of up to 5 unformatted input .html files.
from bs4 import BeautifulSoup
with open('template.htm') as template:
template = BeautifulSoup(template)
template = template.find('table')
lines_template = template.findAll('tr')
for line in lines_template:
newtd = line.findAll('td')[-1]
newtd['control_string'] = 'this_is_new'
line.append(newtd)
=> No new tds. The last one is just overwritten. No new column was created.
I want to copy and paste the last td in a row, because it will have the correct style="" for that row. Is it possible to just copy a bs4.element with all the formatting and add it as the last td in a tr? If not, what module/approach should I use?
Thanks in advance.
You can copy the attributes by assigning to the attrs:
data = '''<table>
<tr>
<td style="color:red;">
parameter A
</td>
<td style="color:blue;">
value A
</td>
</tr>
<tr>
<td style="color:red;">
parameter B
</td>
<td style="color:blue;">
value B
</td>
</tr>
</table>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
for i, tr in enumerate(soup.select('tr'), 1):
tds = tr.select('td')
new_td = soup.new_tag('td', attrs=tds[-1].attrs)
new_td.append('This is data for row {}'.format(i))
tr.append(new_td)
print(soup.table.prettify())
Prints:
<table>
<tr>
<td style="color:red;">
parameter A
</td>
<td style="color:blue;">
value A
</td>
<td style="color:blue;">
This is data for row 1
</td>
</tr>
<tr>
<td style="color:red;">
parameter B
</td>
<td style="color:blue;">
value B
</td>
<td style="color:blue;">
This is data for row 2
</td>
</tr>
</table>
So I have a table that can have from 0 to x rows and always have 7 columns.
Something like below.
Type Price Store Weight For-sale Stock Discount
x
x
x
x
x
here is how the HTML looks:
<table id="my_table" class="datatable" cellspacing="0" cellpadding="0" border="0" style="border-width:0px;border-collapse:collapse;">
<tbody>
<tr>
<tr class="row" style="cursor:pointer;" onclick="javascript:__doPostBack('my$table','Select$0')">
<td>
<td class="first">Meat</td>
<td>75</td>
<td>Adams grocery</td>
<td align="center">1kg</td>
<td>Yes</td>
<td>Full</td>
<td>Yes</td>
<td>
</tr>
<tr class="row" style="cursor:pointer;" onclick="javascript:__doPostBack('my$table','Select$1')">
<td>
<td class="first">Vegetable</td>
<td>25</td>
<td>Adams grocery</td>
<td align="center">0.5kg</td>
<td>No</td>
<td>Empty</td>
<td>No</td>
<td>
</tr>
</tbody>
</table>
</div>
What I want to do is to click on each row if exists that contains the text "Adams grocery" (which is in column 3) so they open in a separate tab, then give new instructions to all tabs at once. For example: Click button "welcome" on all tabs.
I have a feeling the above might be a little too complicated for me as a beginner... So I thought maybe just click on one of the rows to begin with.
Been thinking about this the whole day, thanks for all help!
Do you need something like this:
Tested to this html:
http://jsfiddle.net/zvhrm6tf/
from selenium.webdriver.support.wait import WebDriverWait
td_list = WebDriverWait(driver, 10).until(lambda driver: driver.find_elements_by_css_selector("#my_table tr td"))
for td in td_list:
if(td.text == "Adams grocery"):
td.click()
and if you need to target the table row you could do something like this:
tr = td.find_element_by_xpath("..")
I have a table (screenshot below) where I want to mark off the checkboxes that have the text "Xatu Auto Test" in the same row using selenium python.
I've tried following these two posts:
Iterating Through a Table in Selenium Very Slow
Get row & column values in web table using python web driver
But I couldn't get those solutions to work on my code.
My code:
form = self.browser.find_element_by_id("quotes-form")
try:
rows = form.find_elements_by_tag_name("tr")
for row in rows:
columns = row.find_elements_by_tag_name("td")
for column in columns:
if column.text == self.group_name:
column.find_element_by_name("quote_id").click()
except NoSuchElementException:
pass
The checkboxes are never clicked and I am wondering what I am doing wrong.
This is the HTML when I inspect with FirePath:
<form id="quotes-form" action="/admin/quote/delete_multiple" method="post" name="quotesForm">
<table class="table table-striped table-shadow">
<thead>
<tbody id="quote-rows">
<tr>
<tr>
<td class="document-column">
<td>47</td>
<td class="nobr">
<td class="nobr">
<td class="nobr">
<td class="nobr">
<a title="Xatu Auto Test Data: No" href="http://192.168.56.10:5001/admin/quote/47/">Xatu Auto Test</a>
</td>
<td>$100,000</td>
<td style="text-align: right;">1,000</td>
<td class="nobr">Processing...</td>
<td class="nobr">192.168....</td>
<td/>
<td>
<input type="checkbox" value="47" name="quote_id"/>
</td>
</tr>
<tr>
</tbody>
<tbody id="quote-rows-footer">
</table>
<div class="btn-toolbar" style="text-align:center; width:100%;">
With a quick look, I reckon this line needs changing as you're trying to access column's quote_id, it should be row's:
From:
column.find_element_by_name("quote_id").click()
To:
row.find_element_by_name("quote_id").click()
P.S. Provided that like #Saifur commented, you have your comparison done correctly.
Updated:
I have run a simulation and indeed the checkbox is ticked if changing column to row, simplified version:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('your-form-sample.html')
form = driver.find_element_by_id("quotes-form")
rows = form.find_elements_by_tag_name("tr")
for row in rows:
columns = row.find_elements_by_tag_name("td")
for column in columns:
# I changed this to the actual string provided your comparison is correct
if column.text == 'Xatu Auto Test':
# you need to change from column to row, and it will work
row.find_element_by_name("quote_id").click()
Here's the output:
Im trying to extract some fields from the output at the end of this question with the following code:
doc = LH.fromstring(html2)
tds = (td.text_content() for td in doc.xpath("//td[not(*)]"))
for a,b,c in zip(*[tds]*3):
print (a,b,c)
What i expect is to extract only the fields notificationNodeName,notificationNodeName,packageName,notificationEnabled
The main problem with that is because i want to put the result into a database. and i need to, instead receiveing:
Actual code returns:
('JDBCAdapter', 'JDBCAdapter', 'Package:Notif')
('Package', 'yes', 'Package_2:Notif')
('Package_2', 'yes')
What i need:
('Package:Notif','Package', 'yes')
('Package_2:Notif','Package_2', 'yes')
An unly solution that i found was:
doc = LH.fromstring(html2)
tds = (td.text_content() for td in doc.xpath("//td"))
for td, val in zip(*[tds]*2):
if td == 'notificationNodeName':
notificationNodeName = val
elif td == 'packageName':
packageName = val
elif td == 'notificationEnabled':
notificationEnabled = val
print (notificationNodeName,packageName,notificationEnabled)
It works but doenst seen right for me, im sure it can be a better way to do it.
Original HTML Output:
<tbody><tr>
<td valign="top"><b>adapterTypeName</b></td>
<td>JDBCAdapter</td>
</tr>
<tr>
<td valign="top"><b>adapterTypeNameList</b></td>
<td>
<table>
<tbody><tr>
<td>JDBCAdapter</td>
</tr>
</tbody></table>
</td>
</tr>
<tr>
<td valign="top"><b>notificationDataList</b></td>
<td>
<table>
<tbody><tr>
<td><table bgcolor="#dddddd" border="1">
<tbody><tr>
<td valign="top"><b>notificationNodeName</b></td>
<td>package:Notif</td>
</tr>
<tr>
<td valign="top"><b>packageName</b></td>
<td>Package</td>
</tr>
<tr>
<td valign="top"><b>notificationEnabled</b></td>
<td>unsched</td>
</tr>
</tbody></table>
</td>
</tr>
<tr>
<td><table bgcolor="#dddddd" border="1">
<tbody><tr>
<td valign="top"><b>notificationNodeName</b></td>
<td>Package_2:notif</td>
</tr>
<tr>
<td valign="top"><b>packageName</b></td>
<td>package_2</td>
</tr>
<tr>
<td valign="top"><b>notificationEnabled</b></td>
<td>yes</td>
</tr>
and continues to more ... non relevant repetitive data.
I would recommend using the excellent lxml and it's cssselect functionality for basically most HTML parsing.
You can then select each field you are interested in thusly:
from lxml import html
root = html.parse(open('your/file.html')).getroot()
sibling_content = lambda x: [b.getparent().getnext().text_content() for b in
root.cssselect("td b:contains('{0}')".format(x))]
fields = ['notificationNodeName', 'packageName', 'notificationEnabled']
for item in zip(*[sibling_content(field) for field in fields]):
print item
I would also recommend lxml - it's the de facto standard for parsing XML or HTML with Python.
As an alternative to David's approach, here's a solution using xpaths:
from lxml import html
from lxml import etree
html_file = open('test.html', 'r')
root = html.parse(html_file).getroot()
# Strip those annoying <b> tags for easier xpaths
etree.strip_tags(root,'b')
data_list = root.xpath("//td[text()='notificationDataList']/following-sibling::*")[0]
node_names = data_list.xpath("//td[text()='notificationNodeName']/following-sibling::*/text()")
package_names = data_list.xpath("//td[text()='packageName']/following-sibling::*/text()")
enableds = data_list.xpath("//td[text()='notificationEnabled']/following-sibling::*/text()")
print zip(node_names, package_names, enableds)
Output:
[('package:Notif', 'Package', 'unsched'),
('Package_2:notif', 'package_2', 'yes')]