Python Selenium - iterate through search results - python

In the search results, I need to verify that all of them must contain the search key. This is the HTML source code:
<div id="content">
<h1>Search Results</h1>
<a id="main-content" tabindex="-1"></a>
<ul class="results">
<li><img alt="Icon for Metropolitan trains" title="Metropolitan trains" src="themes/transport-site/images/jp/iconTrain.png" class="resultIcon"/> <strong>Stop</strong> Sunshine Railway Station (Sunshine)</li>
<li><img alt="Icon for Metropolitan trains" title="Metropolitan trains" src="themes/transport-site/images/jp/iconTrain.png" class="resultIcon"/> <strong>Stop</strong> Albion Railway Station (Sunshine North)</li>
</ul>
</div>
I have written this code to enter search key and get the results but it fails to loop through the search result:
from selenium import webdriver
driver = webdriver.Chrome('C:/Users/lovea/OneDrive/Documents/Semester 2 2016/ISYS1087/w3-4/chromedriver')
driver.get('http://www.ptv.vic.gov.au')
next5Element = driver.find_element_by_link_text('Next 5 departures')
next5Element.click()
searchBox = driver.find_element_by_id('Form_ModeSearchForm_Search')
searchBox.click()
searchBox.clear()
searchBox.send_keys('Sunshine')
submitBtn = driver.find_element_by_id('Form_ModeSearchForm_action_doModeSearch')
submitBtn.click()
assert "Sorry, there were no results for your search." not in driver.page_source
results = driver.find_elements_by_xpath("//ul[#class='results']/li/a")
for result in results:
assert "Sunshine" in result //Error: argument of type 'WebElement' is not iterable
Anyone please tell me what is the proper way to to that? Thank you!

You should check if innerHTML value of particular element contains key string, but not element itself, so try
for result in results:
assert "Sunshine" in result.text

I got a mistake in assert statement.
Because result is a WebElement so there is no text to look up in it.
I just change it like: assert "Sunshine" in result.text

Related

(Beautiful Soup) Get data inside a button tag

I try to scrape out an ImageId inside a button tag, want to have the result:
"25511e1fd64e99acd991a22d6c2d6b6c".
When I try:
drawing_url = drawing_url.find_all('button', class_='inspectBut')['onclick']
it doesn't work. Giving an error-
TypeError: list indices must be integers or slices, not str
Input =
for article in soup.find_all('div', class_='dojoxGridRow'):
drawing_url = article.find('td', class_='dojoxGridCell', idx='3')
drawing_url = drawing_url.find_all('button', class_='inspectBut')
if drawing_url:
for e in drawing_url:
print(e)
Output =
<button class="inspectBut" href="#"
onclick="window.open('getImg?imageId=25511e1fd64e99acd991a22d6c2d6b6c&
timestamp=1552011572288','_blank', 'toolbar=0,
menubar=0, modal=yes, scrollbars=1, resizable=1,
height='+$(window).height()+', width='+$(window).width())"
title="Open Image" type="button">
</button>
...
...
Try this one.
import re
#for all the buttons
btn_onlclick_list = [a.get('onclick') for a in soup.find_all('button')]
for click in btn_onlclick_list:
a = re.findall("imageId=(\w+)", click)[0]
print(a)
You first need to check whether the attribute is present or not.
tag.attrs returns a list of attributes present in the current tag
Consider the following Code.
Code:
from bs4 import BeautifulSoup
a="""
<td>
<button class='hi' onclick="This Data">
<button class='hi' onclick="This Second">
</td>"""
soup = BeautifulSoup(a,'lxml')
print([btn['onclick'] for btn in soup.find_all('button',class_='hi') if 'onclick' in btn.attrs])
Output:
['This Data','This Second']
or you can simply do this
[btn['onclick'] for btn in soup.find_all('button', attrs={'class' : 'hi', 'onclick' : True})]
You should be searching for
button_list = soup.find_all('button', {'class': 'inspectBut'})
That will give you the button array and you can later get url field by
[button['getimg?imageid'] for button in button_list]
You will still need to do some parsing, but I hope this can get you on the right track.
Your mistake here was that you need to search correct property class and look for correct html tag, which is, ironically, getimg?imageid.

BeautifulSoup find - exclude nested tag from block of interest

I have a scraper that looks for pricing on particular product pages. I'm only interested in the current price - whether the product is on sale or not.
I store the identifying tags like this in a JSON file:
{
"some_ecommerce_site" : {
"product_name" : ["span", "data-test", "product-name"],
"breadcrumb" : ["div", "class", "breadcrumbs"],
"sale_price" : ["span", "data-test", "sale-price"],
"regular_price" : ["span", "data-test", "product-price"]
},
}
And have these functions to select current price and clean up the price text:
def get_pricing(rpi, spi):
sale_price = self.soup_object.find(spi[0], {spi[1] : spi[2]})
regular_price = self.soup_object.find(rpi[0], {rpi[1] : rpi[2]})
return sale_price if sale_price else regular_price
def get_text(obj):
return re.sub(r'\s\s+', '', obj.text.strip()).encode('utf-8')
Which are called by:
def get_ids(name_of_ecommerce_site):
with open('site_identifiers.json') as j:
return json.load(j)[name_of_ecommerce_site]
def get_data():
rpi = self.site_ids['regular_price']
spi = self.site_ids['sale_price']
product_price = self.get_text( self.get_pricing(rpi, spi) )
This works for all but one site so far because their pricing is formatted like so:
<div class="product-price">
<h3>
£15.00
<span class="price-standard">
£35.00
</span>
</h3>
</div>
So what product_price returns is "£15£35" instead of the desired "£15".
Is there a simple way to exclude the nested <span> which won't break for the working sites?
I thought a solution would be to get a list and select index 0, but checking the tag's contents, that won't work as it's a single item in the list:
>> print(type(regular_price))
>> <class 'bs4.element.Tag'>
>> print(regular_price.contents)
>> [u'\n', <h3>\n\n\xa325.00\n\n<span class="price-standard">\n\n\xa341.00\n</span>\n</h3>, u'\n']
I've tried creating a list out of the result's NavigableString elements then filtering out the empty strings:
filter(None, [self.get_text(unicode(x)) for x in sale_price.find_all(text=True)])
This fixes that one case, but breaks a few of the others (since they often have the currency in a different tag than the value amount) - I get back "£".
If you want to get the text without child element one.You can do like this
from bs4 import BeautifulSoup,NavigableString
html = """
<div class="product-price">
<h3>
£15.00
<span class="price-standard">
£35.00
</span>
</h3>
</div>
"""
bs = BeautifulSoup(html,"xml")
result = bs.find("div",{"class":"product-price"})
fr = [element for element in result.h3 if isinstance(element, NavigableString)]
print(fr[0])
question may be duplicate of this

invalid xpath expression python

I am trying to get all xpaths that start with something, but when I test it out, it says my xpath is invalid.
This is my xpath:
xpath = "//*[contains(#id,'ClubDetails_"+ str(team) +"']"
Which outputs when I print:
//*[contains(#id,'ClubDetails_1']
Here is HTML:
<div id="ClubDetails_1_386" class="fadeout">
<div class="MapListDetail">
<div>Bev Garman</div>
armadaswim#aol.com
</div>
<div class="MapListDetail">
<div>Rolling Hills Country Day School</div>
<div>26444 Crenshaw Blvd</div>
<div>Rolling Hills Estates, CA 90274</div>
</div>
<div class="MapListDetailOtherLocs_1">
<div>This club also swims at other locations</div>
<span class="show_them_link">show them...</span>
</div>
</div>
What am I missing ?
An alternative can be use pattern to match an ID starting with some text and get all those element in a list and then iterate list one by one and check the condition whether id attribute of that element contains the team is you wants.
You can match the pattern in CSS selector as below :
div[id^='ClubDetails_']
And this is how in xpath :
//div[starts-with(#id, 'ClubDetails_')]
Code sample:
expectedid = "1_386"
clubdetails = driver.find_elements_by_css_selector("div[id^='ClubDetails_']")
for item in clubdetails:
elementid = item.get_attribute('id')
expectedid in elementid
// further code

How to find text with a particular value BeautifulSoup python2.7

I have the following html: I'm trying to get the following numbers saved as variables Available Now,7,148.49,HatchBack,Good. The problem I'm running into is that I'm not able to pull them out independently since they don't have a class attached to it. I'm wondering how to solve this. The following is the html then my futile code to solve this.
</div>
<div class="car-profile-info">
<div class="col-md-12 no-padding">
<div class="col-md-6 no-padding">
<strong>Status:</strong> <span class="statusAvail"> Available Now </span><br/>
<strong>Min. Booking </strong>7 Days ($148.89)<br/>
<strong>Style: </strong>Hatchback<br/>
<strong>Transmission: </strong>Automatic<br/>
<strong>Condition: </strong>Good<br/>
</div>
Python 2.7 Code: - this gives me the entire html!
soup=BeautifulSoup(html)
print soup.find("span",{"class":"statusAvail"}).getText()
for i in soup.select("strong"):
if i.getText()=="Min. Booking ":
print i.parent.getText().replace("Min. Booking ","")
Find all the strong elements under the div element with class="car-profile-info" and, for each element found, get the .next_siblings until you meet the br element:
from bs4 import BeautifulSoup, Tag
for strong in soup.select(".car-profile-info strong"):
label = strong.get_text()
value = ""
for elm in strong.next_siblings:
if getattr(elm, "name") == "br":
break
if isinstance(elm, Tag):
value += elm.get_text(strip=True)
else:
value += elm.strip()
print(label, value)
You can use ".next_sibling" to navigate to the text you want like this:
for i in soup.select("strong"):
if i.get_text(strip=True) == "Min. Booking":
print(i.next_sibling) #this will print: 7 Days ($148.89)
See also http://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-sideways

Looping over multiple tooltips

I am trying to get names and affiliations of authors from a series of articles from this page (you'll need to have access to Proquest to visualise it). What I want to do is to open all the tooltips present at the top of the page, and extract some HTML text from them. This is my code:
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
browser = webdriver.Firefox()
url = 'http://search.proquest.com/econlit/docview/56607849/citation/2876523144F544E0PQ/3?accountid=13042'
browser.get(url)
#insert your username and password here
n_authors = browser.find_elements_by_class_name('zoom') #zoom is the class name of the three tooltips that I want to open in my loop
author = []
institution = []
for a in n_authors:
print(a)
ActionChains(browser).move_to_element(a).click().perform()
html_author = browser.find_element_by_xpath('//*[#id="authorResolveLinks"]/li/div/a').get_attribute('innerHTML')
html_institution = browser.find_element_by_xpath('//*[#id="authorResolveLinks"]/li/div/p').get_attribute('innerHTML')
author.append(html_author)
institution.append(html_institution)
Although n_authors has three entries that are apparently different from one another, selenium fails to get the info from all tooltips, instead returning this:
author
#['Nuttall, William J.',
#'Nuttall, William J.',
#'Nuttall, William J.']
And the same happens for the institution. What am I getting wrong? Thanks a lot
EDIT:
The array containing the xpaths of the tooltips:
n_authors
#[<selenium.webdriver.remote.webelement.WebElement (session="277c8abc-3883-
#43a8-9e93-235a8ded80ff", element="{008a2ade-fc82-4114-b1bf-cc014d41c40f}")>,
#<selenium.webdriver.remote.webelement.WebElement (session="277c8abc-3883-
#43a8-9e93-235a8ded80ff", element="{c4c2d89f-3b8a-42cc-8570-735a4bd56c07}")>,
#<selenium.webdriver.remote.webelement.WebElement (session="277c8abc-3883-
#43a8-9e93-235a8ded80ff", element="{9d06cb60-df58-4f90-ad6a-43afeed49a87}")>]
Which has length 3, and the three elements are different, which is why I don't understand why selenium won't distinguish them.
EDIT 2:
Here is the relevant HTML
<span class="titleAuthorETC small">
<span style="display:none" class="title">false</span>
Jamasb, Tooraj
<a class="zoom" onclick="return false;" href="#">
<img style="margin-left:4px; border:none" alt="Visualizza profilo" id="resolverCitation_previewTrigger_0" title="Visualizza profilo" src="/assets/r20161.1.0-4/ctx/images/scholarUniverse/ar_button.gif">
</a><script type="text/javascript">Tips.images = '/assets/r20161.1.0-4/pqc/javascript/prototip/images/prototip/';</script>; Nuttall, William J
<a class="zoom" onclick="return false;" href="#">
<img style="margin-left:4px; border:none" alt="Visualizza profilo" id="resolverCitation_previewTrigger_1" title="Visualizza profilo" src="/assets/r20161.1.0-4/ctx/images/scholarUniverse/ar_button.gif">
</a>; Pollitt, Michael G
<a class="zoom" onclick="return false;" href="#">
<img style="margin-left:4px; border:none" alt="Visualizza profilo" id="resolverCitation_previewTrigger_2" title="Visualizza profilo" src="/assets/r20161.1.0-4/ctx/images/scholarUniverse/ar_button.gif">
</a>.
UPDATE:
#parishodak's answer, for some reason does not work using Firefox, unless I manually hover over the tooltips first. It works with chromedriver, but only if I first hover over the tooltips, and only if I allow time.sleep(), as in
for i in itertools.count():
try:
tooltip = browser.find_element_by_xpath('//*[#id="resolverCitation_previewTrigger_' + str(i) + '"]')
print(tooltip)
ActionChains(browser).move_to_element(tooltip).perform() #
except NoSuchElementException:
break
time.sleep(2)
elements = browser.find_elements_by_xpath('//*[#id="authorResolveLinks"]/li/div/a')
author = []
for e in elements:
print(e)
attribute = e.get_attribute('innerHTML')
author.append(attribute)`
The reason it is returning the same element, because xpath is not changing for all the loop iterations.
Two ways to deal:
Use array notation for xpath as described below:
browser.find_elements_by_xpath('//*[#id="authorResolveLinks"]/li/div/a[1]').get_attribute('innerHTML')
browser.find_elements_by_xpath('//*[#id="authorResolveLinks"]/li/div/a[2]').get_attribute('innerHTML')
browser.find_elements_by_xpath('//*[#id="authorResolveLinks"]/li/div/a[3]').get_attribute('innerHTML')
Or
Instead of find_element_by_xpath use find_elements_by_xpath
elements = browser.find_elements_by_xpath('//*[#id="authorResolveLinks"]/li/div/a')
loop over elements and use get_attribute('innerHTML') on each element in loop iteration.

Categories