Issue extracting specific data with Selenium

Issue extracting specific data with Selenium - python

I am back with essentially a somewhat similar problem. I have now learnt that you cannot find elements that are in an iframe if you haven't switched to it, which helped a lot, but I seem to still have issues locating elements even though they are not in an iframe.
I also ask for any advice regarding my script in general, or how one would go about improving it. Yes, I will change the implicitwait to WebDriverWait, but besides that. Is it okay if a script is structured in this way, with task -> task -> task and so forth, or is it simply bad practice?
I don't really see how I would go about throwing in some objective-oriented programming, or what I would gain from it besides if I wanted to customise the script in a major way, besides of course the learning aspect.
In any case, here is the code:
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import accandpass as login
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import datetime
x = datetime.datetime.now()
x = x.strftime("%d")
driver = browser=webdriver.Firefox()
driver.get("https://connect.garmin.com/modern/activities")
driver.implicitly_wait(2)
iframe = driver.find_element(By.ID, "gauth-widget-frame-gauth-widget")
driver.switch_to.frame(iframe)
element = driver.find_element("name", "username")
element.send_keys(login.username)
element = driver.find_element("name", "password")
element.send_keys(login.password)
element.send_keys(Keys.RETURN)
driver.switch_to.default_content()
driver.implicitly_wait(10)
element = driver.find_element("name", "search")
element.send_keys("Reading")
element.send_keys(Keys.RETURN)
element = driver.find_element(By.CLASS_NAME, "unit")
print(element)
So everything actually works fine so far, to my great surprise. The element gives off this: <selenium.webdriver.remote.webelement.WebElement (session="0ef84b2e-e0af-4b0c-b04c-94d5371356c5", element="a70f4ee1-e840-457c-a255-4b2df603efec")> which wasn't really what I was looking for.
I am more looking for some check, to see that the element with name unit has the same date as x, which is today. So basically:
Minutes read
minutes = 0
for i in element:
if element == x:
minutes += (element with time)
For loop to run through all the elements and check them all for the same date, and if the date matches then add the minutes read that day to the integer minutes for a sum of total minutes read today, for example.
Then do the same for the activities I will do, running, hiking and meditating.
Questions:
How do I get the integer from the element, so I can check it with x?
Is the for loop -> if statement -> add time from element a good solution to the case at hand?
Is it bad practice to structure a script this way, and how would you improve it?
Thanks in advance

Part of your question sounds like you want a code review. If you do, you'll want to post your code over on https://codereview.stackexchange.com but review their question requirements carefully before posting.
I think the main issue you are asking about is wanting to compare the date on the page to the current system date and you're getting some session and element GUIDs instead. You are printing the element object and not the contained text. You want
print(element.text)
or add an assert to compare it to the current system date in a specific format, something like...
assert element.text == datetime.today().strftime('%m %d')
Some quick additional feedback since you asked for some...
It sounds like you've already been informed that .implicitly_wait() is a bad practice and should be replaced with WebDriverWait for each instance where you need to wait.
If you aren't going to reuse a variable, don't declare one. In most cases you don't need to use one.
element = driver.find_element("name", "username")
element.send_keys(login.username)
can be written
driver.find_element("name", "username").send_keys(login.username)
If you are going to use variables, don't reuse the same name over and over, e.g. element. Give each variable a meaningful name so that the next person (or maybe yourself in a few weeks/months) will be able to more easily read and understand your code.
element = driver.find_element("name", "search")
element.send_keys("Reading")
element.send_keys(Keys.RETURN)
should instead be
search = driver.find_element("name", "search")
search.send_keys("Reading")
search.send_keys(Keys.RETURN)
If you are going to continue writing scripts, do some reading on the page object model. Done right, it will clean up your code significantly, make maintenance much faster and easier, and make writing new scripts much faster. Basically you create a class for each page of the site and then add methods to the class for actions you need to take on the page.
I don't have a garmin account so I can't log in but from the screenshot you posted you might have some methods like .search_activities(string), .get_search_results(), .filter_activities(string), etc. Once you've created those methods, they can be called repeatedly from the same script or many scripts.

Related

Selenium only finding certain elements in Python

I'm having some trouble finding elements with Selenium in Python, it works fine for every element on all other websites I have tested yet on a game website it can only find certain elements.
Here is the code I'm using:
from selenium import webdriver
import time
driver = webdriver.Chrome("./chromedriver")
driver.get("https://www.jklm.fun")
passSelf = input("Press enter when in game...")
time.sleep(1)
syllable = driver.find_element_by_xpath("/html/body/div[2]/div[2]/div[2]/div[2]/div").text
print(syllable)
Upon running the code, the element /html/body/div[2]/div[2]/div[2]/div[2]/div isn't found. In the image you can see the element it is trying to find:
Element the code is trying to find
However running the same code but replacing the XPath with something outside of the main game (for example the room code in the top right) it successfully finds the element:
Output of the code being run on a different element
I've tried using the class name, name, selector and XPath to find the original element but no prevail the only things I can think that are affecting it is that:
The elements are changing periodically (not sure if this affects it)
The elements are in the "Canvas area" and it is somehow blocking it.
I'm not certain whether these things matter as I'm new to using selenium any help is appreciated. The website the game is on is https://www.jklm.fun/ if you want to have a look through the elements

Element you are trying to access is inside an iframe. Switch to the frame first like this
driver.switch_to_frame(driver.find_element_by_xpath("//div[#class='game']/iframe[contains(#src,'jklm.fun')]"))

driver.get("https://jklm.fun/JXUS")
WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//button[#class='styled']"))).click()
time.sleep(10)
driver.switch_to.frame(0)
while True:
Get_Text = driver.find_element_by_xpath("//div[#class='round']").text
print(Get_Text)

Selenium not finding list of sections with classes?

I am attempting to get a list of games on
https://www.xbox.com/en-US/live/gold#gameswithgold
According to Firefox's dev console, it seems that I found the correct class: https://i.imgur.com/M6EpVDg.png
In fact, since there are 3 games, I am supposed to get a list of 3 objects with this code: https://pastebin.com/raw/PEDifvdX (the wait is so Seleium can load the page)
But in fact, Selenium says it does not exist: https://i.imgur.com/DqsIdk9.png
I do not get what I am doing wrong. I even tried css selectors like this
listOfGames = driver.find_element_by_css_selector("section.m-product-placement-item f-size-medium context-game gameDiv")
Still nothing. What am I doing wrong?

You are trying to get three different games so you need to give different element path or you can use some sort of loop like this one
i = 1
while i < 4:
link = f"//*[#id='ContentBlockList_11']/div[2]/section[{i}]/a/div/h3"
listGames = str(driver.find_element_by_xpath(link).text)
print(listGames)
i += 1
you can use this kind of loop in some places where there is slight different in xpath,css or class
in this way it will loop over web element one by one and get the list of game
as you are trying to get name I think so you need to put .text which will only get you the name nothing else

Another option with a selector that isn't looped over and changed-- also one that's less dependent on the page structure and a little easier to read:
//a[starts-with(#data-loc-link,'keyLinknowgame')]//h3
Here's sample code:
from selenium import webdriver
from selenium.common.exceptions import StaleElementReferenceException
driver = webdriver.Chrome()
url = f"https://www.xbox.com/en-US/live/gold#gameswithgold"
driver.get(url)
driver.implicitly_wait(10)
listOfGames = driver.find_elements_by_xpath("//a[starts-with(#data-loc-link,'keyLinknowgame')]//h3")
for game in listOfGames:
try:
print(game.text)
except StaleElementReferenceException:
pass
If you're after more than just the title, remove the //h3 selection:
//a[starts-with(#data-loc-link,'keyLinknowgame')]
And add whatever additional Xpath you want to narrow things down to the content/elements that you're after.

How to access text inside div tags using Selenium in Python?

I am trying to make a program in Python using Selenium which prints out the quotes from https://www.brainyquote.com/quote_of_the_day
EDIT:
I was able to access the quotes and the associated authors like so:
authors = driver.find_elements_by_css_selector("""div.col-xs-4.col-md-4 a[title="view author"]""")
for quote,author in zip(quotes,authors):
print('Quote: ', quote.text)
print('Author: ', author.text)
Not able to club topics similarly. Doing
total_topics = driver.find_elements_by_css_selector("""div.col-xs-4.col-md-4 a.qkw-btn.btn.btn-xs.oncl_list_kc""")
would make an undesired list
Earlier I was using Beautiful Soup which did the job perfectly except the fact that the requests library was able to access only the static website. However, I wanted to be able to scroll the website continuously to keep accessing new quotes. For that purpose, I'm trying to use Selenium.
This is how I did it using Soup:
for quote_data in soup.find_all('div', class_='col-xs-4 col-md-4'):
quote = quote_data.find('a',title='view quote').text
print('Quote: ',quote)
However, I am unable to find the same using Selenium.
My code in Selenium for basic testing:
driver.maximize_window()
driver.get('https://www.brainyquote.com/quote_of_the_day')
elem = driver.find_element_by_tag_name("body")
elem.send_keys(Keys.PAGE_DOWN)
time.sleep(0.2)
quote = driver.find_element_by_xpath('//div[#title="view quote"]')
I also tried CSS Selectors
print(driver.find_element_by_css_selector('div.col-xs-4 col-md-4')
The latter gave a NoSuchElementFound exception and the former is not giving any output at all. I would love to get some tips on where I am going wrong and how I would be able to tackle this.
Thanks!

quotes = driver.find_elements_by_xpath('//a[#title="view quote"]')
First scroll to bottom

You might need to write some kind of loop to scroll and click on the quotes links until there are no more elements found. Here's a bit of an outline of how I would do that:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get('https://www.brainyquote.com/quote_of_the_day')
while True:
# wait for all quote elements to appear
quote_links = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//a[#title='view quote']")))
# todo - need to check for the end condition. page has infinite scrolling
# break
# iterate the quote elements until we reach the end of this list
for quote_link in quote_links:
quote_link.click()
driver.back()
# now quote_links has gone stale because we are on a different page
quote_links = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//a[#title='view quote']")))
The above code enters a loop that searches for all of the 'View more' quote links on the page. Then, we iterate the list of links and click on each one. At this point the elements in quote_links list have gone stale due to the page no longer existing, so we re-find the elements with WebDriverWait before clicking another link.
This is just a rough outline and some extra work will need to be done to determine an end case for the infinite scrolling of the page, and you will need to write in the operations to perform on the quote pages themselves, but hopefully you see the idea here.

Efficient download of images from website with Python and selenium

Disclaimer: I do not have any background in web-scraping/HTML/javascripts/css and the likes but I know a bit of Python.
My end goal is to download all 4th image view of every 3515 car views in the ShapeNet website WITH the associated tag.
For instance the first of the 3515 couples would be the image that can be found in the collapse menu on the right of this picture: (that can be loaded by clicking on the first item of the first page and then on Images) with the associated tag "sport utility" as can be seen in the first picture (first car top left).
To do that I wrote with the help of #DebanjanB a snippet of code that click on the sport utility on the first picture opens the iframe clicks on images and then download the 4th picture link to my question. The full working code is this one:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
import os
profile = webdriver.FirefoxProfile()
profile.set_preference("network.proxy.type", 1)
profile.set_preference("network.proxy.socks", "yourproxy")
profile.set_preference("network.proxy.socks_port", yourport)
#browser = webdriver.Firefox(firefox_profile=profile)
browser = webdriver.Firefox()
browser.get('https://www.shapenet.org/taxonomy-viewer')
#Page is long to load
wait = WebDriverWait(browser, 30)
element = wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='02958343_anchor']")))
linkElem = browser.find_element_by_xpath("//*[#id='02958343_anchor']")
linkElem.click()
#Page is also long to display iframe
element = wait.until(EC.element_to_be_clickable((By.ID, "model_3dw_bcf0b18a19bce6d91ad107790a9e2d51")))
linkElem = browser.find_element_by_id("model_3dw_bcf0b18a19bce6d91ad107790a9e2d51")
linkElem.click()
#iframe slow to be displayed
wait.until(EC.frame_to_be_available_and_switch_to_it((By.ID, 'viewerIframe')))
#iframe = browser.find_elements_by_id('viewerIframe')
#browser.switch_to_frame(iframe[0])
element = wait.until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[3]/div[3]/h4")))
time.sleep(10)
linkElem = browser.find_element_by_xpath("/html/body/div[3]/div[3]/h4")
linkElem.click()
img = browser.find_element_by_xpath("/html/body/div[3]/div[3]//div[#class='searchResult' and #id='image.3dw.bcf0b18a19bce6d91ad107790a9e2d51.3']/img[#class='enlarge']")
src = img.get_attribute('src')
os.system("wget %s --no-check-certificate"%src)
There are several issues with this. First I need to know by hand the xpath model_3dw_bcf0b18a19bce6d91ad107790a9e2d51 for each model I also need to extract the tag they both can be found at:
. So I need to extract it by inspecting every image displayed. Then I need to switch page (there are 22 pages) and maybe even scroll down on each page to be sure I have everything. Secondly I had to use time.sleep twice because the other method based on wait to be clickable does not seem to work as intented.
I have two questions the first one is obvious is it the right way of proceeding ? I feel that even if this could be quite fast without the time.sleep this feels very much like what a human would do and therefore must be terribly inefficient secondly if it is indeed the way to go: How could I write a double for loop on pages and items to be able to extract the tag and model id efficiently ?
EDIT 1: It seems that:
l=browser.find_elements_by_xpath("//div[starts-with(#id,'model_3dw')]")
might be the first step towards completion
EDIT 2: Almost there but the code is filled with time.sleep. Still need to get the tag name and to loop through the pages
EDIT 3: Got the tag name still need to loop through the pages and will post first draft of solution

So let me try to understand correctly what you mean and then see if I can help you solve the problem. I do not know Python, so excuse my synthax errors.
You want to click on each and every of the 183533 cars, and then download the 4th image within the iframe that pops up. Correct?
Now if this is the case, lets look at the first element you need, elements on the page with all the cars on it.
So to get all 160 cars of page 1, you are going to need:
elements = browser.find_elements_by_xpath("//img[#class='resultImg lazy']");
This is going to return 160 image elements for you. Which is exactly the amount of the displayed images (on page 1)
Then you can say:
for el in elements:
{here you place the code you need to download the 4th image,
so like switch to iframe, click on the 4th image etc.}
Now, for the first page, you have made a loop which will download the 4th image for every vehicle on it.
This doens't entirely solve your problem as you have multiple pages. Thankfully, the page navigation, previous and next, are greyed out on first and/or last page.
So you can just say:
browser.find_element_by_xpath("//a[#class='next']").click();
Just make sure you catch if element is not clickable as element will be greyed out on the last page.

Rather than scraping the site, you might consider examining the URLs that the webpage uses to query the data, then use the Python 'requests' package to simply make API requests directly from the server. I'm not a registered user on the site, so I can't provide you with any examples, but the paper that describes the shapenet.org site specifically mentions:
"To provide convenient access to all of the model and an-
notation data contained within ShapeNet, we construct an
index over all the 3D models and their associated annota-
tions using the Apache Solr framework. Each stored an-
notation for a given 3D model is contained within the index
as a separate attribute that can be easily queried and filtered
through a simple web-based UI. In addition, to make the
dataset conveniently accessible to researchers, we provide a
batched download capability."
This suggests that it might be easier to do what you want via API, as long as you can learn what their query language provides. A search in their QA/Forum may be productive too.

I came up with this answer, which kind of works but I don't know how to remove the several calls to time.sleep I will not accept my answer until someone finds something more elegant (also when it arrives at the end of the last page it fails):
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
import os
profile = webdriver.FirefoxProfile()
profile.set_preference("network.proxy.type", 1)
profile.set_preference("network.proxy.socks", "yourproxy")
profile.set_preference("network.proxy.socks_port", yourport)
#browser = webdriver.Firefox(firefox_profile=profile)
browser = webdriver.Firefox()
browser.get('https://www.shapenet.org/taxonomy-viewer')
#Page is long to load
wait = WebDriverWait(browser, 30)
element = wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='02958343_anchor']")))
linkElem = browser.find_element_by_xpath("//*[#id='02958343_anchor']")
linkElem.click()
tag_names=[]
page_count=0
while True:
if page_count>0:
browser.find_element_by_xpath("//a[#class='next']").click()
time.sleep(2)
wait.until(EC.presence_of_element_located((By.XPATH, "//div[starts-with(#id,'model_3dw')]")))
list_of_items_on_page=browser.find_elements_by_xpath("//div[starts-with(#id,'model_3dw')]")
list_of_ids=[e.get_attribute("id") for e in list_of_items_on_page]
for i,item in enumerate(list_of_items_on_page):
#Page is also long to display iframe
current_id=list_of_ids[i]
element = wait.until(EC.element_to_be_clickable((By.ID, current_id)))
car_image=browser.find_element_by_id(current_id)
original_tag_name=car_image.find_element_by_xpath("./div[#style='text-align: center']").get_attribute("innerHTML")
count=0
tag_name=original_tag_name
while tag_name in tag_names:
tag_name=original_tag_name+"_"+str(count)
count+=1
tag_names.append(tag_name)
car_image.click()
wait.until(EC.frame_to_be_available_and_switch_to_it((By.ID, 'viewerIframe')))
element = wait.until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[3]/div[3]/h4")))
time.sleep(10)
linkElem = browser.find_element_by_xpath("/html/body/div[3]/div[3]/h4")
linkElem.click()
img = browser.find_element_by_xpath("/html/body/div[3]/div[3]//div[#class='searchResult' and #id='image.3dw.%s.3']/img[#class='enlarge']"%current_id.split("_")[2])
src = img.get_attribute('src')
os.system("wget %s --no-check-certificate -O %s.png"%(src,tag_name))
browser.switch_to.default_content()
browser.find_element_by_css_selector(".btn-danger").click()
time.sleep(1)
page_count+=1
One can also import a NoSuchElementException from selenium and use a while True loop with try except to get rid of the arbitrary time.sleep.

Selenium Page Source is Missing Elements

I have a basic Selenium script that makes use of the chromedriver binary. I'm trying to display a page with recaptcha on it and then hang until the answer has been completed and then store that in a variable for future use.
The roadblock I'm hitting is that I am unable to find the recaptcha element.
#!/bin/env python2.7
import os
from selenium import webdriver
driverBin=os.path.expanduser("~/Desktop/chromedriver")
driver=webdriver.Chrome(driverBin)
driver.implicitly_wait(5)
driver.get('http://patrickhlauke.github.io/recaptcha/')
Is there anything special needed to be able to see this element?
Also is there a way to grab the token after user solve without refreshing the page?
As it is now the input type of the recaptcha-token id is hidden. After solve a second recaptcha-token id is created. This is the value I wish to store in a variable. I was thinking of having a loop of checking length of found elements with that id. If greater than 1 parse. But I'm unsure whether the source updates per se.
UPDATE:
With more research it has to do with the nature of the element, particularly: with the tag: <input type="hidden". So I guess to rephrase my question, how does one extract the value of a hidden element.

The element you are looking for (the input) is in an iframe. You'll need switch to the iframe before you can locate the element and interact with it.
import os
from selenium import webdriver
driver=webdriver.Chrome()
try:
driver.implicitly_wait(5)
driver.get('http://patrickhlauke.github.io/recaptcha/')
# Find the iframe and switch to it
iframe_path = '//iframe[#title="recaptcha widget"]'
iframe = driver.find_element_by_xpath(iframe_path)
driver.switch_to.frame(iframe)
# Find the input element
input_elem = driver.find_element_by_id("recaptcha-token")
print("Found the input element: ", input_elem)
finally:
driver.quit()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Issue extracting specific data with Selenium - python

Related

Selenium only finding certain elements in Python

Selenium not finding list of sections with classes?

How to access text inside div tags using Selenium in Python?

Efficient download of images from website with Python and selenium

Selenium Page Source is Missing Elements

Categories

Resources