Scrape data not visible in the source file Python

Scrape data not visible in the source file Python - python

I want to scrape the data on the website https://www.climatechangecommunication.org/climate-change-opinion-map/. I am somewhat familiar with selenium. But the data I need which is below the map and the tooltip on the map is not visible in the source file. I have read some posts about using PhantomJS and others. However, I am not sure where and how to start. Can someone please help get me started.
Thanks,
Rexon

You can use this sample code:
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://www.climatechangecommunication.org/climate-change-opinion-map/")
# switch to iframe
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, "//iframe[#src = 'https://environment.yale.edu/ycom/factsheets/MapPage/2017Rev/?est=happening&type=value&geo=county']")))
# do your stuff
united_states = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//*[#id='document']/div[4]//*[name()='svg']")))
print(united_states.text)
# switch back to default content
driver.switch_to.default_content()
Output:
50%
No
12%
Yes
70%
United States
Screenshot of the element:
Explanantion: first of all, to be able to interact with elements below the map you have to switch to iframe content, otherwise it is not possible to interact with this elements. Then the data below the map is in svg tags, which are also not trivial. To be able to do this, you the sample I have provided.
PS: I have used WebDriverWait in my code. With WebDriverWait your code becomes quickier and stable, since Selenium waits for particular conditions like visibility or clickable of particular element. In the sample code the driver wait at least 10 seconds until expected condition will be satisfied.

Related

Python/Selenium - "no such element: Unable to locate element"

I'm having a really hard time locating elements on this website. My ultimate aim here is scrape the score card data for each country. The way I've envisioned doing that is to have selenium click list view (as opposed to the default country view) and loop going in and out of all the country profiles, scraping the data for each country along the way. But methods to locate elements on the page that worked for me in the past, have been fruitless with this site.
Here's some sample code outlining my issue.
from selenium import webdriver
driver = webdriver.Chrome(executable_path= "C:/work/chromedriver.exe")
driver.get('https://www.weforum.org/reports/global-gender-gap-report-2021/in-full/economy-profiles#economy-profiles')
# click the `list` view option
driver.find_element_by_xpath('//*[#id="root"]/div/div[1]/div[2]/div[2]/svg[2]')
As you can see, I've only gotten as far as step 1 of my plan. I've tried what other questions have suggested as far as adding waits, but to no avail. I see the site fully loaded in my DOM, but no xpaths are working for any element on there I can find. I apologize if this question is posted too often, but do know that any and all help is immensely appreciated. Thank you!

The element is inside an iframe you need to switch it to access the element.
Use WebDriverWait() wait for frame_to_be_available_and_switch_to_it()
driver.get("https://www.weforum.org/reports/global-gender-gap-report-2021/in-full/economy-profiles#economy-profiles")
wait=WebDriverWait(driver,10)
wait.until(EC.frame_to_be_available_and_switch_to_it((By.ID, "iFrameResizer0")))
wait.until(EC.element_to_be_clickable((By.XPATH,"//*[name()='svg' and #class='sc-gxMtzJ bRxjeC']"))).click()
You need following imports.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
Another update:
driver.get("https://www.weforum.org/reports/global-gender-gap-report-2021/in-full/economy-profiles#economy-profiles")
wait=WebDriverWait(driver,10)
wait.until(EC.frame_to_be_available_and_switch_to_it((By.ID, "iFrameResizer0")))
driver.find_element_by_xpath("//*[name()='svg' and #class='sc-gxMtzJ bRxjeC']").click()

You incorrectly click List view. Your locator has to be stable. I checked it did not work.
So, in order to click the icon, use WebDriverWait:
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
wait = WebDriverWait(driver, timeout=30)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".sc-gzOgki.ftxBlu>.background")))
list_view = driver.find_element_by_css_selector(".sc-gzOgki.ftxBlu>.background")
list_view.click()
Next, to get the unique row locator, use the following css selector:
.sc-jbKcbu.kynSUT
It will give use the list of all countries.

Unable to select Radio button with Selenium (python)

I've been trying to select a radio button using selenium and I'm having no luck. All other selectors (Login, drop down, etc) have all worked fine.
driver.find_element_by_xpath('//*[#id="content_grid"]/div[1]/div[2]/div[4]/div[2]/div[3]/label/div[1]/input').click()
This is the radio button I'm trying to select...
[1]: https://i.stack.imgur.com/X64k8.png
Here is the webpage - https://stathead.com/basketball/pgl_finder.cgi
Appreciate any help! First time poster and noobie coder :)

Few options to solve the problem.
The problem occur as upon retrieving the element and scrolling down to click on it, the site adds a banner that may obstruct the element.
One option is to always open the browser in fullscreen.
driver = webdriver.Chrome()
driver.maximize_window()
This should help avoiding the banner to be in the way.. less scrolling and less chance for the banner to overlap with the radio button
Selenium also offer a library to return an element upon a condition being met using WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
el = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '[value="E"]')))
Basically, the above will return the element once it is reported as clickable. The clickability of the element might not be in cause here. This way only ensure that the element found can be clicked but does not validation if there is an overlap (which I suspect is the issue here).
Third option, you may go with javaScript to click on the element
driver.execute_script("arguments[0].click();", el)

Using selenium driver.find_element().click() on a subdivided webpage

First of all, I'm completely new in web scraping, html and selenium, so my question might not seem meaningful to you.
What I'm trying to do: automated click on an image on a webpage. Manual click on this image allows to display new information on a part of the webpage.
How I tried to do it: I found something specific to this image, for example:
<img src="images/btn_qmj.gif" border="0">
So I just entered in python:
xpath = '//img[#src="images/btn_qmj.gif"]'
driver.find_elements_by_xpath(xpath)
Problem: this returns me an empty list. I partially understood why, but I don't have any solution.
image of the html code inspector
I included here an image of the html tree I obtain on the web inspector. As one can see, the tree to reach my line of interest gives "/html/frameset/frame1/#document/html/body/center/table/tbody/tr[2]/td/a/img".
The issue is that I cannot access -by using an xpath- anything that comes after the #document. If I try to copy the xpath of my line of interest, it tracks it back only up to the 2nd "/html". It is like the page is subdivided, compartmentalized, with inner parts I cannot access. I suppose there is a way to access the #document content, maybe it refers to another webpage, but I'm stuck at that point.
I would be very grateful if anyone could help me on this.

Have you tried switching to the frame first?
driver.switch_to.frame('gauche')

You need to switch control to frame first to handle image on it. Please see below code for your reference
driver.switch_to_frame(frame)
WebDriverWait(driver, 20).until(
EC.visibility_of_element_located((By.XPATH, "'//img[#src="images/btn_qmj.gif"]'"))).click()
driver.switch_to_default_content()
Note :: You need to add below imports
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

The element is present inside an iframe.In order to access the element you need to switch it first.
Induce WebDriverWait() and frame_to_be_available_and_switch_to_it()
Induce WebDriverWait() and visibility_of_element_located()
WebDriverWait(driver,10).until(EC.frame_to_be_available_and_switch_to_it((By.NAME,"gauche")))
WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.XPATH,"//img[#src='images/btn_qmj.gif']"))).click()
You need to import following libraries.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

How to periodically re-check a webpage using selenium in python

I am new to selenium in python (and all web-interface applications of python) and I have a task to complete for my present internship.
My script successfully navigates to an online database and inputs information from my data tables, but then the webpage in question takes anywhere from 30 seconds to several minutes to compute an output.
How do I go about instructing python to re-check the page every 30 seconds until the output appears so that I can parse it for the data I need? For instance, which functions might be I start with?
This will be part of a loop repeated for over 200 entries, and hundreds more if I am successful so it is worth my time to automate it.
Thanks

You should use Seleniums Waits as pointed by G_M and Sam Holloway.
One which I most use is the expected_conditions:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "myDynamicElement"))
)
finally:
driver.quit()
It will wait until there is an element with id "myDynamicElement" and then execute the try block, which should contain the rest of your work.
I prefer to use the the class By.XPATH, but if you use By.XPATH with the method presence_of_element_located add another () so it will be the required tuple as noted in this answer:
from selenium.webdriver.common.by import By
driver.find_element(By.XPATH, '//button[contains(text(),"Some text")]')
driver.find_element(By.XPATH, '//div[#id="id1"]')
driver.find_elements(By.XPATH, '//a')
The easiest way to find (for me) the XPATH of an element is going to the developer mode in chrome (F12), pressing ctrl+F, and using the mouse with inspect, trying to compose the proper XPATH, which will be specific enough to find just the expected element, or the least number of elements as possible.
All the examples are from (or based) the great selenium documentation.

If you just want to space out checks, the time.sleep() function should work.
However, as G_M's comment says, you should look into Selenium waits. Think about this: is there an element on the page that will indicate that the result is loaded? If so, use a Selenium wait on that element to make sure your program is only pausing until the result is loaded and not wasting any time afterwards.

How do I click on links that are contained in a div container fluid using python selenium

I'm writing a piece of python code to test whether users are able to click on dynamic tabs.
I'm using find_elements_by_xpath() to search for attributes of the tab. Now, this is a dynamic set of tabs, so they will never be viewable in the page source. But they do exist when I click on inspect element.
I'm using something like this to identify the tabs:
elem = self.browser.find_elements_by_xpath("//a[contains(#role,'tab')]")
for i in elem:
print(i.text)
I tried the get_attribute feature but that did not work. Is there any way I can use python selenium to click on a dynamic tab (that does not appear on page source)?

You can wait some time until required element present in DOM and clickable:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC
wait(self.browser, 10).until(EC.element_to_be_clickable((By.XPATH, '//a[contains(#role, "tab")]'))).click()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape data not visible in the source file Python - python

Related

Python/Selenium - "no such element: Unable to locate element"

Unable to select Radio button with Selenium (python)

Using selenium driver.find_element().click() on a subdivided webpage

How to periodically re-check a webpage using selenium in python

How do I click on links that are contained in a div container fluid using python selenium

Categories

Resources