Selenium JS map scrape with Python - python

I'm new with Selenium, actually I'm trying it since yesterday, found some stuff interesting with Selenium with python.
I found some information regarding how to scrape and interact with JS page. But my doubt is how can I get a data from a clickable map with selenium. I tried to found if there are any hidden links in the page, but there aren't any of them. I figured out that when I move my mouse over the map in any button (in the map) there is a change in x,y position (of course...) and after I click in the button I can scrape my data. Using a static model I could scrape all the data that I want.
So my question is, how can I simulate the mouse movement over the map and this click action?
Best regards,

If you have x,y position on map and length, width of map, then you can try something like
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.your_web_page.com") # specify webpage
element=driver.find_elements_by_xpath("provide_map_selector") # specify correct xpath
x = 25 # set actual value
y = 50 # set actual value
length = 500 # set actual value
width = 300 # set actual value
action = webdriver.common.action_chains.ActionChains(driver)
action.move_to_element_with_offset(element, width - y, length - x)
action.click()
action.perform()

Related

How To Scroll Inside An Element On A Webpage (Selenium Python)

How can I scroll down in a certain element of a webpage in Selenium?
Basically my goal is to scroll down in this element until new profile results stop loading.
Let's say that there should be 100 profile results that I'm trying to gather.
By default, the webpage will load 30 results.
I need to scroll down IN THIS SECTION, wait a few seconds for 30 more results to load, repeat (until all results have loaded).
I am able to count the number of results with:
len(driver.find_elements(By.XPATH, "//div[#class='virtual-box']"))
I already have all the other code written, I just need to figure out the line of code to get Selenium to scroll down like 2 inches.
I've looked around a bunch and can't seem to find a good answer (that or I suck at googling).
This is a section of my code:
(getting the total number of profiles currently on the page = max_prof)
while new_max_prof > max_prof:
scroll_and_wait(profile_number)
if max_prof != new_max_prof: # to make sure that they are the same
max_prof = new_max_prof
...and here is the function that it is calling (which currently doesn't work because I can't get it to scroll)
def scroll_and_wait(profile_number=profile_number): # This doesn't work yet
global profile_xpath
global new_max_prof
global max_prof
print('scrolling!')
#driver.execute_script("window.scrollTo(0,1080);") # does not work
temp_xpath = profile_xpath + str(max_prof) + ']'
element = driver.find_element(By.XPATH, temp_xpath)
ActionChains(driver).scroll_to_element(element).perform() # scrolls to the last profile
element.click() # selects the last profile
# Tested and this does not seem to load the new profiles unless you scroll down.
print('did the scroll!!!')
time.sleep(5)
new_max_prof = int(len(driver.find_elements(By.XPATH, "//div[#class='virtual-box']")))
print('new max prof is: ' + str(new_max_prof))
time.sleep(4)
I tried:
#1. driver.execute_script("window.scrollTo(0,1080);") and driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")``` but neither seemed to do anything.
#2. ActionChains(driver).scroll_to_element(element).perform() hoping that if I scrolled to the last profile on the page, it would load the next one (it doesn't)
#3. Using pywin32 win32api.mouse_event(MOUSEEVENTF_WHEEL, -300, 0) to simulate the mouse scrolling. Didn't seem to work, but even if it did, I'm not sure this would solve it because it would really need to be in the element of the webpage. Not just going to the bottom of the webpage.
OKAY! I found something that works. (If anyone knows a better solution please let me know)
You can use this code to scroll to the bottom of the page:
driver.find_element(By.TAG_NAME, 'html').send_keys(Keys.END) # works, but not inside element.
What I had to do was more complicated though (since I am trying to scroll down IN AN ELEMENT on the page, and not just to the bottom of the page).
IF YOUR SCROLL BAR HAS ARROW BUTTONS at the top/buttons, try just clicking them with .click() or .click_and_hold() that's a much easier solution that trying to scroll and does the same thing.
IF, LIKE ME, YOUR SCROLL BAR HAS NO ARROW BUTTONS, you can still click on the scroll bar path at the bottom/top and it will move. If you find the XPATH to your scrollbar, then click it, it will click in the middle (not helpful), but you can offset this on the x/y axis with ".move_by_offset(0, 0)" so for example:
# import ActionChains
from selenium.webdriver.common.action_chains import ActionChains
scroll_bar_xpath = "//div[#ng-if='::vm.isVirtual']/div[#class='ps-scrollbar-y-rail']"
element = driver.find_element(By.XPATH, scroll_bar_xpath)
# Do stuff
ActionChains(driver).move_to_element(element).move_by_offset(0,50).click().perform()
Now normally, you wouldn't want to use a fixed pixel amount (50 on the y axis) because if you change the browser size, or run the program on a different monitor, it could mess up.
To solve this, you just need to figure out the size of the scroll bar, so that you know where the bottom of it is. All you have to do is:
element = driver.find_element(By.XPATH, scroll_bar_xpath)
size = element.size
w = size['width']
h = size['height']\
print('size is: ' + size)
print(h)
print(w)
This will give you the size of the element. You want to click at the bottom of it, so you'd thing that you can just take the height, and pass that into move_by_offset like this: ".move_by_offset(0,h)". You can't do that, because when you select an element, it starts from the middle, so you want to cut that number in half (and round it down so that you don't have a decimal.) This is what I ended up doing that worked:
# import ActionChains
from selenium.webdriver.common.action_chains import ActionChains
import math
scroll_bar_xpath = "//div[#ng-if='::vm.isVirtual']/div[#class='ps-scrollbar-y-rail']"
element = driver.find_element(By.XPATH, scroll_bar_xpath)
size = element.size
w = size['width']
h = size['height']
#Calculate where to click
click_place = math.floor(h / 2)
# Do Stuff
ActionChains(driver).move_to_element(element).move_by_offset(0, click_place).click().perform() #50 worked
Hope it helps!

How to load in the entirety of a website for selenium to collect data from, and keep everything loaded in?

I am trying to scrape the terms and definitions, using the selenium chrome driver in python, from this website here: https://quizlet.com/433328443/ap-us-history-flash-cards/. There are 533 terms...so many in fact that quizlet makes you click a See more button if you want to see all the terms. The following code successfully extracts terms and definitions (I have tested it on other quizlet sites with less terms). There are also if() statements to deal with popups and the See more button. Again, my goal is to get the terms and definitions for every single term-definition pair on the page; however, to do this, the entire page needs to be loaded in, which is the basis of my problem.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome(executable_path = chrome_driver_path)
driver.get("https://quizlet.com/433328443/ap-us-history-flash-cards/")
# INCASE OF POPUP, CLICK AWAY
if len(driver.find_elements_by_xpath("//button[#class='UILink UILink--revert']")) > 0:
popup = driver.find_element_by_xpath("//button[#class='UILink UILink--revert']")
popup.click()
del popup
# SCROLL TO BOTTOM TO LOAD IN ALL TERMS, AND THEN BACK TO THE TOP
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# INCASE OF "SEE MORE" BUTTON AT BOTTOM, CLICK IT
if len(driver.find_elements_by_xpath("//button[#class='UIButton UIButton--fill' and #aria-label='See more']")) > 0:
see_more = driver.find_element_by_xpath("//button[#class='UIButton UIButton--fill' and #aria-label='See more']")
see_more.click()
del see_more
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# list of terms
quizlet_terms = tuple(map(lambda a: a.text,
driver.find_elements_by_class_name("SetPageTerm-wordText")))
# list of definitions
quizlet_definitions = tuple(map(lambda a: a.text,
driver.find_elements_by_class_name("SetPageTerm-definitionText")))
In my code, I have tried the scrolling down trick to load in everything, but this does not work. This is because as I scroll down, while terms in my browser window are loaded, terms above and below my browser window get unloaded. Obviously, this is done for memory reasons, but I do not care about memory and I just want for all the terms to be loaded at once so I can access their contents. My code works on smaller quizlet sites (with say 100 terms), but it breaks on this site, generating the following error:
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
This stackoverflow page explains the error message: Python with Selenium "element is not attached to the page document".
From reading the aforementioned page, I have come to the conclusion that because the website is so large, as I scroll down the quizlet page, the terms I am currently looking at in my browser window are loaded, but terms that I have scrolled past and are no longer in my view are unloaded and stored in some funky way that I cannot properly access, generating the error message.
How would one go about in keeping the entirety of the page loaded-in so I can access the contents of all 533 terms? Ideally, I would like a solution that keeps everything I have scrolled past fully-loaded in, and does not unload anything. Another idea is that the whole page is loaded in from the get-go. It would also be nice if there is some memory-saving solution to this, perhaps by simply accessing just the raw html code and no fancy graphics or anything. Has anyone ever encountered this problem, and if so, how did you solve it? Thank you, any help is appreciated.
Much thanks to #Abhishek Dhoundiyal's comment. My working code:
driver.execute_script("window.scrollTo(800, 800);")
terms_in_this_set = int(sub("\D", "", (driver.find_element_by_xpath("//h4[#class='UIHeading UIHeading--assembly UIHeading--four']")).text))
chunk_size = 15000
quizlet = numpy.empty(shape = (0, 2), dtype = "str")
# done in while loop so that terms and definitions can be extracted while scrolling (while making sure there are no duplicate entries)
while len(quizlet) != terms_in_this_set:
# INCASE OF "SEE MORE" BUTTON, CLICK IT TO SEE MORE
if len(driver.find_elements_by_xpath("//button[#class='UIButton UIButton--fill' and #aria-label='See more']")) > 0:
see_more = driver.find_element_by_xpath("//button[#class='UIButton UIButton--fill' and #aria-label='See more']")
see_more.click()
del see_more
# CHECK IF THERE ARE TERMS
quizlet_terms_classes = driver.find_elements_by_class_name("SetPageTerm-wordText")
quizlet_definitions_classes = driver.find_elements_by_class_name("SetPageTerm-definitionText")
if (len(quizlet_terms_classes) > 0) and (len(quizlet_definitions_classes) > 0):
# append current iteration terms and definitions to full quizlet terms and definitions
quizlet = numpy.vstack((quizlet, numpy.transpose([list(map(lambda term: remove_whitespace(term.text), quizlet_terms_classes)), list(map(lambda definition: remove_whitespace(definition.text), quizlet_definitions_classes))])))
# get unique rows
quizlet = numpy.unique(quizlet, axis = 0)
del quizlet_terms_classes, quizlet_definitions_classes
driver.execute_script(f"window.scrollBy(0, {chunk_size})")
del terms_in_this_set

Scrolling down on Python Selenium

I am currently attempting to scrape a DropBox Folder using Selenium on Python. Apparently, if I try to select all hyperlinks (or all elements containing hyperlinks), I only get the first 20 or so results. To give a minimum working example:
from selenium import webdriver
browser = webdriver.Chrome()
page = www.dropbox.com/FolderName
browser.get(page)
elementlist = browser.find_elements_by_class_name('brws-file-name-cell-filename')
#or alternatively, you can simply use the 'by_tag_name('a') method, which yields similar results)
elength = len(elementlist)
Usually, elength is in the order of 20 to 30 elements, which grows to 30 to 40 I add a command to scroll down to the bottom of the page. I know for a fact that there are well over 200 elements in the folder I am trying to scrape. My question is, thus: is there any way to scroll down the page progressively, rather than going all the way to the bottom right away? I have seen that many questions asked on the same topic focus on pages with infinite loading, like Facebook or other social media. My page, on the other hand, has a fixed length. Is there a way I can scroll down step by step, rather than all at once?
UPDATE
I tried following the advice given to me by the community and by the answer you can find here. Unfortunately, I am still struggling to iterate over the height, which is my variable of interest and which seems to be stuck in a string. This has been my best attempt at creating a for loop over the height, and needless to say, it still did not work.
# Get current height
height = browser.execute_script("return document.body.scrollHeight")
while True:
# Scroll down
browser.execute_script('window.scrollTo(0, window.scroll'+str(height)+' + 200)')
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = browser.execute_script("return document.body.scrollHeight")
if new_height == height:
break
else:
height = new_height
UPDATE 2
I think I've found the issue. Dropbox basically has a 'page within the page' structure. The whole of the page is visible to me, but there's an inner archive which I need to navigate. Any idea how to do that?
You could try this answer. Instead of going to the bottom, you could create a for loop with a fixed height and iterate till reach the bottom.
browser.execute_script('window.scrollTo(0, window.scroll'+str(height)+' + 200)')
The second argument inside Javascript method seems odd to me. Lets assume your height variable is 800px so we get this javascript function to execute inside execute_script(execute_script is a selenium method which lets you code javascript).
window.scrollTo(0, window.scroll800 + 200) and I assume this will throw an error and stop the execution. I think you should change your code to this.
browser.execute_script('window.scrollTo(0,'+str(height)+' + 200)')
This code will scroll your window to the bottom of the page(One tip: you can actually just go to devtools of your browser and open the console and try the javascript code there.If it works, you can come back to selenium). At this point you should make your driver instance sleep. Once it loads the page(make sure to give it enough time to load), you should assing the new height value to a new variable. If the page has loaded more elements at the bottom of the page, first height and new height values should be different and that requires another scroll to the bottom. But before scroll you should change the first height value and assign new height value to it so in the next loop your first height will be the second height from previous loop.

Web-scraping with Python: How to scroll into a view by pixels?

I'm using Python and Selenium. I'm looking to scroll inside a view by pixels and not elements. The point is to loop until I've scroll until the end of the list. As a training, I've been trying to scroll all the list of people having liked this instagram post: https://www.instagram.com/p/BuT_u-UAKn1/ . I know how to scroll by elements:
elements = driver.find_elements_by_xpath("//*[#id]/div/a")
driver.execute_script("return arguments[0].scrollIntoView();", elements[-1])
But I would like to scroll by pixels. I've tried to do the following:
driver.execute_script("return arguments[0].scrollIntoView(true);", elements)
driver.execute_script("window.scrollBy(0,200);")
When doing so, this error occurs:
JavascriptException: Message: TypeError: arguments[0].scrollIntoView is not a function
Anyone knows how to scroll into a view by pixels?
Thanks
Below has worked for me.
#first move to the element
self.driver.execute_script("return arguments[0].scrollIntoView(true);", element)
#then scroll by x, y values, in this case 10 pixels up
self.driver.execute_script("window.scrollBy(0, -10);")
When you say scroll by (0,200). The positive number means scroll DOWN. If you want to scroll UP, use the negative -200
Also see the documentation here: https://developer.mozilla.org/en-US/docs/Web/API/Window/scrollBy
If you are using a browser that does not support scrollToOptions then switch to a better more supported browser.
Another possible solution is to implement a webDriverWait for the specific element to be visible in the HTML DOM
element = WebDriverWait(self.driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "element_css")))
self.driver.execute_script("return arguments[0].scrollIntoView(true);", element)
Also you an try using ActionChains
element = driver.find_element_by_id("id") # the element you want to scroll to
ActionChains(driver).move_to_element(element).perform()
After you move to the element, then you can use the scroll code
You can also try adding in an offset. Some webpages will not load new content if you scroll all the way down to the bottom. Some webpages only load new content as you reach the end of the page.
document.documentElement.scrollHeight-10
A less conventional way would be to execute javascript within your code.
Also try maximizing your window with selenium. Sometimes the size of the window effects the operation of Selenium
driver.maximize_window()
findThis = driver.find_element_by_css_selector("CSS SELECTOR HERE")
jsScript = """
function move_up(element) {
element.scrollTop = element.scrollTop - 1000;
}
function move_down(element) {
console.log('Position before: ' + element.scrollTop);
element.scrollTop = element.scrollTop + 1000;
console.log('Position after: ' + element.scrollTop);
}
move_up(arguments[0]);
"""
driver.execute_script(jsScript, findThis)

Can't get "WebDriver" element data if not "eye-visible" in browser using Selenium and Python

I'm doing a scraping with Selenium in Python. My problem is that after I found all the WebElements, I'm unable to get their info (id, text, etc) if the element is not really VISIBLE in the browser opened with Selenium.
What I mean is:
First image
Second image
As you can see from the first and second images, I have the first 4 "tables" that are "visible" for me and for the code. There are however, other 2 tables (5 & 6 Gettho lucky dip & Sue Specs) that are not "visible" until I drag down the right bar.
Here's what I get when I try to get the element info, without "seeing it" in the page:
Third image
Manually dragging the page to the bottom and therefore making it "visible" to the human eye (and also to the code ???) is the only way I can the data from the WebDriver element I need:
Fourth image
What am I missing ? Why Selenium can't do it in background ? Is there a manner to solve this problem without going up and down the page ?
PS: the page could be any kind of dog race page in http://greyhoundbet.racingpost.com/. Just click City - Time - and then FORM.
Here's part of my code:
# I call this function with the URL and it returns the driver object
def open_main_page(url):
chrome_path = r"c:\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get(url)
# Wait for page to load
loading(driver, "//*[#id='showLandingLADB']/h4/p", 0)
element = driver.find_element_by_xpath("//*[#id='showLandingLADB']/h4/p")
element.click()
# Wait for second element to load, after click
loading(driver, "//*[#id='landingLADBStart']", 0)
element = driver.find_element_by_xpath("//*[#id='landingLADBStart']")
element.click()
# Wait for main page to load.
loading(driver, "//*[#id='whRadio']", 0)
return driver
Now I have the browser "driver" which I can use to find the elements I want
url = "http://greyhoundbet.racingpost.com/#card/race_id=1640848&r_date=2018-
09-21&tab=form"
browser = open_main_page(url)
# Find dog names
names = []
text: str
tags = browser.find_elements_by_xpath("//strong")
Now "TAGS" is a list of WebDriver elements as in the figures.
I'm pretty new to this area.
UPDATE:
I've solved the problem with a code workaround.
tags = driver.find_elements_by_tag_name("strong")
for tag in tags:
driver.execute_script("arguments[0].scrollIntoView();", tag)
print(tag.text)
In this manner the browser will move to the element position and it will be able to get its information.
However I still have no idea why with this page in particular I'm not able to read webpages elements that are not visible in the Browser area untill I scroll and literally see them.

Categories