I am trying to scrape with Selenium but I need to load all the content of the page by moving to the end of the website. But when I execute: driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")
the program doesn't do anything at all. I wonder if it's because the page that I'm scraping has a personalized scrollbar.
That's because document.body.scrollHeight is zero, so it doesn't scroll anything.
You can scroll to an arbitrary large value to scroll down or instead use document.documentElement.scrollHeight.
Check this question for more details.
Related
I'm new to using Selenium, and I am having trouble figuring out how to click through all iterations of a specific element. To clarify, I can't even get it to click through one as it's a dropdown but is defined as an element.
I am trying to scrape fanduel; when clicking on a specific game you are presented with a bunch of main title bets and in order to get the information I need to click the dropdowns to get to that information. There is also another drop down that states, "See More" which is a similar problem, but assuming this gets fixed I'm assuming I will be able to figure that out.
So far, I have tried to use:
find_element_by_class_name()
find_element_by_css_selector()
I have also used them in the sense of elements, and tried to loop through and click on each index of the list, but that did not work.
If there are any ideas, they would be much appreciated.
FYI: I am using beautiful soup to scrape the website for the information, I figured Selenium would be helpful making the information that isn't currently accessible, accessible.
This image shows the dropdowns that I am trying to access, in this case the dropdown 'Win Margin'. The HTML code is shown to the left of it.
This also shows that there are multiple dropdowns, varying in amount based off the game.
You can also try using action chains from selenium
menu = driver.find_element_by_css_selector(".nav")
hidden_submenu = driver.find_element_by_css_selector(".nav # submenu1")
ActionChains(driver).move_to_element(menu).click(hidden_submenu).perform()
Source: here
I am trying to go through a webpage with Selenium and create a set of all elements with certain class names, so I have been using:
elements = set(driver.find_elements_by_class_name('class name'))
However, in some cases there are thousands of elements on the page (if I scroll down), and I've noticed that this code only finds the first 18-20 elements on the page (only about 14-16 are visible to me at once). Do I need to scroll, or am I doing something else wrong? Is there any way to instantaneously get all of the elements I want in the HTML into a list without having to visually see them on the screen?
It depends on your webpage. Just look at the HTML source code (or the network log), before you scroll down. If there are just the 18-20 elements then the page lazy load the next items (e.g. Twitter or Instagram). This means, the server just renders the next items if you reached a certain point on the webpage. Otherwise all thousand items would be loaded, which would increase the page size, loading time and server load.
In this case, you have to scroll down until the end and then get the source code to parse all items.
Probably you can use more advanced methods like dealing with each chunk as a kind of page for a pagination method (e.g. not saying "go to next page" but saying "scroll down"). But I guess you're a beginner, so I would start with simple scrolling down to the end (e.g. scroll, waiting, scroll,... until there are no new elements), then fetching the HTML and then parsing it.
For the life of me I can't think of a better title...
I have a Python WebDriver-based scraper that goes to Google, enters a local search such as chiropractors+new york+ny, which, after clicking on More chiropractors+New York+NY, ends up on a page like this
The goal of the scraper is to grab the phone number and full address (including suite# etc.) of each of the 20 results on such a results page. In order to do so, I need to have WebDriver click 20 entries needs to be clicked on the bring up an overlay over the Google Map:
This is mighty slow. Were it not having to trigger each of these overlays, I would be able to do everything up to that point with the much faster lxml, by going straight to the ultimate URL of the results page and then extracting via XPath. But I appear to be stuck with not being able to get data from the overlay without first clicking on the link that brings up the overlay.
Is there a way to get the data out of this page element without having to click the associated links?
I'm trying to make a little script which takes a look at main page of web and finds adds.
The problem is that there are web pages which contains infinite scroll. If this code was built for particular web page, I could use locating elements and scrolling.
But I can't figure out how to make Selenium to scroll at the very bottom of any page?
self.driver.execute_script("window.scrollTo(0, something);")
PS: If there is very huge page, break it down after several seconds of scrolling.
Do you know how to do that?
Here's another method that i used for Java, get the window size and then scroll to that position using javascript. Here's how to do it in Java (hope you can implement the concept in python too) -
double pageHeight = testBase.TestBase.driver.manage().window().getSize().getHeight();
driver.executeScript("window.scrollBy(0,"+pageHeight+")");
If you are implementing an infinite scroll then you can put the executeScript() lines in a loop. Hope it helps.
The web page that I want to parse has more than several thousand links. It also has infinite scrolling feature, which means I need to use send_keys( Keys.PAGE_DOWN ) in Selenium to extend the page for more contents.
Is it possible to use selenium to scroll down the browser and meanwhile only parse the new content? I don't want to repeatedly parse the old contents or wait the web page reaches the bottom and then parse, since the webpage has a large amount of links.
Any suggestion? If there is a better library for python that can help me do that, please also let me know. Thank you.
You can write a simple loop that extracts only the newly rendered links using xpath. Without knowing more about the page you're parsing, I'll assume that all a tags are fair game:
driver = webdriver.Firefox()
links = []
while True:
# Get any links beyond the ones we already have
elements = driver.find_elements_by_xpath(
"//a[position()>{}]".format(len(links))
# If there are no more links, stop
if not len(elements):
break
# "Parse" the links
links += elements
# Page down to trigger load of next batch
driver.find_element_by_tag_name("html").send_keys(Keys.PAGE_DOWN)