I am trying to create an Instagram scraping bot that collects a list of Followers and Following using Python + Selenium.
However, that list keeps on loading when the user scrolls until the list is exhausted. I am attaching a screenshot for reference (some content hidden due to privacy reasons):
Now, I believe I have two ways to achieve this:
Keep reading the usernames, and then keep on scrolling.
Keep scrolling till the end, and then read all usernames together from the source code.
I've been trying to figure this out using the second method. However, I am not able to figure out how to know when there is no more content to scroll. How can I achieve this (provided that I don't know anything about the length of this element)?
Reason for not using Method 1: When scrolling, the DOM keeps getting refreshed, so it is hard to keep track of which usernames have been read.
One way to do this is is to keep track of the amount of child elements the div has that contains the li elements for the followers. If it doesn't increase after a scroll event, you've reached the end of the list.
Related
I'm working on a project trying to autonomously monitor item prices on an Angular website.
Here's what a link to a particular item would look like:
https://www.<site-name>.com/categories/<sub-category>/products?prodNum=9999999
Using Selenium (in Python) on a page with product listings, I can get some useful information about the items, but what I really want is the prodNum parameter.
The onClick attribute for the items = clickOnItem(item, $index).
I do have some information for items including the presumable item and $index values which are visible within the html, but I'm doubtful there is a way of seeing what is actually happening in clickOnItem.
I've tried looking around using dev-tools to find where clickOnItem is defined, but I haven't been successful.
Considering that I don't see any way of getting prodNum without clicking, I'm wondering, is there's a way I could simulate a click to see where it would redirect to, but without actually loading the link- as this would take way too much time to do for each item?
Note: I want to get the specific prodNumber. I want to be able to hit the item page directly without first going though the main listing page.
I'm new to using Selenium, and I am having trouble figuring out how to click through all iterations of a specific element. To clarify, I can't even get it to click through one as it's a dropdown but is defined as an element.
I am trying to scrape fanduel; when clicking on a specific game you are presented with a bunch of main title bets and in order to get the information I need to click the dropdowns to get to that information. There is also another drop down that states, "See More" which is a similar problem, but assuming this gets fixed I'm assuming I will be able to figure that out.
So far, I have tried to use:
find_element_by_class_name()
find_element_by_css_selector()
I have also used them in the sense of elements, and tried to loop through and click on each index of the list, but that did not work.
If there are any ideas, they would be much appreciated.
FYI: I am using beautiful soup to scrape the website for the information, I figured Selenium would be helpful making the information that isn't currently accessible, accessible.
This image shows the dropdowns that I am trying to access, in this case the dropdown 'Win Margin'. The HTML code is shown to the left of it.
This also shows that there are multiple dropdowns, varying in amount based off the game.
You can also try using action chains from selenium
menu = driver.find_element_by_css_selector(".nav")
hidden_submenu = driver.find_element_by_css_selector(".nav # submenu1")
ActionChains(driver).move_to_element(menu).click(hidden_submenu).perform()
Source: here
I'm learning how to do RPA, and a lot of what I want to accomplish involves manipulating checkboxes and text fields on Chrome. I'm trying to find a way to either:
Return a list of all elements of a specific type in Chrome (For example, all the element IDs of checkboxes, buttons, or text fields).
or
Have a user click on an element or use a hot key while hovering over an element to get the element ID.
The idea is that if I can have the element to manipulate selected by the user while the program is running, I can automate tasks more efficiently without having to change the parameters by manually inspecting the element in Chrome and changing the code.
Basically I'm trying to create a crude element selector in Python, or to just display a list of elements so the user can choose which element to interact with.
Currently I'm attempting to find the syntax to return all elements in the active window using pywinauto, and am exploring the use of Beautiful Soup to parse the HTML. However I assume there must be a simple one or two line code to do this, so if possible I would like to learn how to do it correctly rather than hack together a crude function that accomplishes it ineffectively.
I am trying to go through a webpage with Selenium and create a set of all elements with certain class names, so I have been using:
elements = set(driver.find_elements_by_class_name('class name'))
However, in some cases there are thousands of elements on the page (if I scroll down), and I've noticed that this code only finds the first 18-20 elements on the page (only about 14-16 are visible to me at once). Do I need to scroll, or am I doing something else wrong? Is there any way to instantaneously get all of the elements I want in the HTML into a list without having to visually see them on the screen?
It depends on your webpage. Just look at the HTML source code (or the network log), before you scroll down. If there are just the 18-20 elements then the page lazy load the next items (e.g. Twitter or Instagram). This means, the server just renders the next items if you reached a certain point on the webpage. Otherwise all thousand items would be loaded, which would increase the page size, loading time and server load.
In this case, you have to scroll down until the end and then get the source code to parse all items.
Probably you can use more advanced methods like dealing with each chunk as a kind of page for a pagination method (e.g. not saying "go to next page" but saying "scroll down"). But I guess you're a beginner, so I would start with simple scrolling down to the end (e.g. scroll, waiting, scroll,... until there are no new elements), then fetching the HTML and then parsing it.
I'm using python and Webdriver to scrape data from a page that dynamically loads content as the user scrolls down the page (lazy load). I have a total of 30 data elements, while only 15 are displayed without first scrolling down.
I am locating my elements, and getting their values in the following way, after scrolling to the bottom of the page multiple times until each element has loaded:
# Get All Data Items
all_data = self.driver.find_elements_by_css_selector('div[some-attribute="some-attribute-value"]')
# Iterate Through Each Item, Get Value
data_value_list = []
for d in all_data:
# Get Value for Each Data item
data_value = d.find_element_by_css_selector('div[class="target-class"]').get_attribute('target-attribute')
#Save Data Value to List
data_value_list.append(data_value)
When I execute the above code using ChromeDriver, while leaving the browser window up on my screen, I get all 30 data values to populate my data_value_list. When I execute the above code using ChromeDriver, with the window minimized, my list data_value_list is only populated with the initial 15 data values.
The same issue occurs while using PhantomJS, limiting my data_value_list to only the initially-visible data values on the page.
Is there away to load these types of elements while having the browser minimized and, ideally—while utilizing PhantomJS?
NOTE: I'm using an action chain to scroll down using the following approach .send_keys(Keys.PAGE_DOWN).perform() for a calculated number of times.
I had the exact same issue. The solution I found was to execute javascript code in the virtual browser to force elements to scroll to the bottom.
Before putting the Javascript command into selenium, I recommend opening up your page in Firefox and inspecting the elements to find the scrollable content. The element should encompass all of the dynamic rows, but it should not include the scrollbar Then, after selecting the element with javascript, you can scroll it to the bottom by setting its scrollTop attribute to its scrollHeight attribute.
Then, you will need to test scrolling the content in the browser. The easiest way to select the element is by ID if the element has an id, but other ways will work. To select an element with the id "scrollableContent" and scroll it to the bottom, execute the following code in your browser's javascript console:
e = document.getElementById('scrollableContent'); e.scrollTop = e.scrollHeight;
Of course, this will only scroll the content to the current top, you will need to repeat this after new content loads if you need to scroll multiple times. Also, I have no way of figuring out how to find the exact element, for me it is trial and error.
This is some code I tried out. However, I feel it can be improved, and should be for applications that are intended to test code or scrape unpredictably. I couldn't figure out how to explicitly wait until more elements were loaded (maybe get the number of elements, scroll to the bottom, then wait for subelement + 1 to show up, and if they don't exit the loop), so I hardcoded 5 scroll events and used time.sleep. time.sleep is ugly and can lead to issues, partly because it depends on the speed of your machine.
def scrollElementToBottom(driver, element_id):
time.sleep(.2)
for i in range(5):
driver.execute_script("e = document.getElementById('" + element_id + "'); e.scrollTop = e.scrollHeight;")
time.sleep(.2)
The caveat is that the following solution worked with the Firefox driver, but I see no reason why it shouldn't work with your setup.