Selenium Python script only scrapes part of the visible information - python

I am sorry for the title to better describe the problem when you visit the following website :
There is a text on the right that says "See all". Once you click on that a list of links to various forks pops up. I am trying to scrape the hyperlinks for those forks.
One problem is that the scraper not only scrapes the link for the forks but also for the profiles. They don't use specific class nor ID for those links. So I've edited my script to calculate which result is the right one and which is not. That part works. However the script only scrapes a few links and doesn't scrape others. This confused me because at first I thought that this is caused by the element not being visible to selenium since there is scroll present. This doesn't seem to be the issue however since other links that are not scraped are normally visible. The script only scrapes the first 5 links and completely skips the rest.
I am now unsure on what to do since there is not error or warning about any possible issue with the code itself.
This is a short part of the code that scrapes the links.
driver.get(url)
wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "button.see-all-forks"))).click()
fork_count = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "span.jsx-3602798114"))).text
forks = wait.until(ec.presence_of_all_elements_located((By.CSS_SELECTOR, "a.jsx-2470659356")))
j = 1
for i, fork in enumerate(forks):
if j == 1:
forks[i] = fork.get_attribute("href")
print(forks[i])
if j == 3:
j = 1
else:
j += 1
In this case "url" variable is the link I provided above. The loop then skips 3 results after each one because every 4th one is the right one. I tried using XPath to filter out the results using the "contains" fuction however the names vary as the users name them on their own so this to my understanding is the only way to filter out the results.
This is the output that I get.
After which no results ever are printed out and the program gets terminated without errors. What is happening here and what have I missed? I am confused about why Selenium only scrapes five results after which it is terminated.
Edit note - my code explained :
I've set up the if statements to check for every 4th result since it's the right one however the first one is also the right one. If "j!=3" then add 1 to "j" once "j=3" (now appears the result) the code if "j=1" is ran and the right result is printed. So the right result will always be "j=1".

The problem here is that all the expected conditions you are using here are passed once at least one element is presented.
So
forks = wait.until(ec.presence_of_all_elements_located((By.CSS_SELECTOR, "a.jsx-2470659356")))
catches not all the elements as it literally should rather .. you can never know how much, but at least one.
That's why your forks list is so short.
The simplest way to overcome this is to add some hardcoded sleep after wait.until(ec.presence_of_all_elements_located((By.CSS_SELECTOR, "a.jsx-2470659356"))) and only after that to get the list of the elements.
See this post for more details.
In Java there is an expected condition numberOfElementsToBeMoreThan so that it could be used here with condition to be more than 95 etc, but in Python the list of expected conditions is much shorter and there is no such an option....

Related

How to read data from dynamic website faster in selenium

I got a few dynamic websites (football live bets). There's no API I'm reading all of them in selenium. I've got infinite loop and finding elements every time.
while True:
elements = self.driver.find_elements_by_xpath(games_path)
for e in elements:
match = Match()
match.betting_opened = len(e.find_elements_by_class_name('no_betting_odds')) == 0
The problem is it's one hundred times slower than I need it to be.
What's the alternative to this? Any other library or how to speed it up with Selenium?
One of websites I'm scraping https://www.betcris.pl/zaklady-live#/Soccer
The pice of code of yours has a while True loop without a break. That is an implemenation of an infinite loop. From a short snipplet I can not tell if is this the root cause of your "infinite loop" issue, but may be so, check if you have any break statements inside your while loop.
As for the other part of your question: I am not sure how you measure performance of an infinite loop, but there is a way to speed up parsing pages with selenium: not using selenium. Grab a snapshot from the page and use that for evaluating states, values and stuff.
import lxml.html
page_snapshot = lxml.html.document_fromstring(self.driver.page_source)
games = page_snapshot.xpath(games_path)
This approach is about 2 magnitudes faster than querying via selenium api. Grab the page once, parse the hell out of it real quick and grab the page again later if you want to. If you want to just read stuff, you don't need webelements at all, just the tree of data. To interact with elements you'll need the webelement of course with selenium, but to get values and states, a snapshot may be sufficient.
Or what you could do with selenium only: add the 'no_betting_odds' to the games_path xpath. It seems to me that you want to grab those elements which do not have a 'no_betting_odds' class. Then just add the './/*[not contains(#class, "no_betting_odds")]' to the games_path (which you did not share so I can't update).

looping in beautiful soup / no errors

I have been writing a program that would hypothetically find items on a website as soon as they were loaded onto the website. As of now the script takes as input, two different values (keywords) used to describe an item and a color used to pick the color of the item. The parsing is spot on with items that are already on the website but lets say that I run my program before the website loads the items, instead of having to re run the entire script i'd like for it to just refresh the page and re-parse the data until it found it. I also included no errors in my question because from my example run of the script I entered Keywords and Color not pertaining to item on the website and instead of getting an error, I just got " Process finished with exit code 0". Thank you in advance to any who take the time to help !
Here is my code:
As another user suggested, you're probably better off using Selenium for the entire process rather than using it for only parts of your code and swapping between BSoup and Selenium.
As for reloading the page if certain items are not present, if you already know which items are supposed to be on the page then you can simply search for each item by id with selenium and if you can't find one or more then refresh the page with the following line of code:
driver.refresh()

Selenium Webscraping for some reason data only brings back partial instead of everything. Not sure if any dynamic data is in background

Python and Selenium beginner here. I'm trying to scrape the title of the sections of an Udemy class. I've tried using the find_elements_by_class_name and others but for some reason only brings back partial data.
page I'm scraping: https://www.udemy.com/selenium-webdriver-with-python3/
1) I want to get the title of the sections. They are the bold titles.
2) I want to get the title of the subsections.
from selenium import webdriver
driver = webdriver.Chrome()
url = 'https://www.udemy.com/selenium-webdriver-with-python3/'
driver.get(url)
main_titles = driver.find_elements_by_class_name("lecture-title-text")
sub_titles = driver.find_elements_by_class_name("title")
Problem
1) Using main_titles, I got the length to be only 10. It only goes from Introduction to Modules. Working With Files and ones after all don't come out. However, the class names are exactly the same. Not sure why it's not. Modules / WorkingWithFiles is basically the cutoff point. The elements in the inspection also looks different at this point. They all have same span class tag but not sure why only partial is being returned
<span class="lecture-title-text">
Element Inspection between Modules title and WorkingWithFiles title
At this point the webscrape breaks down. Not sure why.
2) Using sub_titles, I got length to be 58 items but when I print them out, I only get the top two:
Introduction
How to reach me anytime and ask questions? *** MUST WATCH ***
After this, it's all blank lines. Not sure why it's only pulling the top two and not the rest when all the tags have
<div class='title'>
Maybe I could try using BeautifulSoup but currently I'm trying to get better using Selenium. Is there a dynamic content throwing off the selenium scrape or am I not scraping it in a proper way?
Thank you guys for the input. Sorry for the long post. I wanted to make sure I describe the problem correctly.
The reason why your only getting the first 10 sections is because only the first ten courses are shown. You might be logged in on your browser, so when you go to check it out, it shows every section. But for me and your scraper it's only showing the first 10. You'll need to click that .section-container--more-sections button before looking for the titles.
As for the weird case of the titles not being scraped properly: It's because when a element is hidden text attribute will always be undefined, which is why it only works for the first section. I'd try using the WebElement.get_attribute('textContent') to scrape the text.
Ok I've went through the suggestions in the comments and have solved it. I'm writing it here in case anyone in future wants to see how solution went.
1) Using suggestions, I made a command to click on the '24 more sections' to expand the tab and then scrape it, which worked perfectly!
driver.find_element_by_class_name("js-load-more").click()
titles = driver.find_elements_by_class_name("lecture-title-text")
for each in titles:
print (each.text)
This pulled all 34 section titles.
2) Using Matt's suggestion, I found the WebElement and used get_attribute('textContent') to pull out the text data. There were bunch of spaces so I used split() to get strings only.
sub_titles = driver.find_elements_by_class_name("title")
for each in sub_titles:
print (each.get_attribute('textContent').strip())
This pulled all 210 subsection titles!

Python Selenium keeping elements after changed browser

I'm currently using Selenium on Python and have got a question about it.
elements = driver.find_elements_by_css_selector("div.classname a")
for element in elements:
element.click()
driver.back()
Since coming back to the previous page using back() in this code, Selenium couldn't find elements anymore, even though I still need it.
If someone has got any clue, please help me out.
Many appreciate in advance
Selenium creates a whole new set of objects when you change pages -- whether you click a link, or go back a page. If clicking on the element in line 3 causes Selenium to load a new page, you're getting a StaleElementException on the second element test. So what you have to do is every time you execute driver.back(), you need to search for the element objects on the page as you do in the first line, and probably maintain at least a counter as to how far down the list of elements you've already clicked (assuming they navigate away from the page). Make sense?
You can store the elements in a list and work on them with a loop. For example:
elementList = driver.find_elements_by_css_selector("div.classname a")
for i in range(len(elementList)):
element = driver.find_elements_by_css_selector("div.classname a")[i]
element.click()
driver.back()

Possible bottle-neck issue in web-scraping with Python

First of all I apologize for the vague title, but the problem is that I'm not sure what is causing the error.
I'm using Python to extrapolate some data from a website.
The code I created works perfectly when passing one link at the time, but somehow breaks when trying to collect the data from the 8000 pages I have (it actually breaks way before). The process I need to do is this:
Collect all the links from one single page (8000 links)
From each link extrapolate another link contained in an iframe
Scrape the date from the link in 2.
Point 1 is easy and works fine.
Point 2 and 3 works for a while and then I get some errors. Every time at a different point and it's never the same. After some tests, I decided to try a different approach and run my code until point 2 on all the links in 1, trying to collect all the links first. And at this point I found out that, probably, I get the error during this stage.
The code works like this: in a for cycle I pass each item of a list of urls to the function below. It's supposed to search for a link to the Disqus website. There should be only one link and there is always one link. Because with a library as lxml, it's not possible to scan inside the iframe, I use selenium and the ChromeDriver.
def get_url(webpage_url):
chrome_driver_path= '/Applications/chromedriver'
driver = webdriver.Chrome(chrome_driver_path)
driver.get(webpage_url)
iframes=driver.find_elements_by_tag_name("iframe")
list_urls=[]
urls=[]
# collects all the urls of all the iframe tags
for iframe in iframes:
driver.switch_to_frame(iframe)
time.sleep(3)
list_urls.append(driver.current_url)
driver.switch_to_default_content()
driver.quit()
for item in list_urls:
if item.startswith('http://disqus'):
urls.append(item)
if len(urls)>1:
print "too many urls collected in iframes"
else:
url=urls[0]
return url
At the beginning there was no time.sleep and it worked for roughly 30 links. Then I put a time.sleep(2) and it arrived to about 60. Now with time.sleep(3) it works for around 130 links. Of course, this cannot be a solution. The error I get now, it's always the same (index out of range in url=urls[0]), but each time with a different link. If I check my code with the single link where it breaks, the code works, so it can actually find urls there. And of course, sometimes passes a link where it stopped before and it works with no issue.
I suspect I get this because maybe of a time-out, but of course I'm not sure.
So, how can I understand what's the issue, here?
If the problem is that it makes too many requests (even though the sleep), how can I deal with this?
Thank you.
From your description of the problem, it might be that the host throttles your client when you issue too many requests in a given time. This is a common protection againts DoS attacks and ill-behaved robots - like yours.
The clean solution here is to checkout if the site has a robots.txt file and if so parse it and respect the rules - else, set a large enough wait time between two requests so you dont get kicked.
Also you can get quite a few other issues - 404, lost network connection etc - and even load time issues with selenium.webdriver as documented here:
Dependent on several factors, including the OS/Browser combination,
WebDriver may or may not wait for the page to load. In some
circumstances, WebDriver may return control before the page has
finished, or even started, loading. To ensure robustness, you need to
wait for the element(s) to exist in the page using Explicit and
Implicit Waits.
wrt/ your IndexError, you blindly assume that you'll get at least one url (which means at least one iframe), which might not be the case for any of the reasons above (and a few others too). First you want to make sure you properly handle all corner cases, then fix your code so you don't assume that you do have at least one url:
url = None
if len(urls) > 1:
print "too many urls collected in iframes"
elif len(urls) == 0:
url = urls[0]
else:
print "no url found"
Also if all you want is the first http://disqus url you can find, no need to collect them all, then filter them out, then return the first:
def get_url(webpage_url):
chrome_driver_path= '/Applications/chromedriver'
driver = webdriver.Chrome(chrome_driver_path)
driver.get(webpage_url)
iframes=driver.find_elements_by_tag_name("iframe")
# collects all the urls of all the iframe tags
for iframe in iframes:
driver.switch_to_frame(iframe)
time.sleep(3)
if driver.current_url.startswith('http;//disqus'):
return driver.current_url
driver.switch_to_default_content()
driver.quit()
return None # nothing found

Categories