Possible bottle-neck issue in web-scraping with Python

Possible bottle-neck issue in web-scraping with Python - python

First of all I apologize for the vague title, but the problem is that I'm not sure what is causing the error.
I'm using Python to extrapolate some data from a website.
The code I created works perfectly when passing one link at the time, but somehow breaks when trying to collect the data from the 8000 pages I have (it actually breaks way before). The process I need to do is this:
Collect all the links from one single page (8000 links)
From each link extrapolate another link contained in an iframe
Scrape the date from the link in 2.
Point 1 is easy and works fine.
Point 2 and 3 works for a while and then I get some errors. Every time at a different point and it's never the same. After some tests, I decided to try a different approach and run my code until point 2 on all the links in 1, trying to collect all the links first. And at this point I found out that, probably, I get the error during this stage.
The code works like this: in a for cycle I pass each item of a list of urls to the function below. It's supposed to search for a link to the Disqus website. There should be only one link and there is always one link. Because with a library as lxml, it's not possible to scan inside the iframe, I use selenium and the ChromeDriver.
def get_url(webpage_url):
chrome_driver_path= '/Applications/chromedriver'
driver = webdriver.Chrome(chrome_driver_path)
driver.get(webpage_url)
iframes=driver.find_elements_by_tag_name("iframe")
list_urls=[]
urls=[]
# collects all the urls of all the iframe tags
for iframe in iframes:
driver.switch_to_frame(iframe)
time.sleep(3)
list_urls.append(driver.current_url)
driver.switch_to_default_content()
driver.quit()
for item in list_urls:
if item.startswith('http://disqus'):
urls.append(item)
if len(urls)>1:
print "too many urls collected in iframes"
else:
url=urls[0]
return url
At the beginning there was no time.sleep and it worked for roughly 30 links. Then I put a time.sleep(2) and it arrived to about 60. Now with time.sleep(3) it works for around 130 links. Of course, this cannot be a solution. The error I get now, it's always the same (index out of range in url=urls[0]), but each time with a different link. If I check my code with the single link where it breaks, the code works, so it can actually find urls there. And of course, sometimes passes a link where it stopped before and it works with no issue.
I suspect I get this because maybe of a time-out, but of course I'm not sure.
So, how can I understand what's the issue, here?
If the problem is that it makes too many requests (even though the sleep), how can I deal with this?
Thank you.

From your description of the problem, it might be that the host throttles your client when you issue too many requests in a given time. This is a common protection againts DoS attacks and ill-behaved robots - like yours.
The clean solution here is to checkout if the site has a robots.txt file and if so parse it and respect the rules - else, set a large enough wait time between two requests so you dont get kicked.
Also you can get quite a few other issues - 404, lost network connection etc - and even load time issues with selenium.webdriver as documented here:
Dependent on several factors, including the OS/Browser combination,
WebDriver may or may not wait for the page to load. In some
circumstances, WebDriver may return control before the page has
finished, or even started, loading. To ensure robustness, you need to
wait for the element(s) to exist in the page using Explicit and
Implicit Waits.
wrt/ your IndexError, you blindly assume that you'll get at least one url (which means at least one iframe), which might not be the case for any of the reasons above (and a few others too). First you want to make sure you properly handle all corner cases, then fix your code so you don't assume that you do have at least one url:
url = None
if len(urls) > 1:
print "too many urls collected in iframes"
elif len(urls) == 0:
url = urls[0]
else:
print "no url found"
Also if all you want is the first http://disqus url you can find, no need to collect them all, then filter them out, then return the first:
def get_url(webpage_url):
chrome_driver_path= '/Applications/chromedriver'
driver = webdriver.Chrome(chrome_driver_path)
driver.get(webpage_url)
iframes=driver.find_elements_by_tag_name("iframe")
# collects all the urls of all the iframe tags
for iframe in iframes:
driver.switch_to_frame(iframe)
time.sleep(3)
if driver.current_url.startswith('http;//disqus'):
return driver.current_url
driver.switch_to_default_content()
driver.quit()
return None # nothing found

Related

Restriction on the site nb-bet when parsing

At https://nb-bet.com/Results, I try to access every match page, but the site seems to block me for a while after the first access. It turns out to get access only to 60-80 matches, and then I get an error 500 or 404 (although the page is available and no protection is displayed if you open it through a browser and this is only to any match page, for example, https://nb-bet page .com/Results will still open normally), which disappears after about 30-40 minutes and you can re-access new matches.
If I use time.sleep(random.uniform(5,10)), I only get access to 5-7 matches. I've tried using fake_headers, fake_useragent, access randomly, but to no avail. I need to find a solution without using proxies etc. Any help would be greatly appreciated.
For example, I provide links to 158 matches and how I go through them, the goal is simply to get the code 200 for each page in one pass (ie without a break of 30-40 minutes). The list with links to the matches had to be published on a separate site, because. this site does not skip posting a question because of the large text, I hope you understand.
The list of links is here - https://pastebin.com/LPeCP5bQ
import requests
s = requests.session()
for i in range(len(links)):
r = s.get(links[i])
print(r.status_code)

How to read data from dynamic website faster in selenium

I got a few dynamic websites (football live bets). There's no API I'm reading all of them in selenium. I've got infinite loop and finding elements every time.
while True:
elements = self.driver.find_elements_by_xpath(games_path)
for e in elements:
match = Match()
match.betting_opened = len(e.find_elements_by_class_name('no_betting_odds')) == 0
The problem is it's one hundred times slower than I need it to be.
What's the alternative to this? Any other library or how to speed it up with Selenium?
One of websites I'm scraping https://www.betcris.pl/zaklady-live#/Soccer

The pice of code of yours has a while True loop without a break. That is an implemenation of an infinite loop. From a short snipplet I can not tell if is this the root cause of your "infinite loop" issue, but may be so, check if you have any break statements inside your while loop.
As for the other part of your question: I am not sure how you measure performance of an infinite loop, but there is a way to speed up parsing pages with selenium: not using selenium. Grab a snapshot from the page and use that for evaluating states, values and stuff.
import lxml.html
page_snapshot = lxml.html.document_fromstring(self.driver.page_source)
games = page_snapshot.xpath(games_path)
This approach is about 2 magnitudes faster than querying via selenium api. Grab the page once, parse the hell out of it real quick and grab the page again later if you want to. If you want to just read stuff, you don't need webelements at all, just the tree of data. To interact with elements you'll need the webelement of course with selenium, but to get values and states, a snapshot may be sufficient.
Or what you could do with selenium only: add the 'no_betting_odds' to the games_path xpath. It seems to me that you want to grab those elements which do not have a 'no_betting_odds' class. Then just add the './/*[not contains(#class, "no_betting_odds")]' to the games_path (which you did not share so I can't update).

Selenium Python script only scrapes part of the visible information

I am sorry for the title to better describe the problem when you visit the following website :
There is a text on the right that says "See all". Once you click on that a list of links to various forks pops up. I am trying to scrape the hyperlinks for those forks.
One problem is that the scraper not only scrapes the link for the forks but also for the profiles. They don't use specific class nor ID for those links. So I've edited my script to calculate which result is the right one and which is not. That part works. However the script only scrapes a few links and doesn't scrape others. This confused me because at first I thought that this is caused by the element not being visible to selenium since there is scroll present. This doesn't seem to be the issue however since other links that are not scraped are normally visible. The script only scrapes the first 5 links and completely skips the rest.
I am now unsure on what to do since there is not error or warning about any possible issue with the code itself.
This is a short part of the code that scrapes the links.
driver.get(url)
wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "button.see-all-forks"))).click()
fork_count = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "span.jsx-3602798114"))).text
forks = wait.until(ec.presence_of_all_elements_located((By.CSS_SELECTOR, "a.jsx-2470659356")))
j = 1
for i, fork in enumerate(forks):
if j == 1:
forks[i] = fork.get_attribute("href")
print(forks[i])
if j == 3:
j = 1
else:
j += 1
In this case "url" variable is the link I provided above. The loop then skips 3 results after each one because every 4th one is the right one. I tried using XPath to filter out the results using the "contains" fuction however the names vary as the users name them on their own so this to my understanding is the only way to filter out the results.
This is the output that I get.
After which no results ever are printed out and the program gets terminated without errors. What is happening here and what have I missed? I am confused about why Selenium only scrapes five results after which it is terminated.
Edit note - my code explained :
I've set up the if statements to check for every 4th result since it's the right one however the first one is also the right one. If "j!=3" then add 1 to "j" once "j=3" (now appears the result) the code if "j=1" is ran and the right result is printed. So the right result will always be "j=1".

The problem here is that all the expected conditions you are using here are passed once at least one element is presented.
So
forks = wait.until(ec.presence_of_all_elements_located((By.CSS_SELECTOR, "a.jsx-2470659356")))
catches not all the elements as it literally should rather .. you can never know how much, but at least one.
That's why your forks list is so short.
The simplest way to overcome this is to add some hardcoded sleep after wait.until(ec.presence_of_all_elements_located((By.CSS_SELECTOR, "a.jsx-2470659356"))) and only after that to get the list of the elements.
See this post for more details.
In Java there is an expected condition numberOfElementsToBeMoreThan so that it could be used here with condition to be more than 95 etc, but in Python the list of expected conditions is much shorter and there is no such an option....

How to work with links that use javascript:window.location using Selenium in Python

I have dabbled with bits of simple code over the years. I am now interested in automating some repetitive steps in a web based CRM used at work. I tried a few automation tools. I was not able to get AutoIT to to work with the Chrome webdriver. I then tried WinTask and did not make meaningful progress. I started exploring Python and Selenium last week.
I now have automated the first few steps of my project by Googling about each step I wanted to achieve, learning from pages on Stackflow and other sites. Where I need help is that most of the links in the CRM are some sort of javascript links. Most of the text links or images have links that are formatted like this...
javascript:window.location = 'Reports/ResponseTimes.aspx?from=1%2f14%2f2021&to=1%2f14%2f2021&target=gn';
It looks like the many find_element_by functions in Selenium do not interact with the javascript links. Tonight I found a page that directed me to use... driver.execute_script(javaScript) ...Eventually I found an example that made it clear I should enter the javascript link into that function. This works...
driver.execute_script("window.location = 'Reports/ResponseTimes.aspx?from=1%2f14%2f2021&to=1%2f14%2f2021&target=gn';")
My issue is that I see now that the javascript links are actually and dynamically generated. In the code above the link gets updated with dates based on the current date. I can't reuse the driver.execute_script() code above since the dates have to be updated.
My hope is to find a way to code so that I can locate the javascript links I need based on some part of the link that does not change. The link above always has "target=gn" at the end and that is unique enough that if I could find and pull the current version of the link into a variable and then run it in driver.execute_script(), I believe that would solve my current issue.
I expect a solution could then be used in the next step I need to perform, where there a list of new leads that all needs to be updated in a manner that tells the system a human has reviewed the lead and "stopped the clock". To view each lead, there are more javascript links. Each link is unique since it includes a value that is the record number for the lead. Here's the first two...
javascript:top.viewItem(971244899);
javascript:top.viewItem(971312602);
I imagine that being able to search the page for some or all of... javascript:top.viewItem( ...in order to create a variable for... javascript:top.viewItem(971244899); ...so that it can be placed in... driver.execute_script() ...is the approach that is needed.
Thanks for any suggestions. I have made many searches on this site and Google for phrases that might teach me more about working with javascript links. I am asking for guidance since I have not been able to move forward on my own. Here's my current code...
import selenium
PATH = "C:\Program Files (x86)\chromedriver.exe"
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome(PATH)
driver.get("https://apps.vinmanager.com/cardashboard/login.aspx")
# log in
time.sleep(1)
search = driver.find_element_by_name("username")
search.send_keys("xxx")
search.send_keys(Keys.RETURN)
time.sleep(2)
search = driver.find_element_by_name("password")
search.send_keys("xxx")
search.send_keys(Keys.RETURN)
time.sleep(1)
# close news pop-up
driver.find_element_by_link_text("Close").click()
time.sleep(2)
# Nav to left pane
driver.switch_to.frame('leftpaneframe')
# Leads at No Contact link
driver.execute_script("window.location = 'Reports/ResponseTimes.aspx?from=1%2f14%2f2021&to=1%2f14%2f2021&target=gn';")

Eventually I found enough info online to recognize that I needed to replace the "//a" tag in the xpath find method with the proper tag, which was "//area" in my case and then extract the href so that I could execute it...
## click no contact link ##
print('click no contact link...')
cncl = driver.find_element_by_xpath("//area[contains(#href,'target=gn')]").get_attribute('href')
time.sleep(2)
driver.execute_script(cncl)

Click all links on the website with action. The element reference is stale er

I make new script, I want click() all listed links on the website, find something, back to the listed links, click() next link, find something, back to the listed links.
I start with website which list for me some links:
link 1
link 2
link 3
etc
links = driver.find_elements_by_xpath("myxpath")
for link in links:
link.click()
try:
time.sleep(2)
wantedelement = driver.find_element_by_xpath("xpath")
wantedelement.click()
#Save to file
tofile = driver.find_element_by_xpath("xpath")
print (tofile.text)
myfile = open("file.txt", "a")
my.write(tofile.text + "\n")
driver.back()
except (ElementNotVisibleException, NoSuchElementException):
driver.back()
But my script check only one link and when back to the listed links print me error:
selenium.common.exceptions.StaleElementReferenceException: Message: The element reference is stale. Either the element is no longer attached to the DOM or the page has been refreshed.
on line:
for link in links:
link.click() <----
How can i fix it? (Python 2.7)

A couple of things:
Don't use delays like "sleep" to synchronize. Use "wait" conditions instead.
Since you are using "back()" you will want to only do that if you successfully moved off the previous page, which is one of the reasons that it is better to test the actions specifically - for example, how do you know that every click() is supposed to move you "forward", might be some other action involved. You are making assumptions about the DOM, here, which is OK, but you'll need to check your conditions before performing other actions. For example if you move "back" before your next page is loaded you will lose your DOM.
Also, it is possible that the back() is causing your DOM to refresh, which is changing the element identifiers. If that's the case you'll want to verify your page after each back and iterate through the elements by index, looking them up individually before clicking.
I always have found it better practice to be more exacting about things like click operations. Click the specific link and check for the specific result - and use wait conditions between actions to assure you are exactly where you expect to be.
More about wait conditions here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Possible bottle-neck issue in web-scraping with Python - python

Related

Restriction on the site nb-bet when parsing

How to read data from dynamic website faster in selenium

Selenium Python script only scrapes part of the visible information

How to work with links that use javascript:window.location using Selenium in Python

Click all links on the website with action. The element reference is stale er

Categories

Resources