Python Selenium Scraping on Local but not on VPS - python

I'm making a python web scraping using Selenium. I've been testing it on my local machine and it finishes the "scraping" stage of my program, then I have it process data to a cloud database.
This is an image of what the logs look like when I scrape on my local computer.
However, on a VPS (Ubuntu), the scraping reaches up to 100% but never really finishes, then enters limbo (program eternally running). I have timeouts on my ChromeDriver with logs whenever it times out on a website, so that shouldn't be the issue.
These are my options:
option = webdriver.ChromeOptions()
option.add_argument("--no-sandbox")
# option.add_argument("--disable-extensions")
# option.add_argument("--disable-setuid-sandbox")
# option.add_argument('--disable-application-cache')
# option.add_argument("enable-automation")
# option.add_argument("--disable-browser-side-navigation")
# option.add_argument("start-maximized")
option.add_argument("--headless")
option.add_argument("--disable-dev-sh-usage")
option.add_argument("--disable-gpu")
# option.add_argument("--blink-settings=imagesEnabled=false")
I took off the commented-out options to test whether it would make a difference. I feel like it did speed up the scraping process on the VPS but the scraping still does not finish.
How do I find out the issue, and what could it be?
Thanks.

Related

How to retrieve Youtube startup delay, rebuffering events from Client-side Chrome browser?

I am looking for a solution to accurately measure Youtube Startup delay and rebuffering events from the Chrome web browser. I wish to ideally leverage the selenium automation-based python script to repeat the experiment for a large number of YouTube videos to get the measurements. Below is a piece of code I started with, however, it doesn't return any meaningful figures. Must be, I am missing something. Am I on the right track? Any help would be appreciated.
driver = webdriver.Chrome(service=s, options=options)
driver.get("https://www.youtube.com/<example-video-id>")
sleep(5)
player_status = driver.execute_script("return document.getElementById('movie_player').getPlayerState()")

Python Selenium script only works in the first execution (ERR_CONNECTION_CLOSED)

I'm trying to scrape a website that contains judicial information of my country (Colombia). I have a python script that uses Selenium to open the website and later insert a process number:
pathDriver = 'yourpathdriver'
driver = webdriver.Chrome(executable_path=pathDriver)
url = 'https://consultaprocesos.ramajudicial.gov.co/Procesos/NumeroRadicacion'
driver.get(url)
However the script only works the first time is executed, in later executions I get this error:
selenium.common.exceptions.WebDriverException: Message: unknown error: net::ERR_CONNECTION_CLOSED
I have to wait about 30 minutes to try the script again, but the result is the same, only works the first time.
I've tried to open the browser with the --incognito flag but this doesn't work. Also, I've tried to find a way to send request headers with Selenium but it seems this feature is not supported.
I am using Windows 10 and ChromeDriver.
Is there any Selenium tip to overcome this issue?
Thanks
When I have seen this error, it was a network issue (site not accessible from internal company network). To confirm or exclude this, try to run the tests from a computer outside your company, for example, your home computer. Here are more suggestions, but some of them are advanced (dangerous) and you should execute them only if you know what you are doing.
Additionally, the site is loaded on my computer for more than 20 seconds and in the console, I see the error:
GET https://consultaprocesos.ramajudicial.gov.co/js/chunk-3b114a7f.921eecf3.js net::ERR_CONNECTION_TIMED_OUT
However, this does not seem to cause the observed behavior.
Another possible reason could be an outdated browser/WebDriver or incorrect disposal (quit()) of the driver. If the issue is not reproduced manually (opening the site without Selenium), you can try with another WebDriver. You are using Chrome, so try with Firefox.

MaxRetryError while web scraping workaround - Python, Selenium

I am having some really hard times trying to figure out how to webscrape making multiple requests to the same website. I have to web scrape 3000 products from a website. That implies making various requests to that server (for example searching the product, clicking on it, going back to the home page) 3000 times.
I state that I am using Selenium. If I only launch one instance of my Firefox webdriver I don't get a MaxRetryError, but as the search goes on my webdriver gets slower and slower, and when the program reaches about half of the searches it stops responding. I looked it up on some forums and I found out it does so for some browser memory issues. So I tried quitting and reinstantiating the webdriver every n seconds (I tried with 100, 200 and 300 secs), but when I do so I get that MaxRetryError because of the too many requests to that url using the same session.
I then tried making the program sleep for a minute when the exception occurs but that hasn't worked (I am only able to make another search and then an exception is again thrown, and so on).
I am wondering if there is any workaround for these kind of issue.
It might be using another library, a way for changing IP or session dynamically or something like that.
P.S. I would rather keep working with selenium if possible.
This error is normally raised if the server determines a high request rate from your client.
As you mentioned, the server bans your IP from making further requests so you can get around that by using some available technologies. Look into Zalenium and also see here for some other possible ways.
Another possible (but tedious) way is to use a number of browser instances to make the call, for example, an answer from here illustrates that.
urlArr = ['https://link1', 'https://link2', '...']
for url in urlArr:
chrome_options = Options()
chromedriver = webdriver.Chrome(executable_path='C:/Users/andre/Downloads/chromedriver_win32/chromedriver.exe', options=chrome_options)
with chromedriver as browser:
browser.get(url)
# your task
chromedriver.close() # will close only the current chrome window.
browser.quit() # should close all of the open windows,

python - Speeding up Chrome Webdriver in Selenium

I am making a simple bot with selenium that will like, comment and message people on certain intervals.
I am using chrome web driver:
browser = webdriver.Chrome()
Also, I am on a x64 linux system. Distro is ubuntu 15.04 and am running with python3 from terminal.
and this works good and all, but it's pretty slow. I know as my code progresses, testing the app will become a pain. I've looked into this already and know it may have something to do with the proxy settings.
I am clueless when it comes to this type of stuff.
I fiddled with my system settings and changed my proxy settings to not require a connection, but nothing changed.
I notice when the driver loads, I see 'Establishing secure connection' for a few seconds in the browser window. I feel this is a culprit.
Also, 'establishing host' shows up multiple times. I'd say it takes about 5-8 seconds just to get a page.
login_url = 'http://www.skout.com/login'
browser.get(login_url)
In what ways can I speed up chrome driver, and is it proxy settings? It could definitely be something else.
Thanks for your time.
Chrome webdriver can be clunky and a bit slow to initialize as it is spawning a fresh instance every time you call the Webdriver object.
If speed is of the utmost importance I might recommend investing some time into looking at a headless alternative such as PhantomJS. This can save a significant amount of time if you are running multiple tests or instances of your application.

IE11 stuck at initial start page for webdriver server only when connection to the URL is slow

For the IE webdriver, it opens the IE browsers but it starts to load the local host and then stops (ie/ It never stated loading ). WHen the browser stops loading it shows the msg 'Initial start page for webdriver server'. The problem is that this does not occur every time I execute the test case making it difficult to identify what could be the cause of the issue. What I have noticed is when this issue occurs, the url will take ~25 secs to load manually on the same machine. When the issue does not occur, the URL will load within 3secs.
All security setting are the same (protected Mode enabled across all zone)
enhance protected mode is disabled
IE version 11
the URL is added as a trusted site.
Any clue why it does not load the URL sometimes?
I would try with disabling IE Native event. And, sorry that I cannot provide you the Python syntax right a way. The following is C# which should be fairly easy to convert.
var ieOptions = new InternetExplorerOptions
{ EnableNativeEvents = false };
ieOptions.EnsureCleanSession = true;
driver = new InternetExplorerDriver(ieOptions);
Use remote driver with desired cap (pageLoadStrategy)
Release notes from seleniumhq.org. Note that we have to use version 2.46 for the jar, iedriverserver.exe and python client driver in order to have things work correctly. It is unclear why 2.45 does not work given the release notes below.
v2.45.0.2
Updates to JavaScript automation atoms.
Added pageLoadStrategy to IE driver. Setting a capability named
pageLoadStrategy when creating a session with the IE driver will now change
the wait behavior when navigating to a new page. The valid values are:
"normal" - Waits for document.readyState to be 'complete'. This is the
default, and is the same behavior as all previous versions of
the IE driver.
"eager" - Will abort the wait when document.readyState is
'interactive' instead of waiting for 'complete'.
"none" - Will abort the wait immediately, without waiting for any of
the page to load.
Setting the capability to an invalid value will result in use of the
"normal" page load strategy.
It hasn't been updated for a while, but recently I had very similar issue - IEDriverServer was eventually opening page under test, but in most cases just stuck on Initial page of WebDriver.
What I found the root cause (in my case) was startup setting of IE. I had Start with tabs from the last session enabled, when changed back to Start with home page driver started to work like a charm, opening page under test in 100% of tries.

Categories