Selenium Python Chrome: Extremely slow. Are cookies the problem? - python

I read that Selenium Chrome can run faster if you use implicit waits, headless, ID and CSS selectors etc. Before implementing those changes, I want to know whether cookies or caching could be slowing me down.
Does Selenium store cookies and cache like a normal browser or does it reload all assets everytime it navigates to a new page on a website?
If yes, then this would slow down the process of scraping websites with millions of identical profile pages, where the scripts and images are similar for each profile.
Is yes, is there a way to avoid this problem? Interested in using cookies and cache during a session and then destroying after the browser is closed.
Edit, more details:
sel_options = {'proxy': {'https': pString}}
prefs = {'download.default_directory' : dFolder}
options.add_experimental_option('prefs', prefs)
blocker = os.path.join( os.getcwd(), "extension_iijehicfndmapfeoplkdpinnaicikehn")
options.add_argument(f"--load-extension={blocker}")
wS = "--window-size="+s1+","+s2
options.add_argument(wS)
if headless == "yes": options.add_argument("--headless");
driver = uc.Chrome(seleniumwire_options=sel_options, options=options, use_subprocess=True, version_main=109)
stealth(driver, languages=["en-US", "en"], vendor="Google Inc.", platform="Win32", webgl_vendor="Intel Inc.", renderer="Intel Iris OpenGL Engine", fix_hairline=True)
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": agent})
navigate("https://linkedin.com")
I don't think my proxy or extension is the culprit, because I have a similar automation app running with no speed issue.

it will automatically store cookies and cache assets just like a normal browser would. This can slow down the process of scraping websites with a large number of similar pages, as the assets and scripts will be reloaded every time a new page is navigated to.
ne solution is to use a separate instance of the WebDriver for each session, and explicitly delete the cookies and cache at the end of each session.
here is the code of example
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--disable-extensions")
options.add_argument("--disable-gpu")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--no-sandbox")
options.add_argument("--headless")
options.add_argument("--disable-features=VizDisplayCompositor")
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument("--disable-logging")
options.add_argument("--disable-setuid-sandbox")
options.add_argument("--disable-seccomp-filter-sandbox")
driver = webdriver.Chrome(options=options)
# Your scraping code here...
driver.delete_all_cookies()
driver.execute_script("window.sessionStorage.clear();")
driver.execute_script("window.localStorage.clear();")
driver.close()
the delete_all_cookies method is used to delete all cookies in the current session, and the execute_script method is used to clear the session

Related

Selenium stealth not working, website still sending 403 error

The website redirect me to a captcha page (which is fine) but doesn't let me complete the captcha, sending a 403 response which is blocking the load of the captcha widget so I cannot send it to 2captcha workers. Tried VPN, tried switching network to my friend's house and I still get blocked. Is there any error in the code? Could be the Chromium version (Chromium 104.0.5112.79 snap) ?
from selenium import webdriver
from selenium_stealth import stealth
import time
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
# options.add_argument("--headless")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path="/snap/chromium/2051/usr/lib/chromium-browser/chromedriver")
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
url = "https://www.ticketmaster.de/event/nfl-munich-game-seattle-seahawks-tampa-bay-buccaneers-tickets/467425?language=en-us"
driver.get(url)
time.sleep(5)
driver.quit()
Option 1: you should try to clear you cookies, you are probably fallen in a black list.
Option 2: the website detect selenium, in that case you can go to this question : Can a website detect when you are using Selenium with chromedriver?
Not that super clear why the website redirect you to a reCAPTCHA page. However with almost similar configuration using chrome=104.0 and chromedriver=104.0 , I can access the page perfecto.
Code block:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium_stealth import stealth
import time
options = Options()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
s = Service('C:\\BrowserDrivers\\chromedriver.exe')
driver = webdriver.Chrome(service=s, options=options)
# Selenium Stealth settings
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
driver.get('https://www.ticketmaster.de/event/nfl-munich-game-seattle-seahawks-tampa-bay-buccaneers-tickets/467425?language=en-us') # not detected
driver.save_screenshot("ticketmaster")
Screenshot:
However, the second time I try to access the same page I I face the Pardon the Interruption page:
which essentially implies the navigation is blocked.
References
You can find a relevant detailed discussion in:
Can a website detect when you are using Selenium with chromedriver?

driver.get() stopped working in Headless -mode (Chrome)

recently a scraper I made stopped working in headless mode. I've tried with both firefox and Chrome. Notable things are that I am using seleniumwire to access API requests, and that I am using ChromeDriverManager to get the driver. Current version for Chrome/93.0.4577.63.
I've tried modifying the User-Agent manually as can be seen in the below code, in case the website added some checks blocking HeadlessChrome/93.0.4577.63 which is the original User-Agent. This did not help.
When running the script in regular mode, it works. When running in headless mode, the below code would output [] meaning that driver.get(url) does not return any requests. I run this code daily and it stopped working on 8.9.2021 I think, during the day.
from selenium.webdriver.chrome.options import Options as chromeOptions
from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager
options = {
'suppress_connection_errors': False,
'connection_timeout': None
}
chrome_options = chromeOptions()
chrome_options.add_argument("--start-maximized")
chrome_options.add_argument("--incognito")
chrome_options.add_argument('--log-level=2')
chrome_options.add_argument("--window-size=1920,1080")
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument('--allow-running-insecure-content')
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(ChromeDriverManager().install(), seleniumwire_options=options, chrome_options=chrome_options)
userAgent = driver.execute_script("return navigator.userAgent;")
userAgent = userAgent.replace('Headless', '')
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": userAgent})
url = 'my URL goes here'
driver.get(url)
print(driver.requests)
Same issue with FireFox, headless does not work but regular browsing does. Any idea what might cause this problem and what could solve it? I've also tried adding the following arguments to Chrome options without any luck:
chrome_options.add_argument("--proxy-server='direct://'")
chrome_options.add_argument("--proxy-bypass-list=*")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--ignore-certificate-errors')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--ignore-certificate-errors-spki-list')
chrome_options.add_argument('--ignore-ssl-errors')
This may have been solved - I noticed that I first set the window size to maximize and after that set it to 1920,1080. When I removed the argument to maximize
chrome_options.add_argument("--start-maximized") the problem disappeared and now the script works once again.
I'm not sure if this actually solved it or whether it was something else, since Selenium is a bit finicky and sometimes data just won't load the same way for the same web page, but at least now it works.

Selenium Chrome driver not generating search results in headless mode

I was doing some scraping task and came across this issue where, when the browser is made to run in headless mode, it does not produce search results, rather gives a blank body.
This issue is not encountered in every run, but randomly pops ups in between some runs.
Expected:
Actual (image saved by driver.save_screenshot()):
configuration of webdriver:
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument("--disable-extensions")
options.add_argument("--proxy-server='direct://'")
options.add_argument("--proxy-bypass-list=*")
options.add_argument("--start-maximized")
options.add_argument('--disable-gpu')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--allow-running-insecure-content')
options.add_argument("--window-size=1920,1080")
While in headless mode only this issue is being faced.
Methods already tried include:
Implicit wait
time.sleep()
Browser reload
Environment: MacOS 11.5.2
Any help is appreciated

selenium python generate data uri of an image with headless Chrome [duplicate]

I'm do me code in Cromedrive in 'normal' mode and works fine. When I change to headless mode it don't download the file. I already try the code I found alround internet, but didn't work.
chrome_options = Options()
chrome_options.add_argument("--headless")
self.driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=r'{}/chromedriver'.format(os.getcwd()))
self.driver.set_window_size(1024, 768)
self.driver.command_executor._commands["send_command"] = ("POST", '/session/$sessionId/chromium/send_command')
params = {'cmd': 'Page.setDownloadBehavior', 'params': {'behavior': 'allow', 'downloadPath': os.getcwd()}}
self.driver.execute("send_command", params)
Anyone have any idea about how solve this problem?
PS: I don't need to use Chomedrive necessarily. If it works in another drive it's fine for me.
First the solution
Minimum Prerequisites:
Selenium client version: Selenium v3.141.59
Chrome version: Chrome v77.0
ChromeDriver version: ChromeDriver v77.0
To download the file clicking on the element with text as Download Data within this website you can use the following solution:
Code Block:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
options.add_argument("--window-size=1920,1080")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe', service_args=["--log-path=./Logs/DubiousDan.log"])
print ("Headless Chrome Initialized")
params = {'behavior': 'allow', 'downloadPath': r'C:\Users\Debanjan.B\Downloads'}
driver.execute_cdp_cmd('Page.setDownloadBehavior', params)
driver.get("https://www.mockaroo.com/")
driver.execute_script("scroll(0, 250)");
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#download"))).click()
print ("Download button clicked")
#driver.quit()
Console Output:
Headless Chrome Initialized
Download button clicked
File Downloading snapshot:
Details
Downloading files through Headless Chromium was one of the most sought functionality since Headless Chrome was introduced.
Since then there were different work-arounds published by different contributors and some of them are:
Downloading with chrome headless and selenium
Python equivalent of a given wget command
Now the, the good news is Chromium team have officially announced the arrival of the functionality Downloading file through Headless Chromium.
In the discussion Headless mode doesn't save file downloads #eseckler mentioned:
Downloads in headless work a little differently. There's the Page.setDownloadBehavior devtools command to set a download folder. We're working on a way to use DevTools network interception to stream the downloaded file via DevTools as well.
A detailed discussion can be found at Issue 696481: Headless mode doesn't save file downloads
Finally, #bugdroid revision seems to have nailed the issue for us.
[ChromeDriver] Added support for headless mode to download files
Previously, Chromedriver running in headless mode would not properly download files due to the fact it sparsely parses the preference file given to it. Engineers from the headless chrome team recommended using DevTools's "Page.setDownloadBehavior" to fix this. This changelist implements this fix. Downloaded files default to the current directory and can be set using download_dir when instantiating a chromedriver instance. Also added tests to ensure proper download functionality.
Here is the revision and commit
From ChromeDriver v77.0.3865.40 (2019-08-20) release notes:
Resolved issue 2454: Headless mode doesn't save file downloads [Pri-2]
Solution
Update ChromeDriver to latest ChromeDriver v77.0 level.
Update Chrome to Chrome Version 77.0 level. (as per ChromeDriver v76.0 release notes)
Note: Chrome v77.0 is yet to be GAed/pushed for release so till then you can download and install a development build and test either from:
Chrome Canary
Latest build from the Dev Channel
Outro
However Mac OSX users have a wait for their pie as On Chromedriver, headless chrome crashes after sending Page.setDownloadBehavior on MacOSX.
Chomedriver Version: 95.0.4638.54
Chrome Version 95.0.4638.69
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
options.add_argument("--start-maximized")
options.add_argument("--no-sandbox")
options.add_argument("--disable-extensions")
options.add_argument('--disable-dev-shm-usage')
options.add_argument("--disable-gpu")
options.add_argument('--disable-software-rasterizer')
options.add_argument("user-agent=Mozilla/5.0 (Windows Phone 10.0; Android 4.2.1; Microsoft; Lumia 640 XL LTE) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Mobile Safari/537.36 Edge/12.10166")
options.add_argument("--disable-notifications")
options.add_experimental_option("prefs", {
"download.default_directory": "C:\\link\\to\\folder",
"download.prompt_for_download": False,
"download.directory_upgrade": True,
"safebrowsing_for_trusted_sources_enabled": False,
"safebrowsing.enabled": False
}
)
What seemed to work was that I used "\\" instead of "/" for the address. The latter approach didn't throw any error, but didn't download any documents either. But, using double back slashes did the job.
For javascript use below code:
const chrome = require('selenium-webdriver/chrome');
let options = new chrome.Options();
options.addArguments('--headless --window-size=1500,1200');
options.setUserPreferences({ 'plugins.always_open_pdf_externally': true,
"profile.default_content_settings.popups": 0,
"download.default_directory": Download_File_Path });
driver = await new webdriver.Builder().setChromeOptions(options).forBrowser('chrome').build();
Then switch tabs as soon as you click the download button:
await driver.sleep(1000);
var Handle = await driver.getAllWindowHandles();
await driver.switchTo().window(Handle[1]);
This C# works for me
Note the new headless option https://www.selenium.dev/blog/2023/headless-is-going-away/
private IWebDriver StartBrowserChromeHeadlessDriver()
{
var chromeOptions = new ChromeOptions();
chromeOptions.AddArgument("--headless=new");
chromeOptions.AddArgument("--window-size=1920,1080");
chromeOptions.AddUserProfilePreference("download.default_directory", downloadFolder);
var chromeDownload = new Dictionary<string, object>
{
{ "behavior", "allow" },
{ "downloadPath", downloadFolder }
};
var driver = new ChromeDriver(driverFolder, chromeOptions, TimeSpan.FromSeconds(timeoutSecs));
driver.ExecuteCdpCommand("Browser.setDownloadBehavior", chromeDownload);
return driver;
}
import pathlib
from selenium.webdriver import Chrome
driver = Chrome()
driver.execute_cdp_cmd("Page.setDownloadBehavior", {
"behavior": "allow",
"downloadPath": str(pathlib.Path.home().joinpath("Downloads"))
})
I don't think you should be using the browser for downloading content, leave it to Chrome developers/testers.
I believe you should rather get href attribute of the element you want to download and obtain it using requests library
If your site requires authentication you could fetch cookies from the browser instance and pass them to requests.Session.

Webpage Is Detecting Selenium Webdriver with Chromedriver as a bot

I am trying to scrape https://www.controller.com/ with python, and since the page detected a bot using pandas.get_html, and requests using user-agents and a rotating proxy, i resorted to using selenium webdriver. However, this is also being detected as a bot with the following message. Can anybody explain how can I get past this?:
Pardon Our Interruption...
As you were browsing www.controller.com something about your browser made us think you were a bot. There are a few reasons this might happen:
You're a power user moving through this website with super-human speed.
You've disabled JavaScript in your web browser.
A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this support article.
To request an unblock, please fill out the form below and we will review it as soon as possible"
Here is my code:
from selenium import webdriver
import requests
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
options = webdriver.ChromeOptions()
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
#options.add_argument('headless')
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://www.controller.com/')
driver.implicitly_wait(30)
You have mentioned about pandas.get_html only in your question and options.add_argument('headless') only in your code so not sure if you are implementing them. However taking out minimum code from your code attempt as follows:
Code Block:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get('https://www.controller.com/')
print(driver.title)
I have faced the same issue.
Browser Snashot:
When I inspected the HTML DOM it was observed that the website refers the distil_referrer on window.onbeforeunload as follows:
<script type="text/javascript" id="">
window.onbeforeunload=function(a){"undefined"!==typeof sessionStorage&&sessionStorage.removeItem("distil_referrer")};
</script>
Snapshot:
This is a clear indication that the website is protected by Bot Management service provider Distil Networks and the navigation by ChromeDriver gets detected and subsequently blocked.
Distil
As per the article There Really Is Something About Distil.it...:
Distil protects sites against automatic content scraping bots by observing site behavior and identifying patterns peculiar to scrapers. When Distil identifies a malicious bot on one site, it creates a blacklisted behavioral profile that is deployed to all its customers. Something like a bot firewall, Distil detects patterns and reacts.
Further,
"One pattern with Selenium was automating the theft of Web content", Distil CEO Rami Essaid said in an interview last week. "Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".
Reference
You can find a couple of detailed discussion in:
Distil detects WebDriver driven Chrome Browsing Context
Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection
Akamai Bot Manager detects WebDriver driven Chrome Browsing Context
Distil can detect if you are headless by doing some fingerprinting using html5 canvas. They also check things like browser plugins and user-agent. Selenium sets some browser flags that are also detectable.
Finally solved the problem and headless mode works as well.
chrome_options = Options()
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--headless")
driver = webdriver.Chrome("chromedriver.exe", options=chrome_options)
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36'})
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")

Categories