I am using selenium in python to do automation tasks, but it requires chrome profiles with special settings. So I am using
options.add_argument("--user-data-dir=path_to_chrome_file")
to load profiled chrome. I am also using following options:
options.add_argument("start-maximized")
options.add_argument("--disable-gpu")
options.add_argument("--disable-web-security")
options.add_experimental_option("detach", True)
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
ignored_exceptions = (
NoSuchElementException,
StaleElementReferenceException,
)
driver = webdriver.Chrome(options=options, executable_path="chromedriver")
stealth(
driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
Some of options may not necessary, just because added along time.
Problem:
I found that, sometimes (around 1/20 times), the file setting present in "--user-data-dir=path_to_chrome_file" will be cleared and generate the new data file. Can someone help figure out why is this happens?
The same code struture and options works well in windows.
Related
So, I'm trying to write a script to login on https://us.etrade.com/e/t/user/login
I am using Selenium for this but it somehow detects selenium when it starts and results in a message that says that the servers are crowded and when it happens, I can't log in. I've also tried using undetected-selenium as well as selenium-stealth but both got detected as well. I really need to automate this log in process. I've tried using python requests but that doesn't work. I'm open to any other technology or method that allows me to do this automation. Please help.
Here's my code
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium_stealth import stealth
import time
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
# chrome_options.add_argument('--browser')
chrome_options.add_argument('--no-sandbox')
# chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
stealth(wd,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
wd.get("https://us.etrade.com/e/t/user/login")
Demo creds would have helped us to dig deeper into your specific usecase.
However using selenium-stealth I was able to bypass the detection of Selenium driven ChromeDriver initiated google-chrome Browsing Context pretty easily.
selenium4 compatible code
Code Block:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium_stealth import stealth
options = Options()
options.add_argument("start-maximized")
# Chrome is controlled by automated test software
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
s = Service('C:\\BrowserDrivers\\chromedriver.exe')
driver = webdriver.Chrome(service=s, options=options)
# Selenium Stealth settings
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
driver.get("https://bot.sannysoft.com/")
driver.save_screenshot('bot_sannysoft.png')
Screenshot:
With ETRADE Login page
Code Block:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium_stealth import stealth
import time
options = Options()
options.add_argument("start-maximized")
# Chrome is controlled by automated test software
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
s = Service('C:\\BrowserDrivers\\chromedriver.exe')
driver = webdriver.Chrome(service=s, options=options)
# Selenium Stealth settings
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
driver.get("https://us.etrade.com/e/t/user/login")
driver.save_screenshot('etrade_com_login.png')
Screenshot:
I read that Selenium Chrome can run faster if you use implicit waits, headless, ID and CSS selectors etc. Before implementing those changes, I want to know whether cookies or caching could be slowing me down.
Does Selenium store cookies and cache like a normal browser or does it reload all assets everytime it navigates to a new page on a website?
If yes, then this would slow down the process of scraping websites with millions of identical profile pages, where the scripts and images are similar for each profile.
Is yes, is there a way to avoid this problem? Interested in using cookies and cache during a session and then destroying after the browser is closed.
Edit, more details:
sel_options = {'proxy': {'https': pString}}
prefs = {'download.default_directory' : dFolder}
options.add_experimental_option('prefs', prefs)
blocker = os.path.join( os.getcwd(), "extension_iijehicfndmapfeoplkdpinnaicikehn")
options.add_argument(f"--load-extension={blocker}")
wS = "--window-size="+s1+","+s2
options.add_argument(wS)
if headless == "yes": options.add_argument("--headless");
driver = uc.Chrome(seleniumwire_options=sel_options, options=options, use_subprocess=True, version_main=109)
stealth(driver, languages=["en-US", "en"], vendor="Google Inc.", platform="Win32", webgl_vendor="Intel Inc.", renderer="Intel Iris OpenGL Engine", fix_hairline=True)
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": agent})
navigate("https://linkedin.com")
I don't think my proxy or extension is the culprit, because I have a similar automation app running with no speed issue.
it will automatically store cookies and cache assets just like a normal browser would. This can slow down the process of scraping websites with a large number of similar pages, as the assets and scripts will be reloaded every time a new page is navigated to.
ne solution is to use a separate instance of the WebDriver for each session, and explicitly delete the cookies and cache at the end of each session.
here is the code of example
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--disable-extensions")
options.add_argument("--disable-gpu")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--no-sandbox")
options.add_argument("--headless")
options.add_argument("--disable-features=VizDisplayCompositor")
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument("--disable-logging")
options.add_argument("--disable-setuid-sandbox")
options.add_argument("--disable-seccomp-filter-sandbox")
driver = webdriver.Chrome(options=options)
# Your scraping code here...
driver.delete_all_cookies()
driver.execute_script("window.sessionStorage.clear();")
driver.execute_script("window.localStorage.clear();")
driver.close()
the delete_all_cookies method is used to delete all cookies in the current session, and the execute_script method is used to clear the session
The website redirect me to a captcha page (which is fine) but doesn't let me complete the captcha, sending a 403 response which is blocking the load of the captcha widget so I cannot send it to 2captcha workers. Tried VPN, tried switching network to my friend's house and I still get blocked. Is there any error in the code? Could be the Chromium version (Chromium 104.0.5112.79 snap) ?
from selenium import webdriver
from selenium_stealth import stealth
import time
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
# options.add_argument("--headless")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path="/snap/chromium/2051/usr/lib/chromium-browser/chromedriver")
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
url = "https://www.ticketmaster.de/event/nfl-munich-game-seattle-seahawks-tampa-bay-buccaneers-tickets/467425?language=en-us"
driver.get(url)
time.sleep(5)
driver.quit()
Option 1: you should try to clear you cookies, you are probably fallen in a black list.
Option 2: the website detect selenium, in that case you can go to this question : Can a website detect when you are using Selenium with chromedriver?
Not that super clear why the website redirect you to a reCAPTCHA page. However with almost similar configuration using chrome=104.0 and chromedriver=104.0 , I can access the page perfecto.
Code block:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium_stealth import stealth
import time
options = Options()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
s = Service('C:\\BrowserDrivers\\chromedriver.exe')
driver = webdriver.Chrome(service=s, options=options)
# Selenium Stealth settings
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
driver.get('https://www.ticketmaster.de/event/nfl-munich-game-seattle-seahawks-tampa-bay-buccaneers-tickets/467425?language=en-us') # not detected
driver.save_screenshot("ticketmaster")
Screenshot:
However, the second time I try to access the same page I I face the Pardon the Interruption page:
which essentially implies the navigation is blocked.
References
You can find a relevant detailed discussion in:
Can a website detect when you are using Selenium with chromedriver?
So, I'm trying to write a script to login on https://us.etrade.com/e/t/user/login
I am using Selenium for this but it somehow detects selenium when it starts and results in a message that says that the servers are crowded and when it happens, I can't log in. I've also tried using undetected-selenium as well as selenium-stealth but both got detected as well. I really need to automate this log in process. I've tried using python requests but that doesn't work. I'm open to any other technology or method that allows me to do this automation. Please help.
Here's my code
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium_stealth import stealth
import time
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
# chrome_options.add_argument('--browser')
chrome_options.add_argument('--no-sandbox')
# chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
stealth(wd,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
wd.get("https://us.etrade.com/e/t/user/login")
Demo creds would have helped us to dig deeper into your specific usecase.
However using selenium-stealth I was able to bypass the detection of Selenium driven ChromeDriver initiated google-chrome Browsing Context pretty easily.
selenium4 compatible code
Code Block:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium_stealth import stealth
options = Options()
options.add_argument("start-maximized")
# Chrome is controlled by automated test software
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
s = Service('C:\\BrowserDrivers\\chromedriver.exe')
driver = webdriver.Chrome(service=s, options=options)
# Selenium Stealth settings
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
driver.get("https://bot.sannysoft.com/")
driver.save_screenshot('bot_sannysoft.png')
Screenshot:
With ETRADE Login page
Code Block:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium_stealth import stealth
import time
options = Options()
options.add_argument("start-maximized")
# Chrome is controlled by automated test software
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
s = Service('C:\\BrowserDrivers\\chromedriver.exe')
driver = webdriver.Chrome(service=s, options=options)
# Selenium Stealth settings
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
driver.get("https://us.etrade.com/e/t/user/login")
driver.save_screenshot('etrade_com_login.png')
Screenshot:
I was doing some scraping task and came across this issue where, when the browser is made to run in headless mode, it does not produce search results, rather gives a blank body.
This issue is not encountered in every run, but randomly pops ups in between some runs.
Expected:
Actual (image saved by driver.save_screenshot()):
configuration of webdriver:
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument("--disable-extensions")
options.add_argument("--proxy-server='direct://'")
options.add_argument("--proxy-bypass-list=*")
options.add_argument("--start-maximized")
options.add_argument('--disable-gpu')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--allow-running-insecure-content')
options.add_argument("--window-size=1920,1080")
While in headless mode only this issue is being faced.
Methods already tried include:
Implicit wait
time.sleep()
Browser reload
Environment: MacOS 11.5.2
Any help is appreciated