I’m attempting to connect to the below website using selenium but the browser is held on a page which says “we’re making sure you’re not a bot”, which I believe is a Cloudflare block.
url: https:// www .aston martin austin. com /
Despite the page saying that it will connect after 5 seconds the browser is held on this site for longer. Implementing a sleep period sometimes results in the site loading, however, on other occasions the browser remains on this page without reaching the site.
I’ve tested various approaches including using a variety of chrome options (see code for a full list), selenium-wire to match headers against those when loading the site manually, a while loop with multiple gets and wait periods in between (which occasionally works - why is this?), but no method that I can code so far has proved to consistently work time after time i.e. connect to the home page first time. I've also tried changing the $cdc variable in the chrome driver executable.
The undetected-chrome package is the one package that does seem to work, however, I’d like to learn and understand which change I can make to my own code to make consistent successful requests without the help of an additional package.
url: https:// www.aston martin austin. com /
Note – spaces should be removed from the URL when testing.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
options = webdriver.ChromeOptions()
options.add_argument("window-size=1200x600")
options.add_argument("--headless")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--no-sandbox")
options.add_argument("--ignore-certificate-errors-spki-list")
options.add_argument("--ignore-ssl-errors")
options.add_argument("--allow-insecure-localhost")
options.add_argument("--ignore-certificate-errors")
options.add_argument("--enable-javascript")
options.add_argument("javascript.enabled")
options.add_argument("--lang=en-US,en;q=0.8")
options.add_argument("--incognito")
options.add_argument("start-maximized")
options.add_argument("Sec-Fetch-Site=cross-site")
capabilities = options.to_capabilities()
capabilities["acceptInsecureCerts"] = True
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36")
ser = Service(r"C:\Users\user\Documents\Selenium\chromedriver.exe")
driver = webdriver.Chrome(options=options,service=ser,)
driver.get(url)
Related
So I have a problem that I have been noticing with selenium when I run it headless where some pages don't totally load/render some elements. I don't exactly know what's happening not to load 100%; maybe JS not running?
My code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from decouple import config
from time import sleep
DEBUG = config('DEBUG')
class DiscordME(object):
def __init__(self):
self.LINUX = config('LINUX', cast=bool)
self.DRIVER_VERSION = config('DRIVER_VERSION')
self.HEADLESS = True
options = Options()
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--disable-extensions')
options.add_argument('--disable-dev-shm-usage')
if self.HEADLESS:
options.add_argument('--headless')
options.add_argument('--window-size=1920,1200')
if self.LINUX:
self.browser = webdriver.Chrome(executable_path=f'./drivers/chromedriver-{self.DRIVER_VERSION}', options=options)
else:
self.browser = webdriver.Chrome(executable_path=f'.\drivers\chromedriver-{self.DRIVER_VERSION}.exe', options=options)
def get_website(self):
self.browser.get('https://discord.me/login')
WebDriverWait(self.browser, 10).until(
EC.url_changes('https://discord.me/login')
)
print(self.browser.current_url)
print(self.browser.page_source)
#print(self.browser.find_element_by_xpath('//*[#id="app-mount"]/div[2]/div/div[2]/div/div/form/div/div/div[1]/div[3]/div[1]/div/div[2]/input'))
DiscordME().get_website()
In this script, it doesn't load the login inputs when it accesses the discord API login page.
As I can see in the page_source I noticed that the page is not being mounted so that could be the problem.
from selenium import webdriver
from time import sleep
options = webdriver.ChromeOptions()
options.add_argument("--window-size=1920,1080")
options.add_argument("--headless")
options.add_argument("--disable-gpu")
options.add_argument(
"user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36")
browser = webdriver.Chrome(options=options)
some websites uses user-agent to detect whether the browser is in headless mode or not as headless browser uses a different user-agent than normal browser. So explicitly set user agent.
Headless browser detection
Another thing to to consider if you are having trouble loading a website with selenium is the processing power.
I was using a Micro AWS instance with a single CPU which worked for many websites, but when I came to a more complex one it kept intermittently getting 0 elements when conducting a search like find_elements_by_xpath('//a[#href]') while sometimes it would work successfully and find the hyperlinks. I upgraded the instance to one with more CPUs (4, but 2 would probably have been sufficient) and that allowed me to fully load the site and scrape the elements.
I would definitely try the other two solutions posted here first (chrome options or firefox browser), but processing power could be the problem as well.
I Just would like to share my experience on this as solving the issue consumed much of my time trying many options and settings for Chrome webdriver.
The user-agent setting solved the problem for some websites I scraped. but, for some other websites the only solution worked with me was to use FireFox webdriver instead of Chrome as per following :
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
fireFoxOptions = Options()
fireFoxOptions.add_argument("--headless")
fireFoxOptions.add_argument("--window-size=1920,1080")
fireFoxOptions.add_argument('--start-maximized')
fireFoxOptions.add_argument('--disable-gpu')
fireFoxOptions.add_argument('--no-sandbox')
driver = webdriver.Firefox(options=fireFoxOptions,
executable_path=r'C:\[your path to firefox webdriver exe file]\geckodriver.exe')
driver.get('https://discord.me/login')
Use the link here to download latest geckodriver for FireFox, and make sure FireFox browser is already installed in you machine.
I am trying to scrape some data from LV website with Selenium and keep getting 'Access Denied' screen once 'sign in' button clicked. I feel like there is a protection against this because all seems to be working fine when I do the same manually. Oddly, I need to click 'sign in' button twice to be able to sign in manually.
My code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'chromedriver.exe')
driver.get('https://secure.louisvuitton.com/eng-gb/mylv')
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//span[#class='ucm-wrapper']")))
driver.find_element_by_xpath("//button[#class='ucm-button ucm-button--default ucm-choice__yes']").click()
driver.find_element_by_id('loginloginForm').send_keys('xxx#xxx.com')
driver.find_element_by_id ('passwordloginForm').send_keys('xxxxxx')
driver.find_element_by_id('loginSubmit_').click()
Error:
You don't have permission to access "http://secure.louisvuitton.com/eng-gb/mylv;jsessionid=xxxxxxx.front61-prd?" on this server.
Is there a way to login with Selenium and bypass this?
I took your code added a few tweaks and ran the test as follows:
Code Block:
from selenium import webdriver
driver.get('https://secure.louisvuitton.com/eng-gb/mylv')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//span[text()='Accept and Continue']"))).click()
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[#id='loginloginForm']"))).send_keys("Mudyla#stackoverflow.com")
driver.find_element_by_xpath("//input[#id='passwordloginForm']").send_keys('Mudyla')
driver.find_element_by_xpath("//input[#id='loginSubmit_']").click()
Observation
Similar to your observation, I have hit the same roadblock with no results as follows:
Deep Dive
It seems the click() on Sign In does happens. But while inspecting the DOM Tree of the webpage you will find that some of the <script> tag refers to JavaScripts having keyword akam. As an example:
akam-sw.js install script version 1.3.3 "serviceWorker"in navigator&&"find"in[]&&function()
<script type="text/javascript" src="https://secure.louisvuitton.com/akam/11/7f0e2ae6" defer=""></script>
<noscript><img src="https://secure.louisvuitton.com/akam/11/pixel_7f0e2ae6?a=dD0xOWNjNTRjMmMxYzdmNmMwZjI0NTUwOGZmZDM5ZTQzMWQ5NjI5ZmIwJmpzPW9mZg==" style="visibility: hidden; position: absolute; left: -999px; top: -999px;" /></noscript>
Which is a clear indication that the website is protected by Bot Manager an advanced bot detection service provided by Akamai and the response gets blocked.
Bot Manager
As per the article Bot Manager - Foundations:
Conclusion
So it can be concluded that the request for the data is detected as being performed by Selenium driven WebDriver instance and the response is blocked.
References
A couple of documentations:
Bot Manager
Bot Manager : Foundations
tl; dr
A couple of relevant discussions:
Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection
Clicking on Get Data button for Monthly Settlement Statistics on nseindia.com doesn't fetch results using Selenium and Python
It's been a while since I had posted this question but if anyone is interested below are the steps I've taken to solve the problem.
Open chromedriver.exe in hex editor, find the string $cdc and replace with something else of the same length. Then save and run modified binary. Read more in this answer and the replies to it.
Selenium python code:
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path='chromedriver.exe')
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/85.0.4183.102 Safari/537.36'})
For me it worked when I added the following line just after launching a driver:
driver.manage().deleteAllCookies();
Hi guys I'm running some crawling script with Selenium and Python, I want to run the Chrome in headless mode so I set the headless options to true as below
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
options = Options()
options.headless = True
options.add_argument("--start-maximized")
options.add_argument("--window-size=1920,1080")
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")
driver = webdriver.Chrome('chromedriver.exe', options=options)
But when running the script, Chrome return the web in mobile version (I've captured screenshot to check the error). Because of this my script cannot run properly
I've try many ways to change it back to desktop website, added arguments like "--window-size=1920,1080", "--start-maximized", etc.. then set browser.maximize_window() and browser.set_window_size(). I also try different chromedriver version but it doesn't work at all
Can anyone help me please? Many thanks.
Yep, I had a very similar issue. The first thing you need to do is manually identify your user agent, check out this site. For example, your user agent could be a long string describing a Safari browser running on macOS that renders web pages using the WebKit engine.
Now go ahead and add in an option to manually set your user agent
options.add_argument("user-agent=User-Agent: your user agent string here")
An example might look like this:
options.add_argument("user-agent=User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/507.06 Safari/507.06")
I am trying to scrape https://www.controller.com/ with python, and since the page detected a bot using pandas.get_html, and requests using user-agents and a rotating proxy, i resorted to using selenium webdriver. However, this is also being detected as a bot with the following message. Can anybody explain how can I get past this?:
Pardon Our Interruption...
As you were browsing www.controller.com something about your browser made us think you were a bot. There are a few reasons this might happen:
You're a power user moving through this website with super-human speed.
You've disabled JavaScript in your web browser.
A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this support article.
To request an unblock, please fill out the form below and we will review it as soon as possible"
Here is my code:
from selenium import webdriver
import requests
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
options = webdriver.ChromeOptions()
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
#options.add_argument('headless')
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://www.controller.com/')
driver.implicitly_wait(30)
You have mentioned about pandas.get_html only in your question and options.add_argument('headless') only in your code so not sure if you are implementing them. However taking out minimum code from your code attempt as follows:
Code Block:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get('https://www.controller.com/')
print(driver.title)
I have faced the same issue.
Browser Snashot:
When I inspected the HTML DOM it was observed that the website refers the distil_referrer on window.onbeforeunload as follows:
<script type="text/javascript" id="">
window.onbeforeunload=function(a){"undefined"!==typeof sessionStorage&&sessionStorage.removeItem("distil_referrer")};
</script>
Snapshot:
This is a clear indication that the website is protected by Bot Management service provider Distil Networks and the navigation by ChromeDriver gets detected and subsequently blocked.
Distil
As per the article There Really Is Something About Distil.it...:
Distil protects sites against automatic content scraping bots by observing site behavior and identifying patterns peculiar to scrapers. When Distil identifies a malicious bot on one site, it creates a blacklisted behavioral profile that is deployed to all its customers. Something like a bot firewall, Distil detects patterns and reacts.
Further,
"One pattern with Selenium was automating the theft of Web content", Distil CEO Rami Essaid said in an interview last week. "Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".
Reference
You can find a couple of detailed discussion in:
Distil detects WebDriver driven Chrome Browsing Context
Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection
Akamai Bot Manager detects WebDriver driven Chrome Browsing Context
Distil can detect if you are headless by doing some fingerprinting using html5 canvas. They also check things like browser plugins and user-agent. Selenium sets some browser flags that are also detectable.
Finally solved the problem and headless mode works as well.
chrome_options = Options()
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--headless")
driver = webdriver.Chrome("chromedriver.exe", options=chrome_options)
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36'})
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
I have the following code in Python:
from selenium.webdriver import Firefox
from contextlib import closing
with closing(Firefox()) as browser:
browser.get(url)
I would like to print the user-agent HTTP header and
possibly change it. Is it possible?
There is no way in Selenium to read the request or response headers. You could do it by instructing your browser to connect through a proxy that records this kind of information.
Setting the User Agent in Firefox
The usual way to change the user agent for Firefox is to set the variable "general.useragent.override" in your Firefox profile. Note that this is independent from Selenium.
You can direct Selenium to use a profile different from the default one, like this:
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.set_preference("general.useragent.override", "whatever you want")
driver = webdriver.Firefox(profile)
Setting the User Agent in Chrome
With Chrome, what you want to do is use the user-agent command line option. Again, this is not a Selenium thing. You can invoke Chrome at the command line with chrome --user-agent=foo to set the agent to the value foo.
With Selenium you set it like this:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
opts = Options()
opts.add_argument("user-agent=whatever you want")
driver = webdriver.Chrome(chrome_options=opts)
Both methods above were tested and found to work. I don't know about other browsers.
Getting the User Agent
Selenium does not have methods to query the user agent from an instance of WebDriver. Even in the case of Firefox, you cannot discover the default user agent by checking what general.useragent.override would be if not set to a custom value. (This setting does not exist before it is set to some value.)
Once the browser is started, however, you can get the user agent by executing:
agent = driver.execute_script("return navigator.userAgent")
The agent variable will contain the user agent.
To build on Louis's helpful answer...
Setting the User Agent in PhantomJS
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
...
caps = DesiredCapabilities.PHANTOMJS
caps["phantomjs.page.settings.userAgent"] = "whatever you want"
driver = webdriver.PhantomJS(desired_capabilities=caps)
The only minor issue is that, unlike for Firefox and Chrome, this does not return your custom setting:
driver.execute_script("return navigator.userAgent")
So, if anyone figures out how to do that in PhantomJS, please edit my answer or add a comment below! Cheers.
This is a short solution to change the request UserAgent on the fly.
Change UserAgent of a request with Chrome
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
driver = webdriver.Chrome(driver_path)
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent":"python 2.7", "platform":"Windows"})
driver.get('http://amiunique.org')
then return your useragent:
agent = driver.execute_script("return navigator.userAgent")
Some sources
The source code of webdriver.py from SeleniumHQ (https://github.com/SeleniumHQ/selenium/blob/11c25d75bd7ed22e6172d6a2a795a1d195fb0875/py/selenium/webdriver/chrome/webdriver.py) extends its functionalities through the Chrome Devtools Protocol
def execute_cdp_cmd(self, cmd, cmd_args):
"""
Execute Chrome Devtools Protocol command and get returned result
We can use the Chrome Devtools Protocol Viewer to list more extended functionalities (https://chromedevtools.github.io/devtools-protocol/tot/Network#method-setUserAgentOverride) as well as the parameters type to use.
Firefox Profile is deprecated, you have to use it in Firefox options like this:
opts = FirefoxOptions()
opts.add_argument("--headless")
opts.add_argument("--width=800")
opts.add_argument("--height=600")
opts.set_preference("general.useragent.override", "userAgent=Mozilla/5.0
(iPhone; CPU iPhone OS 15_4 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like
Gecko) CriOS/101.0.4951.44 Mobile/15E148 Safari/604.1")
To build on JJC's helpful answer that builds on Louis's helpful answer...
With PhantomJS 2.1.1-windows this line works:
driver.execute_script("return navigator.userAgent")
If it doesn't work, you can still get the user agent via the log (to build on Mma's answer):
from selenium import webdriver
import json
from fake_useragent import UserAgent
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (UserAgent().random)
driver = webdriver.PhantomJS(executable_path=r"your_path", desired_capabilities=dcap)
har = json.loads(driver.get_log('har')[0]['message']) # get the log
print('user agent: ', har['log']['entries'][0]['request']['headers'][1]['value'])