Docker Selenium Chromedriver: Unfortunately, automated access to this page was denied - python

I am using selenium chromedriver in my python project.
The application is running under Docker.
When I try to access http://mobile.de website I got rejected stating:
Unfortunately, automated access to this page was denied.
Here is my initialization code:
CHROME_DRIVER_PATH = os.path.abspath('assets/chromedriver')
chrome_options = ChromeOptions()
chrome_options.binary_location = "/usr/bin/google-chrome"
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
self.web_driver_chrome = webdriver.Chrome(executable_path=CHROME_DRIVER_PATH, options=chrome_options)
And here is my send request code:
def get_page_content(self, url):
url = "https://www.mobile.de/"
self.web_driver_chrome.get(url)
print(self.web_driver_chrome.page_source)
return self.web_driver_chrome.page_source
Is there any way I can pass this "automated access check"?

when using --headless it append HeadlessChrome to the user-agent
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/71.0.3578.98 Safari/537.36
The solution is adding argument to set normal user-agent
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
chrome_options.add_argument('user-agent=' + user_agent)

Related

Scrape data from a paginated website which was blocking the request when i use beautifulsoup, selenium

I want to scrape data from this site "https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105" . I Tried using selenium, beautifulsoup but cant able to scrape the data because the site was blocking scrapers. I want to find a way to scrape the data in this site.
Any solution for this problem will be highly appreciated.
I tried these two approaches:
import requests
from bs4 import BeautifulSoup
url="https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246'}
reqs = requests.get(url, headers=headers)
soup = BeautifulSoup(reqs.text, 'lxml')
print(soup)
Second one:
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
#object of Options class
op = webdriver.ChromeOptions()
#add user Agent
op.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246")
#set chromedriver.exe path
driver = webdriver.Chrome(executable_path=r"C:\Users\Vinay Edula\Desktop\evva\findhelp\chromedriver.exe",options=op)
#maximize browser
driver.maximize_window()
#launch URL
driver.get("https://www.findhelp.org/search_results/94105")
time.sleep(15)
driver.quit()
I was able to get into the site by passing this user agent in the header.
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246

How to run a script in a headless mode

There is a script-parser site written in Python with Selenium, if I run it in headless mode, so as not to open the browser window, it can not find the desired item and spar the information from it. If I run it without headless mode, it works fine
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--headless")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r"chromedriver.exe")
driver.get(f'''https://www.wildberries.ru/catalog/38862450/detail.aspx?targetUrl=SP''')
time.sleep(20)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(20)
element = driver.find_element(by=By.CLASS_NAME, value="address-rate-mini")
btn = driver.find_elements(by=By.CLASS_NAME, value="btn-base")
It can't find the btn I need
Browser Detection:
Some webpages can detect that you are scraping their site by looking at your user agent. A normal browsing agent looks like this:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36
However in headless mode, your browsing agent looks like this:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/95.0.4638.69 Safari/537.36
Whatever site you are scraping probably detected the "Headless" tag and possibly restricting the btn element from you.
Solution:
A simple solution to this would be to add the following Selenium option to programmatically change your user agent:
options.add_argument('--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36"')
More Information:
Python selenium headless mode missing elements

HotJar suspicious UserAgent error, nothing on google, Trying to run a python scraper tool for sports odds tracking

c:\Users\Administrator\Downloads\cloudbet(2).py:47: DeprecationWarning: use options instead of chrome_options
driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=r"C:\Python\Python38\Scripts\chromedriver.exe")
DevTools listening on ws://127.0.0.1:54368/devtools/browser/77f2d50d-65c6-49bf-af5b-7923f3d40bfc
[0619/100243.424:INFO:CONSOLE(109)] "Hotjar not launching due to suspicious userAgent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/84.0.4147.56 Safari/537.36", source: https://www.cloudbet.com/static/js/2.08e072ab.chunk.js (109)
I have faced same issue "with some websites". It was related to running Google Chrome webdriver in headless mode and that is showing in your post details.
"Hotjar not launching due to suspicious userAgent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/84.0.4147.56 Safari/537.36"
The following options could help in resolving the issue:
Running Chrome driver in normal mode instead of headless. Namely, remove chrome_options.add_argument("--headless") option.
Running Chrome driver in headless mode, but bypass the auto generated user-agent with a normal-mode user agent by adding the following argument (as an example of that scenario) :
userAgent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.56 Safari/537.36"
chrome_options.add_argument(f'user-agent={userAgent}')
Try using headless Firefox instead.
there is a Cloudbet Odds API available under https://docs.cloudbet.com/ that should make it easier to access odds.
Is there anything missing that you're looking for?

How to rotate various user agents using selenium python on each request

I want to make 10 requests to https://www.google.com/ but with random user agents using selenium and python. I've a loop and inside that loop I'm making 10 requests with random user agents (using fake-user agent). The main problem is for every request web driver is opening a new instance of google chrome and I want to do this in one single instance but with different user agents. How can I make this possible ? 1 google chrome instance and 10 requests with 10 random user agents. Here is my code:
chrome_options = Options()
chrome_options.add_argument('no-sandbox')
chrome_options.add_argument("--start-maximized")
ua = UserAgent()
for i in range(0, 10):
userAgent = ua.random
chrome_options.add_argument('--user-agent="' + str(userAgent) + '"')
driver1 = webdriver.Chrome(chrome_options=chrome_options,
executable_path="C:/Python34/chromedriver")
driver1.get('https://www.google.com/')
time.sleep(5)
First the update 1
execute_cdp_cmd(): With the availability of execute_cdp_cmd(cmd, cmd_args) command now you can easily execute google-chrome-devtools commands using Selenium. Using this feature you can modify the user-agent easily to prevent Selenium from getting detected.
Code Block:
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'C:\WebDrivers\chromedriver.exe')
print(driver.execute_script("return navigator.userAgent;"))
# Setting user agent as Chrome/83.0.4103.97
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'})
print(driver.execute_script("return navigator.userAgent;"))
# Setting user agent as Chrome/83.0.4103.53
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36'})
print(driver.execute_script("return navigator.userAgent;"))
driver.get('https://www.httpbin.org/headers')
Console Output:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36
Browser Snapshot:
Legend: 1 - Applicable only to Selenium python clients.
Originally answered Nov 6 '18 at 8:00
No. When you configure an instance of a ChromeDriver with ChromeOptions to initiate a new Chrome Browser Session the configuration of the ChromeDriver remains unchanged throughout the lifetime of the ChromeDriver and remains uneditable. So you can't change the user agent when the WebDriver instance is executing the loop making 10 requests.
Even if you are able to extract the ChromeDriver and ChromeSession attributes e.g. UserAgent, Session ID, Cookies and other session attributes from the already initiated Browsing Session still you won't be able to change those attributes of the ChromeDriver.
A cleaner way would be to call driver.quit() within tearDown(){} method to close and destroy the ChromeDriver and Chrome Browser instances gracefully and then span a new set of ChromeDriver and Chrome Browser instance with the new set of configurations.
Here you can find a relevant discussion on How can I reconnect to the browser opened by webdriver with selenium?
Reference
You can find a couple of relevant detailed discussions in:
How to change the User Agent using Selenium and Python
Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection
Yes. It is possible now with cdp:
driver.execute_cdp_cmd("Network.enable", {})
driver.execute_cdp_cmd("Network.setExtraHTTPHeaders", {"headers": {"User-Agent": "browser1"}})
driver.execute_cdp_cmd("Network.setExtraHTTPHeaders", {"headers": {"User-Agent": "browser2"}})
driver.execute_cdp_cmd("Network.setExtraHTTPHeaders", {"headers": {"User-Agent": "browser3"}})
driver.get('https://www.httpbin.org/headers')
it open 10 chrome instance because you did notclose() it, try
...
...
driver1.get('https://www.whatsmyua.info/')
time.sleep(5)
driver1.close()

Changing User-Agent String for Every Get

I'm using the following code to change the user-agent string, but I'm wondering whether or not this will change the user-agent string for each and every browser.get request?
ua_strings = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1.1 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36',
...
]
def parse(self, response):
profile = webdriver.FirefoxProfile()
profile.set_preference('general.useragent.override', random.choice(ua_string))
options = Options()
options.add_argument('-headless')
browser = webdriver.Firefox(profile, firefox_options=options)
browser.get(self.start_urls[0])
hrefs = WebDriverWait(browser, 60).until(
EC.visibility_of_all_elements_located((By.XPATH, '//div[#class="discoverableCard"]/a'))
)
pages = []
for href in hrefs:
pages.append(href.get_attribute('href'))
for page in pages:
browser.get(page)
""" scrape page """
browser.close()
Or will I have to browser.close() and then create new instances of browser in order to use new user-agent strings for each request?
for page in pages:
browser = webdriver.Firefox(profile, firefox_options=options)
browser.get(page)
""" scrape page """
browser.close()
Since random.choice() has been called initially, the user-agent string remains the same of all browser.get() requests. To ensure a constantly random user-agent, you can create a set_preference() function, which you call on every loop.
def set_prefrences(self):
user_agent_string = random.choice(ua_string)
#print out user-agent on each loop
print(user_agent_string)
profile = webdriver.FirefoxProfile()
profile.set_preference('general.useragent.override', user_agent_string)
options = Options()
options.add_argument('-headless')
browser = webdriver.Firefox(profile, firefox_options=options)
return browser
Then in your loop can be something like this:
for page in pages:
browser = set_preferences()
browser.get(page)
""" scrape page """
browser.close()
Hope this helps!

Categories