Error using webdriver with headless chrome to download file - python

I want to use headless chrome driver to download pdf. Everything works fine when I downloaded pdf without headless chrome. Here is part of my driver setting code:
options = webdriver.ChromeOptions()
prefs = {'profile.default_content_settings.popups': 0,
"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}], # Disable Chrome's PDF Viewer
'download.default_directory': 'download_dir' ,
"download.extensions_to_open": "applications/pdf"}
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('"--no-sandbox"')
options.add_argument('--ignore-certificate-errors')
options.add_experimental_option('prefs', prefs)
driver.command_executor._commands["send_command"] = ("POST", '/session/$sessionId/chromium/send_command')
params = {'cmd': 'Page.setDownloadBehavior', 'params': {'behavior': 'allow', 'downloadPath': download_dir}}
driver.execute("send_command", params)
driver.get(url)
try : WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, 'check-pdf')))
finally:
driver.find_element_by_class_name('check-pdf').click()
The error shows up when I run this file in cmd.
[0623/130628.966:INFO:CONSOLE(7)] "A parser-blocking, cross site (i.e. different eTLD+1) script, http://s11.cnzz.com/z_stat.php?id=1261865322, is invoked via document.write. The network request for this script MAY be blocked by the browser in this or a future page load due to poor network connectivity. If blocked in this page load, it will be confirmed in a subsequent console message. See https://www.chromestatus.com/feature/5718547946799104 for more details.", source: http://utrack.hexun.com/dp/hexun_uweb.js (7)
[0623/130628.968:INFO:CONSOLE(7)] "A parser-blocking, cross site (i.e. different eTLD+1) script, http://s11.cnzz.com/z_stat.php?id=1261865322, is invoked via document.write. The network request for this script MAY be blocked by the browser in this or a future page load due to poor network connectivity. If blocked in this page load, it will be confirmed in a subsequent console message. See https://www.chromestatus.com/feature/5718547946799104 for more details.", source: http://utrack.hexun.com/dp/hexun_uweb.js (7)
[0623/130628.974:INFO:CONSOLE(16)] "A parser-blocking, cross site (i.e. different eTLD+1) script, http://c.cnzz.com/core.php?web_id=1261865322&t=z, is invoked via document.write. The network request for this script MAY be blocked by the browser in this or a future page load due to poor network connectivity. If blocked in this page load, it will be confirmed in a subsequent console message. See https://www.chromestatus.com/feature/5718547946799104 for more details.", source: https://s11.cnzz.com/z_stat.php?id=1261865322 (16)
[0623/130628.976:INFO:CONSOLE(16)] "A parser-blocking, cross site (i.e. different eTLD+1) script, http://c.cnzz.com/core.php?web_id=1261865322&t=z, is invoked via document.write. The network request for this script MAY be blocked by the browser in this or a future page load due to poor network connectivity. If blocked in this page load, it will be confirmed in a subsequent console message. See https://www.chromestatus.com/feature/5718547946799104 for more details.", source: https://s11.cnzz.com/z_stat.php?id=1261865322 (16)
[0623/130629.038:INFO:CONSOLE(8)] "Uncaught ReferenceError: jQuery is not defined", source: http://img.hexun.com/zl/hx/index/js/appDplus.js (8)
[0623/130629.479:WARNING:render_frame_host_impl.cc(2750)] OnDidStopLoading was called twice
I am wondering what the error message means and how I can fix it ?
Any idea would be helpful !

Related

How to login to website which is detecting bot usage using Selenium [duplicate]

I am running the Chrome driver over Selenium on a Ubuntu server behind a residential proxy network. Yet, my Selenium is being detected. Is there a way to make the Chrome driver and Selenium 100% undetectable?
I have been trying for so long I lost track of the many things I have done including:
Trying different versions of Chrome
Adding several flags and removing some words from the Chrome driver file.
Running it behind a proxy (residential ones also) using incognito mode.
Loading profiles.
Random mouse movements.
Randomising everything.
I am looking for a true version of Selenium that is 100% undetectable. If that ever existed. Or another automation way that is not detectable by bot trackers.
This is part of the starting of the browser:
sx = random.randint(1000, 1500)
sn = random.randint(3000, 4500)
display = Display(visible=0, size=(sx,sn))
display.start()
randagent = random.randint(0,len(useragents_desktop)-1)
uag = useragents_desktop[randagent]
#this is to prevent ip leaking
preferences =
"webrtc.ip_handling_policy" : "disable_non_proxied_udp",
"webrtc.multiple_routes_enabled": False,
"webrtc.nonproxied_udp_enabled" : False
chrome_options.add_experimental_option("prefs", preferences)
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-impl-side-painting")
chrome_options.add_argument("--disable-setuid-sandbox")
chrome_options.add_argument("--disable-seccomp-filter-sandbox")
chrome_options.add_argument("--disable-breakpad")
chrome_options.add_argument("--disable-client-side-phishing-detection")
chrome_options.add_argument("--disable-cast")
chrome_options.add_argument("--disable-cast-streaming-hw-encoding")
chrome_options.add_argument("--disable-cloud-import")
chrome_options.add_argument("--disable-popup-blocking")
chrome_options.add_argument("--ignore-certificate-errors")
chrome_options.add_argument("--disable-session-crashed-bubble")
chrome_options.add_argument("--disable-ipv6")
chrome_options.add_argument("--allow-http-screen-capture")
chrome_options.add_argument("--start-maximized")
wsize = "--window-size=" + str(sx-10) + ',' + str(sn-10)
chrome_options.add_argument(str(wsize) )
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_experimental_option("prefs", prefs)
chrome_options.add_argument("blink-settings=imagesEnabled=true")
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("user-agent="+uag)
chrome_options.add_extension(pluginfile)#this is for the residential proxy
driver = webdriver.Chrome(executable_path="/usr/bin/chromedriver", chrome_options=chrome_options)
The fact that selenium driven WebDriver gets detected doesn't depends on any specific Selenium, Chrome or ChromeDriver version. The Websites themselves can detect the network traffic and can identify the Browser Client i.e. Web Browser as WebDriver controled.
However some generic approaches to avoid getting detected while web-scraping are as follows:
The first and foremost attribute a website can determine your script/program is through your monitor size. So it is recommended not to use the conventional Viewport.
If you need to send multiple requests to a website, you need to keep on changing the user-agent on each request. You can find a detailed discussion in Way to change Google Chrome user agent in Selenium?
To simulate human like behavior you may require to slow down the script execution even beyond WebDriverWait and expected_conditions inducing time.sleep(secs). Here you can find a detailed discussion on How to sleep webdriver in python for milliseconds
#Antoine Vastel in his blog site Detecting Chrome Headless mentioned several approaches, which distinguish the Chrome browser from a headless Chrome browser.
User agent: The user agent attribute is commonly used to detect the OS as well as the browser of the user. With Chrome version 59 it has the following value:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/59.0.3071.115 Safari/537.36
A check for the presence of Chrome headless can be done through:
if (/HeadlessChrome/.test(window.navigator.userAgent)) {
console.log("Chrome headless detected");
}
Plugins: navigator.plugins returns an array of plugins present in the browser. Typically, on Chrome we find default plugins, such as Chrome PDF viewer or Google Native Client. On the opposite, in headless mode, the array returned contains no plugin.
A check for the presence of Plugins can be done through:
if(navigator.plugins.length == 0) {
console.log("It may be Chrome headless");
}
Languages: In Chrome two Javascript attributes enable to obtain languages used by the user: navigator.language and navigator.languages. The first one is the language of the browser UI, while the second one is an array of string representing the user’s preferred languages. However, in headless mode, navigator.languages returns an empty string.
A check for the presence of Languages can be done through:
if(navigator.languages == "") {
console.log("Chrome headless detected");
}
WebGL: WebGL is an API to perform 3D rendering in an HTML canvas. With this API, it is possible to query for the vendor of the graphic driver as well as the renderer of the graphic driver. With a vanilla Chrome and Linux, we can obtain the following values for renderer and vendor: Google SwiftShader and Google Inc.. In headless mode, we can obtain Mesa OffScreen, which is the technology used for rendering without using any sort of window system and Brian Paul, which is the program that started the open source Mesa graphics library.
A check for the presence of WebGL can be done through:
var canvas = document.createElement('canvas');
var gl = canvas.getContext('webgl');
var debugInfo = gl.getExtension('WEBGL_debug_renderer_info');
var vendor = gl.getParameter(debugInfo.UNMASKED_VENDOR_WEBGL);
var renderer = gl.getParameter(debugInfo.UNMASKED_RENDERER_WEBGL);
if(vendor == "Brian Paul" && renderer == "Mesa OffScreen") {
console.log("Chrome headless detected");
}
Not all Chrome headless will have the same values for vendor and renderer. Others keep values that could also be found on non headless version. However, Mesa Offscreen and Brian Paul indicates the presence of the headless version.
Browser features: Modernizr library enables to test if a wide range of HTML and CSS features are present in a browser. The only difference we found between Chrome and headless Chrome was that the latter did not have the hairline feature, which detects support for hidpi/retina hairlines.
A check for the presence of hairline feature can be done through:
if(!Modernizr["hairline"]) {
console.log("It may be Chrome headless");
}
Missing image: The last on our list also seems to be the most robust, comes from the dimension of the image used by Chrome in case an image cannot be loaded. In case of a vanilla Chrome, the image has a width and height that depends on the zoom of the browser, but are different from zero. In a headless Chrome, the image has a width and an height equal to zero.
A check for the presence of Missing image can be done through:
var body = document.getElementsByTagName("body")[0];
var image = document.createElement("img");
image.src = "http://iloveponeydotcom32188.jg";
image.setAttribute("id", "fakeimage");
body.appendChild(image);
image.onerror = function(){
if(image.width == 0 && image.height == 0) {
console.log("Chrome headless detected");
}
}
References
You can find a couple of similar discussions in:
How to bypass Google captcha with Selenium and python?
How to make Selenium script undetectable using GeckoDriver and Firefox through Python?
tl; dr
Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection
How does recaptcha 3 know I'm using selenium/chromedriver?
Selenium and non-headless browser keeps asking for Captcha
why not try undetected-chromedriver?
Optimized Selenium Chromedriver patch which does not trigger anti-bot services like Distill Network / Imperva / DataDome / Botprotect.io Automatically downloads the driver binary and patches it.
Tested until current chrome beta versions
Works also on Brave Browser and many other Chromium based browsers
Python 3.6++
you can install it with: pip install undetected-chromedriver
There are important things you should be ware of:
Due to the inner workings of the module, it is needed to browse programmatically (ie: using .get(url) ). Never use the gui to navigate. Using your keybord and mouse for navigation causes possible detection! New Tabs: same story. If you really need multi-tabs, then open the tab with the blank page (hint: url is data:, including comma, and yes, driver accepts it) and do your thing as usual. If you follow these "rules" (actually its default behaviour), then you will have a great time for now.
In [1]: import undetected_chromedriver as uc
In [2]: driver = uc.Chrome()
In [3]: driver.execute_script('return navigator.webdriver')
Out[3]: True # Detectable
In [4]: driver.get('https://distilnetworks.com') # starts magic
In [4]: driver.execute_script('return navigator.webdriver')
In [5]: None # Undetectable!
For Python with Chrome or Chromium-based browsers, there's Selenium-Profiles
It currently supports:
Overwrite device metrics with fake-profiles
Mobile and Desktop emulation
Undetected by Google, Cloudflare, ..
Modifying headers supported using Selenium-Interceptor
Touch Actions
proxies with authentication
making single POST, GET or other requests using driver.requests.fetch(url, options) (syntax)
Installation
pip install selenium-profiles
Example script
from selenium_profiles.driver import driver as mydriver
from selenium_profiles.profiles import profiles
mydriver = mydriver()
driver = mydriver.start(profiles.Windows()) # or .Android
# get url
driver.get('https://nowsecure.nl/#relax') # test undetectability
input("Press ENTER to exit: ")
driver.quit() # Execute on the End!
Notes:
The package is licenced under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , means, in case you want to use it for something commercial, you need to ask the author first.
headless support currently isn't guaranteed, but you can use pyvirtualdisplay
What about:
import random
from selenium import webdriver
import time
driver = webdriver.Chrome("C:\\Users\\DusEck\\Desktop\\chromedriver.exe")
username = "username" # data_user
password = "password" # data_pass
driver.get("https://www.depop.com/login/") # get URL
driver.find_element_by_xpath('/html/body/div[1]/div/div[3]/div[2]/button[2]').click() # Accept cookies
split_char_pw = [] # Empty lists
split_char = []
n = 1 # Splitter
for index in range(0, len(username), n):
split_char.append(username[index: index + n])
for user_letter in split_char:
time.sleep(random.uniform(0.1, 0.8))
driver.find_element_by_id("username").send_keys(user_letter)
for index in range(0, len(password), n):
split_char.append(password[index: index + n])
for pw_letter in split_char_pw:
time.sleep(random.uniform(0.1, 0.8))
driver.find_element_by_id("password").send_keys(pw_letter)

Selenium webdriver does not open the correct url, rather it opens a blank page

I am using selenium webdriver to try scrape information from realestate.com.au, here is my code:
from selenium.webdriver import Chrome
from bs4 import BeautifulSoup
path = 'C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe'
url = 'https://www.realestate.com.au/buy'
url2 = 'https://www.realestate.com.au/property-house-nsw-castle+hill-134181706'
webdriver = Chrome(path)
webdriver.get(url)
soup = BeautifulSoup(webdriver.page_source, 'html.parser')
print(soup)
it works fine with URL but when I try to do the same to open url2, it opens up a blank page, and I checked the console get the following:
"Failed to load resource: the server responded with a status of 429 ()
about:blank:1 Failed to load resource: net::ERR_UNKNOWN_URL_SCHEME
149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint:1 Failed to load resource: the server responded with a status of 404 ()"
while opening up URL, I tried to search for anything, which also leads to a blank page like url2.
It looks like the www.realestate.com.au website is using an Akamai security tool.
A quick DNS lookup shows that www.realestate.com.au resolves to dualstack.realestate.com.au.edgekey.net.
They are most likely using the Bot Manager product (https://www.akamai.com/us/en/products/security/bot-manager.jsp). I have encountered this on another website recently.
Typically rotating user agents and IP addresses (ideally using residential
proxies) should do the trick. You want to load up the site with a "fresh" browser profile each time. You should also check out https://github.com/67-6f-64/akamai-sensor-data-bypass
I think you should try adding driver.implicitly_wait(10) before your get line, as this will add an implicit wait, in case the page loads too slowly for the driver to pull the site. Also you should consider trying out the Firefox webdriver, since this bug appears to be only affecting chromium browsers.

How to use existing login token for telegram web using selenium webdriver

I'm trying to read telegram messages from https://web.telegram.org with selenium.
When i open https://web.telegram.org in firefox i'm already logged in, but when opening the same page from selenium webdriver(firefox) i get the login page.
I saw that telegram web is not using cookies for the auth but rather saves values in local storage. I can access the local storage with selenium and have keys there such as: "dc2_auth_key", "dc2_server_salt", "dc4_auth_key", ... but I'm not sure what to do with them in order to login(and if i do need to do something with them then why? its the same browser why wont it work the same when opening without selenium?)
To reproduce:
open firefox and login to https://web.telegram.org then run this code:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://web.telegram.org")
# my code is here but is irrelevant since im at the login page.
driver.close()
When you open https://web.telegram.org manually using Firefox, the Default Firefox Profile is used. As you login and browse through the website, the websites stores Authentication Cookies within your system. As the cookies gets stored within the local storage of the Default Firefox Profile even on reopening the browsers you are automatically authenticated.
But when GeckoDriver initiates a new web browsing session for your tests everytime a temporary new mozprofile is created while launching Firefox which is evident from the following log:
mozrunner::runner INFO Running command: "C:\\Program Files\\Mozilla Firefox\\firefox.exe" "-marionette" "-profile" "C:\\Users\\ATECHM~1\\AppData\\Local\\Temp\\rust_mozprofile.fDJt0BIqNu0n"
You can find a detailed discussion in Is it Firefox or Geckodriver, which creates “rust_mozprofile” directory
Once the Test Execution completes and quit() is invoked the temporary mozprofile is deleted in the following process:
webdriver::server DEBUG -> DELETE /session/f84dbafc-4166-4a08-afd3-79b98bad1470
geckodriver::marionette TRACE -> 37:[0,3,"quit",{"flags":["eForceQuit"]}]
Marionette TRACE 0 -> [0,3,"quit",{"flags":["eForceQuit"]}]
Marionette DEBUG New connections will no longer be accepted
Marionette TRACE 0 <- [1,3,null,{"cause":"shutdown"}]
geckodriver::marionette TRACE <- [1,3,null,{"cause":"shutdown"}]
webdriver::server DEBUG Deleting session
geckodriver::marionette DEBUG Stopping browser process
So, when you open the same page using Selenium, GeckoDriver and Firefox the cookies which were stored within the local storage of the Default Firefox Profile aren't accessible and hence you are redirected to the Login Page.
To store and use the cookies within the local storage to get authenticated automatically you need to create and use a Custom Firefox Profile.
Here you can find a relevant discussion on webdriver.FirefoxProfile(): Is it possible to use a profile without making a copy of it?
You can auth using your current data from local storage.
Example:
driver.get(TELEGRAM_WEB_URL);
LocalStorage localStorage = ((ChromeDriver) DRIVER).getLocalStorage();
localStorage.clear();
localStorage.setItem("dc2_auth_key","<YOUR AUTH KEY>");
localStorage.setItem("user_auth","<YOUR USER INFO>");
driver.get(TELEGRAM_WEB_URL);

I cannot get Chrome to default to saving as a PDF when using Selenium

I'm trying to save some web pages to PDF using Python, Selenium, and Chrome, and I can't get the printer to default to Chrome's built-in "save as PDF" option.
I have found examples of how to do this in various places online, including in questions people have asked on Stack Overflow, but they way they're all implementing it doesn't work and I'm not sure if something has changed in more recent versions of Chrome, or if I'm somehow doing something wrong (for example, here is a page that has these settings: Missing elements when using selenium chrome driver to automatically 'Save as PDF').
I only included the default download location change in this code to verify it's accepting any changes at all - if you download one of the Python installs from that page, it will download to the new location and not to the standard download folder, so Chrome seems to be accepting these changes.
The problem appears to be the option "selectedDestinationID", which doesn't seem to do anything.
from selenium import webdriver
import time
import json
chrome_options = webdriver.ChromeOptions()
app_state = {
'recentDestinations': [{
'id': 'Save as PDF',
'origin': 'local'
}],
'selectedDestinationId': 'Save as PDF',
'version': 2
}
prefs = {
'printing.print_preview_sticky_settings.appState': json.dumps(app_state),
'download.default_directory': 'c:\\temp\\seleniumtesting\\'
}
chrome_options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(executable_path='C:\\temp\\seleniumtesting\\chromedriver.exe', options=chrome_options)
driver.get('https://www.python.org/downloads/release/python-373/')
time.sleep(25)
driver.close()
After the page launches, hitting ctrl+p brings up the printing page, but it defaults to the default printer. If I bring up the same page in my standard Chrome installation, it defaults to printing to PDF. I want to get to the point where I can add kiosk printing and then call window.print(), but as of now all that does is send it to the actual paper printer.
Thanks for any help anyone can offer. I'm stumped, and at this point it probably would have been faster to just save all of these manually.
It seems that if you have network printers configured they load up after opening the dialog and override your selectedDestination.
There is a preference "printing.default_destination_selection_rules" which seems to resolve.
prefs = {
"printing.print_preview_sticky_settings.appState": json.dumps(app_state),
"download.default_directory": "c:\\temp\\seleniumtesting\\".startswith(),
"printing.default_destination_selection_rules": {
"kind": "local",
"namePattern": "Save as PDF",
},
}
https://chromium.googlesource.com/chromium/src/+/master/chrome/common/pref_names.cc#1318
https://www.chromium.org/administrators/policy-list-3#DefaultPrinterSelection

Monitoring JSON wire protocol logs

According to the selenium documentation, interactions between the webdriver client and a browser is done via JSON Wire Protocol. Basically the client, written in python, ruby, java whatever, sends JSON messages to the web browser and the web browser responds with JSON too.
Is there a way to view/catch/log these JSON messages while running a selenium test?
For example (in Python):
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://google.com')
driver.close()
I want to see what JSON messages are going between the python selenium webdriver client and a browser when I instantiate the driver (in this case Chrome): webdriver.Chrome(), when I'm getting a page: driver.get('http://google.com') and when I'm closing it: driver.close().
FYI, in the #SFSE: Stripping Down Remote WebDriver tutorial, it is done via capturing the network traffic between the local machine where the script is running and the remote selenium server.
I'm tagging the question as Python specific, but really would be happy with any pointers.
When you use Chrome you can direct the chromedriver instance that will drive Chrome to log more information than what is available through the logging package. This information includes the commands sent to the browser and the responses it gets. Here's an example:
from selenium import webdriver
driver = webdriver.Chrome(service_log_path="/tmp/log")
driver.get("http://www.google.com")
driver.find_element_by_css_selector("input")
driver.quit()
The code above will output the log to /tmp/log. The part of the log that corresponds to the find_element_... call looks like this:
[2.389][INFO]: COMMAND FindElement {
"sessionId": "b6707ee92a3261e1dc33a53514490663",
"using": "css selector",
"value": "input"
}
[2.389][INFO]: Waiting for pending navigations...
[2.389][INFO]: Done waiting for pending navigations
[2.398][INFO]: Waiting for pending navigations...
[2.398][INFO]: Done waiting for pending navigations
[2.398][INFO]: RESPONSE FindElement {
"ELEMENT": "0.3367185448296368-1"
}
As far as I know, the commands and responses faithfully represent what is going on between the client and the server. I've submitted bug reports and fixes to the Selenium project on the basis of what I saw in these logs.
Found one option that almost fits my needs.
Just piping the logger to the stdout allows to see underlying requests being made:
import logging
import sys
from selenium import webdriver
# pipe logs to stdout
logger = logging.getLogger()
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.setLevel(logging.NOTSET)
# selenium specific code
driver = webdriver.Chrome()
driver.get('http://google.com')
driver.close()
It prints:
POST http://127.0.0.1:56668/session {"desiredCapabilities": {"platform": "ANY", "browserName": "chrome", "version": "", "javascriptEnabled": true, "chromeOptions": {"args": [], "extensions": []}}}
Finished Request
POST http://127.0.0.1:56668/session/5b6875595143b0b9993ed4f66f1f19fc/url {"url": "http://google.com", "sessionId": "5b6875595143b0b9993ed4f66f1f19fc"}
Finished Request
DELETE http://127.0.0.1:56668/session/5b6875595143b0b9993ed4f66f1f19fc/window {"sessionId": "5b6875595143b0b9993ed4f66f1f19fc"}
Finished Request
I don't see the responses, but this is already a progress.

Categories