Headless selenium exits immediately - python

I have a headless web scraper. When it run the scraper takes a base url, scrapes the links on that page, and then scrapes the links it got off that page.
The problem I'm having is that when I run the scraper it pretty much immediately exits. When I run the scraper normally (non headless) it works perfectly fine.
These are my selenium arguments:
options = webdriver.ChromeOptions()
options.binary_location = os.environ.get('GOOGLE_CHROME_BIN')
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
driver = webdriver.Chrome(executable_path=os.environ.get('CHROMEDRIVER_PATH'),
options=options)
I've also tried adding these options but it gave me the same result:
options.add_argument('--disable-dev-shm-usage')
options.add_argument("--window-size=1920,1080")
options.add_argument("--start-maximized")
How can I solve this? I'm trying to deploy this scraper to heroku and none of the things I've tried above worked.

Basically some website won't load in headless mode unless a user agent is specified.
To fix this I added:
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
options.add_argument(f'user-agent={user_agent}')
This fixed the problem of my scraper exiting immediately

Related

While I'm scraping with Selenium it keeps telling me that I'm an unusual browser and that I have to enable javascrept

I just started learning programming and started with scraping with python Selenium but when get the Url and send elemets the website keep sending me (Your browser is a bit unusual...
Try disabling ad blockers and other extensions, enabling javascript, or using a different web browser.)
I tried some of the solutions provided on the site, but none of them solved my problem.
Can you explain and solve the problem with python please?
import selenium
from selenium import webdriver
from time import sleep
options = webdriver.ChromeOptions()
options.add_argument("--incognito")
driver = webdriver.Chrome('chromedriver.exe', options=options)
driver.set_window_size(620, 720)
driver.delete_all_cookies()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver.implicitly_wait(5)
options.add_argument("--headless")
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36'})
driver.get('https://sso.godaddy.com/v1/account/create?realm=idp&path=%2Fcontact%2Fvalidate%3FcontactType%3DphoneMobile%26app%3Dsso%26path%3Dprofile%252Fedit%26profileUpdate%3DTrue%26userInteraction%3DPROFILE_UPDATE&app=sso&auth_reason=1&iframe=false')

Selenium Chrome does not return Facebook user's post like in normal Chrome

So recently I am trying to scrape Facebook's page/user's without logging in(I know login could solve so many issues but it also leads to a potential account ban). However, I have found that the page source returned does not match how it looks in a normal Chrome browser.
For example, for Facebook Page, you can see its post like this in normal Chrome (Not logged in, Incognito Mode)
For Facebook User, you can see its post like this in normal Chrome (Not logged in, Incognito Mode)
However, in Selenium, I am unable to retrieve the post content(actually what I only need is the newest post id).
The code is here for reference
import os
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.binary_location = os.environ.get("GOOGLE_CHROME_BIN")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36")
chrome_options.add_argument("--enable-javascript")
driver = webdriver.Chrome(executable_path=os.environ.get("CHROMEDRIVER_PATH"), chrome_options=chrome_options)
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36"})
driver.implicitly_wait(10)
driver.get("https://www.facebook.com/<Page or User>")
print(driver.page_source)
I have tried adding/removing user-agent settings but seems no luck. The user-agent in the code is my current browser's one
What I want to do is to retrieve the post content(or only post id) in selenium chrome driver like it does in normal chrome. Thank you.

How is my Selenium script getting detected?

My simple Python script using Selenium is not working properly. My hypothesis is that it's getting detected and flaged as a bot. The only purpose of the script is to log in into zalando.pl website. No matter what I do, I get Error 403 ("Wystąpił błąd. Pracujemy nad jego usunięciem. Spróbuj ponownie później.").
I've tried various methods to resolve the problem. I've tried to simulate human behavior with sleep with random numbers (I've tried to use WebDriverWait as well). Also, I've been trying to solve the problem using options given to chromedriver, but it didn't help (I also edited string &cdc using hex editor). Exept all above, I tried undetected-chromedriver but it didn't help. Is there any way for my script to work?
Here's the code:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'/chromedriver.exe')
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36'})
print(driver.execute_script("return navigator.userAgent;"))
driver.get('https://www.zalando.pl/login')
time.sleep(7)
username_entry = driver.find_element(By.XPATH, '//*[#id="login.email"]')
username_entry.send_keys("login#mail.com")
time.sleep(1)
password_entry = driver.find_element(By.XPATH, '//*[#id="login.secret"]')
password_entry.send_keys("password")
time.sleep(4)
button_entry = driver.find_element(By.XPATH, '//*[#id="sso"]/div/div[2]/main/div/div[2]/div/div/div/form/button/span')
time.sleep(2)
button_entry.click()

Selenium can't access "bet365 " on a Google Compute Engine VM

It is known that www.bet365.com implements some kind of bot detection that makes web scraping a bit difficult, but with the help of the Internet I got a Chromedriver configuration for my Python script that let met scrape that website flawlessly in both my Windows 10 host and my local virtualized Ubuntu.
The thing is, I uploaded said script to a GCE Ubuntu virtual machine but Selenium seems to be unable to load this particular website. Every needed package/library is the same as my local linux machine which works fine. I can get to any other website with the driver.get() method, is just bet365 the one that keeps loading for around ten seconds and then throws this exception:
from unknown error: cannot determine loading status
from tab crashed
(Session info: headless chrome=87.0.4280.66)
My code is like this:
#Chrome Options
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument("start-maximized")
options.add_argument("headless")
options.add_argument("--no-sandbox")
options.add_argument("--ignore-certificate-errors")
options.add_argument('--disable-gpu')
options.add_argument("--disable-backgrounding-occluded-windows")
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(chrome_options=options)
# Avoid detection
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
"source": """
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
})
"""
})
driver.execute_cdp_cmd('Network.setUserAgentOverride',
{"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36'})
driver.get("https://www.bet365.com") #<-- Doesn't get past
time.sleep(3)
print("Website loaded")
I also tried with geckodriver without success, which also works with this website on my local VM. Finally, I checked that the site is indeed reachable from the GCE machine with the curl -Is http://www.bet365.com | head -1 command, returning 200 OK .
Any idea of what can be the problem here?

python selenium identificate browser running

In my website when users execute one operation in server start new seesion chrome webdrive (python selenium), for monitorization need identificate the browser opened.
UA = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.{NNR} (KHTML, like Gecko) Chrome/42.0.2288.6 Safari/537.{NNR}".format(NNR=NNR)
options = webdriver.ChromeOptions()
options.add_argument('--user-agent={UA}'.format(UA=UA))
options.add_argument("--lang=it");
options.add_argument("--test-type")
self.driver = webdriver.Chrome(chrome_options=options)
need same solution, when the browser is opened want to be associated with a name, visible to the human eye! How i can gived name to browser in selenium ?

Categories