How to hide pyhon selenium works for browser - python

when a try to use a python Selenium, browser detect that this is a parser and i have not posibility to work with page. i can not get a page fully to make click() and other func.
!!! Fake-user-agent does not work
or pls suggest other methot to make a click on the page, for start video-player and get a link of button that located inside this player. i only need this link that always different that is link to download a video (JW player 8.26.5 )

You would need to bypass the detection by the website that you're using Selenium.
One way is using --disable-blink-features=AutomationControlled option when starting the browser. This tells the browser to not use the default user agent string, which is typically set to indicate that the browser is being controlled by automation software like Selenium.
Here is an example with chrome:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(options=options)
driver.get("YOUR_URL")
You may also need to set a custom user-agent string to further disguise your browser as a non-automated user:
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36")

Related

Selenium Chrome does not return Facebook user's post like in normal Chrome

So recently I am trying to scrape Facebook's page/user's without logging in(I know login could solve so many issues but it also leads to a potential account ban). However, I have found that the page source returned does not match how it looks in a normal Chrome browser.
For example, for Facebook Page, you can see its post like this in normal Chrome (Not logged in, Incognito Mode)
For Facebook User, you can see its post like this in normal Chrome (Not logged in, Incognito Mode)
However, in Selenium, I am unable to retrieve the post content(actually what I only need is the newest post id).
The code is here for reference
import os
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.binary_location = os.environ.get("GOOGLE_CHROME_BIN")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36")
chrome_options.add_argument("--enable-javascript")
driver = webdriver.Chrome(executable_path=os.environ.get("CHROMEDRIVER_PATH"), chrome_options=chrome_options)
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36"})
driver.implicitly_wait(10)
driver.get("https://www.facebook.com/<Page or User>")
print(driver.page_source)
I have tried adding/removing user-agent settings but seems no luck. The user-agent in the code is my current browser's one
What I want to do is to retrieve the post content(or only post id) in selenium chrome driver like it does in normal chrome. Thank you.

How is my Selenium script getting detected?

My simple Python script using Selenium is not working properly. My hypothesis is that it's getting detected and flaged as a bot. The only purpose of the script is to log in into zalando.pl website. No matter what I do, I get Error 403 ("Wystąpił błąd. Pracujemy nad jego usunięciem. Spróbuj ponownie później.").
I've tried various methods to resolve the problem. I've tried to simulate human behavior with sleep with random numbers (I've tried to use WebDriverWait as well). Also, I've been trying to solve the problem using options given to chromedriver, but it didn't help (I also edited string &cdc using hex editor). Exept all above, I tried undetected-chromedriver but it didn't help. Is there any way for my script to work?
Here's the code:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'/chromedriver.exe')
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36'})
print(driver.execute_script("return navigator.userAgent;"))
driver.get('https://www.zalando.pl/login')
time.sleep(7)
username_entry = driver.find_element(By.XPATH, '//*[#id="login.email"]')
username_entry.send_keys("login#mail.com")
time.sleep(1)
password_entry = driver.find_element(By.XPATH, '//*[#id="login.secret"]')
password_entry.send_keys("password")
time.sleep(4)
button_entry = driver.find_element(By.XPATH, '//*[#id="sso"]/div/div[2]/main/div/div[2]/div/div/div/form/button/span')
time.sleep(2)
button_entry.click()

Headless selenium exits immediately

I have a headless web scraper. When it run the scraper takes a base url, scrapes the links on that page, and then scrapes the links it got off that page.
The problem I'm having is that when I run the scraper it pretty much immediately exits. When I run the scraper normally (non headless) it works perfectly fine.
These are my selenium arguments:
options = webdriver.ChromeOptions()
options.binary_location = os.environ.get('GOOGLE_CHROME_BIN')
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
driver = webdriver.Chrome(executable_path=os.environ.get('CHROMEDRIVER_PATH'),
options=options)
I've also tried adding these options but it gave me the same result:
options.add_argument('--disable-dev-shm-usage')
options.add_argument("--window-size=1920,1080")
options.add_argument("--start-maximized")
How can I solve this? I'm trying to deploy this scraper to heroku and none of the things I've tried above worked.
Basically some website won't load in headless mode unless a user agent is specified.
To fix this I added:
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
options.add_argument(f'user-agent={user_agent}')
This fixed the problem of my scraper exiting immediately

Why doesn't instagram work with Selenium headless Chrome?

I'm trying to build an insta bot that works headless, but it don't seem to find the username, password columns (i.e NoSuchElementException).
I tried to run this code to troubleshoot. (which basicaly opens the ig homepage and screenshots it)
from selenium import webdriver
from time import sleep
options = webdriver.ChromeOptions()
options.headless = True
options.add_argument("--window-size=1920,1080")
browser = webdriver.Chrome(options=options)
browser.get("https://www.instagram.com")
browser.get_screenshot_as_file(f"screenshot.png")
and i got these screenshots basically saying 'error, retry after several minutes' in french
I tried finding the 'connectez-vous' button thru selenium, but every xpath i try doesn't work, and it's impossible to find it thru f12
The bot will be later uploaded to pythonanywhere so i can run it in the cloud (so if you think i might run into some other problems you can let me know)
What do you suggest me to do?
from selenium import webdriver
from time import sleep
options = webdriver.ChromeOptions()
#options.headless = True
options.add_argument("--window-size=1920,1080")
options.add_argument("--headless")
options.add_argument("--disable-gpu")
options.add_argument(
"user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36")
browser = webdriver.Chrome(options=options)
browser.get("https://www.instagram.com")
sleep(5)
#browser.refresh()
browser.get_screenshot_as_file(f"screenshot.png")
For headless chrome , useragent is set as chromeheadless or something , this makes instagram to detect that you are using headless chrome.
You can vent this by specifying hardcoded useragent,
open a normal chrome , goto network tab , open request header and copy the user agent part and replace in your code
Headless browser detection

Scrapyng AngularJS with Selenium in Python in Chrome headless mode

I want to crawl information from a webpage which is made with angularjs.
My problem is, that if I crawl the page in "--headless" mode I do not receive my target element. Without "--headless" everything works fine.
May somebody can explain or point a link what are the differences to "--headless"?
I red http://allselenium.info/wait-for-elements-python-selenium-webdriver/ . What else could be the matter?
Thank you for any hints.
EDIT:
It also doesn't work with wait conditions in headless mode
Here is a solution that worked for me after some research, reading:
https://github.com/GoogleChrome/puppeteer/issues/665
https://intoli.com/blog/making-chrome-headless-undetectable/
The headless request is detected, so one has to set arguments hiding headless mode:
options.add_argument('--headless')
options.add_argument('--lang=de-DE')
options.add_argument('--user-agent="Mozilla/5.0 (Windows NT 4.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36"')
options.add_argument("window-size=1920x1080")

Categories