I am trying to get selenium to use a proxy that will change at a certain point.
from seleniumwire import webdriver
def proxyManage():
proxyChange("test.com", "8000", "user1", "password1")
def proxyChange(host, port, username, password):
options = {
'proxy': {
'http': 'http://'+username+':'+password+'#'+host+':'+port,
'https': 'https://'+username+':'+password+'#'+host+':'+port,
}
}
PATH = "D:/Programming/undetectable chrome/chromedriver.exe"
browser = webdriver.Chrome(PATH, options=options)
browser.get("https://whatismyipaddress.com/")
proxyManage()
So I import seleniumwire as I am unsure how normal selenium uses proxies. Now when I try to run the program to test on the website if it works I get an error below,
Traceback (most recent call last):
File "D:\Programming\Python\proxyTest.py", line 20, in <module>
proxyManage()
File "D:\Programming\Python\proxyTest.py", line 6, in proxyManage
proxyChange("test.com", "8000", "user1", "password1")
File "D:\Programming\Python\proxyTest.py", line 16, in proxyChange
browser = webdriver.Chrome(PATH, options=options)
File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\seleniumwire\webdriver\browser.py", line 97, in __init__
chrome_options.add_argument('proxy-bypass-list=<-loopback>')
AttributeError: 'dict' object has no attribute 'add_argument'
So first is there a way I can pass these arguments to proxyChange() without the error. The idea is to have a counter in proxyManage() and every time it runs it will move the next line in the proxy.txt file and then parse these arguments to proxyChange() and hopefully it will update without the program closing again? Would be nice to multithread this.
You can try to construct it first before passing it to the dict:
http_proxy = f"'http://{username}:{password}#{host}:{port}"
https_proxy = f"'https://{username}:{password}#{host}:{port}"
options = {
'proxy': {
'http': http_proxy,
'https': https_proxy
}
}
This only works for Python >=3.6. If you're using lower version, concentrate strings should work as well.
Selenium is not suitable for auth.Its just web UI automation please check SeleniumHQ site
Two Factor Authentication shortly know as 2FA is a authorization mechanism where One Time Password(OTP) is generated using “Authenticator” mobile apps such as “Google Authenticator”, “Microsoft Authenticator” etc., or by SMS, e-mail to authenticate. Automating this seamlessly and consistently is a big challenge in Selenium. There are some ways to automate this process. But that will be another layer on top of our Selenium tests and not secured as well. So, you can avoid automating 2FA.
There are few options to get around 2FA checks:
Disable 2FA for certain Users in the test environment, so that you can use those user credentials in the automation.
Disable 2FA in your test environment.
Disable 2FA if you login from certain IPs. That way we can configure our test machine IPs to avoid this.
Change
browser = webdriver.Chrome(PATH, options=options)
to
browser = webdriver.Chrome(PATH, seleniumwire_options=options)
works for me
Related
I'am new in web scraping, and I want to download runtime csv file (The button has no URL, it uses JS function) after login, I have tried using https://curl.trillworks.com/# ,and it works fine, but it uses a dynamic cookies.
import requests
cookies = {
...,
}
headers = {
...
}
data = {
...
}
s = requests.Session()
response = s.post(posturl, headers=headers, cookies=cookies, data=data, verify=False)
Cookies is dynamic, so every time I want to download files, I have to get the new cookies, so I'have tried something different using the same script
payload = {
'login': 'login',
'username': 'My_name',
'password': 'My_password',
}
logurl = "http:..."
posturl = 'http:...'
s = requests.Session()
response = s.post(logurl, headers=headers, data=data)
# response = s.post(posturl, data=payload,auth=(my_name, my_password)) #This too gives me the wrong output
But this doesn't give me the right output, it gives me the first page text/html, the response headers gives me two different content type
print response.headers['Content-Type']
for the right output is 'text/csv;charset=UTF-8' but it gives me 'text/html;charset=UTF-8', and the status_code for both is 200,
for information the posturl for CSV file is the same with html page
After a deep searching, I found :
There are completely different tools for web scraping :
1.request or urllib : widely used tools, it gives us the possibility to make post and get request, and login, create persistent cookies using Session() ..., there are a great tools we can use curl visit https://curl.trillworks.com/, but that doesn't enough for complicated extracting data.
Beautifullsoup or lxml : Use for HTML Parser, navigate in source html, something like regular expression to extract desired element form HTML page get Title, find the div with id=12345, these tools can't understand the JS button, and can't make an action like post or get or click button or submit in form, its just way how to read data from request.
Mechnize or robobrowser or MechnicalSoup : Great tools for web browsing and cookies handling, browser history, we can consider these tools as a mix of request and BeautifullSoup, so we can make get and post and submit in form and navigate in page html content easily like BeautifullSoup, these technologies are not a real browser, so its cannot excute and understand JS, and send asynchronous request, or move the scrollbar, or exporting selected data form table ... so these tools not enough to make complicated requests.
Selenium : is a powerful tools and a real browser, we can get data as we want, we can make get and post and search, submit, selecting, move scroolbar, it used just like we use any navigator, using Selemenium nothing is impossible, we can use a real browser with GUI or we can use option = 'headless' for server environment.
Blow I explain how we can submit in a form and click JS button in server environment step by step.
A. Install webdriver for server environment :
Open a terminal :
sudo apt install chromium-chromedriver
sudo pip install selenium
If you want to use a webdriver for GUI interface download from https://chromedriver.chromium.org/downloads
B. Example in Python 2.7, it works too for Python 3, just edit print line
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')
# define downlaoded directory
options.add_experimental_option("prefs", {
"download.default_directory": "/tmp/",
"download.prompt_for_download": False,
})
browser = webdriver.Chrome(chrome_options=options) # see edit for recent code change.
USERNAME = 'mail'
PASSWORD = 'password'
browser.get('http://www.example.com')
print browser.title
user_input = browser.find_element_by_id('mail_input')
user_input.send_keys(USERNAME)
pass_input = browser.find_element_by_id('pass_input')
pass_input.send_keys(PASSWORD)
login_button = browser.find_element_by_id("btn_123")
login_button.click()
csv_button = browser.find_element_by_id("btn45875465")
csv_button.click()
browser.close() # to close current page or you can use too `browser.quit()` to destroy the hole of webdriver instance
# Check if the file was downloaded completely and seccessfully
file_path = '/tmp/file_name'
while not os.path.exists(file_path):
time.sleep(1)
if os.path.isfile(file_path):
print 'The file was downloaded completely and seccessfully'
I will start by describing the infrastructure I am working within. It contains multiple proxy servers that uses a load balancer to forward user authentications to the appropriate proxy that are directly tied to an active directory. The authentication uses the credentials and source IP that was used to log into the computer the request is coming from. The server caches the IP and credentials for 60 minutes. I am using a test account specifically for this process and is only used on the unit testing server.
I am working on some automation with selenium webdriver on a remote server using a docker container. I am using python as the scripting language. I am trying to run tests on both internal and external webpages/applications. I was able to get a basic test on an internal website with the following script:
Note: 10.1.54.118 is the server hosting the docker container with the selenium web driver
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
browser = webdriver.Remote(command_executor='http://10.1.54.118:4444/wd/hub', desired_capabilities=DesiredCapabilities.CHROME)
browser.get("http://10.0.0.2")
print (browser.find_element_by_tag_name('body').text)
bodyText = browser.find_element_by_tag_name('body').text
print (bodyText)
if 'Hello' in bodyText:
print ('Found hello in body')
else:
print ('Hello not found in body')
browser.quit()
The script is able to access the internal webpage and print all the text on it.
However, I am experiencing problems trying to run test scripts against external websites.
I have tried the following articles and tutorials and it doesn't seem to work for me.
The articles and tutorials I have tried:
https://www.seleniumhq.org/docs/04_webdriver_advanced.jsp
Pass driver ChromeOptions and DesiredCapabilities?
https://www.programcreek.com/python/example/100023/selenium.webdriver.Remote
https://github.com/webdriverio/webdriverio/issues/324
https://www.programcreek.com/python/example/96010/selenium.webdriver.common.desired_capabilities.DesiredCapabilities.CHROME
Running Selenium Webdriver with a proxy in Python
how do i set proxy for chrome in python webdriver
https://docs.proxymesh.com/article/4-python-proxy-configuration
I have tried creating 4 versions of a script to access an external site i.e. google.com and simply print the text off of it. Every script returns a time out error. I apologize for posting a lot of code but maybe the community is able to see where I am going wrong with the coding aspect.
Code 1:
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
PROXY = "10.32.51.169:3128" # IP:PORT or HOST:PORT
desired_capabilities = webdriver.DesiredCapabilities.CHROME.copy()
desired_capabilities['proxy'] = {
"httpProxy":PROXY,
"ftpProxy":PROXY,
"sslProxy":PROXY,
"socksUsername":"myusername",
"socksPassword":"mypassword",
"noProxy":None,
"proxyType":"MANUAL",
"class":"org.openqa.selenium.Proxy",
"autodetect":False
}
browser = webdriver.Remote('http://10.1.54.118:4444/wd/hub', desired_capabilities)
browser.get("https://www.google.com/")
print (browser.find_element_by_tag_name('body').text)
bodyText = browser.find_element_by_tag_name('body').text
print (bodyText)
if 'Hello' in bodyText:
print ('Found hello in body')
else:
print ('Hello not found in body')
browser.quit()
Is my code incorrect in any way? Am I able to pass configuration parameters to the docker chrome selenium webdriver or do I need to build the docker container with the proxy settings preconfigured before building it? I look forward to your replies and any help that can point me in the right direction.
A little late on this one, but a couple ideas + improvements:
Remove the user/pass from the socks proxy config and add them to your Proxy connection uri.
Use the selenium Proxy object to help abstract some of the other bits of the proxy capability.
Add the scheme to the proxy connection string.
Use a try/finally block to make sure the browser quits despite any failures
Note... I'm using Python3, selenium version 3.141.0, and I'm leaving out the FTP config for brevity/simplicity:
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.proxy import Proxy
# Note the addition of the scheme (http) and the user/pass into the connection string.
PROXY = 'http://myusername:mypassword#10.32.51.169:3128'
# Use the selenium Proxy object to add proxy capabilities
proxy_config = {'httpProxy': PROXY, 'sslProxy': PROXY}
proxy_object = Proxy(raw=proxy_config)
capabilities = DesiredCapabilities.CHROME.copy()
proxy_object.add_to_capabilities(capabilities)
browser = webdriver.Remote('http://10.1.54.118:4444/wd/hub', desired_capabilities=capabilities)
# Use try/finally so the browser quits even if there is an exception
try:
browser.get("https://www.google.com/")
print(browser.find_element_by_tag_name('body').text)
bodyText = browser.find_element_by_tag_name('body').text
print(bodyText)
if 'Hello' in bodyText:
print('Found hello in body')
else:
print('Hello not found in body')
finally:
browser.quit()
So basically i am trying to use the Crawlera Proxy from scrapinghub with selenium chrome on windows using python.
I checked the documentation and they suggested using Polipo like this:
1) adding the following lines to /etc/polipo/config
parentProxy = "proxy.crawlera.com:8010"
parentAuthCredentials = "<CRAWLERA_APIKEY>:"
2) adding this to selenium driver
polipo_proxy = "127.0.0.1:8123"
proxy = Proxy({
'proxyType': ProxyType.MANUAL,
'httpProxy': polipo_proxy,
'ftpProxy' : polipo_proxy,
'sslProxy' : polipo_proxy,
'noProxy' : ''
})
capabilities = dict(DesiredCapabilities.CHROME)
proxy.add_to_capabilities(capabilities)
driver = webdriver.Chrome(desired_capabilities=capabilities)
Now i'd like to not use Polipo and directly use the proxy.
Is there a way to replace the polipo_proxy variable and change it to the crawlera one? Each time i try to do it, it doesn't take it into account and runs without proxy.
Crawlera proxy format is like the folowwing: [API KEY]:#[HOST]:[PORT]
I tried adding the proxy using the following line:
chrome_options.add_argument('--proxy-server=http://[API KEY]:#[HOST]:[PORT])
but the problem is that i need to specify HTTP and HTTPS differently.
Thank you in advance!
Polipo is no longer maintained and hence there are challenges in using it. Crawlera requires Authentication, which Chrome driver does not seem to support as of now. You can try using Firefox webdriver, in that you can set the proxy authentication in the custom Firefox profile and use the profile as shown in Running selenium behind a proxy server and http://toolsqa.com/selenium-webdriver/http-proxy-authentication/.
I have been suffering from the same problem and got some relief out of it. Hope it will help you as well. To solve this problem you have to use Firefox driver and its profile to put proxy information this way.
profile = webdriver.FirefoxProfile()
profile.set_preference("network.proxy.type", 1)
profile.set_preference("network.proxy.http", "proxy.server.address")
profile.set_preference("network.proxy.http_port", "port_number")
profile.update_preferences()
driver = webdriver.Firefox(firefox_profile=profile)
This totally worked for me. For reference you can use above sites.
Scrapinghub creates a new project. You need to set up a forwarding agent by using apikey, and then set webdriver to use this agent. The project address is: zyte-smartproxy-headless-proxy
You can have a look
Is there any way to dynamically change the proxy being used by Firefox when using selenium webdriver?
Currently I have proxy support using a proxy profile but is there a way to change the proxy when the browser is alive and running?
My current code:
proxy = Proxy({
'proxyType': 'MANUAL',
'httpProxy': proxy_ip,
'ftpProxy': proxy_ip,
'sslProxy': proxy_ip,
'noProxy': '' # set this value as desired
})
browser = webdriver.Firefox(proxy=proxy)
Thanks in advance.
This is a slightly old question.
But it is actually possible to change the proxies dynamically thru a "hacky way"
I am going to use Selenium JS with Firefox but you can follow thru in the language you want.
Step 1: Visiting "about:config"
driver.get("about:config");
Step 2 : Run script that changes proxy
var setupScript=`var prefs = Components.classes["#mozilla.org/preferences-service;1"]
.getService(Components.interfaces.nsIPrefBranch);
prefs.setIntPref("network.proxy.type", 1);
prefs.setCharPref("network.proxy.http", "${proxyUsed.host}");
prefs.setIntPref("network.proxy.http_port", "${proxyUsed.port}");
prefs.setCharPref("network.proxy.ssl", "${proxyUsed.host}");
prefs.setIntPref("network.proxy.ssl_port", "${proxyUsed.port}");
prefs.setCharPref("network.proxy.ftp", "${proxyUsed.host}");
prefs.setIntPref("network.proxy.ftp_port", "${proxyUsed.port}");
`;
//running script below
driver.executeScript(setupScript);
//sleep for 1 sec
driver.sleep(1000);
Where use ${abcd} is where you put your variables, in the above example I am using ES6 which handles concatenation as shown, you can use other concatenation methods of your choice , depending on your language.
Step 3: : Visit your site
driver.get("http://whatismyip.com");
Explanation:the above code takes advantage of Firefox's API to change the preferences using JavaScript code.
As far as I know there are only two ways to change the proxy setting, one via a profile (which you are using) and the other using the capabilities of a driver when you instantiate it as per here. Sadly neither of these methods do what you want as they both happen before as you create your driver.
I have to ask, why is it you want to change your proxy settings? The only solution I can easily think of is to point firefox to a proxy that you can change at runtime. I am not sure but that might be possible with browsermob-proxy.
One possible solution is to close the webdriver instance and create it again after each operation by passing a new configuration in the browser profile
Have a try selenium-wire, It can even override header field
from seleniumwire import webdriver
options = {
'proxy': {
"http": "http://" + IP_PORT,
"https": "http://" + IP_PORT,
'custom_authorization':AUTH
},
'connection_keep_alive': True,
'connection_timeout': 30,
'verify_ssl': False
}
# Create a new instance of the Firefox driver
driver = webdriver.Firefox(seleniumwire_options=options)
driver.header_overrides = {
'Proxy-Authorization': AUTH
}
# Go to the Google home page
driver.get("http://whatismyip.com")
driver.close()
What I am trying to do is access the traffic meter data on my local netgear router. It's easy enough to login to it and click on the link, but ideally I would like a little app that sits down in the system tray (windows) that I can check whenever I want to see what my network traffic is.
I'm using python to try to access the router's web page, but I've run into some snags. I originally tried modified a script that would reboot the router (found here https://github.com/ncw/router-rebooter/blob/master/router_rebooter.py) but it just serves up the raw html and I need it after the onload javascript functions have run. This type of thing is described in many posts about web scraping and people suggested using selenium.
I tried selenium and have run into two problems. First, it actually opens the browser window, which is not what I want. Second, it skips the stuff I put in to pass the HTTP authentication and pops up the login window anyway. Here is the code:
from selenium import webdriver
baseAddress = '192.168.1.1'
baseURL = 'http://%(user)s:%(pwd)s#%(host)s/traffic_meter.htm'
username = 'admin'
pwd = 'thisisnotmyrealpassword'
url = baseURL % {
'user': username,
'pwd': pwd,
'host': baseAddress
}
profile = webdriver.FirefoxProfile()
profile.set_preference('network.http.phishy-userpass-length', 255)
driver = webdriver.Firefox(firefox_profile=profile)
driver.get(url)
So, my question is, what is the best way to accomplish what I want without having it launch a visible web browser window?
Update:
Okay, I tried sircapsalot's suggestion and modified the script to this:
from selenium import webdriver
from contextlib import closing
url = 'http://admin:notmyrealpassword#192.168.1.1/start.htm'
with closing(webdriver.Remote(desired_capabilities = webdriver.DesiredCapabilities.HTMLUNIT)) as driver:
driver.get(url)
print(driver.page_source)
This fixes the web browser being loaded, but it failed the authentication. Any suggestions?
Okay, I found the solution and it was way easier than I thought. I did try John1024's suggestion and was able to download the proper webpage from the router using wget. However I didn't like the fact that wget saved the result to a file, which I would then have to open and parse.
I ended up going back to the original reboot_router.py script I had attempted to modify unsuccessfully the first time. My problem was I was trying to make it too complicated. This is the final script I ended up using:
import urllib2
user = 'admin'
pwd = 'notmyrealpassword'
host = '192.168.1.1'
url = 'http://' + host + '/traffic_meter_2nd.htm'
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, host, user, pwd)
authhandler = urllib2.HTTPBasicAuthHandler(passman)
opener = urllib2.build_opener(authhandler)
response = opener.open(url)
stuff = response.read()
response.close()
print stuff
This prints out the entire traffic meter webpage from my router, with its proper values loaded. I can then take this and parse the values out of it. The nice thing about this is it has no external dependencies like selenium, wget or other libraries that needs to be installed. Clean is good.
Thank you, everyone, for your suggestions. I wouldn't have gotten to this answer without them.
The web interface for my Netgear router (WNDR3700) is also filled with javascript. Yours may differ but I have found that my scripts can get all the info they need without javascript.
The first step is finding the correct URL. Using FireFox, I went to the traffic page and then used "This Frame -> Show only this frame" to discover that the URL for the traffic page on my router is:
http://my_router_address/traffic.htm
After finding this URL, no web browswer and no javascript is needed. I can, for example, capture this page with wget:
wget http://my_router_address/traffic.htm
Using a text editor on the resulting traffic.htm file, I see that the traffic data is available in a lengthy block that starts:
var traffic_today_time="1486:37";
var traffic_today_up="1,959";
var traffic_today_down="1,945";
var traffic_today_total="3,904";
. . . .
Thus, the traffic.htm file can be easily captured and parsed with the scripting language of your choice. No javascript ever needs to be executed.
UPDATE: I have a ~/.netrc file with a line in it like:
machine my_router_address login someloginname password somepassword
Before wget downloads from the router, it retrieves the login info from this file. This has security advantages. If one runs wget http://name#password..., then the password is viewable to all on your machine via the process list (ps a). Using .netrc, this never happens. Restrictive permissions can be set on .netrc, e.g. readable only by user (chmod 400 ~/.netrc).