Make selenium grab all cookies - python

I was told to do a cookie audit of our front facing sites, now we own alot of domains, so Im really not going to manually dig through each one extracting the cookies. I decided to go with selenium. This works up till the point where I want to grab third party cookies.
Currently (python) I can do
driver.get_cookies()
For all the cookies that are set from my domain, but this doesn't give me any Google, Twitter, Vimeo, or other 3rd party cookies
I have tried modifying the cookie permissions in the firefox driver, but it doesn't help. Anyone know how I can get hold of tehm

Your question has been answered on StackOverflow here
Step 1: You need to download and install "Get All Cookies in XML" extension for Firefox from here (don't forget to restart Firefox after installing the extension).
Step2: Execute this python code to have Selenium's FirefoxWebDriver save all cookies to an xml file and then read this file:
from xml.dom import minidom
from selenium import webdriver
import os
import time
def determine_default_profile_dir():
"""
Returns path of Firefox's default profile directory
#return: directory_path
"""
appdata_location = os.getenv('APPDATA')
profiles_path = appdata_location + "/Mozilla/Firefox/Profiles/"
dirs_files_list = os.listdir(profiles_path)
default_profile_dir = ""
for item_name in dirs_files_list:
if item_name.endswith(".default"):
default_profile_dir = profiles_path + item_name
if not default_profile_dir:
assert ("did not find Firefox default profile directory")
return default_profile_dir
#load firefox with the default profile, so that the "Get All Cookies in XML" addon is enabled
default_firefox_profile = webdriver.FirefoxProfile(determine_default_profile_dir())
driver = webdriver.Firefox(default_firefox_profile)
#trigger Firefox to save value of all cookies into an xml file in Firefox profile directory
driver.get("chrome://getallcookies/content/getAllCookies.xul")
#wait for a bit to give Firefox time to write all the cookies to the file
time.sleep(40)
#cookies file will not be saved into directory with default profile, but into a temp directory.
current_profile_dir = driver.profile.profile_dir
cookie_file_path = current_profile_dir+"/cookie.xml"
print "Reading cookie data from cookie file: "+cookie_file_path
#load cookies file and do what you need with it
cookie_file = open(cookie_file_path,'r')
xmldoc = minidom.parse(cookie_file)
cookie_file.close()
driver.close()
#process all cookies in xmldoc object

Selenium can only get the cookies of the current domain:
getCookies
java.util.Set getCookies()
Get all the cookies for the current domain. This is the equivalent of
calling "document.cookie" and parsing the result
Anyway, I heard somebody used a Firefox plugin that was able to save all the cookies in XML. As far as I know, it is your best option.

Yes, I don't believe Selenium allows you to interact with cookies other than your current domain.
If you know the domains in question, then you could navigate to that domain but I assume this is unlikely.
It would be a massive security risk if you could access cookies cross site

You can get any cookie out of the browsers sqlite database file in the profile folder.
I added a more complete answer here:
Selenium 2 get all cookies on domain

Related

Trying to Webscrape on site that requires log in but is Dynamically loaded, Python, Selenium

I'm trying to scrape my school's website for my upcoming assignments, and add it to a file. However I need to log in to find my assessments, and the website is dynamically loaded, so I need to use Selenium.
My problem is I'm using the requests package to authenticate myself on the website, but I don't know how to open the website with Selenium. Then I'm hoping to take the HTML and scrape it with Beautiful Soup, I would prefer not to learn another Framework.
Here is my Code:
'''
import json
from requests import Session
from bs4 import BeautifulSoup
from selenium import webdriver
# Login function that takes the username and password
def login(username, password):
s = Session()
payload = {
'username' : username,
'password': password
}
res = s.post('https://www.website_url.com', json=payload)
print(res.content)
return s
session = login('username', "password")
driver_path = r'C:\Users\username\Downloads\edgedriver_win64\msedgedriver.exe'
url = 'https://www.website_url.com/assessments/upcoming'
driver = webdriver.Edge(driver_path)
driver.get(url)
'''
The website loads up, but it reverts me to the login page.
P.S. I managed to open the website with Beautiful Soup, but since it is dynamically loaded I can't scrape it.
Edit:
Hey, thanks for the answer! I tried it and it should work, sadly, it is throwing a lot of errors:
[9308:26392:0215/111025.239:ERROR:chrome_browser_main_extra_parts_metrics.cc(251)] START: GetDefaultBrowser(). If you don't see the END: message, this is crbug.com/1216328.
[9308:7708:0215/111025.270:ERROR:device_event_log_impl.cc(214)] [11:10:25.271] USB: usb_device_handle_win.cc:1049 Failed to read descriptor from node connection: A device attached to the system is not functioning. (0x1F)
[9308:7708:0215/111025.281:ERROR:device_event_log_impl.cc(214)] [11:10:25.287] USB: usb_device_handle_win.cc:1049 Failed to read descriptor from node connection: A device attached to the system is not functioning. (0x1F)
ode connection: A device attached to the system is not functioning. (0x1F)
[9308:26392:0215/111025.313:ERROR:chrome_browser_main_extra_parts_metrics.cc(255)] END: GetDefaultBrowser()
I'm not sure what this is, I had a look at the Xpath and it seems to have changed when I resized it I think.
My teacher told me (he isn't familiar with python) that I should try login to the website on a window and open another tab with Selenium so I could avoid the login because I'm logged in on the other tab, I've looked around of how to open a new tab not a window but I can't find anything.
Thank you!
Hey, I just found the answer, the problem was the HTML id, and Xpath was changing each reloads and I didn't realize I could use CSS selectors, so i did that, you've helped me a lot I appreciate it.
login_box = driver.find_element_by_css_selector('body > div.login > div.auth > div.loginBox')
input_boxes = driver.find_elements_by_css_selector('.login>.auth label>input')
input_buttons = driver.find_elements_by_css_selector('.login>.auth button')
input_boxes[0].send_keys(username)
input_boxes[1].send_keys(password)
input_buttons[0].click()
You can use selenium webdriver to login to your school's website to have the session in webdriver and then load the page you want to scrape.
from selenium import webdriver
driver_path = r'C:\Users\username\Downloads\edgedriver_win64\msedgedriver.exe'
url = 'https://www.website_url.com/assessments/upcoming'
login_url = 'https://www.website_url.com'
driver = webdriver.Edge(driver_path)
driver.get(login_url)
driver.find_element_by_xpath("username input xpath").sendkeys(username)
driver.find_element_by_xpath("password input xpath").sendkeys(password)
driver.find_element_by_xpath("submit button xpath").click()
# wait for the page to load
driver.get(url)
You can also directly POST the credentials to the login page:
webdriver.request('POST', login_url, data={"username": username, "password": password})
For the window size part, this should help.
You can ignore these errors, it's just selenium/webdriver log.
I personally don't think you need a new tab but you can try it out. This post has lot of helpfull answers.
Let me know if you need more help.

using python-requests to access a site with captcha

I've searched the web on how to access a website using request, essentially the site ask the user to complete a captcha form before they can access the site.
As of now I understand the process should be
visit the site using selenium
from selenium import webdriver
browser = webdriver.Chrome('chromedriver.exe')
browser.get('link-to-site')
complete the captcha form
save the cookies from that selenium session (since some how these cookies will contain data showing that you've completed captcha
input('cookies ready ?')
pickle.dump( browser.get_cookies() , open("cookies.pkl","wb"))
open a request session
get the site
import requests
session = requests.session()
r = session.get('link-to-site')
then load the cookies in
with open('cookies.pkl', 'r') as f:
cookies = requests.utils.cookiejar_from_dict(json.load(f))
session.cookies.update(cookies)
But I'm still unable to access the site, so I'm assuming the google captcha hasn't been solved when I'm using requests.
So there must be a correct way to go about this, I must be missing something?
You need to load the site after setting the cookies. Otherwise, the response is what the response would be without any cookies. Although having said that you will normally need to submit the form with selenium then list the cookies as a captcha doesn't normally set a cookie in itself.

Open url in browser and fill the form

I want to open a url using python script and then same python script should fill the form but not submit it
For example script should open https://www.facebook.com/ and fill the name and password in the fields, but don't submit it.
You can use Selenium to get it done smoothly. Here is the sample code with Google search:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://www.google.com")
browser.find_element_by_id("lst-ib").send_keys("book")
# browser.find_element_by_name("btnK").click()
The last line is commented intentionally if do not want to submit the search.
Many websites don't support Web Scraping. Actually It may cost you an illegal access case on you.
But Try using requests library in python.
You'll find it easy to do that stuff.
https://realpython.com/python-requests/
payload = {'inUserName': 'USERNAME/EMAIL', 'inUserPass': 'PASSWORD'}
url = 'http://www.locationary.com/home/index2.jsp'
requests.post(url, data=payload)

How to scrape a password-protected ASPX (PDF) page

I'm trying to scrape data about my band's upcoming shows from our agent's web service (such as venue capacity, venue address, set length, set start time ...).
With Python 3.6 and Selenium I've successfully logged in to the site, scraped a bunch of data from the main page, and opened the deal sheet, which is a PDF-like ASPX page. From there I'm unable to scrape the deal sheet. I've successfully switched the Selenium driver to the deal sheet. But when I inspect that page, none of the content is there, just a list of JavaScript scripts.
I tried...
innerHTML = driver.execute_script("return document.body.innerHTML")
...but this yields the same list of scripts rather than the PDF content I can see in the browser.
I've tried the solution suggested here: Python scraping pdf from URL
But the HTML that solution returns is for the login page, not the deal sheet. My problem is different because the PDF is protected by a password.
You won't be able to read the PDF file using Selenium Python API bindings, the solution would be:
Download the file from the web page using requests library. Given you need to be logged in my expectation is that you might need to fetch cookies from the browser session via driver.get_cookies() command and add them to the request which will download the PDF file
Once you download the file you will be able to read its content using, for instance, PyPDF2
This 3-part solution works for me:
Part 1 (Get the URL for the password protected PDF)
# with selenium
driver.find_element_by_xpath('xpath To The PDF Link').click()
# wait for the new window to load
sleep(6)
# switch to the new window that just popped up
driver.switch_to.window(driver.window_handles[1])
# get the URL to the PDF
plugin = driver.find_element_by_css_selector("#plugin")
url = plugin.get_attribute("src")
The element with the url might be different on your page. Michael Kennedy also suggested #embed and #content.
Part 2 (Create a persistent session with python requests, as described here: How to "log in" to a website using Python's Requests module? . And download the PDF.)
# Fill in your details here to be posted to the login form.
# Your parameter names are probably different. You can find them by inspecting the login page.
payload = {
'logOnCode': username,
'passWord': password
}
# Use 'with' to ensure the session context is closed after use.
with requests.Session() as session:
session.post(logonURL, data=payload)
# An authorized request.
f = session.get(url) # this is the protected url
open('c:/yourFilename.pdf', 'wb').write(f.content)
Part 3 (Scrape the PDF with PyPDF2 as suggested by Dmitri T)

How to access netgear router web interface

What I am trying to do is access the traffic meter data on my local netgear router. It's easy enough to login to it and click on the link, but ideally I would like a little app that sits down in the system tray (windows) that I can check whenever I want to see what my network traffic is.
I'm using python to try to access the router's web page, but I've run into some snags. I originally tried modified a script that would reboot the router (found here https://github.com/ncw/router-rebooter/blob/master/router_rebooter.py) but it just serves up the raw html and I need it after the onload javascript functions have run. This type of thing is described in many posts about web scraping and people suggested using selenium.
I tried selenium and have run into two problems. First, it actually opens the browser window, which is not what I want. Second, it skips the stuff I put in to pass the HTTP authentication and pops up the login window anyway. Here is the code:
from selenium import webdriver
baseAddress = '192.168.1.1'
baseURL = 'http://%(user)s:%(pwd)s#%(host)s/traffic_meter.htm'
username = 'admin'
pwd = 'thisisnotmyrealpassword'
url = baseURL % {
'user': username,
'pwd': pwd,
'host': baseAddress
}
profile = webdriver.FirefoxProfile()
profile.set_preference('network.http.phishy-userpass-length', 255)
driver = webdriver.Firefox(firefox_profile=profile)
driver.get(url)
So, my question is, what is the best way to accomplish what I want without having it launch a visible web browser window?
Update:
Okay, I tried sircapsalot's suggestion and modified the script to this:
from selenium import webdriver
from contextlib import closing
url = 'http://admin:notmyrealpassword#192.168.1.1/start.htm'
with closing(webdriver.Remote(desired_capabilities = webdriver.DesiredCapabilities.HTMLUNIT)) as driver:
driver.get(url)
print(driver.page_source)
This fixes the web browser being loaded, but it failed the authentication. Any suggestions?
Okay, I found the solution and it was way easier than I thought. I did try John1024's suggestion and was able to download the proper webpage from the router using wget. However I didn't like the fact that wget saved the result to a file, which I would then have to open and parse.
I ended up going back to the original reboot_router.py script I had attempted to modify unsuccessfully the first time. My problem was I was trying to make it too complicated. This is the final script I ended up using:
import urllib2
user = 'admin'
pwd = 'notmyrealpassword'
host = '192.168.1.1'
url = 'http://' + host + '/traffic_meter_2nd.htm'
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, host, user, pwd)
authhandler = urllib2.HTTPBasicAuthHandler(passman)
opener = urllib2.build_opener(authhandler)
response = opener.open(url)
stuff = response.read()
response.close()
print stuff
This prints out the entire traffic meter webpage from my router, with its proper values loaded. I can then take this and parse the values out of it. The nice thing about this is it has no external dependencies like selenium, wget or other libraries that needs to be installed. Clean is good.
Thank you, everyone, for your suggestions. I wouldn't have gotten to this answer without them.
The web interface for my Netgear router (WNDR3700) is also filled with javascript. Yours may differ but I have found that my scripts can get all the info they need without javascript.
The first step is finding the correct URL. Using FireFox, I went to the traffic page and then used "This Frame -> Show only this frame" to discover that the URL for the traffic page on my router is:
http://my_router_address/traffic.htm
After finding this URL, no web browswer and no javascript is needed. I can, for example, capture this page with wget:
wget http://my_router_address/traffic.htm
Using a text editor on the resulting traffic.htm file, I see that the traffic data is available in a lengthy block that starts:
var traffic_today_time="1486:37";
var traffic_today_up="1,959";
var traffic_today_down="1,945";
var traffic_today_total="3,904";
. . . .
Thus, the traffic.htm file can be easily captured and parsed with the scripting language of your choice. No javascript ever needs to be executed.
UPDATE: I have a ~/.netrc file with a line in it like:
machine my_router_address login someloginname password somepassword
Before wget downloads from the router, it retrieves the login info from this file. This has security advantages. If one runs wget http://name#password..., then the password is viewable to all on your machine via the process list (ps a). Using .netrc, this never happens. Restrictive permissions can be set on .netrc, e.g. readable only by user (chmod 400 ~/.netrc).

Categories