Python and seleniumrequests getting request headers - python

I need to get a cookie from a specific request. Problem is it gets generated outside my eyes and i need to use Selenium to simulate browser open so i can generate it myself. The second problem is that i can't access the request cookie. The cookie i need is in the request, not the response.
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
binary = FirefoxBinary('/usr/bin/firefox')
driver = webdriver.Firefox(firefox_binary=binary)
driver.get('http://www.princess.com/find/searchResults.do')
driver.find_elements_by_xpath('//*[#id="LSV010"]/div[3]/div[1]/div[1]/button')[0].click()
This code block opens the page and on the second result, clicks the "View all dates and pricing" link. The cookie is sent there but by the browser, not as a response. I need to get my hands on that cookie. Other libraries are ok if they can do the job.
If you go manually to the page, this is the thing i need:
I have selected the request and the Cookie i need and as it shows it is in the request not response. Is this possible to achieve?

I found how this is done. Using the selenium library i did manage to get this working:
def fix_header():
browser = webdriver.Firefox(executable_path='geckodriver.exe', firefox_profile=profile)
browser.get('https://www.princess.com/find/searchResults.do')
browser.find_element_by_xpath(
"//*[#class='expand-table view-all-link util-link plain-text-btn gotham-bold']")
WebDriverWait(browser, 60).until(EC.visibility_of_any_elements_located(
(By.XPATH, "//*[#class='expand-table view-all-link util-link plain-text-btn gotham-bold']")))
try:
browser.find_element_by_xpath(
"//*[#class='expand-table view-all-link util-link plain-text-btn gotham-bold']").click()
except Exception:
browser.find_element_by_class_name('mfp-close').click()
browser.find_element_by_xpath(
"//*[#class='expand-table view-all-link util-link plain-text-btn gotham-bold']").click()
cookies = browser.get_cookies()
browser.close()
chrome_cookie = ''
for c in cookies:
chrome_cookie += c['name']
chrome_cookie += '='
chrome_cookie += c['value']
chrome_cookie += "; "
return chrome_cookie[:-2]
Selenium actually goes to the page and "clicks" the url i need with a browser and gets the needed cookies.

Related

using python-requests to access a site with captcha

I've searched the web on how to access a website using request, essentially the site ask the user to complete a captcha form before they can access the site.
As of now I understand the process should be
visit the site using selenium
from selenium import webdriver
browser = webdriver.Chrome('chromedriver.exe')
browser.get('link-to-site')
complete the captcha form
save the cookies from that selenium session (since some how these cookies will contain data showing that you've completed captcha
input('cookies ready ?')
pickle.dump( browser.get_cookies() , open("cookies.pkl","wb"))
open a request session
get the site
import requests
session = requests.session()
r = session.get('link-to-site')
then load the cookies in
with open('cookies.pkl', 'r') as f:
cookies = requests.utils.cookiejar_from_dict(json.load(f))
session.cookies.update(cookies)
But I'm still unable to access the site, so I'm assuming the google captcha hasn't been solved when I'm using requests.
So there must be a correct way to go about this, I must be missing something?
You need to load the site after setting the cookies. Otherwise, the response is what the response would be without any cookies. Although having said that you will normally need to submit the form with selenium then list the cookies as a captcha doesn't normally set a cookie in itself.

Web scraping when scrolling down is needed

I want to scrape, e.g., the title of the first 200 questions under the web page https://www.quora.com/topic/Stack-Overflow-4/all_questions. And I tried the following code:
import requests
from bs4 import BeautifulSoup
url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"
print("url")
print(url)
r = requests.get(url) # HTTP request
print("r")
print(r)
html_doc = r.text # Extracts the html
print("html_doc")
print(html_doc)
soup = BeautifulSoup(html_doc, 'lxml') # Create a BeautifulSoup object
print("soup")
print(soup)
It gave me a text https://pastebin.com/9dSPzAyX. If we search href='/, we can see that the html does contain title of some questions. However, the problem is that the number is not enough; actually on the web page, a user needs to manually scroll down to trigger extra load.
Does anyone know how I could mimic "scrolling down" by the program to load more content of the page?
Infinite scrolls on a webpage is based on the Javascript functionality. Therefore, to find out what URL we need to access and what parameters to use, we need to either thoroughly study the JS code working inside the page or, and preferably, examine the requests that the browser does when you scroll down the page. We can study requests using the Developer Tools.
See example for quora
the more you scroll down, the more requests generated. so now your requests will be done to that url instead of normal url but keep in mind to send correct headers and playload.
other easier solution will be by using selenium
Couldn't find a response using request. But you can use Selenium. First printed out the number of questions at first load, then send the End key to mimic scrolling down. You can see number of questions went from 20 to 40 after sending the End key.
I used driver.implicitly wait for 5 seconds before loading the DOM again in case the script load to fast before the DOM was loaded. You can improve by using EC with selenium.
The page loads 20 questions per scroll. So if you are looking to scrape 100 questions, then you need to send the End key 5 times.
To use the code below you need to install chromedriver.
http://chromedriver.chromium.org/downloads
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
CHROMEDRIVER_PATH = ""
CHROME_PATH = ""
WINDOW_SIZE = "1920,1080"
chrome_options = Options()
# chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)
chrome_options.binary_location = CHROME_PATH
prefs = {'profile.managed_default_content_settings.images':2}
chrome_options.add_experimental_option("prefs", prefs)
url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"
def scrape(url, times):
if not url.startswith('http'):
raise Exception('URLs need to start with "http"')
driver = webdriver.Chrome(
executable_path=CHROMEDRIVER_PATH,
chrome_options=chrome_options
)
driver.get(url)
counter = 1
while counter <= times:
q_list = driver.find_element_by_class_name('TopicAllQuestionsList')
questions = [x for x in q_list.find_elements_by_xpath('//div[#class="pagedlist_item"]')]
q_len = len(questions)
print(q_len)
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
wait = WebDriverWait(driver, 5)
time.sleep(5)
questions2 = [x for x in q_list.find_elements_by_xpath('//div[#class="pagedlist_item"]')]
print(len(questions2))
counter += 1
driver.close()
if __name__ == '__main__':
scrape(url, 5)
I recommend using selenium rather than bs.
selenium can control browser and parsing. like scroll down, click button, etc…
this example is for scroll down for get all liker user in instagram.
https://stackoverflow.com/a/54882356/5611675
If the content only loads on "scrolling down", this probably means that the page is using Javascript to dynamically load the content.
You can try using a web client such as PhantomJS to load the page and execute the javascript in it, and simulate the scroll by injecting some JS such as document.body.scrollTop = sY; (Simulate scroll event using Javascript).

Selenium webdriver clears cookies when navigating to new domain

options = webdriver.ChromeOptions()
#options.add_argument('-headless')
browser = webdriver.Chrome(executable_path="./chromedriver", options=options)
browser.get("http://127.0.0.1:8080/")
print browser.title
browser.find_element_by_name('username').send_keys("admin")
browser.find_element_by_name("password").send_keys("hunter2")
browser.find_element_by_tag_name("button").click()
print browser.get_cookies()
print 'loading another page: ' + url
#example url = example.com
browser.get(url)
I'm trying to do an automated test involving CORS. So, my requirement is I login to domain A successfully and get some cookies set. This works, and I see the cookies set when I do get_cookies(). Next, I navigate to another domain B, which makes a CORS request to domain A (all CORS headers are properly set, and tested manually). But this request fails because it appears that when I navigate to domain B, the cookies are cleared, so the request is unsuccessful.
Is there any way to force cookies to not clear ?
Note : same behavior with Chrome and Firefox driver on OSX
Use browser.navigate().to(url) instead of get().
Credits: https://stackoverflow.com/a/42465954/4536543

Download file from dynamic url by selenium & phantomjs

I'm try to write a web crawler that download a CSV file by a dynamic url.
The url is like http://aaa/bbb.mcv/Download?path=xxxx.csv
I put this url to my chrome browser but I just start to download immediately and the page won't change.
I can't even find any request in develop screen.
I've tried to ways to get the file
put the url in selenium
driver.get(url)
try to get file by requests lib
requests.get(url)
Both didn't work...
Any advice?
Output of two ways:
I try to get the screen shot and it seems doesn't change the page. (just like in chrome)
I try to print out the data I get and it seems like as html file.
Then open it in the browser it is a login page.
import requests
url = '...'
save_location = '...'
session = requests.session()
response = session.get(url)
with open(save_location, 'wb') as t:
for chunk in response.iter_content(1024):
t.write(chunk)
Thanks for everyone's help!
I finally find the problem is that...
I login the website by selenium and I use requests to download the file.
Selenium doesn't have any authentication information!
So my solution is get the cookies by selenium first.
Then send it into the requests!
Here is my Code
cookies = driver.get_cookies() #selenium web driver
s = requests.Session()
for cookie in cookies:
s.cookies.set(cookie['name'], cookie['value'])
response = s.get(url)

Staying logged in with Selenium in Python

I am trying to log into a website and then once logged in navigate to a different page on the website remaining logged in, using Selenium. However, when I try to navigate to the different page, I found I have become logged off.
I believe this is because I do not understand how the webdriver.Firefox().get() function works exactly.
My code:
from selenium import webdriver
from Code.Other import XMLParser
#Initialise driver and go to webpage
driver = webdriver.Firefox()
URL = 'http://www.website.com'
driver.get(URL)
#Login
UserName = XMLParser.XMLParse('./Config.xml','UserName')
Password = XMLParser.XMLParse('./Config.xml','Password')
element = driver.find_elements_by_id('UserName')
element[0].send_keys(UserName)
element = driver.find_elements_by_id('Password')
element[0].send_keys(Password)
element = driver.find_elements_by_id('Submit')
element[0].click()
#Go to new page
URL = 'http://www.website.com/page1'
driver.get(URL)
Unfortunately I am navigated to the new page but I am no longer logged in. How do I fix this?
Looks like website doesn't have enough time to react on your submit in authorization form. You click submit but you don't wait for response and open another url.
Wait until some event after login (like getting cookies or some changes in DOM or just time.sleep) and only then go to another page.
P.S.: if it won't help, try to check your cookies after login and after you open new url, maybe it's problem with authorization backend or webdriver

Categories