displaying cookies through python - python

I wanted to print cookies into text file printcooki.txt. And open a webpage https://www.google.co.in. But at end I am getting blank text file and webpage not opening. What changes to be made in my program ? . Please help me out.
from urllib2 import Request, build_opener, HTTPCookieProcessor, HTTPHandler
import cookielib
import io
object = cookielib.CookieJar()
opener = build_opener(HTTPCookieProcessor(object), HTTPHandler())
webreq = Request("https://www.google.co.in/")
f = opener.open(webreq)
html = f.read()
print html[:10]
print "the webpage has following cookies "
for cookie in object:
print>> cookie
createtext = open("C:\Users\****\Desktop\printcooki.txt", "w")
print>> cookies ' #to save cookies into printcooki.txt
opener.open('https://www.google.co.in') #to open a webpage

It is likely that your company makes use of a corporate firewall.
In such a case - and if you have the needed credentials - you can set a couple of environment variables to instruct urllib2 to use your corporate proxy.
For example in Bash you can run the following commands:
export HTTP_PROXY="http://<user_name>:<user_password>#<proxy_ip_address_or_name>:<proxy_port>"
export HTTPS_PROXY="$HTTP_PROXY"
before running your Python script.

Related

Downloading/Exporting sites Search Results by using Export button in Python

So I'm trying to use Python to scrape data from the following website (with sample query): https://par.nsf.gov/**search**/fulltext:NASA%20NOAA%20coral
However instead of scraping the search results, I realized that it would just be easier if I somehow click on the Save Results as "CSV" link in a programmatic way and work with that CSV data instead as it would free me from having to navigate all the pages of the search results.
I inspected the CSV link element and found it called an "exportSearch('csv') function.
By typing the function's name into the console I found that the CSV link is just setting window.location.href to: https://par.nsf.gov/export/format:csv/fulltext:NASA%20NOAA%20coral
If I follow that link in the same browser, the save prompt will open up with a csv to save.
My issue starts when I want to do replicate this process using python. If I try to call the export link directly using the Requests library, the response is empty.
url = "https://par.nsf.gov/export/format:csv/fulltext:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)
response = requests.get(url)
print("Response: ", len(response.content))
Can someone show me what I'm missing? I don't know how to first establish search results on the website's server that I can then access for export using Python.
I believe the link to download the CSV is here:
https://par.nsf.gov/export/format:csv//term:your_search_term
your_search_term is URL encoded
in your case, the link is: https://par.nsf.gov/export/format:csv//filter-results:F/term:NASA%20NOAA%20coral
You can use the below to download the file in python using the urllib library.
# Include your search term
url = "https://par.nsf.gov/export/format:csv//filter-results:F/term:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)
import urllib.request
urllib.request.urlretrieve(url)
# here you can also specify the path where you want to download the file like
urllib.request.urlretrieve(url,'/Users/Downloads/)
# Run this command only when you face an SSL certificate error, generally it occurs for Mac users with python 3.6 .
# Run this in jupyter notebook
!open /Applications/Python\ 3.6/Install\ Certificates.command
# On terminal just run
open /Applications/Python\ 3.6/Install\ Certificates.command
Similarly you can also use wget to fetch your file:
url = "https://par.nsf.gov/export/format:csv//filter-results:F/term:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)
import wget
wget.download(url)
Turns out I was missing some cookies that didn't come up when you do a simple requests GET (e.g. WT_FPC).
To get around this I used selenium's webdriver to do an initial GET request and use the cookies from that request to set up in the POST request for downloading the CSV data.
from selenium import webdriver
chrome_path = "path to chrome driver"
with requests.Session() as session:
url = "https://par.nsf.gov/search/fulltext:" + urllib.parse.quote("NASA NOAA coral")
#GET fetches website plus needed cookies
browser = webdriver.Chrome(executable_path = chrome_path)
browser.get(url)
## Session is set with webdriver's cookies
request_cookies_browser = browser.get_cookies()
[session.cookies.set(c['name'], c['value']) for c in request_cookies_browser]
url = "https://par.nsf.gov/export/format:csv/fulltext:" + urllib.parse.quote("NASA NOAA coral")
response = session.post(url)
## No longer empty
print(response.content.decode('utf-8'))

How do I handle '403 Forbidden' respond with python scraping?

I have scraped the url of the picture I want, but I use requests module to download the pic, the server responds 403 Forbidden.
I have tried to capture traffic in chrome F12, there are many JS responses in main page and the url of the picture respond just type of Doc
import requests
lines =[
'https://i.hamreus.com/ps4/0-9/9%E5%8F%B7%E6%9D%80%E6%89%8B%E6%B9%9B%E8%93%9D%E4%BB%BB%E5%8A%A1[%E9%AB%98%E6%A1%A5%E7%BE%8E%E7%94%B1%E7%BA%AA]/vol_02/seemh-001-a5f6.jpg.webp?cid=121333&md5=7dHbKv51JwzRC6jjd7p3oQ',
'https://i.hamreus.com/ps4/0-9/9%E5%8F%B7%E6%9D%80%E6%89%8B%E6%B9%9B%E8%93%9D%E4%BB%BB%E5%8A%A1[%E9%AB%98%E6%A1%A5%E7%BE%8E%E7%94%B1%E7%BA%AA]/vol_02/seemh-002-c60d.jpg.webp?cid=121333&md5=7dHbKv51JwzRC6jjd7p3oQ',
'https://i.hamreus.com/ps4/0-9/9%E5%8F%B7%E6%9D%80%E6%89%8B%E6%B9%9B%E8%93%9D%E4%BB%BB%E5%8A%A1[%E9%AB%98%E6%A1%A5%E7%BE%8E%E7%94%B1%E7%BA%AA]/vol_02/seemh-003-4b8a.jpg.webp?cid=121333&md5=7dHbKv51JwzRC6jjd7p3oQ',
'https://i.hamreus.com/ps4/0-9/9%E5%8F%B7%E6%9D%80%E6%89%8B%E6%B9%9B%E8%93%9D%E4%BB%BB%E5%8A%A1[%E9%AB%98%E6%A1%A5%E7%BE%8E%E7%94%B1%E7%BA%AA]/vol_02/seemh-004-87ac.jpg.webp?cid=121333&md5=7dHbKv51JwzRC6jjd7p3oQ',
]
def download_pic(url,s):
pass
r = s.get(url,headers = headers)
with open(url.split('/')[-1].split('.')[0] +'.jpg','wb') as fp:
fp.write(r.content())
def main():
pass
s = requests.Session()
main_url = 'https://www.manhuagui.com/comic/12087/121333.html'
r = s.get(main_url,headers = headers)
for each_url in lines:
download_pic(each_url.strip(r'\n'),s)
if __name__ == '__main__':
main()
I can't download the picture I want
Some websites have a security provision against requests from external sources particularly python files. That is why you are getting the 403 error. You will not be able to use either the urllib or requests module.
My workaround was I made a call to a shell script from python in which I passed the URL of the image. In the shell script I used $1 to access the url passed with wget to download the image as such:
Python:
import subprocess
subprocess.call([filename, url])
Script (.sh)
wget $1

how to run python script to display webpage for auto login on one click

I have my python script which I guess is fine. I want to keep the script on the desktop and upon one click I want the script to run and open a browser(IE or firefox or chrome) and perform the script i.e. to log-in to a website.
import urllib, urllib2, cookielib
username = 'xyzusername'
password = 'xyzpassword'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'userid' : username, 'pass' : password})
resp = opener.open('http://www.gmail.com')
resp.read()
print resp
How do I run the script? Please help.
Hmmmm, your script is logging into a web site, but it's not opening a browser.
If you want to open a browser, you might be better off looking at Selenium:
http://coreygoldberg.blogspot.com/2009/09/selenium-rc-with-python-in-30-seconds.html
http://jimmyg.org/blog/2009/getting-started-with-selenium-and-python.html
http://seleniumhq.org/

Downloading Files with Python Urllib, Urllib2

I am trying to download files from a website using urllib as described in this thread: link text
import urllib
urllib.urlretrieve ("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
I am able to download the files (mostly pdf) but all I get is corrupted files that cannot open. I suspect it's because the website requires a login.
How can the above function be modified to handle cookies? I already know the names of the form fields that carry the username & password information. When I print the return values of urlretrieve I get messages like:
a, b = urllib.urlretrieve ("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
print a, b
>> **cache-control:** no-cache, no-store, must-revalidate, s-maxage=300, proxy-revalida
te
>> **connection:** close
I am able to manually download the files if I enter their urls in the browser. Thanks
First urllib2 actually supports cookies and cookie handling should be easy, second of all you can check what kind of file you have downloaded. E.g. AFAIK all mp3 starts with the bytes "ID3"
import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
r = opener.open("http://example.com/")
I might be possible that the server you requesting to is looking for certain header messages, such as User-Agent. You may try mimicking a browser behavior by sending additional headers.

Python auth_handler not working for me

I've been reading about Python's urllib2's ability to open and read directories that are password protected, but even after looking at examples in the docs, and here on StackOverflow, I can't get my script to work.
import urllib2
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib2.HTTPBasicAuthHandler()
auth_handler.add_password(realm=None,
uri='https://webfiles.duke.edu/',
user='someUserName',
passwd='thisIsntMyRealPassword')
opener = urllib2.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib2.install_opener(opener)
socks = urllib2.urlopen('https://webfiles.duke.edu/?path=/afs/acpub/users/a')
print socks.read()
socks.close()
When I print the contents, it prints the contents of the login screen that the url I'm trying to open will redirect you to. Anyone know why this is?
auth_handler is only for basic HTTP authentication. The site here contains a HTML form, so you'll need to submit your username/password as POST data.
I recommend you using the mechanize module that will simplify the login for you.
Quick example:
import mechanize
browser = mechanize.Browser()
browser.open('https://webfiles.duke.edu/?path=/afs/acpub/users/a')
browser.select_form(nr=0)
browser.form['user'] = 'username'
browser.form['pass'] = 'password'
req = browser.submit()
print req.read()

Categories