How to get cookies of a website using python - python

How can I get cookies of a website from browser using python? The code currently being used is:
get_title = lambda html: re.findall('<title>(.*?)</title>', html, flags=re.DOTALL)[0].strip()
url = config.base_url
public_html = urllib2.urlopen(url).read()
print get_title(public_html)
cj = browsercookie.firefox()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_html = opener.open(url).read()
print get_title(login_html)
This code comes after the application has logged in.
config.base_url = "https://10.194.13.71"
It is giving this error : c** File "/root/Desktop/mysonicwallnew/testservice.py", line 26, in test_service
public_html = urllib2.urlopen(url).read()
CertificateError: hostname '10.194.31.71' doesn't match either of 'www.abc.com', 'abc.com'
**
How do I fix this?

This works for me -
import requests
import browsercookie
import re
cj = browsercookie.chrome()
r = requests.get('http://stackoverflow.com', cookies=cj)
get_title = lambda html: re.findall('<title>(.*?)</title>', html, flags=re.DOTALL)[0].strip()
print r.content
print get_title(r.content)
Try updating question with the error you are facing or the exact thing you are looking for from the cookie to get more specific answers.

Related

Is it possible to fetch the hidden info in webpage using python requests(?srn=true) library?

Here is the url
"https://www.gumtree.com/p/sofas/dfs-couches.-two-3-seaters.-one-teal-and-one-green.-pink-storage-footrest.-less-than-2-years-old.-/1265932994"
Login details :
usrname : life#tech69.com
pwd : shiva#123
While opening the page with above credentials, we can get the info like
Contact details
0770228XXXX
However if adding the ?srn = true at the end of url will give the following info
(https://www.gumtree.com/p/sofas/dfs-couches.-two-3-seaters.-one-teal-and-one-green.-pink-storage-footrest.-less-than-2-years-old.-/1265932994?srn=true)
Contact details
07702287887
The code I've used is below:
import requests
from bs4 import BeautifulSoup
s = requests.session()
login_data = dict(email='life#tech69.com', password='shiva#123')
s.post('https://my.gumtree.com/login', data=login_data)
r = s.get('https://www.gumtree.com/p/sofas/dfs-couches.-two-3-seaters.-one-teal-and-one-green.-pink-storage-footrest.-less-than-2-years-old.-/1265932994?srn=true')
soup = BeautifulSoup(r.content, 'lxml')
y = soup.find('strong' , 'txt-large txt-emphasis form-row-label').text
print str(y)
However the above python code still giving the partial info as
0770228XXXX
How to fetch the full info using python code.
that site is protected by recaptcha, a technology that is specifically designed to prevent autologins
so the line s.post('https://my.gumtree.com/login', data=login_data)
results in this
so when you try to go to the other url you are not actually logged in, and it will not reveal the number...
there may be ways to circumvent this, but im not sure of any offhand...

Retrieving a form results with requests

I want to submit a multipart/form-data that sets the input for a simulation on TRILEGAL, and download the file available from a redirected page.
I studied documentation of requests, urllib, Grab, mechanize, etc. , and it seems that in mechanize my code would be :
from mechanize import Browser
browser = Browser()
browser.open("http://stev.oapd.inaf.it/cgi-bin/trilegal")
browser.select_form(nr=0)
browser['gal_coord'] = ["2"]
browser['eq_alpha'] = ["277.981111"]
browser['eq_delta'] = ["-19.0833"]
response = browser.submit()
content = response.read()
However, I could not test it because it is not available in python 3.
So I tried requests :
import requests
url = 'http://stev.oapd.inaf.it/cgi-bin/trilegal'
values = {'gal_coord':"2",
'eq_alpha':"277.981111",
'eq_delta':"-19.0833",
'field':" 0.047117",
}
r = requests.post(url, files = values)
but I can't figure out how to get to the results page - if I do
r.content
it displays the content of the form that I had just submitted, whereas if you open the actual website, and click 'submit', you see a new window (following the method="post" action="./trilegal_1.6" ).
How can I get to that new window with requests (i.e. follow to the page that opens up when I click the submit button) , and click the link on the results page to retrieve the results file ( "The results will be available after about 2 minutes at THIS LINK.") ?
If you can point me to any other tool that could do the job I would be really grateful - I spent hours looking through SO for something that could help solve this problem.
Thank you!
Chris
Here is working solution for python 2.7
from mechanize import Browser
from urllib import urlretrieve # for download purpose
from bs4 import BeautifulSoup
browser = Browser()
browser.open("http://stev.oapd.inaf.it/cgi-bin/trilegal")
browser.select_form(nr=0)
browser['gal_coord'] = ["2"]
browser['eq_alpha'] = ["277.981111"]
browser['eq_delta'] = ["-19.0833"]
response = browser.submit()
content = response.read()
soup = BeautifulSoup(content, 'html.parser')
base_url = 'http://stev.oapd.inaf.it'
# fetch the url from page source and it to base url
link = soup.findAll('a')[0]['href'].split('..')[1]
url = base_url + str(link)
filename = 'test.dat'
# now download the file
urlretrieve(url, filename)
Your file will be downloaded as test.dat. You can open it with respective program.
I post a separate answer because it would be too cluttered. Thanks to #ksai, this works in python 2.7 :
import re
import time
from mechanize import Browser
browser = Browser()
browser.open("http://stev.oapd.inaf.it/cgi-bin/trilegal")
browser.select_form(nr=0)
#set appropriate form contents
browser['gal_coord'] = ["2"]
browser['eq_alpha'] = "277.981111"
browser['eq_delta'] = "-19.0833"
browser['field'] = " 0.047117"
browser['photsys_file'] = ["tab_mag_odfnew/tab_mag_lsst.dat"]
browser["icm_lim"] = "3"
browser["mag_lim"] = "24.5"
response = browser.submit()
# wait 1 min while results are prepared
time.sleep(60)
# select the appropriate url
url = 'http://stev.oapd.inaf.it/' + str(browser.links()[0].url[3:])
# download the results file
browser.retrieve(url, 'test1.dat')
Thank you very much!
Chris

urllib.urlopen does not work for this url though mechanize works

My code below doesn't work for the URLs in nytimes which are articles. Please try changing the URL variable to something else and you'll see that it works. Why is that?
#url = "http://www.nytimes.com";
url = "http://www.nytimes.com/interactive/2014/07/07/upshot/how-england-italy-and-germany-are-dominating-the-world-cup.html"
htmlfile = urllib.urlopen(url);
htmltext = htmlfile.read();
print htmltext;
Please advise.
Thanks.
I think NYT validates your request with cookies. If the request isn't an ordinary request by web browser, the server returns Location header. It makes your request get lost.
The solution is simple. Use cookiejar like this:
import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
url = "http://www.nytimes.com/interactive/2014/07/07/upshot/how-england-italy-and-germany-are-dominating-the-world-cup.html"
htmlfile = opener.open(url)
htmltext = htmlfile.read();
print htmltext
By "doesn't work" I presume you mean it doesn't give you the expected content. I get an empty result when I access that URL using urllib so this is likely yet another aspect of the NYT's "paywall."

Submitting to a web form using python

I have seen questions like this asked many many times but none are helpful
Im trying to submit data to a form on the web ive tried requests, and urllib and none have worked
for example here is code that should search for the [python] tag on SO:
import urllib
import urllib2
url = 'http://stackoverflow.com/'
# Prepare the data
values = {'q' : '[python]'}
data = urllib.urlencode(values)
# Send HTTP POST request
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
html = response.read()
# Print the result
print html
yet when i run it i get the html soure of the home page
here is an example of using requests:
import requests
data= {
'q': '[python]'
}
r = requests.get('http://stackoverflow.com', data=data)
print r.text
same result! i dont understand why these methods arent working i've tried them on various sites with no success so if anyone has successfully done this please show me how!
Thanks so much!
If you want to pass q as a parameter in the URL using requests, use the params argument, not data (see Passing Parameters In URLs):
r = requests.get('http://stackoverflow.com', params=data)
This will request https://stackoverflow.com/?q=%5Bpython%5D , which isn't what you are looking for.
You really want to POST to a form. Try this:
r = requests.post('https://stackoverflow.com/search', data=data)
This is essentially the same as GET-ting https://stackoverflow.com/questions/tagged/python , but I think you'll get the idea from this.
import urllib
import urllib2
url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
This makes a POST request with the data specified in the values. we need urllib to encode the url and then urllib2 to send a request.
Mechanize library from python is also great allowing you to even submit forms. You can use the following code to create a browser object and create requests.
import mechanize,re
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(False) # can sometimes hang without this
br.addheaders = [('User-agent', 'Firefox')]
br.open( "http://google.com" )
br.select_form( 'f' )
br.form[ 'q' ] = 'foo'
br.submit()
resp = None
for link in br.links():
siteMatch = re.compile( 'www.foofighters.com' ).search( link.url )
if siteMatch:
resp = br.follow_link( link )
break
content = resp.get_data()
print content

Performing a get request in Python

Please tell me why this similar lists of code get different results.
First one (yandex.ru) get page of request, and another one get Main page of site (moyareklama.ru)
import urllib
base = "http://www.moyareklama.ru/single_ad_new.php?"
data = {"id":"201623465"}
url = base + urllib.urlencode(data)
print url
page = urllib.urlopen(url).read()
f = open ("1.html", "w")
f.write(page)
f.close()
print page
##base = "http://yandex.ru/yandsearch?"
##data = (("text","python"),("lr","192"))
##url = base + urllib.urlencode(data)
##print url
##page = urllib.urlopen(url).read()
##f = open ("1.html", "w")
##f.write(page)
##f.close()
##print page
Most likely the reason you get something different with urllib.urlopen and your browser is because your browser can be redirected with javascript and meta/refresh tags as well as standard HTTP 301/302 responses. I'm pretty sure the urllib module will only be redirected by HTTP 301/302 responses.

Categories