python passing cookies to header issue - python

What i need is to extract uid cookie from the first web site and open the second one with it (it's a sort of authorisation)
it neither works with this code:
#!/usr/bin/env python
import urllib, urllib2, cookielib
import socket, Cookie
def extract(url):
jar = cookielib.FileCookieJar("cookies")
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
opener.addheaders = [('User-agent',
'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.14) Gecko/20110201 Firefox/2.0.0.14')]
response = opener.open(url)
for cookie in jar:
precious_value = cookie.value
return precious_value
site1 = "mysite1.com"
site2 = "mysite2.com"
cp = urllib2.HTTPCookieProcessor()
cj = cp.cookiejar
cj.set_cookie(cookielib.Cookie(0, cookie_name,
extract(site1),
'80', False, 'domain', True, False, '/path',
True, False, None, False, None, None, None))
opener = urllib2.build_opener(urllib2.HTTPHandler(),cp)
opener.addheaders.append(('User-agent', 'Mozilla/5.0 (compatible)'))
print opener.open(site2).read()
nor this way:
#!/usr/bin/env python
import urllib, urllib2, cookielib
def extract(url):
jar = cookielib.FileCookieJar("cookies")
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
opener.addheaders = [('User-agent',
'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.14) Gecko/20110201 Firefox/2.0.0.14')]
response = opener.open(url)
for cookie in jar:
precious_value = cookie
return precious_value
site1 = "mysite1.com"
site1 = "mysite2.com"
jar = cookielib.FileCookieJar("cookies")
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows; U; Windows NT 5.1;
en-GB; rv:1.8.1.14) Gecko/20110201 Firefox/2.0.0.14')]
opener.addheaders = [('Cookies', extract(site1))]
response = opener.open(site2)
print response.read()
however I've managed to succeed here with 'requests' library
the code looks nice:
cookies= dict(mycid='9ti6cACUi6AqxXBG2H9AMPkrfRbBJPalKTAh_bLcuQ8c8C')
r = requests.get(url, cookies = cookies)
print r.text
Its fine for me and I don't have anything against requests... but still what have i done wrong during two first attempts? In both cases extract procedures work fine and I see that uid has been properly extracted. I guess the problem is with add_headers area. The answer is obvious but still can get through. Can someone help?
1) What is the proper way to pass a cookie into headers only with urllib or urllib2?
2) How can I pass it as a parameter which can be changed, not just reference to extracted object?
3)How should I properly pass it as an object name/value?
Thanks in advance

Your loop def extract(url): has two problems:
It always returns the last value, which is not necessarily where you cookie is
It makes assumption on the order in which cookie are store which you can't know
(i'm assuming precious_value is defined somewhere else otherwise this code doesn't work)
To know which key you should use to retrieve the particular cookie you're interested in, you can use chrome developer tools to see what's the name of the cookie set by the site you want.
Hope this helps.

Related

Python 3 Website detects scraper when using User-Agent spoofing

I'm trying to scrape some information from Indeed.com using urllib. Occasionally, the job link gets redirected to the hiring company's webpage. When this happens, Indeed throws up some html about using an incompatible browser or device, rather than continuing to the redirected page. After looking around, I found that in most cases spoofing urllib's user agent to look like a browser is enough to get around this, but this doesn't seem to be the case here.
Any suggestions on where to go beyond spoofing the User-Agent? Is it possible Indeed is able to realize the User-Agent is spoofed, and that there is no way around this?
Here's an example of the code:
import urllib
from fake_useragent import UserAgent
from http.cookiejar import CookieJar
ua = UserAgent()
website = 'http://www.indeed.com/rc/clk?jk=0fd52fac51427150&fccid=7f79c79993ec7e60'
req = urllib.request.Request(website)
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
opener.addheaders = [('User-Agent', ua.chrome)]
response = opener.open(req)
print(response.read().decode('utf-8'))
Thanks for the help!
This header usually works :
HDR = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}
Another option is to use the requests package.

using python 3 urllib to access website but failed

I am trying to use the below code to access websites in python 3 using urllib
url = "http://www.goal.com"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
r = urllib.request.Request(url=url, headers=headers)
urllib.request.urlopen(r).read(1000)
It works fine when it access "yahoo.com", but it always returned error 403 when accessing sites such as "goal.com, hkticketing.com.hk" and I cannot figure out what I am missing. Appreciate for your help.
In python 2.x version , you can use urllib2 to fetch the contents. You can invoke the add headers function to add the header information. Then invoke the open method and read the contents. Finally print them.
import urllib2
import sys
print sys.version
url = urllib2.build_opener()
url.addheaders = [('User-agent', 'Mozilla/5.0(Windows NT 6.1; WOW64; rv:23.0)Gecko/20100101 Firefox/23.0')]
print url.open('http://hkticketing.com.hk').read()

ControlNotFoundError (ASP, Mechanize, Javascript, Python)

I am receiving following response from the server
ctrlDateTime%24txtSpecifyFromDate=05%2F02%2F2015&
ctrlDateTime%24rgApplicable=rdoApplicableFor&
ctrlDateTime%24txtSpecifyToDate=05%2F02%2F2015&
I am trying with
br["ctrlDateTime%24txtSpecifyFromDate"]="05%2F02%2F2015";
br["ctrlDateTime%24rgApplicable"]="rdoApplicableFor";
br["ctrlDateTime%24txtSpecifyToDate"]="05%2F02%2F2015";
How can I fix ControlnotfoundError? Here is my code:
Any idea how to solve it?
import mechanize
import re
br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0')]
response = br.open("http://marketinformation.natgrid.co.uk /gas/frmDataItemExplorer.aspx")
br.select_form(nr=0)
br.set_all_readonly(False)
mnext = re.search("""<a id="lnkNext" href="javascript:__doPostBack('(.*?)','(.*?)')">XML""", html)
br["tvDataItem_ExpandState"]="cccccccceennncennccccccccc";
br["tvDataItem_SelectedNode"]="";
br["__EVENTTARGET"]="lbtnCSVDaily";
br["__EVENTARGUMENT"]="";
br["tvDataItem_PopulateLog"]="";
br["__VIEWSTATE"]="%2FwEP.....SNIP....%2F90SB9E%3D";
br["__VIEWSTATEGENERATOR"]="B2D04314";
br["__EVENTVALIDATION"]="%2FwEW...SNIP...uPceSw%3D%3D";
br["txtSearch"]="";
br["tvDataItemn11CheckBox"]="on";
br["tvDataItemn15CheckBox"]="on";
br["ctrlDateTime%24txtSpecifyFromDate"]="05%2F02%2F2015";
br["ctrlDateTime%24rgApplicable"]="rdoApplicableFor";
br["ctrlDateTime%24txtSpecifyToDate"]="05%2F02%2F2015";
br["btnViewData"]="View+Data+for+Data+Items";
br["hdnIsAddToList"]="";
response = br.submit()
print(response.read());
Thanks in advance.
P.
This is solved in two steps: 1) I replaced %24 with '$'; 2) some of the parameters required a true parameter to be passed on and some to be passed on as ['',]

python mechanize to access Sharepoint website

I'm trying to access Sharepoint using mechanize but i got a 401 error. Here's the code i'm using:
import mechanize
url = "http://sharepoint:8080/foo/bar/foobar.aspx"
br.addheaders = [('User-agent', 'Mozilla/4.0(compatible; MSIE 7.0b; Windows NT 6.0)')]
br.add_password(url, 'domain\\user', 'myPassword')
r = br.open(url)
html = r.read()
Did i miss anything?
Did you happen to try Python Ntlm for accessing SharePoint?
Examples in the Ntlm doc will explain how to use it with Urllib2. Pasted below the code for using NTLM authentication using mechanize.
import mechanize
from ntlm import HTTPNtlmAuthHandler
pass_manager = mechanize.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, user, password)
auth_NTLM = HTTPNtlmAuthHandler.HTTPNtlmAuthHandler(pass_manager)
browser = mechanize.Browser()
browser.add_handler(auth_NTLM)
r = browser.open(url)
html = r.read()
Try with:
br.addheaders = [('User-agent', 'Mozilla/4.0(compatible; MSIE 7.0b; Windows NT 6.0)'), ('Authorization', 'Basic %s:%s' % ('domain\\user', 'myPassword'))]
instead of
br.addheaders = [('User-agent', 'Mozilla/4.0(compatible; MSIE 7.0b; Windows NT 6.0)')]
This should work if your sharepoint server provides Basic Auth.
Looking at the usage in the mechanize docs you only need to specify the username (eg 'john_doe', try this
...
br.add_password(url, 'username_string', 'myPassword')
r = br.open(url)
html = r.get_data() # r.get_data() can be called many times without calling seek

Python http download page source

hello there
i was wondering if it was possible to connect to a http host (I.e. for example google.com)
and download the source of the webpage?
Thanks in advance.
Using urllib2 to download a page.
Google will block this request as it will try to block all robots. Add user-agent to the request.
import urllib2
user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3'
headers = { 'User-Agent' : user_agent }
req = urllib2.Request('http://www.google.com', None, headers)
response = urllib2.urlopen(req)
page = response.read()
response.close() # its always safe to close an open connection
You can also use pyCurl
import sys
import pycurl
class ContentCallback:
def __init__(self):
self.contents = ''
def content_callback(self, buf):
self.contents = self.contents + buf
t = ContentCallback()
curlObj = pycurl.Curl()
curlObj.setopt(curlObj.URL, 'http://www.google.com')
curlObj.setopt(curlObj.WRITEFUNCTION, t.content_callback)
curlObj.perform()
curlObj.close()
print t.contents
You can use urllib2 module.
import urllib2
url = "http://somewhere.com"
page = urllib2.urlopen(url)
data = page.read()
print data
See the doc for more examples
The documentation of httplib (low-level) and urllib (high-level) should get you started. Choose the one that's more suitable for you.
Using requests package:
# Import requests
import requests
#url
url = 'https://www.google.com/'
# Create the binary string html containing the HTML source
html = requests.get(url).content
or with the urllib
from urllib.request import urlopen
#url
url = 'https://www.google.com/'
# Create the binary string html containing the HTML source
html = urlopen(url).read()
so here's another approach to this problem using mechanize. I found this to bypass a website's robot checking system. i commented out the set_all_readonly because for some reason it wasn't recognized as a module in mechanize.
import mechanize
url = 'http://www.example.com'
br = mechanize.Browser()
#br.set_all_readonly(False) # allow everything to be written to
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(False) # can sometimes hang without this
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] # [('User-agent', 'Firefox')]
response = br.open(url)
print response.read() # the text of the page
response1 = br.response() # get the response again
print response1.read() # can apply lxml.html.fromstring()

Categories