send/set cookie via urllib2 python - python

I am quite new to python and I am rather struck for a couple of days now trying to send a cookie with urllib2. So, basically, on the page I want to get, I see from firebug that there is a "sent cookie" which looks like:
list_type=height
.. which basically arranges the list on the page in a certain order.
I would like to send this above cookie info via urllib2, so that the rendered page taked this above setting into effect - and here is the code I am trying to write to make it work:
class Networksx(object):
def __init__(self):
self.cj = cookielib.CookieJar()
self.opener = urllib2.build_opener\
#socks handler
self.opener.addheaders = [
('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13'),
('Accept-Charset', 'ISO-8859-1,utf-8;q=0.7,*;q=0.7'),
('Keep-Alive', '115'),
('Connection', 'keep-alive'),
('Cache-Control', 'max-age=0'),
('Referer', 'http://www.google.com'),
("Cookie", {"list_type":"height"}),
]
urllib2.install_opener(self.opener)
self.params = { 'Set-Cookie': "list_type":"height"}
self.encoded_params = urllib.urlencode( self.params )
def fullinfo(self,url):
return self.opener.open(url,self.encoded_params).read()
..as you can see, I have tried a couple of things:
setting the parameter via a header
setting a cookie
however, these do not seem to render the page in the certain list_order (height) as I would like. I was wondering if someone could point me in the right direction as to how to send the cookie information with urllib2
Thanks.

An easy way to generate a cookie.txt is this chrome extension: https://chrome.google.com/webstore/detail/cookietxt-export/lopabhfecdfhgogdbojmaicoicjekelh
import urllib2, cookielib
url = 'https://example.com/path/default.aspx'
txheaders = {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
cj = cookielib.LWPCookieJar()
# cj.load signature: filename=None, ignore_discard=False, ignore_expires=False
cj.load('/path/to/my/cookies.txt')
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
req = urllib2.Request(url, None, txheaders)
handle = urllib2.urlopen(req)
[update]
Sorry, I was pasting from an old code snippet long forgotten. From the LWPCookieJar docstring:
The LWPCookieJar saves a sequence of "Set-Cookie3" lines. "Set-Cookie3" is the format used by the libwww-perl libary, not known to be compatible with any browser, but which is easy to read and doesn't lose information about RFC 2965 cookies.
So it is not compatible with the cookie.txt generated by modern browsers. If you try to load it with you will get: LoadError: 'cookies.txt' does not look like a Set-Cookie3 (LWP) format file.
You can do as the OP and convert the file:
there is something wrong with the format of the output from chrome extension. I just googled the lwp problem and found: code.activestate.com/recipes/302930-cookielib-example the code spits out the cookie in lwp format and then I follow your steps as it is. - James W
You can also use this Firefox addon, and then "Tools->Export cookies". Make sure the first line in the cookies.txt file is "# Netscape HTTP Cookie File" and use:
cj = cookielib.MozillaCookieJar('/path/to/my/cookies.txt')
cj.load()

You would better look into the 'request' module for Python making HTTP much easier approachable than through the low-level urllib modules.
See
http://docs.python-requests.org/en/latest/user/quickstart/#cookies

Related

Python 3 Website detects scraper when using User-Agent spoofing

I'm trying to scrape some information from Indeed.com using urllib. Occasionally, the job link gets redirected to the hiring company's webpage. When this happens, Indeed throws up some html about using an incompatible browser or device, rather than continuing to the redirected page. After looking around, I found that in most cases spoofing urllib's user agent to look like a browser is enough to get around this, but this doesn't seem to be the case here.
Any suggestions on where to go beyond spoofing the User-Agent? Is it possible Indeed is able to realize the User-Agent is spoofed, and that there is no way around this?
Here's an example of the code:
import urllib
from fake_useragent import UserAgent
from http.cookiejar import CookieJar
ua = UserAgent()
website = 'http://www.indeed.com/rc/clk?jk=0fd52fac51427150&fccid=7f79c79993ec7e60'
req = urllib.request.Request(website)
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
opener.addheaders = [('User-Agent', ua.chrome)]
response = opener.open(req)
print(response.read().decode('utf-8'))
Thanks for the help!
This header usually works :
HDR = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}
Another option is to use the requests package.

How to Google in Python Using urllib or requests

What is the proper way to Google something in Python 3? I have tried requests and urllib for a Google page. When I simply res = requests.get("https://www.google.com/#q=" + query) that doesn't come back with the same HTML as when I inspect the Google page in Safari. The same happens with urllib. A similar thing happens when I use Bing. I am familiar with AJAX. However, it seems that that is now depreciated.
In python, if you do not specify the user agent header in http requests manually, python will add for you by default which can be detected by Google and may be forbidden by it.
Try the following if it can help.
import urllib
yourUrl = "post it here"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(yourUrl, headers = headers)
page = urllib.request.urlopen(req)

Reading walmart product page with urllib doesn't work when using "user-agent" string

I'm building django based website where some data is dynamically loaded using Ajax from a user specified url. For this I'm using urllib2 and later on BeautifulSoup. I came to strange thing with Walmart links. Take a look:
import urllib2
url_to_parse = 'http://www.walmart.com/ip/JVC-HARX300-High-Quality-Full-Size-Headphone/13241375'
# 1 - read the url without user-agent string
opened_url = urllib2.urlopen(url_to_parse)
print len(opened_url.read())
# prints 309316
# 2 - read the url wit user-agent string
headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0' }
req = urllib2.Request(url_to_parse, '', headers)
opened_url = urllib2.urlopen(req)
print len(opened_url.read())
# prints 0
My question is why on #2 a zero is printed? I use the user-agent method to deal with other websites (like amazon).
Wget is able to get the page content with no problems btw.
Your problem is not the User-Agent, it is your data parameter.
From the docs:
data may be a string specifying additional data to send to the server,
or None if no such data is needed.
It seems WalMart does not like your empty string. Change your call to this:
req = urllib2.Request(url_to_parse, None, headers)
Now both ways print the same value.

unable to send data using urllib and urllib2 (python)

Hello everybody (first post here).
I am trying to send data to a webpage. This webpage request two fields (a file and an e-mail address) if everything is ok the webpage returns a page saying "everything is ok" and sends a file to the provided e-mail address. I execute the code below and I get nothing in my e-mail account.
import urllib, urllib2
params = urllib.urlencode({'uploaded': open('file'),'email': 'user#domain.com'})
req = urllib2.urlopen('http://webpage.com', params)
print req.read()
the print command gives me the code of the home page (I assume instead it should give the code of the "everything is ok" page).
I think (based o google search) the poster module should do the trick but I need to keep dependencies to a minimum, hence I would like a solution using standard libraries (if that is possible).
Thanks in advance.
Thanks everybody for your answers. I solve my problem using the mechanize library.
import mechanize
br = mechanize.Browser()
br.open('webpage.com')
email='user#domain.com'
br.select_form(nr=0)
br['email'] = email
br.form.add_file(open('filename'), 'mime-type', 'filename')
br.form.set_all_readonly(False)
br.submit()
This site could checks Referer, User-Agent and Cookies.
Way to handle all of this is using urllib2.OpenerDirector which you can get by urllib2.build_opener.
# Cookies handle
cj = cookielib.CookieJar()
CookieProcessor = urllib2.HTTPCookieProcessor(cj)
# Build OpenerDirector
opener = urllib2.build_opener(CookieProcessor)
# Valid User-Agent from Firefox 3.6.8 on Ubuntu 10.04
user_agent = 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.8) Gecko/20100723 Ubuntu/10.04 (lucid) Firefox/3.6.8'
# Referer says that you send request from web-site title page
referer = 'http://webpage.com'
opener.addheaders = [
('User-Agent', user_agent),
('Referer', referer),
('Accept-Charset', 'utf-8')
]
Then prepare parameters with urlencode and send request by opener.open(params)
Documentation for Python 2.7: cookielib, OpenerDirector

How would one log into a phpBB3 forum through a Python script using urllib, urllib2 and ClientCookie?

(ClientCookie is a module for (automatic) cookie-handling: http://wwwsearch.sourceforge.net/ClientCookie)
# I encode the data I'll be sending:
data = urllib.urlencode({'username': 'mandark', 'password': 'deedee'})
# And I send it and read the page:
page = ClientCookie.urlopen('http://www.forum.com/ucp.php?mode=login', data)
output = page.read()
The script doesn't log in, but rather seems to get redirected back to the same login page asking it for a username and password. What am I doing wrong?
Any help would be greatly appreciated! Thanks!
Have you tried fetching the login page first?
I would suggest using Tamper Data to have a peek at exactly what's being sent when you request the login page and then log in normally using a web browser from a fresh start, with no initial cookies in place, so that your script can replicate it exactly.
That's the approach I used when writing the following, extracted from a script which needs to login to an Invision Power Board forum, using cookielib and urllib2 - you may find it useful as a reference.
import cookielib
import logging
import sys
import urllib
import urllib2
cookies = cookielib.LWPCookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookies))
urllib2.install_opener(opener)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.0; en-GB; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12',
'Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
'Accept-Language': 'en-gb,en;q=0.5',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.7',
}
# Fetch the login page to set initial cookies
urllib2.urlopen(urllib2.Request('http://www.rllmukforum.com/index.php?act=Login&CODE=00', None, headers))
# Login so we can access the Off Topic forum
login_headers = headers.copy()
login_headers.update({
'Referer': 'http://www.rllmukforum.com/index.php?act=Login&CODE=00',
'Content-Type': 'application/x-www-form-urlencoded',
})
html = urllib2.urlopen(urllib2.Request('http://www.rllmukforum.com/index.php?act=Login&CODE=01',
urllib.urlencode({
'referer': 'http://www.rllmukforum.com/index.php?',
'UserName': RLLMUK_USERNAME,
'PassWord': RLLMUK_PASSWORD,
}),
login_headers)).read()
if 'The following errors were found' in html:
logging.error('RLLMUK login failed')
logging.info(html)
sys.exit(1)
I'd recommend taking a look at the mechanize library; it's designed for precisely this type of task. It's also far easier than doing it by hand.

Categories