How can I perform a HEAD request with the mechanize library? - python

I know how to do a HEAD request with httplib, but I have to use mechanize for this site.
Essentially, what I need to do is grab a value from the header (filename) without actually downloading the file.
Any suggestions how I could accomplish this?

Mechanize itself only sends GETs and POSTs, but you can easily extend the Request class to send HEAD. Example:
import mechanize
class HeadRequest(mechanize.Request):
def get_method(self):
return "HEAD"
request = HeadRequest("http://www.example.com/")
response = mechanize.urlopen(request)
print response.info()

In mechanize there is no need to do HeadRequest class etc.
Simply
import mechanize
br = mechanize.Browser()
r = br.open("http://www.example.com/")
print r.info()
That's all.

Related

How to handle the dynamic cookies when crawling a website by python?

I am a very beginner of Python. And I tried to crawl some product information from my www.Alibaba.com console. When I came to the visitor details page, I found the cookie changed every time when I clicked the search button. I found the cookie changed for each request. I can not crawl the data in the way I crawled from other pages where the cookies were fixed in a certain period.
After comparing the cookie data, I found here were only 3 key-value pairs were changed. I think those 3 values made me fail to crawl the data. So I want to know how to handle such situation.
For python3 the http.client in the standard library can be configured to use an http.cookiejar CookieJar which will keep track of cookies within the client automatically.
You can set this up like this:
import http.cookiejar, urllib.request
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
r = opener.open("http://example.com/")
If you're using pyhton2 then a similar approach works with urllib:
import urllib2
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
r = opener.open("http://example.com/")

urllib.urlopen does not work for this url though mechanize works

My code below doesn't work for the URLs in nytimes which are articles. Please try changing the URL variable to something else and you'll see that it works. Why is that?
#url = "http://www.nytimes.com";
url = "http://www.nytimes.com/interactive/2014/07/07/upshot/how-england-italy-and-germany-are-dominating-the-world-cup.html"
htmlfile = urllib.urlopen(url);
htmltext = htmlfile.read();
print htmltext;
Please advise.
Thanks.
I think NYT validates your request with cookies. If the request isn't an ordinary request by web browser, the server returns Location header. It makes your request get lost.
The solution is simple. Use cookiejar like this:
import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
url = "http://www.nytimes.com/interactive/2014/07/07/upshot/how-england-italy-and-germany-are-dominating-the-world-cup.html"
htmlfile = opener.open(url)
htmltext = htmlfile.read();
print htmltext
By "doesn't work" I presume you mean it doesn't give you the expected content. I get an empty result when I access that URL using urllib so this is likely yet another aspect of the NYT's "paywall."

Are cookies kept in a Mechanize browser between opening URLs?

I have code similar to this:
br = mechanize.Browser()
br.open("https://mysite.com/")
br.select_form(nr=0)
#do stuff here
response = br.submit()
html = response.read()
#now that i have the login cookie i can do this...
br.open("https://mysite.com/")
html = response.read()
However, my script is responding like it's not logged in for the second request. I checked the first request and yes, it logs in successfully. My question is: do cookies in Mechanize browsers need to be managed or do I need to setup a CookieJar or something, or does it keep track of all of them for you?
The first example here talks about cookies being carried between requests, but they don't talk about browsers.
Yes you will have to store the cookie between open requests in mechanize. Something similar to the below should work as you can add the cookiejar to the br object and as long as that object exists it maintains that cookie.
import Cookie
import cookielib
cookiejar =cookielib.LWPCookieJar()
br = mechanize.Browser()
br.set_cookiejar(cookiejar)
br.open("https://mysite.com/")
br.select_form(nr=0)
#do stuff here
response = br.submit()
html = response.read()
#now that i have the login cookie i can do this...
br.open("https://mysite.com/")
html = response.read()
The Docs cover it in more detail.
I use perl mechanize alot, but not python so I may have missed something python specific for this to work, so if I did I apologize, but I did not want to answer with a simple yes.

Twitter authentication using cookielib

Can someone tell me why this doesn't work?
import cookielib
import urllib
import urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
data = urllib.urlencode({'session[username_or_email]':'twitter handle' , 'session[password]':'password'})
opener.open('https://twitter.com' , data)
stuff = opener.open('https://twitter.com')
print stuff.read()
Why doesn't this give the html of the page after logging in?
Please consider using an Oauth library for your task. Scraping the site using mechanize is not recommended because twitter can change the HTML specific stuffs any time, and then your code will break.
Check this out: Python-twitter at http://code.google.com/p/python-twitter/
Simplest example to post an update:
>>> import twitter
>>> api = twitter.Api(
consumer_key='yourConsumerKey',
consumer_secret='consumerSecret',
access_token_key='accessToken',
access_token_secret='accessTokenSecret')
>>> api.PostUpdate('Blah blah lbah!')
There can be many reasons why it is failing:
Twitter probably expects a User-Agent header, which you are not providing.
I didn't look at the HTML, but many be there's some Javascript at play before the form is actually submitted (I actually think this is the case, because I vaguely remember writing a very detailed answer on this exact thing (and I dont seem to find the link to it!)).

How to view current url using twill?

I intend to use twill to fill out a form on one page, hit the submit button, and then use BeautifulSoup to parse the resulting page. How can I feed BeautifulSoup the HTML page? I assume I have to read the current url, but I do not know how to actually return the url in order to do so. I have tried twill's TwillBrowser.get_url(), but it only returns None.
For any future sufferers, I have found better luck in using mechanize instead of twill as twill is an un-updated thin shell for mechanize. The solution is as follows:
import mechanize
url = "foo.com"
br = mechanize.Browser()
br.open(url)
br.select_form(name = "YOURFORMNAMEHERE") #make sure to leave the quotation marks
br["YOURINPUTFIELDNAMEHERE"] = ["YOURVALUEHERE"] #this must be in a list even if it is only one value
response = br.submit()
print response.geturl()
Finally figured this out!
If you import twill like so:
import twill.commands as com
then the url =
url = com.browser.get_url()
Source: http://nullege.com/codes/search/twill.commands.browser.get_url?utm_expid=24446124-0.lSQi4Ea5S7WZwxHvFPbOIA.0&utm_referrer=https%3A%2F%2Fwww.google.com%2F

Categories