Downloading Files with Python Urllib, Urllib2 - python

I am trying to download files from a website using urllib as described in this thread: link text
import urllib
urllib.urlretrieve ("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
I am able to download the files (mostly pdf) but all I get is corrupted files that cannot open. I suspect it's because the website requires a login.
How can the above function be modified to handle cookies? I already know the names of the form fields that carry the username & password information. When I print the return values of urlretrieve I get messages like:
a, b = urllib.urlretrieve ("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
print a, b
>> **cache-control:** no-cache, no-store, must-revalidate, s-maxage=300, proxy-revalida
te
>> **connection:** close
I am able to manually download the files if I enter their urls in the browser. Thanks

First urllib2 actually supports cookies and cookie handling should be easy, second of all you can check what kind of file you have downloaded. E.g. AFAIK all mp3 starts with the bytes "ID3"
import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
r = opener.open("http://example.com/")

I might be possible that the server you requesting to is looking for certain header messages, such as User-Agent. You may try mimicking a browser behavior by sending additional headers.

Related

How to handle the dynamic cookies when crawling a website by python?

I am a very beginner of Python. And I tried to crawl some product information from my www.Alibaba.com console. When I came to the visitor details page, I found the cookie changed every time when I clicked the search button. I found the cookie changed for each request. I can not crawl the data in the way I crawled from other pages where the cookies were fixed in a certain period.
After comparing the cookie data, I found here were only 3 key-value pairs were changed. I think those 3 values made me fail to crawl the data. So I want to know how to handle such situation.
For python3 the http.client in the standard library can be configured to use an http.cookiejar CookieJar which will keep track of cookies within the client automatically.
You can set this up like this:
import http.cookiejar, urllib.request
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
r = opener.open("http://example.com/")
If you're using pyhton2 then a similar approach works with urllib:
import urllib2
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
r = opener.open("http://example.com/")

Python Request module error while logging into an Wordpress site

I am writing a script to download files from a website.
import requests
import bs4 as bs
import urllib.request
import re
with requests.session() as c: #making c denote the requests.session() function
link="https://gpldl.com/wp-login.php" #login link
initial=c.get(link) #passing link through .get()
headers = {
'User-agent': 'Mozilla/5.0'
}
login_data= {"log":"****","pwd":"****","redirect_to":"https://gpldl.com/my-gpldl-account/","redirect_to_automatic":1,"rememberme": "forever"} #login data for logging in
page_int=c.post(link, data=login_data,headers=headers) #posting the login data to the login link
prefinal_link="https://gpldl.com" #initializing a part of link to be used later
page=c.get("https://gpldl.com/repository/",headers=headers) #passing the given URL through .get() to be used later
good_data = bs.BeautifulSoup(page.content, "lxml") #parsing the data from previous statement into lxml from by BS4
#loop for finding all required links
for category in good_data.find_all("a",{"class":"dt-btn-m"}):
inner_link=str(prefinal_link)+str(category.get("href"))
my_var_2 = requests.get(inner_link)
good_data_2 = bs.BeautifulSoup(my_var_2.content, "lxml") #parsing each link with lxml
for each in good_data_2.find_all("tr",{"class":"row-2"}):
for down_link_pre in each.find_all("td",{"class":"column-4"}): #downloading all files and getting their addresses for to be entered into .csv file
for down_link in down_link_pre.find_all("a"):
link_var=down_link.get("href")
file_name=link_var.split('/')[-1]
urllib.request.urlretrieve(str(down_link),str(file_name))
my_var.write("\n")
Using my code, when I access the website to download the files, the login keeps failing. Can anyone help me to find what's wrong with my code?
Edit: I think the error is with maintaining the logged in state since, when I try to access one page at a time, I'm able to access the links that can be accessed only when one is logged in. But from that, when I navigate, I think, the bot gets logged out and not able to retrieve the download links and download them.
Websites use cookies to check login status in every request to tell if it's coming from a logged in user or not, and modern browsers (Chrome/Firefox etc.) automatically manage your cookies. requests.session() has support for cookies and it handles cookies by default, so in your code with requests.session() as c c is like the miniature version of a browser, cookie is involved in every request made by c, once you log in with c, you're able to use c.get() to browse all those login-accessible-only pages.
And in your code urllib.request.urlretrieve(str(down_link),str(file_name)) is used for downloading, it has no idea of previous login state, that's why you're not able to download those files.
Instead, you should keep using c, which has the login state, to download all those files:
with open(str(file_name), 'w') as download:
response = c.get(down_link)
download.write(response.content)

Problems saving cookies when making HTTP requests using Python

I‘m trying to make a web-spider using python but I've got some problems when I tried to login the web site Pixiv.My code is as below:
import sys
import urllib
import urllib2
import cookielib
url="https://www.secure.pixiv.net/login.php"
cookiename='123.txt'
cookie = cookielib.MozillaCookieJar(cookiename)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
cookie.save()
values={'model':'login',
'return_to':'/',
'pixiv_id':'username',
'pass':'password',
'skip':'1'}
headers = { 'User-Agent' : 'User-Agent' }
data=urllib.urlencode(values)
req=urllib2.Request(url,data)
response=urllib2.urlopen(req)
the_page=response.read()
cookie.save()
To make sure it works, I used the cookielib to save the cookie as a txt file.I ran the code and got a "cookie.txt",but when I open the file I found that it was rmpty,in another word,my code didn't work.
I don't know what's wrong with it.
The problem is you're not using the opener that you created with the cookiejar attached to it in order to make the request. urllib2.urlopen has no way of knowing that you want to use that opener to start the request.
You can either use the opener's open method directly or, if you want to use this by default for the rest of your application, you can install it as the default opener for all requests made with urllib2 using urllib2.install_opener. So give that a try and see if it does the trick.

Python CookieJar saves cookie, but doesn't send it to website

I am trying to login to website using urllib2 and cookiejar. It saves the session id, but when I try to open another link, which requires authentication it says that I am not logged in. What am I doing wrong?
Here's the code, which fails for me:
import urllib
import urllib2
import cookielib
cookieJar = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookieJar))
# Gives response saying that I logged in succesfully
response = opener.open("http://site.com/login", "username=testuser&password=" + md5encode("testpassword"))
# Gives response saying that I am not logged in
response1 = opener.open("http://site.com/check")
Your implementation seems fine... and should work.
It should be sending in the correct cookies, but I see it as the case when the site is actually not logging you in.
How can you say that its not sending the cookies or may be cookies that you are getting are not the one that authenticates you.
Use : response.info() to see the headers of the responses to see what cookies you are receiving actually.
The site may not be logging you in because :
Its having a check on User-agent that you are not setting, since some sites open from 4 major browsers only to disallow bot access.
The site might be looking for some special hidden form field that you might not be sending in.
1 piece of advise:
from urllib import urlencode
# Use urlencode to encode your data
data = urlencode(dict(username='testuser', password=md5encode("testpassword")))
response = opener.open("http://site.com/login", data)
Moreover 1 thing is strange here :
You are md5 encoding your password before sending it over. (Strange)
This is generally done by the server before comparing to database.
This is possible only if the site.com implements md5 in javascript.
Its a very rare case, since only may be 0.01 % websites do that..
Check that - that might be the problem, and you are providing the hashed form and not the actual password to the server.
So, server would have been again calculating a md5 for your md5 hash.
Check out.. !!
:)
I had a similar problem with my own test server, which worked fine with a browser, but not with the urllib2.build_opener solution.
The problem seems to be in urllib2. As these answers suggest, it's easy to use more powerful mechanize library instead of urllib2:
cookieJar = cookielib.CookieJar()
browser = mechanize.Browser()
browser.set_cookiejar(cookieJar)
opener = mechanize.build_opener(*browser.handlers)
And the opener will work as expected!

Access to the cookies of the default browser

I want to write a program that opens the browser and open a url with a given cookie. I dont know how to do this. Maybe I could modify the cookies in the default place.
import urllib2
opener = urllib2.build_opener()
opener.addheaders.append(('Cookie', 'cookiename=cookievalue'))
f = opener.open("http://example.com/")
Modules to look into:
urllib2
cookielib
Cookie
In python, you can emulate a browser with the mechanize library. Also, there is good documentation about mechanize and cookies.

Categories