Python3 Download Incorrectly Encoded Image From URL

Python3 Download Incorrectly Encoded Image From URL - python

The problem I am currently having is trying to download an image that displays as an animated gif, but appears encoded as a jpg. I say that it appears to be encoded as a jpg because the file extension and mime-type are both .jpg add image/jpeg.
When downloading the file to my local machine (Mac OSX), then attempting to open the file I get the error:
The file could not be opened. It may be damaged or use a file format that Preview doesn’t recognize.
While I realize that some people would maybe just ignore that image, if it can be fixed, I'm looking for a solution to do that, not just ignore it.
The url in question is here:
http://www.supergrove.com/wp-content/uploads/2017/03/gif-images-22-1000-about-gif-on-pinterest.jpg
Here is my code, and I am open to suggestions:
from PIL import Image
import requests
response = requests.get(media, stream = True)
response.raise_for_status()
with open(uploadedFile, 'wb') as img:
for chunk in response.iter_content(chunk_size=1024):
if chunk:
img.write(chunk)
img.close()

According to Wheregoes, the link of the image:
http://www.supergrove.com/wp-content/uploads/2017/03/gif-images-22-1000-about-gif-on-pinterest.jpg
receives a 302 redirect to the page that contains it:
http://www.supergrove.com/gif-images/gif-images-22-1000-about-gif-on-pinterest/
Therefore, your code is trying to download a web page as an image.
I tried:
r = requests.get(the_url, headers=headers, allow_redirects=False)
But it returns zero content and status_code = 302.
(Indeed that was obvious it should happen ...)
This server is configured in a way that it will never fulfill that request.
Bypassing that limitation sounds illegal difficult, to the best of my -limited perhaps- knowledge.

Had to answer my own question in this case, but the answer to this problem, was to add a referer for the request. Most likely an htaccess file preventing some direct file access on the image's server unless the request came from their own server.
from fake_useragent import UserAgent
from io import StringIO,BytesIO
import io
import imghdr
import requests
# Set url
mediaURL = 'http://www.supergrove.com/wp-content/uploads/2017/03/gif-images-22-1000-about-gif-on-pinterest.jpg'
# Create a user agent
ua = UserAgent()
# Create a request session
s = requests.Session()
# Set some headers for the request
s.headers.update({ 'User-Agent': ua.chrome, 'Referrer': media })
# Make the request to get the image from the url
response = s.get(mediaURL, allow_redirects=False)
# The request was about to be redirected
if response.status_code == 302:
# Get the next location that we would have been redirected to
location = response.headers['Location']
# Set the previous page url as referer
s.headers.update({'referer': location})
# Try the request again, this time with a referer
response = s.get(mediaURL, allow_redirects=False, cookies=response.cookies)
print(response.headers)
Hat tip to #raratiru for suggesting the use of allow_redirects.
Also noted in their answer is that the image's server might be intentionally blocking access to prevent general scrapers from viewing their images. Hard to tell, but regardless, this solution works.

Related

Downloading file from redirection link python

I am trying to make a simple program that will help with the confusing part of rooting.
I need to download the file from tiny.cc/latestmagisk
I am using this python code
import request
url = tiny.cc/latestmagisk
r = request.get(url)
r.content
The content it returns is the usual 403 Forbidden for nginx
I need this to work with the shortened URL is there anyway to make that happen?

its's not necessary to import request lib
all you need to do is import ssl, urllib and pass ssl._create_unverified_context() as context to the server while you're sendig a request!
your code should be look like this:
import ssl, urllib
certcontext = ssl._create_unverified_context()
f = open('image.jpg','wb') #creating placeholder
#creating image from url and saving it as `image.jpg`!
f.write(urllib.urlopen("https://i.stack.imgur.com/IKh7E.png", context=certcontext).read())
f.close()
note: it will save the image as image.jpg file ..

Contrary to the other answer, you really should use requests for this as requests has better support for redirects.
For getting a page through a redirect from requests:
r=requests.get(url, allow_redirects=True)
For downloading files through redirects:
r = requests.get(url, allow_redirects=True, stream=True)
with open(filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: f.write(chunk)
However, in this case, either tiny.cc or XDA does not allow a simple requests.get; the 403 forbidden is likely due to the User-Agent or other intrinsic header as this method works well with bit.ly and other shortlink generators. You may need to fake headers.

python requests library - get part of response

How can i get a part of response of get/post-request from python-requests library? I need get URL content and analys it fast, but web-server may return a very large response (500Mb for example).

Set stream to True.
example_url = 'http://www.example.com/somethingbig'
r = requests.get(example_url, stream=True)
At this point only the response headers have been downloaded and the connection remains open.
Documentation: body-content-workflow

urllib2 does not read entire page

A portion of code that I have that will parse a web site does not work.
I can trace the problem to the .read function of my urllib2.urlopen object.
page = urllib2.urlopen('http://magiccards.info/us/en.html')
data = page.read()
Until yesterday, this worked fine; but now the length of the data is always 69496 instead of 122989, however when I open smaller pages my code works fine.
I have tested this on Ubuntu, Linux Mint and windows 7. All have the same behaviour.
I'm assuming that something has changed on the web server; but the page is complete when I use a web browser. I have tried to diagnose the issue with wireshark but the page is received as complete.
Does anybody know why this may be happening or what I could try to determine the issue?

The page seems to be misbehaving unless you request the content encoded as gzip. Give this a shot:
import urllib2
import zlib
request = urllib2.Request('http://magiccards.info/us/en.html')
request.add_header('Accept-Encoding', 'gzip')
response = urllib2.urlopen(request)
data = zlib.decompress(response.read(), 16 + zlib.MAX_WBITS)
As Nathan suggested, you could also use the great Requests library, which accepts gzip by default.
import requests
data = requests.get('http://magiccards.info/us/en.html').text

Yes, the server is closing connection and you need keep-alive to be sent. urllib2 does not have that facility ( :-( ). There used be urlgrabber which you could use have a HTTPHandler that works alongside with urllib2 opener. But unfortunately, I dont find that working too. At the moment, you could be other libraries, like requests as demonstrated in the other answer or httplib2.
import httplib2
h = httplib2.Http(".cache")
resp, content = h.request("http://magiccards.info/us/en.html", "GET")
print len(content)

urllib2 basic authentication oddites

I'm slamming my head against the wall with this one. I've been trying every example, reading every last bit I can find online about basic http authorization with urllib2, but I can not figure out what is causing my specific error.
Adding to the frustration is that the code works for one page, and yet not for another.
logging into www.mysite.com/adm goes absolutely smooth. It authenticates no problem. Yet if I change the address to 'http://mysite.com/adm/items.php?n=201105&c=200' I receive this error:
<h4 align="center" class="teal">Add/Edit Items</h4>
<p><strong>Client:</strong> </p><p><strong>Event:</strong> </p><p class="error">Not enough information to complete this task</p>
<p class="error">This is a fatal error so I am exiting now.</p>
Searching google has lead to zero information on this error.
The adm is a frame set page, I'm not sure if that's relevant at all.
Here is the current code:
import urllib2, urllib
import sys
import re
import base64
from urlparse import urlparse
theurl = 'http://xxxxxmedia.com/adm/items.php?n=201105&c=200'
username = 'XXXX'
password = 'XXXX'
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, theurl,username,password)
authhandler = urllib2.HTTPBasicAuthHandler(passman)
opener = urllib2.build_opener(authhandler)
urllib2.install_opener(opener)
pagehandle = urllib2.urlopen(theurl)
url = 'http://xxxxxxxmedia.com/adm/items.php?n=201105&c=200'
values = {'AvAudioCD': 1,
'AvAudioCDDiscount': 00, 'AvAudioCDPrice': 50,
'ProductName': 'python test', 'frmSubmit': 'Submit' }
#opener2 = urllib2.build_opener(urllib2.HTTPCookieProcessor())
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
This is just one of the many versions I've tried. I've followed every example from Urllib2 Missing Manual but still receive the same error.
Can anyone point to what I'm doing wrong?

Run into a similar problem today. I was using basic authentication on the website I am developing and I couldn't authenticate any users.
Here are a few things you can use to debug your problem:
I used slumber.in and httplib2 for testing purposes. I ran both from ipython shell to see what responses I was receiving.
Slumber actually uses httplib2 beneath the covers so they acted similarly. I used tcpdump and later tcpflow (which shows information in a much more readable form) to see what was really being sent and received. If you want a GUI, see wireshark or alternatives.
I tested my website with curl and when I used curl with my username/password it worked correctly and showed the requested page. But slumber and httplib2 were still not working.
I tested my website and browserspy.dk to see what were the differences. Important thing is browserspy's website works for basic authentication and my web site did not, so I could compare between the two. I read in a lot of places that you need to send HTTP 401 Not Authorized so that the browser or the tool you are using could send the username/password you provided. But what I didn't know was, you also needed the WWW-Authenticate field in the header. So this was the missing piece.
What made this whole situation odd was while testing I would see httplib2 send basic authentication headers with most of the requests (tcpflow would show that). It turns out that the library does not send username/password authentication on the first request. If "Status 401" AND "WWW-Authenticate" is in the response, then the credentials are sent on the second request and all the requests to this domain from then on.
So to sum up, your application may be correct but you might not be returning the standard headers and status code for the client to send credentials. Use your debug tools to find which is which. Also, there's debug mode for httplib2, just set httplib2.debuglevel=1 so that debug information is printed on the standard output. This is much more helpful then using tcpdump because it is at a higher level.
Hope this helps someone.

About an year ago, I went thro' the same process and documented how I solved the problem - The direct and simple way to authentication and the standard one. Choose what you deem fit.
HTTP Authentication in Python
There is an explained description, in the missing urllib2 document.

From the HTML you posted, it still think that you authenticate successfully but encounter an error afterwards, in the processing of your POST request. I tried your URL and failing authentication, I get a standard 401 page.
In any case, I suggest you try again running your code and performing the same operation manually in Firefox, only this time with Wireshark to capture the exchange. You can grab the full text of the HTTP request and response in both cases and compare the differences. In most cases that will lead you to the source of the error you get.

I also found the passman stuff doesn't work (sometimes?). Adding the base64 user/pass header as per this answer https://stackoverflow.com/a/18592800/623159 did work for me. I am accessing jenkins URL like this: http:///job//lastCompletedBuild/testR‌‌eport/api/python
This works for me:
import urllib2
import base64
baseurl="http://jenkinsurl"
username=...
password=...
url="%s/job/jobname/lastCompletedBuild/testReport/api/python" % baseurl
base64string = base64.encodestring('%s:%s' % (username, password)).replace('\n', '')
request = urllib2.Request(url)
request.add_header("Authorization", "Basic %s" % base64string)
result = urllib2.urlopen(request)
data = result.read()
This doesn't work for me, error 403 each time:
import urllib2
baseurl="http://jenkinsurl"
username=...
password=...
##urllib2.HTTPError: HTTP Error 403: Forbidden
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, username,password)
urllib2.install_opener(urllib2.build_opener(urllib2.HTTPBasicAuthHandler(passman)))
req = urllib2.Request(url)
result = urllib2.urlopen(req)
data = result.read()

Cookie Problem in Python

I'm working on a simple HTML scraper for Hulu in python 2.6 and am having problems with logging on to my account. Here's my code so far:
import urllib
import urllib2
from cookielib import CookieJar
#make a cookie and redirect handlers
cookies = CookieJar()
cookie_handler= urllib2.HTTPCookieProcessor(cookies)
redirect_handler= urllib2.HTTPRedirectHandler()
opener = urllib2.build_opener(redirect_handler,cookie_handler)#make opener w/ handlers
#build the url
login_info = {'username':USER,'password':PASS}#USER and PASS are defined
data = urllib.urlencode(login_info)
req = urllib2.Request("http://www.hulu.com/account/authenticate",data)#make the request
test = opener.open(req) #open the page
print test.read() #print html results
The code compiles and runs, but all that prints is:
Login.onError("Please \074a href=\"/support/login_faq#cant_login\"\076enable cookies\074/a\076 and try again.");
I assume there is some error in how I'm handling cookies, but just can't seem to spot it. I've heard Mechanize is a very useful module for this type of program, but as this seems to be the only speed bump left, I was hoping to find my bug.

What you're seeing is a ajax return. It is probably using javascript to set the cookie, and screwing up your attempts to authenticate.

The error message you are getting back could be misleading. For example the server might be looking at user-agent and seeing that say it's not one of the supported browsers, or looking at HTTP_REFERER expecting it to be coming from hulu domain. My point is there are two many variables coming in the request to keep guessing them one by one
I recommend using an http analyzer tool, e.g. Charles or the one in Firebug to figure out what (header fields, cookies, parameters) the client sends to server when you doing hulu login via a browser. This will give you the exact request that you need to construct in your python code.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.