Python Downloading Zip Files Damaged

Python Downloading Zip Files Damaged - python

So, I've been trying to make a simple downloader that downloads my zip file.
Code looks like this:
import urllib2
import os
import shutil
url = "https://dl.dropbox.com/u/29251693/CreeperCraft.zip"
file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open('c:\CreeperCraft.zip', 'w+')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8)*(len(status)+1)
print status,
f.close()
And the problem is, it downloads the file to the correct path, but when I open the file, its damaged, only 1 picture appears and when you click on it, it says File Damaged.
Please help.

f = open('c:\CreeperCraft.zip', 'wb+')

You are using "w+" as flag, Python opens the file in text mode:
Python on Windows makes a distinction between text and binary files;
the end-of-line characters in text files are automatically altered
slightly when data is read or written. This behind-the-scenes
modification to file data is fine for ASCII text files, but it’ll
corrupt binary data like that in JPEG or EXE files.
http://docs.python.org/tutorial/inputoutput.html#reading-and-writing-files
Also, note that you should escape the backslash or use raw strings, therefore use open('c:\\CreeperCraft.zip', 'wb+').
I also would recommend that you do not copy raw byte strings by hand, but use shutil.copyfileobj - it makes your code more compact and easier to understand. I also like to use the with statement that automatically cleans up resources (i.e. that closes files:
import urllib2, shutil
url = "https://dl.dropbox.com/u/29251693/CreeperCraft.zip"
with urllib2.urlopen(url) as source, open('c:\CreeperCraft.zip', 'w+b') as target:
shutil.copyfileobj(source, target)

import posixpath
import sys
import urlparse
import urllib
url = "https://dl.dropbox.com/u/29251693/CreeperCraft.zip"
filename = posixpath.basename(urlparse.urlsplit(url).path)
def print_download_status(block_count, block_size, total_size):
sys.stderr.write('\r%10s bytes of %s' % (block_count*block_size, total_size))
filename, headers = urllib.urlretrieve(url, filename, print_download_status)

Related

How to download file by using python? [duplicate]

I have a small utility that I use to download an MP3 file from a website on a schedule and then builds/updates a podcast XML file which I've added to iTunes.
The text processing that creates/updates the XML file is written in Python. However, I use wget inside a Windows .bat file to download the actual MP3 file. I would prefer to have the entire utility written in Python.
I struggled to find a way to actually download the file in Python, thus why I resorted to using wget.
So, how do I download the file using Python?

One more, using urlretrieve:
import urllib.request
urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
(for Python 2 use import urllib and urllib.urlretrieve)

Use urllib.request.urlopen():
import urllib.request
with urllib.request.urlopen('http://www.example.com/') as f:
html = f.read().decode('utf-8')
This is the most basic way to use the library, minus any error handling. You can also do more complex stuff such as changing headers.
On Python 2, the method is in urllib2:
import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()

In 2012, use the python requests library
>>> import requests
>>>
>>> url = "http://download.thinkbroadband.com/10MB.zip"
>>> r = requests.get(url)
>>> print len(r.content)
10485760
You can run pip install requests to get it.
Requests has many advantages over the alternatives because the API is much simpler. This is especially true if you have to do authentication. urllib and urllib2 are pretty unintuitive and painful in this case.
2015-12-30
People have expressed admiration for the progress bar. It's cool, sure. There are several off-the-shelf solutions now, including tqdm:
from tqdm import tqdm
import requests
url = "http://download.thinkbroadband.com/10MB.zip"
response = requests.get(url, stream=True)
with open("10MB", "wb") as handle:
for data in tqdm(response.iter_content()):
handle.write(data)
This is essentially the implementation #kvance described 30 months ago.

import urllib2
mp3file = urllib2.urlopen("http://www.example.com/songs/mp3.mp3")
with open('test.mp3','wb') as output:
output.write(mp3file.read())
The wb in open('test.mp3','wb') opens a file (and erases any existing file) in binary mode so you can save data with it instead of just text.

Python 3
urllib.request.urlopen
import urllib.request
response = urllib.request.urlopen('http://www.example.com/')
html = response.read()
urllib.request.urlretrieve
import urllib.request
urllib.request.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')
Note: According to the documentation, urllib.request.urlretrieve is a "legacy interface" and "might become deprecated in the future" (thanks gerrit)
Python 2
urllib2.urlopen (thanks Corey)
import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()
urllib.urlretrieve (thanks PabloG)
import urllib
urllib.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')

use wget module:
import wget
wget.download('url')

import os,requests
def download(url):
get_response = requests.get(url,stream=True)
file_name = url.split("/")[-1]
with open(file_name, 'wb') as f:
for chunk in get_response.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
download("https://example.com/example.jpg")

An improved version of the PabloG code for Python 2/3:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import ( division, absolute_import, print_function, unicode_literals )
import sys, os, tempfile, logging
if sys.version_info >= (3,):
import urllib.request as urllib2
import urllib.parse as urlparse
else:
import urllib2
import urlparse
def download_file(url, dest=None):
"""
Download and save a file specified by url to dest directory,
"""
u = urllib2.urlopen(url)
scheme, netloc, path, query, fragment = urlparse.urlsplit(url)
filename = os.path.basename(path)
if not filename:
filename = 'downloaded.file'
if dest:
filename = os.path.join(dest, filename)
with open(filename, 'wb') as f:
meta = u.info()
meta_func = meta.getheaders if hasattr(meta, 'getheaders') else meta.get_all
meta_length = meta_func("Content-Length")
file_size = None
if meta_length:
file_size = int(meta_length[0])
print("Downloading: {0} Bytes: {1}".format(url, file_size))
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = "{0:16}".format(file_size_dl)
if file_size:
status += " [{0:6.2f}%]".format(file_size_dl * 100 / file_size)
status += chr(13)
print(status, end="")
print()
return filename
if __name__ == "__main__": # Only run if this file is called directly
print("Testing with 10MB download")
url = "http://download.thinkbroadband.com/10MB.zip"
filename = download_file(url)
print(filename)

Simple yet Python 2 & Python 3 compatible way comes with six library:
from six.moves import urllib
urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")

Following are the most commonly used calls for downloading files in python:
urllib.urlretrieve ('url_to_file', file_name)
urllib2.urlopen('url_to_file')
requests.get(url)
wget.download('url', file_name)
Note: urlopen and urlretrieve are found to perform relatively bad with downloading large files (size > 500 MB). requests.get stores the file in-memory until download is complete.

Wrote wget library in pure Python just for this purpose. It is pumped up urlretrieve with these features as of version 2.0.

In python3 you can use urllib3 and shutil libraires.
Download them by using pip or pip3 (Depending whether python3 is default or not)
pip3 install urllib3 shutil
Then run this code
import urllib.request
import shutil
url = "http://www.somewebsite.com/something.pdf"
output_file = "save_this_name.pdf"
with urllib.request.urlopen(url) as response, open(output_file, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
Note that you download urllib3 but use urllib in code

I agree with Corey, urllib2 is more complete than urllib and should likely be the module used if you want to do more complex things, but to make the answers more complete, urllib is a simpler module if you want just the basics:
import urllib
response = urllib.urlopen('http://www.example.com/sound.mp3')
mp3 = response.read()
Will work fine. Or, if you don't want to deal with the "response" object you can call read() directly:
import urllib
mp3 = urllib.urlopen('http://www.example.com/sound.mp3').read()

If you have wget installed, you can use parallel_sync.
pip install parallel_sync
from parallel_sync import wget
urls = ['http://something.png', 'http://somthing.tar.gz', 'http://somthing.zip']
wget.download('/tmp', urls)
# or a single file:
wget.download('/tmp', urls[0], filenames='x.zip', extract=True)
Doc:
https://pythonhosted.org/parallel_sync/pages/examples.html
This is pretty powerful. It can download files in parallel, retry upon failure , and it can even download files on a remote machine.

You can get the progress feedback with urlretrieve as well:
def report(blocknr, blocksize, size):
current = blocknr*blocksize
sys.stdout.write("\r{0:.2f}%".format(100.0*current/size))
def downloadFile(url):
print "\n",url
fname = url.split('/')[-1]
print fname
urllib.urlretrieve(url, fname, report)

If speed matters to you, I made a small performance test for the modules urllib and wget, and regarding wget I tried once with status bar and once without. I took three different 500MB files to test with (different files- to eliminate the chance that there is some caching going on under the hood). Tested on debian machine, with python2.
First, these are the results (they are similar in different runs):
$ python wget_test.py
urlretrive_test : starting
urlretrive_test : 6.56
==============
wget_no_bar_test : starting
wget_no_bar_test : 7.20
==============
wget_with_bar_test : starting
100% [......................................................................] 541335552 / 541335552
wget_with_bar_test : 50.49
==============
The way I performed the test is using "profile" decorator. This is the full code:
import wget
import urllib
import time
from functools import wraps
def profile(func):
#wraps(func)
def inner(*args):
print func.__name__, ": starting"
start = time.time()
ret = func(*args)
end = time.time()
print func.__name__, ": {:.2f}".format(end - start)
return ret
return inner
url1 = 'http://host.com/500a.iso'
url2 = 'http://host.com/500b.iso'
url3 = 'http://host.com/500c.iso'
def do_nothing(*args):
pass
#profile
def urlretrive_test(url):
return urllib.urlretrieve(url)
#profile
def wget_no_bar_test(url):
return wget.download(url, out='/tmp/', bar=do_nothing)
#profile
def wget_with_bar_test(url):
return wget.download(url, out='/tmp/')
urlretrive_test(url1)
print '=============='
time.sleep(1)
wget_no_bar_test(url2)
print '=============='
time.sleep(1)
wget_with_bar_test(url3)
print '=============='
time.sleep(1)
urllib seems to be the fastest

Just for the sake of completeness, it is also possible to call any program for retrieving files using the subprocess package. Programs dedicated to retrieving files are more powerful than Python functions like urlretrieve. For example, wget can download directories recursively (-R), can deal with FTP, redirects, HTTP proxies, can avoid re-downloading existing files (-nc), and aria2 can do multi-connection downloads which can potentially speed up your downloads.
import subprocess
subprocess.check_output(['wget', '-O', 'example_output_file.html', 'https://example.com'])
In Jupyter Notebook, one can also call programs directly with the ! syntax:
!wget -O example_output_file.html https://example.com

Late answer, but for python>=3.6 you can use:
import dload
dload.save(url)
Install dload with:
pip3 install dload

Source code can be:
import urllib
sock = urllib.urlopen("http://diveintopython.org/")
htmlSource = sock.read()
sock.close()
print htmlSource

I wrote the following, which works in vanilla Python 2 or Python 3.
import sys
try:
import urllib.request
python3 = True
except ImportError:
import urllib2
python3 = False
def progress_callback_simple(downloaded,total):
sys.stdout.write(
"\r" +
(len(str(total))-len(str(downloaded)))*" " + str(downloaded) + "/%d"%total +
" [%3.2f%%]"%(100.0*float(downloaded)/float(total))
)
sys.stdout.flush()
def download(srcurl, dstfilepath, progress_callback=None, block_size=8192):
def _download_helper(response, out_file, file_size):
if progress_callback!=None: progress_callback(0,file_size)
if block_size == None:
buffer = response.read()
out_file.write(buffer)
if progress_callback!=None: progress_callback(file_size,file_size)
else:
file_size_dl = 0
while True:
buffer = response.read(block_size)
if not buffer: break
file_size_dl += len(buffer)
out_file.write(buffer)
if progress_callback!=None: progress_callback(file_size_dl,file_size)
with open(dstfilepath,"wb") as out_file:
if python3:
with urllib.request.urlopen(srcurl) as response:
file_size = int(response.getheader("Content-Length"))
_download_helper(response,out_file,file_size)
else:
response = urllib2.urlopen(srcurl)
meta = response.info()
file_size = int(meta.getheaders("Content-Length")[0])
_download_helper(response,out_file,file_size)
import traceback
try:
download(
"https://geometrian.com/data/programming/projects/glLib/glLib%20Reloaded%200.5.9/0.5.9.zip",
"output.zip",
progress_callback_simple
)
except:
traceback.print_exc()
input()
Notes:
Supports a "progress bar" callback.
Download is a 4 MB test .zip from my website.

You can use PycURL on Python 2 and 3.
import pycurl
FILE_DEST = 'pycurl.html'
FILE_SRC = 'http://pycurl.io/'
with open(FILE_DEST, 'wb') as f:
c = pycurl.Curl()
c.setopt(c.URL, FILE_SRC)
c.setopt(c.WRITEDATA, f)
c.perform()
c.close()

Use Python Requests in 5 lines
import requests as req
remote_url = 'http://www.example.com/sound.mp3'
local_file_name = 'sound.mp3'
data = req.get(remote_url)
# Save file data to local copy
with open(local_file_name, 'wb')as file:
file.write(data.content)
Now do something with the local copy of the remote file

This may be a little late, But I saw pabloG's code and couldn't help adding a os.system('cls') to make it look AWESOME! Check it out :
import urllib2,os
url = "http://download.thinkbroadband.com/10MB.zip"
file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
os.system('cls')
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8)*(len(status)+1)
print status,
f.close()
If running in an environment other than Windows, you will have to use something other then 'cls'. In MAC OS X and Linux it should be 'clear'.

urlretrieve and requests.get are simple, however the reality not.
I have fetched data for couple sites, including text and images, the above two probably solve most of the tasks. but for a more universal solution I suggest the use of urlopen. As it is included in Python 3 standard library, your code could run on any machine that run Python 3 without pre-installing site-package
import urllib.request
url_request = urllib.request.Request(url, headers=headers)
url_connect = urllib.request.urlopen(url_request)
#remember to open file in bytes mode
with open(filename, 'wb') as f:
while True:
buffer = url_connect.read(buffer_size)
if not buffer: break
#an integer value of size of written data
data_wrote = f.write(buffer)
#you could probably use with-open-as manner
url_connect.close()
This answer provides a solution to HTTP 403 Forbidden when downloading file over http using Python. I have tried only requests and urllib modules, the other module may provide something better, but this is the one I used to solve most of the problems.

New Api urllib3 based implementation
>>> import urllib3
>>> http = urllib3.PoolManager()
>>> r = http.request('GET', 'your_url_goes_here')
>>> r.status
200
>>> r.data
*****Response Data****
More info: https://pypi.org/project/urllib3/

You can python requests
import os
import requests
outfile = os.path.join(SAVE_DIR, file_name)
response = requests.get(URL, stream=True)
with open(outfile,'wb') as output:
output.write(response.content)
You can use shutil
import os
import requests
import shutil
outfile = os.path.join(SAVE_DIR, file_name)
response = requests.get(url, stream = True)
with open(outfile, 'wb') as f:
shutil.copyfileobj(response.content, f)
If you are downloading from restricted url, don't forget to include access token in headers

I wanted do download all the files from a webpage. I tried wget but it was failing so I decided for the Python route and I found this thread.
After reading it, I have made a little command line application, soupget, expanding on the excellent answers of PabloG and Stan and adding some useful options.
It uses BeatifulSoup to collect all the URLs of the page and then download the ones with the desired extension(s). Finally it can download multiple files in parallel.
Here it is:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from __future__ import (division, absolute_import, print_function, unicode_literals)
import sys, os, argparse
from bs4 import BeautifulSoup
# --- insert Stan's script here ---
# if sys.version_info >= (3,):
#...
#...
# def download_file(url, dest=None):
#...
#...
# --- new stuff ---
def collect_all_url(page_url, extensions):
"""
Recovers all links in page_url checking for all the desired extensions
"""
conn = urllib2.urlopen(page_url)
html = conn.read()
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('a')
results = []
for tag in links:
link = tag.get('href', None)
if link is not None:
for e in extensions:
if e in link:
# Fallback for badly defined links
# checks for missing scheme or netloc
if bool(urlparse.urlparse(link).scheme) and bool(urlparse.urlparse(link).netloc):
results.append(link)
else:
new_url=urlparse.urljoin(page_url,link)
results.append(new_url)
return results
if __name__ == "__main__": # Only run if this file is called directly
# Command line arguments
parser = argparse.ArgumentParser(
description='Download all files from a webpage.')
parser.add_argument(
'-u', '--url',
help='Page url to request')
parser.add_argument(
'-e', '--ext',
nargs='+',
help='Extension(s) to find')
parser.add_argument(
'-d', '--dest',
default=None,
help='Destination where to save the files')
parser.add_argument(
'-p', '--par',
action='store_true', default=False,
help="Turns on parallel download")
args = parser.parse_args()
# Recover files to download
all_links = collect_all_url(args.url, args.ext)
# Download
if not args.par:
for l in all_links:
try:
filename = download_file(l, args.dest)
print(l)
except Exception as e:
print("Error while downloading: {}".format(e))
else:
from multiprocessing.pool import ThreadPool
results = ThreadPool(10).imap_unordered(
lambda x: download_file(x, args.dest), all_links)
for p in results:
print(p)
An example of its usage is:
python3 soupget.py -p -e <list of extensions> -d <destination_folder> -u <target_webpage>
And an actual example if you want to see it in action:
python3 soupget.py -p -e .xlsx .pdf .csv -u https://healthdata.gov/dataset/chemicals-cosmetics

Another possibility is with built-in http.client:
from http import HTTPStatus, client
from shutil import copyfileobj
# using https
connection = client.HTTPSConnection("www.example.com")
with connection.request("GET", "/noise.mp3") as response:
if response.status == HTTPStatus.OK:
copyfileobj(response, open("noise.mp3")
else:
raise Exception("request needs work")
The HTTPConnection object is considered “low-level” in that it performs the desired request once and assumes the developer will subclass it or script in a way to handle the nuances of HTTP. Libraries such as requests tend to handle more special cases such as automatically following redirects and so on.

You can use keras.utils.get_file to do it:
from tensorflow import keras
path_to_downloaded_file = keras.utils.get_file(
fname="file name",
origin="https://www.linktofile.com/link/to/file",
extract=True,
archive_format="zip", # downloaded file format
cache_dir="/", # cache and extract in current directory
)

Another way is to call an external process such as curl.exe. Curl by default displays a progress bar, average download speed, time left, and more all formatted neatly in a table.
Put curl.exe in the same directory as your script
from subprocess import call
url = ""
call(["curl", {url}, '--output', "song.mp3"])
Note: You cannot specify an output path with curl, so do an os.rename afterwards

How to Replace \n, b and single quotes from Raw File from GitHub?

I am trying to download file from GitHub(raw file) and then run this file as .sql file.
import snowflake.connector
from codecs import open
import logging
import requests
from os import getcwd
import os
import sys
#logging
logging.basicConfig(
filename='C:/Users/abc/Documents/Test.log',
level=logging.INFO
)
url = "https://github.com/raw/abc/master/file_name?token=Anvn3lJXDks5ciVaPwA%3D%3D"
directory = getcwd()
filename = os.path.join(getcwd(),'VIEWS.SQL')
r = requests.get(url)
filename.decode("utf-8")
f = open(filename,'w')
f.write(str(r.content))
with open(filename,'r') as theFile, open(filename,'w') as outFile:
data = theFile.read().split('\n')
data = theFile.read().replace('\n','')
data = theFile.read().replace("b'","")
data = theFile.read()
outFile.write(data)
However I get this error
syntax error line 1 at position 0 unexpected 'b'
My converted sql file has b at the beginning and bunch of newline \n characters in the file. Also the entire output file is in single quotes 'text'. Can anyone help me get rid of these? Looks like replace isn't working.
OS: Windows
Python Version: 3.7.0

You introduced a b'.. prefix by converting the response.content bytes value to a string with str():
>>> import requests
>>> r = requests.get("https://github.com/raw/abc/master/file_name?token=Anvn3lJXDks5ciVaPwA%3D%3D")
>>> r.content
b'Not Found'
>>> str(r.content)
"b'Not Found'"
Of course, the specific dummy URL you gave in your question produces a 404 Not Found response, hence the Not Found content of the response body:
>>> r.status_code
404
so the contents in this demonstration are not actually all that useful. However, even for your real URL you probably want to test for a 200 status code before moving to write the data to a file!
What is going wrong in the above is that str(bytesvalue) converts a bytes object to its representation. You'd normally want to decode a bytes value with a text codec, using the bytes.decode() method. But because you are writing the data to a file here, you should instead just open the file in binary mode and write the bytes object without decoding:
r = requests.get(url)
if r.status_code == 200:
with open(filename, 'wb') as f:
f.write(r.content)
The 'wb' mode opens the file for writing in binary mode. Writing binary content to a binary file is the most efficient; decoding it first then writing to a text file requires that it is encoded again. Better to avoid doing double work.
As a side note: there is no need to join a local filename with getcwd(); relative paths always end up in the current working directory, and otherwise it's better to use os.path.abspath(filename).
You could also trust that GitHub sets the correct character set in the Content-Type headers and have response decode the value to str for you in the form of the response.text attribute:
r = requests.get(url)
if r.status_code == 200:
with open(filename, 'w') as f:
f.write(r.text)
but again, that's really doing extra work for nothing, first decoding the binary content from the request, then encoding again when writing to a text file.
Finally, for larger file responses it is better to stream the data and copy it directly to a file. The shutil.copyfileobj() function can take a raw response fileobject directly, provided you enable transparent transport decompression:
import shutil
r = requests.get(url, stream=True)
if r.status_code == 200:
with open(filename, 'wb') as f:
# enable transparent transport decompression handling
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)

Depending on your version of Python/OS it could be as simple as changing the file to read/write in binary (and if they're still there then altering where you have the replaces):
with open(filename,'rb') as theFile, open(filename,'wb') as outFile:
outfile.write(str(r.content))
data = theFile.read().split('\n')
data = data.replace('\n','')
data = data.replace("b'","")
outFile.write(data)
It would help to have a copy of the file and the line the error is occurring on.

Python Requests: downloaded image file binary not identical to original?

I am saving image file from the web using Python Requests. However, saved file is a little bit binary different than the original and a bit bigger. It still is a valid jpg file, but scrambled.
Here is the code:
import requests
import shutil
import os
if __name__ == "__main__":
image_url = 'http://www.123.com/image.jpg'
filename = 'out.jpg'
username = 'myusername'
password = 'mypasword'
path = os.path.join('c:/', filename )
r = requests.get(image_url, auth=(username, password), stream=True)
if r.status_code == 200:
with open(path, 'w') as f:
r.raw.decode_content = False
shutil.copyfileobj(r.raw, f)
print 'The End'
What am I doing wrong?

open(path, 'w')
should be:
open(path, 'wb')
The b is for "binary". This will make sure Python won't try to convert character encodings and newlines and reads or writes everything exactly as it is byte-for-byte.
Also see the open() documentation

How to get python to successfully download large images from the internet

So I've been using
urllib.request.urlretrieve(URL, FILENAME)
to download images of the internet. It works great, but fails on some images. The ones it fails on seem to be the larger images- eg. http://i.imgur.com/DEKdmba.jpg. It downloads them fine, but when I try to open these files photo viewer gives me the error "windows photo viewer cant open this picture because the file appears to be damaged corrupted or too large".
What might be the reason it can't download these, and how can I fix this?
EDIT: after looking further, I dont think the problem is large images- it manages to download larger ones. It just seems to be some random ones that it can never download whenever I run the script again. Now I'm even more confused

In the past, I have used this code for copying from the internet. I have had no trouble with large files.
def download(url):
file_name = raw_input("Name: ")
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
file_size_dl = 0
block_size = 8192
while True:
buffer = u.read(block_size)
if not buffer:
break

Here's the sample code for Python 3 (tested in Windows 7):
import urllib.request
def download_very_big_image():
url = 'http://i.imgur.com/DEKdmba.jpg'
filename = 'C://big_image.jpg'
conn = urllib.request.urlopen(url)
output = open(filename, 'wb') #binary flag needed for Windows
output.write(conn.read())
output.close()
For completeness sake, here's the equivalent code in Python 2:
import urllib2
def download_very_big_image():
url = 'http://i.imgur.com/DEKdmba.jpg'
filename = 'C://big_image.jpg'
conn = urllib2.urlopen(url)
output = open(filename, 'wb') #binary flag needed for Windows
output.write(conn.read())
output.close()

This should work: use requests module:
import requests
img_url = 'http://i.imgur.com/DEKdmba.jpg'
img_name = img_url.split('/')[-1]
img_data = requests.get(img_url).content
with open(img_name, 'wb') as handler:
handler.write(img_data)

Using Pylzma with streaming and 7Zip compatibility

I have been using pylzma for a little bit, but I have to be able to create files compatible with the 7zip windows application. The caveat is that some of my files are really large (3 to 4gb created by a third party software in a proprietary binary format).
I went over and over here and on the instructions here: https://github.com/fancycode/pylzma/blob/master/doc/USAGE.md
I am able to create compatible files with the following code:
def Compacts(folder,f):
os.chdir(folder)
fsize=os.stat(f).st_size
t=time.clock()
i = open(f, 'rb')
o = open(f+'.7z', 'wb')
i.seek(0)
s = pylzma.compressfile(i)
result = s.read(5)
result += struct.pack('<Q', fsize)
s=result+s.read()
o.write(s)
o.flush()
o.close()
i.close()
os.remove(f)
The smaller files (up to 2Gb) compress well with this code and are compatible with 7Zip, but the larger files just crash python after some time.
According to the user guide, to compact large files one should use streaming, but then the resulting file is not compatible with 7zip, as in the snippet bellow.
def Compacts(folder,f):
os.chdir(folder)
fsize=os.stat(f).st_size
t=time.clock()
i = open(f, 'rb')
o = open(f+'.7z', 'wb')
i.seek(0)
s = pylzma.compressfile(i)
while True:
tmp = s.read(1)
if not tmp: break
o.write(tmp)
o.flush()
o.close()
i.close()
os.remove(f)
Any ideas on how can I incorporate the streaming technique present in pylzma while keeping the 7zip compatibility?

You still need to correctly write the header (.read(5)) and size, e.g. like so:
import os
import struct
import pylzma
def sevenzip(infile, outfile):
size = os.stat(infile).st_size
with open(infile, "rb") as ip, open(outfile, "wb") as op:
s = pylzma.compressfile(ip)
op.write(s.read(5))
op.write(struct.pack('<Q', size))
while True:
# Read 128K chunks.
# Not sure if this has to be 1 instead to trigger streaming in pylzma...
tmp = s.read(1<<17)
if not tmp:
break
op.write(tmp)
if __name__ == "__main__":
import sys
try:
_, infile, outfile = sys.argv
except:
infile, outfile = __file__, __file__ + u".7z"
sevenzip(infile, outfile)
print("compressed {} to {}".format(infile, outfile))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Downloading Zip Files Damaged - python

f = open('c:\CreeperCraft.zip', 'wb+')

Related

How to download file by using python? [duplicate]

How to Replace \n, b and single quotes from Raw File from GitHub?

Python Requests: downloaded image file binary not identical to original?

How to get python to successfully download large images from the internet

Using Pylzma with streaming and 7Zip compatibility

Categories

Resources