Requests download with retries creates large corrupt zip

Requests download with retries creates large corrupt zip - python

relative beginner here. I'm trying to complete a basic task with Requests, downloading zip files. It works fine on most downloads, but intermittently writes over-sized, corrupt zip files when working with large downloads (>5 GB or so). For example, there is a zip file I know to be ~11 GB that shows up anywhere between 16 and 20 GB, corrupted.
When unzipping in Windows Explorer, I get "The compressed (zipped) folder is invalid". 7-Zip will extract the archive, but says:
Headers Error --- Unconfirmed start of archive --- Warnings: There are some data after the end of the payload data
Interestingly, the 7-Zip dialog shows the correct file size as 11479 MB.
Here's my code:
save_dir = Path(f"{dirName}/{item_type}/{item_title}.zip")
file_to_resume = save_dir
try:
with requests.get(url, stream=True, timeout=30) as g:
with open(save_dir, 'wb') as sav:
for chunk in g.iter_content(chunk_size=1024*1024):
sav.write(chunk)
except:
attempts = 0
while attempts < 10:
try:
resume_header = {'Range':f'bytes = {Path(file_to_resume).stat().st_size}-'}
with requests.get(url, stream=True, headers=resume_header, timeout=30) as f:
with open(file_to_resume, 'ab') as sav:
for chunk in f.iter_content(chunk_size=1024*1024):
sav.write(chunk)
break
except:
attempts += 1

It appears the server did not support the Range header. Thanks, #RonieMartinez.

Related

Images VERY LARGE when downloading

I'm trying to download some pictures from a website. When I download them with the browser they are much smaller than the one, downloaded with my code.They have the same resolution as the one downloaded with my code, but the difference between the filesizes is very large!
def download(url, pathname):
"""
Downloads a file given an URL and puts it in the folder `pathname`
"""
# if path doesn't exist, make that path dir
if not os.path.isdir(pathname):
os.makedirs(pathname)
# download the body of response by chunk, not immediately
response = requests.get(url, stream=True)
# get the total file size
file_size = int(response.headers.get("Content-Length", 0))
# get the file name
filename = os.path.join(pathname, url.split("/")[-1])
# progress bar, changing the unit to bytes instead of iteration (default by tqdm)
progress = tqdm(response.iter_content(1024), f"Downloading {filename}", total=file_size, unit="B", unit_scale=True,
unit_divisor=1024, disable=True)
with open(filename, "wb") as f:
for data in progress.iterable:
# write data read to the file
f.write(requests.get(url).content)
# update the progress bar manually
progress.update(len(data))
Example: https://wallpaper-mania.com/wp-content/uploads/2018/09/High_resolution_wallpaper_background_ID_77700030695.jpg
Browser: About 222 KB
Code: 48,4 MB
How does this difference come about? How can I improve the code to download images in way, they are smaller?

f.write(requests.get(url).content)
It looks like you're re-downloading the entire file for every 1024 byte chunk, so you're getting 222 copies of the image. Make that:
f.write(data)

the downloaded file is corrupt

I write a script to download certain files from multiple pages from the web. The downloads seems to work but all the files corrupted. I tried different way to download the files but always give me corrupted files and all the files size only 4 kb.
Where do I need to change or revise my code to fix download's problem?
while pageCounter < 3:
soup_level1 = BeautifulSoup(driver.page_source, 'lxml')
for div in soup_level1.findAll('div', attrs ={'class':'financial-report-download ng-scope'}):
links = div.findAll('a', attrs = {'class':'ng-binding'}, href=re.compile("FinancialStatement"))
for a in links:
driver.find_element_by_xpath("//div[#ng-repeat = 'attachments in res.Attachments']").click()
files = [url + a['href']]
for file in files:
file_name = file.split('/')[-1]
print ("Downloading file:%s"%file_name)
# create response object
r = requests.get(file, stream = True)
# download started
with open(file_name, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024*1024):
if chunk:
f.write(chunk)
print ("%s downloaded!\n"%file_name)

How can I open some unbounded # of files so that I can send them in a request?

I need to take the contents of a user's directory and send them in a request. Unfortunately I cannot modify the service i'm making the request to, and so I CANNOT zip all the files and send that, I must send all the files.
There's a limit on the total size of the files, but not on the # of files. Unfortunately once I try and open too many, Python will error out with: [Errno 24] Too many open files. Here's my current code:
files_to_send = []
files_to_close = []
for file_path in all_files:
file_obj = open(file_path, "rb")
files_to_send.append(("files", (file_path, file_obj)))
files_to_close.append(file_obj)
requests.post(url, files=files_to_send)
for file_to_close in files_to_close:
file_to_close.close()
Is there any way to get around this open file limit given my circumstances?

Can you send the files one at a time? Or maybe send them in batches? You can increase the number of allowed open files here:
IOError: [Errno 24] Too many open files:
But in general that's finicky and not recommended.
You can send your requests in batches, ala:
BATCH_SIZE = 10 # Just an example batch size
while len(all_files) > 0:
files_to_send = []
while len(files_to_send) < BATCH_SIZE:
files_to_send.append(all_files.pop())
file_obj = open(file_path, "rb")
requests.post(url, files=files_to_send)
for f in files_to_send:
f.close()
This will send files in batches of 10 at a time. If one at a time works:
for file in all_files:
with open(file, "rb") as outfile:
requests.post(url, files=[outfile])
It's generally bad practice to open too many files at once.

Why I can't download all video with Python?

I have a video in a url, that I want download it using Python.
The problem here is that when I execute the script and download it, the final file just have 1 kb, it's like never start the process of download.
I tried with this solution that I saw in https://stackoverflow.com/a/16696317/5280246:
url_video = "https://abiu-tree.fruithosted.net/dash/m/cdtsqmlbpkbmmddq~1504839971~190.52.0.0~w7tv1per/init-a1.mp4"
rsp = requests.get(url_video, stream=True)
print("Downloading video...")
with open("video_test_10.mp4",'wb') as outfile:
for chunk in rsp.iter_content(chunk_size=1024):
if chunk:
outfile.write(chunk)
rsp.close()
Too I tried like this:
url_video = "https://abiu-tree.fruithosted.net/dash/m/cdtsqmlbpkbmmddq~1504839971~190.52.0.0~w7tv1per/init-a1.mp4"
rsp = requests.get(url_video)
with open("out.mp4",'wb') as f:
f.write(rsp.content)
I tried too with:
urllib.request.retrieve(url_video, "out.mp4")

Python: Unpredictable memory error when downloading large files

I wrote a python script which I am using to download a large number of video files (50-400 MB each) from an HTTP server. It has worked well so far on long lists of downloads, but for some reason it rarely has a memory error.
The machine has about 1 GB of RAM free, but I don't think it's ever maxed out on RAM while running this script.
I've monitored the memory usage in the task manager and perfmon and it always behaves the same from what I've seen: slowly increases during the download, then returns to normal level after it finishes the download (There's no small leaks that creep up or anything like that).
The way the download behaves is that it creates the file, which remains at 0 KB until the download finishes (or the program crashes), then it writes the whole file at once and closes it.
for i in range(len(urls)):
if os.path.exists(folderName + '/' + filenames[i] + '.mov'):
print 'File exists, continuing.'
continue
# Request the download page
req = urllib2.Request(urls[i], headers = headers)
sock = urllib2.urlopen(req)
responseHeaders = sock.headers
body = sock.read()
sock.close()
# Search the page for the download URL
tmp = body.find('/getfile/')
downloadSuffix = body[tmp:body.find('"', tmp)]
downloadUrl = domain + downloadSuffix
req = urllib2.Request(downloadUrl, headers = headers)
print '%s Downloading %s, file %i of %i'
% (time.ctime(), filenames[i], i+1, len(urls))
f = urllib2.urlopen(req)
# Open our local file for writing, 'b' for binary file mode
video_file = open(foldername + '/' + filenames[i] + '.mov', 'wb')
# Write the downloaded data to the local file
video_file.write(f.read()) ##### MemoryError: out of memory #####
video_file.close()
print '%s Download complete!' % (time.ctime())
# Free up memory, in hopes of preventing memory errors
del f
del video_file
Here is the stack trace:
File "downloadVideos.py", line 159, in <module>
main()
File "downloadVideos.py", line 136, in main
video_file.write(f.read())
File "c:\python27\lib\socket.py", line 358, in read
buf.write(data)
MemoryError: out of memory

Your problem is here: f.read(). That line attempts to download the entire file into memory. Instead of that, read in chunks (chunk = f.read(4096)), and save the pieces to temporary file.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Requests download with retries creates large corrupt zip - python

It appears the server did not support the Range header. Thanks, #RonieMartinez.

Related

Images VERY LARGE when downloading

the downloaded file is corrupt

How can I open some unbounded # of files so that I can send them in a request?

Why I can't download all video with Python?

Python: Unpredictable memory error when downloading large files

Categories

Resources