Python: Unpredictable memory error when downloading large files - python

I wrote a python script which I am using to download a large number of video files (50-400 MB each) from an HTTP server. It has worked well so far on long lists of downloads, but for some reason it rarely has a memory error.
The machine has about 1 GB of RAM free, but I don't think it's ever maxed out on RAM while running this script.
I've monitored the memory usage in the task manager and perfmon and it always behaves the same from what I've seen: slowly increases during the download, then returns to normal level after it finishes the download (There's no small leaks that creep up or anything like that).
The way the download behaves is that it creates the file, which remains at 0 KB until the download finishes (or the program crashes), then it writes the whole file at once and closes it.
for i in range(len(urls)):
if os.path.exists(folderName + '/' + filenames[i] + '.mov'):
print 'File exists, continuing.'
continue
# Request the download page
req = urllib2.Request(urls[i], headers = headers)
sock = urllib2.urlopen(req)
responseHeaders = sock.headers
body = sock.read()
sock.close()
# Search the page for the download URL
tmp = body.find('/getfile/')
downloadSuffix = body[tmp:body.find('"', tmp)]
downloadUrl = domain + downloadSuffix
req = urllib2.Request(downloadUrl, headers = headers)
print '%s Downloading %s, file %i of %i'
% (time.ctime(), filenames[i], i+1, len(urls))
f = urllib2.urlopen(req)
# Open our local file for writing, 'b' for binary file mode
video_file = open(foldername + '/' + filenames[i] + '.mov', 'wb')
# Write the downloaded data to the local file
video_file.write(f.read()) ##### MemoryError: out of memory #####
video_file.close()
print '%s Download complete!' % (time.ctime())
# Free up memory, in hopes of preventing memory errors
del f
del video_file
Here is the stack trace:
File "downloadVideos.py", line 159, in <module>
main()
File "downloadVideos.py", line 136, in main
video_file.write(f.read())
File "c:\python27\lib\socket.py", line 358, in read
buf.write(data)
MemoryError: out of memory

Your problem is here: f.read(). That line attempts to download the entire file into memory. Instead of that, read in chunks (chunk = f.read(4096)), and save the pieces to temporary file.

Related

Requests download with retries creates large corrupt zip

relative beginner here. I'm trying to complete a basic task with Requests, downloading zip files. It works fine on most downloads, but intermittently writes over-sized, corrupt zip files when working with large downloads (>5 GB or so). For example, there is a zip file I know to be ~11 GB that shows up anywhere between 16 and 20 GB, corrupted.
When unzipping in Windows Explorer, I get "The compressed (zipped) folder is invalid". 7-Zip will extract the archive, but says:
Headers Error --- Unconfirmed start of archive --- Warnings: There are some data after the end of the payload data
Interestingly, the 7-Zip dialog shows the correct file size as 11479 MB.
Here's my code:
save_dir = Path(f"{dirName}/{item_type}/{item_title}.zip")
file_to_resume = save_dir
try:
with requests.get(url, stream=True, timeout=30) as g:
with open(save_dir, 'wb') as sav:
for chunk in g.iter_content(chunk_size=1024*1024):
sav.write(chunk)
except:
attempts = 0
while attempts < 10:
try:
resume_header = {'Range':f'bytes = {Path(file_to_resume).stat().st_size}-'}
with requests.get(url, stream=True, headers=resume_header, timeout=30) as f:
with open(file_to_resume, 'ab') as sav:
for chunk in f.iter_content(chunk_size=1024*1024):
sav.write(chunk)
break
except:
attempts += 1
It appears the server did not support the Range header. Thanks, #RonieMartinez.

Writing to data in Python to a local file and uploading to FTP at the same time does not work

I have this weird issue with my code on Raspberry Pi 4.
from gpiozero import CPUTemperature
from datetime import datetime
import ftplib
cpu = CPUTemperature()
now = datetime.now()
time = now.strftime('%H:%M:%S')
# Save data to file
f = open('/home/pi/temp/temp.txt', 'a+')
f.write(str(time) + ' - Temperature is: ' + str(cpu.temperature) + ' C\n')
# Login and store file to FTP server
ftp = ftplib.FTP('10.0.0.2', 'username', 'pass')
ftp.cwd('AiDisk_a1/usb/temperature_logs')
ftp.storbinary('STOR temp.txt', f)
# Close file and connection
ftp.close()
f.close()
When I have this code, script doesn't write anything to the .txt file and file that is transferred to FTP server has size of 0 bytes.
When I remove this part of code, script is writing to the file just fine.
# Login and store file to FTP server
ftp = ftplib.FTP('10.0.0.2', 'username', 'pass')
ftp.cwd('AiDisk_a1/usb/temperature_logs')
ftp.storbinary('STOR temp.txt', f)
...
ftp.close()
I also tried to write some random text to the file and run the script, and the file transferred normally.
Do you have any idea, what am I missing?
After you write the file, the file pointer is at the end. So if you pass file handle to FTP, it reads nothing. Hence nothing is uploaded.
I do not have a direct explanation for the fact the local file ends up empty. But the strange way of combining "append" mode and reading may be the reason. I do not even see a+ mode defined in open function documentation.
If you want to both append data to a local file and FTP, I suggest your either:
Append the data to the file – Seek back to the original position – And upload the appended file contents.
Write the data to memory and then separately 1) dump the in-memory data to a file and 2) upload it.

Memory error while downloading large Gzip files and decompressing them

I am trying to download a dataset from https://datasets.imdbws.com/title.principals.tsv.gz, decompress the contents in my code itself(Python)and write the resulting file(s) onto disk.
To do so I am using the following code snippet.
results = requests.get(config[sourceFiles]['url'])
with open(config[sourceFiles]['downloadLocation']+config[sourceFiles]['downloadFileName'], 'wb') as f_out:
print(config[sourceFiles]['downloadFileName'] + " starting download")
f_out.write(gzip.decompress(results.content))
print(config[sourceFiles]['downloadFileName']+" downloaded successfully")
This code works fine for most zip files however for larger files it gives the following error message.
File "C:\Users\****\AppData\Local\Programs\Python\Python37-32\lib\gzip.py", line 532, in decompress
return f.read()
File "C:\Users\****\AppData\Local\Programs\Python\Python37-32\lib\gzip.py", line 276, in read
return self._buffer.read(size)
File "C:\Users\****\AppData\Local\Programs\Python\Python37-32\lib\gzip.py", line 471, in read
uncompress = self._decompressor.decompress(buf, size)
MemoryError
Is there a way to accomplish this without having to download the zip file directly onto disk and decompressing it for actual data.
You can use a streaming request coupled with zlib:
import zlib
import requests
url = 'https://datasets.imdbws.com/title.principals.tsv.gz'
result = requests.get(url, stream=True)
f_out = open("result.txt", "wb")
chunk_size = 1024 * 1024
d = zlib.decompressobj(zlib.MAX_WBITS|32)
for chunk in result.iter_content(chunk_size):
buffer = d.decompress(chunk)
f_out.write(buffer)
buffer = d.flush()
f_out.write(buffer)
f_out.close()
This snippet reads the data chunk by chunk and feeds it to zlib which can handle data streams.
Depending on your connection speed and CPU/disk performance you can test various chunk sizes.

urllib.request page larger than memory

Is there a way to process a stream of data from urllib.request.urlopen(req) so that it can be processed in chuncks?
I have a limited size machine and am pulling an API call that is potentially larger than the memory size on my machine and is causing out of memory exceptions.
I am currently performing the following command:
resp = json.loads(urllib.request.urlopen(req).read().decode())
Try processing it line by line which may lower the mem size.
req = urllib.request.urlopen(req)
data = ''
for line in req:
line = line.decode()
data += line
data = json.loads(data)

Python: ftp file stuck in buffer?

When I download a file with ftplib using this method:
ftp = ftplib.FTP()
ftp.connect("host", "port")
ftp.login("user", "pwd")
size = ftp.size('locked')
def handleDownload(block):
f.write(block)
pbar.update(pbar.currval+len(block))
f = open("locked", "wb")
pbar=ProgressBar(widgets=[FileTransferSpeed(), Bar('>'), ' ', ETA(), ' ', ReverseBar('<'), Percentage()], maxval=size).start()
ftp.retrbinary("RETR locked",handleDownload, 1024)
pbar.finish()
if the file is less than 1mb the file will be stuck in the buffer until I download another file with enough data to push it out. I have tried to make a dynamic buffer by dividing the ftp.size(filename) by 20 but the same thing still happens. So how do I make it so I can download single files less than 1 mb and still use the callback function?
As Wooble stated in comments I did not f.close() the file like an idiot. It fixed the problem.

Categories