Download a file part by part in Python 3

Download a file part by part in Python 3 - python

I'm using Python 3 to download a file:
local_file = open(file_name, "w" + file_mode)
local_file.write(f.read())
local_file.close()
This code works, but it copies the whole file into memory first. This is a problem with very big files because my program becomes memory hungry. (Going from 17M memory to 240M memory for a 200 MB file)
I would like to know if there is a way in Python to download a small part of a file (packet), write it to file, erase it from memory, and keep repeating the process until the file is completely downloaded.

Try using the method described here:
Lazy Method for Reading Big File in Python?
I am specifically referring to the accepted answer. Let me also copy it here to ensure complete clarity of response.
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open('really_big_file.dat')
for piece in read_in_chunks(f):
process_data(piece)
This will likely be adaptable to your needs: it reads the file in smaller chunks, allowing for processing without filling your entire memory. Come back if you have any further questions.

Related

Reading a binary file from memory in chunks of 10 bytes with python

I have a very big .BIN file and I am loading it into the available RAM memory (128 GB) by using:
ice.Load_data_to_memory("global.bin", True)
(see: https://github.com/iceland2k14/secp256k1)
Now I need to read the content of the file in chunks of 10 bytes, and for that I am using:
with open('global.bin', 'rb') as bf:
while True:
data = bf.read(10)
if data = y:
do this!
This works good with the rest of the code, if the .BIN file is small, but not if the file is big. My suspection is, by writing the code this way I will open the .BIN file twice OR I won't get any result, because with open('global.bin', 'rb') as bf is not "synchronized" with ice.Load_data_to_memory("global.bin", True). Thus, I would like to find a way to directly read the chunks of 10 bytes from memory, without having to open the file with "with open('global.bin', 'rb') as bf"

I found a working approach here: LOAD FILE INTO MEMORY
This is working good with a small .BIN file containing 3 strings of 10 bytes each:
with open('0x4.bin', 'rb') as f:
# Size 0 will read the ENTIRE file into memory!
m = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) #File is open read-only
# Proceed with your code here -- note the file is already in memory
# so "readine" here will be as fast as could be
data = m.read(10) # using read(10) instead of readline()
while data:
do something!
Now the point: When using a much bigger .BIN file, it will take much more time to load the whole file into the memory and the while data: part starts immediately to work, so I would need here a function delay, so that the script only starts to work AFTER the file is completely loaded into the memory...

Issues reading in large .gz files

I am reading in a large zipped json file ~4GB. I want to read in the first n lines.
with gzip.open('/path/to/my/data/data.json.gz','rt') as f:
line_n = f.readlines(1)
print(ast.literal_eval(line_n[0])['events']) # a dictionary object
This works fine when I want to read a single line. If now try and read in a loop e.g.
no_of_lines = 1
with gzip.open('/path/to/my/data/data.json.gz','rt') as f:
for line in range(no_of_lines):
line_n = f.readlines(line)
print(ast.literal_eval(line_n[0])['events'])
My code takes forever to execute, even if that loop is of length 1. I'm assuming this behaviour has something to do with how gzip read files, perhaps when I loop it tries to obtain information about the file length which causes the long execution time? Can anyone shed some light on this and potentially provide an alternative way of doing this?
An edited first line of my data:
['{"events": {"category": "EVENT", "mac_address": "123456", "co_site": "HSTH"}}\n']

You are using the readlines() method, which reads all lines from a file simultaneously. This can cause performance issues when reading huge files, as Python needs to load all the lines into memory at once.
An alternative is to use the iter() method to iterate over the lines of the file, without having to load all the lines into memory at once:
with gzip.open('/path/to/my/data/data.json.gz','rt') as f:
for line in f:
print(ast.literal_eval(line)['events'])

Python streaming compression

I have a program constantly writing to a file, which is compressed on the fly:
import lzma
with lzma.open('file.lz', 'wb') as f:
for ...:
# do something
f.write(item)
So, the file is append-only. At the same time, I need to be able to run another program which will read from this file - not streaming/following, just one-shot reading of the current content. Basically, this works:
import lzma
with lzma.open('file.lz', 'rb') as f:
content = f.read()
But writes in the first program don't write to the file immediately, and instead they buffer the data for some time (I see buffers of the size 8k to 60k). When small writes happen infrequently the file content can become very far from the current state and I'd like to flush it or do something similar (each n records or each n minutes). However, f.flush() doesn't seem to do anything. What's the best solution here, maybe I overlooked something obvious?

Python script that writes result to txt file - why the lag?

I'm using Windows 7 and I have a super-simple script that goes over a directory of images, checking a specified condition for each image (in my case, whether there's a face in the image, using dlib), while writing the paths of images that fulfilled the condition to a text file:
def process_dir(dir_path):
i = 0
with open(txt_output, 'a') as f:
for filename in os.listdir(dir_path):
# loading image to check whether dlib detects a face:
image_path = os.path.join(dir_path, filename)
opencv_img = cv2.imread(image_path)
dets = detector(opencv_img, 1)
if len(dets) > 0 :
f.write(image_path)
f.write('\n')
i = i + 1
print i
Now the following thing happens: there seems to be a significant lag in appending lines to files. For example, I can see the script has "finished" checking several images (i.e, the console prints ~20, meaning 20 files who fulfill the condition have been found) but the .txt file is still empty. At first I thought there was a problem with my script, but after waiting a while I saw that they were in fact added to the file, only it seems to be updated in "batches".
This may not seem like the most crucial issue (and it's definitely not), but still I'm wondering - what explains this behavior? As far as I understand, every time the f.write(image_path) line is executed the file is changed - then why do I see the update with a lag?

Data written to a file object won't necessarily show up on disk immediately.
In the interests of efficiency, most operating systems will buffer the writes, meaning that data is only written out to disk when a certain amount has accumulated (usually 4K).
If you want to write your data right now, use the flush() function, as others have said.

Did you try using with buffersize 0, open(txt_output, 'a', 0).

I'm, not sure about Windows (please, someone correct me here if I'm wrong), but I believe this is because of how the write buffer is handled. Although you are requesting a write, the buffer only writes every so often (when the buffer is full), and when the file is closed. You can open the file with a smaller buffer:
with open(txt_output, 'a', 0) as f:
or manually flush it at the end of the loop:
if len(dets) > 0 :
f.write(image_path)
f.write('\n')
f.flush()
i = i + 1
print i
I would personally recommend flushing manually when you need to.

It sounds like you're running into file stream buffering.
In short, writing to a file is a very slow process (relative to other sorts of things that the processor does). Modifying the hard disk is about the slowest thing you can do, other than maybe printing to the screen.
Because of this, most file I/O libraries will "buffer" your output, meaning that as you write to the file the library will save your data in an in-memory buffer instead of modifying the hard disk right away. Only when the buffer fills up will it "flush" the buffer (write the data to disk), after which point it starts filling the buffer again. This often reduces the number of actual write operations by quite a lot.
To answer your question, the first question to answer is, do really need to append to the file immediately every time you find a face? It will probably slow down your processing by a noticeable amount, especially if you're processing a large number of files.
If you really do need to update immediately, you basically have two options:
Manually flush the write buffer each time you write to the file. In Python, this usually means calling f.flush(), as #JamieCounsell pointed out.
Tell Python to just not use a buffer, or more accurately to use a buffer of size 0. As #VikasMadhusudana pointed out, you can tell Python how big of a buffer to use with a third argument to open(): open(txt_output, 'a', 0) for a 0-byte buffer.
Again, you probably don't need this; the only case I can think that might require this sort of thing is if you have some other external operation that's watching the file and triggers off of new data being added to it.
Hope that helps!

It's flush related, try:
print(image_path, file=f) # Python 3
or
print >>f, image_page # Python 2
instead of:
f.write(image_path)
f.write('\n')
print flushes.
another good thing about print is it gives you the newline for free.

Bufferization in GzipFile

Imagine the following simple script:
def reader():
for line in open('logfile.log'):
# do some stuff here like splitting the line or filtering etc.
yield some_new_line
def writer(stream):
with gzip.GzipFile('some_output_file.gz', 'w') as fh:
for _s in stream:
fh.write(_s+'\n')
stream = reader()
writer(stream)
So pretty simple - read lines using generators and write some result into a gzip file.
But how to speed it up? The HDD seems to be a bottleneck. I saw I can use buffer size for reads - using open(file, mode, buffer) syntax. But I'm not quite sure it will work in my case (with generators).
Also I didn't find any bufferization parameter for the gzip.GzipFile call. From the code, it's based on some bufferized class, but I don't see any further docs on that.
I have a (crazy?) idea to create an explicit cache and replace open methods with it - so it will read the file in bigger chunks, say, by 8MB, and then perform splitting it by lines. As for writes, I thought to create a list of lines to write, collect them (say, 5000 lines), and then dump into the file.
Am I trying to re-invent the wheel? I'm not satisfied with the performance the script currently has, so I'm trying to speed it up as much as possible.
UPD. I have around 4-5 different parallel workers running. They all perform reads and writes. So I guess the HDD is jumping from one sector to another, and this is the reason why I want to implement some bufferization to dump the data periodically in big chunks.
Thanks!

I can just propose more compact code:
def reader():
for line in open('logfile.log'):
# do some stuff here like splitting the line or filtering etc.
yield some_new_line
def writer(stream):
with gzip.GzipFile('some_output_file.gz', 'w') as fh:
fh.writelines(stream)
writer(reader())
However, there is no actual speed-up. Python will manage the streams, but if you cannot spare memory for full file write, the speed-up will not be great.
The compression though gzip is the slowest step. The following function will give you only ~3% speed-up (disregarding the generator's part).
def writer():
f = open('logfile.log').read()
gzip.GzipFile('some_output_file.gz', 'w').write(f)
writer()
So, if you need gzip, than you cannot do much.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.