How to limit memory overhead when streaming a file? - python

I'm trying to read a file in smallish chunks with Python to get around the issue of having < 1G of memory to play with. I'm able to write the file to disk and read in chunks, but no matter what I try I always end up getting a MemoryError. I originally didn't have the del/gc stuff, but put that in after reading a bit online.
Can anyone help point me in the right direction in a way to read this file in chunks (256M-512M) and dump the chunk out of memory as soon as it's done and before loading the next one?
with open(path) as in_file:
current = 0
total = os.stat(path).st_size
while current < total:
in_file.seek(current, 0)
bytes_read = in_file.read(byte_count)
# do other things with the bytes here
in_file.close()
del in_file
gc.collect()
current += byte_count

Related

Reading a binary file from memory in chunks of 10 bytes with python

I have a very big .BIN file and I am loading it into the available RAM memory (128 GB) by using:
ice.Load_data_to_memory("global.bin", True)
(see: https://github.com/iceland2k14/secp256k1)
Now I need to read the content of the file in chunks of 10 bytes, and for that I am using:
with open('global.bin', 'rb') as bf:
while True:
data = bf.read(10)
if data = y:
do this!
This works good with the rest of the code, if the .BIN file is small, but not if the file is big. My suspection is, by writing the code this way I will open the .BIN file twice OR I won't get any result, because with open('global.bin', 'rb') as bf is not "synchronized" with ice.Load_data_to_memory("global.bin", True). Thus, I would like to find a way to directly read the chunks of 10 bytes from memory, without having to open the file with "with open('global.bin', 'rb') as bf"
I found a working approach here: LOAD FILE INTO MEMORY
This is working good with a small .BIN file containing 3 strings of 10 bytes each:
with open('0x4.bin', 'rb') as f:
# Size 0 will read the ENTIRE file into memory!
m = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) #File is open read-only
# Proceed with your code here -- note the file is already in memory
# so "readine" here will be as fast as could be
data = m.read(10) # using read(10) instead of readline()
while data:
do something!
Now the point: When using a much bigger .BIN file, it will take much more time to load the whole file into the memory and the while data: part starts immediately to work, so I would need here a function delay, so that the script only starts to work AFTER the file is completely loaded into the memory...

Chunking with Python

Hi and happy holidays to everyone!
I have to cope with big csv files (around 5GB each) on a simple laptop, so I am learning to read files in chunks (I am a complete noob in this), using python 2.7 in particular. I found this very nice example
# chunked file reading
from __future__ import division
import os
def get_chunks(file_size):
chunk_start = 0
chunk_size = 0x20000 # 131072 bytes, default max ssl buffer size
while chunk_start + chunk_size < file_size:
yield(chunk_start, chunk_size)
chunk_start += chunk_size
final_chunk_size = file_size - chunk_start
yield(chunk_start, final_chunk_size)
def read_file_chunked(file_path):
with open(file_path) as file_:
file_size = os.path.getsize(file_path)
print('File size: {}'.format(file_size))
progress = 0
for chunk_start, chunk_size in get_chunks(file_size):
file_chunk = file_.read(chunk_size)
# do something with the chunk, encrypt it, write to another file...
progress += len(file_chunk)
print('{0} of {1} bytes read ({2}%)'.format(
progress, file_size, int(progress / file_size * 100))
)
if __name__ == '__main__':
read_file_chunked('some-file.gif')
(source: https://gist.github.com/richardasaurus/21d4b970a202d2fffa9c)
but something still is not very clear to me. For example, let's say that I write a piece of code and I want to test it on a small fraction of my dataset, just to check if it runs properly. How could I read, let's say, only the first 10% of my csv file and run my code on that chunk without having to store in the memory the rest of the dataset?
I appreciate any hint - even some reading or external reference is good, if related to chunking files with python. Thank you!
Let's consider the following CSV file:
If you open this CSV file with Notepad or any simple text editor you can see this:
CU-C2376;Airbus A380;50.00;259.00
J2-THZ;Boeing 737;233.00;213.00
SU-XBG;Embraer ERJ-195;356.00;189.00
TI-GGH;Boeing 737;39.00;277.00
HK-6754J;Airbus A380;92.00;93.00
6Y-VBU;Embraer ERJ-195;215.00;340.00
9N-ABU;Embraer ERJ-195;151.00;66.00
YV-HUI;Airbus A380;337.00;77.00
If you observe carefully each line corresponds to one row and each value is separated with a ";".
Let's say i want to read only the first three rows, then:
with open('data.csv') as f:
lines = list()
for i in range(3):
lines.append(f.readline())
#Do some stuff with the first three lines
This is a better way of reading chunk of file because let's the file is 10MB and if you read first 3MB the last byte you read may not represent anything.
Alternatively you can use libraries like panda..

Stop Python Script from Writing to File after it reaches a certain size in linux

Somewhat new to Python and new to linux. I created a script that mines Twitter's streaming API. Script writes to a .csv file when things in the stream match my parameters.
I'd like to know if there's any way to stop my script once the file has reached 1 gig. I know cron can be used to time the script and everything, but I'm more concerned about the file size than the time it takes.
Thanks for your input and consideration.
In your case, you probably don't need os.stat and os.stat may give you a false size in some cases (namely buffers not flushing). Why not just use f.tell() to read the size with something like this
with open('out.txt', 'w', encoding='utf-8') as f:
csvfile = csv.writer(f)
maxsize = 1024 # max file size in bytes
for row in data():
csvfile.writerow(row)
if f.tell() > maxsize: # f.tell() gives byte offset, no need to worry about multiwide chars
break
Use python's os.stat() to get info on the file, then check the total number of bytes of the existing file (fileInfo.st_size) plus the size of the data you are about to write.
import os
fileInfo = os.stat('twitter_stream.csv')
fileSize = fileInfo.st_size
print fileSize
# Now get data from twitter
# determine number of bytes in data
# write data if file size + data bytes < 1GB

How to make a buffered writer?

So recently I took on as a personal project to make my very own DB in Python, mainly because I hate messing arround with most DBs and I needed something easy to setup, portable and simple to study large data sets.
I now find myself stuck on a problem, an efficient way to delete a line from the DB file (which is really just a text file). The way I found to do it is to write all of the content thats after the line before it, and then truncate the file (I take suggestions on better ways to do it). The problem arrives when I need to write the content after the line before it, because doing it all at once could possibly load millions of lines onto the RAM at once. The code follows:
ln = 11 # Line to be deleted
with open("test.txt", "r+") as f:
readlinef = f.readline
for i in xrange(ln):
line = readlinef()
length, start = (len(line), f.tell()-len(line))
f.seek(0, 2)
chunk = f.tell() - start+length
f.seek(start+length, 0)
# How to make this buffered?
data = f.read(chunk)
f.seek(start, 0)
f.write(data)
f.truncate()
Right now thats reading all of that data at once, how would I make that last code block work in a buffered fashion? The start position would switch every time a new chunk of data is written before it, I was wondering what would be the most efficient and fast (execution time wise) way to do this.
Thanks in advance.
edit
I've decided to follow the advices submitted here, but just for curiosity's sake I found a way to read and write in chunks. It follows:
with open("test.txt", "r+") as f:
readlinef = f.readline
for i in xrange(ln):
line = readlinef()
start, length = (f.tell()-len(line), len(line))
readf = f.read
BUFFER_SIZE = 1024 * 1024
x = 0
chunk = readf(BUFFER_SIZE)
while chunk:
f.seek(start, 0)
f.write(chunk)
start += BUFFER_SIZE
f.seek(start+length+(x*BUFFER_SIZE), 0)
chunk = readf(BUFFER_SIZE)
f.truncate()
Answering your question "How would I do that?" concerning indices and vacuum.
Disclaimer: This is a very simple example and does in no way compare to existing DBMS and I strongly advise against it.
Basic idea:
For each table in your DB, keep various files, some for your object ids (row ids, record ids) and some (page files) with the actual data. Let's suppose that each record is of variable length.
Each record has a table-unique OID. These are stored in the oid-files. Let's name the table "test" and the oid files "test.oidX". Each record in the oid file is of fixed length and each oid file is of fixed length.
Now if "test.oid1" reads:
0001:0001:0001:0015 #oid:pagefile:position:length
0002:0001:0016:0100
0004:0002:0001:0001
It means that record 1 is in page file 1, at position 1 and has length 15. Record 2 is in page file 1 at position 16 of length 100, etc.
Now when you want to delete a record, just touch the oid file. E.g. for deleting record 2, edit it to:
0001:0001:0001:0015
0000:0001:0016:0100 #0000 indicating empty cell
0004:0002:0001:0001
And don't even bother touching your page files.
This will create holes in your page files. Now you need to implement some "maintenance" routine which moves blocks in your page files around, etc, which could either run when requested by the user, or automatically when your DBMS has nothing else to do. Depending on which locking strategy you use, you might need to lock the concerned records or the whole table.
Also when you insert a new record, and you find a hole big enough, you can insert it there.
If your oid-files should also function as an index (slow inserts, fast queries), you will need to rebuild it (surely on insertion, maybe on deletion).
Operations on oid-files should be fast, as they are fixed-length and of fixed-length records.
This is just the very basic idea, not touching topics like search trees, hashing, etc, etc.
You can do this the same way that (effectively) memmove works: seek back and forth between the source range and the destination range:
count = (size+chunksize-1) // chunk size
for chunk in range(count):
f.seek(start + chunk * chunksize + deleted_line_size, 0)
buf = f.read(chunksize)
f.seek(start + chunk * chunksize, 0)
f.write(buf)
Using a temporary file and shutil makes it a lot simpler—and, despite what you're expect, it may actually be faster. (There's twice as much writing, but a whole lot less seeking, and mostly block-aligned writing.) For example:
with tempfile.TemporaryFile('w') as ftemp:
shutil.copyfileobj(ftemp, f)
ftemp.seek(0, 0)
f.seek(start, 0)
shutil.copyfileobj(f, ftemp)
f.truncate()
However, if your files are big enough to fit in your virtual memory space (which they probably are in 64-bit land, but may not be in 32-bit land), it may be simpler to just mmap the file and let the OS/libc take care of the work:
m = mmap.mmap(f.fileno(), access=mmap.ACCESS_WRITE)
m[start:end-deleted_line_size] = m[start+deleted_line_size:end]
m.close()
f.seek(end-deleted_line_size)
f.truncate()

Download a file part by part in Python 3

I'm using Python 3 to download a file:
local_file = open(file_name, "w" + file_mode)
local_file.write(f.read())
local_file.close()
This code works, but it copies the whole file into memory first. This is a problem with very big files because my program becomes memory hungry. (Going from 17M memory to 240M memory for a 200 MB file)
I would like to know if there is a way in Python to download a small part of a file (packet), write it to file, erase it from memory, and keep repeating the process until the file is completely downloaded.
Try using the method described here:
Lazy Method for Reading Big File in Python?
I am specifically referring to the accepted answer. Let me also copy it here to ensure complete clarity of response.
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open('really_big_file.dat')
for piece in read_in_chunks(f):
process_data(piece)
This will likely be adaptable to your needs: it reads the file in smaller chunks, allowing for processing without filling your entire memory. Come back if you have any further questions.

Categories