Chunking with Python - python

Hi and happy holidays to everyone!
I have to cope with big csv files (around 5GB each) on a simple laptop, so I am learning to read files in chunks (I am a complete noob in this), using python 2.7 in particular. I found this very nice example
# chunked file reading
from __future__ import division
import os
def get_chunks(file_size):
chunk_start = 0
chunk_size = 0x20000 # 131072 bytes, default max ssl buffer size
while chunk_start + chunk_size < file_size:
yield(chunk_start, chunk_size)
chunk_start += chunk_size
final_chunk_size = file_size - chunk_start
yield(chunk_start, final_chunk_size)
def read_file_chunked(file_path):
with open(file_path) as file_:
file_size = os.path.getsize(file_path)
print('File size: {}'.format(file_size))
progress = 0
for chunk_start, chunk_size in get_chunks(file_size):
file_chunk = file_.read(chunk_size)
# do something with the chunk, encrypt it, write to another file...
progress += len(file_chunk)
print('{0} of {1} bytes read ({2}%)'.format(
progress, file_size, int(progress / file_size * 100))
)
if __name__ == '__main__':
read_file_chunked('some-file.gif')
(source: https://gist.github.com/richardasaurus/21d4b970a202d2fffa9c)
but something still is not very clear to me. For example, let's say that I write a piece of code and I want to test it on a small fraction of my dataset, just to check if it runs properly. How could I read, let's say, only the first 10% of my csv file and run my code on that chunk without having to store in the memory the rest of the dataset?
I appreciate any hint - even some reading or external reference is good, if related to chunking files with python. Thank you!

Let's consider the following CSV file:
If you open this CSV file with Notepad or any simple text editor you can see this:
CU-C2376;Airbus A380;50.00;259.00
J2-THZ;Boeing 737;233.00;213.00
SU-XBG;Embraer ERJ-195;356.00;189.00
TI-GGH;Boeing 737;39.00;277.00
HK-6754J;Airbus A380;92.00;93.00
6Y-VBU;Embraer ERJ-195;215.00;340.00
9N-ABU;Embraer ERJ-195;151.00;66.00
YV-HUI;Airbus A380;337.00;77.00
If you observe carefully each line corresponds to one row and each value is separated with a ";".
Let's say i want to read only the first three rows, then:
with open('data.csv') as f:
lines = list()
for i in range(3):
lines.append(f.readline())
#Do some stuff with the first three lines
This is a better way of reading chunk of file because let's the file is 10MB and if you read first 3MB the last byte you read may not represent anything.
Alternatively you can use libraries like panda..

Related

How to limit memory overhead when streaming a file?

I'm trying to read a file in smallish chunks with Python to get around the issue of having < 1G of memory to play with. I'm able to write the file to disk and read in chunks, but no matter what I try I always end up getting a MemoryError. I originally didn't have the del/gc stuff, but put that in after reading a bit online.
Can anyone help point me in the right direction in a way to read this file in chunks (256M-512M) and dump the chunk out of memory as soon as it's done and before loading the next one?
with open(path) as in_file:
current = 0
total = os.stat(path).st_size
while current < total:
in_file.seek(current, 0)
bytes_read = in_file.read(byte_count)
# do other things with the bytes here
in_file.close()
del in_file
gc.collect()
current += byte_count

How can I truncate an mp3 audio file by 30%?

I am trying to truncate an audio file by 30%, if the audio file was 4 minutes long, after truncating it, it should be around 72 seconds. I have written the code below to do it but it only returns a 0 byte file size. Please tell me where i went wrong?
def loadFile():
with open('music.mp3', 'rb') as in_file:
data = len(in_file.read())
with open('output.mp3', 'wb') as out_file:
ndata = newBytes(data)
out_file.write(in_file.read()[:ndata])
def newBytes(bytes):
newLength = (bytes/100) * 30
return int(newLength)
loadFile()
You are trying to read your file a second time which will result in no data, e.g. len(in_file.read(). Instead read the whole file into a variable and then calculate the length of that. The variable can then be used a second time.
def newBytes(bytes):
return (bytes * 70) / 100
def loadFile():
with open('music.mp3', 'rb') as in_file:
data = in_file.read()
with open('output.mp3', 'wb') as out_file:
ndata = newBytes(len(data))
out_file.write(data[:ndata])
Also it is better to multiply first and then divide to avoid having to work with floating point numbers.
You cannot reliably truncate an MP3 file by byte size and expect it to be equivalently truncated in audio time length.
MP3 frames can change bitrate. While your method will sort of work, it won't be all that accurate. Additionally, you'll undoubtedly break frames leaving glitches at the end of the file. You will also lose ID3v1 tags (if you still use them... better to use ID3v2 anyway).
Consider executing FFmpeg with -acodec copy instead. This will simply copy the bytes over while maintaining the integrity of the file, and ensuring a good clean cut where you want it to be.

Stop Python Script from Writing to File after it reaches a certain size in linux

Somewhat new to Python and new to linux. I created a script that mines Twitter's streaming API. Script writes to a .csv file when things in the stream match my parameters.
I'd like to know if there's any way to stop my script once the file has reached 1 gig. I know cron can be used to time the script and everything, but I'm more concerned about the file size than the time it takes.
Thanks for your input and consideration.
In your case, you probably don't need os.stat and os.stat may give you a false size in some cases (namely buffers not flushing). Why not just use f.tell() to read the size with something like this
with open('out.txt', 'w', encoding='utf-8') as f:
csvfile = csv.writer(f)
maxsize = 1024 # max file size in bytes
for row in data():
csvfile.writerow(row)
if f.tell() > maxsize: # f.tell() gives byte offset, no need to worry about multiwide chars
break
Use python's os.stat() to get info on the file, then check the total number of bytes of the existing file (fileInfo.st_size) plus the size of the data you are about to write.
import os
fileInfo = os.stat('twitter_stream.csv')
fileSize = fileInfo.st_size
print fileSize
# Now get data from twitter
# determine number of bytes in data
# write data if file size + data bytes < 1GB

How to make a buffered writer?

So recently I took on as a personal project to make my very own DB in Python, mainly because I hate messing arround with most DBs and I needed something easy to setup, portable and simple to study large data sets.
I now find myself stuck on a problem, an efficient way to delete a line from the DB file (which is really just a text file). The way I found to do it is to write all of the content thats after the line before it, and then truncate the file (I take suggestions on better ways to do it). The problem arrives when I need to write the content after the line before it, because doing it all at once could possibly load millions of lines onto the RAM at once. The code follows:
ln = 11 # Line to be deleted
with open("test.txt", "r+") as f:
readlinef = f.readline
for i in xrange(ln):
line = readlinef()
length, start = (len(line), f.tell()-len(line))
f.seek(0, 2)
chunk = f.tell() - start+length
f.seek(start+length, 0)
# How to make this buffered?
data = f.read(chunk)
f.seek(start, 0)
f.write(data)
f.truncate()
Right now thats reading all of that data at once, how would I make that last code block work in a buffered fashion? The start position would switch every time a new chunk of data is written before it, I was wondering what would be the most efficient and fast (execution time wise) way to do this.
Thanks in advance.
edit
I've decided to follow the advices submitted here, but just for curiosity's sake I found a way to read and write in chunks. It follows:
with open("test.txt", "r+") as f:
readlinef = f.readline
for i in xrange(ln):
line = readlinef()
start, length = (f.tell()-len(line), len(line))
readf = f.read
BUFFER_SIZE = 1024 * 1024
x = 0
chunk = readf(BUFFER_SIZE)
while chunk:
f.seek(start, 0)
f.write(chunk)
start += BUFFER_SIZE
f.seek(start+length+(x*BUFFER_SIZE), 0)
chunk = readf(BUFFER_SIZE)
f.truncate()
Answering your question "How would I do that?" concerning indices and vacuum.
Disclaimer: This is a very simple example and does in no way compare to existing DBMS and I strongly advise against it.
Basic idea:
For each table in your DB, keep various files, some for your object ids (row ids, record ids) and some (page files) with the actual data. Let's suppose that each record is of variable length.
Each record has a table-unique OID. These are stored in the oid-files. Let's name the table "test" and the oid files "test.oidX". Each record in the oid file is of fixed length and each oid file is of fixed length.
Now if "test.oid1" reads:
0001:0001:0001:0015 #oid:pagefile:position:length
0002:0001:0016:0100
0004:0002:0001:0001
It means that record 1 is in page file 1, at position 1 and has length 15. Record 2 is in page file 1 at position 16 of length 100, etc.
Now when you want to delete a record, just touch the oid file. E.g. for deleting record 2, edit it to:
0001:0001:0001:0015
0000:0001:0016:0100 #0000 indicating empty cell
0004:0002:0001:0001
And don't even bother touching your page files.
This will create holes in your page files. Now you need to implement some "maintenance" routine which moves blocks in your page files around, etc, which could either run when requested by the user, or automatically when your DBMS has nothing else to do. Depending on which locking strategy you use, you might need to lock the concerned records or the whole table.
Also when you insert a new record, and you find a hole big enough, you can insert it there.
If your oid-files should also function as an index (slow inserts, fast queries), you will need to rebuild it (surely on insertion, maybe on deletion).
Operations on oid-files should be fast, as they are fixed-length and of fixed-length records.
This is just the very basic idea, not touching topics like search trees, hashing, etc, etc.
You can do this the same way that (effectively) memmove works: seek back and forth between the source range and the destination range:
count = (size+chunksize-1) // chunk size
for chunk in range(count):
f.seek(start + chunk * chunksize + deleted_line_size, 0)
buf = f.read(chunksize)
f.seek(start + chunk * chunksize, 0)
f.write(buf)
Using a temporary file and shutil makes it a lot simpler—and, despite what you're expect, it may actually be faster. (There's twice as much writing, but a whole lot less seeking, and mostly block-aligned writing.) For example:
with tempfile.TemporaryFile('w') as ftemp:
shutil.copyfileobj(ftemp, f)
ftemp.seek(0, 0)
f.seek(start, 0)
shutil.copyfileobj(f, ftemp)
f.truncate()
However, if your files are big enough to fit in your virtual memory space (which they probably are in 64-bit land, but may not be in 32-bit land), it may be simpler to just mmap the file and let the OS/libc take care of the work:
m = mmap.mmap(f.fileno(), access=mmap.ACCESS_WRITE)
m[start:end-deleted_line_size] = m[start+deleted_line_size:end]
m.close()
f.seek(end-deleted_line_size)
f.truncate()

Writing binary data to middle of a sparse file

I need to compile a binary file in pieces with pieces arriving in random order (yes, its a P2P project)
def write(filename, offset, data)
file.open(filename, "ab")
file.seek(offset)
file.write(data)
file.close()
Say I have a 32KB write(f, o, d) at offset 1MB into file and then another 32KB write(f, o, d) at offset 0
I end up with a file 65KB in length (i.e. the gap consisting of 0s between 32KB - 1MB is truncated/disappears)
I am aware this may appear an incredibly stupid question, but I cannot seem to figure it out from the file.open(..) modes
Advice gratefully received.
*** UPDATE
My method to write P2P pieces ended up as follows (for those who may glean some value from it)
def writePiece(self, filename, pieceindex, bytes, ipsrc, ipdst, ts):
file = open(filename,"r+b")
if not self.piecemap[ipdst].has_key(pieceindex):
little = struct.pack('<'+'B'*len(bytes), *bytes)
# Seek to offset based on piece index
file.seek(pieceindex * self.piecesize)
file.write(little)
file.flush()
self.procLog.info("Wrote (%d) bytes of piece (%d) to %s" % (len(bytes), pieceindex, filename))
# Remember we have this piece now in case duplicates arrive
self.piecemap[ipdst][pieceindex] = True
file.close()
Note: I also addressed some endian issues using struct.pack which plagued me for a while.
For anyone wondering, the project I am working on is to analyse BT messages captured directly off the wire.
>>> import os
>>> filename = 'tempfile'
>>> def write(filename,data,offset):
... try:
... f = open(filename,'r+b')
... except IOError:
... f = open(filename,'wb')
... f.seek(offset)
... f.write(data)
... f.close()
...
>>> write(filename,'1' * (1024*32),1024*1024)
>>> write(filename,'1' * (1024*32),0)
>>> os.path.getsize(filename)
1081344
You opened the file in append ("a") mode. All writes are going to the end of the file, irrespective of the calls to seek().
Try using 'r+b' rather than 'ab'.
It seems to me like there's not a lot of point in trying to assemble the file until all the pieces of it are there. Why not keep the pieces separate until all are present, then write them to the final file in order? That's what most P2P apps do, AFAIK.

Categories