Batch reading and writing, from textfile to HDF5 in python

Batch reading and writing, from textfile to HDF5 in python - python

Goal is to feed large datasets to Tensorflow. I came to the following implementation. However, while io of HDF5 is supposed to be very fast my implementation is slow. Is this due to not using the chunks function? I do not seem to get the dimensions right for the chunks, should I see this as a third dimension. Like; (4096, 7, 1000) for chunksize 1000?
Please note, I could have simplified my code below more by finding solution for a single generator. However, I think the data/label combination is very common and usefull for others.
I use the following function to create two generators, one for the data and one for the corresponding labels.
def read_chunks(file, dim, batch_size=batch_size):
chunk = np.empty(dim,)
current_size = 1
# read input file line by line
for line in file:
current_size += 1
# build chunk
chunk = np.vstack((chunk, np.genfromtxt(io.BytesIO(line.encode()))))
# reaches batch size
if current_size == batch_size:
yield chunk
# reset counters
current_size = 1
chunk = np.empty(dim,)
Then I wish move the data and labels produced by these generators to HDF5.
def write_h5(data_gen, label_gen, out_file, batch_size, h5_batch_size, data_dtype, label_dtype):
# remove existing file
if os.path.isfile(out_file):
os.remove(out_file)
with h5py.File(out_file, 'a') as f:
# create a dataset and labelset in the same file
d = f.create_dataset('data', (batch_size,data_dim), maxshape=(None,data_dim), dtype=data_dtype)
l = f.create_dataset('label', (batch_size,label_dim), maxshape=(None,label_dim), dtype=label_dtype)
# use generators to fill both sets
for data in data_gen:
d.resize(d.shape[0]+batch_size, axis=0)
d[-batch_size:] = data
l.resize(l.shape[0]+batch_size, axis=0)
l[-batch_size:] = next(label_gen)
With the following constants I combined both functions like so;
batch_size = 4096
h5_batch_size = 1000
data_dim = 7 #[NUM_POINT, 9]
label_dim = 1 #[NUM_POINT]
data_dtype = 'float32'
label_dtype = 'uint8'
for data_file, label_file in data_label_files:
print(data_file)
with open(data_file, 'r') as data_f, open(label_file, 'r') as label_f:
data_gen = read_chunks(data_f, dim=data_dim)
label_gen = read_chunks(label_f, dim=label_dim)
out_file = data_file[:-4] + '.h5'
write_h5(data_gen, label_gen, out_file, batch_size, h5_batch_size, data_dtype, label_dtype)

The problem is not that HDF5 is slow. The problem is that you are reading a single line at a time using a Python loop, calling genfromtxt() once per line! That function is meant to read entire files. And then you use the anti-pattern of "array = vstack(array, newstuff)` in the same loop.
In short, your performance problem starts here:
chunk = np.vstack((chunk, np.genfromtxt(io.BytesIO(line.encode()))))
You should just read the entire file at once. If you can't do that, read half of it (you can set a max number of lines to read each time, such as 1 million).

Related

Python filter larger text by quantile

Assume I am process a very large text file,
I have the following pseudocode
xx_valueList = []
lines=[]
with line in file:
xx_value = calc_xxValue(line)
xx_valueList.append(xx_value)
lines.append(lines)
# get_quantile_value is a function return the cutoff value with a specific quantile precent
cut_offvalue = get_quantile_value(xx_valueList, precent=0.05)
for line in lines:
if calc_xxValue(line) > cut_offvalue:
# do someting here
Note that the file is very large and may come from a pipe, so I don't want to read it twice.
We must read the entire file before we can get the cutoff to filter file
The above method can work, but it consumes too much memory, is there some algorithmic optimization that can improve efficiency and reduce memory consumption?

xx_value_list = []
cut_offvalue = 0
with open(file, 'r') as f:
for line in f:
xx_value = calc_xxValue(line)
xx_value_list.append(xx_value)
if len(xx_value_list) % 100 == 0:
cut_offvalue = get_quantile_value(xx_value_list, precent=0.05)
if xx_value < cut_offvalue:
# do something here
pass

Tokenizing & encoding dataset uses too much RAM

Trying to tokenize and encode data to feed to a neural network.
I only have 25GB RAM and everytime I try to run the code below my google colab crashes. Any idea how to prevent his from happening? “Your session crashed after using all available RAM”
I thought tokenize/encoding chunks of 50000 sentences would work but unfortunately not.
The code works on a dataset with length 1.3 million. The current dataset has a length of 5 million.
max_q_len = 128
max_a_len = 64
trainq_list = train_q.tolist()
batch_size = 50000
def batch_encode(text, max_seq_len):
for i in range(0, len(trainq_list), batch_size):
encoded_sent = tokenizer.batch_encode_plus(
text,
max_length = max_seq_len,
pad_to_max_length=True,
truncation=True,
return_token_type_ids=False
)
return encoded_sent
# tokenize and encode sequences in the training set
tokensq_train = batch_encode(trainq_list, max_q_len)
The tokenizer comes from HuggingFace:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-uncased')

You should use generators and pass data to tokenizer.batch_encode_plus, no matter the size.
Conceptually, something like this:
Training list
This one probably holds list of sentences, which is read from some file(s). If this is a single large file, you could follow this answer to lazily read parts of the input (preferably of batch_size lines at once):
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
Otherwise open a single file (much smaller than memory, because it will be way larger after encoding using BERT), something like this:
import pathlib
def read_in_chunks(directory: pathlib.Path):
# Use "*.txt" or any other extension your file might have
for file in directory.glob("*"):
with open(file, "r") as f:
yield f.readlines()
Encoding
Encoder should take this generator and yield back encoded parts, something like this:
# Generator should create lists useful for encoding
def batch_encode(generator, max_seq_len):
tokenizer = BertTokenizerFast.from_pretrained("bert-base-multilingual-uncased")
for text in generator:
yield tokenizer.batch_encode_plus(
text,
max_length=max_seq_len,
pad_to_max_length=True,
truncation=True,
return_token_type_ids=False,
)
Saving encoded files
As the files will be too large to fit in RAM memory, you should save them to disk (or use somehow as they are generated).
Something along those lines:
import numpy as np
# I assume np.arrays are created, adjust to PyTorch Tensors or anything if needed
def save(encoding_generator):
for i, encoded in enumerate(encoding_generator):
np.save(str(i), encoded)

Pytables: Can Appended Earray be reduced in size?

Following suggestions on SO Post, I also found PyTables-append is exceptionally time efficient. However, in my case the output file (earray.h5) has huge size. Is there a way to append the data such that the output file is not as huge? For example, in my case (see link below) a 13GB input file (dset_1: 2.1E8 x 4 and dset_2: 2.1E8 x 4) gives a 197 GB output file with just one column (2.5E10 x 1). All elements are float64.
I want to reduce the output file size such that the execution speed of the script is not compromised and the output file reading is also efficient for later use. Can saving the data along columns and not just rows help? Any suggestions on this? Given below is a MWE.
Output and input files' details here
# no. of chunks from dset-1 and dset-2 in inp.h5
loop_1 = 40
loop_2 = 20
# save to disk after these many rows
app_len = 10**6
# **********************************************
# Grabbing input.h5 file
# **********************************************
filename = 'inp.h5'
f2 = h5py.File(filename, 'r')
chunks1 = f2['dset_1']
chunks2 = f2['dset_2']
shape1, shape2 = chunks1.shape[0], chunks2.shape[0]
f1 = tables.open_file("table.h5", "w")
a = f1.create_earray(f1.root, "dataset_1", atom=tables.Float64Atom(), shape=(0, 4))
size1 = shape1//loop_1
size2 = shape2//loop_2
# ***************************************************
# Grabbing chunks to process and append data
# ***************************************************
for c in range(loop_1):
h = c*size1
# grab chunks from dset_1 of inp.h5
chunk1 = chunks1[h:(h + size1)]
for d in range(loop_2):
g = d*size2
chunk2 = chunks2[g:(g + size2)] # grab chunks from dset_2 of inp.h5
r1 = chunk1.shape[0]
r2 = chunk2.shape[0]
left, right = 0, 0
for j in range(r1): # grab col.2 values from dataset-1
e1 = chunk1[j, 1]
#...Algaebraic operations here to output a row containing 4 float64
#...append to a (earray) when no. of rows reach a million
del chunk2
del chunk1
f2.close()

I wrote the answer you are referencing. That is a simple example that "only" writes 1.5e6 rows. I didn't do anything to optimize performance for very large files. You are creating a very large file, but did not say how many rows (obviously way more than 10**6). Here are some suggestions based on comments in another thread.
Areas I recommend (3 related to PyTables code, and 2 based on external utilizes).
PyTables code suggestions:
Enable compression when you create the file (add the filters= parameter when you create the file). Start with tb.Filters(complevel=1).
Define the expectedrows= parameter in .create_tables() (per PyTables docs, 'this will optimize the HDF5 B-Tree and amount of memory used'). The default value is set in tables/parameters.py (look for EXPECTED_ROWS_TABLE; It's only 10000 in my installation). I suggest you set this to a larger value if you are creating 10**6 (or more) rows.
There is a side benefit to setting expectedrows=. If you don't define chunkshape, 'a sensible value is calculated based on the expectedrows parameter'. Check the value used. This won't decrease the created file size, but will improve I/O performance.
If you didn't use compression when you created the file, there are 2 methods to compress existing files:
External Utilities:
The PyTables utility ptrepack - run against a HDF5 file to create a
new file (useful to go from uncompressed to compressed, or vice-versa). It is delivered with PyTables, and runs on the command line.
The HDF5 utility h5repack - works similar to ptrepack. It is delivered with the HDF5 installer from The HDF Group.
There are trade-offs with file compression: it reduces the file size, but increases access time (reduces I/O performance). I tend to use uncompressed files I open frequently (for best I/O performance). Then when done, I convert to compressed format for long term archiving. You can continue to work with them in compress format (the API handles cleanly).

How can I split csv files in python?

Because of the memory error, i have to split my csv files. I did research it. I found it from one of the stack overflow user who is Aziz Alto. This is his code.
csvfile = open('#', 'r').readlines()
filename = 1
for i in range(len(csvfile)):
if i % 10000000 == 0:
open(str(filename) + '.csv', 'w+').writelines(csvfile[i:i+10000000])
filename += 1
It works well but for second file, the code did not add header which is very important for me. My question is that How can I add header for second file?

import pandas as pd
rows = pd.read_csv("csvfile.csv", chunksize=5000000)
for i, chuck in enumerate(rows):
chuck.to_csv('out{}.csv'.format(i)) # i is for chunk number of each iteration
chucksize you specify how many rows you want- in excel you can have upto 1,048,576 rows.
This will save it as 5000000 and with header.
hope this Helps!!

On the 2nd till last file you have to always add the 1st line of your original file (the one containing the header):
# this loads the first file fully into memory
with open('#', 'r') as f:
csvfile = f.readlines()
linesPerFile = 1000000
filename = 1
# this is better then your former loop, it loops in 1000000 lines a peice,
# instead of incrementing 1000000 times and only write on the millionth one
for i in range(0,len(csvfile),linesPerFile):
with open(str(filename) + '.csv', 'w+') as f:
if filename > 1: # this is the second or later file, we need to write the
f.write(csvfile[0]) # header again if 2nd.... file
f.writelines(csvfile[i:i+linesPerFile])
filename += 1

Fast csv file splitting
If you have a very big file and you have to try different partitions (say to find the best way to split it) the above solutions are too slow to try.
Another way to solve this (and a very fast one) is to create an index file by record number. It takes about six minutes to create an index file of a csv file of 6867839 rows and 9 Gb, and an additional 2 minutes for joblib to store it on disk.
This method is particularly impressive if you are dealing with huge files, like 3 Gb or more.
Here's the code for creating the index file:
# Usage:
# creaidx.py filename.csv
# indexes a csv file by record number. This can be used to
# access any record directly or to split a file without the
# need of reading it all. The index file is joblib-stored as
# filename.index
# filename.csv is the file to create index for
import os,sys,joblib
BLKSIZE=512
def checkopen(s,m='r',bz=None):
if os.access(s,os.F_OK):
if bz==None:
return open(s,m) # returns open file
else:
return open(s,m,bz) # returns open file with buffer size
else:
return None
def get_blk():
global ix,off,blk,buff
while True: # dealing with special cases
if ix==0:
n=0
break
if buff[0]==b'\r':
n=2
off=0
break
if off==BLKSIZE-2:
n=0
off=0
break
if off==BLKSIZE-1:
n=0
off=1
break
n=2
off=buff.find(b'\r')
break
while (off>=0 and off<BLKSIZE-2):
idx.append([ix,blk,off+n])
# g.write('{},{},{}\n'.format(ix,blk,off+n))
print(ix,end='\r')
n=2
ix+=1
off= buff.find(b'\r',off+2)
def crea_idx():
global buff,blk
buff=f.read(BLKSIZE)
while len(buff)==BLKSIZE:
get_blk()
buff=f.read(BLKSIZE)
blk+=1
get_blk()
idx[-1][2]=-1
return
if len(sys.argv)==1:
sys.exit("Need to provide a csv filename!")
ix=0
blk=0
off=0
idx=[]
buff=b'0'
s=sys.argv[1]
f=checkopen(s,'rb')
idxfile=s.replace('.csv','.index')
if checkopen(idxfile)==None:
with open(idxfile,'w') as g:
crea_idx()
joblib.dump(idx,idxfile)
else:
if os.path.getctime(idxfile)<os.path.getctime(s):
with open(idxfile,'w') as g:
crea_idx()
joblib.dump(idx,idxfile)
f.close()
Let's use a toy example:
strings,numbers,colors
string1,1,blue
string2,2,red
string3,3,green
string4,4,yellow
The index file will be:
[[0, 0, 0],
[1, 0, 24],
[2, 0, 40],
[3, 0, 55],
[4, 0, 72],
[5, 0, -1]]
Note the -1 at the last index element to indicate end of index file in case of a sequential access. You can use a tool like this to access any individual row of the csv file:
def get_rec(n=1,binary=False):
n=1 if n<0 else n+1
s=b'' if binary else ''
if len(idx)==0:return ''
if idx[n-1][2]==-1:return ''
f.seek(idx[n-1][1]*BLKSIZE+idx[n-1][2])
buff=f.read(BLKSIZE)
x=buff.find(b'\r')
while x==-1:
s=s+buff if binary else s+buff.decode()
buff=f.read(BLKSIZE)
x=buff.find(b'\r')
return s+buff[:x]+b'\r\n' if binary else s+buff[:x].decode()
The first field of the index record is obviously unnecessary. It is kept there for debugging purposes. As a side note, if you substitute this field by any field in the csv record and you sort the index file by that field, then you have the csv file sorted by that field if you use the index field to access the csv file.
Now, once you have you index file created you just call the following program with the filename (the one which index was created already) and a number between 1 and 100 which will be the percentage the file will be split at as command line parameters:
start_time = time.time()
BLKSIZE=512
WSIZE=1048576 # pow(2,20) 1Mb for faster reading/writing
import sys
import joblib
from common import Drv,checkopen
ix=0
blk=0
off=0
idx=[]
buff=b'0'
if len(sys.argv)<3:
sys.exit('Argument missing!')
s=Drv+sys.argv[1]
if sys.argv[2].isnumeric():
pct=int(sys.argv[2])/100
else:
sys.exit('Bad percentage: '+sys.argv[2])
f=checkopen(s,'rb')
idxfile=s.replace('.csv','.index')
if checkopen(idxfile):
print('Loading index...')
idx=joblib.load(idxfile)
print('Done loading index.')
else:
sys.exit(idxfile+' does not exist.')
head=get_rec(0,True)
n=int(pct*(len(idx)-2))
off=idx[n+1][1]*BLKSIZE+idx[n+1][2]-len(head)-1
num=off//WSIZE
res=off%WSIZE
sout=s.replace('.csv','.part1.csv')
i=0
with open(sout,'wb') as g:
g.write(head)
f.seek(idx[1][1]*BLKSIZE+idx[1][2])
for x in range(num):
print(i,end='\r')
i+=1
buff=f.read(WSIZE)
g.write(buff)
buff=f.read(res)
g.write(buff)
print()
i=0
sout=s.replace('.csv','.part2.csv')
with open(sout,'wb') as g:
g.write(head)
f.seek(idx[n+1][1]*BLKSIZE+idx[n+1][2])
buff=f.read(WSIZE)
while len(buff)==WSIZE:
g.write(buff)
print(i,end='\r')
i+=1
buff=f.read(WSIZE)
g.write(buff)
end_time = time.time()
The file are created using blocks of 1048576 bytes. You can play with that figure to make file creation faster or to tailor it to machines with less memory resources.
The file is split only on two partitions, each of them having the header of the original file. It is not too difficult to change the code to make it
split files into more than two partitions.
Finally to put things in perspective, to split a csv file of 6867839 rows and 9 Gb by 50%, it took me roughly 6 minutes to create the index file and another 2 minutes for joblib to store it on disk. It took 3 additional minutes to split the file.

Python: performance issues with islice

With the following code, I'm seeing longer and longer execution times as I increase the starting row in islice. For example, a start_row of 4 will execute in 1s but a start_row of 500004 will take 11s. Why does this happen and is there a faster way to do this? I want to be able to iterate over several ranges of rows in a large CSV file (several GB) and make some calculations.
import csv
import itertools
from collections import deque
import time
my_queue = deque()
start_row = 500004
stop_row = start_row + 50000
with open('test.csv', 'rb') as fin:
#load into csv's reader
csv_f = csv.reader(fin)
#start logging time for performance
start = time.time()
for row in itertools.islice(csv_f, start_row, stop_row):
my_queue.append(float(row[4])*float(row[10]))
#stop logging time
end = time.time()
#display performance
print "Initial queue populating time: %.2f" % (end-start)

For example, a start_row of 4 will execute in 1s but a start_row of
500004 will take 11s
That is islice being intelligent. Or lazy, depending on which term you prefer.
Thing is, files are "just" strings of bytes on your hard drive. They don't have any internal organization. \n is just another set of bytes in that long, long string. There is no way to access any particular line without looking at all of the information before it (unless your lines are of the exact same length, in which case you can use file.seek).
Line 4? Finding line 4 is fast, your computer just needs to find 3 \n. Line 50004? Your computer has to read through the file until it finds 500003 \n. No way around it, and if someone tells you otherwise, they either have some other sort of quantum computer or their computer is reading through the file just like every other computer in the world, just behind their back.
As for what you can do about it: Try to be smart when trying to grab lines to iterate over. Smart, and lazy. Arrange your requests so you're only iterating through the file once, and close the file as soon as you've pulled the data you need. (islice does all of this, by the way.)
In python
lines_I_want = [(start1, stop1), (start2, stop2),...]
with f as open(filename):
for i,j in enumerate(f):
if i >= lines_I_want[0][0]:
if i >= lines_I_want[0][1]:
lines_I_want.pop(0)
if not lines_I_want: #list is empty
break
else:
#j is a line I want. Do something
And if you have any control over making that file, make every line the same length so you can seek. Or use a database.

The problem with using islice() for what you're doing is that iterates through all the lines before the first one you want before returning anything. Obviously the larger the starting row, the longer this will take. Another is that you're using a csv.reader to read these lines, which incurs likely unnecessary overhead since one line of the csv file is often one row of it. The only time that's not true is when the csv file has string fields in it that contain embedded newline characters — which in my experience is uncommon.
If this is a valid assumption for your data, it would likely be much faster to first index the file and build a table of (filename, offset, number-of-rows) tuples indicating the approximately equally-sized logical chunks of lines/rows in the file. With that, you can process them relatively quickly by first seeking to the starting offset and then reading the specified number of csv rows from that point on.
Another advantage to this approach is it would allow you to process the chunks in parallel, which I suspect is is the real problem you're trying to solve based on a previous question of yours. So, even though you haven't mentioned multiprocessing here, this following has been written to be compatible with doing that, if that's the case.
import csv
from itertools import islice
import os
import sys
def open_binary_mode(filename, mode='r'):
""" Open a file proper way (depends on Python verion). """
kwargs = (dict(mode=mode+'b') if sys.version_info[0] == 2 else
dict(mode=mode, newline=''))
return open(filename, **kwargs)
def split(infilename, num_chunks):
infile_size = os.path.getsize(infilename)
chunk_size = infile_size // num_chunks
offset = 0
num_rows = 0
bytes_read = 0
chunks = []
with open_binary_mode(infilename, 'r') as infile:
for _ in range(num_chunks):
while bytes_read < chunk_size:
try:
bytes_read += len(next(infile))
num_rows += 1
except StopIteration: # end of infile
break
chunks.append((infilename, offset, num_rows))
offset += bytes_read
num_rows = 0
bytes_read = 0
return chunks
chunks = split('sample_simple.csv', num_chunks=4)
for filename, offset, rows in chunks:
print('processing: {} rows starting at offset {}'.format(rows, offset))
with open_binary_mode(filename, 'r') as fin:
fin.seek(offset)
for row in islice(csv.reader(fin), rows):
print(row)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.