Loading a very large txt file and taking transpose - python

I have a tab separated .txt file that keeps numbers as matrix. Number of lines is 904,652 and the number of columns is 26,600 (tab separated). The total size of the file is around 48 GB. I need to load this file as matrix and take the transpose of the matrix to extract training and testing data. I am using Python, pandas, and sklearn packages. I behave 500GB memory server but it is not enough to load it with pandas package. Could anyone help me about my problem?
The loading code part is below:
def open_with_pandas_read_csv(filename):
df = pandas.read_csv(filename, sep=csv_delimiter, header=None)
data = df.values
return data

If your server has 500GB of RAM, you should have no problem using numpy's loadtxt method.
data = np.loadtxt("path_to_file").T

The fact that it's a text file makes it a little harder. As a first step, I would create a binary file out of it, where each number takes a constant number of bytes. It will probably also reduce the file size.
Then, I would make a number of passes, and in each pass I would write N rows in the output file.
Pseudo code:
transposed_rows = [ [], .... , [] ] # length = N
for p in range(columns / N):
for row in range(rows):
x = read_N_numbers_from_row_of_input_matrix(row,pass*N)
for i in range(N):
transposed_rows[i].append(x)
for i in range(N):
append_to_output_file(transposed_rows[i])
The transformation to binary file enables to read a sequence of numbers from the middle of the row.
N should be small enough to fit transposed_rows() in memory, i.e. N*rows should be reasonable.
N should be large enough so that we take advantage of caching. If N=1, this means we're wasting a lot of reads to generate a single row of output.

It sounds to me that your are working on genetic data. If so, consider using --transpose with plink, it is very quick: http://pngu.mgh.harvard.edu/~purcell/plink/dataman.shtml#recode

I found a solution (I believe there are still more efficient and logical ones) on stackoverflow. np.fromfile() method loads huge files more efficiently more than np.loadtxt() and np.genfromtxt() and even pandas.read_csv(). It took just around 274 GB without any modification or compression. I thank everyone who tried to help me on this issue.

Related

Faster alternative than pandas.read_csv for reding txt files with data

I am working on a project where I need to process large amounts of txt files (about 7000 files), where each file has 2 columns of floats with 12500 rows.
I am using pandas, and takes about 2 min 20 sec which is a bit long. With MATLAB this takes 1 min less. I would like to get closer or faster than MATLAB.
Is there any faster alternative that I can implement with python?
I tried Cython and the speed was the same as with pandas.
Here is the code in am using. It reads the files composed of column 1 (time) and column 2 (amplitude). I calculate the envelope and make a list with the resulting envelopes for all files. I extract the time from the first file using the simple numpy.load_txt(), which is slower than pandas but no impact since it is just one file.
Any ideas?
For coding suggestions please try to use the same format as I use.
Cheers
Data example:
1.7949600e-05 -5.3232106e-03
1.7950000e-05 -5.6231098e-03
1.7950400e-05 -5.9230090e-03
1.7950800e-05 -6.3228746e-03
1.7951200e-05 -6.2978830e-03
1.7951600e-05 -6.6727570e-03
1.7952000e-05 -6.5727906e-03
1.7952400e-05 -6.9726562e-03
1.7952800e-05 -7.0726226e-03
1.7953200e-05 -7.2475638e-03
1.7953600e-05 -7.1725890e-03
1.7954000e-05 -6.9476646e-03
1.7954400e-05 -6.6227738e-03
1.7954800e-05 -6.4228410e-03
1.7955200e-05 -5.8480342e-03
1.7955600e-05 -6.1979166e-03
1.7956000e-05 -5.7980510e-03
1.7956400e-05 -5.6231098e-03
1.7956800e-05 -5.3482022e-03
1.7957200e-05 -5.1732611e-03
1.7957600e-05 -4.6484375e-03
20 files here:
https://1drv.ms/u/s!Ag-tHmG9aFpjcFZPqeTO12FWlMY?e=f6Zk38
folder_tarjet="D:\this"
if len(folder_tarjet) > 0:
print ("You chose %s" % folder_tarjet)
list_of_files = os.listdir(folder_tarjet)
list_of_files.sort(key=lambda f: os.path.getmtime(join(folder_tarjet, f)))
num_files=len(list_of_files)
envs_a=[]
for elem in list_of_files:
file_name=os.path.join(folder_tarjet,elem)
amp=pd.read_csv(file_name,header=None,dtype={'amp':np.float64},delim_whitespace=True)
env_amplitudes = np.abs(hilbert(np.array(pd.DataFrame(amp[1]))))
envs_a.append(env_amplitudes)
envelopes=np.array(envs_a).T
file_name=os.path.join(folder_tarjet,list_of_files[1])
Time=np.loadtxt(file_name,usecols=0)
I would suggest you not to use csv to store and load big data quantities, if you already have your data in csv you can still convert all of it at once into a faster format. For instance you can use pickle, h5, feather and parquet, they are non human readable but have much better performance in any other metric.
After the conversion, you will be able to load the data in few seconds (if not less than a second) instead of minutes, so in my opinion it is a much better solution than trying to make a marginal 50% optimization.
If the data is being generated, make sure to generate it in a compressed format. If you don't know which format to use, parquet for instance would be one of the fastest for reading and writing and you can also load it from Matlab.
Here you will the options already supported by pandas.
As discussed in #Ziur Olpa's answer and the comments, a binary format is bound to be faster than to parse text.
The quick way to get those gains is to use Numpy's own NPY format, and have your reader function cache those onto disk; that way, when you re-(re-re-)run your data analysis, it will use the pre-parsed NPY files instead of the "raw" TXT files. Should the TXT files change, you can just remove all NPY files and wait a while longer for parsing to happen (or maybe add logic to look at the modification times of the NPY files c.f. their corresponding TXT files).
Something like this – I hope I got your hilbert/abs/transposition logic the way you wanted to.
def read_scans(folder):
"""
Read scan files from a given folder, yield tuples of filename/matrix.
This will also create "cached" npy files for each txt file, e.g. a.txt -> a.txt.npy.
"""
for filename in sorted(glob.glob(os.path.join(folder, "*.txt")), key=os.path.getmtime):
np_filename = filename + ".npy"
if os.path.exists(np_filename):
yield (filename, np.load(np_filename))
else:
df = pd.read_csv(filename, header=None, dtype=np.float64, delim_whitespace=True)
mat = df.values
np.save(np_filename, mat)
yield (filename, mat)
def read_data_opt(folder):
times = None
envs = []
for filename, mat in read_scans(folder):
if times is None:
# If we hadn't got the times yet, grab them from this
# matrix (by slicing the first column).
times = mat[:, 0]
env_amplitudes = signal.hilbert(mat[:, 1])
envs.append(env_amplitudes)
envs = np.abs(np.array(envs)).T
return (times, envs)
On my machine, the first run (for your 20-file dataset) takes
read_data_opt took 0.13302898406982422 seconds
and the subsequent runs are 4x faster:
read_data_opt took 0.031115055084228516 seconds

How to optimize sequential writes with h5py to increase speed when reading the file afterwards?

I process some input data which, if I did it all at once, would give me a dataset of float32s and typical shape (5000, 30000000). (The length of the 0th axis is fixed, the 1st varies, but I do know what it will be before I start).
Since that's ~600GB and won't fit in memory I have to cut it up along the 1st axis and process it in blocks of (5000, blocksize). I cannot cut it up along the 0th axis, and due to RAM constraints blocksize is typically around 40000. At the moment I'm writing each block to an hdf5 dataset sequentially, creating the dataset like:
fout = h5py.File(fname, "a")
blocksize = 40000
block_to_write = np.random.random((5000, blocksize))
fout.create_dataset("data", data=block_to_write, maxshape=(5000, None))
and then looping through blocks and adding to it via
fout["data"].resize((fout["data"].shape[1] + blocksize), axis=1)
fout["data"][:, -blocksize:] = block_to_write
This works and runs in an acceptable amount of time.
The end product I need to feed into the next step is a binary file for each row of the output. It's someone else's software so unfortunately I have no flexibility there.
The problem is that reading in one row like
fin = h5py.File(fname, 'r')
data = fin['data']
a = data[0,:]
takes ~4min and with 5000 rows, that's way too long!
Is there any way I can alter my write so that my read is faster? Or is there anything else I can do instead?
Should I make each individual row its own data set within the hdf5 file? I assumed that doing lots of individual writes would be too slow but maybe it's better?
I tried writing the binary files directly - opening them outside of the loop, writing to them during the loops, and then closing them afterwards - but I ran into OSError: [Errno 24] Too many open files. I haven't tried it but I assume opening the files and closing them inside the loop would make it way too slow.
Your question is similar to a previous SO/h5py question I recently answered: h5py extremely slow writing. Apparently you are getting acceptable write performance, and want to improve read performance.
The 2 most important factors that affect h5py I/O performance are: 1) chunk size/shape, and 2) size of the I/O data block. h5py docs recommend keeping chunk size between 10 KB and 1 MB -- larger for larger datasets. Ref: h5py Chunked Storage. I have also found write performance degrades when I/O data blocks are "too small". Ref: pytables writes much faster than h5py. The size of your read data block is certainly large enough.
So, my initial hunch was to investigate chunk size influence on I/O performance. Setting the optimal chunk size is a bit of an art. Best way to tune the value is to enable chunking, let h5py define the default size, and see if you get acceptable performance. You didn't define the chunks parameter. However, because you defined the maxshape parameter, chunking was automatically enabled with a default size (based on the dataset's initial size). (Without chunking, I/O on a file of this size would be painfully slow.) An additional consideration for your problem: the optimal chunk size has to balance the size of the write data blocks (5000 x 40_000) vs the read data blocks (1 x 30_000_000).
I parameterized your code so I could tinker with the dimensions. When I did, I discovered something interesting. Reading the data is much faster when I run it as a separate process after creating the file. And, the default chunk size seems to give adequate read performance. (Initially I was going to benchmark different chunk size values.)
Note: I only created a 78GB file (4_000_000 columns). This takes >13mins to run on my Windows system. I didn't want to wait 90mins to create a 600GB file. You can modify n_blocks=750 if you want to test 30_000_000 columns. :-) All code at the end of this post.
Next I created a separate program to read the data. Read performance was fast with the default chunk size: (40, 625). Timing output below:
Time to read first row: 0.28 (in sec)
Time to read last row: 0.28
Interestingly, I did not get the same read times with every test. Values above were pretty consistent, but occasionally I would get a read time of 7-10 seconds. Not sure why that happens.
I ran 3 tests (In all cases block_to_write.shape=(500,40_000)):
default chunksize=(40,625) [95KB]; for 500x40_000 dataset (resized)
default chunksize=(10,15625) [596KB]; for 500x4_000_000 dataset (not resized)
user defined chunksize=(10,40_000) [1.526MB]; for 500x4_000_000 dataset (not resized)
Larger chunks improves read performance, but speed with default values is pretty fast. (Chunk size has a very small affect on write performance.) Output for all 3 below.
dataset chunkshape: (40, 625)
Time to read first row: 0.28
Time to read last row: 0.28
dataset chunkshape: (10, 15625)
Time to read first row: 0.05
Time to read last row: 0.06
dataset chunkshape: (10, 40000)
Time to read first row: 0.00
Time to read last row: 0.02
Code to create my test file below:
with h5py.File(fname, 'w') as fout:
blocksize = 40_000
n_blocks = 100
n_rows = 5_000
block_to_write = np.random.random((n_rows, blocksize))
start = time.time()
for cnt in range(n_blocks):
incr = time.time()
print(f'Working on loop: {cnt}', end='')
if "data" not in fout:
fout.create_dataset("data", shape=(n_rows,blocksize),
maxshape=(n_rows, None)) #, chunks=(10,blocksize))
else:
fout["data"].resize((fout["data"].shape[1] + blocksize), axis=1)
fout["data"][:, cnt*blocksize:(cnt+1)*blocksize] = block_to_write
print(f' - Time to add block: {time.time()-incr:.2f}')
print(f'Done creating file: {fname}')
print(f'Time to create {n_blocks}x{blocksize:,} columns: {time.time()-start:.2f}\n')
Code to read 2 different arrays from the test file below:
with h5py.File(fname, 'r') as fin:
print(f'dataset shape: {fin["data"].shape}')
print(f'dataset chunkshape: {fin["data"].chunks}')
start = time.time()
data = fin["data"][0,:]
print(f'Time to read first row: {time.time()-start:.2f}')
start = time.time()
data = fin["data"][-1,:]
print(f'Time to read last row: {time.time()-start:.2f}'

Making a generator read multiple lines

I have a very large csv file (11 Million lines).
I would like to create batches of data. I cannot for the life of me figure out how to read n lines in a generator (where I specifiy what n is, sometimes I want it to be 50, sometimes 2). I came up with a kluge that works one time, but I could not get this to iterate a second time. Generators are quite new to me, so it took me a bit even to get the calling down. (For the record, this is a clean dataset with 29 values every line)
import numpy as np
import csv
def getData(filename):
with open(filename, "r") as csv1:
reader1 = csv.reader(csv1)
for row1 in reader1:
yield row1
def make_b(size, file):
gen = getData(file)
data=np.zeros((size,29))
for i in range(size):
data[i,:] = next(gen)
yield data[:,0],data[:,1:]
test=make_b(4,"myfile.csv")
next(test)
next(test)
The reason for this is to use an example of batching data in keras. While it is possible to use different methods to get all of the data into memory, I am trying to introduce to students the concepts of batching data from a large dataset. Since this is a survey course, I wanted to demonstrate batching in data from a large text file, which has proved frustrating difficult for such an 'entry-level' task. (its actually much easier in tensorflow proper, but i am using keras to introduce high level concepts of the MLP).

Loading a .csv file in python consuming too much memory

I'm loading a 4gb .csv file in python. Since it's 4gb I expected it would be ok to load it at once, but after a while my 32gb of ram gets completely filled.
Am I doing something wrong? Why does 4gb is becoming so much larger in ram aspects?
Is there a faster way of loading this data?
fname = "E:\Data\Data.csv"
a = []
with open(fname) as csvfile:
reader = csv.reader(csvfile,delimiter=',')
cont = 0;
for row in reader:
cont = cont+1
print(cont)
a.append(row)
b = np.asarray(a)
You copied the entire content of the csv at least twice.
Once in a and again in b.
Any additional work over that data consumes additional memory to hold values and such
You could del a once you have b, but note that pandas library provides you a read_csv function and a way to count the rows of the generated Dataframe
You should be able to do an
a = list(reader)
which might be a little better.
Because it's Python :-D One of easiest approaches: create your own class for rows, which would store data at least in slots, which can save couple hundreds bytes per row (reader put them into dict, which is kinda huge even when empty)..... if go further, than you may try to store binary representation of data.
But maybe you can process data without saving entire data-array? It's will consume significantly less memory.

Load Text File as a matrix in MATLAB

I have a text-file which is a huge set of data(around 9 GB). I have arranged the file as 244 X 3089987 with data delimited with tabs. I would like to load this text-file in Matlab as a matrix. Here is what I have tried and I have been unsuccessful (My Matlab gets hung).
fread = fopen('merge.txt','r');
formatString = repmat('%f',244,3089987);
C = textscan(fread,formatString);
Am I doing something wrong or is my approach wrong? If this is easily possible in Python, could someone please suggest accordingly.
If you read the documentation for textscan you will see that you can define an input argument N so that:
textscan reads file data using the formatSpec N times, where N is a
positive integer. To read additional data from the file after N
cycles, call textscan again using the original fileID. If you resume a
text scan of a file by calling textscan with the same file identifier
(fileID), then textscan automatically resumes reading at the point
where it terminated the last read.
You can also pass a blank formatSpec to textscan in order to read in an arbitrary number of columns. This is how dlmread, a wrapper for textscan operates.
For example:
fID = fopen('test.txt');
chunksize = 10; % Number of lines to read for each iteration
while ~feof(fID) % Iterate until we reach the end of the file
datachunk = textscan(fID, '', chunksize, 'Delimiter', '\t', 'CollectOutput', true);
datachunk = datachunk{1}; % Pull data out of cell array. Can take time for large arrays
% Do calculations
end
fclose(fID);
This will read in 10 line chunks until you reach the end of the file.
If you have enough RAM to store the data (a 244 x 3089987 array of double is just over 6 gigs) then you can do:
mydata = textscan(fID, '', 'Delimiter', '\t', 'CollectOutput', true);
mydata = mydata{1}; % Pull data out of cell array. Can take time for large arrays
try:
A = importdata('merge.txt', '\t');
http://es.mathworks.com/help/matlab/ref/importdata.html
and if the rows are not delimited by '\n':
[C, remaining] = vec2mat(A, 244)
http://es.mathworks.com/help/comm/ref/vec2mat.html
Another option in recent MATLAB releases is to use datastore. This has the advantage of being designed to allow you to page through the data, rather than read the whole lot at once. It can generally deduce all the formatting stuff.
http://www.mathworks.com/help/matlab/import_export/read-and-analyze-data-in-a-tabulartextdatastore.html
I'm surprised this is even trying to run, when I try something similar textscan throws an error.
If you really want to use textscan you only need the format for each row so you can replace 244 in your code with 1 and it should work. Edit: having read your comment not that in the first element is the number of columns so you should do formatString = repmat('%f',1, 244);. Also you can apparently just leave the format as empty ('') and it will work.
However, Matlab has several text import functions of which textscan is rarely the easiest way to do something.
In this case I would probably use dlmread, which does any delimitated numerical data. You want something like:
C=dlmread('merge.txt', '\t');
Also as you are trying to load 9GB of data I assume you have enough memory, you'll probably get an out of memory error if you don't but it is something to consider.

Categories