I have a very large csv file (11 Million lines).
I would like to create batches of data. I cannot for the life of me figure out how to read n lines in a generator (where I specifiy what n is, sometimes I want it to be 50, sometimes 2). I came up with a kluge that works one time, but I could not get this to iterate a second time. Generators are quite new to me, so it took me a bit even to get the calling down. (For the record, this is a clean dataset with 29 values every line)
import numpy as np
import csv
def getData(filename):
with open(filename, "r") as csv1:
reader1 = csv.reader(csv1)
for row1 in reader1:
yield row1
def make_b(size, file):
gen = getData(file)
data=np.zeros((size,29))
for i in range(size):
data[i,:] = next(gen)
yield data[:,0],data[:,1:]
test=make_b(4,"myfile.csv")
next(test)
next(test)
The reason for this is to use an example of batching data in keras. While it is possible to use different methods to get all of the data into memory, I am trying to introduce to students the concepts of batching data from a large dataset. Since this is a survey course, I wanted to demonstrate batching in data from a large text file, which has proved frustrating difficult for such an 'entry-level' task. (its actually much easier in tensorflow proper, but i am using keras to introduce high level concepts of the MLP).
Related
I have a while loop that collects data from a microphone (replaced here for np.random.random() to make it more reproducible). I do some operations, let's say I take the abs().mean() here because my output will be a one dimensional array.
This loop is going to run for a LONG time (e.g., once a second for a week) and I am wondering my options to save this. My main concerns are saving the data with acceptable performance and having the result being portable (e.g, .csv beats .npy).
The simple way: just append things into a .txt file. Could be replaced by csv.gz maybe? Maybe using np.savetxt()? Would it be worth it?
The hdf5 way: this should be a nicer way, but reading the whole dataset to append to it doesn't seem like good practice or better performing than dumping into a text file. Is there another way to append to hdf5 files?
The npy way (code not shown): I could save this into a .npy file but I would rather make it portable using a format that could be read from any program.
from collections import deque
import numpy as np
import h5py
amplitudes = deque(maxlen=save_interval_sec)
# Read from the microphone in a continuous stream
while True:
data = np.random.random(100)
amplitude = np.abs(data).mean()
print(amplitude, end="\r")
amplitudes.append(amplitude)
# Save the amplitudes to a file every n iterations
if len(amplitudes) == save_interval:
with open("amplitudes.txt", "a") as f:
for amp in amplitudes:
f.write(str(amp) + "\n")
amplitudes.clear()
# Save the amplitudes to an HDF5 file every n iterations
if len(amplitudes) == save_interval:
# Convert the deque to a Numpy array
amplitudes_array = np.array(amplitudes)
# Open an HDF5 file
with h5py.File("amplitudes.h5", "a") as f:
# Get the existing dataset or create a new one if it doesn't exist
dset = f.get("amplitudes")
if dset is None:
dset = f.create_dataset("amplitudes", data=amplitudes_array, dtype=np.float32,
maxshape=(None,), chunks=True, compression="gzip")
else:
# Get the current size of the dataset
current_size = dset.shape[0]
# Resize the dataset to make room for the new data
dset.resize((current_size + save_interval,))
# Write the new data to the dataset
dset[current_size:] = amplitudes_array
# Clear the deque
amplitudes.clear()
# For debug only
if len(amplitudes)>3:
break
Update
I get that the answer might depend a bit on the sampling frequency (once a second might be too slow) and the data dimensions (single column might be too little). I guess I asked because anything can work, but I always just dump to text. I am not sure where the breaking points are that tip the decision into one or the other method.
I process some input data which, if I did it all at once, would give me a dataset of float32s and typical shape (5000, 30000000). (The length of the 0th axis is fixed, the 1st varies, but I do know what it will be before I start).
Since that's ~600GB and won't fit in memory I have to cut it up along the 1st axis and process it in blocks of (5000, blocksize). I cannot cut it up along the 0th axis, and due to RAM constraints blocksize is typically around 40000. At the moment I'm writing each block to an hdf5 dataset sequentially, creating the dataset like:
fout = h5py.File(fname, "a")
blocksize = 40000
block_to_write = np.random.random((5000, blocksize))
fout.create_dataset("data", data=block_to_write, maxshape=(5000, None))
and then looping through blocks and adding to it via
fout["data"].resize((fout["data"].shape[1] + blocksize), axis=1)
fout["data"][:, -blocksize:] = block_to_write
This works and runs in an acceptable amount of time.
The end product I need to feed into the next step is a binary file for each row of the output. It's someone else's software so unfortunately I have no flexibility there.
The problem is that reading in one row like
fin = h5py.File(fname, 'r')
data = fin['data']
a = data[0,:]
takes ~4min and with 5000 rows, that's way too long!
Is there any way I can alter my write so that my read is faster? Or is there anything else I can do instead?
Should I make each individual row its own data set within the hdf5 file? I assumed that doing lots of individual writes would be too slow but maybe it's better?
I tried writing the binary files directly - opening them outside of the loop, writing to them during the loops, and then closing them afterwards - but I ran into OSError: [Errno 24] Too many open files. I haven't tried it but I assume opening the files and closing them inside the loop would make it way too slow.
Your question is similar to a previous SO/h5py question I recently answered: h5py extremely slow writing. Apparently you are getting acceptable write performance, and want to improve read performance.
The 2 most important factors that affect h5py I/O performance are: 1) chunk size/shape, and 2) size of the I/O data block. h5py docs recommend keeping chunk size between 10 KB and 1 MB -- larger for larger datasets. Ref: h5py Chunked Storage. I have also found write performance degrades when I/O data blocks are "too small". Ref: pytables writes much faster than h5py. The size of your read data block is certainly large enough.
So, my initial hunch was to investigate chunk size influence on I/O performance. Setting the optimal chunk size is a bit of an art. Best way to tune the value is to enable chunking, let h5py define the default size, and see if you get acceptable performance. You didn't define the chunks parameter. However, because you defined the maxshape parameter, chunking was automatically enabled with a default size (based on the dataset's initial size). (Without chunking, I/O on a file of this size would be painfully slow.) An additional consideration for your problem: the optimal chunk size has to balance the size of the write data blocks (5000 x 40_000) vs the read data blocks (1 x 30_000_000).
I parameterized your code so I could tinker with the dimensions. When I did, I discovered something interesting. Reading the data is much faster when I run it as a separate process after creating the file. And, the default chunk size seems to give adequate read performance. (Initially I was going to benchmark different chunk size values.)
Note: I only created a 78GB file (4_000_000 columns). This takes >13mins to run on my Windows system. I didn't want to wait 90mins to create a 600GB file. You can modify n_blocks=750 if you want to test 30_000_000 columns. :-) All code at the end of this post.
Next I created a separate program to read the data. Read performance was fast with the default chunk size: (40, 625). Timing output below:
Time to read first row: 0.28 (in sec)
Time to read last row: 0.28
Interestingly, I did not get the same read times with every test. Values above were pretty consistent, but occasionally I would get a read time of 7-10 seconds. Not sure why that happens.
I ran 3 tests (In all cases block_to_write.shape=(500,40_000)):
default chunksize=(40,625) [95KB]; for 500x40_000 dataset (resized)
default chunksize=(10,15625) [596KB]; for 500x4_000_000 dataset (not resized)
user defined chunksize=(10,40_000) [1.526MB]; for 500x4_000_000 dataset (not resized)
Larger chunks improves read performance, but speed with default values is pretty fast. (Chunk size has a very small affect on write performance.) Output for all 3 below.
dataset chunkshape: (40, 625)
Time to read first row: 0.28
Time to read last row: 0.28
dataset chunkshape: (10, 15625)
Time to read first row: 0.05
Time to read last row: 0.06
dataset chunkshape: (10, 40000)
Time to read first row: 0.00
Time to read last row: 0.02
Code to create my test file below:
with h5py.File(fname, 'w') as fout:
blocksize = 40_000
n_blocks = 100
n_rows = 5_000
block_to_write = np.random.random((n_rows, blocksize))
start = time.time()
for cnt in range(n_blocks):
incr = time.time()
print(f'Working on loop: {cnt}', end='')
if "data" not in fout:
fout.create_dataset("data", shape=(n_rows,blocksize),
maxshape=(n_rows, None)) #, chunks=(10,blocksize))
else:
fout["data"].resize((fout["data"].shape[1] + blocksize), axis=1)
fout["data"][:, cnt*blocksize:(cnt+1)*blocksize] = block_to_write
print(f' - Time to add block: {time.time()-incr:.2f}')
print(f'Done creating file: {fname}')
print(f'Time to create {n_blocks}x{blocksize:,} columns: {time.time()-start:.2f}\n')
Code to read 2 different arrays from the test file below:
with h5py.File(fname, 'r') as fin:
print(f'dataset shape: {fin["data"].shape}')
print(f'dataset chunkshape: {fin["data"].chunks}')
start = time.time()
data = fin["data"][0,:]
print(f'Time to read first row: {time.time()-start:.2f}')
start = time.time()
data = fin["data"][-1,:]
print(f'Time to read last row: {time.time()-start:.2f}'
I have read numerous threads on similar topics on the forum. However, what I am asking here, I believe, it is not a duplicate question.
I am reading a very large dataset (22 gb) of CSV format, having 350 million rows. I am trying to read the dataset in chunks, based on the solution provided by that link.
My current code is as following.
import pandas as pd
def Group_ID_Company(chunk_of_dataset):
return chunk_of_dataset.groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum()
chunk_size = 9000000
chunk_skip = 1
transactions_dataset_DF = pd.read_csv('transactions.csv', skiprows = range(1, chunk_skip), nrows = chunk_size)
Group_ID_Company(transactions_dataset_DF.reset_index()).to_csv('Group_ID_Company.csv')
for i in range(0, 38):
chunk_skip += chunk_size;
transactions_dataset_DF = pd.read_csv('transactions.csv', skiprows = range(1, chunk_skip), nrows = chunk_size)
Group_ID_Company(transactions_dataset_DF.reset_index()).to_csv('Group_ID_Company.csv', mode = 'a', header = False)
There is no issue with the code, it runs fine. But, it, groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum() only runs for 9000000 rows, which is the declared as chunk_size. Whereas, I need to run that statement for the entire dataset, not chunk by chunk.
Reason for that is, when it is run chunk by chunk, only one chunk get processed, however, there are a lot of other rows which are scattered all over the dataset and get left behind into another chunk.
A possible solution is to run the code again on the newly generated "Group_ID_Company.csv". By doing so, code will go through new dataset once again and sum() the required columns. However, I am thinking may be there is another (better) way of achieving that.
The solution for your problem is probably Dask. You may watch the introductory video, read examples and try them online in a live session (in JupyterLab).
The answer form MarianD worked perfectly, I am answering to share the solution code here.
Moreover, DASK is able to utilize all cores equally, whereas, Pandas was using the only one core to 100%. So, that's the another benefit of DASK, I have noticed over Pandas.
import dask.dataframe as dd
transactions_dataset_DF = dd.read_csv('transactions.csv')
Group_ID_Company_DF = transactions_dataset_DF.groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum().compute()
Group_ID_Company_DF.to_csv('Group_ID_Company.csv')
# to clear the memory
transactions_dataset_DF = None
Group_ID_Company_DF = None
DASK has been able to read all 350 million rows (20 GB of dataset) at once. Which was not achieved by Pandas previously, I had to create 37 chunks to process the entire dataset and it took almost 2 hours to complete the processing using Pandas.
However, with the DASK, it only took around 20 mins to (at once) read, process and then save the new dataset.
I have a large text file (>1GB) of three comma-separated values that I want to read into a Pandas DataFrame in chunks. An example of the DataFrame is below:
I'd like to filter through this file while reading it in and output a "clean" version. One issue I have is that some Timestamps are out-of-order, but the problem is usually quite local (usually a tick is out-of-order by a few slots before or below). Are there any ways to do localized, "sliding window" sorting?
Also, as I'm fairly new to Python, and learning about the I/O methods, I'm unsure of the best class/method to use for filtering large data files. TextIOBase?
This is a really interesting question, as the data's big enough to not fit in memory easily.
First of all, about I/O: if it's a CSV I'd use a standard library csv.reader() object, like so (I'm assuming Python 3):
with open('big.csv', newline='') as f:
for row in csv.reader(f):
...
Then I'd probably keep a sliding window of rows in a collections.deque(maxlen=WINDOW_SIZE) instance with window size set to maybe 20 based on your description. Read the first WINDOW_SIZE rows into the deque, then enter the main read loop which would output the left-most item (row) in the deque, then append the current row.
After appending each row, if the current row's timestamp comes before the timestamp of the previous rows (window[-2]), then sort the deque. You can't sort a deque directly, but do something like:
window = collections.deque(sorted(window), maxlen=WINDOW_SIZE)
Python's Timsort algorithm handles already-sorted runs efficiently, so this should be very fast (linear time).
If the window size and the number of out-of-order rows are small (as it sounds like they can be), I believe the overall algorithm will be O(N) where N is the number of rows in the data file, so linear time.
UPDATE: I wrote some demo code to generate a file like this and then sort it using the above technique -- see this Gist, tested on Python 3.5. It's much faster than the sort utility on the same data, and also faster than Python's sorted() function after about N = 1,000,000. Incidentally the function that generates a demo CSV is significantly slower than the sorting code. :-) My results timing process_sliding() for various N (definitely looks linear-ish):
N = 1,000,000: 3.5s
N = 2,000,000: 6.6s
N = 10,000,000: 32.9s
For reference, here's the code for my version of process_sliding():
def process_sliding(in_filename, out_filename, window_size=20):
with (open(in_filename, newline='') as fin,
open(out_filename, 'w', newline='') as fout):
reader = csv.reader(fin)
writer = csv.writer(fout)
first_window = sorted(next(reader) for _ in range(window_size))
window = collections.deque(first_window, maxlen=window_size)
for row in reader:
writer.writerow(window.popleft())
window.append(row)
if row[0] < window[-2][0]:
window = collections.deque(sorted(window), maxlen=window_size)
for row in window:
writer.writerow(row)
I have a tab separated .txt file that keeps numbers as matrix. Number of lines is 904,652 and the number of columns is 26,600 (tab separated). The total size of the file is around 48 GB. I need to load this file as matrix and take the transpose of the matrix to extract training and testing data. I am using Python, pandas, and sklearn packages. I behave 500GB memory server but it is not enough to load it with pandas package. Could anyone help me about my problem?
The loading code part is below:
def open_with_pandas_read_csv(filename):
df = pandas.read_csv(filename, sep=csv_delimiter, header=None)
data = df.values
return data
If your server has 500GB of RAM, you should have no problem using numpy's loadtxt method.
data = np.loadtxt("path_to_file").T
The fact that it's a text file makes it a little harder. As a first step, I would create a binary file out of it, where each number takes a constant number of bytes. It will probably also reduce the file size.
Then, I would make a number of passes, and in each pass I would write N rows in the output file.
Pseudo code:
transposed_rows = [ [], .... , [] ] # length = N
for p in range(columns / N):
for row in range(rows):
x = read_N_numbers_from_row_of_input_matrix(row,pass*N)
for i in range(N):
transposed_rows[i].append(x)
for i in range(N):
append_to_output_file(transposed_rows[i])
The transformation to binary file enables to read a sequence of numbers from the middle of the row.
N should be small enough to fit transposed_rows() in memory, i.e. N*rows should be reasonable.
N should be large enough so that we take advantage of caching. If N=1, this means we're wasting a lot of reads to generate a single row of output.
It sounds to me that your are working on genetic data. If so, consider using --transpose with plink, it is very quick: http://pngu.mgh.harvard.edu/~purcell/plink/dataman.shtml#recode
I found a solution (I believe there are still more efficient and logical ones) on stackoverflow. np.fromfile() method loads huge files more efficiently more than np.loadtxt() and np.genfromtxt() and even pandas.read_csv(). It took just around 274 GB without any modification or compression. I thank everyone who tried to help me on this issue.