Efficient way to process CSV file into a numpy array - python

CSV file may not be clean (lines with inconsistent number of elements), unclean lines would need to be disregarded.
String manipulation is required during processing.
Example input:
20150701 20:00:15.173,0.5019,0.91665
Desired output: float32 (pseudo-date, seconds in the day, f3, f4)
0.150701 72015.173 0.5019 0.91665 (+ the trailing trash floats usually get)
The CSV file is also very big, the numpy array in memory would be expected to take 5-10 GB, CSV file is over 30GB.
Looking for an efficient way to process the CSV file and end up with a numpy array.
Current solution: use csv module, process line by line and use a list() as a buffer that later gets turned to numpy array with asarray(). Problem is, during the turning process memory consumption is doubled and the copying process adds execution overhead.
Numpy's genfromtxt and loadtxt don't appear to be able to process the data as desired.

If you know in advance how many rows are in the data, you could dispense with the intermediate list and write directly to the array.
import numpy as np
no_rows = 5
no_columns = 4
a = np.zeros((no_rows, no_columns), dtype = np.float)
with open('myfile') as f:
for i, line in enumerate(f):
a[i,:] = cool_function_that_returns_formatted_data(line)

did you think for using pandas read_csv (with engine='C')
I find it as one of the best and easy solutions to handling csv. I worked with 4GB file and it worked for me.
import pandas as pd
df=pd.read_csv('abc.csv',engine='C')
print(df.head(10))

I think i/o capability of pandas is the best way to get data into a numpy array. Specifically the read_csv method will read into a pandas DataFrame. You can then access the underlying numpy array using the as_matrix method of the returned DataFrame.

Related

How can I create a Numpy Array that is much bigger than my RAM from 1000s of CSV files?

I have 1000s of CSV files that I would like to append and create one big numpy array. The problem is that the numpy array would be much bigger than my RAM. Is there a way of writing a bit at a time to disk without having the entire array in RAM?
Also is there a way of reading only a specific part of the array from disk at a time?
When working with numpy and large arrays, there are several approaches depending on what you need to do with that data.
The simplest answer is to use less data. If your data has lots of repeating elements, it is often possible to use a sparse array from scipy because the two libraries are heavily integrated.
Another answer (IMO: the correct solution to your problem) is to use a memory mapped array. This will let numpy only load the necessary parts to ram when needed, and leave the rest on disk. The files containing the data can be simple binary files created using any number of methods, but the built-in python module that would handle this is struct. Appending more data would be as simple as opening the file in append mode, and writing more bytes of data. Make sure that any references to the memory mapped array are re-created any time more data is written to the file so the information is fresh.
Finally is something like compression. Numpy can compress arrays with savez_compressed which can then be opened with numpy.load. Importantly, compressed numpy files cannot be memory-mapped, and must be loaded into memory entirely. Loading one column at a time may be able to get you under the threshold, but this could similarly be applied to other methods to reduce memory usage. Numpy's built in compression techniques will only save disk space not memory. There may exist other libraries that perform some sorts of streaming compression, but that is beyond the scope of my answer.
Here is an example of putting binary data into a file then opening it as a memory-mapped array:
import numpy as np
#open a file for data of a single column
with open('column_data.dat', 'wb') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())
#open the array as a memory-mapped file
column_mmap = np.memmap('column_data.dat', dtype=np.float)
#read some data
print(np.mean(column_mmap[0:1024]))
#write some data
column_mmap[0:512] = .5
#deletion closes the memory-mapped file and flush changes to disk.
# del isn't specifically needed as python will garbage collect objects no
# longer accessable. If for example you intend to read the entire array,
# you will need to periodically make sure the array gets deleted and re-created
# or the entire thing will end up in memory again. This could be done with a
# function that loads and operates on part of the array, then when the function
# returns and the memory-mapped array local to the function goes out of scope,
# it will be garbage collected. Calling such a function would not cause a
# build-up of memory usage.
del column_mmap
#write some more data to the array (not while the mmap is open)
with open('column_data.dat', 'ab') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())

Why 6GB csv file is not possible to read whole to memory (64GB) in numpy

I have csv file which sze is 6.8GB and I am not able to read it into memory into numpy array although I have 64GB RAM
CSV file has 10 milion of lines, each line has 131 records (mix of int and float)
I tried to read it to float numpy array
import numpy as np
data = np.genfromtxt('./data.csv', delimiter=';')
it failed due to memory.
when I read just one line and get size
data = np.genfromtxt('./data.csv', delimiter=';', max_rows=1)
data.nbytes
I get 1048 bytes
So , I would expect that 10.000.000 * 1048 = 10,48 GB which should be stored in memory without any problem. Why it doesn't work?
Finaly I tried to optimize array in memory by defining types
data = np.genfromtxt('./data.csv', delimiter=';', max_rows=1,
dtype="i1,i1,f4,f4,....,i2,f4,f4,f4")
data.nbytes
so I get only 464B per line, so it would be only 4,56 GB but it is still not possible to load to memory.
Do you have any idea?
I need to use this array in Keras.
Thank you
genfromtext is regular python code, that converts the data to a numpy array only as a final step. During this last step, the RAM needs to hold a giant python list as well as the resultant numpy array, both at the same time. Maybe you could try numpy.fromfile, or the Pandas csv reader. Since you know the type of data per column and the number of lines, you also preallocate a numpy array yourself and fill it using a simple for-loop.

numpy won't print full (unsummarized array)

I've looked at this response to try and get numpy to print the full array rather than a summarized view, but it doesn't seem to be working.
I have a CSV with named headers. Here are the first five rows
v0 v1 v2 v3 v4
1001 5529 24 56663 16445
1002 4809 30.125 49853 28069
1003 407 20 28462 8491
1005 605 19.55 75423 4798
1007 1607 20.26 79076 12962
I'd like to read in the data and be able to view it fully. I tried doing this:
import numpy as np
np.set_printoptions(threshold=np.inf)
main_df2=np.genfromtxt('file location', delimiter=",")
main_df2[0:3,:]
However this still returns the truncated array, and the performance seems greatly slowed. What am I doing wrong?
OK, in a regular Python session (I usually use Ipython instead), I set the print options, and made a large array:
>>> np.set_printoptions(threshold=np.inf, suppress=True)
>>> x=np.random.rand(25000,5)
When I execute the next line, it spends about 21 seconds formatting the array, and then writes the resulting string to the screen (with more lines than fit the terminal's window buffer).
>>> x
This is the same as
>>> print(repr(x))
The internal storage for x is a buffer of floats (which you can 'see' with x.tostring(). To print x it has to format it, create a multiline string that contains a print representation of each number, all 125000 of them. The result of repr(x) is a string 1850000 char long, 25000 lines. This is what takes 21 seconds. Displaying that on the screen is just limited by the terminal scroll speed.
I haven't looked at the details, but I think the numpy formatting is mostly written in Python, not compiled. It's designed more for flexibility than speed. It's normal to want to see 10-100 lines of an array. 25000 lines is an unusual case.
Somewhat curiously, writing this array as a csv is fast, with a minimal delay:
>>> np.savetxt('test.txt', x, fmt='%10f', delimiter=',')
And I know what savetxt does - it iterates on rows, and does a file write
f.write(fmt % tuple(row))
Evidently all the bells-n-whistles of the regular repr are expensive. It can summarize, it can handle many dimensions, it can handle complicated dtypes, etc. Simply formatting each row with a known fixed format is not the time consuming step.
Actually that savetxt route might be more useful, as well as fast. You can control the display format, and you can view the resulting text file in an editor or terminal window at your leisure. You won't be limited by the scroll buffer of your terminal window. But how will this savetxt file be different from the original csv?
I'm surprised you get an array at all as your example does not use ',' as delimiter. But maybe you forgot to included commas in your example file.
I would use the DataFrame functionality of pandas if I work with csv data. It uses numpy under the hood, so all numpy operation work on pandas DataFrames.
Pandas has many tricks for operating with table like data.
import pandas as pd
df = pd.read_csv('nothing.txt')
#==============================================================================
# next line remove blanks from the column names
#==============================================================================
df.columns = [name.strip(' ') for name in df.columns]
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print(df)
When I copied and pasted it the data here it was open in Excel, but the file is a CSV.
I'm doing a class exercise and we have to use numpy. One thing I noticed was that the results were quite illegible thanks for the scientific notation, so I did the following and things are much smoother:
np.set_printoptions(threshold=100000, suppress=True)
The suppress statement saved me a lot of formatting. The performance does suffer a lot when I change the threshold to something like 'nan' or inf, and I'm not sure why.

large csv file makes segmentation fault for numpy.genfromtxt

I'd really like to create a numpy array from a csv file, however, I'm having issues when the file is ~50k lines long (like the MNIST training set). The file I'm trying to import looks something like this:
0.0,0.0,0.0,0.5,0.34,0.24,0.0,0.0,0.0
0.0,0.0,0.0,0.4,0.34,0.2,0.34,0.0,0.0
0.0,0.0,0.0,0.34,0.43,0.44,0.0,0.0,0.0
0.0,0.0,0.0,0.23,0.64,0.4,0.0,0.0,0.0
It works fine for something thats 10k lines long, like the validation set:
import numpy as np
csv = np.genfromtxt("MNIST_valid_set_data.csv",delimiter = ",")
If I do the same with the training data (larger file), I'll get a c-style segmentation fault. Does anyone know any better ways besides breaking the file up and then piecing it together?
The end result is that I'd like to pickle the arrays into a similar mnist.pkl.gz file but I can't do that if I can't read in the data.
Any help would be greatly appreciated.
I think you really want to track down the actual problem and solve it, rather than just work around it, because I'll bet you have other problems with your NumPy installation that you're going to have to deal with eventually.
But, since you asked for a workaround that's better than manually splitting the files, reading them, and merging them, here are two:
First, you can split the files programmatically and dynamically, instead of manually. This avoids wasting a lot of your own human effort, and also saves the disk space needed for those copies, even though it's conceptually the same thing you already know how to do.
As the genfromtxt docs make clear, the fname argument can be a pathname, or a file object (open in 'rb' mode), or just a generator of lines (as bytes). Of course a file object is itself a generator of lines, but so is, say, an islice of a file object, or a group from a grouper. So:
import numpy as np
from more_itertools import grouper
def getfrombigtxt(fname, *args, **kwargs):
with open(fname, 'rb') as f:
return np.vstack(np.genfromtxt(group, *args, **kwargs)
for group in grouper(f, 5000, b''))
If you don't want to install more_itertools, you can also just copy the 2-line grouper implementation from the Recipes section of the itertools docs, or even inline zipping the iterators straight into your code.
Alternatively, you can parse the CSV file with the stdlib's csv module instead of with NumPy:
import csv
import numpy as np
def getfrombigtxt(fname, delimiter=','):
with open(fname, 'r') as f: # note text mode, not binary
rows = (list(map(float, row)) for row in csv.reader(f))
return np.vstack(rows)
This is obviously going to be a lot slower… but if we're talking about turning 50ms of processing into 1000ms, and you only do it once, who cares?

Memory runs out when creating a numpy array from a list of lists in python

I want to read in a tab separated value file and turn it into a numpy array. The file has 3 lines. It looks like this:
Ann1 Bill1 Chris1 Dick1
Ann2 Bill2 Chris2 "Dick2
Ann3 Bill3 Chris3 Dick3
So, I used this simple line of code:
new_list = []
with open('/home/me/my_tsv.tsv') as tsv:
for line in csv.reader(tsv, delimiter="\t"):
new_list.append(line)
new = np.array(job_posts)
print new.shape
And because of that pesky " character, the shape of my fancy new numpy array is
(2,4)
That's not right! So, the solution is to include the argument, quoting, in the csv.reader call, as follows:
for line in csv.reader(tsv, delimiter="\t", quoting=csv.QUOTE_NONE):
This is great! Now my dimension is
(3,4)
as I hoped.
Now comes the real problem -- in all actuality, I have a 700,000 X 10 .tsv file, with long fields. I can read the file into Python with no problems, just like in the above situation. But, when I get to the step where I create new = np.array(job_posts), my wimpy 16 GB laptop cries and says...
MEMORY ERROR
Obviously, I cannot simultaneously have these 2 object in memory -- the Python list of lists, and the numpy array.
My question is therefore this: how can I read this file directly into a numpy array, perhaps using genfromtxt or something similar...yet also achieve what I've achieved by using the quoting=csv.QUOTE_NONE argument in the csv.reader?
So far, I've found no analogy to the quoting=csv.QUOTE_NONE option available anywhere in the standard ways to read in a tsv file using numpy.
This is a tough little problem. I though about iteratively building the numpy array during the reading in process, but I can't figure it out.
I tried
nparray = np.genfromtxt("/home/me/my_tsv.tsv", delimiter="/t")
print obj.shap
and got
(3,0)
If anyone has any advice I'd be greatly appreciative. Also, I know the real answer is probably to use Pandas...but at this point I'm committed to using numpy for a lot of compelling reasons...
Thanks in advance.

Categories