Pandas/Python memory spike while reading 3.2 GB file

Pandas/Python memory spike while reading 3.2 GB file - python

So I have been trying to read a 3.2GB file in memory using pandas read_csv function but I kept on running into some sort of memory leak, my memory usage would spike 90%+.
So as alternatives
I tried defining dtype to avoid keeping the data in memory as strings, but saw similar behaviour.
Tried out numpy read csv, thinking I would get some different results but was definitely wrong about that.
Tried reading line by line ran into the same problem, but really slowly.
I recently moved to python 3, so thought there could be some bug there, but saw similar results on python2 + pandas.
The file in question is a train.csv file from a kaggle competition grupo bimbo
System info:
RAM: 16GB, Processor: i7 8cores
Let me know if you would like to know anything else.
Thanks :)
EDIT 1: its a memory spike! not a leak (sorry my bad.)
EDIT 2: Sample of the csv file
Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Venta_uni_hoy,Venta_hoy,Dev_uni_proxima,Dev_proxima,Demanda_uni_equil
3,1110,7,3301,15766,1212,3,25.14,0,0.0,3
3,1110,7,3301,15766,1216,4,33.52,0,0.0,4
3,1110,7,3301,15766,1238,4,39.32,0,0.0,4
3,1110,7,3301,15766,1240,4,33.52,0,0.0,4
3,1110,7,3301,15766,1242,3,22.92,0,0.0,3
EDIT 3: number rows in the file 74180465
Other then a simple pd.read_csv('filename', low_memory=False)
I have tried
from numpy import genfromtxt
my_data = genfromtxt('data/train.csv', delimiter=',')
UPDATE
The below code just worked, but I still want to get to the bottom of this problem, there must be something wrong.
import pandas as pd
import gc
data = pd.DataFrame()
data_iterator = pd.read_csv('data/train.csv', chunksize=100000)
for sub_data in data_iterator:
data.append(sub_data)
gc.collect()
EDIT: Piece of Code that worked.
Thanks for all the help guys, I had messed up my dtypes by adding python dtypes instead of numpy ones. Once I fixed that the below code worked like a charm.
dtypes = {'Semana': pd.np.int8,
'Agencia_ID':pd.np.int8,
'Canal_ID':pd.np.int8,
'Ruta_SAK':pd.np.int8,
'Cliente_ID':pd.np.int8,
'Producto_ID':pd.np.int8,
'Venta_uni_hoy':pd.np.int8,
'Venta_hoy':pd.np.float16,
'Dev_uni_proxima':pd.np.int8,
'Dev_proxima':pd.np.float16,
'Demanda_uni_equil':pd.np.int8}
data = pd.read_csv('data/train.csv', dtype=dtypes)
This brought down the memory consumption to just under 4Gb

A file stored in memory as text is not as compact as a compressed binary format, however it is relatively compact data-wise. If it's a simple ascii file, aside from any file header information, each character is only 1 byte. Python strings have a similar relation, where there's some overhead for internal python stuff, but each extra character adds only 1 byte (from testing with __sizeof__). Once you start converting to numeric types and collections (lists, arrays, data frames, etc.) the overhead will grow. A list for example must store a type and a value for each position, whereas a string only stores a value.
>>> s = '3,1110,7,3301,15766,1212,3,25.14,0,0.0,3\r\n'
>>> l = [3,1110,7,3301,15766,1212,3,25.14,0,0.0,3]
>>> s.__sizeof__()
75
>>> l.__sizeof__()
128
A little bit of testing (assuming __sizeof__ is accurate):
import numpy as np
import pandas as pd
s = '1,2,3,4,5,6,7,8,9,10'
print ('string: '+str(s.__sizeof__())+'\n')
l = [1,2,3,4,5,6,7,8,9,10]
print ('list: '+str(l.__sizeof__())+'\n')
a = np.array([1,2,3,4,5,6,7,8,9,10])
print ('array: '+str(a.__sizeof__())+'\n')
b = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.dtype('u1'))
print ('byte array: '+str(b.__sizeof__())+'\n')
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10])
print ('dataframe: '+str(df.__sizeof__())+'\n')
returns:
string: 53
list: 120
array: 136
byte array: 106
dataframe: 152

Based on your second chart, it looks as though there's a brief period in time where your machine allocates an additional 4.368 GB of memory, which is approximately the size of your 3.2 GB dataset (assuming 1GB overhead, which might be a stretch).
I tried to track down a place where this could happen and haven't been super successful. Perhaps you can find it, though, if you're motivated. Here's the path I took:
This line reads:
def read(self, nrows=None):
if nrows is not None:
if self.options.get('skip_footer'):
raise ValueError('skip_footer not supported for iteration')
ret = self._engine.read(nrows)
Here, _engine references PythonParser.
That, in turn, calls _get_lines().
That makes calls to a data source.
Which looks like it reads in in the form of strings from something relatively standard (see here), like TextIOWrapper.
So things are getting read in as standard text and converted, this explains the slow ramp.
What about the spike? I think that's explained by these lines:
ret = self._engine.read(nrows)
if self.options.get('as_recarray'):
return ret
# May alter columns / col_dict
index, columns, col_dict = self._create_index(ret)
df = DataFrame(col_dict, columns=columns, index=index)
ret becomes all the components of a data frame`.
self._create_index() breaks ret apart into these components:
def _create_index(self, ret):
index, columns, col_dict = ret
return index, columns, col_dict
So far, everything can be done by reference, and the call to DataFrame() continues that trend (see here).
So, if my theory is correct, DataFrame() is either copying the data somewhere, or _engine.read() is doing so somewhere along the path I've identified.

Related

numpy won't print full (unsummarized array)

I've looked at this response to try and get numpy to print the full array rather than a summarized view, but it doesn't seem to be working.
I have a CSV with named headers. Here are the first five rows
v0 v1 v2 v3 v4
1001 5529 24 56663 16445
1002 4809 30.125 49853 28069
1003 407 20 28462 8491
1005 605 19.55 75423 4798
1007 1607 20.26 79076 12962
I'd like to read in the data and be able to view it fully. I tried doing this:
import numpy as np
np.set_printoptions(threshold=np.inf)
main_df2=np.genfromtxt('file location', delimiter=",")
main_df2[0:3,:]
However this still returns the truncated array, and the performance seems greatly slowed. What am I doing wrong?

OK, in a regular Python session (I usually use Ipython instead), I set the print options, and made a large array:
>>> np.set_printoptions(threshold=np.inf, suppress=True)
>>> x=np.random.rand(25000,5)
When I execute the next line, it spends about 21 seconds formatting the array, and then writes the resulting string to the screen (with more lines than fit the terminal's window buffer).
>>> x
This is the same as
>>> print(repr(x))
The internal storage for x is a buffer of floats (which you can 'see' with x.tostring(). To print x it has to format it, create a multiline string that contains a print representation of each number, all 125000 of them. The result of repr(x) is a string 1850000 char long, 25000 lines. This is what takes 21 seconds. Displaying that on the screen is just limited by the terminal scroll speed.
I haven't looked at the details, but I think the numpy formatting is mostly written in Python, not compiled. It's designed more for flexibility than speed. It's normal to want to see 10-100 lines of an array. 25000 lines is an unusual case.
Somewhat curiously, writing this array as a csv is fast, with a minimal delay:
>>> np.savetxt('test.txt', x, fmt='%10f', delimiter=',')
And I know what savetxt does - it iterates on rows, and does a file write
f.write(fmt % tuple(row))
Evidently all the bells-n-whistles of the regular repr are expensive. It can summarize, it can handle many dimensions, it can handle complicated dtypes, etc. Simply formatting each row with a known fixed format is not the time consuming step.
Actually that savetxt route might be more useful, as well as fast. You can control the display format, and you can view the resulting text file in an editor or terminal window at your leisure. You won't be limited by the scroll buffer of your terminal window. But how will this savetxt file be different from the original csv?

I'm surprised you get an array at all as your example does not use ',' as delimiter. But maybe you forgot to included commas in your example file.
I would use the DataFrame functionality of pandas if I work with csv data. It uses numpy under the hood, so all numpy operation work on pandas DataFrames.
Pandas has many tricks for operating with table like data.
import pandas as pd
df = pd.read_csv('nothing.txt')
#==============================================================================
# next line remove blanks from the column names
#==============================================================================
df.columns = [name.strip(' ') for name in df.columns]
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print(df)

When I copied and pasted it the data here it was open in Excel, but the file is a CSV.
I'm doing a class exercise and we have to use numpy. One thing I noticed was that the results were quite illegible thanks for the scientific notation, so I did the following and things are much smoother:
np.set_printoptions(threshold=100000, suppress=True)
The suppress statement saved me a lot of formatting. The performance does suffer a lot when I change the threshold to something like 'nan' or inf, and I'm not sure why.

Pandas MemoryError when reading large CSV followed by `.iloc` slicing columns

I've been trying to process a 1.4GB CSV file with Pandas, but keep having memory problems. I have tried different things in attempt to make Pandas read_csv work to no avail.
It didn't work when I used the iterator=True and chunksize=number parameters. Moreover, the smaller the chunksize, the slower it is to process the same amount of data.
(Simple heavier overhead doesn't explain it because it was way too slower when number of chunks is big. I suspect when processing every chunk, panda needs to go though all the chunks before it to "get to it", instead of jumping right to the start of the chunk. This seems the only way this can be explained.)
Then as a last resort, I split the CSV files into 6 parts, and tried to read them one by one, but still get MemoryError.
(I have monitored the memory usage of python when running the code below, and found that each time python finishes processing a file and moves on to the next, the memory usage goes up. It seemed quite obvious that panda didn't release memory for the previous file when it's already finished processing it.)
The code may not make sense but that's because I removed the part where it writes into an SQL database to simplify it and isolate the problem.
import csv,pandas as pd
import glob
filenameStem = 'Crimes'
counter = 0
for filename in glob.glob(filenameStem + '_part*.csv'): # reading files Crimes_part1.csv through Crimes_part6.csv
chunk = pd.read_csv(filename)
df = chunk.iloc[:,[5,8,15,16]]
df = df.dropna(how='any')
counter += 1
print(counter)

you may try to parse only those columns that you need (as #BrenBarn said in comments):
import os
import glob
import pandas as pd
def get_merged_csv(flist, **kwargs):
return pd.concat([pd.read_csv(f, **kwargs) for f in flist], ignore_index=True)
fmask = 'Crimes_part*.csv'
cols = [5,8,15,16]
df = get_merged_csv(glob.glob(fmask), index_col=None, usecols=cols).dropna(how='any')
print(df.head())
PS this will include only 4 out of at least 17 columns in your resulting data frame

Thanks for the reply.
After some debugging, I have located the problem. The "iloc" subsetting of pandas created a circular reference, which prevented garbage recollection. Detailed discussion can be found here

I have found same issues in csv file. First to make csv as chunks and fix the chunksize.use the chunksize or iterator parameter to return the data in chunks.
Syntax:
csv_onechunk = padas.read_csv(filepath, sep = delimiter, skiprows = 1, chunksize = 10000)
then concatenate the chunks (Only valid with C parser)

Error using NP.savez and NP.load

I have a list data_list and I save it as following:
data_array = np.array(data_list)
np.savez("File", data_array)
In order to load "File"
a = np.load("File.npz")
b = a['arr_0']
I used this code until two weeks ago and it worked fine.
Today I have tried to work with my program, but it ends with a memory error identified in the line
b = a['arr_0']
"File" has 300 MB dimension. So I don't think it's a memory problem.
Any idea about how it happened?

What is the data you are storing? Do you get the same problem using a similarly shaped/sized np.randn?
Furthermore, it is probably useful to know that you can assign names to the arrays you store with np.savez by specifying them as kwargs, i.e.
np.savez("File", data_array=data_array)
then you can use
a = np.load("File.npz")
b = a['data_array']
Also, note that (iirc) np.savez compresses files, so the problem might be due to the sheer size of your array, even if the resulting file is not so large.

"Reading in" large text file into hdf5 via PyTables or PyHDF?

I'm attempting some statistics using SciPy, but my input dataset is quite large (~1.9GB) and in dbf format.
The file is large enough that Numpy returns an error message when I try to create an array with genfromtxt. (I've got 3GB ram, but running win32).
i.e.:
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
ind_sum = numpy.genfromtxt(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\IND_SUM.dbf", dtype = (int, int, int, float, float, int), names = True, usecols = (5))
File "C:\Python26\ArcGIS10.0\lib\site-packages\numpy\lib\npyio.py", line 1335, in genfromtxt
for (i, line) in enumerate(itertools.chain([first_line, ], fhd)):
MemoryError
From other posts, I see that the chunked array provided by PyTables could be useful, but my problem is reading in this data in the first place. Or in other words, PyTables or PyHDF easily create a HDF5 output that is desired, but what should I do to get my data into an array first?
For instance:
import numpy, scipy, tables
h5file = tables.openFile(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\HET_IND_SUM2.h5", mode = "w", title = "Diversity Index Results")
group = h5.createGroup("/", "IND_SUM", "Aggregated Index Values"`)
and then I could either create a table or array, but how do I refer back to the original dbf data? In the description?
Thanks for any thoughts you might have!

If the data is too big to fit in memory, you can work with a memory-mapped file (it's like a numpy array but stored on disk - see docs here), though you may be able to get similar results using HDF5 depending on what operations you need to perform on the array. Obviously this will make many operations slower but this is better than not being able to do them at all.
Because you are hitting a memory limit, I think you cannot use genfromtxt. Instead, you should iterate through your text file one line at a time, and write the data to the relevant position in the memmap/hdf5 object.
It is not clear what you mean by "referring back to the original dbf data"? Obviously you can just store the filename it came from somewhere. HDF5 objects have "attributes" which are designed to store this kind of meta-data.
Also, I have found that using h5py is a much simpler and cleaner way to access hdf5 files than pytables, though this is largely a matter of preference.

If the data is in a dbf file, you might try my dbf package -- it only keeps the records in memory that are being accessed, so you should be able to cycle through the records pulling out the data that you need:
import dbf
table = dbf.Table(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\IND_SUM.dbf")
sums = [0, 0, 0, 0.0, 0.0, 0]
for record in table:
for index in range(5):
sums[index] += record[index]

rpy2: Converting a data.frame to a numpy array

I have a data.frame in R. It contains a lot of data : gene expression levels from many (125) arrays. I'd like the data in Python, due mostly to my incompetence in R and the fact that this was supposed to be a 30 minute job.
I would like the following code to work. To understand this code, know that the variable path contains the full path to my data set which, when loaded, gives me a variable called immgen. Know that immgen is an object (a Bioconductor ExpressionSet object) and that exprs(immgen) returns a data frame with 125 columns (experiments) and tens of thousands of rows (named genes). (Just in case it's not clear, this is Python code, using robjects.r to call R code)
import numpy as np
import rpy2.robjects as robjects
# ... some code to build path
robjects.r("load('%s')"%path) # loads immgen
e = robjects.r['data.frame']("exprs(immgen)")
expression_data = np.array(e)
This code runs, but expression_data is simply array([[1]]).
I'm pretty sure that e doesn't represent the data frame generated by exprs() due to things like:
In [40]: e._get_ncol()
Out[40]: 1
In [41]: e._get_nrow()
Out[41]: 1
But then again who knows? Even if e did represent my data.frame, that it doesn't convert straight to an array would be fair enough - a data frame has more in it than an array (rownames and colnames) and so maybe life shouldn't be this easy. However I still can't work out how to perform the conversion. The documentation is a bit too terse for me, though my limited understanding of the headings in the docs implies that this should be possible.
Anyone any thoughts?

This is the most straightforward and reliable way i've found to to transfer a data frame from R to Python.
To begin with, I think exchanging the data through the R bindings is an unnecessary complication. R provides a simple method to export data, likewise, NumPy has decent methods for data import. The file format is the only common interface required here.
data(iris)
iris$Species = unclass(iris$Species)
write.table(iris, file="/path/to/my/file/np_iris.txt", row.names=F, sep=",")
# now start a python session
import numpy as NP
fpath = "/path/to/my/file/np_iris.txt"
A = NP.loadtxt(fpath, comments="#", delimiter=",", skiprows=1)
# print(type(A))
# returns: <type 'numpy.ndarray'>
print(A.shape)
# returns: (150, 5)
print(A[1:5,])
# returns:
[[ 4.9  3.   1.4  0.2  1. ]
[ 4.7  3.2  1.3  0.2  1. ]
[ 4.6  3.1  1.5  0.2  1. ]
[ 5.   3.6  1.4  0.2  1. ]]
According to the Documentation (and my own experience for what it's worth) loadtxt is the preferred method for conventional data import.
You can also pass in to loadtxt a tuple of data types (the argument is dtypes), one item in the tuple for each column. Notice 'skiprows=1' to step over the column headers (for loadtxt rows are indexed from 1, columns from 0).
Finally, i converted the dataframe factor to integer (which is actually the underlying data type for factor) prior to exporting--'unclass' is probably the easiest way to do this.
If you have big data (ie, don't want to load the entire data file into memory but still need to access it) NumPy's memory-mapped data structure ('memmap') is a good choice:
from tempfile import mkdtemp
import os.path as path
filename = path.join(mkdtemp(), 'tempfile.dat')
# now create a memory-mapped file with shape and data type
# based on original R data frame:
A = NP.memmap(fpath, dtype="float32", mode="w+", shape=(150, 5))
# methods are ' flush' (writes to disk any changes you make to the array), and 'close'
# to write data to the memmap array (acdtually an array-like memory-map to
# the data stored on disk)
A[:] = somedata[:]

Why going through a data.frame when 'exprs(immgen)' returns a /matrix/ and your end goal is to have your data in a matrix ?
Passing the matrix to numpy is straightforward (and can even be made without making a copy):
http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy
This should beat in both simplicity and efficiency the suggestion of going through text representation of numerical data in flat files as a way to exchange data.
You seem to be working with bioconductor classes, and might be interested in the following:
http://pypi.python.org/pypi/rpy2-bioconductor-extensions/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas/Python memory spike while reading 3.2 GB file - python

Related

numpy won't print full (unsummarized array)

Pandas MemoryError when reading large CSV followed by `.iloc` slicing columns

Error using NP.savez and NP.load

"Reading in" large text file into hdf5 via PyTables or PyHDF?

rpy2: Converting a data.frame to a numpy array

Categories

Resources