So I have been trying to read a 3.2GB file in memory using pandas read_csv function but I kept on running into some sort of memory leak, my memory usage would spike 90%+.
So as alternatives
I tried defining dtype to avoid keeping the data in memory as strings, but saw similar behaviour.
Tried out numpy read csv, thinking I would get some different results but was definitely wrong about that.
Tried reading line by line ran into the same problem, but really slowly.
I recently moved to python 3, so thought there could be some bug there, but saw similar results on python2 + pandas.
The file in question is a train.csv file from a kaggle competition grupo bimbo
System info:
RAM: 16GB, Processor: i7 8cores
Let me know if you would like to know anything else.
Thanks :)
EDIT 1: its a memory spike! not a leak (sorry my bad.)
EDIT 2: Sample of the csv file
Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Venta_uni_hoy,Venta_hoy,Dev_uni_proxima,Dev_proxima,Demanda_uni_equil
3,1110,7,3301,15766,1212,3,25.14,0,0.0,3
3,1110,7,3301,15766,1216,4,33.52,0,0.0,4
3,1110,7,3301,15766,1238,4,39.32,0,0.0,4
3,1110,7,3301,15766,1240,4,33.52,0,0.0,4
3,1110,7,3301,15766,1242,3,22.92,0,0.0,3
EDIT 3: number rows in the file 74180465
Other then a simple pd.read_csv('filename', low_memory=False)
I have tried
from numpy import genfromtxt
my_data = genfromtxt('data/train.csv', delimiter=',')
UPDATE
The below code just worked, but I still want to get to the bottom of this problem, there must be something wrong.
import pandas as pd
import gc
data = pd.DataFrame()
data_iterator = pd.read_csv('data/train.csv', chunksize=100000)
for sub_data in data_iterator:
data.append(sub_data)
gc.collect()
EDIT: Piece of Code that worked.
Thanks for all the help guys, I had messed up my dtypes by adding python dtypes instead of numpy ones. Once I fixed that the below code worked like a charm.
dtypes = {'Semana': pd.np.int8,
'Agencia_ID':pd.np.int8,
'Canal_ID':pd.np.int8,
'Ruta_SAK':pd.np.int8,
'Cliente_ID':pd.np.int8,
'Producto_ID':pd.np.int8,
'Venta_uni_hoy':pd.np.int8,
'Venta_hoy':pd.np.float16,
'Dev_uni_proxima':pd.np.int8,
'Dev_proxima':pd.np.float16,
'Demanda_uni_equil':pd.np.int8}
data = pd.read_csv('data/train.csv', dtype=dtypes)
This brought down the memory consumption to just under 4Gb
A file stored in memory as text is not as compact as a compressed binary format, however it is relatively compact data-wise. If it's a simple ascii file, aside from any file header information, each character is only 1 byte. Python strings have a similar relation, where there's some overhead for internal python stuff, but each extra character adds only 1 byte (from testing with __sizeof__). Once you start converting to numeric types and collections (lists, arrays, data frames, etc.) the overhead will grow. A list for example must store a type and a value for each position, whereas a string only stores a value.
>>> s = '3,1110,7,3301,15766,1212,3,25.14,0,0.0,3\r\n'
>>> l = [3,1110,7,3301,15766,1212,3,25.14,0,0.0,3]
>>> s.__sizeof__()
75
>>> l.__sizeof__()
128
A little bit of testing (assuming __sizeof__ is accurate):
import numpy as np
import pandas as pd
s = '1,2,3,4,5,6,7,8,9,10'
print ('string: '+str(s.__sizeof__())+'\n')
l = [1,2,3,4,5,6,7,8,9,10]
print ('list: '+str(l.__sizeof__())+'\n')
a = np.array([1,2,3,4,5,6,7,8,9,10])
print ('array: '+str(a.__sizeof__())+'\n')
b = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.dtype('u1'))
print ('byte array: '+str(b.__sizeof__())+'\n')
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10])
print ('dataframe: '+str(df.__sizeof__())+'\n')
returns:
string: 53
list: 120
array: 136
byte array: 106
dataframe: 152
Based on your second chart, it looks as though there's a brief period in time where your machine allocates an additional 4.368 GB of memory, which is approximately the size of your 3.2 GB dataset (assuming 1GB overhead, which might be a stretch).
I tried to track down a place where this could happen and haven't been super successful. Perhaps you can find it, though, if you're motivated. Here's the path I took:
This line reads:
def read(self, nrows=None):
if nrows is not None:
if self.options.get('skip_footer'):
raise ValueError('skip_footer not supported for iteration')
ret = self._engine.read(nrows)
Here, _engine references PythonParser.
That, in turn, calls _get_lines().
That makes calls to a data source.
Which looks like it reads in in the form of strings from something relatively standard (see here), like TextIOWrapper.
So things are getting read in as standard text and converted, this explains the slow ramp.
What about the spike? I think that's explained by these lines:
ret = self._engine.read(nrows)
if self.options.get('as_recarray'):
return ret
# May alter columns / col_dict
index, columns, col_dict = self._create_index(ret)
df = DataFrame(col_dict, columns=columns, index=index)
ret becomes all the components of a data frame`.
self._create_index() breaks ret apart into these components:
def _create_index(self, ret):
index, columns, col_dict = ret
return index, columns, col_dict
So far, everything can be done by reference, and the call to DataFrame() continues that trend (see here).
So, if my theory is correct, DataFrame() is either copying the data somewhere, or _engine.read() is doing so somewhere along the path I've identified.
CSV file may not be clean (lines with inconsistent number of elements), unclean lines would need to be disregarded.
String manipulation is required during processing.
Example input:
20150701 20:00:15.173,0.5019,0.91665
Desired output: float32 (pseudo-date, seconds in the day, f3, f4)
0.150701 72015.173 0.5019 0.91665 (+ the trailing trash floats usually get)
The CSV file is also very big, the numpy array in memory would be expected to take 5-10 GB, CSV file is over 30GB.
Looking for an efficient way to process the CSV file and end up with a numpy array.
Current solution: use csv module, process line by line and use a list() as a buffer that later gets turned to numpy array with asarray(). Problem is, during the turning process memory consumption is doubled and the copying process adds execution overhead.
Numpy's genfromtxt and loadtxt don't appear to be able to process the data as desired.
If you know in advance how many rows are in the data, you could dispense with the intermediate list and write directly to the array.
import numpy as np
no_rows = 5
no_columns = 4
a = np.zeros((no_rows, no_columns), dtype = np.float)
with open('myfile') as f:
for i, line in enumerate(f):
a[i,:] = cool_function_that_returns_formatted_data(line)
did you think for using pandas read_csv (with engine='C')
I find it as one of the best and easy solutions to handling csv. I worked with 4GB file and it worked for me.
import pandas as pd
df=pd.read_csv('abc.csv',engine='C')
print(df.head(10))
I think i/o capability of pandas is the best way to get data into a numpy array. Specifically the read_csv method will read into a pandas DataFrame. You can then access the underlying numpy array using the as_matrix method of the returned DataFrame.
I want to read in a tab separated value file and turn it into a numpy array. The file has 3 lines. It looks like this:
Ann1 Bill1 Chris1 Dick1
Ann2 Bill2 Chris2 "Dick2
Ann3 Bill3 Chris3 Dick3
So, I used this simple line of code:
new_list = []
with open('/home/me/my_tsv.tsv') as tsv:
for line in csv.reader(tsv, delimiter="\t"):
new_list.append(line)
new = np.array(job_posts)
print new.shape
And because of that pesky " character, the shape of my fancy new numpy array is
(2,4)
That's not right! So, the solution is to include the argument, quoting, in the csv.reader call, as follows:
for line in csv.reader(tsv, delimiter="\t", quoting=csv.QUOTE_NONE):
This is great! Now my dimension is
(3,4)
as I hoped.
Now comes the real problem -- in all actuality, I have a 700,000 X 10 .tsv file, with long fields. I can read the file into Python with no problems, just like in the above situation. But, when I get to the step where I create new = np.array(job_posts), my wimpy 16 GB laptop cries and says...
MEMORY ERROR
Obviously, I cannot simultaneously have these 2 object in memory -- the Python list of lists, and the numpy array.
My question is therefore this: how can I read this file directly into a numpy array, perhaps using genfromtxt or something similar...yet also achieve what I've achieved by using the quoting=csv.QUOTE_NONE argument in the csv.reader?
So far, I've found no analogy to the quoting=csv.QUOTE_NONE option available anywhere in the standard ways to read in a tsv file using numpy.
This is a tough little problem. I though about iteratively building the numpy array during the reading in process, but I can't figure it out.
I tried
nparray = np.genfromtxt("/home/me/my_tsv.tsv", delimiter="/t")
print obj.shap
and got
(3,0)
If anyone has any advice I'd be greatly appreciative. Also, I know the real answer is probably to use Pandas...but at this point I'm committed to using numpy for a lot of compelling reasons...
Thanks in advance.
I have multiple files which I process using Numpy and SciPy, but I am required to deliver an Excel file. How can I efficiently copy/paste a huge numpy array to Excel?
I have tried to convert to Pandas' DataFrame object, which has the very usefull function to_clipboard(excel=True), but I spend most of my time converting the array into a DataFrame.
I cannot simply write the array to a CSV file then open it in excel, because I have to add the array to an existing file; something very hard to achieve with xlrd/xlwt and other Excel tools.
My best solution here would be to turn the array into a string, then use win32clipboard to sent it to the clipboard. This is not a cross-platform solution, but then again, Excel is not avalable on every platform anyway.
Excel uses tabs (\t) to mark column change, and \r\n to indicate a line change.
The relevant code would be:
import win32clipboard as clipboard
def toClipboardForExcel(array):
"""
Copies an array into a string format acceptable by Excel.
Columns separated by \t, rows separated by \n
"""
# Create string from array
line_strings = []
for line in array:
line_strings.append("\t".join(line.astype(str)).replace("\n",""))
array_string = "\r\n".join(line_strings)
# Put string into clipboard (open, clear, set, close)
clipboard.OpenClipboard()
clipboard.EmptyClipboard()
clipboard.SetClipboardText(array_string)
clipboard.CloseClipboard()
I have tested this code with random arrays of shape (1000,10000) and the biggest bottleneck seems to be passing the data to the function. (When I add a print statement at the beginning of the function, I still have to wait a bit before it prints anything.)
EDIT: The previous paragraph related my experience in Python Tools for Visual Studio. In this environment, it seens like the print statement is delayed. In direct command line interface, the bottleneck is in the loop, like expected.
import pandas as pd
pd.DataFrame(arr).to_clipboard()
I think it's one of the easiest way with pandas package.
If I would need to process multiple files loaded into python and then parse into excel, I would probably make some tools using xlwt
That said, may I offer my recipe Pasting python data into a spread sheet open for any edits, complaints or feedback. It uses no third party libraries and should be cross platform.
As of today, you can also use xlwings. It's open source, and fully compatible with Numpy arrays and Pandas DataFrames.
I extended PhilMacKay answer to:
- incude 1dimensional arrays and,
- to allow for commas as decimal separator (decimal=","):
import win32clipboard as clipboard
def to_clipboard(array, decimal=","):
"""
Copies an array into a string format acceptable by Excel.
Columns separated by \t, rows separated by \n
"""
# Create string from array
try:
n, m = np.shape(array)
except ValueError:
n, m = 1, 0
line_strings = []
if m > 0:
for line in array:
if decimal == ",":
line_strings.append("\t".join(line.astype(str)).replace(
"\n","").replace(".", ","))
else:
line_strings.append("\t".join(line.astype(str)).replace(
"\n",""))
array_string = "\r\n".join(line_strings)
else:
if decimal == ",":
array_string = "\r\n".join(array.astype(str)).replace(".", ",")
else:
array_string = "\r\n".join(array.astype(str))
# Put string into clipboard (open, clear, set, close)
clipboard.OpenClipboard()
clipboard.EmptyClipboard()
clipboard.SetClipboardText(array_string)
clipboard.CloseClipboard()
You could also look into pyxll project.
I have a data.frame in R. It contains a lot of data : gene expression levels from many (125) arrays. I'd like the data in Python, due mostly to my incompetence in R and the fact that this was supposed to be a 30 minute job.
I would like the following code to work. To understand this code, know that the variable path contains the full path to my data set which, when loaded, gives me a variable called immgen. Know that immgen is an object (a Bioconductor ExpressionSet object) and that exprs(immgen) returns a data frame with 125 columns (experiments) and tens of thousands of rows (named genes). (Just in case it's not clear, this is Python code, using robjects.r to call R code)
import numpy as np
import rpy2.robjects as robjects
# ... some code to build path
robjects.r("load('%s')"%path) # loads immgen
e = robjects.r['data.frame']("exprs(immgen)")
expression_data = np.array(e)
This code runs, but expression_data is simply array([[1]]).
I'm pretty sure that e doesn't represent the data frame generated by exprs() due to things like:
In [40]: e._get_ncol()
Out[40]: 1
In [41]: e._get_nrow()
Out[41]: 1
But then again who knows? Even if e did represent my data.frame, that it doesn't convert straight to an array would be fair enough - a data frame has more in it than an array (rownames and colnames) and so maybe life shouldn't be this easy. However I still can't work out how to perform the conversion. The documentation is a bit too terse for me, though my limited understanding of the headings in the docs implies that this should be possible.
Anyone any thoughts?
This is the most straightforward and reliable way i've found to to transfer a data frame from R to Python.
To begin with, I think exchanging the data through the R bindings is an unnecessary complication. R provides a simple method to export data, likewise, NumPy has decent methods for data import. The file format is the only common interface required here.
data(iris)
iris$Species = unclass(iris$Species)
write.table(iris, file="/path/to/my/file/np_iris.txt", row.names=F, sep=",")
# now start a python session
import numpy as NP
fpath = "/path/to/my/file/np_iris.txt"
A = NP.loadtxt(fpath, comments="#", delimiter=",", skiprows=1)
# print(type(A))
# returns: <type 'numpy.ndarray'>
print(A.shape)
# returns: (150, 5)
print(A[1:5,])
# returns:
[[ 4.9 3. 1.4 0.2 1. ]
[ 4.7 3.2 1.3 0.2 1. ]
[ 4.6 3.1 1.5 0.2 1. ]
[ 5. 3.6 1.4 0.2 1. ]]
According to the Documentation (and my own experience for what it's worth) loadtxt is the preferred method for conventional data import.
You can also pass in to loadtxt a tuple of data types (the argument is dtypes), one item in the tuple for each column. Notice 'skiprows=1' to step over the column headers (for loadtxt rows are indexed from 1, columns from 0).
Finally, i converted the dataframe factor to integer (which is actually the underlying data type for factor) prior to exporting--'unclass' is probably the easiest way to do this.
If you have big data (ie, don't want to load the entire data file into memory but still need to access it) NumPy's memory-mapped data structure ('memmap') is a good choice:
from tempfile import mkdtemp
import os.path as path
filename = path.join(mkdtemp(), 'tempfile.dat')
# now create a memory-mapped file with shape and data type
# based on original R data frame:
A = NP.memmap(fpath, dtype="float32", mode="w+", shape=(150, 5))
# methods are ' flush' (writes to disk any changes you make to the array), and 'close'
# to write data to the memmap array (acdtually an array-like memory-map to
# the data stored on disk)
A[:] = somedata[:]
Why going through a data.frame when 'exprs(immgen)' returns a /matrix/ and your end goal is to have your data in a matrix ?
Passing the matrix to numpy is straightforward (and can even be made without making a copy):
http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy
This should beat in both simplicity and efficiency the suggestion of going through text representation of numerical data in flat files as a way to exchange data.
You seem to be working with bioconductor classes, and might be interested in the following:
http://pypi.python.org/pypi/rpy2-bioconductor-extensions/