pandas' memory usage for list of SparseSeries - python

I'm trying to create a list of SparseSeries from a sparse numpy matrix. Creating the lil_matrix is fast and does not consume a lot of memory (in reality my dimension are more in the order of millions, i.e. 15 million samples and 4 million features). I have read a previous topic on this. But that solution as well seems to eat up all my memory, freezing my computer. At the surface it looks like the pandas SparseSeries is not really sparse, or am I doing something wrong? The ultimate goal is to create a SparseDataFrame from this (just like in the other topic I referred to).
from scipy.sparse import lil_matrix, csr_matrix
from numpy import random
import pandas as pd
nsamples = 10**5
nfeatures = 10**4
rm = lil_matrix((nsamples,nfeatures))
for i in xrange(nsamples):
index = random.randint(0,nfeatures,size=4)
rm[i,index] = 1
l=[]
for i in xrange(nsamples):
l.append(pd.Series(rm[i,:].toarray().ravel()).to_sparse(fill_value=0))

Since your goal is a sparse dataframe, I skipped the Series stage and went straight to a dataframe. I only had the patience to do this on a smaller array size:
nsamples = 10**3
nfeatures = 10**2
Creation of rm is the same, but I don't load into a list, but rather do this:
df = pd.DataFrame(rm[1,:].toarray().ravel()).to_sparse(0)
for i in xrange(1,nsamples):
df[i] = rm[i,:].toarray().ravel()
This is unfortunately much slower to run than what you have, but the result is a dataframe, not a list. I played around with this a little and as best I can tell there is not any fast way to build a large, sparse dataframe (even one full of zeros) column by column, rather than all at once (which is not going to be memory efficient). All of the examples in the documentation that I could find start with a dense structure and then convert to sparse in one step.
In any event, this way should be fairly memory efficient by compressing one column at a time such that you never have the full array/dataframe uncompressed at the same time. The resulting dataframe is definitely sparse:
In [39]: type(df)
Out[39]: pandas.sparse.frame.SparseDataFrame
and definitely saves space (almost 25x compression):
In [40]: df.memory_usage().sum()
Out[40]: 31528
In [41]: df.to_dense().memory_usage().sum()
Out[41]: 800000

Related

How to reduce the time taken working on a big Data Frame

I think my code is inefficient and I think there may be a way to do it better.
The objective of the code is that it takes an Excel listing and has to relate each element of a column to the rest of the elements of the same column. Depending on some conditions store it in a new data frame with the joint information, in my case the file has more than 16000 rows, so when doing the exercise it must perform (16.000 x 16.000) 256.000.000 iterations. But it takes days processing.
The code I have is the following:
import pandas as pd
import numpy as np
excel1="Cs.xlsx"
dataframe1=pd.read_excel(excel1)
col_names=["Eb","Eb_n","Eb_Eb","L1","Ll1","L2","Ll2","D"]
my_df =pd.DataFrame(columns=col_names)
count_row = dataframe1.shape[0]
print(count_row)
for n in range(0,count_row):
for p in range(0,count_row):
if abs(dataframe1.iloc[n,1] - dataframe1.iloc[p,1]) < 0.27 and abs(dataframe1.iloc[n,2] -
dataframe1.iloc[p,2]) < 0.27:
Nb_Nb=dataframe1.iloc[n,0]+"_"+dataframe1.iloc[p,0]
myrow=pd.Series([dataframe1.iloc[n,0],dataframe1.iloc[p,0],Nb_Nb,dataframe1.iloc[n,1],
dataframe1.iloc[n,2],dataframe1.iloc[p,1],dataframe1.iloc[p,2]],
index=["Eb","Eb_n","Eb_Eb","L1","Ll1","L2","Ll2"])
my_df = my_df.append(myrow, ignore_index=True)
print(my_df.head(5))
To start with, you can try using a different python structure. Dataframes take a lot of memory and are slower to process.
Order from simple structures and more efficient processing to complex structures and less efficient processing
Lists
Dictionaries
Numpy Arrays
Pandas Series
Pandas Dataframes

Why does a pandas dataframe with sparse columns take up more memory?

I am working on a dataset with mixed sparse / dense columns. As the number of sparse columns greatly outnumber the number of dense I wanted to see if I could store these in an efficient manner using sparse data structures in pandas. However, while testing the functionality I found dataframes with sparse columns appear to take up more memory, consider the following example:
import numpy as np
import pandas as pd
a = np.zeros(10000000)
b = np.zeros(10000000)
a[3000:3100] = 2
b[300:310] = 1
df = pd.DataFrame({'a':pd.SparseArray(a), 'b':pd.SparseArray(b)})
print(df.info())
This prints memory usage: 228.9 MB.
Next:
df = pd.DataFrame({'a':a, 'b':b})
print(df.info())
This prints memory usage: 152.6 MB.
Does the non-sparse dataframe take up less space? Am I misunderstanding?
Installation info:
pandas 0.25.0
python 3.7.2
I've reproduced those exact numbers. From the docs:
Pandas provides data structures for efficiently storing sparse data.
These are not necessarily sparse in the typical “mostly 0”. Rather,
you can view these objects as being “compressed” where any data
matching a specific value (NaN / missing value, though any value can
be chosen, including 0) is omitted. The compressed values are not
actually stored in the array.
Which means you have to specify that it's the 0 elements that should be compressed. You can do that by using fill_value=0, like so:
df = pd.DataFrame({'a':pd.SparseArray(a, fill_value=0), 'b':pd.SparseArray(b, fill_value=0)})
The result of df.info() is 1.4kb of memory usage in this case, quite a dramatic difference.
As to why it's initially bigger in your example than a normal "uncompressed" array, my guess is that it has to do with the compression data added on top of all the normal data that is still there (including zeros in your case). Anyway, that's just a guess
Additional reading in the docs would tell you that 0 is the default fill_value only in arrays of data.dtype=int, which yours weren't

Vectorization - how to append array without loop for

I have the following code:
x = range(100)
M = len(x)
sample=np.zeros((M,41632))
for i in range(M):
lista=np.load('sample'+str(i)+'.npy')
for j in range(41632):
sample[i,j]=np.array(lista[j])
print i
to create an array made of sample_i numpy arrays.
sample0, sample1, sample3, etc. are numpy arrays and my expected output is a Mx41632 array like this:
sample = [[sample0],[sample1],[sample2],...]
How can I compact and make more quick this operation without loop for? M can reach also 1 million.
Or, how can I append my sample array if the starting point is, for example, 1000 instead of 0?
Thanks in advance
Initial load
You can make your code a lot faster by avoiding the inner loop and not initialising sample to zeros.
x = range(100)
M = len(x)
sample = np.empty((M, 41632))
for i in range(M):
sample[i, :] = np.load('sample'+str(i)+'.npy')
In my tests this took the reading code from 3 seconds to 60 miliseconds!
Adding rows
In general it is very slow to change the size of a numpy array. You can append a row once you have loaded the data in this way:
sample = np.insert(sample, len(sample), newrow, axis=0)
but this is almost never what you want to do, because it is so slow.
Better storage: HDF5
Also if M is very large you will probably start running out of memory.
I recommend that you have a look at PyTables which will allow you to store your sample results in one HDF5 file and manipulate the data without loading it into memory. This will in general be a lot faster than the .npy files you are using now.
It is quite simple with numpy. Consider this example:
import numpy as np
l = [[1,2,3],[4,5,6],[7,8,9],[10,11,12]]
#create an array with 4 rows and 3 columns
arr = np.zeros([4,3])
arr[:,:] = l
You can also insert rows or columns separately:
#insert the first row
arr[0,:] = l[0]
You just have to provide that dimensions are the same.

Shuffling multiple HDF5 datasets in-place

I have multiple HDF5 datasets saved in the same file, my_file.h5. These datasets have different dimensions, but the same number of observations in the first dimension:
features.shape = (1000000, 24, 7, 1)
labels.shape = (1000000)
info.shape = (1000000, 4)
It is important that the info/label data is correctly connected to each set of features and I therefore want to shuffle these datasets with an identical seed. Furthermore, I would like to shuffle these without ever loading them fully into memory. Is that possible using numpy and h5py?
Shuffling arrays on disk will be time consuming, as it means that you have allocate new arrays in the hdf5 file, then copy all the rows in a different order. You can iterate over rows (or use chunks of rows), if you want to avoid loading all the data at once into memory with PyTables or h5py.
An alternative approach could be to keep your data as it is and simply to map new row numbers to old row numbers in a separate array (that you can keep fully loaded in RAM, since it will be only 4MB with your array sizes). For instance, to shuffle a numpy array x,
x = np.random.rand(5)
idx_map = numpy.arange(x.shape[0])
numpy.random.shuffle(idx_map)
Then you can use advanced numpy indexing to access your shuffled data,
x[idx_map[2]] # equivalent to x_shuffled[2]
x[idx_map] # equivament to x_shuffled[:], etc.
this will work also with arrays saved to hdf5. Of course there would be some overhead, as compared to writing shuffled arrays on disk, but it could be sufficient depending on your use-case.
Shuffling arrays like this in numpy is straight forward
Create the large suffling index (shuffle np.arange(1000000)) and index the arrays
features = features[I, ...]
labels = labels[I]
info = info[I, :]
This isn't an inplace operation. labels[I] is a copy of labels, not a slice or view.
An alternative
features[I,...] = features
looks on the surface like it is an inplace operation. I doubt that it is, down in the C code. It has to be buffered, because the I values are not guaranteed to be unique. In fact there is a special ufunc .at method for unbuffered operations.
But look at what h5py says about this same sort of 'fancy indexing':
http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing
labels[I] selection is implemented, but with restrictions.
List selections may not be empty
Selection coordinates must be given in increasing order
Duplicate selections are ignored
Very long lists (> 1000 elements) may produce poor performance
Your shuffled I is, by definition not in increasing order. And it is very large.
Also I don't see anything about using this fancy indexing on the left handside, labels[I] = ....
import numpy as np
import h5py
data = h5py.File('original.h5py', 'r')
with h5py.File('output.h5py', 'w') as out:
indexes = np.arange(data['some_dataset_in_original'].shape[0])
np.random.shuffle(indexes)
for key in data.keys():
print(key)
feed = np.take(data[key], indexes, axis=0)
out.create_dataset(key, data=feed)

doing better than numpy's in1d mask function: ordered arrays?

This operation needs to be applied as fast as possible as the actual arrays which contain millions of elements. This is a simple version of the problem.
So, I have a random array of unique integers (normally millions of elements).
totalIDs = [5,4,3,1,2,9,7,6,8 ...]
I have another array (normally a tens of thousands) of unique integers which I can create a mask.
subsampleIDs1 = [5,1,9]
subsampleIDs2 = [3,7,8]
subsampleIDs3 = [2,6,9]
...
I can use numpy to do
mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)
I can then extract the information I want of another array using the mask (say column 0 contains the one I want).
variable = allvariables[mask][:,0]
Now given that the IDs are unique in both arrays, is there any way to speed this up significantly. It takes a long time to construct the mask for a few thousand points (subsampleIDs) matching against millions of IDs (totalIDs).
I thought of going through it once and writing out a binary file of an index (to speed up future searches).
for i in range(0,3):
mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)
index[mask] = i
where X is in subsampleIDsX. Then I can just do:
for i in range(0,3):
if index[i] == i:
rowmatch = i
break
variable = allvariables[rowmatch:len(subsampleIDs),0]
right? But this is also slow because there is a conditional in the loop to find when it first matches. Is there a faster way to find when a number first appears in an ordered array so the conditional doesn't slow the loop?
I suggest you use DataFrame in Pandas. the index of the DataFrame is the totalIDs, and you can select subsampleIDs by: df.ix[subsampleIDs].
Create some test data first:
import numpy as np
N = 2000000
M = 5000
totalIDs = np.random.randint(0, 10000000, N)
totalIDs = np.unique(totalIDs)
np.random.shuffle(totalIDs)
v1 = np.random.rand(len(totalIDs))
v2 = np.random.rand(len(totalIDs))
subsampleIDs = np.random.choice(totalIDs, M)
subsampleIDs = np.unique(subsampleIDs)
np.random.shuffle(subsampleIDs)
Then convert you data in to a DataFrame:
import pandas as pd
df = pd.DataFrame(data = {"v1":v1, "v2":v2}, index=totalIDs)
df.ix[subsampleIDs]
DataFrame use a hashtable to map the index to it's location, it's very fast.
Often this kind of indexing is best performed using a DB (with proper column-indexing).
Another idea is to sort totalIDs once, as a preprocessing stage, and implement your own version of in1d, which avoids sorting everything. The numpy implementation of in1d (at least in the version that I have installed) is fairly simple, and should be easy to copy and modify.
EDIT:
Or, even better, use bucket sort (or radix sort). That should give you O(N+M), N being the size of totalIDs, and M the size of sampleIDs (times a constant you can play with by changing the number of buckets). Here too, you can split totalIDs to buckets only once, which gives you a nifty O(N+M1+M2+...).
Unfortunately, I'm not aware of a numpy implementation, but I did find this: http://en.wikipedia.org/wiki/Radix_sort#Example_in_Python

Categories