I have a large DataFrame object (1,440,000,000 rows). I operate at memory (swap includet) limit.
I need to extract a subset of the rows with certain value of a field. However if i do like that:
>>> SUBSET = DATA[DATA.field == value]
I end with either MemoryError exception or crash.
Is there any way to filter rows explicitely - without calculating intermediate mask (DATA.field == value)?
I have found DataFrame.filter() and DataFrame.select() methods, but they operate on column labels/row indices rather than on the row data.
Use query, it should be a bit faster:
df = df.query("field == value")
If by any change all the data in the DataFrame are of same types, use numpy array instead, it's more memory efficient and faster. You can convert your dataframe to numpy matrix by df.as_matrix().
Also that you might wanna check how much memory the dataframe already takes by:
import sys
sys.getsizeof()
that returns the size in bytes.
Related
I am working on a dataset with mixed sparse / dense columns. As the number of sparse columns greatly outnumber the number of dense I wanted to see if I could store these in an efficient manner using sparse data structures in pandas. However, while testing the functionality I found dataframes with sparse columns appear to take up more memory, consider the following example:
import numpy as np
import pandas as pd
a = np.zeros(10000000)
b = np.zeros(10000000)
a[3000:3100] = 2
b[300:310] = 1
df = pd.DataFrame({'a':pd.SparseArray(a), 'b':pd.SparseArray(b)})
print(df.info())
This prints memory usage: 228.9 MB.
Next:
df = pd.DataFrame({'a':a, 'b':b})
print(df.info())
This prints memory usage: 152.6 MB.
Does the non-sparse dataframe take up less space? Am I misunderstanding?
Installation info:
pandas 0.25.0
python 3.7.2
I've reproduced those exact numbers. From the docs:
Pandas provides data structures for efficiently storing sparse data.
These are not necessarily sparse in the typical “mostly 0”. Rather,
you can view these objects as being “compressed” where any data
matching a specific value (NaN / missing value, though any value can
be chosen, including 0) is omitted. The compressed values are not
actually stored in the array.
Which means you have to specify that it's the 0 elements that should be compressed. You can do that by using fill_value=0, like so:
df = pd.DataFrame({'a':pd.SparseArray(a, fill_value=0), 'b':pd.SparseArray(b, fill_value=0)})
The result of df.info() is 1.4kb of memory usage in this case, quite a dramatic difference.
As to why it's initially bigger in your example than a normal "uncompressed" array, my guess is that it has to do with the compression data added on top of all the normal data that is still there (including zeros in your case). Anyway, that's just a guess
Additional reading in the docs would tell you that 0 is the default fill_value only in arrays of data.dtype=int, which yours weren't
I have a large DataFrame object (1,440,000,000 rows). I operate at memory (swap includet) limit.
I need to extract a subset of the rows with certain value of a field. However if i do like that:
>>> SUBSET = DATA[DATA.field == value]
I end with either MemoryError exception or crash.
Is there any way to filter rows explicitely - without calculating intermediate mask (DATA.field == value)?
I have found DataFrame.filter() and DataFrame.select() methods, but they operate on column labels/row indices rather than on the row data.
Use query, it should be a bit faster:
df = df.query("field == value")
If by any change all the data in the DataFrame are of same types, use numpy array instead, it's more memory efficient and faster. You can convert your dataframe to numpy matrix by df.as_matrix().
Also that you might wanna check how much memory the dataframe already takes by:
import sys
sys.getsizeof()
that returns the size in bytes.
In Pandas, there is a method DataFrame.shift(n) which shifts the contents of an array by n rows, relative to the index, similarly to np.roll(a, n). I can't seem to find a way to get a similar behaviour working with Dask. I realise things like row-shifts may be difficult to manage with Dask's chunked system, but I don't know of a better way to compare each row with the subsequent one.
What I'd like to be able to do is this:
import numpy as np
import pandas as pd
import dask.DataFrame as dd
with pd.HDFStore(path) as store:
data = dd.from_hdf(store, 'sim')[col1]
shifted = data.shift(1)
idx = data.apply(np.sign) != shifted.apply(np.sign)
in order to create a boolean series indicating the locations of sign changes in the data. (I am aware that method would also catch changes from a signed value to zero)
I would then use the boolean series to index a different Dask dataframe for plotting.
Rolling functions
Currently dask.dataframe does not implement the shift operation. It could though if you raise an issue. In principle this is not so dissimilar from rolling operations that dask.dataframe does support, like rolling_mean, rolling_sum, etc..
Actually, if you were to create a Pandas function that adheres to the same API as these pandas.rolling_foo functions then you can use the dask.dataframe.rolling.wrap_rolling function to turn your pandas style rolling function into a dask.dataframe rolling function.
dask.dataframe.rolling_sum = wrap_rolling(pandas.rolling_sum)
The following code might help to shift down the series.
s = dd_df['column'].rolling(window=2).sum() - dd_df['column']
Edit (03/09/2019):
When you are rolling and finding the sum, for a particular row,
result[i] = row[i-1] + row[i]
Then by subtracting the old value of the column from the result, you are doing the following operation:
final_row[i] = result[i] - row[i]
Which equals:
final_row[i] = row[i-1] + row[i] - row[i]
Which ultimately results in the whole column getting shifted down once.
Tip:
If you want to shift it down multiple rows, you should actually execute the whole operation again that many times with the same window.
A bit of context: I am writting a code to save the data I plot to a text file. This data should be stored in such a way it can be loaded back using a script so it can be displayed again (but this time without performing any calculation). The initial idea was to store the data in columns with a format x1,y1,x2,y2,x3,y3...
I am using a code which would be simplified to something like this (incidentally, I am not sure if using a list to group my arrays is the most efficient approach):
import numpy as np
MatrixResults = []
x1 = np.array([1,2,3,4,5,6])
y1 = np.array([7,8,9,10,11,12])
x2 = np.array([0,1,2,3])
y2 = np.array([0,1,4,9])
MatrixResults.append(x1)
MatrixResults.append(y1)
MatrixResults.append(x2)
MatrixResults.append(y2)
MatrixResults = np.array(MatrixResults)
TextFile = open('/Users/UserName/Desktop/Datalog.txt',"w")
np.savetxt(TextFile, np.transpose(MatrixResults))
TextFile.close()
However, this code gives and error when any of the data sets have different lengths. Reading similar questions:
Can numpy.savetxt be used on N-dimensional ndarrays with N>2?
Table, with the different length of columns
However, this requires to break the format (either with flattening or adding some filling strings to the shorter columns to fill the shorter arrays)
My issue summarises as:
1) Is there any method that at the same time we transpose the arrays these are saved individually as consecutive columns?
2) Or maybe is there anyway to append columns to a text file (given a certain number of rows and columns to skip)
3) Should I try this with another library such as pandas?
Thank you very for any advice.
Edit 1:
After looking a bit more it seems that leaving blank spaces is more innefficient than filling the lists.
In the end I wrote my own (not sure if there is numpy function for this) in which I match the arrays length with "nan" values.
To get the data back I use the genfromtxt method and then I use this line:
x = x[~isnan(x)]
To remove the these cells from the arrays
If I find a better solution I will post it :)
To save your array you can use np.savez and read them back with np.load:
# Write to file
np.savez(filename, matrixResults)
# Read back
matrixResults = np.load(filename + '.npz').items[0][1]
As a side note you should follow naming conventions i.e. only class names start with upper case letters.
This operation needs to be applied as fast as possible as the actual arrays which contain millions of elements. This is a simple version of the problem.
So, I have a random array of unique integers (normally millions of elements).
totalIDs = [5,4,3,1,2,9,7,6,8 ...]
I have another array (normally a tens of thousands) of unique integers which I can create a mask.
subsampleIDs1 = [5,1,9]
subsampleIDs2 = [3,7,8]
subsampleIDs3 = [2,6,9]
...
I can use numpy to do
mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)
I can then extract the information I want of another array using the mask (say column 0 contains the one I want).
variable = allvariables[mask][:,0]
Now given that the IDs are unique in both arrays, is there any way to speed this up significantly. It takes a long time to construct the mask for a few thousand points (subsampleIDs) matching against millions of IDs (totalIDs).
I thought of going through it once and writing out a binary file of an index (to speed up future searches).
for i in range(0,3):
mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)
index[mask] = i
where X is in subsampleIDsX. Then I can just do:
for i in range(0,3):
if index[i] == i:
rowmatch = i
break
variable = allvariables[rowmatch:len(subsampleIDs),0]
right? But this is also slow because there is a conditional in the loop to find when it first matches. Is there a faster way to find when a number first appears in an ordered array so the conditional doesn't slow the loop?
I suggest you use DataFrame in Pandas. the index of the DataFrame is the totalIDs, and you can select subsampleIDs by: df.ix[subsampleIDs].
Create some test data first:
import numpy as np
N = 2000000
M = 5000
totalIDs = np.random.randint(0, 10000000, N)
totalIDs = np.unique(totalIDs)
np.random.shuffle(totalIDs)
v1 = np.random.rand(len(totalIDs))
v2 = np.random.rand(len(totalIDs))
subsampleIDs = np.random.choice(totalIDs, M)
subsampleIDs = np.unique(subsampleIDs)
np.random.shuffle(subsampleIDs)
Then convert you data in to a DataFrame:
import pandas as pd
df = pd.DataFrame(data = {"v1":v1, "v2":v2}, index=totalIDs)
df.ix[subsampleIDs]
DataFrame use a hashtable to map the index to it's location, it's very fast.
Often this kind of indexing is best performed using a DB (with proper column-indexing).
Another idea is to sort totalIDs once, as a preprocessing stage, and implement your own version of in1d, which avoids sorting everything. The numpy implementation of in1d (at least in the version that I have installed) is fairly simple, and should be easy to copy and modify.
EDIT:
Or, even better, use bucket sort (or radix sort). That should give you O(N+M), N being the size of totalIDs, and M the size of sampleIDs (times a constant you can play with by changing the number of buckets). Here too, you can split totalIDs to buckets only once, which gives you a nifty O(N+M1+M2+...).
Unfortunately, I'm not aware of a numpy implementation, but I did find this: http://en.wikipedia.org/wiki/Radix_sort#Example_in_Python