Selecting rows from sparse dataframe by index position - python

In a typical python dataframe, it's easy to select desired rows based on index:
df.ix[list_of_inds] or df.loc[list_of_inds]
However, using this method to take a substantial subset of a large, sparse dataframe (73,000 rows, 8,000 columns specifically) seems to be extremely intensive - my memory shoots up and my computer crashes.
I did notice that indexing using a range like this..
df.ix[1:N]
works fine, while using a list of indices like this...
df.ix[np.arange(1,N)]
is what makes the memory overload.
Is there another way to select rows from a sparse dataframe that's computationally easier? Or, can I convert this dataframe to an actual sparse matrix...
sparse_df = scipy.sparse.csc(df)
and select only the indices I want from that?

The issue you are facing could be related to view vs copy semantics.
df.ix[1:N] # uses slicing => operates on a view
df.ix[np.arange(1,N)] # uses fancy indexing => "probably" creates a copy first
I created a DataFrame on my machine of shape 73000x8000 and my memory spiked to 4.4 GB so I wouldn't be surprised with crashes. That said, if you do need to create a new array with the index list, then you're out of luck. However, to modify the original DataFrame, you should be able to modify the DataFrame one row at a time, or few sliced rows at a time at the expense of speed, eg:
for i in arbitrary_list_of_indices:
df.ix[i] = new_values
Btw, you could try working off numpy arrays directly which I felt has clearer descriptions of which operations result in copies vs views. You can always create a DataFrame from the array with hardly any memory overhead since it just creates a reference to the original array.
Also indexing in numpy seems much faster, even without slicing. Here's a simple testcase:
In [66]: df
Out[66]:
0 1 2 3
0 3 14 5 1
1 9 19 14 4
2 5 4 5 5
3 13 14 4 7
4 8 12 3 16
5 15 3 17 12
6 11 0 12 0
In [68]: df.ix[[1,3,5]] # fancy index version
Out[68]:
0 1 2 3
1 9 19 14 4
3 13 14 4 7
5 15 3 17 12
In [69]: df.ix[1:5:2] # sliced version of the same
Out[69]:
0 1 2 3
1 9 19 14 4
3 13 14 4 7
5 15 3 17 12
In [71]: %timeit df.ix[[1,3,5]] = -1 # use fancy index version
1000 loops, best of 3: 251 µs per loop
In [72]: %timeit df.ix[1:5:2] = -2 # faster sliced version
10000 loops, best of 3: 157 µs per loop
In [73]: arr = df.values
In [74]: arr
Out[74]:
array([[ 3, 14, 5, 1],
[-2, -2, -2, -2],
[ 5, 4, 5, 5],
[-2, -2, -2, -2],
[ 8, 12, 3, 16],
[-2, -2, -2, -2],
[11, 0, 12, 0]])
In [75]: %timeit arr[[1,3,5]] = -1 # much faster than DataFrame
The slowest run took 23.49 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 4.56 µs per loop
In [77]: %timeit arr[1:5:2] = -3 # really fast but restricted to slicing
The slowest run took 19.46 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 821 ns per loop
Good luck!

Related

Classify DataFrame rows based on first matching condition

I have a pandas DataFrame, each column represents a quarter, the most recent quarters are placed to the right, not all the information gets at the same time, some columns might be missing information (NaN values)
I would like to create a new column with the first criteria number that the row matches, or zero if it doesn't match any criteria
The criteria gets applied to the 3 most recent columns that have data (an integer, ignoring NaNs) and a match is considered if the value in the list is greater than or equal to its corresponding value in the DataFrame
I tried using apply, but I couldn't make it work and the failed attempts were slow
import pandas as pd
import numpy as np
criteria_dict = {
1: [10, 0, 10]
, 2: [0, 10, 10]
}
list_of_tuples = [
(78, 7, 11, 15), # classify as 2 since 7 >= 0, 11 >= 10, 15 >= 10
(98, -5, np.NaN, 18), # classify as 0, ignoring NaN it doesn't match any criteria because of the -5
(-78, 20, 64, 28), # classify as 1 20 >= 10, 64 >= 0, 28 >= 10
(35, 63, 27, np.NaN), # classify as 1, NaN value should be ignored, 35 >= 10, 63 >=0, 27 >= 10
(-11, 0, 56, 10) # classify as 2, 0 >= 0, 56 >= 10, 10 >= 10
]
df = pd.DataFrame(
list_of_tuples,
index=['A', 'B', 'C', 'D', 'E'],
columns=['2021Q2', '2021Q3', '2021Q4', '2022Q1']
)
print(df)
Applying a custom function to each row should work.
def func(x):
x = x.dropna().to_numpy()[-3:]
if len(x) < 3:
return 0
for k, v in criteria_dict.items():
if np.all(x >= v):
return k
return 0
df.apply(func, axis=1)
Probably using apply is the most straightforward, but I wanted to try a solution with numpy, which should be faster with dataframes with many rows.
import numpy as np
# Rows with too many NaNs.
df_arr = df.to_numpy()
# Find NaNs.
nans = np.nonzero(np.isnan(df_arr))
# Roll the rows so that the latest three columns with valid data are all to the right.
for row, col in zip(*nans):
df_arr[row, :] = np.roll(df_arr[row, :], shift=4-col)
# Check for matching criteria.
df['criteria'] = np.select([np.all((df_arr[:, 1:] - criteria_dict[crit])>=0, axis=1) for crit in criteria_dict],
[crit for crit in criteria_dict])
print(df)
2021Q2 2021Q3 2021Q4 2022Q1 criteria
A 78 7 11.0 15.0 2.0
B 98 -5 NaN 18.0 0.0
C -78 20 64.0 28.0 1.0
D 35 63 27.0 NaN 1.0
E -11 0 56.0 10.0 2.0
Some timings on df = pd.concat([df]*10000):
# 103 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit numpy(df)
# 1.32 s ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pandas_apply(df)
So it is ~10x faster.
It is possible to achieve a full vectorial comparison. Note that the bottleneck is the broadcasting step that will create an intermediate array of K*N*M size where M*N is the size of the subset of the dataframe (here 5*3) and K*N that of the criterions (here 2*3). You need to have enough memory to create this array.
Step by step procedure:
First get last 3 non-nan values as b:
N = 3
a = df.to_numpy()
from scipy.stats import rankdata
b = a[rankdata(~np.isnan(a), method='ordinal', axis=1)>(a.shape[1]-N)].reshape(-1,N)
array([[ 7., 11., 15.],
[98., -5., 18.],
[20., 64., 28.],
[35., 63., 27.],
[ 0., 56., 10.]])
Then craft an array with the conditions as c;
c = np.array(list(criteria_dict.values()))
array([[10, 0, 10],
[ 0, 10, 10]])
Broadcast the comparison of b and c and get all values >=:
d = (b>=c[:, None]).all(2)
array([[False, False, True, True, False],
[ True, False, True, True, True]])
Get index of first True using the criteria_dict keys (else 0):
e = np.where(d.any(0), np.array(list(criteria_dict))[np.argmax(d, axis=0)], 0)
array([2, 0, 1, 1, 2])
Assign to DataFrame:
df['criteria'] = e
2021Q2 2021Q3 2021Q4 2022Q1 criteria
A 78 7 11.0 15.0 2
B 98 -5 NaN 18.0 0
C -78 20 64.0 28.0 1
D 35 63 27.0 NaN 1
E -11 0 56.0 10.0 2

How to efficiently partial argsort Pandas dataframe across columns

I would like to replace values with column labels according to the largest 3 values for each row. Let's assume this input:
p1 p2 p3 p4
0 0 9 1 4
1 0 2 3 4
2 1 3 10 7
3 1 5 3 1
4 2 3 7 10
Given n = 3, I am looking for:
Top1 Top2 Top3
0 p2 p4 p3
1 p4 p3 p2
2 p3 p4 p2
3 p2 p3 p1
4 p4 p3 p2
I'm not concerned about duplicates, e.g. for index 3, Top3 can be 'p1' or 'p4'.
Attempt 1
My first attempt is a full sort using np.ndarray.argsort:
res = pd.DataFrame(df.columns[df.values.argsort(1)]).iloc[:, len(df.index): 0: -1]
But in reality I have more than 4 columns and this will be inefficient.
Attempt 2
Next I tried np.argpartition. But since values within each partition are not sorted, this required a subsequent sort:
n = 3
parts = np.argpartition(-df.values, n, axis=1)[:, :-1]
args = (-df.values[np.arange(df.shape[0])[:, None], parts]).argsort(1)
res = pd.DataFrame(df.columns[parts[np.arange(df.shape[0])[:, None], args]],
columns=[f'Top{i}' for i in range(1, n+1)])
This, in fact, works out slower than the first attempt for larger dataframes. Is there a more efficient way which takes advantage of partial sorting? You can use the below code for benchmarking purposes.
Benchmarking
# Python 3.6.0, NumPy 1.11.3, Pandas 0.19.2
import pandas as pd, numpy as np
df = pd.DataFrame({'p1': [0, 0, 1, 1, 2],
'p2': [9, 2, 3, 5, 3],
'p3': [1, 3, 10, 3, 7],
'p4': [4, 4, 7, 1, 10]})
def full_sort(df):
return pd.DataFrame(df.columns[df.values.argsort(1)]).iloc[:, len(df.index): 0: -1]
def partial_sort(df):
n = 3
parts = np.argpartition(-df.values, n, axis=1)[:, :-1]
args = (-df.values[np.arange(df.shape[0])[:, None], parts]).argsort(1)
return pd.DataFrame(df.columns[parts[np.arange(df.shape[0])[:, None], args]])
df = pd.concat([df]*10**5)
%timeit full_sort(df) # 86.3 ms per loop
%timeit partial_sort(df) # 158 ms per loop
With a decent number of columns, we can use np.argpartition with some slicing and indexing, like so -
def topN_perrow_colsindexed(df, N):
# Extract array data
a = df.values
# Get top N indices per row with not necessarily sorted order
idxtopNpart = np.argpartition(a,-N,axis=1)[:,-1:-N-1:-1]
# Index into input data with those and use argsort to force sorted order
sidx = np.take_along_axis(a,idxtopNpart,axis=1).argsort(1)
idxtopN = np.take_along_axis(idxtopNpart,sidx[:,::-1],axis=1)
# Index into column values with those for final output
c = df.columns.values
return pd.DataFrame(c[idxtopN], columns=[['Top'+str(i+1) for i in range(N)]])
Sample run -
In [65]: df
Out[65]:
p1 p2 p3 p4
0 0 9 1 4
1 0 2 3 4
2 1 3 10 7
3 1 5 3 1
4 2 3 7 10
In [66]: topN_perrow_colsindexed(df, N=3)
Out[66]:
Top1 Top2 Top3
0 p2 p4 p3
1 p4 p3 p2
2 p3 p4 p2
3 p2 p3 p4
4 p4 p3 p2
Timings -
In [143]: np.random.seed(0)
In [144]: df = pd.DataFrame(np.random.rand(10000,30))
In [145]: %timeit full_sort(df)
...: %timeit partial_sort(df)
...: %timeit topN_perrow_colsindexed(df,N=3)
100 loops, best of 3: 7.96 ms per loop
100 loops, best of 3: 13.9 ms per loop
100 loops, best of 3: 5.47 ms per loop
In [146]: df = pd.DataFrame(np.random.rand(10000,100))
In [147]: %timeit full_sort(df)
...: %timeit partial_sort(df)
...: %timeit topN_perrow_colsindexed(df,N=3)
10 loops, best of 3: 34 ms per loop
10 loops, best of 3: 56.1 ms per loop
100 loops, best of 3: 13.6 ms per loop

How to find the maximum consecutive number for multiple columns?

I need to identify the highest number of consecutive values that meet a certain criteria for multiple columns.
If my df is:
A B C D E
26 24 21 23 24
26 23 22 15 23
24 19 17 11 15
27 22 28 24 24
26 27 30 23 11
26 26 29 27 29
I want to know the maximum consecutive times that numbers over 25 occur for each column. So the output would be:
A 3
B 2
C 3
D 1
E 1
Using the following code, I can obtain the outcome for one column at a time; is there a way to create a table as above rather than repeating for each column (I have over 40 columns in total).
df.A.isnull().astype(int).groupby(df.A.notnull().astype(int).cumsum()).sum().max()
Thanks in advance.
Is this what you want ? pandas approach (PS: never thought I can make it one line LOL)
(df>25).apply(lambda x :x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).mask(df<25).max()
Out[320]:
A 3.0
B 2.0
C 3.0
D 1.0
E 1.0
dtype: float64
One option using numpy to calculate the max consecutive:
def max_consecutive(arr):
# calculate the indices where the condition changes
split_indices = np.flatnonzero(np.ediff1d(arr.values, to_begin=1, to_end=1))
# calculate the chunk length of consecutive values and pick every other value based on
# the initial value
try:
max_size = np.diff(split_indices)[not arr.iat[0]::2].max()
except ValueError:
max_size = 0
return max_size
df.gt(25).apply(max_consecutive)
#A 3
#B 2
#C 3
#D 1
#E 1
#dtype: int64
Timing compared with the other approach:
%timeit df.gt(25).apply(max_consecutive)
# 520 µs ± 6.92 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit (df>25).apply(lambda x :x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).mask(df<25).max(0)
# 10.3 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Here's one with NumPy -
# mask is 2D boolean array representing islands as True values per col
def max_island_len_cols(mask):
m,n = mask.shape
out = np.zeros(n,dtype=int)
b = np.zeros((m+2,n),dtype=bool)
b[1:-1] = mask
for i in range(mask.shape[1]):
idx = np.flatnonzero(b[1:,i] != b[:-1,i])
if len(idx)>0:
out[i] = (idx[1::2] - idx[::2]).max()
return out
output = pd.Series(max_island_len_cols(df.values>25), index=df.columns)
Sample run -
In [690]: df
Out[690]:
A B C D E
0 26 24 21 23 24
1 26 23 22 15 23
2 24 19 17 11 15
3 27 22 28 24 24
4 26 27 30 23 11
5 26 26 29 27 29
In [690]:
In [691]: pd.Series(max_island_len_cols(df.values>25), index=df.columns)
Out[691]:
A 3
B 2
C 3
D 1
E 1
dtype: int64
Runtime test
Inspired by the given sample that has numbers in range (24,28) and with 40 cols, let's setup a bigger input dataframe and test out all the solutions -
# Input dataframe
In [692]: df = pd.DataFrame(np.random.randint(24,28,(1000,40)))
# Proposed in this post
In [693]: %timeit pd.Series(max_island_len_cols(df.values>25), index=df.columns)
1000 loops, best of 3: 539 µs per loop
# #Psidom's solution
In [694]: %timeit df.gt(25).apply(max_consecutive)
1000 loops, best of 3: 1.81 ms per loop
# #Wen's solution
In [695]: %timeit (df>25).apply(lambda x :x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).mask(df<25).max(0)
10 loops, best of 3: 95.2 ms per loop
An approach using pandas and scipy.ndimage.label, for fun.
import pandas as pd
from scipy.ndimage import label
struct = [[0, 1, 0], # Structure used for segmentation
[0, 1, 0], # Equivalent to axis=0 in `numpy`
[0, 1, 0]] # Or 'columns' in `pandas`
labels, nlabels = label(df > 25, structure=struct)
>>> labels # Labels for each column-wise block of consecutive numbers > 25
Out[]:
array([[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[2, 0, 3, 0, 0],
[2, 4, 3, 0, 0],
[2, 4, 3, 5, 6]])
labels_df = pd.DataFrame(columns=df.columns, data=labels) # Add original columns names
res = (labels_df.apply(lambda x: x.value_counts()) # Execute `value_counts` on each column
.iloc[1:] # slice results for labels > 0
.max()) # and get max value
>>> res
Out[]:
A 3.0
B 2.0
C 3.0
D 1.0
E 1.0
dtype: float64

Quickly Find Non-Zero Intervals

I am writing an algorithm to determine the intervals of the "mountains" on a density plot. The plot is taken from the depths from a Kinect if anyone is interested. Here is a quick visual example of what this algorithm finds: (with the small mountains removed):
My current algorithm:
def find_peak_intervals(data):
previous = 0
peak = False
ranges = []
begin_range = 0
end_range = 0
for current in xrange(len(data)):
if (not peak) and ((data[current] - data[previous]) > 0):
peak = True
begin_range = current
if peak and (data[current] == 0):
peak = False
end_range = current
ranges.append((begin_range, end_range))
previous = current
return np.array(ranges)
The function works but it takes nearly 3 milliseconds on my laptop, and I need to be able to run my entire program at at least 30 frames per second. This function is rather ugly and I have to run it 3 times per frame for my program, so I would like any hints as to how to simplify and optimize this function (maybe something from numpy or scipy that I missed).
Assuming a pandas dataframe like so:
Value
0 0
1 3
2 2
3 2
4 1
5 2
6 3
7 0
8 1
9 3
10 0
11 0
12 0
13 1
14 0
15 3
16 2
17 3
18 1
19 0
You can get the contiguous non-zero ranges by using df["Value"].shift(x) where x could either be 1 or -1 so you can check if it's bounded by zeroes. Once you get the boundaries, you can just store their index pairs and use them later on when filtering the data.
The following code is based on the excellent answer here by #behzad.nouri.
import pandas as pd
df = pd.read_csv("data.csv")
# Or you can use df = pd.DataFrame.from_dict({'Value': {0: 0, 1: 3, 2: 2, 3: 2, 4: 1, 5: 2, 6: 3, 7: 0, 8: 1, 9: 3, 10: 0, 11: 0, 12: 0, 13: 1, 14: 0, 15: 3, 16: 2, 17: 3, 18: 1, 19: 0}})
# --
# from https://stackoverflow.com/questions/24281936
# credits to #behzad.nouri
df['tag'] = df['Value'] > 0
fst = df.index[df['tag'] & ~ df['tag'].shift(1).fillna(False)]
lst = df.index[df['tag'] & ~ df['tag'].shift(-1).fillna(False)]
pr = [(i, j) for i, j in zip(fst, lst)]
# --
for i, j in pr:
print df.loc[i:j, "Value"]
This gives the result:
1 3
2 2
3 2
4 1
5 2
6 3
Name: Value, dtype: int64
8 1
9 3
Name: Value, dtype: int64
13 1
Name: Value, dtype: int64
15 3
16 2
17 3
18 1
Name: Value, dtype: int64
Timing it in IPython gives the following:
%timeit find_peak_intervals(df)
1000 loops, best of 3: 1.49 ms per loop
This is not too far from your attempt speed-wise. An alternative is to use convert the pandas series to a numpy array and operate from there. Let's take another excellent answer, this one by #Warren Weckesser, and modify it to suit your needs. Let's time it as well.
In [22]: np_arr = np.array(df["Value"])
In [23]: def greater_than_zero(a):
...: isntzero = np.concatenate(([0], np.greater(a, 0).view(np.int8), [0]))
...: absdiff = np.abs(np.diff(isntzero))
...: ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
...: return ranges
In [24]: %timeit greater_than_zero(np_arr)
100000 loops, best of 3: 17.1 µs per loop
Not so bad at 17.1 microseconds, and it gives the same ranges as well.
[1 7] # Basically same as indices 1-6 in pandas.
[ 8 10] # 8, 9
[13 14] # 13, 13
[15 19] # 15, 18

Count number of elements in each column less than x

I have a DataFrame which looks like below. I am trying to count the number of elements less than 2.0 in each column, then I will visualize the result in a bar plot. I did it using lists and loops, but I wonder if there is a "Pandas way" to do this quickly.
x = []
for i in range(6):
x.append(df[df.ix[:,i]<2.0].count()[i])
then I can get a bar plot using list x.
A B C D E F
0 2.142 1.929 1.674 1.547 3.395 2.382
1 2.077 1.871 1.614 1.491 3.110 2.288
2 2.098 1.889 1.610 1.487 3.020 2.262
3 1.990 1.760 1.479 1.366 2.496 2.128
4 1.935 1.765 1.656 1.530 2.786 2.433
In [96]:
df = pd.DataFrame({'a':randn(10), 'b':randn(10), 'c':randn(10)})
df
Out[96]:
a b c
0 -0.849903 0.944912 1.285790
1 -1.038706 1.445381 0.251002
2 0.683135 -0.539052 -0.622439
3 -1.224699 -0.358541 1.361618
4 -0.087021 0.041524 0.151286
5 -0.114031 -0.201018 -0.030050
6 0.001891 1.601687 -0.040442
7 0.024954 -1.839793 0.917328
8 -1.480281 0.079342 -0.405370
9 0.167295 -1.723555 -0.033937
[10 rows x 3 columns]
In [97]:
df[df > 1.0].count()
Out[97]:
a 0
b 2
c 2
dtype: int64
So in your case:
df[df < 2.0 ].count()
should work
EDIT
some timings
In [3]:
%timeit df[df < 1.0 ].count()
%timeit (df < 1.0).sum()
%timeit (df < 1.0).apply(np.count_nonzero)
1000 loops, best of 3: 1.47 ms per loop
1000 loops, best of 3: 560 us per loop
1000 loops, best of 3: 529 us per loop
So #DSM's suggestions are correct and much faster than my suggestion
Method-chaining is possible (comparison operators have their respective methods, e.g. < = lt(), <= = le()):
df.lt(2).sum()
If you have multiple conditions to consider, e.g. count the number of values between 2 and 10. Then you can use boolean operator on two boolean Serieses:
(df.gt(2) & df.lt(10)).sum()
or you can use pd.eval():
pd.eval("2 < df < 10").sum()
Count the number of values less than 2 or greater than 10:
(df.lt(2) | df.gt(10)).sum()
# or
pd.eval("df < 2 or df > 10").sum()

Categories