Let's assume that we have an array A if shape (100,) and B of shape (10,). Both contain values in [0,1].
How do we get the count of elements in A greater than each value in B? I expect an of shape (10,), where the first element is "how many in A are greater than B[0]", the second is "how many in A are greater than B[1]", etc ...
Without using loops.
I tried the following, but it didn't work :
import numpy as np
import numpy.random as rdm
A = rdm.rand(100)
B = np.linspace(0,1,10)
def occ(z: float) ->float:
return np.count_nonzero(A > z)
occ(B)
Python won't use my function as a scalar function on B, that's why I get:
operands could not be broadcast together with shapes (10,) (100,)
I've also tried with np.greater but I've got the same issue ...
Slow But Simple
The error message is cryptic if you don't understand it, but it's telling you what to do. Array dimensions are broadcast together by lining them up starting with the right edge. This is especially helpful if you split your operation into two parts:
Create a (100, 10) mask showing which elements of A are greater than which elements of B:
mask = A[:, None] > B
Sum the result of the previous operation along the axis corresponding to A:
result = np.count_nonzero(mask, axis=0)
OR
result = np.sum(mask, axis=0)
This can be written as a one-liner:
(A[:, None] > B).sum(0)
OR
np.count_nonzero(A[:, None] > B, axis=0)
You can switch the dimensions and place B in the first axis to get the same result:
(A > B[:, None]).sum(1)
Fast and Elegant
Taking a totally different (but likely much more efficient) approach, you can use np.searchsorted:
A.sort()
result = A.size - np.searchsorted(A, B)
By default, searchsorted returns the left-index that each element of B would be inserted into A at. That pretty much immediately tells you how many elements of A are greater than that.
Benchmarks
Here, the algos are labeled as follows:
B0: (A[:, None] > B).sum(0)
B1: (A > B[:, None]).sum(1)
HH: np.cumsum(np.histogram(A, bins=B)[0][::-1])[::-1]
SS: A.sort(); A.size - np.searchsorted(A, B)
+--------+--------+----------------------------------------+
| A.size | B.size | Time (B0 / B1 / HH / SS) |
+--------+--------+----------------------------------------+
| 100 | 10 | 20.9 µs / 15.7 µs / 68.3 µs / 8.87 µs |
+--------+--------+----------------------------------------+
| 1000 | 10 | 118 µs / 57.2 µs / 139 µs / 17.8 µs |
+--------+--------+----------------------------------------+
| 10000 | 10 | 987 µs / 288 µs / 1.23 ms / 131 µs |
+--------+--------+----------------------------------------+
| 100000 | 10 | 9.48 ms / 2.77 ms / 13.4 ms / 1.42 ms |
+--------+--------+----------------------------------------+
| 100 | 100 | 70.7 µs / 63.8 µs / 71 µs / 11.4 µs |
+--------+--------+----------------------------------------+
| 1000 | 100 | 518 µs / 388 µs / 148 µs / 21.6 µs |
+--------+--------+----------------------------------------+
| 10000 | 100 | 4.91 ms / 2.67 ms / 1.22 ms / 137 µs |
+--------+--------+----------------------------------------+
| 100000 | 100 | 57.4 ms / 35.6 ms / 13.5 ms / 1.42 ms |
+--------+--------+----------------------------------------+
Memory layout matters. B1 is always faster than B0. This happens because summing contiguous (cached) elements (along the last axis in C-order) is always faster than having to skip across rows to get the next element. Broadcasting performs well for small values of B. Keep in mind that both the time and space complexity for B0 and B1 is O(A.size * B.size). The complexity of the two histogramming solutions should be about O(A.size * log(A.size)), but SS is implemented much more efficiently than HH because it can assume more things about the data.
I think you can use np.histogram for this job
A = rdm.rand(100)
B = np.linspace(0,1,10)
np.histogram(A, bins=B)[0]
Gives the output
array([10, 9, 8, 11, 9, 14, 10, 12, 17])
B[9] will always be empty because there are no values > 1.
And compute the cumsum backwards
np.cumsum(np.histogram(A, bins=B)[0][::-1])[::-1]
Output
array([100, 90, 81, 73, 62, 53, 39, 29, 17])
np.sum(A>B.reshape((-1,1)), axis=1)
Explanation
Need to understand broadcasting and reshaping for this. By reshaping B to shape (len(B), 1), it can be broadcasted with A to produce an array with shape (len(B), len(A)) containing all comparisons. Then you sum over axis 1 (along A).
In other words, A < B does not work because A has 100 entries, and B has 10. If you read the broadcasting rules, you will see that numpy will start with the last dimension, and if they are the same size, then it can compare one-to-one. If one of the two is 1, then this dimension is stretched or “copied” to match the other. If they are not equal and none of them is equal to 1, it fails.
With a shorter example:
A = np.array([0.5112744 , 0.21696187, 0.14710105, 0.98581087, 0.50053359,
0.54954654, 0.81217522, 0.50009166, 0.42990167, 0.56078499])
B = array([0.25, 0.5 , 0.75])
the transpose of (A>B.reshape((-1,1))) (for readability)
np.array([[ True, True, False],
[False, False, False],
[False, False, False],
[ True, True, True],
[ True, True, False],
[ True, True, False],
[ True, True, True],
[ True, True, False],
[ True, False, False],
[ True, True, False]])
and np.sum(A>B.reshape((-1,1)), axis=1) is
array([8, 7, 2])
Related
I have a pandas DataFrame, each column represents a quarter, the most recent quarters are placed to the right, not all the information gets at the same time, some columns might be missing information (NaN values)
I would like to create a new column with the first criteria number that the row matches, or zero if it doesn't match any criteria
The criteria gets applied to the 3 most recent columns that have data (an integer, ignoring NaNs) and a match is considered if the value in the list is greater than or equal to its corresponding value in the DataFrame
I tried using apply, but I couldn't make it work and the failed attempts were slow
import pandas as pd
import numpy as np
criteria_dict = {
1: [10, 0, 10]
, 2: [0, 10, 10]
}
list_of_tuples = [
(78, 7, 11, 15), # classify as 2 since 7 >= 0, 11 >= 10, 15 >= 10
(98, -5, np.NaN, 18), # classify as 0, ignoring NaN it doesn't match any criteria because of the -5
(-78, 20, 64, 28), # classify as 1 20 >= 10, 64 >= 0, 28 >= 10
(35, 63, 27, np.NaN), # classify as 1, NaN value should be ignored, 35 >= 10, 63 >=0, 27 >= 10
(-11, 0, 56, 10) # classify as 2, 0 >= 0, 56 >= 10, 10 >= 10
]
df = pd.DataFrame(
list_of_tuples,
index=['A', 'B', 'C', 'D', 'E'],
columns=['2021Q2', '2021Q3', '2021Q4', '2022Q1']
)
print(df)
Applying a custom function to each row should work.
def func(x):
x = x.dropna().to_numpy()[-3:]
if len(x) < 3:
return 0
for k, v in criteria_dict.items():
if np.all(x >= v):
return k
return 0
df.apply(func, axis=1)
Probably using apply is the most straightforward, but I wanted to try a solution with numpy, which should be faster with dataframes with many rows.
import numpy as np
# Rows with too many NaNs.
df_arr = df.to_numpy()
# Find NaNs.
nans = np.nonzero(np.isnan(df_arr))
# Roll the rows so that the latest three columns with valid data are all to the right.
for row, col in zip(*nans):
df_arr[row, :] = np.roll(df_arr[row, :], shift=4-col)
# Check for matching criteria.
df['criteria'] = np.select([np.all((df_arr[:, 1:] - criteria_dict[crit])>=0, axis=1) for crit in criteria_dict],
[crit for crit in criteria_dict])
print(df)
2021Q2 2021Q3 2021Q4 2022Q1 criteria
A 78 7 11.0 15.0 2.0
B 98 -5 NaN 18.0 0.0
C -78 20 64.0 28.0 1.0
D 35 63 27.0 NaN 1.0
E -11 0 56.0 10.0 2.0
Some timings on df = pd.concat([df]*10000):
# 103 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit numpy(df)
# 1.32 s ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pandas_apply(df)
So it is ~10x faster.
It is possible to achieve a full vectorial comparison. Note that the bottleneck is the broadcasting step that will create an intermediate array of K*N*M size where M*N is the size of the subset of the dataframe (here 5*3) and K*N that of the criterions (here 2*3). You need to have enough memory to create this array.
Step by step procedure:
First get last 3 non-nan values as b:
N = 3
a = df.to_numpy()
from scipy.stats import rankdata
b = a[rankdata(~np.isnan(a), method='ordinal', axis=1)>(a.shape[1]-N)].reshape(-1,N)
array([[ 7., 11., 15.],
[98., -5., 18.],
[20., 64., 28.],
[35., 63., 27.],
[ 0., 56., 10.]])
Then craft an array with the conditions as c;
c = np.array(list(criteria_dict.values()))
array([[10, 0, 10],
[ 0, 10, 10]])
Broadcast the comparison of b and c and get all values >=:
d = (b>=c[:, None]).all(2)
array([[False, False, True, True, False],
[ True, False, True, True, True]])
Get index of first True using the criteria_dict keys (else 0):
e = np.where(d.any(0), np.array(list(criteria_dict))[np.argmax(d, axis=0)], 0)
array([2, 0, 1, 1, 2])
Assign to DataFrame:
df['criteria'] = e
2021Q2 2021Q3 2021Q4 2022Q1 criteria
A 78 7 11.0 15.0 2
B 98 -5 NaN 18.0 0
C -78 20 64.0 28.0 1
D 35 63 27.0 NaN 1
E -11 0 56.0 10.0 2
I've 9.5M rows in a DataFrame of following form:
Id | X | Y | Pass_Fail_Status
-------+---+---+-----------------
w0 | 0 | 0 | Pass
w0 | 0 | 1 | Fail
...
w1 | 0 | 0 | Fail
...
w6000 | 45| 45| Pass
What is the most efficient way to select subset DataFrame for each "Id" and do processing with that?
As of now I'm doing following:
I already have set of possible "Id"s from another DataFrame
for id in uniqueIds:
subsetDF = mainDF[mainDF["Id"] == id]
predLabel = predict(subsetDF)
But this seems to have severe performance issue as there're 6.7K such possible id and each repeating 1.4K times. I've done some profiling using cProfile that does not point to this line but I see some scalar op call taking time which is has exact 6.7K call count.
EDIT2: The requirement for the subset-dataframe is that all rows should have same Id - finally for the training or predict 'Id' is not that important but the X,Y location and pass/fail in that location is important.
The subsetDF should be of following form:
Id | X | Y | Pass_Fail_Status
-------+---+---+-----------------
w0 | 0 | 0 | Pass
w0 | 0 | 1 | Fail
...
w1399 | 0 | 0 | Fail
...
w1399 |45 |45 | Pass
Conclusion:
Winner: groupby
According to the result of my experiments, the most efficient way to select a subset DataFrame for each "Id" and do processing with is to use the groupby method.
Code (Jupyter Lab):
# Preparation:
import pandas as pd
import numpy as np
# Create a sample dataframe
n = 6700 * 1400 # = 9380000
freq = 1400
mainDF = pd.DataFrame({
'Id': ['w{:04d}'.format(i//freq) for i in range(n)],
'X': np.random.randint(0, 46, n),
'Y': np.random.randint(0, 46, n),
'Pass_Fail_Status': [('Pass', 'Fail')[i] for i in np.random.randint(0, 2, n)]
})
uniqueIds = set(mainDF['Id'])
# Experiments:
# Experiment (a): apply pandas mask (the OP's method)
def exp_a():
for _id in uniqueIds:
subsetDF = mainDF[mainDF['Id'] == _id]
print('Experiment (a):')
%timeit exp_a()
# Experiment (b): use set_index
def exp_b():
df_b = mainDF.set_index('Id')
for _id in uniqueIds:
subsetDF = df_b.loc[_id]
print('Experiment (b):')
%timeit exp_b()
# Experiment (c): use groupby
def exp_c():
for _, subsetDF in mainDF.groupby('Id'):
pass
print('Experiment (c):')
%timeit exp_c()
Output:
Experiment (a): # apply pandas mask (the OP's method)
39min 46s ± 992 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Experiment (b): # use set_index
1.19 s ± 7.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Experiment (c): # use groupby
997 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sample dataframe:
Id
X
Y
Pass_Fail_Status
0
w0000
9
28
Fail
1
w0000
42
28
Pass
2
w0000
26
36
Pass
9379997
w6699
12
14
Fail
9379998
w6699
8
40
Fail
9379999
w6699
17
21
Pass
IIUC, you could use groupby + sample to randomly sample a certain fraction of the original df to split into train and test DataFrames:
train = df.groupby('Id').sample(frac=0.7)
test = df[~df.index.isin(train.index)]
For example, in the sample you have in the OP, the above code produces:
train:
Id X Y Pass_Fail_Status
0 w0 0 0 Pass
2 w1 0 0 Fail
3 w6000 45 45 Pass
test:
Id X Y Pass_Fail_Status
1 w0 0 1 Fail
I have tried the suggestion for 'groupby' by both 'enke' and 'quasi-human'. It improved the overall performance (including other operations) by 6X (I measured the numbers 3 times for each approaches and this gain is based on avg) - now the for loop is like following:
for id, subsetDF in mainDF.groupby("Id", as_index=False):
predLabel = predict(subsetDF)
I would like to replace values with column labels according to the largest 3 values for each row. Let's assume this input:
p1 p2 p3 p4
0 0 9 1 4
1 0 2 3 4
2 1 3 10 7
3 1 5 3 1
4 2 3 7 10
Given n = 3, I am looking for:
Top1 Top2 Top3
0 p2 p4 p3
1 p4 p3 p2
2 p3 p4 p2
3 p2 p3 p1
4 p4 p3 p2
I'm not concerned about duplicates, e.g. for index 3, Top3 can be 'p1' or 'p4'.
Attempt 1
My first attempt is a full sort using np.ndarray.argsort:
res = pd.DataFrame(df.columns[df.values.argsort(1)]).iloc[:, len(df.index): 0: -1]
But in reality I have more than 4 columns and this will be inefficient.
Attempt 2
Next I tried np.argpartition. But since values within each partition are not sorted, this required a subsequent sort:
n = 3
parts = np.argpartition(-df.values, n, axis=1)[:, :-1]
args = (-df.values[np.arange(df.shape[0])[:, None], parts]).argsort(1)
res = pd.DataFrame(df.columns[parts[np.arange(df.shape[0])[:, None], args]],
columns=[f'Top{i}' for i in range(1, n+1)])
This, in fact, works out slower than the first attempt for larger dataframes. Is there a more efficient way which takes advantage of partial sorting? You can use the below code for benchmarking purposes.
Benchmarking
# Python 3.6.0, NumPy 1.11.3, Pandas 0.19.2
import pandas as pd, numpy as np
df = pd.DataFrame({'p1': [0, 0, 1, 1, 2],
'p2': [9, 2, 3, 5, 3],
'p3': [1, 3, 10, 3, 7],
'p4': [4, 4, 7, 1, 10]})
def full_sort(df):
return pd.DataFrame(df.columns[df.values.argsort(1)]).iloc[:, len(df.index): 0: -1]
def partial_sort(df):
n = 3
parts = np.argpartition(-df.values, n, axis=1)[:, :-1]
args = (-df.values[np.arange(df.shape[0])[:, None], parts]).argsort(1)
return pd.DataFrame(df.columns[parts[np.arange(df.shape[0])[:, None], args]])
df = pd.concat([df]*10**5)
%timeit full_sort(df) # 86.3 ms per loop
%timeit partial_sort(df) # 158 ms per loop
With a decent number of columns, we can use np.argpartition with some slicing and indexing, like so -
def topN_perrow_colsindexed(df, N):
# Extract array data
a = df.values
# Get top N indices per row with not necessarily sorted order
idxtopNpart = np.argpartition(a,-N,axis=1)[:,-1:-N-1:-1]
# Index into input data with those and use argsort to force sorted order
sidx = np.take_along_axis(a,idxtopNpart,axis=1).argsort(1)
idxtopN = np.take_along_axis(idxtopNpart,sidx[:,::-1],axis=1)
# Index into column values with those for final output
c = df.columns.values
return pd.DataFrame(c[idxtopN], columns=[['Top'+str(i+1) for i in range(N)]])
Sample run -
In [65]: df
Out[65]:
p1 p2 p3 p4
0 0 9 1 4
1 0 2 3 4
2 1 3 10 7
3 1 5 3 1
4 2 3 7 10
In [66]: topN_perrow_colsindexed(df, N=3)
Out[66]:
Top1 Top2 Top3
0 p2 p4 p3
1 p4 p3 p2
2 p3 p4 p2
3 p2 p3 p4
4 p4 p3 p2
Timings -
In [143]: np.random.seed(0)
In [144]: df = pd.DataFrame(np.random.rand(10000,30))
In [145]: %timeit full_sort(df)
...: %timeit partial_sort(df)
...: %timeit topN_perrow_colsindexed(df,N=3)
100 loops, best of 3: 7.96 ms per loop
100 loops, best of 3: 13.9 ms per loop
100 loops, best of 3: 5.47 ms per loop
In [146]: df = pd.DataFrame(np.random.rand(10000,100))
In [147]: %timeit full_sort(df)
...: %timeit partial_sort(df)
...: %timeit topN_perrow_colsindexed(df,N=3)
10 loops, best of 3: 34 ms per loop
10 loops, best of 3: 56.1 ms per loop
100 loops, best of 3: 13.6 ms per loop
I am trying to understand the provided below (which I found online, but do not fully understand). I want to essentially remove user names that do not appear in my dataframe at least 4 times (other than removing this names, I do not want to modify the dataframe in any other way). Does the following code solve this problem and if so, can you explain how the filter combined with the lambda achieves this? I have the following:
df.groupby('userName').filter(lambda x: len(x) > 4)
I am also open to alternative solutions/approaches that are easy to understand.
You can check filtration.
Faster solution in bigger DataFrame is with transform and boolean indexing:
df[df.groupby('userName')['userName'].transform('size') > 4]
Sample:
df = pd.DataFrame({'userName':['a'] * 5 + ['b'] * 3 + ['c'] * 6})
print (df.groupby('userName').filter(lambda x: len(x) > 4))
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
print (df[df.groupby('userName')['userName'].transform('size') > 4])
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
Timings:
np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})
print (df)
In [128]: %timeit (df.groupby('userName').filter(lambda x: len(x) > 1000))
1 loop, best of 3: 468 ms per loop
In [129]: %timeit (df[df.groupby('userName')['userName'].transform(len) > 1000])
1 loop, best of 3: 661 ms per loop
In [130]: %timeit (df[df.groupby('userName')['userName'].transform('size') > 1000])
10 loops, best of 3: 96.9 ms per loop
Using numpy
def pir(df, k):
names = df.userName.values
f, u = pd.factorize(names)
c = np.bincount(f)
m = c[f] > k
return df[m]
pir(df, 4)
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
__
Timing
#jezrael's large data
np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})
pir(df, 1000).equals(
df[df.groupby('userName')['userName'].transform('size') > 1000]
)
True
%timeit df[df.groupby('userName')['userName'].transform('size') > 1000]
%timeit pir(df, 1000)
10 loops, best of 3: 78.4 ms per loop
10 loops, best of 3: 61.9 ms per loop
In a typical python dataframe, it's easy to select desired rows based on index:
df.ix[list_of_inds] or df.loc[list_of_inds]
However, using this method to take a substantial subset of a large, sparse dataframe (73,000 rows, 8,000 columns specifically) seems to be extremely intensive - my memory shoots up and my computer crashes.
I did notice that indexing using a range like this..
df.ix[1:N]
works fine, while using a list of indices like this...
df.ix[np.arange(1,N)]
is what makes the memory overload.
Is there another way to select rows from a sparse dataframe that's computationally easier? Or, can I convert this dataframe to an actual sparse matrix...
sparse_df = scipy.sparse.csc(df)
and select only the indices I want from that?
The issue you are facing could be related to view vs copy semantics.
df.ix[1:N] # uses slicing => operates on a view
df.ix[np.arange(1,N)] # uses fancy indexing => "probably" creates a copy first
I created a DataFrame on my machine of shape 73000x8000 and my memory spiked to 4.4 GB so I wouldn't be surprised with crashes. That said, if you do need to create a new array with the index list, then you're out of luck. However, to modify the original DataFrame, you should be able to modify the DataFrame one row at a time, or few sliced rows at a time at the expense of speed, eg:
for i in arbitrary_list_of_indices:
df.ix[i] = new_values
Btw, you could try working off numpy arrays directly which I felt has clearer descriptions of which operations result in copies vs views. You can always create a DataFrame from the array with hardly any memory overhead since it just creates a reference to the original array.
Also indexing in numpy seems much faster, even without slicing. Here's a simple testcase:
In [66]: df
Out[66]:
0 1 2 3
0 3 14 5 1
1 9 19 14 4
2 5 4 5 5
3 13 14 4 7
4 8 12 3 16
5 15 3 17 12
6 11 0 12 0
In [68]: df.ix[[1,3,5]] # fancy index version
Out[68]:
0 1 2 3
1 9 19 14 4
3 13 14 4 7
5 15 3 17 12
In [69]: df.ix[1:5:2] # sliced version of the same
Out[69]:
0 1 2 3
1 9 19 14 4
3 13 14 4 7
5 15 3 17 12
In [71]: %timeit df.ix[[1,3,5]] = -1 # use fancy index version
1000 loops, best of 3: 251 µs per loop
In [72]: %timeit df.ix[1:5:2] = -2 # faster sliced version
10000 loops, best of 3: 157 µs per loop
In [73]: arr = df.values
In [74]: arr
Out[74]:
array([[ 3, 14, 5, 1],
[-2, -2, -2, -2],
[ 5, 4, 5, 5],
[-2, -2, -2, -2],
[ 8, 12, 3, 16],
[-2, -2, -2, -2],
[11, 0, 12, 0]])
In [75]: %timeit arr[[1,3,5]] = -1 # much faster than DataFrame
The slowest run took 23.49 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 4.56 µs per loop
In [77]: %timeit arr[1:5:2] = -3 # really fast but restricted to slicing
The slowest run took 19.46 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 821 ns per loop
Good luck!