Efficient way to select subset dataframe - python

I've 9.5M rows in a DataFrame of following form:
Id | X | Y | Pass_Fail_Status
-------+---+---+-----------------
w0 | 0 | 0 | Pass
w0 | 0 | 1 | Fail
...
w1 | 0 | 0 | Fail
...
w6000 | 45| 45| Pass
What is the most efficient way to select subset DataFrame for each "Id" and do processing with that?
As of now I'm doing following:
I already have set of possible "Id"s from another DataFrame
for id in uniqueIds:
subsetDF = mainDF[mainDF["Id"] == id]
predLabel = predict(subsetDF)
But this seems to have severe performance issue as there're 6.7K such possible id and each repeating 1.4K times. I've done some profiling using cProfile that does not point to this line but I see some scalar op call taking time which is has exact 6.7K call count.
EDIT2: The requirement for the subset-dataframe is that all rows should have same Id - finally for the training or predict 'Id' is not that important but the X,Y location and pass/fail in that location is important.
The subsetDF should be of following form:
Id | X | Y | Pass_Fail_Status
-------+---+---+-----------------
w0 | 0 | 0 | Pass
w0 | 0 | 1 | Fail
...
w1399 | 0 | 0 | Fail
...
w1399 |45 |45 | Pass

Conclusion:
Winner: groupby
According to the result of my experiments, the most efficient way to select a subset DataFrame for each "Id" and do processing with is to use the groupby method.
Code (Jupyter Lab):
# Preparation:
import pandas as pd
import numpy as np
# Create a sample dataframe
n = 6700 * 1400 # = 9380000
freq = 1400
mainDF = pd.DataFrame({
'Id': ['w{:04d}'.format(i//freq) for i in range(n)],
'X': np.random.randint(0, 46, n),
'Y': np.random.randint(0, 46, n),
'Pass_Fail_Status': [('Pass', 'Fail')[i] for i in np.random.randint(0, 2, n)]
})
uniqueIds = set(mainDF['Id'])
# Experiments:
# Experiment (a): apply pandas mask (the OP's method)
def exp_a():
for _id in uniqueIds:
subsetDF = mainDF[mainDF['Id'] == _id]
print('Experiment (a):')
%timeit exp_a()
# Experiment (b): use set_index
def exp_b():
df_b = mainDF.set_index('Id')
for _id in uniqueIds:
subsetDF = df_b.loc[_id]
print('Experiment (b):')
%timeit exp_b()
# Experiment (c): use groupby
def exp_c():
for _, subsetDF in mainDF.groupby('Id'):
pass
print('Experiment (c):')
%timeit exp_c()
Output:
Experiment (a): # apply pandas mask (the OP's method)
39min 46s ± 992 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Experiment (b): # use set_index
1.19 s ± 7.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Experiment (c): # use groupby
997 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sample dataframe:
Id
X
Y
Pass_Fail_Status
0
w0000
9
28
Fail
1
w0000
42
28
Pass
2
w0000
26
36
Pass
9379997
w6699
12
14
Fail
9379998
w6699
8
40
Fail
9379999
w6699
17
21
Pass

IIUC, you could use groupby + sample to randomly sample a certain fraction of the original df to split into train and test DataFrames:
train = df.groupby('Id').sample(frac=0.7)
test = df[~df.index.isin(train.index)]
For example, in the sample you have in the OP, the above code produces:
train:
Id X Y Pass_Fail_Status
0 w0 0 0 Pass
2 w1 0 0 Fail
3 w6000 45 45 Pass
test:
Id X Y Pass_Fail_Status
1 w0 0 1 Fail

I have tried the suggestion for 'groupby' by both 'enke' and 'quasi-human'. It improved the overall performance (including other operations) by 6X (I measured the numbers 3 times for each approaches and this gain is based on avg) - now the for loop is like following:
for id, subsetDF in mainDF.groupby("Id", as_index=False):
predLabel = predict(subsetDF)

Related

Count "greater" occurrences without loop

Let's assume that we have an array A if shape (100,) and B of shape (10,). Both contain values in [0,1].
How do we get the count of elements in A greater than each value in B? I expect an of shape (10,), where the first element is "how many in A are greater than B[0]", the second is "how many in A are greater than B[1]", etc ...
Without using loops.
I tried the following, but it didn't work :
import numpy as np
import numpy.random as rdm
A = rdm.rand(100)
B = np.linspace(0,1,10)
def occ(z: float) ->float:
return np.count_nonzero(A > z)
occ(B)
Python won't use my function as a scalar function on B, that's why I get:
operands could not be broadcast together with shapes (10,) (100,)
I've also tried with np.greater but I've got the same issue ...
Slow But Simple
The error message is cryptic if you don't understand it, but it's telling you what to do. Array dimensions are broadcast together by lining them up starting with the right edge. This is especially helpful if you split your operation into two parts:
Create a (100, 10) mask showing which elements of A are greater than which elements of B:
mask = A[:, None] > B
Sum the result of the previous operation along the axis corresponding to A:
result = np.count_nonzero(mask, axis=0)
OR
result = np.sum(mask, axis=0)
This can be written as a one-liner:
(A[:, None] > B).sum(0)
OR
np.count_nonzero(A[:, None] > B, axis=0)
You can switch the dimensions and place B in the first axis to get the same result:
(A > B[:, None]).sum(1)
Fast and Elegant
Taking a totally different (but likely much more efficient) approach, you can use np.searchsorted:
A.sort()
result = A.size - np.searchsorted(A, B)
By default, searchsorted returns the left-index that each element of B would be inserted into A at. That pretty much immediately tells you how many elements of A are greater than that.
Benchmarks
Here, the algos are labeled as follows:
B0: (A[:, None] > B).sum(0)
B1: (A > B[:, None]).sum(1)
HH: np.cumsum(np.histogram(A, bins=B)[0][::-1])[::-1]
SS: A.sort(); A.size - np.searchsorted(A, B)
+--------+--------+----------------------------------------+
| A.size | B.size | Time (B0 / B1 / HH / SS) |
+--------+--------+----------------------------------------+
| 100 | 10 | 20.9 µs / 15.7 µs / 68.3 µs / 8.87 µs |
+--------+--------+----------------------------------------+
| 1000 | 10 | 118 µs / 57.2 µs / 139 µs / 17.8 µs |
+--------+--------+----------------------------------------+
| 10000 | 10 | 987 µs / 288 µs / 1.23 ms / 131 µs |
+--------+--------+----------------------------------------+
| 100000 | 10 | 9.48 ms / 2.77 ms / 13.4 ms / 1.42 ms |
+--------+--------+----------------------------------------+
| 100 | 100 | 70.7 µs / 63.8 µs / 71 µs / 11.4 µs |
+--------+--------+----------------------------------------+
| 1000 | 100 | 518 µs / 388 µs / 148 µs / 21.6 µs |
+--------+--------+----------------------------------------+
| 10000 | 100 | 4.91 ms / 2.67 ms / 1.22 ms / 137 µs |
+--------+--------+----------------------------------------+
| 100000 | 100 | 57.4 ms / 35.6 ms / 13.5 ms / 1.42 ms |
+--------+--------+----------------------------------------+
Memory layout matters. B1 is always faster than B0. This happens because summing contiguous (cached) elements (along the last axis in C-order) is always faster than having to skip across rows to get the next element. Broadcasting performs well for small values of B. Keep in mind that both the time and space complexity for B0 and B1 is O(A.size * B.size). The complexity of the two histogramming solutions should be about O(A.size * log(A.size)), but SS is implemented much more efficiently than HH because it can assume more things about the data.
I think you can use np.histogram for this job
A = rdm.rand(100)
B = np.linspace(0,1,10)
np.histogram(A, bins=B)[0]
Gives the output
array([10, 9, 8, 11, 9, 14, 10, 12, 17])
B[9] will always be empty because there are no values > 1.
And compute the cumsum backwards
np.cumsum(np.histogram(A, bins=B)[0][::-1])[::-1]
Output
array([100, 90, 81, 73, 62, 53, 39, 29, 17])
np.sum(A>B.reshape((-1,1)), axis=1)
Explanation
Need to understand broadcasting and reshaping for this. By reshaping B to shape (len(B), 1), it can be broadcasted with A to produce an array with shape (len(B), len(A)) containing all comparisons. Then you sum over axis 1 (along A).
In other words, A < B does not work because A has 100 entries, and B has 10. If you read the broadcasting rules, you will see that numpy will start with the last dimension, and if they are the same size, then it can compare one-to-one. If one of the two is 1, then this dimension is stretched or “copied” to match the other. If they are not equal and none of them is equal to 1, it fails.
With a shorter example:
A = np.array([0.5112744 , 0.21696187, 0.14710105, 0.98581087, 0.50053359,
0.54954654, 0.81217522, 0.50009166, 0.42990167, 0.56078499])
B = array([0.25, 0.5 , 0.75])
the transpose of (A>B.reshape((-1,1))) (for readability)
np.array([[ True, True, False],
[False, False, False],
[False, False, False],
[ True, True, True],
[ True, True, False],
[ True, True, False],
[ True, True, True],
[ True, True, False],
[ True, False, False],
[ True, True, False]])
and np.sum(A>B.reshape((-1,1)), axis=1) is
array([8, 7, 2])

Optimising itertools combination with grouped DataFrame and post filter

I have a DataFrame as
Locality money
1 3
1 4
1 10
1 12
1 15
2 16
2 18
I have to do a combination with replacement of money column with a groupby view on Locality and a filter on the money difference. The target must be like
Locality money1 money2
1 3 3
1 3 4
1 4 4
1 10 10
1 10 12
1 10 15
1 12 12
1 12 15
1 15 15
2 16 16
2 16 18
2 18 18
Note that the combination is applied for values on the same Locality and values which have a difference less than 6.
My current code is
from itertools import combinations_with_replacement
import numpy as np
import panda as pd
def generate_graph(input_series, out_cols):
return pd.DataFrame(list(combinations_with_replacement(input_series, r=2)), columns=out_cols)
df = (
df.groupby(['Locality'])['money'].apply(
lambda x: generate_graph(x, out_cols=['money1', 'money2'])
).reset_index().drop(columns=['level_1'], errors='ignore')
)
# Ensure the Distance between money is within the permissible limit
df = df.loc[(
df['money2'] - df['money1'] < 6
)]
The issue is, I have a DataFrame with 100000 rows which takes almost 33 seconds to process my code. I need to optimize the time taken by my code probably using numpy. I am looking for optimizing the groupby and the post-filter which takes extra space and time. For sample data, you can use this code to generate the DataFrame.
# Generate dummy data
t1 = list(range(0, 100000))
b = np.random.randint(100, 10000, 100000)
a = (b/100).astype(int)
df = pd.DataFrame({'Locality': a, 'money': t1})
df = df.sort_values(by=['Locality', 'money'])
To gain both running time speedup and reduce space consumption:
Instead of post-filtering - apply an extended function (say combine_values) that generates dataframe on a generator expression yielding already filtered (by condition) combinations.
(factor below is a default argument that indicates to the mentioned permissible limit)
In [48]: def combine_values(values, out_cols, factor=6):
...: return pd.DataFrame(((m1, m2) for m1, m2 in combinations_with_replacement(values, r=2)
...: if m2 - m1 < factor), columns=out_cols)
...:
In [49]: df_result = (
...: df.groupby(['Locality'])['money'].apply(
...: lambda x: combine_values(x, out_cols=['money1', 'money2'])
...: ).reset_index().drop(columns=['level_1'], errors='ignore')
...: )
Execution time performance:
In [50]: %time df.groupby(['Locality'])['money'].apply(lambda x: combine_values(x, out_cols=['money1', 'money2'])).reset_index().drop(columns=['l
...: evel_1'], errors='ignore')
CPU times: user 2.42 s, sys: 1.64 ms, total: 2.42 s
Wall time: 2.42 s
Out[50]:
Locality money1 money2
0 1 34 34
1 1 106 106
2 1 123 123
3 1 483 483
4 1 822 822
... ... ... ...
105143 99 99732 99732
105144 99 99872 99872
105145 99 99889 99889
105146 99 99913 99913
105147 99 99981 99981
[105148 rows x 3 columns]

How to pass condition into lambda?

I have a dictionary like this:
Dict={'A':0.0697,'B':0.1136,'C':0.2227,'D':0.2725,'E':0.4555}
I want my output like this:
Return A,B,C,D,E if the value in my dataframe is LESS THAN 0.0697,0.1136,0.2227,0.2725,0.4555 respectively; else return F
I tried:
TrainTest['saga1'] = TrainTest['saga'].apply(lambda x,v: Dict[x] if x<=v else 'F')
But it returns an error:
TypeError: <lambda>() takes exactly 2 arguments (1 given)
Let's make some test data:
saga = pd.Series([0.1, 0.2, 0.3, 0.4, 0.5, 0.9])
Next, recognize that Dict is a dict and has no ordering, so let's get that sorted by the numbers in reverse order:
thresh = sorted(Dict.items(), key=lambda t: t[1], reverse=True)
Finally, solve the problem by looping not over saga but over thresh, because loops (and apply()) in Python/Pandas are slow and we assume saga is much longer than thresh:
result = pd.Series('F', saga.index) # all F's to start
for name, value in thresh:
result[saga < value] = name
Now result is a series of values A,B,C,D,E,F as appropriate--we loop in reverse order because e.g. 0 is smaller than all the values and should be labeled A, not E.
Regarding run-times:
In [160]:%%timeit
# loop over smaller thresh, not << saga
for name, value in thresh:
result[saga < value] = name
100 loops, best of 3: 2.59 ms per loop
Here are pandas run-times:
saga1 = pd.DataFrame([0.05,0.1, 0.2, 0.3, 0.4, 0.5, 0.9],columns=['c1'])
def mapF(s):
# descending
curr='F'
for name, value in thresh:
if s < value:
curr = name
return curr
Using map/apply:
In [149]: %%timeit
saga1['result'] = saga1['c1'].map(lambda x: mapF(x) )
1000 loops, best of 3: 311 µs per loop
Using vectorization:
In [166]:%%timeit
import numpy as np
saga1['result'] = np.vectorize(mapF)(saga1['c1'])
1000 loops, best of 3: 244 µs per loop
** saga1
+---+------+--------+
| | c1 | result |
+---+------+--------+
| 0 | 0.05 | A |
| 1 | 0.1 | B |
| 2 | 0.2 | C |
| 3 | 0.3 | E |
| 4 | 0.4 | E |
| 5 | 0.5 | F |
| 6 | 0.9 | F |
+---+------+--------+

Removing usernames from a dataframe that do not appear a certain number of times?

I am trying to understand the provided below (which I found online, but do not fully understand). I want to essentially remove user names that do not appear in my dataframe at least 4 times (other than removing this names, I do not want to modify the dataframe in any other way). Does the following code solve this problem and if so, can you explain how the filter combined with the lambda achieves this? I have the following:
df.groupby('userName').filter(lambda x: len(x) > 4)
I am also open to alternative solutions/approaches that are easy to understand.
You can check filtration.
Faster solution in bigger DataFrame is with transform and boolean indexing:
df[df.groupby('userName')['userName'].transform('size') > 4]
Sample:
df = pd.DataFrame({'userName':['a'] * 5 + ['b'] * 3 + ['c'] * 6})
print (df.groupby('userName').filter(lambda x: len(x) > 4))
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
print (df[df.groupby('userName')['userName'].transform('size') > 4])
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
Timings:
np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})
print (df)
In [128]: %timeit (df.groupby('userName').filter(lambda x: len(x) > 1000))
1 loop, best of 3: 468 ms per loop
In [129]: %timeit (df[df.groupby('userName')['userName'].transform(len) > 1000])
1 loop, best of 3: 661 ms per loop
In [130]: %timeit (df[df.groupby('userName')['userName'].transform('size') > 1000])
10 loops, best of 3: 96.9 ms per loop
Using numpy
def pir(df, k):
names = df.userName.values
f, u = pd.factorize(names)
c = np.bincount(f)
m = c[f] > k
return df[m]
pir(df, 4)
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
__
Timing
#jezrael's large data
np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})
pir(df, 1000).equals(
df[df.groupby('userName')['userName'].transform('size') > 1000]
)
True
%timeit df[df.groupby('userName')['userName'].transform('size') > 1000]
%timeit pir(df, 1000)
10 loops, best of 3: 78.4 ms per loop
10 loops, best of 3: 61.9 ms per loop

Count number of elements in each column less than x

I have a DataFrame which looks like below. I am trying to count the number of elements less than 2.0 in each column, then I will visualize the result in a bar plot. I did it using lists and loops, but I wonder if there is a "Pandas way" to do this quickly.
x = []
for i in range(6):
x.append(df[df.ix[:,i]<2.0].count()[i])
then I can get a bar plot using list x.
A B C D E F
0 2.142 1.929 1.674 1.547 3.395 2.382
1 2.077 1.871 1.614 1.491 3.110 2.288
2 2.098 1.889 1.610 1.487 3.020 2.262
3 1.990 1.760 1.479 1.366 2.496 2.128
4 1.935 1.765 1.656 1.530 2.786 2.433
In [96]:
df = pd.DataFrame({'a':randn(10), 'b':randn(10), 'c':randn(10)})
df
Out[96]:
a b c
0 -0.849903 0.944912 1.285790
1 -1.038706 1.445381 0.251002
2 0.683135 -0.539052 -0.622439
3 -1.224699 -0.358541 1.361618
4 -0.087021 0.041524 0.151286
5 -0.114031 -0.201018 -0.030050
6 0.001891 1.601687 -0.040442
7 0.024954 -1.839793 0.917328
8 -1.480281 0.079342 -0.405370
9 0.167295 -1.723555 -0.033937
[10 rows x 3 columns]
In [97]:
df[df > 1.0].count()
Out[97]:
a 0
b 2
c 2
dtype: int64
So in your case:
df[df < 2.0 ].count()
should work
EDIT
some timings
In [3]:
%timeit df[df < 1.0 ].count()
%timeit (df < 1.0).sum()
%timeit (df < 1.0).apply(np.count_nonzero)
1000 loops, best of 3: 1.47 ms per loop
1000 loops, best of 3: 560 us per loop
1000 loops, best of 3: 529 us per loop
So #DSM's suggestions are correct and much faster than my suggestion
Method-chaining is possible (comparison operators have their respective methods, e.g. < = lt(), <= = le()):
df.lt(2).sum()
If you have multiple conditions to consider, e.g. count the number of values between 2 and 10. Then you can use boolean operator on two boolean Serieses:
(df.gt(2) & df.lt(10)).sum()
or you can use pd.eval():
pd.eval("2 < df < 10").sum()
Count the number of values less than 2 or greater than 10:
(df.lt(2) | df.gt(10)).sum()
# or
pd.eval("df < 2 or df > 10").sum()

Categories