Python Panda Percentages Calculations - python

Trying to Calculate Percentages on Python Pandas
1 2
0 A 0
1 A 1
2 A 2
3 B 0
4 B 0
5 B 1.5
6 B 0
Output of Percentage of A's and B's with a score higher than 0.
A = 66%
B = 25%

Create a boolean filter on the second column ((df['2'] > 0))
Group it by the first column
Aggregate with sum and size (sum will count the ones that satisfy the condition)
Divide sum by size to get the percentage:
res = (df['2'] > 0).groupby(df['1']).agg(['sum', 'size'])
res['sum'] / res['size']
Out:
1
A 0.666667
B 0.250000
dtype: float64
This can be done in a more compact way with a lambda expression:
df.groupby('1')['2'].agg(lambda x: (x > 0).sum() / x.size)
Out:
1
A 0.666667
B 0.250000
Name: 2, dtype: float64
but I suspect that the first one is more efficient.

In [3]: df['2'].gt(0).groupby(df['1']).mean()
Out[3]:
1
A 0.666667
B 0.250000
Name: 2, dtype: float64

Don't mind me...
I'm on a kick where I'm solving everything with np.bincount and pd.factorize
f, u = df['1'].factorize()
pd.Series(
np.bincount(f, df['2'].values > 0) / np.bincount(f),
u
)
A 0.666667
B 0.250000
dtype: float64
One-liner version for fun!
(lambda w, g, f, u: pd.Series(g(f, w) / g(f)))(
df['2'].values > 0, np.bincount, *pd.factorize(df['1'].values)
)
Naive Timing
%timeit df['2'].gt(0).groupby(df['1']).mean()
%timeit df.groupby('1')['2'].agg(lambda x: (x > 0).sum() / x.size)
%timeit (lambda w, g, f, u: pd.Series(g(f, w) / g(f)))(df['2'].values > 0, np.bincount, *pd.factorize(df['1'].values))
1000 loops, best of 3: 697 µs per loop
1000 loops, best of 3: 1 ms per loop
10000 loops, best of 3: 117 µs per loop

Related

Cumulative apply within window defined by other columns

I am trying to apply a function, cumulatively, to values that lie within a window defined by 'start' and 'finish' columns. So, 'start' and 'finish' define the intervals where the value is 'active'; for each row, I want to get a sum of all 'active' values at the time.
Here is a 'bruteforce' example that does what I am after - is there a more elegant, faster or more memory efficient way of doing this?
df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],
columns=['start', 'finish', 'val'])
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
Originally, df is:
start finish val
0 1 3 100
1 2 4 200
2 3 6 300
3 4 6 400
4 5 6 500
The result I am after is:
1 100
2 300
3 500
4 700
5 1200
numba
from numba import njit
#njit
def pir_numba(S, F, V):
mn = S.min()
mx = F.max()
out = np.zeros(mx)
for s, f, v in zip(S, F, V):
out[s:f] += v
return out[mn:]
pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])
np.bincount
s, f, v = [df[col].values for col in ['start', 'finish', 'val']]
np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))
array([ 100., 300., 500., 700., 1200.])
Comprehension
This depends on the index being unique
pd.Series({
(k, i): v
for i, s, f, v in df.itertuples()
for k in range(s, f)
}).sum(level=0)
1 100
2 300
3 500
4 700
5 1200
dtype: int64
With no dependence on index
pd.Series({
(k, i): v
for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))
for k in range(s, f)
}).sum(level=0)
Using numpy boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby. So far base on my test Pir 's solution performance is the best
s1=df['start'].values
s2=df['finish'].values
np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)
Out[44]: array([ 100, 200, 300, 700, 1200], dtype=int64)
Some timing
#df=pd.concat([df]*1000)
%timeit merged(df)
1 loop, best of 3: 5.02 s per loop
%timeit npb(df)
1 loop, best of 3: 283 ms per loop
% timeit PIR(df)
100 loops, best of 3: 9.8 ms per loop
def merged(df):
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
return val
def npb(df):
s1 = df['start'].values
s2 = df['finish'].values
return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)

how to count the number of state change in pandas?

i have below dataframe that have columns 0-1 .. and i wanna
count the number of 0->1,1->0 every column. in below dataframe
'a' column state change number is 6, 'b' state change number is 3
, 'c' state change number is 2 .. actually i don't know how
code in pandas.
number a b c
1 0 0 0
2 1 0 1
3 0 1 1
4 1 1 1
5 0 0 0
6 1 0 0
7 0 1 0
actually i don't have idea in pandas.. because recently used only r.
but now i must use python pandas. so have little bit in difficult
situation anybody can help ? thanks in advance !
Use rolling and compare each value, then count all True values by sum:
df = df[['a','b','c']].rolling(2).apply(lambda x: x[0] != x[-1], raw=True).sum().astype(int)
a 6
b 3
c 2
dtype: int64
Bit wise xor (^)
Use the Numpy array df.values and compare the shifted elements with ^
This is meant to be a fast solution.
Xor has the property that only one of the two items being operated on can be true as shown in this truth table
A B XOR
T T F
T F T
F T T
F F F
And replicated in 0/1 form
a = np.array([1, 1, 0, 0])
b = np.array([1, 0, 1, 0])
pd.DataFrame(dict(A=a, B=b, XOR=a ^ b))
A B XOR
0 1 1 0
1 1 0 1
2 0 1 1
3 0 0 0
Demo
v = df.values
pd.Series((v[1:] ^ v[:-1]).sum(0), df.columns)
a 6
b 3
c 2
dtype: int64
Time Testing
Open in Colab
Open in GitHub
Functions
def pir_xor(df):
v = df.values
return pd.Series((v[1:] ^ v[:-1]).sum(0), df.columns)
def pir_diff1(df):
v = df.values
return pd.Series(np.abs(np.diff(v, axis=0)).sum(0), df.columns)
def pir_diff2(df):
v = df.values
return pd.Series(np.diff(v.astype(np.bool), axis=0).sum(0), df.columns)
def cold(df):
return df.ne(df.shift(-1)).sum(0) - 1
def jez(df):
return df.rolling(2).apply(lambda x: x[0] != x[-1]).sum().astype(int)
def naga(df):
return df.diff().abs().sum().astype(int)
Testing
np.random.seed([3, 1415])
idx = [10, 30, 100, 300, 1000, 3000, 10000, 30000, 100000, 300000]
col = 'pir_xor pir_diff1 pir_diff2 cold jez naga'.split()
res = pd.DataFrame(np.nan, idx, col)
for i in idx:
df = pd.DataFrame(np.random.choice([0, 1], size=(i, 3)), columns=[*'abc'])
for j in col:
stmt = f"{j}(df)"
setp = f"from __main__ import {j}, df"
res.at[i, j] = timeit(stmt, setp, number=100)
Results
res.div(res.min(1), 0)
pir_xor pir_diff1 pir_diff2 cold jez naga
10 1.06203 1.119769 1.000000 21.217555 16.768532 6.601518
30 1.00000 1.075406 1.115743 23.229013 18.844025 7.212369
100 1.00000 1.134082 1.174973 22.673289 21.478068 7.519898
300 1.00000 1.119153 1.166782 21.725495 26.293712 7.215490
1000 1.00000 1.106267 1.167786 18.394462 37.925160 6.284253
3000 1.00000 1.118554 1.342192 16.053097 64.953310 5.594610
10000 1.00000 1.163557 1.511631 12.008129 106.466636 4.503359
30000 1.00000 1.249835 1.431120 7.826387 118.380227 3.621455
100000 1.00000 1.275272 1.528840 6.690012 131.912349 3.150155
300000 1.00000 1.279373 1.528238 6.301007 140.667427 3.190868
res.plot(loglog=True, figsize=(15, 8))
shift and compare:
df.ne(df.shift(-1)).sum(0) - 1
a 6
b 3
c 2
dtype: int64
...Assuming "number" is the index, otherwise precede your solution with
df.set_index('number', inplace=True).
You can try of taking difference with previous one and add absolute valeues
df.diff().abs().sum().astype(int)
Out:
1 6
2 3
3 2
dtype: int32
Use:
def agg_columns(x):
shifted = x.shift()
return sum(x[1:] != shifted[1:])
df[['a','b','c']].apply(agg_columns)
a 6
b 3
c 2
dtype: int64
You can try also with:
((df!=df.shift()).cumsum() - 1).iloc[-1:]

Removing usernames from a dataframe that do not appear a certain number of times?

I am trying to understand the provided below (which I found online, but do not fully understand). I want to essentially remove user names that do not appear in my dataframe at least 4 times (other than removing this names, I do not want to modify the dataframe in any other way). Does the following code solve this problem and if so, can you explain how the filter combined with the lambda achieves this? I have the following:
df.groupby('userName').filter(lambda x: len(x) > 4)
I am also open to alternative solutions/approaches that are easy to understand.
You can check filtration.
Faster solution in bigger DataFrame is with transform and boolean indexing:
df[df.groupby('userName')['userName'].transform('size') > 4]
Sample:
df = pd.DataFrame({'userName':['a'] * 5 + ['b'] * 3 + ['c'] * 6})
print (df.groupby('userName').filter(lambda x: len(x) > 4))
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
print (df[df.groupby('userName')['userName'].transform('size') > 4])
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
Timings:
np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})
print (df)
In [128]: %timeit (df.groupby('userName').filter(lambda x: len(x) > 1000))
1 loop, best of 3: 468 ms per loop
In [129]: %timeit (df[df.groupby('userName')['userName'].transform(len) > 1000])
1 loop, best of 3: 661 ms per loop
In [130]: %timeit (df[df.groupby('userName')['userName'].transform('size') > 1000])
10 loops, best of 3: 96.9 ms per loop
Using numpy
def pir(df, k):
names = df.userName.values
f, u = pd.factorize(names)
c = np.bincount(f)
m = c[f] > k
return df[m]
pir(df, 4)
userName
0 a
1 a
2 a
3 a
4 a
8 c
9 c
10 c
11 c
12 c
13 c
__
Timing
#jezrael's large data
np.random.seed(123)
N = 1000000
L = np.random.randint(1000,size=N).astype(str)
df = pd.DataFrame({'userName': np.random.choice(L, N)})
pir(df, 1000).equals(
df[df.groupby('userName')['userName'].transform('size') > 1000]
)
True
%timeit df[df.groupby('userName')['userName'].transform('size') > 1000]
%timeit pir(df, 1000)
10 loops, best of 3: 78.4 ms per loop
10 loops, best of 3: 61.9 ms per loop

Dataframe conditional logic

I have a dataframe ('dayData') with the columns 'Power1' and 'Power2'
Power1 Power2
1.049246442 -0.231991505
-0.950753558 0.276990531
-0.950753558 0.531481549
0 -0.231991505
-0.464648091 -0.231991505
1.049246442 -1.204952258
0.455388896 -0.486482523
0.879383766 0.226092327
-0.50417844 0.83687077
0.152025349 -0.359237014
I'm trying to use conditional logic to create the 'resultPower' column. For each row, the logic I'm trying to install is:
if (Power1 >= 0 AND Power2 =<0) OR if (Power1 <= 0 AND Power2 >= 0) then 0, return the value for Power1.
So when the resultPower column is added the dataframe would look like:
Power1 Power2 ResultPower
1.049246442 -0.231991505 0
-0.950753558 0.276990531 0
-0.950753558 0.531481549 0
0 -0.231991505 0
-0.464648091 -0.231991505 -0.464648091
1.049246442 -1.204952258 0
0.455388896 -0.486482523 0
0.879383766 0.226092327 0.879383766
-0.50417844 0.83687077 0
0.152025349 -0.359237014 0
I have used basic conditional logic in pandas before, for example I would be able to check one of the logic conditions i.e.
dayData['ResultPower'] = np.where(dayData.Power1 > 0, 0, dayData.Power1)
but I can't find how I can add logic conditions with AND / OR functions. To build something like:
dayData['ResultPower'] = np.where(dayData.Power1 >= 0 and dayData.Power2 =< 0 or dayData.Power1 =< 0 and dayData.Power2 >= 0, 0, dayData.Power1)
Could someone let me know if this is possible and the syntax for doing this please?
Dataframe reproduction
import pandas as pd
from io import StringIO
datastring = StringIO("""\
Power1 Power2
1.049246442 -0.231991505
-0.950753558 0.276990531
-0.950753558 0.531481549
0 -0.231991505
-0.464648091 -0.231991505
1.049246442 -1.204952258
0.455388896 -0.486482523
0.879383766 0.226092327
-0.50417844 0.83687077
0.152025349 -0.359237014
""")
df = pd.read_table(datastring, sep='\s\s+', engine='python')
df['ResultPower'] = df['Power1']
cond1 = (df.Power1 >= 0) & (df.Power2 <= 0)
cond2 = (df.Power1 <= 0) & (df.Power2 >= 0)
df.loc[cond1 | cond2, 'ResultPower'] = 0
Using timeit: 100 loops, best of 3: 1.87 ms per loop
When you want element-wise logical operations on pandas objects, you need to use & for and and | for or. So, this is what you are looking for:
In [15]: dayData
Out[15]:
Power1 Power2
0 1.049246 -0.231992
1 -0.950754 0.276991
2 -0.950754 0.531482
3 0.000000 -0.231992
4 -0.464648 -0.231992
5 1.049246 -1.204952
6 0.455389 -0.486483
7 0.879384 0.226092
8 -0.504178 0.836871
9 0.152025 -0.359237
In [16]: dayData['ResultsPower'] = np.where(((dayData.Power1 >= 0) & (dayData.Power2 <= 0)) | ((dayData.Power1 <= 0) & (dayData.Power2 >=0)),0, dayData.Power1)
In [17]: dayData
Out[17]:
Power1 Power2 ResultsPower
0 1.049246 -0.231992 0.000000
1 -0.950754 0.276991 0.000000
2 -0.950754 0.531482 0.000000
3 0.000000 -0.231992 0.000000
4 -0.464648 -0.231992 -0.464648
5 1.049246 -1.204952 0.000000
6 0.455389 -0.486483 0.000000
7 0.879384 0.226092 0.879384
8 -0.504178 0.836871 0.000000
9 0.152025 -0.359237 0.000000
Read more about it here:
http://pandas.pydata.org/pandas-docs/version/0.13.1/gotchas.html#bitwise-boolean
Another approach is to use the apply method of dataframes, which applies a function to a row or columns of the dataframe. First, define your function:
In [18]: def my_function(S):
....: if ((S.Power1 >=0) and (S.Power2 <=0)) or ((S.Power1 <=0) and (S.Power2 >= 0)):
....: return 0
....: else:
....: return S.Power1
....:
Now use the apply method with axis=1 if you want to work with each row:
In [29]: dayData.apply(my_function, axis=1)
Out[29]:
0 0.000000
1 0.000000
2 0.000000
3 0.000000
4 -0.464648
5 0.000000
6 0.000000
7 0.879384
8 0.000000
9 0.000000
dtype: float64
Now we can compare the speed of each of these operations:
In [31]: timeit np.where(((dayData.Power1 >= 0) & (dayData.Power2 <= 0)) | ((dayData.Power1 <= 0) & (dayData.Power2 >=0)),0, dayData.Power1)
100 loops, best of 3: 2.21 ms per loop
In [32]: timeit dayData.apply(my_function, axis=1)
1000 loops, best of 3: 990 µs per loop
So it seems that in this case using apply is faster, but that may be because it has to convert data-structures.

Count number of elements in each column less than x

I have a DataFrame which looks like below. I am trying to count the number of elements less than 2.0 in each column, then I will visualize the result in a bar plot. I did it using lists and loops, but I wonder if there is a "Pandas way" to do this quickly.
x = []
for i in range(6):
x.append(df[df.ix[:,i]<2.0].count()[i])
then I can get a bar plot using list x.
A B C D E F
0 2.142 1.929 1.674 1.547 3.395 2.382
1 2.077 1.871 1.614 1.491 3.110 2.288
2 2.098 1.889 1.610 1.487 3.020 2.262
3 1.990 1.760 1.479 1.366 2.496 2.128
4 1.935 1.765 1.656 1.530 2.786 2.433
In [96]:
df = pd.DataFrame({'a':randn(10), 'b':randn(10), 'c':randn(10)})
df
Out[96]:
a b c
0 -0.849903 0.944912 1.285790
1 -1.038706 1.445381 0.251002
2 0.683135 -0.539052 -0.622439
3 -1.224699 -0.358541 1.361618
4 -0.087021 0.041524 0.151286
5 -0.114031 -0.201018 -0.030050
6 0.001891 1.601687 -0.040442
7 0.024954 -1.839793 0.917328
8 -1.480281 0.079342 -0.405370
9 0.167295 -1.723555 -0.033937
[10 rows x 3 columns]
In [97]:
df[df > 1.0].count()
Out[97]:
a 0
b 2
c 2
dtype: int64
So in your case:
df[df < 2.0 ].count()
should work
EDIT
some timings
In [3]:
%timeit df[df < 1.0 ].count()
%timeit (df < 1.0).sum()
%timeit (df < 1.0).apply(np.count_nonzero)
1000 loops, best of 3: 1.47 ms per loop
1000 loops, best of 3: 560 us per loop
1000 loops, best of 3: 529 us per loop
So #DSM's suggestions are correct and much faster than my suggestion
Method-chaining is possible (comparison operators have their respective methods, e.g. < = lt(), <= = le()):
df.lt(2).sum()
If you have multiple conditions to consider, e.g. count the number of values between 2 and 10. Then you can use boolean operator on two boolean Serieses:
(df.gt(2) & df.lt(10)).sum()
or you can use pd.eval():
pd.eval("2 < df < 10").sum()
Count the number of values less than 2 or greater than 10:
(df.lt(2) | df.gt(10)).sum()
# or
pd.eval("df < 2 or df > 10").sum()

Categories