Dataframe conditional logic - python

I have a dataframe ('dayData') with the columns 'Power1' and 'Power2'
Power1 Power2
1.049246442 -0.231991505
-0.950753558 0.276990531
-0.950753558 0.531481549
0 -0.231991505
-0.464648091 -0.231991505
1.049246442 -1.204952258
0.455388896 -0.486482523
0.879383766 0.226092327
-0.50417844 0.83687077
0.152025349 -0.359237014
I'm trying to use conditional logic to create the 'resultPower' column. For each row, the logic I'm trying to install is:
if (Power1 >= 0 AND Power2 =<0) OR if (Power1 <= 0 AND Power2 >= 0) then 0, return the value for Power1.
So when the resultPower column is added the dataframe would look like:
Power1 Power2 ResultPower
1.049246442 -0.231991505 0
-0.950753558 0.276990531 0
-0.950753558 0.531481549 0
0 -0.231991505 0
-0.464648091 -0.231991505 -0.464648091
1.049246442 -1.204952258 0
0.455388896 -0.486482523 0
0.879383766 0.226092327 0.879383766
-0.50417844 0.83687077 0
0.152025349 -0.359237014 0
I have used basic conditional logic in pandas before, for example I would be able to check one of the logic conditions i.e.
dayData['ResultPower'] = np.where(dayData.Power1 > 0, 0, dayData.Power1)
but I can't find how I can add logic conditions with AND / OR functions. To build something like:
dayData['ResultPower'] = np.where(dayData.Power1 >= 0 and dayData.Power2 =< 0 or dayData.Power1 =< 0 and dayData.Power2 >= 0, 0, dayData.Power1)
Could someone let me know if this is possible and the syntax for doing this please?
Dataframe reproduction
import pandas as pd
from io import StringIO
datastring = StringIO("""\
Power1 Power2
1.049246442 -0.231991505
-0.950753558 0.276990531
-0.950753558 0.531481549
0 -0.231991505
-0.464648091 -0.231991505
1.049246442 -1.204952258
0.455388896 -0.486482523
0.879383766 0.226092327
-0.50417844 0.83687077
0.152025349 -0.359237014
""")
df = pd.read_table(datastring, sep='\s\s+', engine='python')

df['ResultPower'] = df['Power1']
cond1 = (df.Power1 >= 0) & (df.Power2 <= 0)
cond2 = (df.Power1 <= 0) & (df.Power2 >= 0)
df.loc[cond1 | cond2, 'ResultPower'] = 0
Using timeit: 100 loops, best of 3: 1.87 ms per loop

When you want element-wise logical operations on pandas objects, you need to use & for and and | for or. So, this is what you are looking for:
In [15]: dayData
Out[15]:
Power1 Power2
0 1.049246 -0.231992
1 -0.950754 0.276991
2 -0.950754 0.531482
3 0.000000 -0.231992
4 -0.464648 -0.231992
5 1.049246 -1.204952
6 0.455389 -0.486483
7 0.879384 0.226092
8 -0.504178 0.836871
9 0.152025 -0.359237
In [16]: dayData['ResultsPower'] = np.where(((dayData.Power1 >= 0) & (dayData.Power2 <= 0)) | ((dayData.Power1 <= 0) & (dayData.Power2 >=0)),0, dayData.Power1)
In [17]: dayData
Out[17]:
Power1 Power2 ResultsPower
0 1.049246 -0.231992 0.000000
1 -0.950754 0.276991 0.000000
2 -0.950754 0.531482 0.000000
3 0.000000 -0.231992 0.000000
4 -0.464648 -0.231992 -0.464648
5 1.049246 -1.204952 0.000000
6 0.455389 -0.486483 0.000000
7 0.879384 0.226092 0.879384
8 -0.504178 0.836871 0.000000
9 0.152025 -0.359237 0.000000
Read more about it here:
http://pandas.pydata.org/pandas-docs/version/0.13.1/gotchas.html#bitwise-boolean
Another approach is to use the apply method of dataframes, which applies a function to a row or columns of the dataframe. First, define your function:
In [18]: def my_function(S):
....: if ((S.Power1 >=0) and (S.Power2 <=0)) or ((S.Power1 <=0) and (S.Power2 >= 0)):
....: return 0
....: else:
....: return S.Power1
....:
Now use the apply method with axis=1 if you want to work with each row:
In [29]: dayData.apply(my_function, axis=1)
Out[29]:
0 0.000000
1 0.000000
2 0.000000
3 0.000000
4 -0.464648
5 0.000000
6 0.000000
7 0.879384
8 0.000000
9 0.000000
dtype: float64
Now we can compare the speed of each of these operations:
In [31]: timeit np.where(((dayData.Power1 >= 0) & (dayData.Power2 <= 0)) | ((dayData.Power1 <= 0) & (dayData.Power2 >=0)),0, dayData.Power1)
100 loops, best of 3: 2.21 ms per loop
In [32]: timeit dayData.apply(my_function, axis=1)
1000 loops, best of 3: 990 µs per loop
So it seems that in this case using apply is faster, but that may be because it has to convert data-structures.

Related

How to count no of occurrence for each value in a given column of dataframe for a certain class interval?

this is my first question at stackoverflow.
I have two dataframes of different sizes df1(266808 rows) and df2 (201 rows).
df1
and
df2
I want to append the count of each value/number in df1['WS_140m'] to df2['count'] if number falls in a class interval given in df2['Class_interval'].
I have tried
1)
df2['count']=pd.cut(x=df1['WS_140m'], bins=df2['Class_interval'])
2)
df2['count'] = df1['WS_140m'].groupby(df1['Class_interval'])
3)
for anum in df1['WS_140m']:
if anum in df2['Class_interval']:
df2['count'] = df2['count'] + 1
Please guide, if someone knows.
Please try something like:
def in_class_interval(value, interval):
#TODO
def in_class_interval_closure(interval):
return lambda x: in_class_interval(x, interval)
df2['count'] = df2['Class_interval']
.apply(lambda x: df1[in_class_interval_closure(x)(df1['WS_140m'])].size,axis=1)
Define your function in_class_interval(value, interval), which returns boolean.
I guess something like this would do it:
In [330]: df1
Out[330]:
WS_140m
0 5.10
1 5.16
2 5.98
3 5.58
4 4.81
In [445]: df2
Out[445]:
count Class_interval
0 0 NaN
1 0 (0.05,0.15]
2 0 (0.15,0.25]
3 0 (0.25,0.35]
4 0 (3.95,5.15]
In [446]: df2.Class_interval = df2.Class_interval.str.replace(']', ')')
In [451]: from ast import literal_eval
In [449]: for i, v in df2.Class_interval.iteritems():
...: if pd.notnull(v):
...: df2.at[i, 'Class_interval'] = literal_eval(df2.Class_interval[i])
In [342]: df2['falls_in_range'] = df1.WS_140m.between(df2.Class_interval.str[0], df2.Class_interval.str[1])
You can increase the count wherever True comes like below :
In [360]: df2['count'] = df2.loc[df2.index[df2['falls_in_range'] == True].tolist()]['count'] +1
In [361]: df2
Out[361]:
count Class_interval falls_in_range
0 NaN NaN False
1 NaN (0.05, 0.15) False
2 NaN (0.15, 0.25) False
3 NaN (0.25, 0.35) False
4 1.0 (3.95, 5.15) True

Speed up turn probabilities into binary features

I have a dataframe with 3 columns, in each row I have the probability that this row, the feature T has the value 1, 2 and 3
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({"T1" : [0.8,0.5,0.01],"T2":[0.1,0.2,0.89],"T3":[0.1,0.3,0.1]})
For row 0, T is 1 with 80% chance, 2 with 10% and 3 with 10%
I want to simulate the value of T for each row and change the columns T1,T2, T3 to binary features.
I have a solution but it needs to loop on the rows of the dataframe, it is really slow (my real dataframe has over 1 million rows) :
possib = df.columns
for i in range(df.shape[0]):
probas = df.iloc[i][possib].tolist()
choix_transp = np.random.choice(possib,1, p=probas)[0]
for pos in possib:
if pos==choix_transp:
df.iloc[i][pos] = 1
else:
df.iloc[i][pos] = 0
Is there a way to vectorize this code ?
Thank you !
Here's one based on vectorized random.choice with a given matrix of probabilities -
def matrixprob_to_onehot(ar):
# Get one-hot encoded boolean array based on matrix of probabilities
c = ar.cumsum(axis=1)
idx = (np.random.rand(len(c), 1) < c).argmax(axis=1)
ar_out = np.zeros(ar.shape, dtype=bool)
ar_out[np.arange(len(idx)),idx] = 1
return ar_out
ar_out = matrixprob_to_onehot(df.values)
df_out = pd.DataFrame(ar_out.view('i1'), index=df.index, columns=df.columns)
Verify with a large dataset for the probabilities -
In [139]: df = pd.DataFrame({"T1" : [0.8,0.5,0.01],"T2":[0.1,0.2,0.89],"T3":[0.1,0.3,0.1]})
In [140]: df
Out[140]:
T1 T2 T3
0 0.80 0.10 0.1
1 0.50 0.20 0.3
2 0.01 0.89 0.1
In [141]: p = np.array([matrixprob_to_onehot(df.values) for i in range(100000)]).argmax(2)
In [142]: np.array([np.bincount(p[:,i])/100000.0 for i in range(len(df))])
Out[142]:
array([[0.80064, 0.0995 , 0.09986],
[0.50051, 0.20113, 0.29836],
[0.01015, 0.89045, 0.0994 ]])
In [145]: np.round(_,2)
Out[145]:
array([[0.8 , 0.1 , 0.1 ],
[0.5 , 0.2 , 0.3 ],
[0.01, 0.89, 0.1 ]])
Timings on 1000,000 rows -
# Setup input
In [169]: N = 1000000
...: a = np.random.rand(N,3)
...: df = pd.DataFrame(a/a.sum(1,keepdims=1),columns=[['T1','T2','T3']])
# #gmds's soln
In [171]: %timeit pd.get_dummies((np.random.rand(len(df), 1) > df.cumsum(axis=1)).idxmin(axis=1))
1 loop, best of 3: 4.82 s per loop
# Soln from this post
In [172]: %%timeit
...: ar_out = matrixprob_to_onehot(df.values)
...: df_out = pd.DataFrame(ar_out.view('i1'), index=df.index, columns=df.columns)
10 loops, best of 3: 43.1 ms per loop
We can use numpy for this:
result = pd.get_dummies((np.random.rand(len(df), 1) > df.cumsum(axis=1)).idxmin(axis=1))
This generates a single column of random values and compares it to the column-wise cumsum of the dataframe, which results in a DataFrame of values where the first False value shows which "bucket" the random value falls in. With idxmax, we can get the index of this bucket, which we can then convert back with pd.get_dummies.
Example:
import numpy as np
import pandas as pd
np.random.seed(0)
data = np.random.rand(10, 3)
normalised = data / data.sum(axis=1)[:, np.newaxis]
df = pd.DataFrame(normalised)
result = pd.get_dummies((np.random.rand(len(df), 1) > df.cumsum(axis=1)).idxmin(axis=1))
print(result)
Output:
0 1 2
0 1 0 0
1 0 0 1
2 0 1 0
3 0 1 0
4 1 0 0
5 0 0 1
6 0 1 0
7 0 1 0
8 0 0 1
9 0 1 0
A note:
Most of the slowdown comes from pd.get_dummies; if you use Divakar's method of pd.DataFrame(result.view('i1'), index=df.index, columns=df.columns), it gets a lot faster.

Python Panda Percentages Calculations

Trying to Calculate Percentages on Python Pandas
1 2
0 A 0
1 A 1
2 A 2
3 B 0
4 B 0
5 B 1.5
6 B 0
Output of Percentage of A's and B's with a score higher than 0.
A = 66%
B = 25%
Create a boolean filter on the second column ((df['2'] > 0))
Group it by the first column
Aggregate with sum and size (sum will count the ones that satisfy the condition)
Divide sum by size to get the percentage:
res = (df['2'] > 0).groupby(df['1']).agg(['sum', 'size'])
res['sum'] / res['size']
Out:
1
A 0.666667
B 0.250000
dtype: float64
This can be done in a more compact way with a lambda expression:
df.groupby('1')['2'].agg(lambda x: (x > 0).sum() / x.size)
Out:
1
A 0.666667
B 0.250000
Name: 2, dtype: float64
but I suspect that the first one is more efficient.
In [3]: df['2'].gt(0).groupby(df['1']).mean()
Out[3]:
1
A 0.666667
B 0.250000
Name: 2, dtype: float64
Don't mind me...
I'm on a kick where I'm solving everything with np.bincount and pd.factorize
f, u = df['1'].factorize()
pd.Series(
np.bincount(f, df['2'].values > 0) / np.bincount(f),
u
)
A 0.666667
B 0.250000
dtype: float64
One-liner version for fun!
(lambda w, g, f, u: pd.Series(g(f, w) / g(f)))(
df['2'].values > 0, np.bincount, *pd.factorize(df['1'].values)
)
Naive Timing
%timeit df['2'].gt(0).groupby(df['1']).mean()
%timeit df.groupby('1')['2'].agg(lambda x: (x > 0).sum() / x.size)
%timeit (lambda w, g, f, u: pd.Series(g(f, w) / g(f)))(df['2'].values > 0, np.bincount, *pd.factorize(df['1'].values))
1000 loops, best of 3: 697 µs per loop
1000 loops, best of 3: 1 ms per loop
10000 loops, best of 3: 117 µs per loop

Pandas apply but only for rows where a condition is met

I would like to use Pandas df.apply but only for certain rows
As an example, I want to do something like this, but my actual issue is a little more complicated:
import pandas as pd
import math
z = pd.DataFrame({'a':[4.0,5.0,6.0,7.0,8.0],'b':[6.0,0,5.0,0,1.0]})
z.where(z['b'] != 0, z['a'] / z['b'].apply(lambda l: math.log(l)), 0)
What I want in this example is the value in 'a' divided by the log of the value in 'b' for each row, and for rows where 'b' is 0, I simply want to return 0.
The other answers are excellent, but I thought I'd add one other approach that can be faster in some circumstances – using broadcasting and masking to achieve the same result:
import numpy as np
mask = (z['b'] != 0)
z_valid = z[mask]
z['c'] = 0
z.loc[mask, 'c'] = z_valid['a'] / np.log(z_valid['b'])
Especially with very large dataframes, this approach will generally be faster than solutions based on apply().
You can just use an if statement in a lambda function.
z['c'] = z.apply(lambda row: 0 if row['b'] in (0,1) else row['a'] / math.log(row['b']), axis=1)
I also excluded 1, because log(1) is zero.
Output:
a b c
0 4 6 2.232443
1 5 0 0.000000
2 6 5 3.728010
3 7 0 0.000000
4 8 1 0.000000
Hope this helps. It is easy and readable
df['c']=df['b'].apply(lambda x: 0 if x ==0 else math.log(x))
You can use a lambda with a conditional to return 0 if the input value is 0 and skip the whole where clause:
z['c'] = z.apply(lambda x: math.log(x.b) if x.b > 0 else 0, axis=1)
You also have to assign the results to a new column (z['c']).
Use np.where() which divides a by the log of the value in b if the condition is met and returns 0 otherwise:
import numpy as np
z['c'] = np.where(z['b'] != 0, z['a'] / np.log(z['b']), 0)
Output:
a b c
0 4.0 6.0 2.232443
1 5.0 0.0 0.000000
2 6.0 5.0 3.728010
3 7.0 0.0 0.000000
4 8.0 1.0 inf

Count number of elements in each column less than x

I have a DataFrame which looks like below. I am trying to count the number of elements less than 2.0 in each column, then I will visualize the result in a bar plot. I did it using lists and loops, but I wonder if there is a "Pandas way" to do this quickly.
x = []
for i in range(6):
x.append(df[df.ix[:,i]<2.0].count()[i])
then I can get a bar plot using list x.
A B C D E F
0 2.142 1.929 1.674 1.547 3.395 2.382
1 2.077 1.871 1.614 1.491 3.110 2.288
2 2.098 1.889 1.610 1.487 3.020 2.262
3 1.990 1.760 1.479 1.366 2.496 2.128
4 1.935 1.765 1.656 1.530 2.786 2.433
In [96]:
df = pd.DataFrame({'a':randn(10), 'b':randn(10), 'c':randn(10)})
df
Out[96]:
a b c
0 -0.849903 0.944912 1.285790
1 -1.038706 1.445381 0.251002
2 0.683135 -0.539052 -0.622439
3 -1.224699 -0.358541 1.361618
4 -0.087021 0.041524 0.151286
5 -0.114031 -0.201018 -0.030050
6 0.001891 1.601687 -0.040442
7 0.024954 -1.839793 0.917328
8 -1.480281 0.079342 -0.405370
9 0.167295 -1.723555 -0.033937
[10 rows x 3 columns]
In [97]:
df[df > 1.0].count()
Out[97]:
a 0
b 2
c 2
dtype: int64
So in your case:
df[df < 2.0 ].count()
should work
EDIT
some timings
In [3]:
%timeit df[df < 1.0 ].count()
%timeit (df < 1.0).sum()
%timeit (df < 1.0).apply(np.count_nonzero)
1000 loops, best of 3: 1.47 ms per loop
1000 loops, best of 3: 560 us per loop
1000 loops, best of 3: 529 us per loop
So #DSM's suggestions are correct and much faster than my suggestion
Method-chaining is possible (comparison operators have their respective methods, e.g. < = lt(), <= = le()):
df.lt(2).sum()
If you have multiple conditions to consider, e.g. count the number of values between 2 and 10. Then you can use boolean operator on two boolean Serieses:
(df.gt(2) & df.lt(10)).sum()
or you can use pd.eval():
pd.eval("2 < df < 10").sum()
Count the number of values less than 2 or greater than 10:
(df.lt(2) | df.gt(10)).sum()
# or
pd.eval("df < 2 or df > 10").sum()

Categories