Find first true value in a row of Pandas dataframe - python

I have two dataframes of boolean values.
The first one looks like this:
b1=pd.DataFrame([[ True, False, False, False, False],
[False, False, True, False, False],
[False, True, False, False, False],
[False, False, False, False, False]])
b1
Out[88]:
0 1 2 3 4
0 True False False False False
1 False False True False False
2 False True False False False
3 False False False False False
If I am just interested in whether each row has any True value I can use the any method:
b1.any(1)
Out[89]:
0 True
1 True
2 True
3 False
dtype: bool
However, I want to have an added constraint based on a second dataframe that looks like the following:
b2 = pd.DataFrame([[ True, False, True, False, False],
[False, False, True, True, True],
[ True, True, False, False, False],
[ True, True, True, False, False]])
b2
Out[91]:
0 1 2 3 4
0 True False True False False
1 False False True True True
2 True True False False False
3 True True True False False
I want to identify rows that have a True value in the first dataframe ONLY if it is the first True value in a row of the second dataframe.
For example, this would exclude row 2 because although it has a True value in the first dataframe, it is the 2nd true value in the second dataframe. In contrast, rows 1 and 2 have a true value in dataframe 1 that is also the first true value in dataframe 2. The output should be the following:
0 True
1 True
2 False
3 False
dtype: bool

One way would be to use cumsum to help find the first:
In [123]: (b1 & b2 & (b2.cumsum(axis=1) == 1)).any(axis=1)
Out[123]:
0 True
1 True
2 False
3 False
dtype: bool
This works because b2.cumsum(axis=1) gives us the cumulative number of Trues seen, and cases where that number is 1 and b2 itself is True must be the first one.
In [124]: b2.cumsum(axis=1)
Out[124]:
0 1 2 3 4
0 1 1 2 2 2
1 0 0 1 2 3
2 1 2 2 2 2
3 1 2 3 3 3

As a variation to #DSM's clever answer, this approach seemed a little more intuitive to me. The first part should be pretty self-explanatory, and the second part finds the first column number (w/ axis = 1) that is true for each dataframe and compares.
(b1.any(axis = 1) & (b1.idxmax(axis = 1) == b2.idxmax(axis = 1))

Worked out a solution which turned out to be similar to pshep123's solution.
# the part on the right of & is to check if the first True position in b1 matches the first True position in b2.
b1.any(1) & (b1.values.argmax(axis=1) == b2.values.argmax(axis=1))
Out[823]:
0 True
1 True
2 False
3 False
dtype: bool

Related

Vectorizing the aggregation operation on different columns of a Pandas dataframe

I have a Pandas dataframe, mostly containing boolean columns. A small example is:
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3, 1, 2, 3],
"B": ['a', 'b', 'c', 'a', 'b', 'c'],
"f1": [True, True, True, True, True, False],
"f2": [True, True, True, True, False, True],
"f3": [True, True, True, False, True, True],
"f4": [True, True, False, True, True, True],
"f5": [True, False, True, True, True, True],
"target1": [True, False, True, True, False, True],
"target2": [False, True, True, False, True, False]})
df
Outout:
A B f1 f2 f3 f4 f5 target1 target2
0 1 a True True True True True True False
1 2 b True True True True False False True
2 3 c True True True False True True True
3 1 a True True False True True True False
4 2 b True False True True True False True
5 3 c False True True True True True False
for each True and False class of each f columns and for all groups in ("A", "B") columns, I want to do a sum over target1 and target2 columns. Using a loop over f columns, we have:
for col in ["f1", "f2", "f3", "f4", "f5"]:
print(col, "\n",
df[df[col]].groupby(["A", "B"]).agg({"target1": "sum", "target2": "sum"}), "\n",
df[~df[col]].groupby(["A", "B"]).agg({"target1": "sum", "target2": "sum"}))
Now, I need to do it without the for loop; I mean a vecotization over f columns to reduce the computation time (computation time should be almost equal to time needed for doing it for one f column).
Use DataFrame.melt, so possible aggreagte by columns names f and value for True/Falses:
df = df.melt(['A','B','target1','target2'])
df1 = df.groupby(["A", "B","variable","value"]).agg({"target1": "sum", "target2": "sum"})
print (df1)
target1 target2
A B variable value
1 a f1 True 2 0
f2 True 2 0
f3 False 1 0
True 1 0
f4 True 2 0
f5 True 2 0
2 b f1 True 0 2
f2 False 0 1
True 0 1
f3 True 0 2
f4 True 0 2
f5 False 0 1
True 0 1
3 c f1 False 1 0
True 1 1
f2 True 2 1
f3 True 2 1
f4 False 1 1
True 1 0
f5 True 2 1
Then selecting is possible by:
print (df1.query("variable=='f1' and value==True").droplevel([-1,-2]))
target1 target2
A B
1 a 2 0
2 b 0 2
3 c 1 1
Or:
idx = pd.IndexSlice
print (df1.loc[idx[:, :, 'f1', True],:].droplevel([-1,-2]))
target1 target2
A B
1 a 2 0
2 b 0 2
3 c 1 1

Create column indicating historical existence of specific value based on other column

Suppose I have df below:
df = pd.DataFrame({
'A': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
'B': [False, True, False, False, True, False, False, True]
})
df is already sorted by A (obviously) and time (descending). So for each group defined by A, the vlues in B are time sorted descendingly. What I want to do is to add a columns C which, for each group, is True if there is a True value in B in the past. The result would look like:
A B C
0 a False True
1 a True False
2 a False False
3 a False False
4 b True True
5 b False True
6 b False True
7 b True False
I suspect I need to use groupby() and idxmax() somehow but haven't been able to make it work. Any ideas?
idxmax is the way with transform
df['New']=df.index<df.iloc[::-1].groupby('A').B.transform('idxmax').sort_index()
df
A B New
0 a False True
1 a True False
2 a False False
3 a False False
4 b True True
5 b False True
6 b False True
7 b True False
If all False
s1=df.index<df.iloc[::-1].groupby('A').B.transform('idxmax').sort_index()
s2=df.groupby('A').B.transform('any')
df['New']=s1&s2
IIUC here's one way:
rev_cs = df[::-1].groupby('A').B.apply(lambda x: x.cumsum().shift(fill_value=0.).gt(0))
df['C'] = rev_cs[::-1]
print(df)
A B C
0 a False True
1 a True False
2 a False False
3 a False False
4 b True True
5 b False True
6 b False True
7 b True False

What happened to python's ~ when working with boolean?

In a pandas DataFrame, I have a series of boolean values. In order to filter to rows where the boolean is True, I can use: df[df.column_x]
I thought in order to filter to only rows where the column is False, I could use: df[~df.column_x]. I feel like I have done this before, and have seen it as the accepted answer.
However, this fails because ~df.column_x converts the values to integers. See below.
import pandas as pd . # version 0.24.2
a = pd.Series(['a', 'a', 'a', 'a', 'b', 'a', 'b', 'b', 'b', 'b'])
b = pd.Series([True, True, True, True, True, False, False, False, False, False], dtype=bool)
c = pd.DataFrame(data=[a, b]).T
c.columns = ['Classification', 'Boolean']```
print(~c.Boolean)
0 -2
1 -2
2 -2
3 -2
4 -2
5 -1
6 -1
7 -1
8 -1
9 -1
Name: Boolean, dtype: object
print(~b)
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
9 True
dtype: bool
Basically, I can use c[~b], but not c[~c.Boolean]
Am I just dreaming that this use to work?
Ah , since you created the c by using DataFrame constructor , then T,
1st let us look at what we have before T:
pd.DataFrame([a, b])
Out[610]:
0 1 2 3 4 5 6 7 8 9
0 a a a a b a b b b b
1 True True True True True False False False False False
So pandas will make each columns only have one dtype, if not it will convert to object .
After T what data type we have for each columns
The dtypes in your c :
c.dtypes
Out[608]:
Classification object
Boolean object
Boolean columns became object type , that is why you get unexpected output for ~c.Boolean
How to fix it ? ---concat
c=pd.concat([a,b],1)
c.columns = ['Classification', 'Boolean']
~c.Boolean
Out[616]:
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
9 True
Name: Boolean, dtype: bool

Label contiguous groups of True elements within a pandas Series

I have a pandas series of Boolean values, and I would like to label contiguous groups of True values. How is it possible to do this? Is it possible to do this in a vectorised manner? Any help would be hugely appreciated!
Data:
A
0 False
1 True
2 True
3 True
4 False
5 False
6 True
7 False
8 False
9 True
10 True
Desired:
A Label
0 False 0
1 True 1
2 True 1
3 True 1
4 False 0
5 False 0
6 True 2
7 False 0
8 False 0
9 True 3
10 True 3
Here's a unlikely but simple and working solution:
import scipy.ndimage.measurements as mnts
labeled, clusters = mnts.label(df.A.values)
# labeled is what you want, cluster is the number of clusters.
df.Labels = labeled # puts it into df
Tested as:
a = array([False, False, True, True, True, False, True, False, False,
True, False, True, True, True, True, True, True, True,
False, True], dtype=bool)
labeled, clusters = mnts.label(a)
>>> labeled
array([0, 0, 1, 1, 1, 0, 2, 0, 0, 3, 0, 4, 4, 4, 4, 4, 4, 4, 0, 5], dtype=int32)
>>> clusters
5
With cumsum
a = df.A.values
z = np.zeros(a.shape, int)
z[a] = pd.factorize((~a).cumsum()[a])[0] + 1
df.assign(Label=z)
A Label
0 False 0
1 True 1
2 True 1
3 True 1
4 False 0
5 False 0
6 True 2
7 False 0
8 False 0
9 True 3
10 True 3
You can use cumsum and groupby + ngroup to mark groups.
v = (~df.A).cumsum().where(df.A).bfill()
df['Label'] = (
v.groupby(v).ngroup().add(1).where(df.A).fillna(0, downcast='infer'))
df
A Label
0 False 0
1 True 1
2 True 1
3 True 1
4 False 0
5 False 0
6 True 2
7 False 0
8 False 0
9 True 3
10 True 3

How to nicely measure runs of same-data in a pandas dataframe

I want to give a function an arbitrary dataframe, dateindex, and column and ask it to return how many continuous preceding rows (including itself) had the same value. I've been able to keep most of my pandas code vectorized. Struggling to think how I can do this cleanly though.
Below is a small toy dataset and examples of what outputs I'd want from the function.
bar foo
2016-06-01 False True
2016-06-02 True False
2016-06-03 True True
2016-06-06 True False
2016-06-07 False False
2016-06-08 True False
2016-06-09 True False
2016-06-10 False True
2016-06-13 False True
2016-06-14 True True
import pandas as pd
rng = pd.bdate_range('6/1/2016', periods=10)
cola = [True, False, True, False, False, False,False, True, True, True]
colb = [False, True, True, True, False, True, True, False, False, True]
d = {'foo':pd.Series(cola, index =rng), 'bar':pd.Series(colb, index=rng)}
df = pd.DataFrame(d)
"""
consec('foo','2016-06-09') => 4 # it's the fourth continuous 'False' in a row
consec('foo', '2016-06-08') => 3 # It's the third continuous'False' in a row
consec('bar', '2016-06-02') => 1 # It's the first continuou true in a row
consec('foo', '2016-06-14') => 3 # It's the third continuous True
"""
==================
I ended up using the itertools-answer below, with a small change, because it got me exactly what I wanted (slightly more involved than my original question spec). Thanks for the many suggestions.
rng = pd.bdate_range('6/1/2016', periods=100)
cola = [True, False, True, False, False, False,False, True, True, True]*10
colb = [False, True, True, True, False, True, True, False, False, True]*10
d = {'foo':pd.Series(cola, index =rng), 'bar':pd.Series(colb, index=rng)}
df2 = pd.DataFrame(d)
def make_new_col_of_consec(df,col_list):
for col_name in col_list:
lst = []
for state, repeat_values in itertools.groupby(df1[col_name]):
if state == True:
lst.extend([i+1 for i,v in enumerate(repeat_values)])
elif state == False:
lst.extend([0 for i,v in enumerate(repeat_values)])
df1[col_name + "_consec"] = lst
return df
print make_new_col_of_consec(df1,["bar","foo"])
The output as follows:
bar foo bar_consec foo_consec
2016-06-01 False True 0 1
2016-06-02 True False 1 0
2016-06-03 True True 2 1
2016-06-06 True False 3 0
2016-06-07 False False 0 0
2016-06-08 True False 1 0
2016-06-09 True False 2 0
2016-06-10 False True 0 1
2016-06-13 False True 0 2
2016-06-14 True True 1 3
2016-06-15 False True 0 4
2016-06-16 True False 1 0
2016-06-17 True True 2 1
2016-06-20 True False 3 0
2016-06-21 False False 0 0
2016-06-22 True False 1 0
Here's an alternative method which creates a new column with the relevant consecutive count for each row. I tested this when the dataframe has 10000 rows and it took 24 ms. It uses groupby from itertools. It takes advantage of the fact that a break is created whenever the key value, in this case foo and bar changes so we can just use the index from there.
rng = pd.bdate_range('6/1/2016', periods=10000)
cola = [True, False, True, False, False, False,False, True, True, True]*1000
colb = [False, True, True, True, False, True, True, False, False, True]*1000
d = {'foo':pd.Series(cola, index =rng), 'bar':pd.Series(colb, index=rng)}
df1 = pd.DataFrame(d)
def make_new_col_of_consec(df,col_list):
for col_name in col_list:
lst = []
for state, repeat_values in itertools.groupby(df1[col_name]):
lst.extend([i+1 for i,v in enumerate(repeat_values)])
df1[col_name + "_consec"] = lst
return df
print make_new_col_of_consec(df1,["bar","foo"])
Output:
bar foo bar_consec foo_consec
2016-06-01 False True 1 1
2016-06-02 True False 1 1
2016-06-03 True True 2 1
2016-06-06 True False 3 1
2016-06-07 False False 1 2
2016-06-08 True False 1 3
...
[10000 rows x 4 columns]
10 loops, best of 3: 24.1 ms per loop
try this:
In [135]: %paste
def consec(df, col, d):
return (df[:d].groupby((df[col] != df[col].shift())
.cumsum())[col]
.transform('size').tail(1)[0])
## -- End pasted text --
In [137]: consec(df, 'foo', '2016-06-09')
Out[137]: 4
In [138]: consec(df, 'foo', '2016-06-08')
Out[138]: 3
In [139]: consec(df, 'bar', '2016-06-02')
Out[139]: 1
In [140]: consec(df, 'bar', '2016-06-14')
Out[140]: 1
Explanation:
In [141]: (df.foo != df.foo.shift()).cumsum()
Out[141]:
2016-06-01 1
2016-06-02 2
2016-06-03 3
2016-06-06 4
2016-06-07 4
2016-06-08 4
2016-06-09 4
2016-06-10 5
2016-06-13 5
2016-06-14 5
Freq: B, Name: foo, dtype: int32
In [142]: df.groupby((df.foo != df.foo.shift()).cumsum()).foo.transform('size')
Out[142]:
2016-06-01 1
2016-06-02 1
2016-06-03 1
2016-06-06 4
2016-06-07 4
2016-06-08 4
2016-06-09 4
2016-06-10 3
2016-06-13 3
2016-06-14 3
Freq: B, dtype: int64
In [143]: df.groupby((df.foo != df.foo.shift()).cumsum()).foo.transform('size').tail(1)
Out[143]:
2016-06-14 3
Freq: B, dtype: int64
You can use:
#reorder index in df
df = df[::-1]
def consec(col, date):
#select df by date
df1 = df.ix[date:,:]
#get first group == 1
colconsec = (df1[col] != df1[col].shift()).cumsum() == 1
return 'Value is ' + str(df1.ix[0,col]) + ', Len is: '+ str(len(df1[colconsec]))
print (consec('foo', '2016-06-09'))
print (consec('foo', '2016-06-08'))
print (consec('bar', '2016-06-02'))
print (consec('foo', '2016-06-14'))
Value is False, Len is: 4
Value is False, Len is: 3
Value is True, Len is: 1
Value is True, Len is: 3
Another solution with finding last value of Series colconsec by iat for creating mask:
def consec(col, date):
df1 = df.ix[:date,:]
colconsec = (df1[col] != df1[col].shift()).cumsum()
mask = colconsec == colconsec.iat[-1]
return 'Value is ' + str(df1[col].iat[-1]) + ', Len is: '+ str(len(df1[mask]))
print (consec('foo', '2016-06-09'))
print (consec('foo', '2016-06-08'))
print (consec('bar', '2016-06-02'))
print (consec('foo', '2016-06-14'))
Value is False, Len is: 4
Value is False, Len is: 3
Value is True, Len is: 1
Value is True, Len is: 3

Categories