Vectorizing the aggregation operation on different columns of a Pandas dataframe - python

I have a Pandas dataframe, mostly containing boolean columns. A small example is:
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3, 1, 2, 3],
"B": ['a', 'b', 'c', 'a', 'b', 'c'],
"f1": [True, True, True, True, True, False],
"f2": [True, True, True, True, False, True],
"f3": [True, True, True, False, True, True],
"f4": [True, True, False, True, True, True],
"f5": [True, False, True, True, True, True],
"target1": [True, False, True, True, False, True],
"target2": [False, True, True, False, True, False]})
df
Outout:
A B f1 f2 f3 f4 f5 target1 target2
0 1 a True True True True True True False
1 2 b True True True True False False True
2 3 c True True True False True True True
3 1 a True True False True True True False
4 2 b True False True True True False True
5 3 c False True True True True True False
for each True and False class of each f columns and for all groups in ("A", "B") columns, I want to do a sum over target1 and target2 columns. Using a loop over f columns, we have:
for col in ["f1", "f2", "f3", "f4", "f5"]:
print(col, "\n",
df[df[col]].groupby(["A", "B"]).agg({"target1": "sum", "target2": "sum"}), "\n",
df[~df[col]].groupby(["A", "B"]).agg({"target1": "sum", "target2": "sum"}))
Now, I need to do it without the for loop; I mean a vecotization over f columns to reduce the computation time (computation time should be almost equal to time needed for doing it for one f column).

Use DataFrame.melt, so possible aggreagte by columns names f and value for True/Falses:
df = df.melt(['A','B','target1','target2'])
df1 = df.groupby(["A", "B","variable","value"]).agg({"target1": "sum", "target2": "sum"})
print (df1)
target1 target2
A B variable value
1 a f1 True 2 0
f2 True 2 0
f3 False 1 0
True 1 0
f4 True 2 0
f5 True 2 0
2 b f1 True 0 2
f2 False 0 1
True 0 1
f3 True 0 2
f4 True 0 2
f5 False 0 1
True 0 1
3 c f1 False 1 0
True 1 1
f2 True 2 1
f3 True 2 1
f4 False 1 1
True 1 0
f5 True 2 1
Then selecting is possible by:
print (df1.query("variable=='f1' and value==True").droplevel([-1,-2]))
target1 target2
A B
1 a 2 0
2 b 0 2
3 c 1 1
Or:
idx = pd.IndexSlice
print (df1.loc[idx[:, :, 'f1', True],:].droplevel([-1,-2]))
target1 target2
A B
1 a 2 0
2 b 0 2
3 c 1 1

Related

Is there a Pandas equivalent to tidyr's uncount?

Let's assume we have a table with groupings of variable and their frequencies:
In R:
> df
# A tibble: 3 x 3
Cough Fever cases
<lgl> <lgl> <dbl>
1 TRUE FALSE 1
2 FALSE FALSE 2
3 TRUE TRUE 3
Then we could use tidyr::uncount to get a dataframe with the individual cases:
> uncount(df, cases)
# A tibble: 6 x 2
Cough Fever
<lgl> <lgl>
1 TRUE FALSE
2 FALSE FALSE
3 FALSE FALSE
4 TRUE TRUE
5 TRUE TRUE
6 TRUE TRUE
Is there an equivalent in Python/Pandas?
You have a row index and repeat it according to the counts, for example in R you can do:
df[rep(1:nrow(df),df$cases),]
first to get a data like yours:
df = pd.DataFrame({'x':[1,1,2,2,2,2],'y':[0,1,0,1,1,1]})
counts = df.groupby(['x','y']).size().reset_index()
counts.columns = ['x','y','n']
x y n
0 1 0 1
1 1 1 1
2 2 0 1
3 2 1 3
Then:
counts.iloc[np.repeat(np.arange(len(counts)),counts.n),:2]
x y
0 1 0
1 1 1
2 2 0
3 2 1
3 2 1
3 2 1
I haven't found an equivalent function in Python, but this works
df2 = df.pop('cases')
df = pd.DataFrame(df.values.repeat(df2, axis=0), columns=df.columns)
df['cases'] is passed to df2, then you create a new DataFrame with the elements from the original DataFrame repeated according to the count in df2. Please let me know if it helps.
In addition to the other solutions, you could combine take, repeat and drop:
import pandas as pd
df = pd.DataFrame({'Cough': [True, False, True],
'Fever': [False, False, True],
'cases': [1, 2, 3]})
df.take(df.index.repeat(df.cases)).drop(columns="cases")
Cough Fever
0 True False
1 False False
1 False False
2 True True
2 True True
2 True True
As easy as you use tidyr's API with datar:
>>> from datar.all import f, tribble, uncount
>>> df = tribble(
... f.Cough, f.Fever, f.cases,
... True, False, 1,
... False, False, 2,
... True, True, 3
... )
>>> uncount(df, f.cases)
Cough Fever
<bool> <bool>
0 True False
1 False False
2 False False
3 True True
4 True True
5 True True
I am the author of the package. Feel free to submit issues if you have any questions.

Create column indicating historical existence of specific value based on other column

Suppose I have df below:
df = pd.DataFrame({
'A': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
'B': [False, True, False, False, True, False, False, True]
})
df is already sorted by A (obviously) and time (descending). So for each group defined by A, the vlues in B are time sorted descendingly. What I want to do is to add a columns C which, for each group, is True if there is a True value in B in the past. The result would look like:
A B C
0 a False True
1 a True False
2 a False False
3 a False False
4 b True True
5 b False True
6 b False True
7 b True False
I suspect I need to use groupby() and idxmax() somehow but haven't been able to make it work. Any ideas?
idxmax is the way with transform
df['New']=df.index<df.iloc[::-1].groupby('A').B.transform('idxmax').sort_index()
df
A B New
0 a False True
1 a True False
2 a False False
3 a False False
4 b True True
5 b False True
6 b False True
7 b True False
If all False
s1=df.index<df.iloc[::-1].groupby('A').B.transform('idxmax').sort_index()
s2=df.groupby('A').B.transform('any')
df['New']=s1&s2
IIUC here's one way:
rev_cs = df[::-1].groupby('A').B.apply(lambda x: x.cumsum().shift(fill_value=0.).gt(0))
df['C'] = rev_cs[::-1]
print(df)
A B C
0 a False True
1 a True False
2 a False False
3 a False False
4 b True True
5 b False True
6 b False True
7 b True False

Label contiguous groups of True elements within a pandas Series

I have a pandas series of Boolean values, and I would like to label contiguous groups of True values. How is it possible to do this? Is it possible to do this in a vectorised manner? Any help would be hugely appreciated!
Data:
A
0 False
1 True
2 True
3 True
4 False
5 False
6 True
7 False
8 False
9 True
10 True
Desired:
A Label
0 False 0
1 True 1
2 True 1
3 True 1
4 False 0
5 False 0
6 True 2
7 False 0
8 False 0
9 True 3
10 True 3
Here's a unlikely but simple and working solution:
import scipy.ndimage.measurements as mnts
labeled, clusters = mnts.label(df.A.values)
# labeled is what you want, cluster is the number of clusters.
df.Labels = labeled # puts it into df
Tested as:
a = array([False, False, True, True, True, False, True, False, False,
True, False, True, True, True, True, True, True, True,
False, True], dtype=bool)
labeled, clusters = mnts.label(a)
>>> labeled
array([0, 0, 1, 1, 1, 0, 2, 0, 0, 3, 0, 4, 4, 4, 4, 4, 4, 4, 0, 5], dtype=int32)
>>> clusters
5
With cumsum
a = df.A.values
z = np.zeros(a.shape, int)
z[a] = pd.factorize((~a).cumsum()[a])[0] + 1
df.assign(Label=z)
A Label
0 False 0
1 True 1
2 True 1
3 True 1
4 False 0
5 False 0
6 True 2
7 False 0
8 False 0
9 True 3
10 True 3
You can use cumsum and groupby + ngroup to mark groups.
v = (~df.A).cumsum().where(df.A).bfill()
df['Label'] = (
v.groupby(v).ngroup().add(1).where(df.A).fillna(0, downcast='infer'))
df
A Label
0 False 0
1 True 1
2 True 1
3 True 1
4 False 0
5 False 0
6 True 2
7 False 0
8 False 0
9 True 3
10 True 3

Find first true value in a row of Pandas dataframe

I have two dataframes of boolean values.
The first one looks like this:
b1=pd.DataFrame([[ True, False, False, False, False],
[False, False, True, False, False],
[False, True, False, False, False],
[False, False, False, False, False]])
b1
Out[88]:
0 1 2 3 4
0 True False False False False
1 False False True False False
2 False True False False False
3 False False False False False
If I am just interested in whether each row has any True value I can use the any method:
b1.any(1)
Out[89]:
0 True
1 True
2 True
3 False
dtype: bool
However, I want to have an added constraint based on a second dataframe that looks like the following:
b2 = pd.DataFrame([[ True, False, True, False, False],
[False, False, True, True, True],
[ True, True, False, False, False],
[ True, True, True, False, False]])
b2
Out[91]:
0 1 2 3 4
0 True False True False False
1 False False True True True
2 True True False False False
3 True True True False False
I want to identify rows that have a True value in the first dataframe ONLY if it is the first True value in a row of the second dataframe.
For example, this would exclude row 2 because although it has a True value in the first dataframe, it is the 2nd true value in the second dataframe. In contrast, rows 1 and 2 have a true value in dataframe 1 that is also the first true value in dataframe 2. The output should be the following:
0 True
1 True
2 False
3 False
dtype: bool
One way would be to use cumsum to help find the first:
In [123]: (b1 & b2 & (b2.cumsum(axis=1) == 1)).any(axis=1)
Out[123]:
0 True
1 True
2 False
3 False
dtype: bool
This works because b2.cumsum(axis=1) gives us the cumulative number of Trues seen, and cases where that number is 1 and b2 itself is True must be the first one.
In [124]: b2.cumsum(axis=1)
Out[124]:
0 1 2 3 4
0 1 1 2 2 2
1 0 0 1 2 3
2 1 2 2 2 2
3 1 2 3 3 3
As a variation to #DSM's clever answer, this approach seemed a little more intuitive to me. The first part should be pretty self-explanatory, and the second part finds the first column number (w/ axis = 1) that is true for each dataframe and compares.
(b1.any(axis = 1) & (b1.idxmax(axis = 1) == b2.idxmax(axis = 1))
Worked out a solution which turned out to be similar to pshep123's solution.
# the part on the right of & is to check if the first True position in b1 matches the first True position in b2.
b1.any(1) & (b1.values.argmax(axis=1) == b2.values.argmax(axis=1))
Out[823]:
0 True
1 True
2 False
3 False
dtype: bool

How to nicely measure runs of same-data in a pandas dataframe

I want to give a function an arbitrary dataframe, dateindex, and column and ask it to return how many continuous preceding rows (including itself) had the same value. I've been able to keep most of my pandas code vectorized. Struggling to think how I can do this cleanly though.
Below is a small toy dataset and examples of what outputs I'd want from the function.
bar foo
2016-06-01 False True
2016-06-02 True False
2016-06-03 True True
2016-06-06 True False
2016-06-07 False False
2016-06-08 True False
2016-06-09 True False
2016-06-10 False True
2016-06-13 False True
2016-06-14 True True
import pandas as pd
rng = pd.bdate_range('6/1/2016', periods=10)
cola = [True, False, True, False, False, False,False, True, True, True]
colb = [False, True, True, True, False, True, True, False, False, True]
d = {'foo':pd.Series(cola, index =rng), 'bar':pd.Series(colb, index=rng)}
df = pd.DataFrame(d)
"""
consec('foo','2016-06-09') => 4 # it's the fourth continuous 'False' in a row
consec('foo', '2016-06-08') => 3 # It's the third continuous'False' in a row
consec('bar', '2016-06-02') => 1 # It's the first continuou true in a row
consec('foo', '2016-06-14') => 3 # It's the third continuous True
"""
==================
I ended up using the itertools-answer below, with a small change, because it got me exactly what I wanted (slightly more involved than my original question spec). Thanks for the many suggestions.
rng = pd.bdate_range('6/1/2016', periods=100)
cola = [True, False, True, False, False, False,False, True, True, True]*10
colb = [False, True, True, True, False, True, True, False, False, True]*10
d = {'foo':pd.Series(cola, index =rng), 'bar':pd.Series(colb, index=rng)}
df2 = pd.DataFrame(d)
def make_new_col_of_consec(df,col_list):
for col_name in col_list:
lst = []
for state, repeat_values in itertools.groupby(df1[col_name]):
if state == True:
lst.extend([i+1 for i,v in enumerate(repeat_values)])
elif state == False:
lst.extend([0 for i,v in enumerate(repeat_values)])
df1[col_name + "_consec"] = lst
return df
print make_new_col_of_consec(df1,["bar","foo"])
The output as follows:
bar foo bar_consec foo_consec
2016-06-01 False True 0 1
2016-06-02 True False 1 0
2016-06-03 True True 2 1
2016-06-06 True False 3 0
2016-06-07 False False 0 0
2016-06-08 True False 1 0
2016-06-09 True False 2 0
2016-06-10 False True 0 1
2016-06-13 False True 0 2
2016-06-14 True True 1 3
2016-06-15 False True 0 4
2016-06-16 True False 1 0
2016-06-17 True True 2 1
2016-06-20 True False 3 0
2016-06-21 False False 0 0
2016-06-22 True False 1 0
Here's an alternative method which creates a new column with the relevant consecutive count for each row. I tested this when the dataframe has 10000 rows and it took 24 ms. It uses groupby from itertools. It takes advantage of the fact that a break is created whenever the key value, in this case foo and bar changes so we can just use the index from there.
rng = pd.bdate_range('6/1/2016', periods=10000)
cola = [True, False, True, False, False, False,False, True, True, True]*1000
colb = [False, True, True, True, False, True, True, False, False, True]*1000
d = {'foo':pd.Series(cola, index =rng), 'bar':pd.Series(colb, index=rng)}
df1 = pd.DataFrame(d)
def make_new_col_of_consec(df,col_list):
for col_name in col_list:
lst = []
for state, repeat_values in itertools.groupby(df1[col_name]):
lst.extend([i+1 for i,v in enumerate(repeat_values)])
df1[col_name + "_consec"] = lst
return df
print make_new_col_of_consec(df1,["bar","foo"])
Output:
bar foo bar_consec foo_consec
2016-06-01 False True 1 1
2016-06-02 True False 1 1
2016-06-03 True True 2 1
2016-06-06 True False 3 1
2016-06-07 False False 1 2
2016-06-08 True False 1 3
...
[10000 rows x 4 columns]
10 loops, best of 3: 24.1 ms per loop
try this:
In [135]: %paste
def consec(df, col, d):
return (df[:d].groupby((df[col] != df[col].shift())
.cumsum())[col]
.transform('size').tail(1)[0])
## -- End pasted text --
In [137]: consec(df, 'foo', '2016-06-09')
Out[137]: 4
In [138]: consec(df, 'foo', '2016-06-08')
Out[138]: 3
In [139]: consec(df, 'bar', '2016-06-02')
Out[139]: 1
In [140]: consec(df, 'bar', '2016-06-14')
Out[140]: 1
Explanation:
In [141]: (df.foo != df.foo.shift()).cumsum()
Out[141]:
2016-06-01 1
2016-06-02 2
2016-06-03 3
2016-06-06 4
2016-06-07 4
2016-06-08 4
2016-06-09 4
2016-06-10 5
2016-06-13 5
2016-06-14 5
Freq: B, Name: foo, dtype: int32
In [142]: df.groupby((df.foo != df.foo.shift()).cumsum()).foo.transform('size')
Out[142]:
2016-06-01 1
2016-06-02 1
2016-06-03 1
2016-06-06 4
2016-06-07 4
2016-06-08 4
2016-06-09 4
2016-06-10 3
2016-06-13 3
2016-06-14 3
Freq: B, dtype: int64
In [143]: df.groupby((df.foo != df.foo.shift()).cumsum()).foo.transform('size').tail(1)
Out[143]:
2016-06-14 3
Freq: B, dtype: int64
You can use:
#reorder index in df
df = df[::-1]
def consec(col, date):
#select df by date
df1 = df.ix[date:,:]
#get first group == 1
colconsec = (df1[col] != df1[col].shift()).cumsum() == 1
return 'Value is ' + str(df1.ix[0,col]) + ', Len is: '+ str(len(df1[colconsec]))
print (consec('foo', '2016-06-09'))
print (consec('foo', '2016-06-08'))
print (consec('bar', '2016-06-02'))
print (consec('foo', '2016-06-14'))
Value is False, Len is: 4
Value is False, Len is: 3
Value is True, Len is: 1
Value is True, Len is: 3
Another solution with finding last value of Series colconsec by iat for creating mask:
def consec(col, date):
df1 = df.ix[:date,:]
colconsec = (df1[col] != df1[col].shift()).cumsum()
mask = colconsec == colconsec.iat[-1]
return 'Value is ' + str(df1[col].iat[-1]) + ', Len is: '+ str(len(df1[mask]))
print (consec('foo', '2016-06-09'))
print (consec('foo', '2016-06-08'))
print (consec('bar', '2016-06-02'))
print (consec('foo', '2016-06-14'))
Value is False, Len is: 4
Value is False, Len is: 3
Value is True, Len is: 1
Value is True, Len is: 3

Categories