How to nicely measure runs of same-data in a pandas dataframe

How to nicely measure runs of same-data in a pandas dataframe - python

I want to give a function an arbitrary dataframe, dateindex, and column and ask it to return how many continuous preceding rows (including itself) had the same value. I've been able to keep most of my pandas code vectorized. Struggling to think how I can do this cleanly though.
Below is a small toy dataset and examples of what outputs I'd want from the function.
bar foo
2016-06-01 False True
2016-06-02 True False
2016-06-03 True True
2016-06-06 True False
2016-06-07 False False
2016-06-08 True False
2016-06-09 True False
2016-06-10 False True
2016-06-13 False True
2016-06-14 True True
import pandas as pd
rng = pd.bdate_range('6/1/2016', periods=10)
cola = [True, False, True, False, False, False,False, True, True, True]
colb = [False, True, True, True, False, True, True, False, False, True]
d = {'foo':pd.Series(cola, index =rng), 'bar':pd.Series(colb, index=rng)}
df = pd.DataFrame(d)
"""
consec('foo','2016-06-09') => 4 # it's the fourth continuous 'False' in a row
consec('foo', '2016-06-08') => 3 # It's the third continuous'False' in a row
consec('bar', '2016-06-02') => 1 # It's the first continuou true in a row
consec('foo', '2016-06-14') => 3 # It's the third continuous True
"""
==================
I ended up using the itertools-answer below, with a small change, because it got me exactly what I wanted (slightly more involved than my original question spec). Thanks for the many suggestions.
rng = pd.bdate_range('6/1/2016', periods=100)
cola = [True, False, True, False, False, False,False, True, True, True]*10
colb = [False, True, True, True, False, True, True, False, False, True]*10
d = {'foo':pd.Series(cola, index =rng), 'bar':pd.Series(colb, index=rng)}
df2 = pd.DataFrame(d)
def make_new_col_of_consec(df,col_list):
for col_name in col_list:
lst = []
for state, repeat_values in itertools.groupby(df1[col_name]):
if state == True:
lst.extend([i+1 for i,v in enumerate(repeat_values)])
elif state == False:
lst.extend([0 for i,v in enumerate(repeat_values)])
df1[col_name + "_consec"] = lst
return df
print make_new_col_of_consec(df1,["bar","foo"])
The output as follows:
bar foo bar_consec foo_consec
2016-06-01 False True 0 1
2016-06-02 True False 1 0
2016-06-03 True True 2 1
2016-06-06 True False 3 0
2016-06-07 False False 0 0
2016-06-08 True False 1 0
2016-06-09 True False 2 0
2016-06-10 False True 0 1
2016-06-13 False True 0 2
2016-06-14 True True 1 3
2016-06-15 False True 0 4
2016-06-16 True False 1 0
2016-06-17 True True 2 1
2016-06-20 True False 3 0
2016-06-21 False False 0 0
2016-06-22 True False 1 0

Here's an alternative method which creates a new column with the relevant consecutive count for each row. I tested this when the dataframe has 10000 rows and it took 24 ms. It uses groupby from itertools. It takes advantage of the fact that a break is created whenever the key value, in this case foo and bar changes so we can just use the index from there.
rng = pd.bdate_range('6/1/2016', periods=10000)
cola = [True, False, True, False, False, False,False, True, True, True]*1000
colb = [False, True, True, True, False, True, True, False, False, True]*1000
d = {'foo':pd.Series(cola, index =rng), 'bar':pd.Series(colb, index=rng)}
df1 = pd.DataFrame(d)
def make_new_col_of_consec(df,col_list):
for col_name in col_list:
lst = []
for state, repeat_values in itertools.groupby(df1[col_name]):
lst.extend([i+1 for i,v in enumerate(repeat_values)])
df1[col_name + "_consec"] = lst
return df
print make_new_col_of_consec(df1,["bar","foo"])
Output:
bar foo bar_consec foo_consec
2016-06-01 False True 1 1
2016-06-02 True False 1 1
2016-06-03 True True 2 1
2016-06-06 True False 3 1
2016-06-07 False False 1 2
2016-06-08 True False 1 3
...
[10000 rows x 4 columns]
10 loops, best of 3: 24.1 ms per loop

try this:
In [135]: %paste
def consec(df, col, d):
return (df[:d].groupby((df[col] != df[col].shift())
.cumsum())[col]
.transform('size').tail(1)[0])
## -- End pasted text --
In [137]: consec(df, 'foo', '2016-06-09')
Out[137]: 4
In [138]: consec(df, 'foo', '2016-06-08')
Out[138]: 3
In [139]: consec(df, 'bar', '2016-06-02')
Out[139]: 1
In [140]: consec(df, 'bar', '2016-06-14')
Out[140]: 1
Explanation:
In [141]: (df.foo != df.foo.shift()).cumsum()
Out[141]:
2016-06-01 1
2016-06-02 2
2016-06-03 3
2016-06-06 4
2016-06-07 4
2016-06-08 4
2016-06-09 4
2016-06-10 5
2016-06-13 5
2016-06-14 5
Freq: B, Name: foo, dtype: int32
In [142]: df.groupby((df.foo != df.foo.shift()).cumsum()).foo.transform('size')
Out[142]:
2016-06-01 1
2016-06-02 1
2016-06-03 1
2016-06-06 4
2016-06-07 4
2016-06-08 4
2016-06-09 4
2016-06-10 3
2016-06-13 3
2016-06-14 3
Freq: B, dtype: int64
In [143]: df.groupby((df.foo != df.foo.shift()).cumsum()).foo.transform('size').tail(1)
Out[143]:
2016-06-14 3
Freq: B, dtype: int64

You can use:
#reorder index in df
df = df[::-1]
def consec(col, date):
#select df by date
df1 = df.ix[date:,:]
#get first group == 1
colconsec = (df1[col] != df1[col].shift()).cumsum() == 1
return 'Value is ' + str(df1.ix[0,col]) + ', Len is: '+ str(len(df1[colconsec]))
print (consec('foo', '2016-06-09'))
print (consec('foo', '2016-06-08'))
print (consec('bar', '2016-06-02'))
print (consec('foo', '2016-06-14'))
Value is False, Len is: 4
Value is False, Len is: 3
Value is True, Len is: 1
Value is True, Len is: 3
Another solution with finding last value of Series colconsec by iat for creating mask:
def consec(col, date):
df1 = df.ix[:date,:]
colconsec = (df1[col] != df1[col].shift()).cumsum()
mask = colconsec == colconsec.iat[-1]
return 'Value is ' + str(df1[col].iat[-1]) + ', Len is: '+ str(len(df1[mask]))
print (consec('foo', '2016-06-09'))
print (consec('foo', '2016-06-08'))
print (consec('bar', '2016-06-02'))
print (consec('foo', '2016-06-14'))
Value is False, Len is: 4
Value is False, Len is: 3
Value is True, Len is: 1
Value is True, Len is: 3

Related

Vectorizing the aggregation operation on different columns of a Pandas dataframe

I have a Pandas dataframe, mostly containing boolean columns. A small example is:
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3, 1, 2, 3],
"B": ['a', 'b', 'c', 'a', 'b', 'c'],
"f1": [True, True, True, True, True, False],
"f2": [True, True, True, True, False, True],
"f3": [True, True, True, False, True, True],
"f4": [True, True, False, True, True, True],
"f5": [True, False, True, True, True, True],
"target1": [True, False, True, True, False, True],
"target2": [False, True, True, False, True, False]})
df
Outout:
A B f1 f2 f3 f4 f5 target1 target2
0 1 a True True True True True True False
1 2 b True True True True False False True
2 3 c True True True False True True True
3 1 a True True False True True True False
4 2 b True False True True True False True
5 3 c False True True True True True False
for each True and False class of each f columns and for all groups in ("A", "B") columns, I want to do a sum over target1 and target2 columns. Using a loop over f columns, we have:
for col in ["f1", "f2", "f3", "f4", "f5"]:
print(col, "\n",
df[df[col]].groupby(["A", "B"]).agg({"target1": "sum", "target2": "sum"}), "\n",
df[~df[col]].groupby(["A", "B"]).agg({"target1": "sum", "target2": "sum"}))
Now, I need to do it without the for loop; I mean a vecotization over f columns to reduce the computation time (computation time should be almost equal to time needed for doing it for one f column).

Use DataFrame.melt, so possible aggreagte by columns names f and value for True/Falses:
df = df.melt(['A','B','target1','target2'])
df1 = df.groupby(["A", "B","variable","value"]).agg({"target1": "sum", "target2": "sum"})
print (df1)
target1 target2
A B variable value
1 a f1 True 2 0
f2 True 2 0
f3 False 1 0
True 1 0
f4 True 2 0
f5 True 2 0
2 b f1 True 0 2
f2 False 0 1
True 0 1
f3 True 0 2
f4 True 0 2
f5 False 0 1
True 0 1
3 c f1 False 1 0
True 1 1
f2 True 2 1
f3 True 2 1
f4 False 1 1
True 1 0
f5 True 2 1
Then selecting is possible by:
print (df1.query("variable=='f1' and value==True").droplevel([-1,-2]))
target1 target2
A B
1 a 2 0
2 b 0 2
3 c 1 1
Or:
idx = pd.IndexSlice
print (df1.loc[idx[:, :, 'f1', True],:].droplevel([-1,-2]))
target1 target2
A B
1 a 2 0
2 b 0 2
3 c 1 1

Is there a Pandas equivalent to tidyr's uncount?

Let's assume we have a table with groupings of variable and their frequencies:
In R:
> df
# A tibble: 3 x 3
Cough Fever cases
<lgl> <lgl> <dbl>
1 TRUE FALSE 1
2 FALSE FALSE 2
3 TRUE TRUE 3
Then we could use tidyr::uncount to get a dataframe with the individual cases:
> uncount(df, cases)
# A tibble: 6 x 2
Cough Fever
<lgl> <lgl>
1 TRUE FALSE
2 FALSE FALSE
3 FALSE FALSE
4 TRUE TRUE
5 TRUE TRUE
6 TRUE TRUE
Is there an equivalent in Python/Pandas?

You have a row index and repeat it according to the counts, for example in R you can do:
df[rep(1:nrow(df),df$cases),]
first to get a data like yours:
df = pd.DataFrame({'x':[1,1,2,2,2,2],'y':[0,1,0,1,1,1]})
counts = df.groupby(['x','y']).size().reset_index()
counts.columns = ['x','y','n']
x y n
0 1 0 1
1 1 1 1
2 2 0 1
3 2 1 3
Then:
counts.iloc[np.repeat(np.arange(len(counts)),counts.n),:2]
x y
0 1 0
1 1 1
2 2 0
3 2 1
3 2 1
3 2 1

I haven't found an equivalent function in Python, but this works
df2 = df.pop('cases')
df = pd.DataFrame(df.values.repeat(df2, axis=0), columns=df.columns)
df['cases'] is passed to df2, then you create a new DataFrame with the elements from the original DataFrame repeated according to the count in df2. Please let me know if it helps.

In addition to the other solutions, you could combine take, repeat and drop:
import pandas as pd
df = pd.DataFrame({'Cough': [True, False, True],
'Fever': [False, False, True],
'cases': [1, 2, 3]})
df.take(df.index.repeat(df.cases)).drop(columns="cases")
Cough Fever
0 True False
1 False False
1 False False
2 True True
2 True True
2 True True

As easy as you use tidyr's API with datar:
>>> from datar.all import f, tribble, uncount
>>> df = tribble(
... f.Cough, f.Fever, f.cases,
... True, False, 1,
... False, False, 2,
... True, True, 3
... )
>>> uncount(df, f.cases)
Cough Fever
<bool> <bool>
0 True False
1 False False
2 False False
3 True True
4 True True
5 True True
I am the author of the package. Feel free to submit issues if you have any questions.

What happened to python's ~ when working with boolean?

In a pandas DataFrame, I have a series of boolean values. In order to filter to rows where the boolean is True, I can use: df[df.column_x]
I thought in order to filter to only rows where the column is False, I could use: df[~df.column_x]. I feel like I have done this before, and have seen it as the accepted answer.
However, this fails because ~df.column_x converts the values to integers. See below.
import pandas as pd . # version 0.24.2
a = pd.Series(['a', 'a', 'a', 'a', 'b', 'a', 'b', 'b', 'b', 'b'])
b = pd.Series([True, True, True, True, True, False, False, False, False, False], dtype=bool)
c = pd.DataFrame(data=[a, b]).T
c.columns = ['Classification', 'Boolean']```
print(~c.Boolean)
0 -2
1 -2
2 -2
3 -2
4 -2
5 -1
6 -1
7 -1
8 -1
9 -1
Name: Boolean, dtype: object
print(~b)
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
9 True
dtype: bool
Basically, I can use c[~b], but not c[~c.Boolean]
Am I just dreaming that this use to work?

Ah , since you created the c by using DataFrame constructor , then T,
1st let us look at what we have before T:
pd.DataFrame([a, b])
Out[610]:
0 1 2 3 4 5 6 7 8 9
0 a a a a b a b b b b
1 True True True True True False False False False False
So pandas will make each columns only have one dtype, if not it will convert to object .
After T what data type we have for each columns
The dtypes in your c :
c.dtypes
Out[608]:
Classification object
Boolean object
Boolean columns became object type , that is why you get unexpected output for ~c.Boolean
How to fix it ? ---concat
c=pd.concat([a,b],1)
c.columns = ['Classification', 'Boolean']
~c.Boolean
Out[616]:
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
9 True
Name: Boolean, dtype: bool

Make a Pandas mask based on a column vector

I have a given dataframe and I would like for each line to be able to select the values that are above the line's given percentile.
Let's consider this dataframe:
df = pd.DataFrame({'A' : [5,6,3,4, 0,5,9], 'B' : [1,2,3, 5,7,0,1]})
A B
0 5 1
1 6 2
2 3 3
3 4 5
4 0 7
5 5 0
6 9 1
And a given vector of the 20th quantiles for each row:
rowsQuantiles = df.quantile(0.2, axis=1)
0 1.8
1 2.8
2 3.0
3 4.2
4 1.4
5 1.0
6 2.6
I would like to be able to filter-out for each row the values that are below the row's quantile in to have the following result:
quantileMask = df > rowsQuantiles
A B
0 True False
1 True False
2 False False
3 False True
4 False True
5 True False
6 True False
EDIT:
I really liked both approaches by #andrew_reece and #Andy Hayden, so I decided to see which one was the fastet/best-implemented:
N=10000000
df = pd.DataFrame({'A' : [random.random() for i in range(N)], 'B' : [random.random() for i in range(N)]})
rowsQuantiles = df.quantile(0.2, axis=1)
t0=time.time()
mask=(df.T>rowsQuantiles).T
#mask=df.apply(lambda row: row > rowsQuantiles)
print(str(time.time()-t0))
Results are pretty straightforward (after several repeted tests):
220ms for mask=(df.T>rowsQuantiles).T
65ms for mask=df.apply(lambda row: row > rowsQuantiles)
21ms for df.gt(rowsQuantiles,0), the accepted answer.

Also only using gt
df.gt(rowsQuantiles,0)
Out[288]:
A B
0 True False
1 True False
2 False False
3 False True
4 False True
5 True False
6 True False
Using add
df.add(-rowsQuantiles,0).gt(0)
Out[284]:
A B
0 True False
1 True False
2 False False
3 False True
4 False True
5 True False
6 True False

There's a transpose error with your mask, but assuming you want to replace the values with NaN, the method you're looking for is where:
In [11]: df.T > rowsQuantiles
Out[11]:
0 1 2 3 4 5 6
A True True False False False True True
B False False False True True False False
In [12]: (df.T > rowsQuantiles).T
Out[12]:
A B
0 True False
1 True False
2 False False
3 False True
4 False True
5 True False
6 True False
In [13]: df.where((df.T > rowsQuantiles).T)
Out[13]:
A B
0 5.0 NaN
1 6.0 NaN
2 NaN NaN
3 NaN 5.0
4 NaN 7.0
5 5.0 NaN
6 9.0 NaN

df.apply(lambda row: row > rowsQuantiles)
A B
0 True False
1 True False
2 False False
3 False True
4 False True
5 True False
6 True False

An alternative I could get behind is np.where:
np.where(df.values > rowsQuantiles[:, None], True, False)
array([[ True, False],
[ True, False],
[False, False],
[False, True],
[False, True],
[ True, False],
[ True, False]], dtype=bool)
Which returns a numpy array, if you're okay with that.
Timings
%timeit df.T > rowsQuantiles
1 loop, best of 3: 251 ms per loop
%timeit df.where((df.T > rowsQuantiles).T)
1 loop, best of 3: 583 ms per loop
%timeit np.where(df.values > rowsQuantiles[:, None], True, False)
10 loops, best of 3: 136 ms per loop
%timeit df.add(-rowsQuantiles,0).gt(0)
10 loops, best of 3: 141 ms per loop
%timeit df.gt(rowsQuantiles,0)
10 loops, best of 3: 25.4 ms per loop
%timeit df.apply(lambda row: row > rowsQuantiles)
10 loops, best of 3: 60.6 ms per loop

Find first true value in a row of Pandas dataframe

I have two dataframes of boolean values.
The first one looks like this:
b1=pd.DataFrame([[ True, False, False, False, False],
[False, False, True, False, False],
[False, True, False, False, False],
[False, False, False, False, False]])
b1
Out[88]:
0 1 2 3 4
0 True False False False False
1 False False True False False
2 False True False False False
3 False False False False False
If I am just interested in whether each row has any True value I can use the any method:
b1.any(1)
Out[89]:
0 True
1 True
2 True
3 False
dtype: bool
However, I want to have an added constraint based on a second dataframe that looks like the following:
b2 = pd.DataFrame([[ True, False, True, False, False],
[False, False, True, True, True],
[ True, True, False, False, False],
[ True, True, True, False, False]])
b2
Out[91]:
0 1 2 3 4
0 True False True False False
1 False False True True True
2 True True False False False
3 True True True False False
I want to identify rows that have a True value in the first dataframe ONLY if it is the first True value in a row of the second dataframe.
For example, this would exclude row 2 because although it has a True value in the first dataframe, it is the 2nd true value in the second dataframe. In contrast, rows 1 and 2 have a true value in dataframe 1 that is also the first true value in dataframe 2. The output should be the following:
0 True
1 True
2 False
3 False
dtype: bool

One way would be to use cumsum to help find the first:
In [123]: (b1 & b2 & (b2.cumsum(axis=1) == 1)).any(axis=1)
Out[123]:
0 True
1 True
2 False
3 False
dtype: bool
This works because b2.cumsum(axis=1) gives us the cumulative number of Trues seen, and cases where that number is 1 and b2 itself is True must be the first one.
In [124]: b2.cumsum(axis=1)
Out[124]:
0 1 2 3 4
0 1 1 2 2 2
1 0 0 1 2 3
2 1 2 2 2 2
3 1 2 3 3 3

As a variation to #DSM's clever answer, this approach seemed a little more intuitive to me. The first part should be pretty self-explanatory, and the second part finds the first column number (w/ axis = 1) that is true for each dataframe and compares.
(b1.any(axis = 1) & (b1.idxmax(axis = 1) == b2.idxmax(axis = 1))

Worked out a solution which turned out to be similar to pshep123's solution.
# the part on the right of & is to check if the first True position in b1 matches the first True position in b2.
b1.any(1) & (b1.values.argmax(axis=1) == b2.values.argmax(axis=1))
Out[823]:
0 True
1 True
2 False
3 False
dtype: bool

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to nicely measure runs of same-data in a pandas dataframe - python

Related

Vectorizing the aggregation operation on different columns of a Pandas dataframe

Is there a Pandas equivalent to tidyr's uncount?

What happened to python's ~ when working with boolean?

Make a Pandas mask based on a column vector

Find first true value in a row of Pandas dataframe

Categories

Resources