Group by sequence of True - python

I have the following df:
df = pd.DataFrame({"val_a":[True,True,False,False,False,True,False,False,True,True,True,True,False,True,True]})
val_a
0 True
1 True
2 False
3 False
4 False
5 True
6 False
7 False
8 True
9 True
10 True
11 True
12 False
13 True
14 True
and I wish to have the following result:
val_a tx
0 True 0
1 True 0
2 False None
3 False None
4 False None
5 True 1
6 False None
7 False None
8 True 2
9 True 2
10 True 2
11 True 2
12 False None
13 True 3
14 True 3
explanation: When you see a True - count it as a group so for index 0 and 1 its the same tx (0) later comes only one True (index 5) so mark it as 1.
What have I tired: I know that cumsum and groupby must come into play here but couldnt figure how.
g = (df['val_a']==True).cumsum()
df['tx'] = df.groupby(g).ffill()

Identify the groups with cumsum then filter the rows having True values and use factorize to assign the ordinal number to each unique group
m = df['val_a']
df.loc[m, 'tx'] = (~m).cumsum()[m].factorize()[0]
Alternatively you can also use groupby + ngroup
m = df['val_a']
df['tx'] = m[m].groupby((~m).cumsum()).ngroup()
val_a tx
0 True 0.0
1 True 0.0
2 False NaN
3 False NaN
4 False NaN
5 True 1.0
6 False NaN
7 False NaN
8 True 2.0
9 True 2.0
10 True 2.0
11 True 2.0
12 False NaN
13 True 3.0
14 True 3.0

Related

Pandas - Set value based on idxmax of group, including NaN

I am trying to set the value of a column, "C" based on the idxmax() of a groupby ("B"). To make it a bit more complicated though, in the event of a NaN or 0, I would like it to return the min value excluding the NaN or 0 if such a value exists. Here is an example dataframe:
Index
A
B
C
0
1
5
False
1
1
10
False
2
2
9
False
3
2
NaN
False
4
3
3
False
5
3
5
False
6
4
NaN
False
7
4
NaN
False
8
5
0
False
9
5
5
False
I am trying to set column "C" to True for the idxmax() of column B, split by a groupby on column "A":
A
B
C
0
1
5
True
1
1
10
False
2
2
9
True
3
2
NaN
False
4
3
3
True
5
3
5
False
6
4
NaN
True
7
4
NaN
False
8
5
0
False
9
5
5
True
Thanks!
Let's use groupby with transform like this:
df['C_new'] = df.groupby('A')['B'].transform('idxmax') == df.index
Output:
Index A B C C_new
0 0 1 5.0 False False
1 1 1 10.0 False True
2 2 2 9.0 False True
3 3 2 NaN False False
4 4 3 3.0 False False
5 5 3 5.0 False True
or, idxmin...
df['C_new'] = df.groupby('A')['B'].transform('idxmin') == df.index
Output:
Index A B C C_new
0 0 1 5.0 False True
1 1 1 10.0 False False
2 2 2 9.0 False True
3 3 2 NaN False False
4 4 3 3.0 False True
5 5 3 5.0 False False

Pandas: Setting True to False in a column, if it appears less than n times in a row

I have a boolean column in a data frame. In my case, n is 4, so if True appears less than 4 times in a row I want to set these True value to False. The following code can pull that off:
example_data = [False,False,False,False,True,True,False,False,True,False,False,
False,True,True,True,False,False,False,True,True,True,True,
True,False]
import pandas as pd
df = pd.DataFrame(example_data,columns=["input"])
# At the beginning the output is equal to the input.
df["output"] = df["input"]
# This counter will count how often a True apeard in a row.
true_count = 0
# The smalest number of True's that have to appear in a row to keep them.
n = 4
for index, row in df.iterrows():
# If the current value is True the true_counter is increased.
if row["input"] == True:
true_count += 1
# If the value is false and the previous value was false as well nothing.
# will happen.
elif true_count == 0:
pass
# If the true_count is smaler than n starting from the previous input
# the number of previous True's are set to false depending on the
# true_count. After that the true_count is reset to 0.
elif true_count < n:
for i in range(0,true_count):
df._set_value(index-(i+1),"output",False)
true_count = 0
# In case the true_count is bigger n or greater it is simply reset to 0.
else:
true_count = 0
The data frame will look something like this:
input output
0 False False
1 False False
2 False False
3 False False
4 True False
5 True False
6 False False
7 False False
8 True False
9 False False
10 False False
11 False False
12 True False
13 True False
14 True False
15 False False
16 False False
17 False False
18 True True
19 True True
20 True True
21 True True
22 True True
23 False False
My question is if there is a more "pandas" way to do this, as iterating over the data is quite slow. I thought about some functionality that uses given sequences like for example False, True, True, True, False to replace them, but I didn't found anything like that.
Thanks in advance for any helpful answer.
Idea is create groups for each consecutive Trues values by Series.cumsum with inverted boolean mask, then replace non match values to NaNs by Series.where and last count values of each groups by Series.map and Series.value_counts compared by threshold for greater by Series.gt:
s = (~df['input']).cumsum().where(df['input'])
df['out'] = s.map(s.value_counts()).gt(4)
print (df)
input output out
0 False False False
1 False False False
2 False False False
3 False False False
4 True False False
5 True False False
6 False False False
7 False False False
8 True False False
9 False False False
10 False False False
11 False False False
12 True False False
13 True False False
14 True False False
15 False False False
16 False False False
17 False False False
18 True True True
19 True True True
20 True True True
21 True True True
22 True True True
23 False False False
Details:
s = (~df['input']).cumsum().where(df['input'])
print (df.assign(inv = (~df['input']),
cumsum = (~df['input']).cumsum(),
s = (~df['input']).cumsum().where(df['input']),
count = s.map(s.value_counts()),
out = s.map(s.value_counts()).gt(4)))
input output inv cumsum s count out
0 False False True 1 NaN NaN False
1 False False True 2 NaN NaN False
2 False False True 3 NaN NaN False
3 False False True 4 NaN NaN False
4 True False False 4 4.0 2.0 False
5 True False False 4 4.0 2.0 False
6 False False True 5 NaN NaN False
7 False False True 6 NaN NaN False
8 True False False 6 6.0 1.0 False
9 False False True 7 NaN NaN False
10 False False True 8 NaN NaN False
11 False False True 9 NaN NaN False
12 True False False 9 9.0 3.0 False
13 True False False 9 9.0 3.0 False
14 True False False 9 9.0 3.0 False
15 False False True 10 NaN NaN False
16 False False True 11 NaN NaN False
17 False False True 12 NaN NaN False
18 True True False 12 12.0 5.0 True
19 True True False 12 12.0 5.0 True
20 True True False 12 12.0 5.0 True
21 True True False 12 12.0 5.0 True
22 True True False 12 12.0 5.0 True
23 False False True 13 NaN NaN False
Here's a way to do that:
N = 4
df["group_size"] = df.assign(group = (df.input==False).cumsum()).groupby("group").transform("count")
df.loc[(df.group_size > N) & df.input, "output"] = True
df.output.fillna(False, inplace = True)
The output is (note that the group size is always the actual size + 1) - but the final result is fine:
input group_size output
0 False 1 False
1 False 1 False
2 False 1 False
3 False 3 False
4 True 3 False
5 True 3 False
6 False 1 False
7 False 2 False
8 True 2 False
9 False 1 False
10 False 1 False
11 False 4 False
12 True 4 False
13 True 4 False
14 True 4 False
15 False 1 False
16 False 1 False
17 False 6 False
18 True 6 True
19 True 6 True
20 True 6 True
21 True 6 True
22 True 6 True
23 False 1 False

Is there a function in pandas like cumsum() but for the mean? I need to apply it based on a condition

I need to extract the cumulative mean only while my column A is different form zero. Each time it is zero, the cumulative mean should restart. Thanks so much in advance I am not so good using python.
Input:
ColumnA
0 5
1 6
2 7
3 0
4 0
5 1
6 2
7 3
8 0
9 5
10 10
11 15
Expected Output:
ColumnA CumulativeMean
0 5 5.0
1 6 5.5
2 7 6.0
3 0 0.0
4 0 0.0
5 1 1.0
6 2 1.5
7 3 2.0
8 0 0.0
9 5 5.0
10 10 7.5
11 15 10.0
You can try with cumsum to make groups and then with expanding+mean to make the cumulative mean
groups=df.ColumnA.eq(0).cumsum()
df.groupby(groups).apply(lambda x: x[x.ne(0)].expanding().mean()).fillna(0)
Details:
Make groups when column is equal to 0 with eq and cumsum, since eq gives you a mask with True and False values, and with cumsum these values are taken as 1 or 0:
groups=df.ColumnA.eq(0).cumsum()
groups
0 0
1 0
2 0
3 1
4 2
5 2
6 2
7 2
8 3
9 3
10 3
11 3
Name: ColumnA, dtype: int32
Then group by that groups and use apply to do the cumulative mean over elements different to 0:
df.groupby(groups).apply(lambda x: x[x.ne(0)].expanding().mean())
ColumnA
0 5.0
1 5.5
2 6.0
3 NaN
4 NaN
5 1.0
6 1.5
7 2.0
8 NaN
9 5.0
10 7.5
11 10.0
And finally use fillna to fill with 0 the nan values:
df.groupby(groups).apply(lambda x: x[x.ne(0)].expanding().mean()).fillna(0)
ColumnA
0 5.0
1 5.5
2 6.0
3 0.0
4 0.0
5 1.0
6 1.5
7 2.0
8 0.0
9 5.0
10 7.5
11 10.0
You can use boolean indexing to compare rows that are ==0 and !=0 against the previous rows with .shift(). Then, jsut take the .cumsum() to separate into groups, according to where zeros are within ColumnA.
df['CumulativeMean'] = (df.groupby((((df.shift()['ColumnA'] != 0) & (df['ColumnA'] == 0)) |
(df.shift()['ColumnA'] == 0) & (df['ColumnA'] != 0))
.cumsum())['ColumnA'].apply(lambda x: x.expanding().mean()))
Out[6]:
ColumnA CumulativeMean
0 5 5.0
1 6 5.5
2 7 6.0
3 0 0.0
4 0 0.0
5 1 1.0
6 2 1.5
7 3 2.0
8 0 0.0
9 5 5.0
10 10 7.5
11 15 10.0
I'll have broken down the logic of the boolean indexing within the .groupby statement down into multiple columns that build into the final result of the column abcd_cumsum. From there, ['ColumnA'].apply(lambda x: x.expanding().mean())) takes the mean of the group up to any given row in that group. For example, The second row (index of 1) takes the grouped mean of the first and second row, but excludes the third row.
df['a'] = (df.shift()['ColumnA'] != 0)
df['b'] = (df['ColumnA'] == 0)
df['ab'] = (df['a'] & df['b'])
df['c'] = (df.shift()['ColumnA'] == 0)
df['d'] = (df['ColumnA'] != 0)
df['cd'] = (df['c'] & df['d'])
df['abcd'] = (df['ab'] | df['cd'])
df['abcd_cumsum'] = (df['ab'] | df['cd']).cumsum()
df['CumulativeMean'] = (df.groupby(df['abcd_cumsum'])['ColumnA'].apply(lambda x: x.expanding().mean()))
Out[7]:
ColumnA a b ab c d cd abcd abcd_cumsum \
0 5 True False False False True False False 0
1 6 True False False False True False False 0
2 7 True False False False True False False 0
3 0 True True True False False False True 1
4 0 False True False True False False False 1
5 1 False False False True True True True 2
6 2 True False False False True False False 2
7 3 True False False False True False False 2
8 0 True True True False False False True 3
9 5 False False False True True True True 4
10 10 True False False False True False False 4
11 15 True False False False True False False 4
CumulativeMean
0 5.0
1 5.5
2 6.0
3 0.0
4 0.0
5 1.0
6 1.5
7 2.0
8 0.0
9 5.0
10 7.5
11 10.0

how to reindex python dataframe based on column grouping

i am a newbie in python. please assist. I have a huge dataframe consisting of thousands of rows. an example of the df is shown below.
STATE VOLUME
INDEX
1 on 10
2 on 15
3 on 10
4 off 20
5 off 30
6 on 15
7 on 20
8 off 10
9 off 30
10 off 10
11 on 20
12 off 25
i want to be able to index this data based on the 'state' column such that the first batch of 'on' and 'off' registers as index 1, the next batch of 'on' and 'off' registers as index 2 etc etc... i want to be able to select a group of data if i select rows with index 1.
ID VOLUME
INDEX
1 on 10
1 on 15
1 on 10
1 off 20
1 off 30
2 on 15
2 on 20
2 off 10
2 off 30
2 off 10
3 on 20
3 off 25
You can try this with pd.Series.eq with pd.Series.shift and take cumsum using pd.Series.cumsum
df.index = (df['STATE'].eq('off').shift() & df['STATE'].eq('on')).cumsum() + 1
df.index.name = 'INDEX'
STATE VOLUME
INDEX
1 on 10
1 on 15
1 on 10
1 off 20
1 off 30
2 on 15
2 on 20
2 off 10
2 off 30
2 off 10
3 on 20
3 off 25
Details
The idea is to find where an off is followed by an on.
# (df['STATE'].eq('off').shift() & df['STATE'].eq('on')).cumsum() + 1
eq(off).shift eq(on) eq(off).shift & eq(on)
INDEX
1 NaN True False
2 False True False
3 False True False
4 False False False
5 True False False
6 True True True
7 False True False
8 False False False
9 True False False
10 True False False
11 True True True
12 False False False
You could try this with pd.Series.shift and pd.Series.cumsum:
df.index=((df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off')).shift(fill_value=0).cumsum()+1
Same as this with np.where:
temp=pd.Series(np.where((df.STATE.shift(-1) != df.STATE)&(df.STATE.eq('off')),1,0))
df.index=temp.shift(1,fill_value=0).cumsum().astype(int).add(1)
Output:
df
STATE VOLUME
1 on 10
1 on 15
1 on 10
1 off 20
1 off 30
2 on 15
2 on 20
2 off 10
2 off 30
2 off 10
3 on 20
3 off 25
Explanation:
With (df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off'), you will get a mask with the last value when it changes to 'off':
(df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off')
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
10 True
11 False
12 True
Then you shift it to include that last value, and then you do a cumsum() knowing that True: 1 and False: 0:
((df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off')).shift(fill_value=0)
1 0
2 False
3 False
4 False
5 False
6 True
7 False
8 False
9 False
10 False
11 True
12 False
((df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off')).shift(fill_value=0).cumsum()
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 1
11 2
12 2
And finally you add 1(+1) to the index, to get the desired result.

Pandas: How can I check multiple columns if there are any values that are smaller than previous value?

Solution for a single column is already provided here: Pandas: Check if column value is smaller than any previous column value.
However, my dataset consists of many columns and I don't want to brute force many codes.
Example dataset:
c d e
0 3 5 8
1 1 5 8
2 5 6 8
3 6 7 8
4 2 1 9
5 9 3 3
Desired result:
c d e c_diff d_diff e_diff
0 3 5 8 False False False
1 1 5 8 True False False
2 5 6 8 False False False
3 6 7 8 False False False
4 2 1 9 True True False
5 9 3 3 False False True
Is there any way to perform this task with simple lines of Python/Panda code?
We can use DataFrame.diff with DataFrame.lt
df.diff().lt(0).add_suffix('_diff')
c_diff d_diff e_diff
0 False False False
1 True False False
2 False False False
3 False False False
4 True True False
5 False False True
You can operate on the whole dataframe:
df.lt(df.shift()).add_suffix('_diff')
gives you
c_diff d_diff e_diff
0 False False False
1 True False False
2 False False False
3 False False False
4 True True False
5 False False True
And you can join:
df.join(df.lt(df.shift()).add_suffix('_diff'))
which gives:
c d e c_diff d_diff e_diff
0 3 5 8 False False False
1 1 5 8 True False False
2 5 6 8 False False False
3 6 7 8 False False False
4 2 1 9 True True False
5 9 3 3 False False True

Categories