creating a column which keeps a running count of consecutive values - python

I am trying to create a column (“consec”) which will keep a running count of consecutive values in another (“binary”) without using loop. This is what the desired outcome would look like:
. binary consec
1 0 0
2 1 1
3 1 2
4 1 3
5 1 4
5 0 0
6 1 1
7 1 2
8 0 0
However, this...
df['consec'][df['binary']==1] = df['consec'].shift(1) + df['binary']
results in this...
. binary consec
0 1 NaN
1 1 1
2 1 1
3 0 0
4 1 1
5 0 0
6 1 1
7 1 1
8 1 1
9 0 0
I see other posts which use grouping or sorting, but unfortunately, I don't see how that could work for me.

You can use the compare-cumsum-groupby pattern (which I really need to getting around to writing up for the documentation), with a final cumcount:
>>> df = pd.DataFrame({"binary": [0,1,1,1,0,0,1,1,0]})
>>> df["consec"] = df["binary"].groupby((df["binary"] == 0).cumsum()).cumcount()
>>> df
binary consec
0 0 0
1 1 1
2 1 2
3 1 3
4 0 0
5 0 0
6 1 1
7 1 2
8 0 0
This works because first we get the positions where we want to reset the counter:
>>> (df["binary"] == 0)
0 True
1 False
2 False
3 False
4 True
5 True
6 False
7 False
8 True
Name: binary, dtype: bool
The cumulative sum of these gives us a different id for each group:
>>> (df["binary"] == 0).cumsum()
0 1
1 1
2 1
3 1
4 2
5 3
6 3
7 3
8 4
Name: binary, dtype: int64
And then we can pass this to groupby and use cumcount to get an increasing index in each group.

For those who ended up here looking for an answer to the "misunderstood" version:
To reset count for each change in the binary column, so that consec does "keep a running count of consecutive values", the following seems to work:
df["consec2"] = df["binary"].groupby((df["binary"] <> df["binary"].shift()).cumsum()).cumcount()

Related

Grouping a set of numbers that reoccur in a pandas DataFrame

Say I have the following dataframe
holder
0
1
2
0
1
2
0
1
0
1
2
I want to be able to group each set of numbers that come in starting at 0, ends at the max value, assign it a value for that group.
So
holder group
0 1
1 1
2 1
0 2
1 2
2 2
0 3
1 3
0 4
1 4
2 4
I tried:
n=3
df['group'] = [int(i/n) for i,x in enumerate(df.holder)]
But this returns
holder group
0 1
1 1
2 1
0 2
1 2
2 2
0 3
1 3
0 3
1 4
2 4
Assuming each group's numbers always increase, you can check whether the numbers are less than or equal to the ones before, then take the cumulative sum, which turns the booleans into group numbers.
df['group'] = df['holder'].diff().le(0).cumsum() + 1
Result:
holder group
0 0 1
1 1 1
2 2 1
3 0 2
4 1 2
5 2 2
6 0 3
7 1 3
8 0 4
9 1 4
10 2 4
(I'm using <= specifically instead of < in case of two adjacent 0s.)
This was inspired by Nickil Maveli's answer on "Groupby conditional sum of adjacent rows" but the cleaner method was posted by d.b in a comment here.
Assuming holder is monotonically nondecreasing until another 0 occurs, you can identify the zeroes and create groups by taking the cumulative sum.
df = pd.DataFrame({'holder': [0, 1, 2, 0, 1, 2, 0, 1, 0, 1, 2]})
# identify 0s and create groups
df['group'] = df['holder'].eq(0).cumsum()
print(df)
holder group
0 0 1
1 1 1
2 2 1
3 0 2
4 1 2
5 2 2
6 0 3
7 1 3
8 0 4
9 1 4
10 2 4

Filtering out the end of a dataframe in pandas

I have a large dataframe with timestamps and there is one value that starts with a decrease, then stays 0 for a while and increases again starting the next cycle.
I would like to analyze the decreasing and stable part, but not the increasing part.
Ideally the code should check if there was a 0 in the df before, and if so any value > 0 afterwards should be excluded from the dataframe. It would also work to determine when the last value at ==0 occurs and delete all the data afterwards.
Is there a possibility to do this?
Cheers!
import pandas as pd
data = {
"Period_index": [1,2,3,4,5,6,7,8,9,10],
"Value": [9, 7, 3, 0, 0, 0, 0, 2, 4, 6]
}
df = pd.DataFrame(data)
If your data is always starting with decrease, staying 0 for a while and increasing, this should work.
df[:np.where(df.Value == 0)[0].max()]
Period index Value
0 1 9
1 2 7
2 3 3
3 4 0
4 5 0
5 6 0
Literally as you say:
check there has never been a zero with .gt(0).cummin() which returns True on non-zero and then the cumulative min, i.e. as soon as there is a False return only False
or the value is zero with .eq(0)
>>> df['Value'].gt(0).cummin()
0 True
1 True
2 True
3 False
4 False
5 False
6 False
7 False
8 False
9 False
Name: Value, dtype: bool
>>> df[df['Value'].gt(0).cummin() | df['Value'].eq(0)]
Period_index Value
0 1 9
1 2 7
2 3 3
3 4 0
4 5 0
5 6 0
6 7 0
If you’re afraid of getting unconnected zeros again you can pass the whole thing to cummin() again, i.e.:
(df[df['Value'].gt(0).cummin() | df['Value'].eq(0)]).cummin()
FWIW this is the only answer that really stops at the first zero:
>>> test = pd.Series([1, 0, 0, 1, 0])
>>> test[(test.gt(0).cummin() | test.eq(0)).cummin()]
0 1
1 0
2 0
dtype: int64
>>> test.loc[:test.eq(0).idxmax()]
0 1
1 0
dtype: int64
>>> test.loc[:test.eq(0)[::-1].idxmax()]
0 1
1 0
2 0
3 1
4 0
dtype: int64
#SeaBean’s answer is also good (or a variant thereof):
>>> test[test.gt(test.shift()).cumsum().eq(0)]
0 1
1 0
2 0
dtype: int64
This code will cut the serie when it starts growing again.
The rolling window calculates the variation and when it starts growing again we have a positive value captured by cummax
df[df.Value.rolling(2).apply(lambda w:w.values[1]-w.values[0]).cummax().fillna(-1)<=0]
Period_index Value
0 1 9
1 2 7
2 3 3
3 4 0
4 5 0
5 6 0
6 7 0
try via idxmax() + iloc:
out=df.iloc[:df['Value'].eq(0).idxmax()]
#df.loc[:df['Value'].eq(0).idxmax()-1]
output of out:
Period_index Value
0 1 9
1 2 7
2 3 3
OR
via idxmax() and loc:
out=df.loc[:df['Value'].eq(0)[::-1].idxmax()]
output of out:
Period_index Value
0 1 9
1 2 7
2 3 3
3 4 0
4 5 0
5 6 0
6 7 0
OR
If you don't want to get unconnected 0's then use argmin():
df=df.loc[:(df['Value'].eq(0) | (~df['Value'].gt(df['Value'].shift()))).argmin()-1]

pandas create category column based on sequence repetition in another column

This is very likely a duplicate, but I'm not sure what to search for to find it.
I have a column in a dataframe that cycles from 0 to some value a number of times (in my example it cycles to 4 three times) . I want to create another column that simply shows which cycle it is. Example:
import pandas as pd
df = pd.DataFrame({'A':[0,1,2,3,4,0,1,2,3,4,0,1,2,3,4]})
df['desired_output'] = [0,0,0,0,0,1,1,1,1,1,2,2,2,2,2]
print(df)
A desired_output
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 0 1
6 1 1
7 2 1
8 3 1
9 4 1
10 0 2
11 1 2
12 2 2
13 3 2
14 4 2
I was thinking maybe something along the lines of a groupby(), cumsum() and transform(), but I'm not quite sure how to implement it. Could be wrong though.
Compare by 0 with Series.eq and then add Series.cumsum, last subtract 1:
df['desired_output'] = df['A'].eq(0).cumsum() - 1
print (df)
A desired_output
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 0 1
6 1 1
7 2 1
8 3 1
9 4 1
10 0 2
11 1 2
12 2 2
13 3 2
14 4 2

Pandas - Assign unique ID to each group in grouped data

I have a dataframe with many attributes. I want to assign an id for all unique combinations of these attributes.
assume, this is my df:
df = pd.DataFrame(np.random.randint(1,3, size=(10, 3)), columns=list('ABC'))
A B C
0 2 1 1
1 1 1 1
2 1 1 1
3 2 2 2
4 1 2 2
5 1 2 1
6 1 2 2
7 1 2 1
8 1 2 2
9 2 2 1
Now, I need to append a new column with an id for unique combinations. It has to be 0, it the combination occurs only once. In this case:
A B C unique_combination
0 2 1 1 0
1 1 1 1 1
2 1 1 1 1
3 2 2 2 0
4 1 2 2 2
5 1 2 1 3
6 1 2 2 2
7 1 2 1 3
8 1 2 2 2
9 2 2 1 0
My first approach was to use a for loop and check for every row, if I find more than one combination in the dataframe of the row's values with .query:
unique_combination = 1 #acts as a counter
df['unique_combination'] = 0
for idx, row in df.iterrows():
if len(df.query('A == #row.A & B == #row.B & C == #row.C')) > 1:
# check, if one occurrence of the combination already has a value > 0???
df.loc[idx, 'unique_combination'] = unique_combination
unique_combination += 1
However, I have no idea how to check whether there already is an ID assigned for a combination (see comment in code). Additionally my approach feels very slow and hacky (I have over 15000 rows). Do you data wrangler see a different approach to my problem?
Thank you very much!
Step1 : Assign a new column with values 0
df['new'] = 0
Step2 : Create a mask with repetition more than 1 i.e
mask = df.groupby(['A','B','C'])['new'].transform(lambda x : len(x)>1)
Step3 : Assign the values factorizing based on mask i.e
df.loc[mask,'new'] = df.loc[mask,['A','B','C']].astype(str).sum(1).factorize()[0] + 1
# or
# df.loc[mask,'new'] = df.loc[mask,['A','B','C']].groupby(['A','B','C']).ngroup()+1
Output:
A B C new
0 2 1 1 0
1 1 1 1 1
2 1 1 1 1
3 2 2 2 0
4 1 2 2 2
5 1 2 1 3
6 1 2 2 2
7 1 2 1 3
8 1 2 2 2
9 2 2 1 0
A new feature added in Pandas version 0.20.2 creates a column of unique ids automatically for you.
df['unique_id'] = df.groupby(['A', 'B', 'C']).ngroup()
gives the following output
A B C unique_id
0 2 1 2 3
1 2 2 1 4
2 1 2 1 1
3 1 2 2 2
4 1 1 1 0
5 1 2 1 1
6 1 1 1 0
7 2 2 2 5
8 1 2 2 2
9 1 2 2 2
The groups are given ids based on the order they would be iterated over.
See the documentation here: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#enumerate-groups

Apply a value to all instances of a number based on conditions

I have a df like this:
ID Number
1 0
1 0
1 1
2 0
2 0
3 1
3 1
3 0
I want to apply a 5 to any ids that have a 1 anywhere in the number column and a zero to those that don't. For example, if the number "1" appears anywhere in the Number column for ID 1, I want to place a 5 in the total column for every instance of that ID.
My desired output would look as such
ID Number Total
1 0 5
1 0 5
1 1 5
2 0 0
2 0 0
3 1 5
3 1 5
3 0 5
Trying to think of a way leverage applymap for this issue but not sure how to implement.
Use transform to add a column to your df as a result of a groupby on 'ID':
In [6]:
df['Total'] = df.groupby('ID').transform(lambda x: 5 if (x == 1).any() else 0)
df
Out[6]:
ID Number Total
0 1 0 5
1 1 0 5
2 1 1 5
3 2 0 0
4 2 0 0
5 3 1 5
6 3 1 5
7 3 0 5
You can use DataFrame.groupby() on ID column and then take max() of the Number column, and then make that into a dictionary and then use that to create the 'Total' column. Example -
grouped = df.groupby('ID')['Number'].max().to_dict()
df['Total'] = df.apply((lambda row:5 if grouped[row['ID']] else 0), axis=1)
Demo -
In [44]: df
Out[44]:
ID Number
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 3 1
6 3 1
7 3 0
In [56]: grouped = df.groupby('ID')['Number'].max().to_dict()
In [58]: df['Total'] = df.apply((lambda row:5 if grouped[row['ID']] else 0), axis=1)
In [59]: df
Out[59]:
ID Number Total
0 1 0 5
1 1 0 5
2 1 1 5
3 2 0 0
4 2 0 0
5 3 1 5
6 3 1 5
7 3 0 5

Categories