Grouping a set of numbers that reoccur in a pandas DataFrame

Grouping a set of numbers that reoccur in a pandas DataFrame - python

Say I have the following dataframe
holder
0
1
2
0
1
2
0
1
0
1
2
I want to be able to group each set of numbers that come in starting at 0, ends at the max value, assign it a value for that group.
So
holder group
0 1
1 1
2 1
0 2
1 2
2 2
0 3
1 3
0 4
1 4
2 4
I tried:
n=3
df['group'] = [int(i/n) for i,x in enumerate(df.holder)]
But this returns
holder group
0 1
1 1
2 1
0 2
1 2
2 2
0 3
1 3
0 3
1 4
2 4

Assuming each group's numbers always increase, you can check whether the numbers are less than or equal to the ones before, then take the cumulative sum, which turns the booleans into group numbers.
df['group'] = df['holder'].diff().le(0).cumsum() + 1
Result:
holder group
0 0 1
1 1 1
2 2 1
3 0 2
4 1 2
5 2 2
6 0 3
7 1 3
8 0 4
9 1 4
10 2 4
(I'm using <= specifically instead of < in case of two adjacent 0s.)
This was inspired by Nickil Maveli's answer on "Groupby conditional sum of adjacent rows" but the cleaner method was posted by d.b in a comment here.

Assuming holder is monotonically nondecreasing until another 0 occurs, you can identify the zeroes and create groups by taking the cumulative sum.
df = pd.DataFrame({'holder': [0, 1, 2, 0, 1, 2, 0, 1, 0, 1, 2]})
# identify 0s and create groups
df['group'] = df['holder'].eq(0).cumsum()
print(df)
holder group
0 0 1
1 1 1
2 2 1
3 0 2
4 1 2
5 2 2
6 0 3
7 1 3
8 0 4
9 1 4
10 2 4

Related

pandas create category column based on sequence repetition in another column

This is very likely a duplicate, but I'm not sure what to search for to find it.
I have a column in a dataframe that cycles from 0 to some value a number of times (in my example it cycles to 4 three times) . I want to create another column that simply shows which cycle it is. Example:
import pandas as pd
df = pd.DataFrame({'A':[0,1,2,3,4,0,1,2,3,4,0,1,2,3,4]})
df['desired_output'] = [0,0,0,0,0,1,1,1,1,1,2,2,2,2,2]
print(df)
A desired_output
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 0 1
6 1 1
7 2 1
8 3 1
9 4 1
10 0 2
11 1 2
12 2 2
13 3 2
14 4 2
I was thinking maybe something along the lines of a groupby(), cumsum() and transform(), but I'm not quite sure how to implement it. Could be wrong though.

Compare by 0 with Series.eq and then add Series.cumsum, last subtract 1:
df['desired_output'] = df['A'].eq(0).cumsum() - 1
print (df)
A desired_output
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 0 1
6 1 1
7 2 1
8 3 1
9 4 1
10 0 2
11 1 2
12 2 2
13 3 2
14 4 2

Cumulative sum of a dataframe column with restart

I would like to perform the following function on a dataframe.
Calculate the cumulative sum of a column, notice:
It looks at the previous index only, not including the current one, e.g. the very first one will be zero as there is no previous data to look at.
When it doesn't cumulate, e.g the increment is zero, it restarts the count.
Number Cumulative
0 1 0
1 1 1
2 1 2
3 0 3
4 0 0
5 1 0
6 1 1
7 0 2
I know there is an expanding function, but it doesnt restart when it sees zero

IIUC, this works by making groups according to whether the previous row was 0, then getting the cumulative count:
>>> df
Number
0 1
1 1
2 1
3 0
4 0
5 1
6 1
7 0
df['Cumulative'] = df.groupby(df.Number.shift().eq(0).cumsum()).cumcount()
>>> df
Number Cumulative
0 1 0
1 1 1
2 1 2
3 0 3
4 0 0
5 1 0
6 1 1
7 0 2
Alternatively, if it really is cumsum you want, then apply cumsum with the same grouping as above, and shift it 1 down:
df['Cumulative '] = df.groupby(df.Number.eq(0).cumsum()).cumsum().shift().fillna(0)
>>> df
Number Cumulative
0 1 0.0
1 1 1.0
2 1 2.0
3 0 3.0
4 0 0.0
5 1 0.0
6 1 1.0
7 0 2.0

Identify first non-zero element within group composed of multiple columns in pandas

I have a dataframe that looks like the following. The rightmost column is my desired column:
Group1 Group2 Value Target_Column
1 3 0 0
1 3 1 1
1 4 1 1
1 4 1 0
2 5 5 5
2 5 1 0
2 6 0 0
2 6 1 1
2 6 9 0
How do I identify the first non-zero value in a group that is made up of two columns(Group1 & Group2) and then create a column that shows the first non-zero value and shows all else as zeroes?
This question is very similar to one posed earlier here:
Identify first non-zero element within a group in pandas
but that solution gives an error on groups based on multiple columns.
I have tried:
import pandas as pd
dt = pd.DataFrame({'Group1': [1,1,1,1,2,2,2,2,2], 'Group2': [3,3,4,4,5,5,6,6,6], 'Value': [0,1,1,1,5,1,0,1,9]})
dt['Newcol']=0
dt.loc[dt.Value.ne(0).groupby(dt['Group1','Group2']).idxmax(),'Newcol']=dt.Value

Setup
df['flag'] = df.Value.ne(0)
Using numpy.where and assign:
df.assign(
target=np.where(df.index.isin(df.groupby(['Group1', 'Group2']).flag.idxmax()),
df.Value, 0)
).drop('flag', 1)
Using loc and assign
df.assign(
target=df.loc[df.groupby(['Group1', 'Group2']).flag.idxmax(), 'Value']
).fillna(0).astype(int).drop('flag', 1)
Both produce:
Group1 Group2 Value target
0 1 3 0 0
1 1 3 1 1
2 1 4 1 1
3 1 4 1 0
4 2 5 5 5
5 2 5 1 0
6 2 6 0 0
7 2 6 1 1
8 2 6 9 0

The number may off, since when there are only have two same values, I do not know you need the which one.
Using user3483203 's setting up
df['flag'] = df.Value.ne(0)
df['Target']=df.sort_values(['flag'],ascending=False).drop_duplicates(['Group1','Group2']).Value
df['Target'].fillna(0,inplace=True)
df
Out[20]:
Group1 Group2 Value Target_Column Target
0 1 3 0 0 0.0
1 1 3 1 1 1.0
2 1 4 1 1 1.0
3 1 4 1 0 0.0
4 2 5 5 5 5.0
5 2 5 1 0 0.0
6 2 6 0 0 0.0
7 2 6 1 1 1.0

Pandas - Assign unique ID to each group in grouped data

I have a dataframe with many attributes. I want to assign an id for all unique combinations of these attributes.
assume, this is my df:
df = pd.DataFrame(np.random.randint(1,3, size=(10, 3)), columns=list('ABC'))
A B C
0 2 1 1
1 1 1 1
2 1 1 1
3 2 2 2
4 1 2 2
5 1 2 1
6 1 2 2
7 1 2 1
8 1 2 2
9 2 2 1
Now, I need to append a new column with an id for unique combinations. It has to be 0, it the combination occurs only once. In this case:
A B C unique_combination
0 2 1 1 0
1 1 1 1 1
2 1 1 1 1
3 2 2 2 0
4 1 2 2 2
5 1 2 1 3
6 1 2 2 2
7 1 2 1 3
8 1 2 2 2
9 2 2 1 0
My first approach was to use a for loop and check for every row, if I find more than one combination in the dataframe of the row's values with .query:
unique_combination = 1 #acts as a counter
df['unique_combination'] = 0
for idx, row in df.iterrows():
if len(df.query('A == #row.A & B == #row.B & C == #row.C')) > 1:
# check, if one occurrence of the combination already has a value > 0???
df.loc[idx, 'unique_combination'] = unique_combination
unique_combination += 1
However, I have no idea how to check whether there already is an ID assigned for a combination (see comment in code). Additionally my approach feels very slow and hacky (I have over 15000 rows). Do you data wrangler see a different approach to my problem?
Thank you very much!

Step1 : Assign a new column with values 0
df['new'] = 0
Step2 : Create a mask with repetition more than 1 i.e
mask = df.groupby(['A','B','C'])['new'].transform(lambda x : len(x)>1)
Step3 : Assign the values factorizing based on mask i.e
df.loc[mask,'new'] = df.loc[mask,['A','B','C']].astype(str).sum(1).factorize()[0] + 1
# or
# df.loc[mask,'new'] = df.loc[mask,['A','B','C']].groupby(['A','B','C']).ngroup()+1
Output:
A B C new
0 2 1 1 0
1 1 1 1 1
2 1 1 1 1
3 2 2 2 0
4 1 2 2 2
5 1 2 1 3
6 1 2 2 2
7 1 2 1 3
8 1 2 2 2
9 2 2 1 0

A new feature added in Pandas version 0.20.2 creates a column of unique ids automatically for you.
df['unique_id'] = df.groupby(['A', 'B', 'C']).ngroup()
gives the following output
A B C unique_id
0 2 1 2 3
1 2 2 1 4
2 1 2 1 1
3 1 2 2 2
4 1 1 1 0
5 1 2 1 1
6 1 1 1 0
7 2 2 2 5
8 1 2 2 2
9 1 2 2 2
The groups are given ids based on the order they would be iterated over.
See the documentation here: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#enumerate-groups

Apply a value to all instances of a number based on conditions

I have a df like this:
ID Number
1 0
1 0
1 1
2 0
2 0
3 1
3 1
3 0
I want to apply a 5 to any ids that have a 1 anywhere in the number column and a zero to those that don't. For example, if the number "1" appears anywhere in the Number column for ID 1, I want to place a 5 in the total column for every instance of that ID.
My desired output would look as such
ID Number Total
1 0 5
1 0 5
1 1 5
2 0 0
2 0 0
3 1 5
3 1 5
3 0 5
Trying to think of a way leverage applymap for this issue but not sure how to implement.

Use transform to add a column to your df as a result of a groupby on 'ID':
In [6]:
df['Total'] = df.groupby('ID').transform(lambda x: 5 if (x == 1).any() else 0)
df
Out[6]:
ID Number Total
0 1 0 5
1 1 0 5
2 1 1 5
3 2 0 0
4 2 0 0
5 3 1 5
6 3 1 5
7 3 0 5

You can use DataFrame.groupby() on ID column and then take max() of the Number column, and then make that into a dictionary and then use that to create the 'Total' column. Example -
grouped = df.groupby('ID')['Number'].max().to_dict()
df['Total'] = df.apply((lambda row:5 if grouped[row['ID']] else 0), axis=1)
Demo -
In [44]: df
Out[44]:
ID Number
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 3 1
6 3 1
7 3 0
In [56]: grouped = df.groupby('ID')['Number'].max().to_dict()
In [58]: df['Total'] = df.apply((lambda row:5 if grouped[row['ID']] else 0), axis=1)
In [59]: df
Out[59]:
ID Number Total
0 1 0 5
1 1 0 5
2 1 1 5
3 2 0 0
4 2 0 0
5 3 1 5
6 3 1 5
7 3 0 5

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grouping a set of numbers that reoccur in a pandas DataFrame - python

Related

pandas create category column based on sequence repetition in another column

Cumulative sum of a dataframe column with restart

Identify first non-zero element within group composed of multiple columns in pandas

Pandas - Assign unique ID to each group in grouped data

Apply a value to all instances of a number based on conditions

Categories

Resources