Pandas - Assign unique ID to each group in grouped data - python

I have a dataframe with many attributes. I want to assign an id for all unique combinations of these attributes.
assume, this is my df:
df = pd.DataFrame(np.random.randint(1,3, size=(10, 3)), columns=list('ABC'))
A B C
0 2 1 1
1 1 1 1
2 1 1 1
3 2 2 2
4 1 2 2
5 1 2 1
6 1 2 2
7 1 2 1
8 1 2 2
9 2 2 1
Now, I need to append a new column with an id for unique combinations. It has to be 0, it the combination occurs only once. In this case:
A B C unique_combination
0 2 1 1 0
1 1 1 1 1
2 1 1 1 1
3 2 2 2 0
4 1 2 2 2
5 1 2 1 3
6 1 2 2 2
7 1 2 1 3
8 1 2 2 2
9 2 2 1 0
My first approach was to use a for loop and check for every row, if I find more than one combination in the dataframe of the row's values with .query:
unique_combination = 1 #acts as a counter
df['unique_combination'] = 0
for idx, row in df.iterrows():
if len(df.query('A == #row.A & B == #row.B & C == #row.C')) > 1:
# check, if one occurrence of the combination already has a value > 0???
df.loc[idx, 'unique_combination'] = unique_combination
unique_combination += 1
However, I have no idea how to check whether there already is an ID assigned for a combination (see comment in code). Additionally my approach feels very slow and hacky (I have over 15000 rows). Do you data wrangler see a different approach to my problem?
Thank you very much!

Step1 : Assign a new column with values 0
df['new'] = 0
Step2 : Create a mask with repetition more than 1 i.e
mask = df.groupby(['A','B','C'])['new'].transform(lambda x : len(x)>1)
Step3 : Assign the values factorizing based on mask i.e
df.loc[mask,'new'] = df.loc[mask,['A','B','C']].astype(str).sum(1).factorize()[0] + 1
# or
# df.loc[mask,'new'] = df.loc[mask,['A','B','C']].groupby(['A','B','C']).ngroup()+1
Output:
A B C new
0 2 1 1 0
1 1 1 1 1
2 1 1 1 1
3 2 2 2 0
4 1 2 2 2
5 1 2 1 3
6 1 2 2 2
7 1 2 1 3
8 1 2 2 2
9 2 2 1 0

A new feature added in Pandas version 0.20.2 creates a column of unique ids automatically for you.
df['unique_id'] = df.groupby(['A', 'B', 'C']).ngroup()
gives the following output
A B C unique_id
0 2 1 2 3
1 2 2 1 4
2 1 2 1 1
3 1 2 2 2
4 1 1 1 0
5 1 2 1 1
6 1 1 1 0
7 2 2 2 5
8 1 2 2 2
9 1 2 2 2
The groups are given ids based on the order they would be iterated over.
See the documentation here: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#enumerate-groups

Related

How to keep track of how many times a unique condition occurs

I have a df that looks like this:
time val
0 1
1 1
2 2
3 3
4 1
5 2
6 3
7 3
8 3
9 3
10 1
11 1
How do I create new columns that hold the amount of times a condition occurs and does not change? In this case, I want to create a column for each unique value in val that holds the cumulative sum at the given row of occurences, but does not increment the value if the condition doesn't change.
Expected outcome below:
time val sum_1 sum_2 sum_3
0 1 1 0 0
1 1 1 0 0
2 2 1 1 0
3 3 1 1 1
4 1 2 1 1
5 2 2 2 1
6 3 2 2 2
7 3 2 2 2
8 3 2 2 2
9 3 2 2 2
10 1 3 2 2
11 1 3 2 2
EDIT
To be more specific with the condition:
I want to count the number of times a unique value appears in val. For example, using the code below, I could get this result:
df['sum_1'] = (df['val'] == 1).cumsum()
df['sum_2'] = (df['val'] == 2).cumsum()
df['sum_3'] = (df['val'] == 3).cumsum()
time val sum_1 sum_2 sum_3
0 0 1 1 0 0
1 1 1 2 0 0
2 2 2 2 1 0
3 3 3 2 1 1
4 4 1 3 1 1
5 5 2 3 2 1
However, this code counts EVERY occurence of a condition. For example, val shows 1 occurring 3 times total. However, I want to treat consecutive occurrences of 1 as a single group, counting only the number of consecutive groupings that occur. In the example above, 1 occurs in total 3 times, but only 2 times as a consecutive grouping.
You can chain mask by & for bitwise AND for test first consecutive values by compare by shifted values by Series.ne with Series.shift and run code for test all unique values of column val:
uniq = df['val'].unique()
m = df['val'].ne(df['val'].shift())
for c in uniq:
df[f'sum_{c}'] = (df['val'].eq(c) & m).cumsum()
print (df)
time val sum_1 sum_2 sum_3
0 0 1 1 0 0
1 1 1 1 0 0
2 2 2 1 1 0
3 3 3 1 1 1
4 4 1 2 1 1
5 5 2 2 2 1
6 6 3 2 2 2
7 7 3 2 2 2
8 8 3 2 2 2
9 9 3 2 2 2
10 10 1 3 2 2
11 11 1 3 2 2
For better performance (I hope) here is numpy alternative:
a = df['val'].to_numpy()
uniq = np.unique(a)
m = np.concatenate(([False], a[:-1])) != a
arr = np.cumsum((a[:, None] == uniq) & m[:, None], axis=0)
df = df.join(pd.DataFrame(arr, index=df.index, columns=uniq).add_prefix('sum_'))
print (df)
time val sum_1 sum_2 sum_3
0 0 1 1 0 0
1 1 1 1 0 0
2 2 2 1 1 0
3 3 3 1 1 1
4 4 1 2 1 1
5 5 2 2 2 1
6 6 3 2 2 2
7 7 3 2 2 2
8 8 3 2 2 2
9 9 3 2 2 2
10 10 1 3 2 2
11 11 1 3 2 2

How to calculate count within the same group based on ID

My DataFrame looks like:
df = pd.DataFrame({"ID":['A','B','A','A','B','B','C','D','D','C'],
'count':[1,1,2,2,2,2,1,1,1,2]})
print(df)
ID count
0 A 1
1 B 1
2 A 2
3 A 2
4 B 2
5 B 2
6 C 1
7 D 1
8 D 1
9 C 2
I will be having only ID column and I want to calculate count column. The logic is I want to cumulatively count the occurrence of an ID. If its repeated immediately like index 2 & 3 they both should get same count. How can I achieve this?
My attempt which is not giving the accurate results:
df['x'] = df['ID'].eq(df['ID'].shift(-1)).astype(int)
df.groupby('ID')['x'].transform('cumsum')+1
0 1
1 1
2 2
3 2
4 2
5 2
6 1
7 2
8 2
9 1
Name: x, dtype: int32
The question is not directly related to groupby cumulative count, but it is different.
We can do filter then reindex back
(df[df.ID.ne(df.ID.shift())].groupby('ID').cumcount().add(1)
.reindex(df.index,method='ffill'))
Out[10]:
0 1
1 1
2 2
3 2
4 2
5 2
6 1
7 1
8 1
9 2
dtype: int64
You could also use groupby() with sort=False:
df['count2'] = df[(df.ID.ne(df.ID.shift()))].groupby('ID', sort=False).cumcount().add(1)
df['count2'] = df['count2'].ffill()
Output:
ID count count2
0 A 1 1
1 B 1 1
2 A 2 2
3 A 2 2
4 B 2 2
5 B 2 2
6 C 1 1
7 D 1 1
8 D 1 1
9 C 2 2

keep second level of multi-index intact while sorting on first one pandas python

I have sorted my first level of index using the following method : Custom sort order function for groupby pandas python
def my_func(group):
return sum(group["B"]*group["C"])
idx=df.groupby('A').apply(my_func).reindex(df.index.get_level_values(0))
df.iloc[idx.argsort()]
The issue is that the second level ordering is jumbled up after sorting on the first. How can I make sure that the intra-group order is kept ?
from
A B C
1 0 1 8
1 3 3
2 0 1 2
1 2 2
3 0 1 3
1 2 4
to
A B C
2 0 1 2
1 2 2
3 0 1 3
1 2 4
1 0 1 8
1 3 3
and not (last 2 lines inverted)
A B C
2 0 1 2
1 2 2
3 0 1 3
1 2 4
1 1 3 3
0 1 8
I think you need stable sorting algo - mergesort:
idx=df.index.get_level_values(0).map(df.groupby('A').apply(my_func))
df = df.iloc[idx.argsort(kind='mergesort')]
print (df)
B C
A
2 0 1 2
1 2 2
3 0 1 3
1 2 4
1 0 1 8
1 3 3

How to add incremental number to Dataframe using Pandas

I have original dataframe:
ID T value
1 0 1
1 4 3
2 0 0
2 4 1
2 7 3
The value is same previous row.
The output should be like:
ID T value
1 0 1
1 1 1
1 2 1
1 3 1
1 4 3
2 0 0
2 1 0
2 2 0
2 3 0
2 4 1
2 5 1
2 6 1
2 7 3
... ... ...
I tried loop it take long time process.
Any idea how to solve this for large dataframe?
Thanks!
For solution is necessary unique integer values in T for each group.
Use groupby with custom function - for each group use reindex and then replace NaNs in value column by forward filling ffill:
df1 = (df.groupby('ID')['T', 'value']
.apply(lambda x: x.set_index('T').reindex(np.arange(x['T'].min(), x['T'].max() + 1)))
.ffill()
.astype(int)
.reset_index())
print (df1)
ID T value
0 1 0 1
1 1 1 1
2 1 2 1
3 1 3 1
4 1 4 3
5 2 0 0
6 2 1 0
7 2 2 0
8 2 3 0
9 2 4 1
10 2 5 1
11 2 6 1
12 2 7 3
If get error:
ValueError: cannot reindex from a duplicate axis
it means some duplicated values per group like:
print (df)
ID T value
0 1 0 1
1 1 4 3
2 2 0 0
3 2 4 1 <-4 is duplicates per group 2
4 2 4 3 <-4 is duplicates per group 2
5 2 7 3
Solution is aggregate values first for unique T - e.g.by sum:
df = df.groupby(['ID', 'T'], as_index=False)['value'].sum()
print (df)
ID T value
0 1 0 1
1 1 4 3
2 2 0 0
3 2 4 4
4 2 7 3

How to apply cummulative count on multiple columns of dataframe

Dataframe
a b c
0 0 1 1
1 0 1 1
2 0 0 1
3 0 0 1
4 1 1 0
5 1 1 1
6 1 1 1
7 0 0 1
I am trying apply cummulative count cumcount on multiple columns of dataframe, i have tried applying the cummulative count by grouping each column. Is there any easy way to achieve expected output
I have tried this code , but it is not working
li =[]
for column in df.columns:
li.append(df.groupby(column)[column].cumcount())
pd.concat(li,axis=1)
Expected output
a b c
0 1 1 1
1 1 2 2
2 1 1 3
3 1 1 4
4 1 1 1
5 2 2 1
6 3 3 2
7 1 1 3
Create consecutive groups by comparing with shifted values and for each column apply cumcount, last set 1 by boolean mask:
df = (df.ne(df.shift()).cumsum()
.apply(lambda x: df.groupby(x).cumcount() + 1)
.mask(df == 0, 1))
print (df)
a b c
0 1 1 1
1 1 2 2
2 1 1 3
3 1 1 4
4 1 1 1
5 2 2 1
6 3 3 2
7 1 1 3
Another solution if performance is important - count only 1 values and last set 1 by mask by np.where:
a = df == 1
b = a.cumsum()
arr = np.where(a, b-b.mask(a).ffill().fillna(0).astype(int), 1)
df = pd.DataFrame(arr, index=df.index, columns=df.columns)
print (df)
a b c
0 1 1 1
1 1 2 2
2 1 1 3
3 1 1 4
4 1 1 1
5 2 2 1
6 3 3 2
7 1 1 3

Categories