Column that counts up within subgroups pandas - python

I've got a df
df1
a b
4 0 1
5 0 1
6 0 2
2 0 3
3 1 2
15 1 3
12 1 3
13 1 1
15 3 1
14 3 1
8 3 3
9 3 2
10 3 1
the df should be grouped by a and b and I need a column c that goes up from 1 to amount of groups within subgroups of a
df1
a b c
4 0 1 1
5 0 1 1
6 0 2 2
2 0 3 3
3 1 2 1
15 1 3 2
12 1 3 2
13 1 1 3
15 3 1 1
14 3 1 1
8 3 3 2
9 3 2 3
10 3 1 4
How can I do that?

We can do groupby + transform factorize
df['C']=df.groupby('a').b.transform(lambda x : x.factorize()[0]+1)
4 1
5 1
6 2
2 3
3 1
15 2
12 2
13 3
15 1
14 1
8 1
9 1
10 2
Name: b, dtype: int64

Just so we can see the loop version
from itertools import count
from collections import defaultdict
x = defaultdict(count)
y = {}
c = []
for a, b in zip(df.a, df.b):
if (a, b) not in y:
y[(a, b)] = next(x[a]) + 1
c.append(y[(a, b)])
df.assign(C=c)
a b C
4 0 1 1
5 0 1 1
6 0 2 2
2 0 3 3
3 1 2 1
15 1 3 2
12 1 3 2
13 1 1 3
15 3 1 1
14 3 1 1
8 3 3 2
9 3 2 3
10 3 1 1

One option is groupby a and then iterate through each group and groupby b. Then use can use ngroup
df['c'] = np.hstack([g.groupby('b').ngroup().to_numpy() for _,g in df.groupby('a')])
a b c
4 0 1 0
5 0 1 0
6 0 2 1
2 0 3 2
3 1 2 1
15 1 3 2
12 1 3 2
13 1 1 0
15 3 1 0
14 3 1 0
8 3 1 0
9 3 1 0
10 3 2 1

you can use groupby.rank if you don't care about the order in the data.
df['c'] = df.groupby('a')['b'].rank('dense').astype(int)

Related

pandas restart cumsum every time the value is zero

so I have a series, I want to cumsum, but start over every time I hit a 0, somthing like this:
orig
wanted result
0
0
0
1
1
1
2
1
2
3
1
3
4
1
4
5
1
5
6
1
6
7
0
0
8
1
1
9
1
2
10
1
3
11
0
0
12
1
1
13
1
2
14
1
3
15
1
4
16
1
5
17
1
6
any ideas? (pandas, pure python, other)
Use df['orig'].eq(0).cumsum() to generate groups starting on each 0, then cumcount to get the increasing values:
df['result'] = df.groupby(df['orig'].eq(0).cumsum()).cumcount()
output:
orig wanted result result
0 0 0 0
1 1 1 1
2 1 2 2
3 1 3 3
4 1 4 4
5 1 5 5
6 1 6 6
7 0 0 0
8 1 1 1
9 1 2 2
10 1 3 3
11 0 0 0
12 1 1 1
13 1 2 2
14 1 3 3
15 1 4 4
16 1 5 5
17 1 6 6
Intermediate:
df['orig'].eq(0).cumsum()
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 2
8 2
9 2
10 2
11 3
12 3
13 3
14 3
15 3
16 3
17 3
Name: orig, dtype: int64
import pandas as pd
condition = df.Orig.eq(0)
df['reset'] = condition.cumsum()

Change repeating groups in a column to incremental groups

I have the following dataframe:
df = pd.DataFrame({'group_nr':[0,0,1,1,1,2,2,3,3,0,0,1,1,2,2,2,3,3]})
print(df)
group_nr
0 0
1 0
2 1
3 1
4 1
5 2
6 2
7 3
8 3
9 0
10 0
11 1
12 1
13 2
14 2
15 2
16 3
17 3
and would like to change from repeating group numbers to incremental group numbers:
group_nr incremental_group_nr
0 0 0
1 0 0
2 1 1
3 1 1
4 1 1
5 2 2
6 2 2
7 3 3
8 3 3
9 0 4
10 0 4
11 1 5
12 1 5
13 2 6
14 2 6
15 2 6
16 3 7
17 3 7
I can't find a way of doing this without looping through the rows. Does someone have an idea how to implement this nicely?
You can check if the values are equal to the following, and take a cumsum of the boolean series to generate the groups:
df['incremental_group_nr'] = df.group_nr.ne(df.group_nr.shift()).cumsum().sub(1)
print(df)
group_nr incremental_group_nr
0 0 0
1 0 0
2 1 1
3 1 1
4 1 1
5 2 2
6 2 2
7 3 3
8 3 3
9 0 4
10 0 4
11 1 5
12 1 5
13 2 6
14 2 6
15 2 6
16 3 7
17 3 7
Compare by shifted values by Series.shift with not equal by Series.ne and then add cumulative sum with subract 1:
df['incremental_group_nr'] = df['group_nr'].ne(df['group_nr'].shift()).cumsum() - 1
print(df)
group_nr incremental_group_nr
0 0 0
1 0 0
2 1 1
3 1 1
4 1 1
5 2 2
6 2 2
7 3 3
8 3 3
9 0 4
10 0 4
11 1 5
12 1 5
13 2 6
14 2 6
15 2 6
16 3 7
17 3 7
Another idea is use backfilling first missing value after shift by bfill:
df['incremental_group_nr'] = df['group_nr'].ne(df['group_nr'].shift().bfill()).cumsum()
print(df)
group_nr incremental_group_nr
0 0 0
1 0 0
2 1 1
3 1 1
4 1 1
5 2 2
6 2 2
7 3 3
8 3 3
9 0 4
10 0 4
11 1 5
12 1 5
13 2 6
14 2 6
15 2 6
16 3 7
17 3 7

Pandas: find duplicates in another dataframe based on a subset

Assume DF 1:
A B C
0 1 1 1
1 1 1 2
2 2 1 1
3 1 9 0
4 9 9 9
And DF 2
A B C
0 6 1 1
1 1 1 2
2 2 1 1
3 1 9 0
4 1 9 6
I would like to add a column to DF 1 with a count of duplicates in DF 2 based on a subset of columns:
For example
Duplicate on
1
2
Result:
A B C Dupe
0 1 1 1 1
1 1 1 2 1
2 2 1 1 1
3 1 9 0 2
4 9 9 9 0
Sound like you should groupby by df2 then merge
df=df1.merge(df2.groupby(['A','B']).size().to_frame('DUP').reset_index(),how='left').fillna(0)
A B C DUP
0 1 1 1 1.0
1 1 1 2 1.0
2 2 1 1 1.0
3 1 9 0 2.0
4 9 9 9 0.0

Reshaping groupby dataframe to fixed dimensions

I have dataframe df with following data.
A B C D
1 1 3 1
1 2 9 8
1 3 3 9
2 1 2 9
2 2 1 4
2 3 9 5
2 4 6 4
3 1 4 1
3 2 0 4
4 1 2 6
5 1 2 4
5 2 8 3
grp = df.groupby('A')
Next I want to make all groups of dataframe df grouped on columns A to have same number of rows. Either Truncate extra rows or pad 0 rows. For above data, I want to make all groups to have 3 rows. I required the following results.
A B C D
1 1 3 1
1 2 9 8
1 3 3 9
2 1 2 9
2 2 1 4
2 3 9 5
3 1 4 1
3 2 0 4
3 0 0 0
4 1 2 6
4 0 0 0
4 0 0 0
5 1 2 4
5 2 8 3
5 0 0 0
Similarly, I may want to groupby on multiple columns, like
grp = df.groupby(['A','B'])
Use GroupBy.cumcount for counter column with DataFrame.reindex by MultiIndex.from_product:
df['g'] = df.groupby('A').cumcount()
mux = pd.MultiIndex.from_product([df['A'].unique(), range(3)], names=('A','g'))
df = (df.set_index(['A','g'])
.reindex(mux, fill_value=0)
.reset_index(level=1, drop=True)
.reset_index())
print (df)
A B C D
0 1 1 3 1
1 1 2 9 8
2 1 3 3 9
3 2 1 2 9
4 2 2 1 4
5 2 3 9 5
6 3 1 4 1
7 3 2 0 4
8 3 0 0 0
9 4 1 2 6
10 4 0 0 0
11 4 0 0 0
12 5 1 2 4
13 5 2 8 3
14 5 0 0 0
Another solution with DataFrame.merge with left join with helper DataFrame:
from itertools import product
df['g'] = df.groupby('A').cumcount()
df1 = pd.DataFrame(list(product(df['A'].unique(), range(3))), columns=['A','g'])
df = df1.merge(df, how='left').fillna(0).astype(int).drop('g', axis=1)
print (df)
A B C D
0 1 1 3 1
1 1 2 9 8
2 1 3 3 9
3 2 1 2 9
4 2 2 1 4
5 2 3 9 5
6 3 1 4 1
7 3 2 0 4
8 3 0 0 0
9 4 1 2 6
10 4 0 0 0
11 4 0 0 0
12 5 1 2 4
13 5 2 8 3
14 5 0 0 0
EDIT:
df['g'] = df.groupby(['A','B']).cumcount()
mux = pd.MultiIndex.from_product([df['A'].unique(),
df['B'].unique(),
range(3)], names=('A','B','g'))
df = (df.set_index(['A','B','g'])
.reindex(mux, fill_value=0)
.reset_index(level=2, drop=True)
.reset_index())
print (df.head(10))
A B C D
0 1 1 3 1
1 1 1 0 0
2 1 1 0 0
3 1 2 9 8
4 1 2 0 0
5 1 2 0 0
6 1 3 3 9
7 1 3 0 0
8 1 3 0 0
9 1 4 0 0
from itertools import product
df['g'] = df.groupby(['A','B']).cumcount()
df1 = pd.DataFrame(list(product(df['A'].unique(),
df['B'].unique(),
range(3))), columns=['A','B','g'])
df = df1.merge(df, how='left').fillna(0).astype(int).drop('g', axis=1)

Repeating rows in a DataFrame based on a column

I have a dataframe now:
class1 class2 value value2
0 1 0 1 4
1 2 1 2 3
2 2 0 3 5
3 3 1 4 6
I want to repeat rows and insert an increment column in the same amount according to the difference between value and value2. I want to get the dataframe should like this:
class1 class2 value value2 value3
0 1 0 1 4 1
1 1 0 1 4 2
2 1 0 1 4 3
3 1 0 1 4 4
4 2 1 2 3 2
5 2 1 2 3 3
6 2 0 3 5 3
7 2 0 3 5 4
8 2 0 3 5 5
9 3 1 4 6 4
10 3 1 4 6 5
11 3 1 4 6 6
I tried it like:
def func(x):
copy = x.copy()
num = x.value2+1-x.value
return pd.concat([copy]*num.values[0])
df= df.groupby(['class1','class2']).apply(lambda x:func(x))
But there will be a oredr problem that leads me to not know how to add column value3. And I'd like to have an elegant way of doing it.
Can anyone help me? Thanks in advance
Compute the difference and call Index.repeat:
idx = df.index.repeat(df.value2 - df.value + 1)
Now, either use reindex:
df = df.reindex(idx).reset_index(drop=True)
Or loc:
df = df.loc[idx].reset_index(drop=True)
And you get
df
class1 class2 value value2
0 1 0 1 4
1 1 0 1 4
2 1 0 1 4
3 1 0 1 4
4 2 1 2 3
5 2 1 2 3
6 2 0 3 5
7 2 0 3 5
8 2 0 3 5
9 3 1 4 6
10 3 1 4 6
11 3 1 4 6
For the second part of your question, you'll need groupby.cumcount:
s = idx.to_series()
df['value3'] = df['value'] + s.groupby(idx).cumcount().values
df
class1 class2 value value2 value3
0 1 0 1 4 1
1 1 0 1 4 2
2 1 0 1 4 3
3 1 0 1 4 4
4 2 1 2 3 2
5 2 1 2 3 3
6 2 0 3 5 3
7 2 0 3 5 4
8 2 0 3 5 5
9 3 1 4 6 4
10 3 1 4 6 5
11 3 1 4 6 6
Here's a sequence of things that would get you the desired output:
df.join(df
.apply(lambda x: pd.Series(range(x.value, x.value2+1)), axis=1)
.stack().astype(int)
.reset_index(level=1, drop=1)
.to_frame('value3')).reset_index(drop=1)
Out[]:
class1 class2 value value2 value3
0 1 0 1 4 1
1 1 0 1 4 2
2 1 0 1 4 3
3 1 0 1 4 4
4 2 1 2 3 2
5 2 1 2 3 3
6 2 0 3 5 3
7 2 0 3 5 4
8 2 0 3 5 5
9 3 1 4 6 4
10 3 1 4 6 5
11 3 1 4 6 6

Categories