pandas aggregate data while carrying a column unchanged - python

I have a data frame, a:
a=pd.DataFrame({'ID': [1,1,2,2,3,4], 'B': [1,5,3,2,4,1], 'C': [1,4,3,6,1,1]})
ID B C
0 1 1 1
1 1 5 4
2 2 3 3
3 2 2 6
4 3 4 1
5 4 1 1
And I want to aggregate it so that the resulting new data frame will be grouped by ID and return the row corresponding to min of B (so apply min() on B and carry C as is.
So the resulting data frame should be:
ID B C
0 1 1 1
1 2 2 6
2 3 4 1
3 4 1 1
How can I do this programmatically using pandas.groupby(), or is there another way to do it?

You can use groupby and transform to filter rows
a.loc[a['B'] == a.groupby('ID').B.transform('min')]
B C ID
0 1 1 1
3 2 6 2
4 4 1 3
5 1 1 4

Try sorting before your groupby, then taking first:
a.sort_values('B').groupby('ID',as_index=False).first()
ID B C
0 1 1 1
1 2 2 6
2 3 4 1
3 4 1 1
Or, probably a faster way to do it is to sort by ID and B and then drop duplicate IDs, keeping the first (which is the default behavior of drop_duplicates):
a.sort_values(['ID','B']).drop_duplicates('ID')
ID B C
0 1 1 1
1 2 2 6
2 3 4 1
3 4 1 1

When there is sorting involved, and the grouping doesn't involve any calculations, I prefer to work on the underlying numpy arrays for performance.
Using argsort and numpy.unique:
arr = a.values
out = arr[np.argsort(arr[:, 1])]
_, idx = np.unique(out[:, 0], return_index=True)
out[idx]
array([[1, 1, 1],
[2, 2, 6],
[3, 4, 1],
[4, 1, 1]], dtype=int64)
To reassign the values to your DataFrame:
pd.DataFrame(out[idx], columns=a.columns)
ID B C
0 1 1 1
1 2 2 6
2 3 4 1
3 4 1 1

Related

Autoincrement indexing after groupby with pandas on the original table

I cannot solve a very easy/simple problem in pandas. :(
I have the following table:
df = pd.DataFrame(data=dict(a=[1, 1, 1,2, 2, 3,1], b=["A", "A","B","A", "B", "A","A"]))
df
Out[96]:
a b
0 1 A
1 1 A
2 1 B
3 2 A
4 2 B
5 3 A
6 1 A
I would like to make an incrementing ID of each grouped (grouped by columns a and b) unique item. So the result would like like this (column c):
Out[98]:
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
I tried with:
df.groupby(["a", "b"]).nunique().cumsum().reset_index()
Result:
Out[105]:
a b c
0 1 A 1
1 1 B 2
2 2 A 3
3 2 B 4
4 3 A 5
Unfortunatelly this works only for the grouped by dataset and not on the original dataset. As you can see in the original table I have 7 rows and the grouped by returns only 5.
So could someone please help me on how to get the desired table:
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
Thank you in advance!
groupby + ngroup
df['c'] = df.groupby(['a', 'b']).ngroup() + 1
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
Use pd.factorize after create a tuple from (a, b) columns:
df['c'] = pd.factorize(df[['a', 'b']].apply(tuple, axis=1))[0] + 1
print(df)
# Output
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1

Replace all values in Dataframe with their count per column

import pandas as pd
data = ([1, 1], [2, 3], [4, 5], [1, 2], [3, 4], [3, 2])
df = pd.DataFrame(data)
print(df)
0 1
0 1 1
1 2 3
2 4 5
3 1 2
4 3 4
5 3 2
The desired output would be a count of all occurrences for each value in df. Desired output below:
0 1
0 2 1
1 1 1
2 1 1
3 2 2
4 2 1
5 2 2
There are 2 "1"'s in the first column, 1 "2", 1 "4", 2 "3"'s. ETC for each column.
You can map each column (Series s) to its value count:
>>> df.apply(lambda s: s.map(s.value_counts()))
0 1
0 2 1
1 1 1
2 1 1
3 2 2
4 2 1
5 2 2
One way is using transform on each columns:
df.apply(lambda x: x.groupby(x).transform('count'))
Or using Counter with replace:
from collections import Counter
df.replace(df.apply(Counter))
0 1
0 2 1
1 1 1
2 1 1
3 2 2
4 2 1
5 2 2
Alternatively you can leverage numpy.unique:
import numpy as np
def count_values(s):
_, idx, cnt = np.unique(s, return_inverse=True, return_counts=True)
return cnt[idx]
df.apply(count_values)
0 1
0 2 1
1 1 1
2 1 1
3 2 2
4 2 1
5 2 2

Replicating rows on Pandas Dataframe based on a value column and then affixing a counter column

Suppose I have this dataframe df:
A B count
0 1 2 3
1 3 4 2
2 5 6 1
3 7 8 2
Then I want to do row-replication operation depending on the count column, and then add a new column that does the counter. So the resulting outcome is:
counter A B count
0 0 1 2 3
1 1 1 2 3
2 2 1 2 3
3 0 3 4 2
4 1 3 4 2
5 0 5 6 1
6 0 7 8 2
7 1 7 8 2
My idea was to duplicate the rows accordingly (using numpy and pandas df). Then add a counter column that increments for every row found the same and then reset to 0 once found a new row. But I was thinking this may be slow. Is there any way to do it much easily and not that slow?
Let's try index.repeat to scale up the DataFrame, then groupby cumcount to create the groups and insert it into the DataFrame at the front:
df = df.loc[df.index.repeat(df['count'])]
df.insert(0, 'counter', df.groupby(level=0).cumcount())
df = df.reset_index(drop=True)
df:
counter A B count
0 0 1 2 3
1 1 1 2 3
2 2 1 2 3
3 0 3 4 2
4 1 3 4 2
5 0 5 6 1
6 0 7 8 2
7 1 7 8 2
DataFrame constructor:
import pandas as pd
df = pd.DataFrame({
'A': [1, 3, 5, 7], 'B': [2, 4, 6, 8], 'count': [3, 2, 1, 2]
})

How to unnest a column in a Pandas DataFrame?

I have the following this question where one of the columns is an object (list type cell):
I don't want to use explode (using an older version of pandas). How to do the same for dataframe with three columns?
df
A B C
0 1 [1, 2] 3
1 1 [1, 2] 4
2 2 [3, 4] 5
My expected output is:
A B C
0 1 1 3
1 1 2 3
3 1 1 4
4 1 2 4
5 2 3 5
6 2 4 5
I found these two methods useful.
How to add third column to this code.
df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
or
df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
You set the index to be all of the columns you want to keep tied to the list you explode:
(df.set_index(['A', 'C'])['B']
.apply(pd.Series).stack()
.reset_index()
.drop(columns='level_2').rename(columns={0: 'B'}))
A C B
0 1 3 1
1 1 3 2
2 1 4 1
3 1 4 2
4 2 5 3
5 2 5 4
Or for the second method also repeat 'C'
pd.DataFrame({'A': df.A.repeat(df.B.str.len()),
'C': df.C.repeat(df.B.str.len()),
'B': np.concatenate(df.B.to_numpy())})
You can use itertools to reshape your data :
from itertools import product,chain
pd.DataFrame(chain.from_iterable((product([a],b,[c]))
for a,b,c in df.to_numpy()),
columns = df.columns)
A B C
0 1 1 3
1 1 2 3
2 1 1 4
3 1 2 4
4 2 1 5
5 2 4 5

Repeating rows of a dataframe based on a column value

I have a data frame like this:
df1 = pd.DataFrame({'a': [1,2],
'b': [3,4],
'c': [6,5]})
df1
Out[150]:
a b c
0 1 3 6
1 2 4 5
Now I want to create a df that repeats each row based on difference between col b and c plus 1. So diff between b and c for first row is 6-3 = 3. I want to repeat that row 3+1=4 times. Similarly for second row the difference is 5-4 = 1, so I want to repeat it 1+1=2 times. The column d is added to have value from min(b) to diff between b and c (i.e.6-3 = 3. So it goes from 3->6). So I want to get this df:
a b c d
0 1 3 6 3
0 1 3 6 4
0 1 3 6 5
0 1 3 6 6
1 2 4 5 4
1 2 4 5 5
Do it with reindex + repeat, then using groupby cumcount assign the new value d
df1.reindex(df1.index.repeat(df1.eval('c-b').add(1))).\
assign(d=lambda x : x.c-x.groupby('a').cumcount(ascending=False))
Out[572]:
a b c d
0 1 3 6 3
0 1 3 6 4
0 1 3 6 5
0 1 3 6 6
1 2 4 5 4
1 2 4 5 5

Categories