How can I get the positions of a group? - python

I have pandas dataframe like this
data = [[1, 'a'], [2, 'a'], [3, 'b'], [4, 'b'], [5, 'a'], [6, 'c']]
df1 = pd.DataFrame(data, columns=['Id', 'Group'])
Id Group
1 a
2 a
3 b
4 b
5 a
6 c
Without changing order I need to get the position of every Id based on the `Group.
Basically, I want below output
Id Group position
1 a 1
2 a 2
3 b 1
4 b 2
5 a 3
6 c 1

try, transform + cumcount
df1['position'] = df1.groupby('Group').transform('cumcount') + 1
Id Group position
0 1 a 1
1 2 a 2
2 3 b 1
3 4 b 2
4 5 a 3
5 6 c 1

You can simply do it by .cumcount
df1['position'] = df1.groupby('Group').cumcount() + 1
GroupBy.cumcount numbers each item in each group from 0 to the length of that group - 1. It is NOT an aggregated function producing condensed result. So no need to use .transform() to propagate aggregated result back to each item of the whole group.
Result:
print(df1)
Id Group position
0 1 a 1
1 2 a 2
2 3 b 1
3 4 b 2
4 5 a 3
5 6 c 1

Related

Pandas cumulative count on new value

I have a data frame like the below one.
df = pd.DataFrame()
df['col_1'] = [1, 1, 1, 2, 2, 2, 3, 3, 3]
df['col_2'] = ['A', 'B', 'B', 'A', 'B', 'C', 'A', 'A', 'B']
df
col_1 col_2
0 1 A
1 1 B
2 1 B
3 2 A
4 2 B
5 2 C
6 3 A
7 3 A
8 3 B
I need to group by on col_1 and within each group, I need to update cumulative count whenever there is a new value in col_2. Something like below data frame.
col_1 col_2 col_3
0 1 A 1
1 1 B 2
2 1 B 2
3 2 A 1
4 2 B 2
5 2 C 3
6 3 A 1
7 3 A 1
8 3 B 2
I could do this using lists and dictionary. But couldn't find a way using pandas in built functions.
Use factorize with lambda function in GroupBy.transform:
df['col_3'] = df.groupby('col_1')['col_2'].transform(lambda x: pd.factorize(x)[0]+1)
print (df)
col_1 col_2 col_3
0 1 A 1
1 1 B 2
2 1 B 2
3 2 A 1
4 2 B 2
5 2 C 3
6 3 A 1
7 3 A 1
8 3 B 2

use meshgrid for rows with common values in column

my dataframes:
df1 = pd.DataFrame(np.array([[1, 2, 3], [4, 2, 3], [7, 8, 8]]),columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 2, 3], [5, 8, 8]]),columns=['a', 'b', 'c'])
df1,df2:
a b c
0 1 2 3
1 4 2 3
2 7 8 8
a b c
0 1 2 3
1 4 2 3
2 5 8 8
I want to combine rows from columns a from both df's in all sequences but only where values in column b and c are equal.
Right now I have only solution for all in general with this code:
x = np.array(np.meshgrid(df1.a.values,
df2.a.values)).T.reshape(-1,2)
df = pd.DataFrame(x)
print(df)
0 1
0 1 1
1 1 4
2 1 5
3 4 1
4 4 4
5 4 5
6 7 1
7 7 4
8 7 5
expected output for df1.a and df2.a only for rows where df1.b==df2.b and df1.c==df2.c:
0 1
0 1 1
1 1 4
2 4 1
3 4 4
4 7 5
so basically i need to group by common rows in selected columns band c
You should try DataFrame.merge using inner merge:
df1.merge(df2, on=['b', 'c'])[['a_x', 'a_y']]
a_x a_y
0 1 1
1 1 4
2 4 1
3 4 4
4 7 5

pandas aggregate data while carrying a column unchanged

I have a data frame, a:
a=pd.DataFrame({'ID': [1,1,2,2,3,4], 'B': [1,5,3,2,4,1], 'C': [1,4,3,6,1,1]})
ID B C
0 1 1 1
1 1 5 4
2 2 3 3
3 2 2 6
4 3 4 1
5 4 1 1
And I want to aggregate it so that the resulting new data frame will be grouped by ID and return the row corresponding to min of B (so apply min() on B and carry C as is.
So the resulting data frame should be:
ID B C
0 1 1 1
1 2 2 6
2 3 4 1
3 4 1 1
How can I do this programmatically using pandas.groupby(), or is there another way to do it?
You can use groupby and transform to filter rows
a.loc[a['B'] == a.groupby('ID').B.transform('min')]
B C ID
0 1 1 1
3 2 6 2
4 4 1 3
5 1 1 4
Try sorting before your groupby, then taking first:
a.sort_values('B').groupby('ID',as_index=False).first()
ID B C
0 1 1 1
1 2 2 6
2 3 4 1
3 4 1 1
Or, probably a faster way to do it is to sort by ID and B and then drop duplicate IDs, keeping the first (which is the default behavior of drop_duplicates):
a.sort_values(['ID','B']).drop_duplicates('ID')
ID B C
0 1 1 1
1 2 2 6
2 3 4 1
3 4 1 1
When there is sorting involved, and the grouping doesn't involve any calculations, I prefer to work on the underlying numpy arrays for performance.
Using argsort and numpy.unique:
arr = a.values
out = arr[np.argsort(arr[:, 1])]
_, idx = np.unique(out[:, 0], return_index=True)
out[idx]
array([[1, 1, 1],
[2, 2, 6],
[3, 4, 1],
[4, 1, 1]], dtype=int64)
To reassign the values to your DataFrame:
pd.DataFrame(out[idx], columns=a.columns)
ID B C
0 1 1 1
1 2 2 6
2 3 4 1
3 4 1 1

Get values and column names

I have a pandas data frame that looks something like this:
data = {'1' : [0, 2, 0, 0], '2' : [5, 0, 0, 2], '3' : [2, 0, 0, 0], '4' : [0, 7, 0, 0]}
df = pd.DataFrame(data, index = ['a', 'b', 'c', 'd'])
df
1 2 3 4
a 0 5 2 0
b 2 0 0 7
c 0 0 0 0
d 0 2 0 0
I know I can get the maximum value and the corresponding column name for each row by doing (respectively):
df.max(1)
df.idxmax(1)
How can I get the values and the column name for every cell that is not zero?
So in this case, I'd want 2 tables, one giving me each value != 0 for each row:
a 5
a 2
b 2
b 7
d 2
And one giving me the column names for those values:
a 2
a 3
b 1
b 4
d 2
Thanks!
You can use stack for Series, then filter by boolean indexing, rename_axis, reset_index and last drop column or select columns by subset:
s = df.stack()
df1 = s[s!= 0].rename_axis(['a','b']).reset_index(name='c')
print (df1)
a b c
0 a 2 5
1 a 3 2
2 b 1 2
3 b 4 7
4 d 2 2
df2 = df1.drop('b', axis=1)
print (df2)
a c
0 a 5
1 a 2
2 b 2
3 b 7
4 d 2
df3 = df1.drop('c', axis=1)
print (df3)
a b
0 a 2
1 a 3
2 b 1
3 b 4
4 d 2
df3 = df1[['a','c']]
print (df3)
a c
0 a 5
1 a 2
2 b 2
3 b 7
4 d 2
df3 = df1[['a','b']]
print (df3)
a b
0 a 2
1 a 3
2 b 1
3 b 4
4 d 2

Populating a Pandas DataFrame frome another DataFrame based on column names

I have a DataFrame of the following form:
a b c
0 1 4 6
1 3 2 4
2 4 1 5
And I have a list of column names that I need to use to create a new DataFrame using the columns of the first DataFrame that correspond to each label. For example, if my list of columns is ['a', 'b', 'b', 'a', 'c'], the resulting DataFrame should be:
a b b a c
0 1 4 4 1 6
1 3 2 2 3 4
2 4 1 1 4 5
I've been trying to figure out a fast way of performing this operations because I'm dealing with extremly large DataFrames and I don't think looping is a reasonable option.
You can just use the list to select them:
In [44]:
cols = ['a', 'b', 'b', 'a', 'c']
df[cols]
Out[44]:
a b b a c
0 1 4 4 1 6
1 3 2 2 3 4
2 4 1 1 4 5
[3 rows x 5 columns]
So no need for a loop, once you have created your dataframe df then using a list of column names will just index them and create the df you want.
You can do that directly:
>>> df
a b c
0 1 4 6
1 3 2 4
2 4 1 5
>>> column_names
['a', 'b', 'b', 'a', 'c']
>>> df[column_names]
a b b a c
0 1 4 4 1 6
1 3 2 2 3 4
2 4 1 1 4 5
[3 rows x 5 columns]
From 0.17 onwards you can use reindex like
In [795]: cols = ['a', 'b', 'b', 'a', 'c']
In [796]: df.reindex(columns=cols)
Out[796]:
a b b a c
0 1 4 4 1 6
1 3 2 2 3 4
2 4 1 1 4 5
Note: Ideally, you don't want to have duplicate column names.

Categories