Pandas groupby and pivot - python

I have the following pandas data frame
id category counts_mean
0 8 a 23
1 8 b 22
2 8 c 23
3 8 d 30
4 9 a 40
5 9 b 22
6 9 c 11
7 9 d 10
....
And I want to group by the id and transpose the category columns to get something like this:
id a b c d
0 8 23 22 23 30
1 9 40 22 11 10
I tried different things with groupby and pivot, but I'm not sure what should be the aggregation argument for the groupby...

Instead using groupby and pivot, you just need to use the pivot function and set the parameters (index , columns, values) to re-shape your DataFrame.
#Creat the DataFrame
data = {
'id': [8,8,8,8,9,9,9,9],
'catergory': ['a','b','c','d','a','b','c','d'],
'counts_mean': [23,22,23,30,40,22,11,10]
}
df = pd.DataFrame(data)
# Using pivot to reshape the DF
df_reshaped = df.pivot(index='id',columns='catergory',values = 'counts_mean')
print(df_reshaped)
output:
catergory a b c d
id
8 23 22 23 30
9 40 22 11 10

Related

For Every duplicated value in Id Column how can i append a string 'duplicated' with that value

I have created a dataframe
df=pd.DataFrame({'Weather':[32,45,12,18,19,27,39,11,22,42],
'Id':[1,2,3,4,5,1,6,7,8,2]})
df.head()
You can see Id on index 5th and 9th are duplicated. So, I want to append string --duplicated with Id on 5th and 9th index.
df.loc[df['Id'].duplicated()]
Output
Weather Id
5 27 1
9 42 2
Expected output
Weather Id
5 27 1--duplicated
9 42 2--duplicated
Do you want an aggregated DataFrame with modification of your previous output using assign?
(df.loc[df['Id'].duplicated()]
.assign(Id=lambda d: d['Id'].astype(str).add('--duplicated'))
)
output:
Weather Id
5 27 1--duplicated
9 42 2--duplicated
Or, in place modification of the original DataFrame with boolean indexing?
m = df['Id'].duplicated()
df.loc[m, 'Id'] = df.loc[m, 'Id'].astype(str)+'--duplicated'
Output:
Weather Id
0 32 1
1 45 2
2 12 3
3 18 4
4 19 5
5 27 1--duplicated
6 39 6
7 11 7
8 22 8
9 42 2--duplicated
If need add suffix to filtered rows use DataFrame.loc by mask :
m = df['Id'].duplicated()
df.loc[m,'Id' ] = df.loc[m,'Id' ].astype(str).add('--duplicated')
print (df)
Weather Id
0 32 1
1 45 2
2 12 3
3 18 4
4 19 5
5 27 1--duplicated
6 39 6
7 11 7
8 22 8
9 42 2--duplicated
Or use boolean indexing and then add suffix:
df1 = df[df['Id'].duplicated()].copy()
df1['Id'] = df1['Id'].astype(str) + '--duplicated'
print (df1)
Weather Id
5 27 1--duplicated
9 42 2--duplicated

Sum column values in a dataframe if values in another column are next to each other

Hello I have a dataframe:
import pandas as pd
df1 = {'name': ["x","x","x","x","x","x","x","y","y","y","y","y","y","y"],
'a': [3,4,5,11,14,15,16,2,3,4,10,13,14,15],
'b': [9,8,7,12,23,22,21,8,7,6,11,22,21,20],
'val': [2,1,3,4,5,6,3,21,11,31,41,51,61,31]
}
df1 = pd.DataFrame (df1, columns = ['name','a','b','val'])
I wish to sum the numbers in the 'val' column if the numbers in the 'a' column are next to one another. E.g. in 'a' you have 3,4,5 (all next to each other) so add together their associated numbers in the 'val' column (i.e. 2+1+3) and then create a new column where the added value is present. The harder bit for me is grouping these by 'name'.
I don't know how well I've explained this, but here is the dataframe i wish to end up with
df2 = {'name': ["x","x","x","x","x","x","x","y","y","y","y","y","y","y"],
'a': [3,4,5,11,14,15,16,2,3,4,10,13,14,15],
'b': [9,8,7,12,23,22,21,8,7,6,11,22,21,20],
'val': [2,1,3,4,5,6,3,21,11,31,41,51,61,31],
'sum_val': [6,6,6,4,14,14,14,63,63,63,41,143,143,143]
}
df2 = pd.DataFrame (df2, columns = ['name','a','b','val','sum_val'])
Create groups by compare difference for not equal with cumulative sum per groups in lambda function and pass Series to GroupBy.transform with sum:
g = df1.groupby('name')['a'].apply(lambda x: x.diff().ne(1).cumsum())
df1['sum_val'] = df1.groupby([g, 'name'])['val'].transform('sum')
print (df1)
name a b val sum_val
0 x 3 9 2 6
1 x 4 8 1 6
2 x 5 7 3 6
3 x 11 12 4 4
4 x 14 23 5 14
5 x 15 22 6 14
6 x 16 21 3 14
7 y 2 8 21 63
8 y 3 7 11 63
9 y 4 6 31 63
10 y 10 11 41 41
11 y 13 22 51 143
12 y 14 21 61 143
13 y 15 20 31 143

Python Pandas - Save tuple returned from function into 2 seperate columns

I am using pandas to run a function on each row of a dataframe and then save the result into a new column. The problem I am having is my function returns a tuple. The function returns for example...
(2345,4837)
And I am saving this as a new column by doing...
myDataFrame['col5'] = myDataFrame.apply(muFunction, axis=1)
This works but I how do I split the return into 2 columns, something like...
myDataFrame['col5'] = myDataFrame.apply(muFunction, axis=1)
myDataFrame['col6'] = myDataFrame.apply(muFunction, axis=1)
But the first part of the tuple in col5 and the second in col6, anyone have an example?
Assume that the source DataFrame contains:
A B C
0 2 4 6
1 4 8 12
2 5 10 15
3 8 16 24
4 9 18 27
The function to apply to it, returning a 2-tuple, is:
def myFun(row):
return row.C + 2, row.C * 2
To apply it and save its result in 2 new columns, you can run:
df[['X', 'Y']] = df.apply(myFun, axis=1).apply(pd.Series)
The result is:
A B C X Y
0 2 4 6 8 12
1 4 8 12 14 24
2 5 10 15 17 30
3 8 16 24 26 48
4 9 18 27 29 54

How to subset pandas dataframe columns with idxmax output?

I have a pandas dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,40,size=(10,4)), columns=range(4), index = range(10))
df.head()
0 1 2 3
0 27 10 13 21
1 25 12 23 8
2 2 24 24 34
3 10 11 11 10
4 0 15 0 27
I'm using the idxmax function to get the columns that contain the maximum value.
df_max = df.idxmax(1)
df_max.head()
0 0
1 0
2 3
3 1
4 3
How can I use df_max along with df, to create a time-series of values corresponding to the maximum value in each row of df? This is the output I want:
0 27
1 25
2 34
3 11
4 27
5 37
6 35
7 32
8 20
9 38
I know I can achieve this using df.max(1), but I want to know how to arrive at this same output by using df_max, since I want to be able to apply df_max to other matrices (not df) which share the same columns and indices as df (but not the same values).
You may try df.lookup
df.lookup(df_max.index, df_max)
Out[628]: array([27, 25, 34, 11, 27], dtype=int64)
If you want Series/DataFrame, you pass the output to the Series/DataFrame constructor
pd.Series(df.lookup(df_max.index, df_max), index=df_max.index)
Out[630]:
0 27
1 25
2 34
3 11
4 27
dtype: int64

crosstab to fill with data of another column

I dont arrive to populate a crosstab with data from another colum: maybe its not the solution...
initial dataframe final waited
id id_m X
0 10 10 a
1 10 11 b id_m 10 11 12
2 10 12 c id
3 11 10 d -> 10 a b c
4 11 11 e 11 d e f
5 11 12 f 12 g h i
6 12 10 g
7 12 11 h
8 12 12 i
my code to help you:
import pandas as pd
df= pd.DataFrame({'id': [10, 11,12]})
df_m = pd.merge(df.assign(key=0), df.assign(key=0), suffixes=('', '_m'), on='key').drop('key', axis=1)
# just a sample to populate the column
df_m['X'] =['a','b' ,'c','d', 'e','f','g' ,'h', 'i']
If your original df is this
id id_m X
0 10 10 a
1 10 11 b
2 10 12 c
3 11 10 d
4 11 11 e
5 11 12 f
6 12 10 g
7 12 11 h
8 12 12 i
And all you want is this
id_m 10 11 12
id
10 a b c
11 d e f
12 g h i
You can groupby the id and id_m columns, take the max of the X column, then unstack the id_m column like this.
df.groupby([
'id',
'id_m'
]).X.max().unstack()
If you really want to use pivot_table you can do this too
df.pivot_table(index='id', columns='id_m', values='X', aggfunc='max')
Same results.
Lastly, you can use just pivot since your rows are unique with respect to the indices and columns.
df.pivot(index='id', columns='id_m')
References
groupby
pivot_table
pivot
Yours is a bit more tricky since you have text as values, you have to explicitly tell pandas the aggfunc, you can use a lambda function for that like the following:
df_final = pd.pivot_table(df_m, index='id', columns='id_m', values='X', aggfunc=lambda x: ' '.join(x) )
id_m 10 11 12
id
10 a b c
11 d e f
12 g h i

Categories