I have created a dataframe
df=pd.DataFrame({'Weather':[32,45,12,18,19,27,39,11,22,42],
'Id':[1,2,3,4,5,1,6,7,8,2]})
df.head()
You can see Id on index 5th and 9th are duplicated. So, I want to append string --duplicated with Id on 5th and 9th index.
df.loc[df['Id'].duplicated()]
Output
Weather Id
5 27 1
9 42 2
Expected output
Weather Id
5 27 1--duplicated
9 42 2--duplicated
Do you want an aggregated DataFrame with modification of your previous output using assign?
(df.loc[df['Id'].duplicated()]
.assign(Id=lambda d: d['Id'].astype(str).add('--duplicated'))
)
output:
Weather Id
5 27 1--duplicated
9 42 2--duplicated
Or, in place modification of the original DataFrame with boolean indexing?
m = df['Id'].duplicated()
df.loc[m, 'Id'] = df.loc[m, 'Id'].astype(str)+'--duplicated'
Output:
Weather Id
0 32 1
1 45 2
2 12 3
3 18 4
4 19 5
5 27 1--duplicated
6 39 6
7 11 7
8 22 8
9 42 2--duplicated
If need add suffix to filtered rows use DataFrame.loc by mask :
m = df['Id'].duplicated()
df.loc[m,'Id' ] = df.loc[m,'Id' ].astype(str).add('--duplicated')
print (df)
Weather Id
0 32 1
1 45 2
2 12 3
3 18 4
4 19 5
5 27 1--duplicated
6 39 6
7 11 7
8 22 8
9 42 2--duplicated
Or use boolean indexing and then add suffix:
df1 = df[df['Id'].duplicated()].copy()
df1['Id'] = df1['Id'].astype(str) + '--duplicated'
print (df1)
Weather Id
5 27 1--duplicated
9 42 2--duplicated
Hello I have a dataframe:
import pandas as pd
df1 = {'name': ["x","x","x","x","x","x","x","y","y","y","y","y","y","y"],
'a': [3,4,5,11,14,15,16,2,3,4,10,13,14,15],
'b': [9,8,7,12,23,22,21,8,7,6,11,22,21,20],
'val': [2,1,3,4,5,6,3,21,11,31,41,51,61,31]
}
df1 = pd.DataFrame (df1, columns = ['name','a','b','val'])
I wish to sum the numbers in the 'val' column if the numbers in the 'a' column are next to one another. E.g. in 'a' you have 3,4,5 (all next to each other) so add together their associated numbers in the 'val' column (i.e. 2+1+3) and then create a new column where the added value is present. The harder bit for me is grouping these by 'name'.
I don't know how well I've explained this, but here is the dataframe i wish to end up with
df2 = {'name': ["x","x","x","x","x","x","x","y","y","y","y","y","y","y"],
'a': [3,4,5,11,14,15,16,2,3,4,10,13,14,15],
'b': [9,8,7,12,23,22,21,8,7,6,11,22,21,20],
'val': [2,1,3,4,5,6,3,21,11,31,41,51,61,31],
'sum_val': [6,6,6,4,14,14,14,63,63,63,41,143,143,143]
}
df2 = pd.DataFrame (df2, columns = ['name','a','b','val','sum_val'])
Create groups by compare difference for not equal with cumulative sum per groups in lambda function and pass Series to GroupBy.transform with sum:
g = df1.groupby('name')['a'].apply(lambda x: x.diff().ne(1).cumsum())
df1['sum_val'] = df1.groupby([g, 'name'])['val'].transform('sum')
print (df1)
name a b val sum_val
0 x 3 9 2 6
1 x 4 8 1 6
2 x 5 7 3 6
3 x 11 12 4 4
4 x 14 23 5 14
5 x 15 22 6 14
6 x 16 21 3 14
7 y 2 8 21 63
8 y 3 7 11 63
9 y 4 6 31 63
10 y 10 11 41 41
11 y 13 22 51 143
12 y 14 21 61 143
13 y 15 20 31 143
I am using pandas to run a function on each row of a dataframe and then save the result into a new column. The problem I am having is my function returns a tuple. The function returns for example...
(2345,4837)
And I am saving this as a new column by doing...
myDataFrame['col5'] = myDataFrame.apply(muFunction, axis=1)
This works but I how do I split the return into 2 columns, something like...
myDataFrame['col5'] = myDataFrame.apply(muFunction, axis=1)
myDataFrame['col6'] = myDataFrame.apply(muFunction, axis=1)
But the first part of the tuple in col5 and the second in col6, anyone have an example?
Assume that the source DataFrame contains:
A B C
0 2 4 6
1 4 8 12
2 5 10 15
3 8 16 24
4 9 18 27
The function to apply to it, returning a 2-tuple, is:
def myFun(row):
return row.C + 2, row.C * 2
To apply it and save its result in 2 new columns, you can run:
df[['X', 'Y']] = df.apply(myFun, axis=1).apply(pd.Series)
The result is:
A B C X Y
0 2 4 6 8 12
1 4 8 12 14 24
2 5 10 15 17 30
3 8 16 24 26 48
4 9 18 27 29 54
I have a pandas dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,40,size=(10,4)), columns=range(4), index = range(10))
df.head()
0 1 2 3
0 27 10 13 21
1 25 12 23 8
2 2 24 24 34
3 10 11 11 10
4 0 15 0 27
I'm using the idxmax function to get the columns that contain the maximum value.
df_max = df.idxmax(1)
df_max.head()
0 0
1 0
2 3
3 1
4 3
How can I use df_max along with df, to create a time-series of values corresponding to the maximum value in each row of df? This is the output I want:
0 27
1 25
2 34
3 11
4 27
5 37
6 35
7 32
8 20
9 38
I know I can achieve this using df.max(1), but I want to know how to arrive at this same output by using df_max, since I want to be able to apply df_max to other matrices (not df) which share the same columns and indices as df (but not the same values).
You may try df.lookup
df.lookup(df_max.index, df_max)
Out[628]: array([27, 25, 34, 11, 27], dtype=int64)
If you want Series/DataFrame, you pass the output to the Series/DataFrame constructor
pd.Series(df.lookup(df_max.index, df_max), index=df_max.index)
Out[630]:
0 27
1 25
2 34
3 11
4 27
dtype: int64
I dont arrive to populate a crosstab with data from another colum: maybe its not the solution...
initial dataframe final waited
id id_m X
0 10 10 a
1 10 11 b id_m 10 11 12
2 10 12 c id
3 11 10 d -> 10 a b c
4 11 11 e 11 d e f
5 11 12 f 12 g h i
6 12 10 g
7 12 11 h
8 12 12 i
my code to help you:
import pandas as pd
df= pd.DataFrame({'id': [10, 11,12]})
df_m = pd.merge(df.assign(key=0), df.assign(key=0), suffixes=('', '_m'), on='key').drop('key', axis=1)
# just a sample to populate the column
df_m['X'] =['a','b' ,'c','d', 'e','f','g' ,'h', 'i']
If your original df is this
id id_m X
0 10 10 a
1 10 11 b
2 10 12 c
3 11 10 d
4 11 11 e
5 11 12 f
6 12 10 g
7 12 11 h
8 12 12 i
And all you want is this
id_m 10 11 12
id
10 a b c
11 d e f
12 g h i
You can groupby the id and id_m columns, take the max of the X column, then unstack the id_m column like this.
df.groupby([
'id',
'id_m'
]).X.max().unstack()
If you really want to use pivot_table you can do this too
df.pivot_table(index='id', columns='id_m', values='X', aggfunc='max')
Same results.
Lastly, you can use just pivot since your rows are unique with respect to the indices and columns.
df.pivot(index='id', columns='id_m')
References
groupby
pivot_table
pivot
Yours is a bit more tricky since you have text as values, you have to explicitly tell pandas the aggfunc, you can use a lambda function for that like the following:
df_final = pd.pivot_table(df_m, index='id', columns='id_m', values='X', aggfunc=lambda x: ' '.join(x) )
id_m 10 11 12
id
10 a b c
11 d e f
12 g h i