How to create new calculated columns in pandas and remove original?

How to create new calculated columns in pandas and remove original? - python

I have a pandas dataframe 'df' that has:
name a b
greg 1 1
george 2 2
giles 3 3
giovanni 4 5
I want to run this dataframe through a calcuate function to create new columns: c and d such that I get the following resulting dataframe:
name c d
greg 11 21
george 12 22
giles 13 23
giovanni 14 24
Currently, my code is as follows:
My calcluate function:
def calculate(row):
return row['a']+10, row['b']+20
My function to modify the dataframe:
df['c'] = df.apply(calculate, axis=1)
The resulting dataframe I am getting is this:
name a b c
greg 1 1 (11, 21)
george 2 2(12, 22)
giles 3 3 (13, 23)
giovanni 4 4 (14, 24)
How do I get my dataframe to look like:
name c d
greg 11 21
george 12 22
giles 13 23
giovanni 14 24

Row iteration is very slow. You are much better off doing something like the following:
df['c'] = df.a + 10
df['d'] = df.b + 20
df.drop(['a', 'b'], axis='columns', inplace=True)
To implement your method, however, you would need to do this:
df['c'], df['d'] = zip(*df.apply(calculate, axis=1))
>>> df
name a b c d
0 greg 1 1 11 21
1 george 2 2 12 22
2 giles 3 3 13 23
3 giovanni 4 5 14 25

Related

Pandas groupby on one column witout losing others columns?

I have a problem with the groupby and pandas, at the beginning I have this chart :
import pandas as pd
data = {'Code_Name':[1,2,3,4,1,2,3,4] ,'Name':['Tom', 'Nicko', 'Krish','Jack kr','Tom', 'Nick', 'Krishx', 'Jacks'],'Cat':['A', 'B','C','D','A', 'B','C','D'], 'T':[9, 7, 14, 12,4, 3, 12, 11]}
# Create DataFrame
df = pd.DataFrame(data)
df
i have this :
Code_Name Name Cat T
0 1 Tom A 9
1 2 Nick B 7
2 3 Krish C 14
3 4 Jack kr D 12
4 1 Tom A 4
5 2 Nick B 3
6 3 Krishx C 12
7 4 Jacks D 11
Now i with groupby :
df.groupby(['Code_Name','Name','Cat'],as_index=False)['T'].sum()
i got this:
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nick B 10
2 3 Krish C 14
3 3 Krishx C 12
4 4 Jack kr D 12
5 4 Jacks D 11
But for me , i need this result :
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nick B 10
2 3 Krish C 26
3 4 Jack D 23
i don't care about Name the Code_name is only thing important for me with sum of T
Thank's

There is 2 ways - for each column with avoid losts add aggreation function - first, last or ', '.join obviuosly for strings columns and aggregation dunctions like sum, mean for numeric columns:
df = df.groupby('Code_Name',as_index=False).agg({'Name':'first', 'Cat':'first', 'T':'sum'})
print (df)
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nicko B 10
2 3 Krish C 26
3 4 Jack kr D 23
Or if some values are duplicated per groups like here Cat values add this columns to groupby - only order should be changed in output:
df = df.groupby(['Code_Name','Cat'],as_index=False).agg({'Name':'first', 'T':'sum'})
print (df)
Code_Name Cat Name T
0 1 A Tom 13
1 2 B Nicko 10
2 3 C Krish 26
3 4 D Jack kr 23

If you don't care about the other variable then just group by the column of interest:
gb = df.groupby(['Code_Name'],as_index=False)['T'].sum()
print(gb)
Code_Name T
0 1 13
1 2 10
2 3 26
3 4 23
Now to get your output, you can take the last value of Name for each group:
gb = df.groupby(['Code_Name'],as_index=False).agg({'Name': 'last', 'Cat': 'first', 'T': 'sum'})
print(gb)
0 1 Tom A 13
1 2 Nick B 10
2 3 Krishx C 26
3 4 Jacks D 23

Perhaps you can try:
(df.groupby("Code_Name", as_index=False)
.agg({"Name":"first", "Cat":"first", "T":"sum"}))
see link: https://datascience.stackexchange.com/questions/53405/pandas-dataframe-groupby-and-then-sum-multi-columns-sperately for the original answer

Get order of subgroups in pandas dataframe

I have a pandas dataframe that looks something like this:
df = pd.DataFrame({'Name' : ['Kate', 'John', 'Peter','Kate', 'John', 'Peter'],'Distance' : [23,16,32,15,31,26], 'Time' : [3,5,2,7,9,4]})
df
Distance Name Time
0 23 Kate 3
1 16 John 5
2 32 Peter 2
3 15 Kate 7
4 31 John 9
5 26 Peter 2
I want to add a column that tells me, for each Name, what's the order of the times.
I want something like this:
Order Distance Name Time
0 16 John 5
1 31 John 9
0 23 Kate 3
1 15 Kate 7
0 32 Peter 2
1 26 Peter 4
I can do it using a for loop:
df2 = df[df['Name'] == 'aaa'].reset_index().reset_index() # I did this just to create an empty data frame with the columns I want
for name, row in df.groupby('Name').count().iterrows():
table = df[df['Name'] == name].sort_values('Time').reset_index().reset_index()
to_concat = [df2,table]
df2 = pd.concat(to_concat)
df2.drop('index', axis = 1, inplace = True)
df2.columns = ['Order', 'Distance', 'Name', 'Time']
df2
This works, the problem is (apart from being very unpythonic), for large tables (my actual table has about 50 thousand rows) it takes about half an hour to run.
Can someone help me write this in a simpler way that runs faster?
I'm sorry if this has been answered somewhere, but I didn't really know how to search for it.
Best,

Use sort_values with cumcount:
df = df.sort_values(['Name','Time'])
df['Order'] = df.groupby('Name').cumcount()
print (df)
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
If need first column use insert:
df = df.sort_values(['Name','Time'])
df.insert(0, 'Order', df.groupby('Name').cumcount())
print (df)
Order Distance Name Time
1 0 16 John 5
4 1 31 John 9
0 0 23 Kate 3
3 1 15 Kate 7
2 0 32 Peter 2
5 1 26 Peter 4

In [67]: df = df.sort_values(['Name','Time']) \
.assign(Order=df.groupby('Name').cumcount())
In [68]: df
Out[68]:
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
PS I'm not sure this is the most elegant way to do this...

Compare two pandas dataframe with different size

I have one massive pandas dataframe with this structure:
df1:
A B
0 0 12
1 0 15
2 0 17
3 0 18
4 1 45
5 1 78
6 1 96
7 1 32
8 2 45
9 2 78
10 2 44
11 2 10
And a second one, smaller like this:
df2
G H
0 0 15
1 1 45
2 2 31
I want to add a column to my first dataframe following this rule: column df1.C = df2.H when df1.A == df2.G
I manage to do it with for loops, but the database is massive and the code run really slowly so I am looking for a Pandas-way or numpy to do it.
Many thanks,
Boris

If you only want to match mutual rows in both dataframes:
import pandas as pd
df1 = pd.DataFrame({'Name':['Sara'],'Special ability':['Walk on water']})
df1
Name Special ability
0 Sara Walk on water
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1, left_on='Name', right_on='Name', how='left')
df
Name Age Special ability
0 Sara 4 NaN
1 Gustaf 12 Walk on water
2 Patrik 11 NaN
This Can allso be done with more than one matching argument: (In this example Patrik from df1 does not exist in df2 becuse they have different ages and therfore will not merge)
df1 = pd.DataFrame({'Name':['Sara','Patrik'],'Special ability':['Walk on water','FireBalls'],'Age':[12,83]})
df1
Name Special ability Age
0 Sara Walk on water 12
1 Patrik FireBalls 83
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1,left_on=['Name','Age'],right_on=['Name','Age'],how='left')
df
Name Age Special ability
0 Sara 12 Walk on water
1 Gustaf 12 NaN
2 Patrik 11 NaN

You probably want to use a merge:
df=df1.merge(df2,left_on="A",right_on="G")
will give you a dataframe with 3 columns, but the third one's name will be H
df.columns=["A","B","C"]
will then give you the column names you want

You can use map by Series created by set_index:
df1['C'] = df1['A'].map(df2.set_index('G')['H'])
print (df1)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Or merge with drop and rename:
df = df1.merge(df2,left_on="A",right_on="G", how='left')
.drop('G', axis=1)
.rename(columns={'H':'C'})
print (df)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31

Here's one vectorized NumPy approach -
idx = np.searchsorted(df2.G.values, df1.A.values)
df1['C'] = df2.H.values[idx]
idx could be computed in a simpler way with : df2.G.searchsorted(df1.A), but don't think that would be anymore efficient, because we want to use the underlying array with .values for performance as done earlier.

pandas: finding maximum for each series in dataframe

Consider this data:
df = pd.DataFrame(np.random.randint(0,20,size=(5, 4)),
columns=list('ABCD'),
index=pd.date_range('2016-04-01', '2016-04-05'))
date A B C D
1/1/2016 15 5 19 2
2/1/2016 18 1 14 11
3/1/2016 10 16 8 8
4/1/2016 7 17 17 18
5/1/2016 10 15 18 18
where date is the index
what I want to get back is a tuple of (date, <max>, <series_name>) for each column:
2/1/2016, 18, 'A'
4/1/2016, 17, 'B'
1/1/2016, 19, 'C'
4/1/2016, 18, 'D'
How can this be done in idiomatic pandas?

You could use idxmax and max with axis=0 for that and then join them:
np.random.seed(632)
df = pd.DataFrame(np.random.randint(0,20,size=(5, 4)), columns=list('ABCD'))
In [28]: df
Out[28]:
A B C D
0 10 14 16 1
1 12 13 8 8
2 8 16 11 1
3 8 1 17 12
4 4 2 1 7
In [29]: df.idxmax(axis=0)
Out[29]:
A 1
B 2
C 3
D 3
dtype: int64
In [30]: df.max(axis=0)
Out[30]:
A 12
B 16
C 17
D 12
dtype: int32
In [32]: pd.concat([df.idxmax(axis=0) , df.max(axis=0)], axis=1)
Out[32]:
0 1
A 1 12
B 2 16
C 3 17
D 3 12

I think you can concat max and idxmax. Last you can reset_index, rename column index and reorder all columns:
print df
A B C D
date
1/1/2016 15 5 19 2
2/1/2016 18 1 14 11
3/1/2016 10 16 8 8
4/1/2016 7 17 17 18
5/1/2016 10 15 18 18
print pd.concat([df.max(),df.idxmax()], axis=1, keys=['max','date'])
max date
A 18 2/1/2016
B 17 4/1/2016
C 19 1/1/2016
D 18 4/1/2016
df = pd.concat([df.max(),df.idxmax()], axis=1, keys=['max','date'])
.reset_index()
.rename(columns={'index':'name'})
#change order of columns
df = df[['date','max','name']]
print df
date max name
0 2/1/2016 18 A
1 4/1/2016 17 B
2 1/1/2016 19 C
3 4/1/2016 18 D
Another solution with rename_axis (new in pandas 0.18.0):
print pd.concat([df.max().rename_axis('name'), df.idxmax()], axis=1, keys=['max','date'])
max date
name
A 18 2/1/2016
B 17 4/1/2016
C 19 1/1/2016
D 18 4/1/2016
df = pd.concat([df.max().rename_axis('name'), df.idxmax()], axis=1, keys=['max','date'])
.reset_index()
#change order of columns
df = df[['date','max','name']]
print df
date max name
0 2/1/2016 18 A
1 4/1/2016 17 B
2 1/1/2016 19 C
3 4/1/2016 18 D

Setup
import numpy as np
import pandas as pd
np.random.seed(314)
df = pd.DataFrame(np.random.randint(0,20,size=(5, 4)),
columns=list('ABCD'),
index=pd.date_range('2016-04-01', '2016-04-05'))
print df
A B C D
2016-04-01 8 13 9 19
2016-04-02 10 14 16 7
2016-04-03 2 7 16 3
2016-04-04 12 7 4 0
2016-04-05 4 13 8 16
Solution
stacked = df.stack()
stacked = stacked[stacked.groupby(level=1).idxmax()]
produces
print stacked
2016-04-04 A 12
2016-04-02 B 14
C 16
2016-04-01 D 19
dtype: int32

Multiindex on DataFrames and sum in Pandas

I am currently trying to make use of Pandas MultiIndex attribute. I am trying to group an existing DataFrame-object df_original based on its columns in a smart way, and was therefore thinking of MultiIndex.
print df_original =
by_currency by_portfolio A B C
1 AUD a 1 2 3
2 AUD b 4 5 6
3 AUD c 7 8 9
4 AUD d 10 11 12
5 CHF a 13 14 15
6 CHF b 16 17 18
7 CHF c 19 20 21
8 CHF d 22 23 24
Now, what I would like to have is a MultiIndex DataFrame-object, with A, B and C, and by_portfolio as indices. Looking like
CHF AUD
A a 13 1
b 16 4
c 19 7
d 22 10
B a 14 2
b 17 5
c 20 8
d 23 11
C a 15 3
b 18 6
c 21 9
d 24 12
I have tried making all columns in df_original and the sought after indices into list-objects, and from there create a new DataFrame. This seems a bit cumbersome, and I can't figure out how to add the actual values after.
Perhaps some sort of groupby is better for this purpose? Thing is I will need to be able to add this data to another, similar, DataFrame, so I will need the resulting DataFrame to be able to be added to another one later on.
Thanks

You can use a combination of stack and unstack:
In [50]: df.set_index(['by_currency', 'by_portfolio']).stack().unstack(0)
Out[50]:
by_currency AUD CHF
by_portfolio
a A 1 13
B 2 14
C 3 15
b A 4 16
B 5 17
C 6 18
c A 7 19
B 8 20
C 9 21
d A 10 22
B 11 23
C 12 24
To obtain your desired result, we only need to swap the levels of the index:
In [51]: df2 = df.set_index(['by_currency', 'by_portfolio']).stack().unstack(0)
In [52]: df2.columns.name = None
In [53]: df2.index = df2.index.swaplevel(0,1)
In [55]: df2 = df2.sort_index()
In [56]: df2
Out[56]:
AUD CHF
by_portfolio
A a 1 13
b 4 16
c 7 19
d 10 22
B a 2 14
b 5 17
c 8 20
d 11 23
C a 3 15
b 6 18
c 9 21
d 12 24

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create new calculated columns in pandas and remove original? - python

Related

Pandas groupby on one column witout losing others columns?

Get order of subgroups in pandas dataframe

Compare two pandas dataframe with different size

pandas: finding maximum for each series in dataframe

Multiindex on DataFrames and sum in Pandas

Categories

Resources