I have a pandas dataframe 'df' that has:
name a b
greg 1 1
george 2 2
giles 3 3
giovanni 4 5
I want to run this dataframe through a calcuate function to create new columns: c and d such that I get the following resulting dataframe:
name c d
greg 11 21
george 12 22
giles 13 23
giovanni 14 24
Currently, my code is as follows:
My calcluate function:
def calculate(row):
return row['a']+10, row['b']+20
My function to modify the dataframe:
df['c'] = df.apply(calculate, axis=1)
The resulting dataframe I am getting is this:
name a b c
greg 1 1 (11, 21)
george 2 2(12, 22)
giles 3 3 (13, 23)
giovanni 4 4 (14, 24)
How do I get my dataframe to look like:
name c d
greg 11 21
george 12 22
giles 13 23
giovanni 14 24
Row iteration is very slow. You are much better off doing something like the following:
df['c'] = df.a + 10
df['d'] = df.b + 20
df.drop(['a', 'b'], axis='columns', inplace=True)
To implement your method, however, you would need to do this:
df['c'], df['d'] = zip(*df.apply(calculate, axis=1))
>>> df
name a b c d
0 greg 1 1 11 21
1 george 2 2 12 22
2 giles 3 3 13 23
3 giovanni 4 5 14 25
Related
I have a problem with the groupby and pandas, at the beginning I have this chart :
import pandas as pd
data = {'Code_Name':[1,2,3,4,1,2,3,4] ,'Name':['Tom', 'Nicko', 'Krish','Jack kr','Tom', 'Nick', 'Krishx', 'Jacks'],'Cat':['A', 'B','C','D','A', 'B','C','D'], 'T':[9, 7, 14, 12,4, 3, 12, 11]}
# Create DataFrame
df = pd.DataFrame(data)
df
i have this :
Code_Name Name Cat T
0 1 Tom A 9
1 2 Nick B 7
2 3 Krish C 14
3 4 Jack kr D 12
4 1 Tom A 4
5 2 Nick B 3
6 3 Krishx C 12
7 4 Jacks D 11
Now i with groupby :
df.groupby(['Code_Name','Name','Cat'],as_index=False)['T'].sum()
i got this:
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nick B 10
2 3 Krish C 14
3 3 Krishx C 12
4 4 Jack kr D 12
5 4 Jacks D 11
But for me , i need this result :
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nick B 10
2 3 Krish C 26
3 4 Jack D 23
i don't care about Name the Code_name is only thing important for me with sum of T
Thank's
There is 2 ways - for each column with avoid losts add aggreation function - first, last or ', '.join obviuosly for strings columns and aggregation dunctions like sum, mean for numeric columns:
df = df.groupby('Code_Name',as_index=False).agg({'Name':'first', 'Cat':'first', 'T':'sum'})
print (df)
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nicko B 10
2 3 Krish C 26
3 4 Jack kr D 23
Or if some values are duplicated per groups like here Cat values add this columns to groupby - only order should be changed in output:
df = df.groupby(['Code_Name','Cat'],as_index=False).agg({'Name':'first', 'T':'sum'})
print (df)
Code_Name Cat Name T
0 1 A Tom 13
1 2 B Nicko 10
2 3 C Krish 26
3 4 D Jack kr 23
If you don't care about the other variable then just group by the column of interest:
gb = df.groupby(['Code_Name'],as_index=False)['T'].sum()
print(gb)
Code_Name T
0 1 13
1 2 10
2 3 26
3 4 23
Now to get your output, you can take the last value of Name for each group:
gb = df.groupby(['Code_Name'],as_index=False).agg({'Name': 'last', 'Cat': 'first', 'T': 'sum'})
print(gb)
0 1 Tom A 13
1 2 Nick B 10
2 3 Krishx C 26
3 4 Jacks D 23
Perhaps you can try:
(df.groupby("Code_Name", as_index=False)
.agg({"Name":"first", "Cat":"first", "T":"sum"}))
see link: https://datascience.stackexchange.com/questions/53405/pandas-dataframe-groupby-and-then-sum-multi-columns-sperately for the original answer
I have a pandas dataframe that looks something like this:
df = pd.DataFrame({'Name' : ['Kate', 'John', 'Peter','Kate', 'John', 'Peter'],'Distance' : [23,16,32,15,31,26], 'Time' : [3,5,2,7,9,4]})
df
Distance Name Time
0 23 Kate 3
1 16 John 5
2 32 Peter 2
3 15 Kate 7
4 31 John 9
5 26 Peter 2
I want to add a column that tells me, for each Name, what's the order of the times.
I want something like this:
Order Distance Name Time
0 16 John 5
1 31 John 9
0 23 Kate 3
1 15 Kate 7
0 32 Peter 2
1 26 Peter 4
I can do it using a for loop:
df2 = df[df['Name'] == 'aaa'].reset_index().reset_index() # I did this just to create an empty data frame with the columns I want
for name, row in df.groupby('Name').count().iterrows():
table = df[df['Name'] == name].sort_values('Time').reset_index().reset_index()
to_concat = [df2,table]
df2 = pd.concat(to_concat)
df2.drop('index', axis = 1, inplace = True)
df2.columns = ['Order', 'Distance', 'Name', 'Time']
df2
This works, the problem is (apart from being very unpythonic), for large tables (my actual table has about 50 thousand rows) it takes about half an hour to run.
Can someone help me write this in a simpler way that runs faster?
I'm sorry if this has been answered somewhere, but I didn't really know how to search for it.
Best,
Use sort_values with cumcount:
df = df.sort_values(['Name','Time'])
df['Order'] = df.groupby('Name').cumcount()
print (df)
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
If need first column use insert:
df = df.sort_values(['Name','Time'])
df.insert(0, 'Order', df.groupby('Name').cumcount())
print (df)
Order Distance Name Time
1 0 16 John 5
4 1 31 John 9
0 0 23 Kate 3
3 1 15 Kate 7
2 0 32 Peter 2
5 1 26 Peter 4
In [67]: df = df.sort_values(['Name','Time']) \
.assign(Order=df.groupby('Name').cumcount())
In [68]: df
Out[68]:
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
PS I'm not sure this is the most elegant way to do this...
I have one massive pandas dataframe with this structure:
df1:
A B
0 0 12
1 0 15
2 0 17
3 0 18
4 1 45
5 1 78
6 1 96
7 1 32
8 2 45
9 2 78
10 2 44
11 2 10
And a second one, smaller like this:
df2
G H
0 0 15
1 1 45
2 2 31
I want to add a column to my first dataframe following this rule: column df1.C = df2.H when df1.A == df2.G
I manage to do it with for loops, but the database is massive and the code run really slowly so I am looking for a Pandas-way or numpy to do it.
Many thanks,
Boris
If you only want to match mutual rows in both dataframes:
import pandas as pd
df1 = pd.DataFrame({'Name':['Sara'],'Special ability':['Walk on water']})
df1
Name Special ability
0 Sara Walk on water
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1, left_on='Name', right_on='Name', how='left')
df
Name Age Special ability
0 Sara 4 NaN
1 Gustaf 12 Walk on water
2 Patrik 11 NaN
This Can allso be done with more than one matching argument: (In this example Patrik from df1 does not exist in df2 becuse they have different ages and therfore will not merge)
df1 = pd.DataFrame({'Name':['Sara','Patrik'],'Special ability':['Walk on water','FireBalls'],'Age':[12,83]})
df1
Name Special ability Age
0 Sara Walk on water 12
1 Patrik FireBalls 83
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1,left_on=['Name','Age'],right_on=['Name','Age'],how='left')
df
Name Age Special ability
0 Sara 12 Walk on water
1 Gustaf 12 NaN
2 Patrik 11 NaN
You probably want to use a merge:
df=df1.merge(df2,left_on="A",right_on="G")
will give you a dataframe with 3 columns, but the third one's name will be H
df.columns=["A","B","C"]
will then give you the column names you want
You can use map by Series created by set_index:
df1['C'] = df1['A'].map(df2.set_index('G')['H'])
print (df1)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Or merge with drop and rename:
df = df1.merge(df2,left_on="A",right_on="G", how='left')
.drop('G', axis=1)
.rename(columns={'H':'C'})
print (df)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Here's one vectorized NumPy approach -
idx = np.searchsorted(df2.G.values, df1.A.values)
df1['C'] = df2.H.values[idx]
idx could be computed in a simpler way with : df2.G.searchsorted(df1.A), but don't think that would be anymore efficient, because we want to use the underlying array with .values for performance as done earlier.
Consider this data:
df = pd.DataFrame(np.random.randint(0,20,size=(5, 4)),
columns=list('ABCD'),
index=pd.date_range('2016-04-01', '2016-04-05'))
date A B C D
1/1/2016 15 5 19 2
2/1/2016 18 1 14 11
3/1/2016 10 16 8 8
4/1/2016 7 17 17 18
5/1/2016 10 15 18 18
where date is the index
what I want to get back is a tuple of (date, <max>, <series_name>) for each column:
2/1/2016, 18, 'A'
4/1/2016, 17, 'B'
1/1/2016, 19, 'C'
4/1/2016, 18, 'D'
How can this be done in idiomatic pandas?
You could use idxmax and max with axis=0 for that and then join them:
np.random.seed(632)
df = pd.DataFrame(np.random.randint(0,20,size=(5, 4)), columns=list('ABCD'))
In [28]: df
Out[28]:
A B C D
0 10 14 16 1
1 12 13 8 8
2 8 16 11 1
3 8 1 17 12
4 4 2 1 7
In [29]: df.idxmax(axis=0)
Out[29]:
A 1
B 2
C 3
D 3
dtype: int64
In [30]: df.max(axis=0)
Out[30]:
A 12
B 16
C 17
D 12
dtype: int32
In [32]: pd.concat([df.idxmax(axis=0) , df.max(axis=0)], axis=1)
Out[32]:
0 1
A 1 12
B 2 16
C 3 17
D 3 12
I think you can concat max and idxmax. Last you can reset_index, rename column index and reorder all columns:
print df
A B C D
date
1/1/2016 15 5 19 2
2/1/2016 18 1 14 11
3/1/2016 10 16 8 8
4/1/2016 7 17 17 18
5/1/2016 10 15 18 18
print pd.concat([df.max(),df.idxmax()], axis=1, keys=['max','date'])
max date
A 18 2/1/2016
B 17 4/1/2016
C 19 1/1/2016
D 18 4/1/2016
df = pd.concat([df.max(),df.idxmax()], axis=1, keys=['max','date'])
.reset_index()
.rename(columns={'index':'name'})
#change order of columns
df = df[['date','max','name']]
print df
date max name
0 2/1/2016 18 A
1 4/1/2016 17 B
2 1/1/2016 19 C
3 4/1/2016 18 D
Another solution with rename_axis (new in pandas 0.18.0):
print pd.concat([df.max().rename_axis('name'), df.idxmax()], axis=1, keys=['max','date'])
max date
name
A 18 2/1/2016
B 17 4/1/2016
C 19 1/1/2016
D 18 4/1/2016
df = pd.concat([df.max().rename_axis('name'), df.idxmax()], axis=1, keys=['max','date'])
.reset_index()
#change order of columns
df = df[['date','max','name']]
print df
date max name
0 2/1/2016 18 A
1 4/1/2016 17 B
2 1/1/2016 19 C
3 4/1/2016 18 D
Setup
import numpy as np
import pandas as pd
np.random.seed(314)
df = pd.DataFrame(np.random.randint(0,20,size=(5, 4)),
columns=list('ABCD'),
index=pd.date_range('2016-04-01', '2016-04-05'))
print df
A B C D
2016-04-01 8 13 9 19
2016-04-02 10 14 16 7
2016-04-03 2 7 16 3
2016-04-04 12 7 4 0
2016-04-05 4 13 8 16
Solution
stacked = df.stack()
stacked = stacked[stacked.groupby(level=1).idxmax()]
produces
print stacked
2016-04-04 A 12
2016-04-02 B 14
C 16
2016-04-01 D 19
dtype: int32
I am currently trying to make use of Pandas MultiIndex attribute. I am trying to group an existing DataFrame-object df_original based on its columns in a smart way, and was therefore thinking of MultiIndex.
print df_original =
by_currency by_portfolio A B C
1 AUD a 1 2 3
2 AUD b 4 5 6
3 AUD c 7 8 9
4 AUD d 10 11 12
5 CHF a 13 14 15
6 CHF b 16 17 18
7 CHF c 19 20 21
8 CHF d 22 23 24
Now, what I would like to have is a MultiIndex DataFrame-object, with A, B and C, and by_portfolio as indices. Looking like
CHF AUD
A a 13 1
b 16 4
c 19 7
d 22 10
B a 14 2
b 17 5
c 20 8
d 23 11
C a 15 3
b 18 6
c 21 9
d 24 12
I have tried making all columns in df_original and the sought after indices into list-objects, and from there create a new DataFrame. This seems a bit cumbersome, and I can't figure out how to add the actual values after.
Perhaps some sort of groupby is better for this purpose? Thing is I will need to be able to add this data to another, similar, DataFrame, so I will need the resulting DataFrame to be able to be added to another one later on.
Thanks
You can use a combination of stack and unstack:
In [50]: df.set_index(['by_currency', 'by_portfolio']).stack().unstack(0)
Out[50]:
by_currency AUD CHF
by_portfolio
a A 1 13
B 2 14
C 3 15
b A 4 16
B 5 17
C 6 18
c A 7 19
B 8 20
C 9 21
d A 10 22
B 11 23
C 12 24
To obtain your desired result, we only need to swap the levels of the index:
In [51]: df2 = df.set_index(['by_currency', 'by_portfolio']).stack().unstack(0)
In [52]: df2.columns.name = None
In [53]: df2.index = df2.index.swaplevel(0,1)
In [55]: df2 = df2.sort_index()
In [56]: df2
Out[56]:
AUD CHF
by_portfolio
A a 1 13
b 4 16
c 7 19
d 10 22
B a 2 14
b 5 17
c 8 20
d 11 23
C a 3 15
b 6 18
c 9 21
d 12 24