I have the following pandas DataFrame:
df = pd.DataFrame({'country' : ['US','FR','DE','SP'],
'energy_per_capita': [10,8,9,7],
'pop_2014' : [300,70,80,60],
'pop_2015': [305,72,80,'NaN']})
I'd like to create a new column:
df['total energy consumption']
which multiplies energy_per_capita and pop.
I'd like it to take pop_2015 when available and pop_2014 if pop_2015 == NaN
thanks
make sure you read 10 Minutes to pandas. For this case we are using pandas.DataFrame.fillna method
df = pd.DataFrame({'country' : ['US','FR','DE','SP'],
'energy_per_capita': [10,8,9,7],
'pop_2014' : [300,70,80,60],
'pop_2015': [305,72,80,np.nan]})
df['total energy consumption']= df['energy_per_capita'] *df['pop_2015'].fillna(df['pop_2014'])
print df
output
country energy_per_capita pop_2014 pop_2015 total energy consumption
0 US 10 300 305.0 3050.0
1 FR 8 70 72.0 576.0
2 DE 9 80 80.0 720.0
3 SP 7 60 NaN 420.0
Related
I have a dataframe like as shown below
df = pd.DataFrame(
{'stud_name' : ['ABC', 'ABC','ABC','DEF',
'DEF','DEF'],
'qty' : [123,31,490,518,70,900],
'trans_date' : ['13/11/2020','10/1/2018','11/11/2017','27/03/2016','13/05/2010','14/07/2008']})
I would like to do the below
a) for each stud_name, look at their past data (full past data) and compute the min, max and mean of qty column
Please note that the 1st record/row for every unique stud_name will be NA because there is no past data (history) to look at and compute the aggregate statistics
I tried something like below but the output is incorrect
df['trans_date'] = pd.to_datetime(df['trans_date'])
df.sort_values(by=['stud_name','trans_date'],inplace=True)
df['past_transactions'] = df.groupby('stud_name').cumcount()
df['past_max_qty'] = df.groupby('stud_name')['qty'].expanding().max().values
df['past_min_qty'] = df.groupby('stud_name')['qty'].expanding().min().values
df['past_avg_qty'] = df.groupby('stud_name')['qty'].expanding().mean().values
I expect my output to be like as shown below
We can use custom function to calculate the past statistics per student
def past_stats(q):
return (
q.expanding()
.agg(['max', 'min', 'mean'])
.shift().add_prefix('past_')
)
df.join(df.groupby('stud_name')['qty'].apply(past_stats))
stud_name qty trans_date past_max past_min past_mean
2 ABC 490 2017-11-11 NaN NaN NaN
1 ABC 31 2018-10-01 490.0 490.0 490.0
0 ABC 123 2020-11-13 490.0 31.0 260.5
5 DEF 900 2008-07-14 NaN NaN NaN
4 DEF 70 2010-05-13 900.0 900.0 900.0
3 DEF 518 2016-03-27 900.0 70.0 485.0
I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
Following this answer to a previous questions I had asked, I used this code to summarise the ultrasound measurements using the maximum measurement recorded in a single trimester (13 weeks):
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
This results in the following output:
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
However, MotherID and PregnancyID no longer appear as columns in the output of df.info(). Similarly, when I output the dataframe to a csv file, I only get columns 1,2 and 3. The id columns only appear when running df.head() as can be seen in the dataframe above.
I need to preserve the id columns as I want to use them to merge this dataframe with another one using the ids. Therefore, my question is, how do I preserve these id columns as part of my dataframe after running the code above?
Chain that with reset_index:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
# .drop(columns = 'gestationalAgeInWeeks') # don't need this
.groupby(['MotherID', 'PregnancyID','tm'])['abdomCirc'] # change here
.max().add_prefix('abdomCirc_') # here
.unstack()
.reset_index() # and here
)
Or a more friendly version with pivot_table:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
.pivot_table(index= ['MotherID', 'PregnancyID'], columns='tm',
values= 'abdomCirc', aggfunc='max')
.add_prefix('abdomCirc_') # remove this if you don't want the prefix
.reset_index()
)
Output:
tm MotherID PregnancyID abdomCirc_1 abdomCirc_2 abdomCirc_3
0 abdomCirc_0 abdomCirc_0 NaN 200.0 NaN
1 abdomCirc_1 abdomCirc_1 NaN 315.0 350.0
2 abdomCirc_2 abdomCirc_2 180.0 NaN NaN
I have two dataframes from excels which look like the below. The first dataframe has a multi-index header.
I am trying to find the correlation between each column in the dataframe with the corresponding dataframe based on the currency (i.e KRW, THB, USD, INR). At the moment, I am doing a loop to iterate through each column, matching by index and corresponding header before finding the correlation.
for stock_name in index_data.columns.get_level_values(0):
stock_prices = index_data.xs(stock_name, level=0, axis=1)
stock_prices = stock_prices.dropna()
fx = currency_data[stock_prices.columns.get_level_values(1).values[0]]
fx = fx[fx.index.isin(stock_prices.index)]
merged_df = pd.merge(stock_prices, fx, left_index=True, right_index=True)
merged_df[0].corr(merged_df[1])
Is there a more panda-ish way of doing this?
So you wish to find the correlation between the stock price and its related currency. (Or stock price correlation to all currencies?)
# dummy data
date_range = pd.date_range('2019-02-01', '2019-03-01', freq='D')
stock_prices = pd.DataFrame(
np.random.randint(1, 20, (date_range.shape[0], 4)),
index=date_range,
columns=[['BYZ6DH', 'BLZGSL', 'MBT', 'BAP'],
['KRW', 'THB', 'USD', 'USD']])
fx = pd.DataFrame(np.random.randint(1, 20, (date_range.shape[0], 3)),
index=date_range, columns=['KRW', 'THB', 'USD'])
This is what it looks like, calculating correlations on this data shouldn't make much sense since it is random.
>>> print(stock_prices.head())
BYZ6DH BLZGSL MBT BAP
KRW THB USD USD
2019-02-01 15 10 19 19
2019-02-02 5 9 19 5
2019-02-03 19 7 18 10
2019-02-04 1 6 7 18
2019-02-05 11 17 6 7
>>> print(fx.head())
KRW THB USD
2019-02-01 15 11 10
2019-02-02 6 5 3
2019-02-03 13 1 3
2019-02-04 19 8 14
2019-02-05 6 13 2
Use apply to calculate the correlation between columns with the same currency.
def f(x, fx):
correlation = x.corr(fx[x.name[1]])
return correlation
correlation = stock_prices.apply(f, args=(fx,), axis=0)
>>> print(correlation)
BYZ6DH KRW -0.247529
BLZGSL THB 0.043084
MBT USD -0.471750
BAP USD 0.314969
dtype: float64
I have the following DataFrame:
actor Daily Total actor1 actor2
Day
2019-01-01 25 10 15
2019-01-02 30 15 15
Avg 27.5 12.5 15.0
How do I change the data type of 'Avg' row to integer? How do I round those values in the row?
In pandas after add new row filled by floats all columns are changed to floats.
Possible solution is round and convert all columns:
df = df.round().astype(int)
Or add new Series converted to integer:
df = df.append(df.mean().rename('Avg').round().astype(int))
print (df)
Daily Total actor1 actor2
actor
2019-01-01 25 10 15
2019-01-02 30 15 15
Avg 28 12 15
If want convert only columns with row values filled by whole numbers:
d = dict.fromkeys(df.columns[df.loc['Avg'] == df.loc['Avg'].astype(int)], 'int')
df = df.astype(d)
print (df)
Daily Total actor1 actor2
actor
2019-01-01 25.0 10.0 15
2019-01-02 30.0 15.0 15
Avg 27.5 12.5 15
Use loc to access index then use numpy.round in apply.
import numpy as np
df.loc['Avg'] = df.loc['Avg'].apply(np.round)
I'm new to python and am struggling to figure something out. I'm doing some data analysis on an invoice database in pandas with columns of $ amounts, credits, date, and a unique company ID for each package bought.
I want to run every unique company id through a function that will calculate the average spend rate of these credits based on the difference of package purchase dates. I have the basics figured out in my function, and it returns a series indexed to the original dataframe with the values of the average amount of credits spent each day between packages. However, I only have it working with one company ID at a time, and I don'tknow what kind of process I can do to combine all of these different series for each company id to be able to correctly add a new column onto my dataframe with this average credit spend value for each package. Here's my code so far:
def creditspend(mylist = []):
for i in mylist:
a = df.loc[df['CompanyId'] == i]
a = a.sort_values(by=['Date'], ascending=False)
days = a.Date.diff().map(lambda x: abs(x.days))
spend = a['Credits']/days
print(spend)
If I call
creditspend(mylist=[8, 15])
(with multiple inputs) it obviously does not work. What do I need to do to complete this function?
Thanks in advance.
apply() is a very useful method in pandas that applies a function to every row or column of a DataFrame.
So, if your DataFrame is df:
def creditspend(row):
# some calculation code here
return spend
df['spend_rate'] = df.apply(creditspend)
(You can also use apply() on columns with the axis=1 keyword.)
Consider a groupby for a CompanyID aggregation. Below demonstrates with random data:
import numpy as np
import pandas as pd
np.random.seed(7182018)
df = pd.DataFrame({'CompanyID': np.random.choice(['julia', 'pandas', 'r', 'sas', 'stata', 'spss'],50),
'Date': np.random.choice(pd.Series(pd.date_range('2018-01-01', freq='D', periods=180)), 50),
'Credits': np.random.uniform(0,1000,50)
}, columns=['Date', 'CompanyID', 'Credits'])
# SORT ONCE OUTSIDE OF PROCESSING
df = df.sort_values(by=['CompanyID', 'Date'], ascending=[True, False]).reset_index(drop=True)
def creditspend(g):
g['days'] = g.Date.diff().map(lambda x: abs(x.days))
g['spend'] = g['Credits']/g['days']
return g
grp_df = df.groupby('CompanyID').apply(creditspend)
Output
print(grp_df.head(20))
# Date CompanyID Credits days spend
# 0 2018-06-20 julia 280.522287 NaN NaN
# 1 2018-06-12 julia 985.009523 8.0 123.126190
# 2 2018-05-17 julia 892.308179 26.0 34.319545
# 3 2018-05-03 julia 97.410360 14.0 6.957883
# 4 2018-03-26 julia 480.206077 38.0 12.637002
# 5 2018-03-07 julia 78.892365 19.0 4.152230
# 6 2018-03-03 julia 878.671506 4.0 219.667877
# 7 2018-02-25 julia 905.172807 6.0 150.862135
# 8 2018-02-19 julia 970.016418 6.0 161.669403
# 9 2018-02-03 julia 669.073067 16.0 41.817067
# 10 2018-01-23 julia 636.926865 11.0 57.902442
# 11 2018-01-11 julia 790.107486 12.0 65.842291
# 12 2018-06-16 pandas 639.180696 NaN NaN
# 13 2018-05-21 pandas 565.432415 26.0 21.747401
# 14 2018-04-22 pandas 145.232115 29.0 5.008004
# 15 2018-04-13 pandas 379.964557 9.0 42.218284
# 16 2018-04-12 pandas 538.168690 1.0 538.168690
# 17 2018-03-20 pandas 783.572993 23.0 34.068391
# 18 2018-03-14 pandas 618.354489 6.0 103.059081
# 19 2018-02-10 pandas 859.278127 32.0 26.852441