groupby weighted average and sum in pandas dataframe - python

I have a dataframe:
Out[78]:
contract month year buys adjusted_lots price
0 W Z 5 Sell -5 554.85
1 C Z 5 Sell -3 424.50
2 C Z 5 Sell -2 424.00
3 C Z 5 Sell -2 423.75
4 C Z 5 Sell -3 423.50
5 C Z 5 Sell -2 425.50
6 C Z 5 Sell -3 425.25
7 C Z 5 Sell -2 426.00
8 C Z 5 Sell -2 426.75
9 CC U 5 Buy 5 3328.00
10 SB V 5 Buy 5 11.65
11 SB V 5 Buy 5 11.64
12 SB V 5 Buy 2 11.60
I need a sum of adjusted_lots , price which is weighted average , of price and adjusted_lots , grouped by all the other columns , ie. grouped by (contract, month , year and buys)
Similar solution on R was achieved by following code, using dplyr, however unable to do the same in pandas.
> newdf = df %>%
select ( contract , month , year , buys , adjusted_lots , price ) %>%
group_by( contract , month , year , buys) %>%
summarise(qty = sum( adjusted_lots) , avgpx = weighted.mean(x = price , w = adjusted_lots) , comdty = "Comdty" )
> newdf
Source: local data frame [4 x 6]
contract month year comdty qty avgpx
1 C Z 5 Comdty -19 424.8289
2 CC U 5 Comdty 5 3328.0000
3 SB V 5 Comdty 12 11.6375
4 W Z 5 Comdty -5 554.8500
is the same possible by groupby or any other solution ?

EDIT: update aggregation so it works with recent version of pandas
To pass multiple functions to a groupby object, you need to pass a tuples with the aggregation functions and the column to which the function applies:
# Define a lambda function to compute the weighted mean:
wm = lambda x: np.average(x, weights=df.loc[x.index, "adjusted_lots"])
# Define a dictionary with the functions to apply for a given column:
# the following is deprecated since pandas 0.20:
# f = {'adjusted_lots': ['sum'], 'price': {'weighted_mean' : wm} }
# df.groupby(["contract", "month", "year", "buys"]).agg(f)
# Groupby and aggregate with namedAgg [1]:
df.groupby(["contract", "month", "year", "buys"]).agg(adjusted_lots=("adjusted_lots", "sum"),
price_weighted_mean=("price", wm))
adjusted_lots price_weighted_mean
contract month year buys
C Z 5 Sell -19 424.828947
CC U 5 Buy 5 3328.000000
SB V 5 Buy 12 11.637500
W Z 5 Sell -5 554.850000
You can see more here:
http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once
and in a similar question here:
Apply multiple functions to multiple groupby columns
[1] : https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#groupby-aggregation-with-relabeling

Doing weighted average by groupby(...).apply(...) can be very slow (100x from the following).
See my answer (and others) on this thread.
def weighted_average(df,data_col,weight_col,by_col):
df['_data_times_weight'] = df[data_col]*df[weight_col]
df['_weight_where_notnull'] = df[weight_col]*pd.notnull(df[data_col])
g = df.groupby(by_col)
result = g['_data_times_weight'].sum() / g['_weight_where_notnull'].sum()
del df['_data_times_weight'], df['_weight_where_notnull']
return result

Wouldn't it be a lot more simpler to do this.
Multiply (adjusted_lots * price_weighted_mean) into a new column "X"
Use groupby().sum() for columns "X" and "adjusted_lots" to get
grouped df df_grouped
Compute weighted average on the df_grouped as
df_grouped['X']/df_grouped['adjusted_lots']

The solution that uses a dict of aggregation functions will be deprecated in a future version of pandas (version 0.22):
FutureWarning: using a dict with renaming is deprecated and will be removed in a future
version return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
Use a groupby apply and return a Series to rename columns as discussed in:
Rename result columns from Pandas aggregation ("FutureWarning: using a dict with renaming is deprecated")
def my_agg(x):
names = {'weighted_ave_price': (x['adjusted_lots'] * x['price']).sum()/x['adjusted_lots'].sum()}
return pd.Series(names, index=['weighted_ave_price'])
produces the same result:
>df.groupby(["contract", "month", "year", "buys"]).apply(my_agg)
weighted_ave_price
contract month year buys
C Z 5 Sell 424.828947
CC U 5 Buy 3328.000000
SB V 5 Buy 11.637500
W Z 5 Sell 554.850000

With datar, you don't have to learn pandas APIs to transition your R code:
>>> from datar.all import f, tibble, c, rep, select, summarise, sum, weighted_mean, group_by
>>> df = tibble(
... contract=c('W', rep('C', 8), 'CC', rep('SB', 3)),
... month=c(rep('Z', 9), 'U', rep('V', 3)),
... year=5,
... buys=c(rep('Sell', 9), rep('Buy', 4)),
... adjusted_lots=[-5, -3, -2, -2, -3, -2, -3, -2, -2, 5, 5, 5, 2],
... price=[554.85, 424.50, 424.00, 423.75, 423.50, 425.50, 425.25, 426.00, 426.75,3328.00, 11.65, 11.64, 1
1.60]
... )
>>> df
contract month year buys adjusted_lots price
0 W Z 5 Sell -5 554.85
1 C Z 5 Sell -3 424.50
2 C Z 5 Sell -2 424.00
3 C Z 5 Sell -2 423.75
4 C Z 5 Sell -3 423.50
5 C Z 5 Sell -2 425.50
6 C Z 5 Sell -3 425.25
7 C Z 5 Sell -2 426.00
8 C Z 5 Sell -2 426.75
9 CC U 5 Buy 5 3328.00
10 SB V 5 Buy 5 11.65
11 SB V 5 Buy 5 11.64
12 SB V 5 Buy 2 11.60
>>> newdf = df >> \
... select(f.contract, f.month, f.year, f.buys, f.adjusted_lots, f.price) >> \
... group_by(f.contract, f.month, f.year, f.buys) >> \
... summarise(
... qty = sum(f.adjusted_lots),
... avgpx = weighted_mean(x = f.price , w = f.adjusted_lots),
... comdty = "Comdty"
... )
[2021-05-24 13:11:03][datar][ INFO] `summarise()` has grouped output by ['contract', 'month', 'year'] (overr
ide with `_groups` argument)
>>>
>>> newdf
contract month year buys qty avgpx comdty
0 C Z 5 Sell -19 424.828947 Comdty
1 CC U 5 Buy 5 3328.000000 Comdty
2 SB V 5 Buy 12 11.637500 Comdty
3 W Z 5 Sell -5 554.850000 Comdty
[Groups: ['contract', 'month', 'year'] (n=4)]
I am the author of the package. Feel free to submit issues if you have any questions.

ErnestScribbler's answer is much faster than the accepted solution. Here a multivariate analogue:
def weighted_average(df,data_col,weight_col,by_col):
''' Now data_col can be a list of variables '''
df_data = df[data_col].multiply(df[weight_col], axis='index')
df_weight = pd.notnull(df[data_col]).multiply(df[weight_col], axis='index')
df_data[by_col] = df[by_col]
df_weight[by_col] = df[by_col]
result = df_data.groupby(by_col).sum() / df_weight.groupby(by_col).sum()
return result

I came across this thread when confronted with a similar problem. In my case, I wanted to generate a weighted metric of a quarterback rating should more than one quarterback have attempted a pass in a given NFL game.
I may change the code if I start running into significant performance issues as I scale. For now, I preferred squeezing my solution into the .agg function alongside other transforms. Happy to see if someone has a simpler solution to achieve the same end. Ultimately, I employed a closure pattern.
The magic of the closure approach, if this is an unfamiliar pattern to a future reader, is that I can still return a simple function to pandas' .agg() method, but I get to do so with some additional information preconfigured from the top-level factory function.
def weighted_mean_factory(*args, **kwargs):
weights = kwargs.get('w').copy()
def weighted_mean(x):
x_mask = ~np.isnan(x)
w = weights.loc[x.index]
if all(v is False for v in x_mask):
raise ValueError('there are no non-missing x variable values')
return np.average(x[x_mask], weights=w[x_mask])
return weighted_mean
res_df = df.groupby(['game_id', 'team'])\
.agg(pass_player_cnt=('attempts', count_is_not_zero),
completions=('completions', 'sum'),
attempts=('attempts', 'sum'),
pass_yds=('pass_yards', 'sum'),
pass_tds=('pass_tds', 'sum'),
pass_int=('pass_int', 'sum'),
sack_taken=('sacks_taken', 'sum'),
sack_yds_loss=('sack_yds_loss', 'sum'),
longest_completion=('longest_completion', 'max'),
qbr_w_avg=('qb_rating', weighted_mean_factory(x='qb_rating', w=df['attempts']))
)
Some basic benchmarking stats on a DataFrame with the shape (5436, 31) are below and are not cause for concern on my end in terms of performance at this stage:
149 ms ± 4.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This combines the original approach by jrjc with the closure approach by MB. It has the advantage of being able to reuse the closure function.
import pandas as pd
def group_weighted_mean_factory(df: pd.DataFrame, weight_col_name: str):
# Ref: https://stackoverflow.com/a/69787938/
def group_weighted_mean(x):
try:
return np.average(x, weights=df.loc[x.index, weight_col_name])
except ZeroDivisionError:
return np.average(x)
return group_weighted_mean
df = ... # Define
group_weighted_mean = group_weighted_mean_factory(df, "adjusted_lots")
g = df.groupby(...) # Define
agg_df = g.agg({'price': group_weighted_mean})

Related

Groupby, sum by month and calculate standard deviation divide by mean in Python

For df:
id Date ITEM_ID TYPE VALUE YearMonth
0 13710750 2019-07-01 SLM607 O 10 2019-07
1 13710760 2019-07-01 SLM607 O 10 2019-07
2 13710770 2019-07-03 SLM607 O 2 2019-07
3 13710780 2019-09-03 SLM607 O 5 2019-09
4 13667449 2019-08-02 887643 O 7 2019-08
5 13667450 2019-08-02 792184 O 1 2019-08
6 13728171 2019-09-17 SLM607 I 1 2019-09
7 13667452 2019-08-02 794580 O 3 2019-08
reproducible example:
data = {
"id": [
13710750,
13710760,
13710770,
13710780,
13667449,
13667450,
13728171,
13667452,
],
"Date": [
"2019-07-01",
"2019-07-01",
"2019-07-03",
"2019-09-03",
"2019-08-02",
"2019-08-02",
"2019-09-17",
"2019-08-02",
],
"ITEM_ID": [
"SLM607",
"SLM607",
"SLM607",
"SLM607",
"887643",
"792184",
"SLM607",
"794580",
],
"TYPE": ["O", "O", "O", "O", "O", "O", "I", "O"],
"YearMonth": [
"2019-07",
"2019-07",
"2019-07",
"2019-09",
"2019-08",
"2019-08",
"2019-09",
"2019-08",
],
"VALUE": [10, 10, 2, 5, 7, 1, 1, 3],
}
df = pd.DataFrame(data)
I would like to group df by ITEM_ID, sum VALUE for each month using YearMonth, if there is no data for a month, create a row with 0 value for that month.The time period should be from 2019-07 to 2020-06 which is the financial year. Then I want to calculate the standard deviation of monthly summed Value, divided by mean. So each ITEM_ID would have one final value of standard deviation/mean for the year.
I did the first step with
df.groupby(['ITEM_ID', 'YearMonth']).sum().reset_index()
to calculate the monthly sum, but I'm not sure how to continue from here. Any idea is appreciated, thx.
An example as to how we calculate the standard deviation for each ITEM_ID
Using ITEM_ID ==SLM607 for example,
Month Sum of VALUE
2019-07 22 (10 + 10 + 2)
2019-09 6 (5 + 1)
For other months from 2019-07-01 to 2020-06-30, we assume 0 for each month.
Hence, the standard deviation for ITEM_ID ==SLM607 would be the standard deviation of the list [22, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], which gives the result:
np.std([22, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) = 6.155395104206463 (or 6.4291005073286 for sample std dev)
Apologies for any confusion caused in the original question. I'm trying to understand the reason of the difference in magnitude with the suggested solutions.
Your first step gets you the sums by object + month, but I'll just copy it here for completeness (note that I only sum VALUE since the ID's are numerically meaningless; you can group by them if you want to keep them around):
In [1]: by_month = df.groupby(['ITEM_ID', 'YearMonth'])['VALUE'].sum().reset_index()
In [2]: by_month
Out[2]:
ITEM_ID YearMonth VALUE
0 792184 2019-08 1
1 794580 2019-08 3
2 887643 2019-08 7
3 SLM607 2019-07 22
4 SLM607 2019-09 6
Now you want three values: the number of months being considered, then the std and mean of each item over that span. This gets a bit trickier if different objects were available over different durations, so I'll stick with the easier case of assuming all objects were available over the same duration, which I'll call months = 12 (you specify that the time range of interest is a whole year).
The equations for mean and standard deviation both assume we know the correct N (12, in your case). We could pad the dataframe with 0's to allow Pandas' built-in mean and std equations to work, or we could implement our own functions that perform the padding on-demand. The latter sounds simpler to me.
In [3]: months = 12
In [4]: def mean_over_months(data):
...: return np.mean(list(data) + [0] * (months - len(data)))
...:
In [5]: def std_over_months(data):
...: return np.std(list(data) + [0] * (months - len(data)))
...:
In [6]: by_month.groupby('ITEM_ID')['VALUE'].agg(mean_over_months)
Out[6]:
ITEM_ID
792184 0.083333
794580 0.250000
887643 0.583333
SLM607 2.333333
Name: VALUE, dtype: float64
Now just perform the desired computation:
In [7]: item_means = by_month.groupby('ITEM_ID')['VALUE'].agg(mean_over_months)
In [8]: item_stds = by_month.groupby('ITEM_ID')['VALUE'].agg(std_over_months)
In [9]: item_stds / item_means
Out[9]:
ITEM_ID
792184 3.464102
794580 3.464102
887643 3.464102
SLM607 2.755329
Name: VALUE, dtype: float64
df = pd.DataFrame(data)
# convert date string to datetiem
df['Date'] = pd.to_datetime(df['Date'])
# create a date range for your fiscal year
dr = pd.date_range('2019-07-01', '2020-06-01', freq='M').to_period('M')
# groupby year, month and item id and then sum the value column
g = df.groupby([df['Date'].dt.to_period('M'), 'ITEM_ID'])['VALUE'].sum()
# reindex the grouped multiindex with your new fiscal year date range and your item ids
new_df = g.reindex(pd.MultiIndex.from_product([dr, g.index.levels[1]], names=['Date', 'ITEM_ID']), fill_value=0).to_frame()
# create a groupby object
new_g = new_df.groupby(level=1)['VALUE']
# std divided by the mean of the groupby object
new_g.std()/new_g.mean()
ITEM_ID
792184 3.316625
794580 3.316625
887643 3.316625
SLM607 2.631636
Name: VALUE, dtype: float64

Equivalent of 'mutate_at' dplyr function in Python pandas

and thank you in advanced for the help.
I am looking to create multiple new columns in a pandas dataframe, by dividing a subset of existing columns by another existing column, dynamically named with a suffix. Below is dummy code illustrating the general gist of what i want to do, except for 25+ columns with various transformations.
R code
library(dplyr)
player = c('John','Peter','Michael')
min = c(20, 23, 35)
points = c(10,12,14)
rebounds = c(5,7,9)
assists = c(4,6,7)
df = data.frame(player,min,points,rebounds,assists)
df = df %>%
mutate_at(vars(points:assists),.funs=funs(per_min=./min))
Expected output
player min points rebounds assists points_per_min rebounds_per_min assists_per_min
1 John 20 10 5 4 0.5000000 0.2500000 0.2000000
2 Peter 23 12 7 6 0.5217391 0.3043478 0.2608696
3 Michael 35 14 9 7 0.4000000 0.2571429 0.2000000
I know that I can reproduce the above in pandas as follows:
import pandas as pd
data = pd.DataFrame({'player':['John','Peter','Michael'],
'min':[20,23,35],
'points':[10,12,14],
'rebounds':[5,7,9],
'assists':[4,6,7]})
df = pd.DataFrame(data)
df['points_per_minute'] = df['points']/df['min']
df['rebounds_per_minute'] = df['rebounds']/df['min']
df['assists_per_minute'] = df['assists']/df['min']
df.head()
player min points rebounds assists points_per_minute rebounds_per_minute assists_per_minute
0 John 20 10 5 4 0.500000 0.250000 0.20000
1 Peter 23 12 7 6 0.521739 0.304348 0.26087
2 Michael 35 14 9 7 0.400000 0.257143 0.20000
However, I have to do this for 25+ columns, with different transformations, and explicitly naming every column and operation will become rather cumbersome. Is there any pandas replication of this?
Similar to base R, assign by block of columns with basic arithmetic. Often base R translates better to Numpy/Pandas.
R
cols <- c("points", "rebounds", "assists")
df[paste0(cols, "_per_min")] <- df[cols] / df$min
Python pandas
cols = ["points", "rebounds", "assists"]
df[[col+'_per_min' for col in cols]] = df[cols].div(df['min'], axis='index')
Method1:
Take the list of columns(if you dont have a list of columns and want to get all columns after the min column , use cols=df.iloc[:,df.columns.get_loc('min')+1:].columns)
cols=['points','rebounds','assists']
create a copy of the subset of those columns by df.loc[] and add_suffix as _per_minute, then divide them with the min column.
m=df.loc[:,cols].add_suffix('_per_minute')
df[m.columns]=m.div(df['min'],axis=0)
print(df)
Method2: concat:
cols=['points','rebounds','assists']
df=pd.concat([df,df.loc[:,cols].add_suffix('_per_minute').div(df['min'],axis=0)],axis=1)
Method3:
directly assign them with string formatting using same logic:
cols=['points','rebounds','assists']
df[[f"{i}_per_minute" for i in cols]]=df.loc[:,cols].div(df['min'],axis=0)
print(df)
player min points rebounds assists points_per_minute \
0 John 20 10 5 4 0.500000
1 Peter 23 12 7 6 0.521739
2 Michael 35 14 9 7 0.400000
rebounds_per_minute assists_per_minute
0 0.250000 0.20000
1 0.304348 0.26087
2 0.257143 0.20000
mutate_at is superseded by mutate and across.
Here is how you can do it in a dplyr way in python:
>>> from datar.all import c, f, tibble, mutate, across
>>>
>>> player = c('John','Peter','Michael')
>>> min = c(20, 23, 35)
>>> points = c(10,12,14)
>>> rebounds = c(5,7,9)
>>> assists = c(4,6,7)
>>>
>>> df = tibble(player,min,points,rebounds,assists)
>>>
>>> df = df >> mutate(
... # f.min passed to lambda as y
... across(f[f.points:f.assists], {'per_min': lambda x, y: x / y}, f.min)
... )
>>> df
player min points rebounds assists points_per_min rebounds_per_min assists_per_min
<object> <int64> <int64> <int64> <int64> <float64> <float64> <float64>
0 John 20 10 5 4 0.500000 0.250000 0.20000
1 Peter 23 12 7 6 0.521739 0.304348 0.26087
2 Michael 35 14 9 7 0.400000 0.257143 0.20000
I am the author of the datar package. Feel free to submit issues if you have any questions.
With the specific goal of making this feel more like dplyr, I really prefer method-chaining solutions because of their syntactic similarity to piped dplyr code.
This solution uses pandas.DataFrame.assign and dictionary unpacking.
updated_data = data.assign(**{f"{col}_per_minute": lambda x: x[col] / x["min"]
for col in ["points", "rebounds", "assists"]})

Correlation between two dataframes column with matched headers

I have two dataframes from excels which look like the below. The first dataframe has a multi-index header.
I am trying to find the correlation between each column in the dataframe with the corresponding dataframe based on the currency (i.e KRW, THB, USD, INR). At the moment, I am doing a loop to iterate through each column, matching by index and corresponding header before finding the correlation.
for stock_name in index_data.columns.get_level_values(0):
stock_prices = index_data.xs(stock_name, level=0, axis=1)
stock_prices = stock_prices.dropna()
fx = currency_data[stock_prices.columns.get_level_values(1).values[0]]
fx = fx[fx.index.isin(stock_prices.index)]
merged_df = pd.merge(stock_prices, fx, left_index=True, right_index=True)
merged_df[0].corr(merged_df[1])
Is there a more panda-ish way of doing this?
So you wish to find the correlation between the stock price and its related currency. (Or stock price correlation to all currencies?)
# dummy data
date_range = pd.date_range('2019-02-01', '2019-03-01', freq='D')
stock_prices = pd.DataFrame(
np.random.randint(1, 20, (date_range.shape[0], 4)),
index=date_range,
columns=[['BYZ6DH', 'BLZGSL', 'MBT', 'BAP'],
['KRW', 'THB', 'USD', 'USD']])
fx = pd.DataFrame(np.random.randint(1, 20, (date_range.shape[0], 3)),
index=date_range, columns=['KRW', 'THB', 'USD'])
This is what it looks like, calculating correlations on this data shouldn't make much sense since it is random.
>>> print(stock_prices.head())
BYZ6DH BLZGSL MBT BAP
KRW THB USD USD
2019-02-01 15 10 19 19
2019-02-02 5 9 19 5
2019-02-03 19 7 18 10
2019-02-04 1 6 7 18
2019-02-05 11 17 6 7
>>> print(fx.head())
KRW THB USD
2019-02-01 15 11 10
2019-02-02 6 5 3
2019-02-03 13 1 3
2019-02-04 19 8 14
2019-02-05 6 13 2
Use apply to calculate the correlation between columns with the same currency.
def f(x, fx):
correlation = x.corr(fx[x.name[1]])
return correlation
correlation = stock_prices.apply(f, args=(fx,), axis=0)
>>> print(correlation)
BYZ6DH KRW -0.247529
BLZGSL THB 0.043084
MBT USD -0.471750
BAP USD 0.314969
dtype: float64

Multiple input and multiple output function application to Pandas DataFrame raises shape exception

I have a dataframe with 6 columns (excluding the index), 2 of which are relevant inputs to a function and that function has two outputs. I'd like to insert these outputs to the original dataframe as columns.
I'm following toto_tico's answer here. I'm copying for convenience (with slight modifications):
import pandas as pd
df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10], "C": [10, 10, 10], "D": [1, 1, 1]})
def fab(row):
return row['A'] * row['B'], row['A'] + row['B']
df['newcolumn'], df['newcolumn2'] = zip(*df.apply(fab, axis=1))
This code works without a problem. My code, however, doesn't. My dataframe has the following structure:
Date Station Insolation Daily Total Temperature(avg) Latitude
0 2011-01-01 Aksaray 1.7 72927.6 -0.025000 38.3705
1 2011-01-02 Aksaray 5.6 145874.7 2.541667 38.3705
2 2011-01-03 Aksaray 6.3 147197.8 6.666667 38.3705
3 2011-01-04 Aksaray 2.9 100350.9 5.312500 38.3705
4 2011-01-05 Aksaray 0.7 42138.7 4.639130 38.3705
The function I'm applying takes a row as input, and returns two values based on Latitude and Date. Here's that function:
def h0(row):
# Get a row from a dataframe, give back H0 and daylength
# Leap year must be taken into account
# row['Latitude'] and row['Date'] are relevant inputs
# phi is taken in degrees, all angles are assumed to be degrees as well in formulas
# numpy defaults to radians however...
gsc = 1367
phi = np.deg2rad(row['Latitude'])
date = row['Date']
year = pd.DatetimeIndex([date]).year[0]
month = pd.DatetimeIndex([date]).month[0]
day = pd.DatetimeIndex([date]).day[0]
if year % 4 == 0:
B = (day-1) * (360/366)
else:
B = (day-1) * (360/365)
B = np.deg2rad(B)
delta = (0.006918 - 0.399912*np.cos(B) + 0.070257*np.sin(B)
- 0.006758*np.cos(2*B) + 0.000907*np.sin(2*B)
- 0.002697*np.cos(3*B) + 0.00148*np.sin(3*B))
ws = np.arccos(-np.tan(phi) * np.tan(delta))
daylenght = (2/15) * np.rad2deg(ws)
if year % 4 == 0:
dayangle = np.deg2rad(360*day/366)
else:
dayangle = np.deg2rad(360*day/365)
h0 = (24*3600*gsc/np.pi) * (1 + 0.033*np.cos(dayangle)) * (np.cos(phi)*np.cos(delta)*np.sin(ws) +
ws*np.sin(phi)*np.sin(delta))
return h0, daylenght
When I use
ak['h0'], ak['N'] = zip(*ak.apply(h0, axis=1))
I get the error: Shape of passed values is (1816, 2), indices imply (1816, 6)
I'm unable to find what's wrong with my code. Can you help?
So as mentioned in my previous comment, if you'd like to create multiple NEW columns in the DataFrame based on multiple EXISTING columns of the DataFrame. You can create a new field in the row Series WITHIN your h0 function.
Here's an overly simple example to showcase what I mean:
>>> def simple_func(row):
... row['new_column1'] = row.lat * 1000
... row['year'] = row.date.year
... row['month'] = row.date.month
... row['day'] = row.date.day
... return row
...
>>> df
date lat
0 2018-01-29 1000
1 2018-01-30 5000
>>> df.date
0 2018-01-29
1 2018-01-30
Name: date, dtype: datetime64[ns]
>>> df.apply(simple_func, axis=1)
date lat new_column1 year month day
0 2018-01-29 1000 1000000 2018 1 29
1 2018-01-30 5000 5000000 2018 1 30
In your case, inside your h0 function, setrow['h0'] = h0 and row['N'] = daylength then return row. Then when it comes to calling the function the DF your line changes to ak = ak.apply(h0, axis=1)

pandas dataframe column means [duplicate]

I am new to Python and Pandas. I have a panda dataframe with monthly columns ranging from 2000 (2000-01) to 2016 (2016-06).
I want to find the average of every three months and assign it to a new quarterly column (2000q1). I know I can do the following:
df['2000q1'] = df[['2000-01', '2000-02', '2000-03']].mean(axis=1)
df['2000q2'] = df[['2000-04', '2000-05', '2000-06']].mean(axis=1)
.
.
.
df['2016-02'] = df[['2016-04', '2016-05', '2016-06']].mean(axis=1)
But, this is very tedious. I appreciate it if someone helps me find a better way.
You can use groupby on columns:
df.groupby(np.arange(len(df.columns))//3, axis=1).mean()
Or, those can be converted to datetime. You can use resample:
df.columns = pd.to_datetime(df.columns)
df.resample('Q', axis=1).mean()
Here's a demo:
cols = pd.date_range('2000-01', '2000-06', freq='MS')
cols = cols.strftime('%Y-%m')
cols
Out:
array(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'],
dtype='<U7')
df = pd.DataFrame(np.random.randn(10, 6), columns=cols)
df
Out:
2000-01 2000-02 2000-03 2000-04 2000-05 2000-06
0 -1.263798 0.251526 0.851196 0.159452 1.412013 1.079086
1 -0.909071 0.685913 1.394790 -0.883605 0.034114 -1.073113
2 0.516109 0.452751 -0.397291 -0.050478 -0.364368 -0.002477
3 1.459609 -1.696641 0.457822 1.057702 -0.066313 -0.910785
4 -0.482623 1.388621 0.971078 -0.038535 0.033167 0.025781
5 -0.016654 1.404805 0.100335 -0.082941 -0.418608 0.588749
6 0.684735 -2.007105 0.552615 1.969356 -0.614634 0.021459
7 0.382475 0.965739 -1.826609 -0.086537 -0.073538 -0.534753
8 1.548773 -0.157250 0.494819 -1.631516 0.627794 -0.398741
9 0.199049 0.145919 0.711701 0.305382 -0.118315 -2.397075
First alternative:
df.groupby(np.arange(len(df.columns))//3, axis=1).mean()
Out:
0 1
0 -0.053692 0.883517
1 0.390544 -0.640868
2 0.190523 -0.139108
3 0.073597 0.026868
4 0.625692 0.006805
5 0.496162 0.029067
6 -0.256585 0.458727
7 -0.159465 -0.231609
8 0.628781 -0.467487
9 0.352223 -0.736669
Second alternative:
df.columns = pd.to_datetime(df.columns)
df.resample('Q', axis=1).mean()
Out:
2000-03-31 2000-06-30
0 -0.053692 0.883517
1 0.390544 -0.640868
2 0.190523 -0.139108
3 0.073597 0.026868
4 0.625692 0.006805
5 0.496162 0.029067
6 -0.256585 0.458727
7 -0.159465 -0.231609
8 0.628781 -0.467487
9 0.352223 -0.736669
You can assign this to a DataFrame:
res = df.resample('Q', axis=1).mean()
Change column names as you like:
res = res.rename(columns=lambda col: '{}q{}'.format(col.year, col.quarter))
res
Out:
2000q1 2000q2
0 -0.053692 0.883517
1 0.390544 -0.640868
2 0.190523 -0.139108
3 0.073597 0.026868
4 0.625692 0.006805
5 0.496162 0.029067
6 -0.256585 0.458727
7 -0.159465 -0.231609
8 0.628781 -0.467487
9 0.352223 -0.736669
And attach this to your current DataFrame by:
pd.concat([df, res], axis=1)

Categories