I'm currently working with a dataframe which is routinely grouped into a MultiIndex of three or more levels, with fiscal Quarter always at the top level. Necessarily, a few calculated fields are added to the frame as year/year percent change within each unique index, easily obtained with a groupby up to but not including Quarter, and pd.pct_change().
Unfortunately, this only returns accurate values if a value exists for each possible Quarter. If I have a point for 2021Q1 and my next is 2021Q4, I need to pad in a row with zeroes for 2021Q2 and 2021Q3 in order for the year/year at 2021Q4 to not return 2021Q4/2021Q1. My problem is that I often have at least six and up to fifty unique index values at each level of the MultiIndex, and to pad it correctly I need as many rows as the are unique combinations, which quickly becomes an exponential disaster which makes the code unusable.
My question: Is it possible to take a Quarter/Quarter value respecting the MultiIndex without padding out every missing quarter on the index?
Reproducible example:
idx=pd.MultiIndex.from_product([['Canine', 'Feline'],
['Chihuahua','Samoyed','Shorthair'],
[dt.datetime(2021,4,1),dt.datetime(2021,7,1),dt.datetime(2021,10,1),dt.datetime(2022,1,1)]],
names=['species','breed','cyq'])
data=pd.DataFrame(index=idx)
data.loc[:,'paid']=np.random.randint(100,200,24)
correct_ex=data.drop([('Canine','Shorthair'),('Feline','Chihuahua'),('Feline','Samoyed')])
correct_ex.loc[('Canine','Samoyed',dt.datetime(2021,7,1)),'paid']=0
incorrect_ex=correct_ex.drop([('Canine','Samoyed',dt.datetime(2021,7,1))])
correct_ex.loc[:,'paid_change']=correct_ex.groupby(['species','breed'])['paid'].pct_change()
correct_ex=correct_ex.drop([('Canine','Samoyed',dt.datetime(2021,7,1))])
incorrect_ex.loc[:,'paid_change']=correct_ex.groupby(['species','breed'])['paid'].pct_change()
Correct Results:
Incorrect Results:
The correct_ex frame above contains the values that I would want to see if Samoyeds had no data for 7/1/2021, but the only way to get it is to keep a row with paid value 0 for that date. The incorrect_ex frame above is what I get if I attempt pct_change without the added row.
Thanks for the help!
You can calculate the percentage changes in a groupby apply and mask rows where the difference between the cyq dates (= last level of the index) is more than 93 days with np.inf:
import numpy as np
import pandas as pd
import datetime as dt
idx = pd.MultiIndex.from_product([['Canine', 'Feline'],
['Chihuahua','Samoyed','Shorthair'],
[dt.datetime(2021,4,1),dt.datetime(2021,7,1),dt.datetime(2021,10,1),dt.datetime(2022,1,1)]],
names=['species','breed','cyq'])
np.random.seed(0)
data = pd.DataFrame({'paid': np.random.randint(100,200,24)}, index=idx)
data = data.drop([('Canine','Samoyed',dt.datetime(2021,7,1))])
data = data.drop([('Canine','Shorthair'),('Feline','Chihuahua'),('Feline','Samoyed')])
data['paid_change'] = data.groupby(['species','breed']).paid.apply(
lambda x: x.pct_change().mask(
x.index.get_level_values(-1).to_series().diff().gt(pd.Timedelta(days=93)),
np.inf
)
)
Result:
paid paid_change
species breed cyq
Canine Chihuahua 2021-04-01 144 NaN
2021-07-01 147 0.020833
2021-10-01 164 0.115646
2022-01-01 167 0.018293
Samoyed 2021-04-01 167 NaN
2021-10-01 183 inf
2022-01-01 121 -0.338798
Feline Shorthair 2021-04-01 181 NaN
2021-07-01 137 -0.243094
2021-10-01 125 -0.087591
2022-01-01 177 0.416000
Related
In python, how can I reference previous row and calculate something against it? Specifically, I am working with dataframes in pandas - I have a data frame full of stock price information that looks like this:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
Here is how I created this dataframe:
import pandas
url = 'http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
data = data = pandas.read_csv(url)
## now I sorted the data frame ascending by date
data = data.sort(columns='Date')
Starting with row number 2, or in this case, I guess it's 250 (PS - is that the index?), I want to calculate the difference between 2011-01-03 and 2011-01-04, for every entry in this dataframe. I believe the appropriate way is to write a function that takes the current row, then figures out the previous row, and calculates the difference between them, the use the pandas apply function to update the dataframe with the value.
Is that the right approach? If so, should I be using the index to determine the difference? (note - I'm still in python beginner mode, so index may not be the right term, nor even the correct way to implement this)
I think you want to do something like this:
In [26]: data
Out[26]:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
In [27]: data.set_index('Date').diff()
Out[27]:
Close Adj Close
Date
2011-01-03 NaN NaN
2011-01-04 0.16 0.16
2011-01-05 -0.59 -0.58
2011-01-06 1.61 1.57
2011-01-07 -0.73 -0.71
To calculate difference of one column. Here is what you can do.
df=
A B
0 10 56
1 45 48
2 26 48
3 32 65
We want to compute row difference in A only and want to consider the rows which are less than 15.
df['A_dif'] = df['A'].diff()
df=
A B A_dif
0 10 56 Nan
1 45 48 35
2 26 48 19
3 32 65 6
df = df[df['A_dif']<15]
df=
A B A_dif
0 10 56 Nan
3 32 65 6
I don't know pandas, and I'm pretty sure it has something specific for this; however, I'll give you the pure-Python solution, that might be of some help even if you need to use pandas:
import csv
import urllib
# This basically retrieves the CSV files and loads it in a list, converting
# All numeric values to floats
url='http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
reader = csv.reader(urllib.urlopen(url), delimiter=',')
# We sort the output list so the records are ordered by date
cleaned = sorted([[r[0]] + map(float, r[1:]) for r in list(reader)[1:]])
for i, row in enumerate(cleaned): # enumerate() yields two-tuples: (<id>, <item>)
# The try..except here is to skip the IndexError for line 0
try:
# This will calculate difference of each numeric field with the same field
# in the row before this one
print row[0], [(row[j] - cleaned[i-1][j]) for j in range(1, 7)]
except IndexError:
pass
I have two dataframes with particular data that I'm needing to merge.
Date Greenland Antarctica
0 2002.29 0.00 0.00
1 2002.35 68.72 19.01
2 2002.62 -219.32 -59.36
3 2002.71 -242.83 46.55
4 2002.79 -209.12 63.31
.. ... ... ...
189 2020.79 -4928.78 -2542.18
190 2020.87 -4922.47 -2593.06
191 2020.96 -4899.53 -2751.98
192 2021.04 -4838.44 -3070.67
193 2021.12 -4900.56 -2755.94
[194 rows x 3 columns]
and
Date Mean Sea Level
0 1993.011526 -38.75
1 1993.038692 -39.77
2 1993.065858 -39.61
3 1993.093025 -39.64
4 1993.120191 -38.72
... ... ...
1021 2020.756822 62.83
1022 2020.783914 62.93
1023 2020.811006 62.98
1024 2020.838098 63.00
1025 2020.865190 63.00
[1026 rows x 2 columns]
My ultimate goal is trying to pull out the data from the second data frame(Mean Sea Level column) that comes from (roughly) the same time frame as the dates in the first dataframe, and then merge that back in with the first data frame.
However, the only ways that I can think of selecting out certain dates, involves first converting all of the dates in the Date columns of both Dataframes to something Pandas recognizes, but I have been unable to figure our how to do that. I figured out some code(below) that can convert individual dates to a more common date format, but its been difficult to successfully apply it to all of the Dates in dataframe. Also I'm not sure I can then get Pandas to then convert that to a date format that Pandas recognizes.
from datetime import datetime
def fraction2datetime(year_fraction: float) -> datetime:
year = int(year_fraction)
fraction = year_fraction - year
first = datetime(year, 1, 1)
aux = datetime(year + 1, 1, 1)
return first + (aux - first)*fraction
I also looked at pandas.to_datetime but I don't see a way to have it read the format the dates are initially in.
So does anyone have any guidance on this? Firstly with the conversion of dates, but also with the task of picking out the dates from the second dataframe if possible. Any help would be greatly appreciated.
Suppose you have this 2 dataframes:
df1:
Date Greenland Antarctica
0 2020.79 -4928.78 -2542.18
1 2020.87 -4922.47 -2593.06
2 2020.96 -4899.53 -2751.98
3 2021.04 -4838.44 -3070.67
4 2021.12 -4900.56 -2755.94
df2:
Date Mean Sea Level
0 2020.756822 62.83
1 2020.783914 62.93
2 2020.811006 62.98
3 2020.838098 63.00
4 2020.865190 63.00
To convert the dates:
def fraction2datetime(year_fraction: float) -> datetime:
year = int(year_fraction)
fraction = year_fraction - year
first = datetime(year, 1, 1)
aux = datetime(year + 1, 1, 1)
return first + (aux - first) * fraction
df1["Date"] = df1["Date"].apply(fraction2datetime)
df2["Date"] = df2["Date"].apply(fraction2datetime)
print(df1)
print(df2)
Prints:
Date Greenland Antarctica
0 2020-10-16 03:21:35.999999 -4928.78 -2542.18
1 2020-11-14 10:04:47.999997 -4922.47 -2593.06
2 2020-12-17 08:38:24.000001 -4899.53 -2751.98
3 2021-01-15 14:23:59.999999 -4838.44 -3070.67
4 2021-02-13 19:11:59.999997 -4900.56 -2755.94
Date Mean Sea Level
0 2020-10-03 23:55:28.012795 62.83
1 2020-10-13 21:54:02.073603 62.93
2 2020-10-23 19:52:36.134397 62.98
3 2020-11-02 17:51:10.195198 63.00
4 2020-11-12 15:49:44.255992 63.00
For the join, you can use pd.merge_asof. For example this will join on nearest date within 30-day tolerance (you can tweak these values as you want):
x = pd.merge_asof(
df1, df2, on="Date", tolerance=pd.Timedelta(days=30), direction="nearest"
)
print(x)
Will print:
Date Greenland Antarctica Mean Sea Level
0 2020-10-16 03:21:35.999999 -4928.78 -2542.18 62.93
1 2020-11-14 10:04:47.999997 -4922.47 -2593.06 63.00
2 2020-12-17 08:38:24.000001 -4899.53 -2751.98 NaN
3 2021-01-15 14:23:59.999999 -4838.44 -3070.67 NaN
4 2021-02-13 19:11:59.999997 -4900.56 -2755.94 NaN
You can specify a timestamp format in to_datetime(). Otherwise, if you need to use a custom function you can use apply(). If performance is a concern, be aware that apply() does not perform as well as builtin pandas methods.
To combine the DataFrames you can use an outer join on the date column.
I have a dataframe as follows:
Date Group Value Duration
2018-01-01 A 20 30
2018-02-01 A 10 60
2018-01-01 B 15 180
2018-02-01 B 30 210
2018-03-01 B 25 238
2018-01-01 C 10 235
2018-02-01 C 15 130
I want to use group_by dynamically i.e. do not wish to type the column names on which group_by would be applied. Specifically, I want to compute mean of each Group for last two months.
As we can see that not each Group's data is present in the above dataframe for all dates. So the tasks are as follows:
Add a dummy row based on the date, in case data pertaining to Date = 2018-03-01not present for each Group (e.g. add row for A and C).
Perform group_by to compute mean using last two month's Value and Duration.
So my approach is as follows:
For Task 1:
s = pd.MultiIndex.from_product(df['Date'].unique(),df['Group'].unique()],names=['Date','Group'])
df = df.set_index(['Date','Group']).reindex(s).reset_index().sort_values(['Group','Date']).ffill(axis=0)
can we have a better method for achieving the 'add row' task? The reference is found here.
For Task 2:
def cond_grp_by(df,grp_by:str,cols_list:list,*args):
df_grp = df.groupby(grp_by)[cols_list].transform(lambda x : x.tail(2).mean())
return df_grp
df_cols = df.columns.tolist()
df = cond_grp_by(dealer_f_filt,'Group',df_cols)
Reference of the above approach is found here.
The above code is throwing IndexError : Column(s) ['index','Group','Date','Value','Duration'] already selected
The expected output is
Group Value Duration
A 10 60 <--------- Since a row is added for 2018-03-01 with
B 27.5 224 same value as 2018-02-01,we are
C 15 130 <--------- computing mean for last two values
Use GroupBy.agg instead transform if need output filled by aggregate values:
def cond_grp_by(df,grp_by:str,cols_list:list,*args):
return df.groupby(grp_by)[cols_list].agg(lambda x : x.tail(2).mean()).reset_index()
df = cond_grp_by(df,'Group',df_cols)
print (df)
Group Value Duration
0 A 10.0 60.0
1 B 27.5 224.0
2 C 15.0 130.0
If need last value per groups use GroupBy.last:
def cond_grp_by(df,grp_by:str,cols_list:list,*args):
return df.groupby(grp_by)[cols_list].last().reset_index()
df = cond_grp_by(df,'Group',df_cols)
Given a pandas dataframe in the following format:
toy = pd.DataFrame({
'id': [1,2,3,
1,2,3,
1,2,3],
'date': ['2015-05-13', '2015-05-13', '2015-05-13',
'2016-02-12', '2016-02-12', '2016-02-12',
'2018-07-23', '2018-07-23', '2018-07-23'],
'my_metric': [395, 634, 165,
144, 305, 293,
23, 395, 242]
})
# Make sure 'date' has datetime format
toy.date = pd.to_datetime(toy.date)
The my_metric column contains some (random) metric I wish to compute a time-dependent moving average of, conditional on the column id
and within some specified time interval that I specify myself. I will refer to this time interval as the "lookback time"; which could be 5 minutes
or 2 years. To determine which observations that are to be included in the lookback calculation, we use the date column (which could be the index if you prefer).
To my frustration, I have discovered that such a procedure is not easily performed using pandas builtins, since I need to perform the calculation conditionally
on id and at the same time the calculation should only be made on observations within the lookback time (checked using the date column). Hence, the output dataframe should consist of one row for each id-date combination, with the my_metric column now being the average of all observations that is contatined within the lookback time (e.g. 2 years, including today's date).
For clarity, I have included a figure with the desired output format (apologies for the oversized figure) when using a 2-year lookback time:
I have a solution but it does not make use of specific pandas built-in functions and is likely sub-optimal (combination of list comprehension and a single for-loop). The solution I am looking for will not make use of a for-loop, and is thus more scalable/efficient/fast.
Thank you!
Calculating lookback time: (Current_year - 2 years)
from dateutil.relativedelta import relativedelta
from dateutil import parser
import datetime
In [1691]: dt = '2018-01-01'
In [1695]: dt = parser.parse(dt)
In [1696]: lookback_time = dt - relativedelta(years=2)
Now, filter the dataframe on lookback time and calculate rolling average
In [1722]: toy['new_metric'] = ((toy.my_metric + toy[toy.date > lookback_time].groupby('id')['my_metric'].shift(1))/2).fillna(toy.my_metric)
In [1674]: toy.sort_values('id')
Out[1674]:
date id my_metric new_metric
0 2015-05-13 1 395 395.0
3 2016-02-12 1 144 144.0
6 2018-07-23 1 23 83.5
1 2015-05-13 2 634 634.0
4 2016-02-12 2 305 305.0
7 2018-07-23 2 395 350.0
2 2015-05-13 3 165 165.0
5 2016-02-12 3 293 293.0
8 2018-07-23 3 242 267.5
So, after some tinkering I found an answer that will generalize adequately. I used a slightly different 'toy' dataframe (slightly more relevant to my case). For completeness sake, here is the data:
Consider now the following code:
# Define a custom function which groups by time (using the index)
def rolling_average(x, dt):
xt = x.sort_index().groupby(lambda x: x.time()).rolling(window=dt).mean()
xt.index = xt.index.droplevel(0)
return xt
dt='730D' # rolling average window: 730 days = 2 years
# Group by the 'id' column
g = toy.groupby('id')
# Apply the custom function
df = g.apply(rolling_average, dt=dt)
# Massage the data to appropriate format
df.index = df.index.droplevel(0)
df = df.reset_index().drop_duplicates(keep='last', subset=['id', 'date'])
The result is as expected:
From
Fill in missing row values in pandas dataframe
I have the following dataframe and would like to fill in missing values.
mukey hzdept_r hzdepb_r sandtotal_r silttotal_r
425897 0 61
425897 61 152 5.3 44.7
425911 0 30 30.1 54.9
425911 30 74 17.7 49.8
425911 74 84
I want each missing value to be the average of values corresponding to that mukey. In this case, e.g. the first row missing values will be the average of sandtotal_r and silttotal_r corresponding to mukey==425897. pandas fillna doesn't seem to do the trick. Any help?
While the code works for the sample dataframe in that example, it is failing on the larger dataset I have uploaded here: https://www.dropbox.com/s/w3m0jppnq74op4c/www004.csv?dl=0
import pandas as pd
df = pd.read_csv('www004.csv')
# CSV file is here: https://www.dropbox.com/s/w3m0jppnq74op4c/www004.csv?dl=0
df1 = df.set_index('mukey')
df1.fillna(df.groupby('mukey').mean(),inplace=True)
df1.reset_index()
I get the error: InvalidIndexError. Why is it not working?
Use combine_first. It allows you to patch up the missing data on the left dataframe with the matching data on the right dataframe based on same index.
In this case, df1 is on the left and df2, the means, as the one on the right.
In [48]: df = pd.read_csv('www004.csv')
...: df1 = df.set_index('mukey')
...: df2 = df.groupby('mukey').mean()
In [49]: df1.loc[426178,:]
Out[49]:
hzdept_r hzdepb_r sandtotal_r silttotal_r claytotal_r om_r
mukey
426178 0 36 NaN NaN NaN 72.50
426178 36 66 NaN NaN NaN 72.50
426178 66 152 42.1 37.9 20 0.25
In [50]: df2.loc[426178,:]
Out[50]:
hzdept_r 34.000000
hzdepb_r 84.666667
sandtotal_r 42.100000
silttotal_r 37.900000
claytotal_r 20.000000
om_r 48.416667
Name: 426178, dtype: float64
In [51]: df3 = df1.combine_first(df2)
...: df3.loc[426178,:]
Out[51]:
hzdept_r hzdepb_r sandtotal_r silttotal_r claytotal_r om_r
mukey
426178 0 36 42.1 37.9 20 72.50
426178 36 66 42.1 37.9 20 72.50
426178 66 152 42.1 37.9 20 0.25
Note that the following rows still won't have values in the resulting df3
426162
426163
426174
426174
426255
because they were single rows to begin with, hence, .mean() doesn't mean anything to them (eh, see what I did there?).
The problem is the duplicate index values. When you use df1.fillna(df2), if you have multiple NaN entries in df1 where both the index and the column label are the same, pandas will get confused when trying to slice df1, and throw that InvalidIndexError.
Your sample dataframe works because even though you have duplicate index values there, only one of each index value is null. Your larger dataframe contains null entries that share both the index value and column label in some cases.
To make this work, you can do this one column at a time. For some reason, when operating on a series, pandas will not get confused by multiple entries of the same index, and will simply fill the same value in each one. Hence, this should work:
import pandas as pd
df = pd.read_csv('www004.csv')
# CSV file is here: https://www.dropbox.com/s/w3m0jppnq74op4c/www004.csv?dl=0
df1 = df.set_index('mukey')
grouped = df.groupby('mukey').mean()
for col in ['sandtotal_r', 'silttotal_r']:
df1[col] = df1[col].fillna(grouped[col])
df1.reset_index()
NOTE: Be careful using the combine_first method if you ever have "extra" data in the dataframe you're filling from. The combine_first function will include ALL indices from the dataframe you're filling from, even if they're not present in the original dataframe.