How to groupby and make calculations on consecutive rows of the group? - python

For example, let's consider the following dataframe:
Restaurant_ID Floor Cust_Arrival_Datetime
0 100 1 2021-11-17 17:20:00
1 100 1 2021-11-17 17:22:00
2 100 1 2021-11-17 17:25:00
3 100 1 2021-11-17 17:30:00
4 100 1 2021-11-17 17:50:00
5 100 1 2021-11-17 17:51:00
6 100 2 2021-11-17 17:25:00
7 100 2 2021-11-17 18:00:00
8 100 2 2021-11-17 18:50:00
9 100 2 2021-11-17 18:56:00
For the above toy example we can consider that the Cust_Arrival_Datetime is sorted as well as grouped by store and floor (as seen above). How could we, now, calculate things such as the median time interval that passes for a customer arrival for each unique store and floor group?
The desired output would be:
Restaurant_ID Floor Median Arrival Interval(in minutes)
0 100 1 3
1 100 2 35
The Median Arrival Interval is calculated as follows: for the first floor of the store we can see that by the time the second customer arrives 2 minutes have already passed since the first one arrived. Similarly, 3 minutes have elapsed between the 2nd and the 3rd customer and 5 minutes for the 3rd and 4th customer etc. The median for floor 1 and restaurant 100 would be 3.
I have tried something like this:
df.groupby(['Restaurant_ID', 'Floor'].apply(lambda row: row['Customer_Arrival_Datetime'].shift() - row['Customer_Arrival_Datetime']).apply(np.median)
but this does not work!
Any help is welcome!

IIUC, you can do
(df.groupby(['Restaurant_ID', 'Floor'])['Cust_Arrival_Datetime']
.agg(lambda x: x.diff().dt.total_seconds().median()/60))
and you get
Restaurant_ID Floor
100 1 3.0
2 35.0
Name: Cust_Arrival_Datetime, dtype: float64
you can chain with reset_index if needed

Consider the following data frame:
df = pd.DataFrame({
'group': [1,1,1,2,2,2],
'time': pd.to_datetime(
['14:14', '14:17', '14:25', '17:29', '17:40','17:43']
)
})
Suppose, you'd like to apply a range of transformations:
def stats(group):
diffs = group.diff().dt.total_seconds()/60
return {
'min': diffs.min(),
'mean': diffs.mean(),
'median': diffs.median(),
'max': diffs.max()
}
Then you simply have to apply these:
>>> df.groupby('group')['time'].agg(stats).apply(pd.Series)
min mean median max
group
1 3.0 5.5 5.5 8.0
2 3.0 7.0 7.0 11.0

Related

Historical Volatility from Prices of many different bonds in same column

I have a csv file with bid/ask prices of many bonds (using ISIN identifiers) for the past 1 yr. Using these historical prices, I'm trying to calculate the historical volatility for each bond. Although it should be typically an easy task, the issue is not all bonds have exactly same number of days of trading price data, while they're all in same column and not stacked. Hence if I need to calculate a rolling std deviation, I can't choose a standard rolling window of 252 days for 1 yr.
The data set has this format-
BusinessDate
ISIN
Bid
Ask
Date 1
ISIN1
P1
P2
Date 2
ISIN1
P1
P2
Date 252
ISIN1
P1
P2
Date 1
ISIN2
P1
P2
Date 2
ISIN2
P1
P2
......
& so on.
My current code is as follows-
vol_df = pd.read_csv('hist_prices.csv')
vol_df['BusinessDate'] = pd.to_datetime(vol_df['BusinessDate'])
vol_df[Mid Price'] = vol_df[['Bid', 'Ask']].mean(axis = 1)
vol_df['log_return'] = vol_df.groupby('ISIN')['Mid Price'].apply(lambda x: np.log(x) - np.log(x.shift(1)))
vol_df['hist_vol'] = vol_df['log_return'].std() * np.sqrt(252)
The last line of code seems to be giving all NaN values in the column. This is most likely because the operation for calculating the std deviation is happening on the same row number and not for a list of numbers. I tried replacing the last line to use rolling_std-
vol_df.set_index('BusinessDate').groupby('ISIN').rolling(window = 1, freq = 'A').std()['log_return']
But this doesn't help either. It gives 2 numbers for each ISIN. I also tried to use pivot() to place the ISINs in columns and BusinessDate as index, and the Prices as "values". But it gives an error. Also I've close to 9,000 different ISINs and hence putting them in columns to calculate std() for each column may not be the best way. Any clues on how I can sort this out?
I was able to resolve this in a crude way-
vol_df_2 = vol_df.groupby('ISIN')['logret'].std()
vol_df_3 = vol_df_2.to_frame()
vol_df_3.rename(columns = {'logret':'daily_std}, inplace = True)
The first line above was returning a series and the std deviation column named as 'logret'. So the 2nd and 3rd line of code converts it into a dataframe and renames the daily std deviation as such. And finally the annual vol can be calculated using sqrt(252).
If anyone has a better way to do it in the same dataframe instead of creating a series, that'd be great.
ok this almost works now.
It does need some math per ISIN to figure out the rolling period, I just used 3 and 2 in my example, you probably need to count how many days of trading in the year or whatever and fix it at that per ISIN somehow.
And then you need to figure out how to merge the data back. The output actually has errors becuase its updating a copy, but that is kind of what I was looking for here. I am sure someone that knows more could fix it at this point. I can't get it working to do the merge.
toy_data={'BusinessDate': ['10/5/2020','10/6/2020','10/7/2020','10/8/2020','10/9/2020',
'10/12/2020','10/13/2020','10/14/2020','10/15/2020','10/16/2020',
'10/5/2020','10/6/2020','10/7/2020','10/8/2020'],
'ISIN': [1,1,1,1,1, 1,1,1,1,1, 2,2,2,2],
'Bid': [0.295,0.295,0.295,0.295,0.295,
0.296, 0.296,0.297,0.298,0.3,
2.5,2.6,2.71,2.8],
'Ask': [0.301,0.305,0.306,0.307,0.308,
0.315,0.326,0.337,0.348,0.37,
2.8,2.7,2.77,2.82]}
#vol_df = pd.read_csv('hist_prices.csv')
vol_df = pd.DataFrame(toy_data)
vol_df['BusinessDate'] = pd.to_datetime(vol_df['BusinessDate'])
vol_df['Mid Price'] = vol_df[['Bid', 'Ask']].mean(axis = 1)
vol_df['log_return'] = vol_df.groupby('ISIN')['Mid Price'].apply(lambda x: np.log(x) - np.log(x.shift(1)))
vol_df.dropna(subset = ['log_return'], inplace=True)
# do some math here to calculate how many days you want to roll for an ISIN
# maybe count how many days over a 1 year period exist???
# not really sure how you'd miss days unless stuff just doesnt trade
# (but I don't need to understand it anyway)
rolling = {1: 3, 2: 2}
for isin in vol_df['ISIN'].unique():
roll = rolling[isin]
print(f'isin={isin}, roll={roll}')
df_single = vol_df[vol_df['ISIN']==isin]
df_single['rolling'] = df_single['log_return'].rolling(roll).std()
# i can't get the right syntax to merge data back, but this shows it
vol_df[isin, 'rolling'] = df_single['rolling']
print(df_single)
print(vol_df)
which outputs (minus the warning errors):
isin=1, roll=3
BusinessDate ISIN Bid Ask Mid Price log_return rolling
1 2020-10-06 1 0.295 0.305 0.3000 0.006689 NaN
2 2020-10-07 1 0.295 0.306 0.3005 0.001665 NaN
3 2020-10-08 1 0.295 0.307 0.3010 0.001663 0.002901
4 2020-10-09 1 0.295 0.308 0.3015 0.001660 0.000003
5 2020-10-12 1 0.296 0.315 0.3055 0.013180 0.006650
6 2020-10-13 1 0.296 0.326 0.3110 0.017843 0.008330
7 2020-10-14 1 0.297 0.337 0.3170 0.019109 0.003123
8 2020-10-15 1 0.298 0.348 0.3230 0.018751 0.000652
9 2020-10-16 1 0.300 0.370 0.3350 0.036478 0.010133
isin=2, roll=2
BusinessDate ISIN Bid ... log_return (1, rolling) rolling
11 2020-10-06 2 2.60 ... 2.220446e-16 NaN NaN
12 2020-10-07 2 2.71 ... 3.339828e-02 NaN 0.023616
13 2020-10-08 2 2.80 ... 2.522656e-02 NaN 0.005778
[3 rows x 8 columns]
BusinessDate ISIN Bid ... log_return (1, rolling) (2, rolling)
1 2020-10-06 1 0.295 ... 6.688988e-03 NaN NaN
2 2020-10-07 1 0.295 ... 1.665279e-03 NaN NaN
3 2020-10-08 1 0.295 ... 1.662511e-03 0.002901 NaN
4 2020-10-09 1 0.295 ... 1.659751e-03 0.000003 NaN
5 2020-10-12 1 0.296 ... 1.317976e-02 0.006650 NaN
6 2020-10-13 1 0.296 ... 1.784313e-02 0.008330 NaN
7 2020-10-14 1 0.297 ... 1.910886e-02 0.003123 NaN
8 2020-10-15 1 0.298 ... 1.875055e-02 0.000652 NaN
9 2020-10-16 1 0.300 ... 3.647821e-02 0.010133 NaN
11 2020-10-06 2 2.600 ... 2.220446e-16 NaN NaN
12 2020-10-07 2 2.710 ... 3.339828e-02 NaN 0.023616
13 2020-10-08 2 2.800 ... 2.522656e-02 NaN 0.005778

Add a row for missing period and for the corresponding period calculate the average of last 3 Months

I am trying to write a code which adds missing periods to the dataframe and calculates their respective averages. Refer to the below example:
Invoice Date Amount
9 01/2020 227500
4 02/2020 56000
0 03/2020 22000
1 05/2020 25000
5 06/2020 75000
2 07/2020 27000
6 08/2020 48000
3 09/2020 35000
7 10/2020 115000
8 12/2020 85000
In the above dataframe, we see that there's a record missing for '11/2020'. I am trying to add the record for the period of 11/2020 and calculate it's mean for the last three months i.e., if 11/2020 is missing, take the amounts of 12/2020,10/2020 and 9/2020 and calculate its Mean and add/append it to the dataframe.
Expected output:
Invoice Date Amount
10 01/2020 227500.00
4 02/2020 56000.00
0 03/2020 22000.00
5 04/2020 75000.00
1 05/2020 25000.00
6 06/2020 48000.00
2 07/2020 27000.00
7 08/2020 115000.00
3 09/2020 35000.00
8 10/2020 77000.00
11 11/2020 65666.67
9 12/2020 85000.00
Please note that, I am able to arrive at the above result with the following code:
import pandas as pd
FundAdmin = {
'Invoice Date': ['03/2020', '05/2020', '07/2020', '09/2020', '02/2020', '04/2020', '06/2020', '08/2020', '10/2020', '12/2020',
'01/2020'
],
'Amount': [22000, 25000, 27000, 35000, 56000, 75000, 48000, 115000, 77000, 85000, 227500]
}
expected_dates = ['01/2020', '02/2020', '03/2020', '04/2020', '05/2020', '06/2020', '07/2020', '08/2020', '09/2020', '10/2020', '11/2020',
'12/2020'
]
df = pd.DataFrame(FundAdmin, columns = ['Invoice Date', 'Amount'])
current_dates = df['Invoice Date']
missing_dates = list(set(expected_dates) - set(current_dates))
sorted_df = df.sort_values(by = 'Invoice Date')
for i in missing_dates:
Top_3_Rows = sorted_df.tail(3)# print(Top_3_Rows)
Top_3_Rows_Amount = round(Top_3_Rows.mean(), 2)
CalcDF = {
'Invoice Date': i,
'Amount': float(Top_3_Rows_Amount)
}
FullDF = df.append(CalcDF, ignore_index = True)
print(FullDF)
However, my code is not able to handle the calculation for missing records in the middle of the dataframe. Meaning, it adds missing period to dataframe, but is not able to pick up the values of the previous 3months and it is adding the same mean amount to all the missing periods. Example: If there's a record for 4/2020 missing, code should be able to add a new record for 4/2020 and assign the value of the mean generated out of 1/2020,2/2020 and 3/2020 to 4/2020. Instead, it is assigning the Mean value of other missing period. Please refer to the below:
Expected Output (if both 11/2020 and 4/2020 are missing):
Invoice Date Amount
10 01/2020 227500.00
4 02/2020 56000.00
0 03/2020 22000.00
5 04/2020 101833.33 <---- New Record Inserted for 4/2020 through the calculation the mean for 3/2020,2/2020,1/2020
1 05/2020 25000.00
6 06/2020 48000.00
2 07/2020 27000.00
7 08/2020 115000.00
3 09/2020 35000.00
8 10/2020 77000.00
11 11/2020 65666.67 <---- New Record Inserted for 11/2020 through the calculation the mean for 12/2020,10/2020,9/2020
9 12/2020 85000.00
My Output (if both 11/2020 and 4/2020 are missing):
Invoice Date Amount
10 01/2020 227500.00
4 02/2020 56000.00
0 03/2020 22000.00
5 04/2020 65666.67 <--- Value same as 11/2020
1 05/2020 25000.00
6 06/2020 48000.00
2 07/2020 27000.00
7 08/2020 115000.00
3 09/2020 35000.00
8 10/2020 77000.00
11 11/2020 65666.67 <--- This works fine.
9 12/2020 85000.00
From my observation, I found that my code is not able to fetch the last 3 records if the missing period occurs to be in the middle of the dataframe, as I am using tail() method and it is fetching the records of 9/2020,10/2020 and 12/2020, caluclating its mean and assigning the same value to 4/2020. I am a complete beginner to python and if any assistance provided to resolve the above issue is greatly appreciated.
Would this work for you?
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from random import randint
df_len = 100
df = pd.DataFrame({
'Invoice': [randint(1, 10) for _ in range(df_len)],
'Dates' : [(datetime.today() - pd.DateOffset(months=mnths_ago)).date()
for mnths_ago in range(df_len)],
'Amount': [randint(1, 100000) for _ in range(df_len)],
})
# Drop 10 random rows
drop_indices = np.random.choice(df.index, 10, replace=False)
df = df.drop(drop_indices)
df
Invoice Dates Amount
0 1 2020-05-19 23797
1 6 2020-04-19 54101
2 10 2020-03-19 91522
3 5 2020-02-19 48762
4 1 2020-01-19 54497
.. ... ... ...
93 1 2012-08-19 56834
94 10 2012-07-19 21382
95 2 2012-06-19 33056
96 1 2012-05-19 93336
98 7 2012-03-19 12406
from dateutil import relativedelta
def get_prev_mean(date):
return df[:df.loc[df.Dates == date].index[0]].tail(3)['Amount'].mean()
r = relativedelta.relativedelta(df.Dates.min(), df.Dates.max())
n_months = -(r.years * 12) + r.months
all_months = [(df.Dates.max() - pd.DateOffset(months=mnths_ago)).date() for mnths_ago in range(n_months)]
missing_months = [mnth for mnth in all_months if mnth in list(df.Dates)]
dct = {mnth: get_prev_mean(mnth) for mnth in missing_months}
to_merge = pd.DataFrame(data=dct.values(), index=dct.keys()).reset_index()
to_merge.columns = ['Dates', 'Amount']
out = pd.concat([df, to_merge], sort=False).sort_values(by='Dates').reset_index(drop=True)
out
Invoice Dates Amount
0 7.0 2012-03-19 12406.0
1 1.0 2012-05-19 93336.0
2 2.0 2012-06-19 33056.0
3 10.0 2012-07-19 21382.0
4 1.0 2012-08-19 56834.0
.. ... ... ...
171 10.0 2020-03-19 91522.0
172 NaN 2020-04-19 23797.0
173 6.0 2020-04-19 54101.0
174 NaN 2020-05-19 NaN
175 1.0 2020-05-19 23797.0

check if each user has consecutive dates in a python 3 pandas dataframe

Imagine there is a dataframe:
id date balance_total transaction_total
0 1 01/01/2019 102.0 -1.0
1 1 01/02/2019 100.0 -2.0
2 1 01/03/2019 100.0 NaN
3 1 01/04/2019 100.0 NaN
4 1 01/05/2019 96.0 -4.0
5 2 01/01/2019 200.0 -2.0
6 2 01/02/2019 100.0 -2.0
7 2 01/04/2019 100.0 NaN
8 2 01/05/2019 96.0 -4.0
here is the create dataframe command:
import pandas as pd
import numpy as np
users=pd.DataFrame(
[
{'id':1,'date':'01/01/2019', 'transaction_total':-1, 'balance_total':102},
{'id':1,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
{'id':1,'date':'01/03/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':1,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':1,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':np.nan},
{'id':2,'date':'01/01/2019', 'transaction_total':-2, 'balance_total':200},
{'id':2,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
{'id':2,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':2,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':96}
]
)
How could I check if each id has consecutive dates or not? I use the
"shift" idea here but it doesn't seem to work:
Calculating time difference between two rows
df['index_col'] = df.index
for id in df['id'].unique():
# create an empty QA dataframe
column_names = ["Delta"]
df_qa = pd.DataFrame(columns = column_names)
df_qa['Delta']=(df['index_col'] - df['index_col'].shift(1))
if (df_qa['Delta'].iloc[1:] != 1).any() is True:
print('id ' + id +' might have non-consecutive dates')
# doesn't print any account => Each Customer's Daily Balance has Consecutive Dates
break
Ideal output:
it should print id 2 might have non-consecutive dates
Thank you!
Use groupby and diff:
df["date"] = pd.to_datetime(df["date"],format="%m/%d/%Y")
df["difference"] = df.groupby("id")["date"].diff()
print (df.loc[df["difference"]>pd.Timedelta(1, unit="d")])
#
id date transaction_total balance_total difference
7 2 2019-01-04 NaN 100.0 2 days
Use DataFrameGroupBy.diff with Series.dt.days, compre by greatee like 1 and filter only id column by DataFrame.loc:
users['date'] = pd.to_datetime(users['date'])
i = users.loc[users.groupby('id')['date'].diff().dt.days.gt(1), 'id'].tolist()
print (i)
[2]
for val in i:
print( f'id {val} might have non-consecutive dates')
id 2 might have non-consecutive dates
First step is to parse date:
users['date'] = pd.to_datetime(users.date).
Then add a shifted column on the id and date columns:
users['id_shifted'] = users.id.shift(1)
users['date_shifted'] = users.date.shift(1)
The difference between date and date_shifted columns is of interest:
>>> users.date - users.date_shifted
0 NaT
1 1 days
2 1 days
3 1 days
4 1 days
5 -4 days
6 1 days
7 2 days
8 1 days
dtype: timedelta64[ns]
You can now query the DataFrame for what you want:
users[(users.id_shifted == users.id) & (users.date_shifted - users.date != np.timedelta64(days=1))]
That is, consecutive lines of the same user with a date difference != 1 day.
This solution does assume the data is sorted by (id, date).

How to generate discrete data to pass into a contour plot using pandas and matplotlib?

I have two sets of continuous data that I would like to pass into a contour plot. The x-axis would be time, the y-axis would be mass, and the z-axis would be frequency (as in how many times that data point appears). However, most data points are not identical but rather very similar. Thus, I suspect it's easiest to discretize both the x-axis and y-axis.
Here's the data I currently have:
INPUT
import pandas as pd
df = pd.read_excel('data.xlsx')
df['Dates'].head(5)
df['Mass'].head(5)
OUTPUT
13 2003-05-09
14 2003-09-09
15 2010-01-18
16 2010-11-21
17 2012-06-29
Name: Date, dtype: datetime64[ns]
13 2500.0
14 3500.0
15 4000.0
16 4500.0
17 5000.0
Name: Mass, dtype: float64
I'd like to convert the data such that it groups up data points within the year (ex: all datapoints taken in 2003) and it groups up data points within different levels of mass (ex: all datapoints between 3000-4000 kg). Next, the code would count how many data points are within each of these blocks and pass that as the z-axis.
Ideally, I'd also like to be able to adjust the levels of slices. Ex: grouping points up every 100kg instead of 1000kg, or passing a custom list of levels that aren't equally distributed. How would I go about doing this?
I think the function you are looking for is pd.cut
import pandas as pd
import numpy as np
import datetime
n = 10
scale = 1e3
Min = 0
Max = 1e4
np.random.seed(6)
Start = datetime.datetime(2000, 1, 1)
Dates = np.array([base + datetime.timedelta(days=i*180) for i in range(n)])
Mass = np.random.rand(n)*10000
df = pd.DataFrame(index = Dates, data = {'Mass':Mass})
print(df)
gives you:
Mass
2000-01-01 8928.601514
2000-06-29 3319.798053
2000-12-26 8212.291231
2001-06-24 416.966257
2001-12-21 1076.566799
2002-06-19 5950.520642
2002-12-16 5298.173622
2003-06-14 4188.074286
2003-12-11 3354.078493
2004-06-08 6225.194322
if you want to group your Masses by say 1000, or implement your own custom bins, you can do this:
Bins,Labels=np.arange(Min,Max+.1,scale),(np.arange(Min,Max,scale))+(scale)/2
EqualBins = pd.cut(df['Mass'],bins=Bins,labels=Labels)
df.insert(1,'Equal Bins',EqualBins)
Bins,Labels=[0,1000,5000,10000],['Small','Medium','Big']
CustomBins = pd.cut(df['Mass'],bins=Bins,labels=Labels)
df.insert(2,'Custom Bins',CustomBins)
If you want to just show the year, month, etc it is very simple:
df['Year'] = df.index.year
df['Month'] = df.index.month
but you can also do custom date ranges if you like:
Bins=[datetime.datetime(1999, 12, 31),datetime.datetime(2000, 9, 1),
datetime.datetime(2002, 1, 1),datetime.datetime(2010, 9, 1)]
Labels = ['Early','Middle','Late']
CustomDateBins = pd.cut(df.index,bins=Bins,labels=Labels)
df.insert(3,'Custom Date Bins',CustomDateBins)
print(df)
This yields something like what you want:
Mass Equal Bins Custom Bins Custom Date Bins Year Month
2000-01-01 8928.601514 8500.0 Big Early 2000 1
2000-06-29 3319.798053 3500.0 Medium Early 2000 6
2000-12-26 8212.291231 8500.0 Big Middle 2000 12
2001-06-24 416.966257 500.0 Small Middle 2001 6
2001-12-21 1076.566799 1500.0 Medium Middle 2001 12
2002-06-19 5950.520642 5500.0 Big Late 2002 6
2002-12-16 5298.173622 5500.0 Big Late 2002 12
2003-06-14 4188.074286 4500.0 Medium Late 2003 6
2003-12-11 3354.078493 3500.0 Medium Late 2003 12
2004-06-08 6225.194322 6500.0 Big Late 2004 6
The .groupby function is probably of interst to you as well:
yeargroup = df.groupby(df.index.year).mean()
massgroup = df.groupby(df['Equal Bins']).count()
print(yeargroup)
print(massgroup)
Mass Year Month
2000 6820.230266 2000.0 6.333333
2001 746.766528 2001.0 9.000000
2002 5624.347132 2002.0 9.000000
2003 3771.076389 2003.0 9.000000
2004 6225.194322 2004.0 6.000000
Mass Custom Bins Custom Date Bins Year Month
Equal Bins
500.0 1 1 1 1 1
1500.0 1 1 1 1 1
2500.0 0 0 0 0 0
3500.0 2 2 2 2 2
4500.0 1 1 1 1 1
5500.0 2 2 2 2 2
6500.0 1 1 1 1 1
7500.0 0 0 0 0 0
8500.0 2 2 2 2 2
9500.0 0 0 0 0 0

Forward filling missing dates into Python Pandas Dataframe

I have a Panda's dataframe that is filled as follows:
ref_date tag
1/29/2010 1
2/26/2010 3
3/31/2010 4
4/30/2010 4
5/31/2010 1
6/30/2010 3
8/31/2010 1
9/30/2010 4
12/31/2010 2
Note how there are missing months (i.e. 7, 10, 11) in the data. I want to fill in the missing data through a forward filling method so that it looks like this:
ref_date tag
1/29/2010 1
2/26/2010 3
3/31/2010 4
4/30/2010 4
5/31/2010 1
6/30/2010 3
7/30/2010 3
8/31/2010 1
9/30/2010 4
10/29/2010 4
11/30/2010 4
12/31/2010 2
The tag of the missing date will have the tag of the previous. All dates represent the last business day of the month.
This is what I tried to do:
idx = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
df.ref_date.index = pd.to_datetime(df.ref_date.index)
df = df.reindex(index=[idx], columns=[ref_date], method='ffill')
It's giving me the error:
TypeError: Cannot compare type 'Timestamp' with type 'int'
where pd is pandas and df is the dataframe.
I'm new to Pandas Dataframe, so any help would be appreciated!
You were very close, you just need to set the dataframe's index with the ref_date, reindex it to the business day month end index while specifying ffill at the method, then reset the index and rename back to the original:
# First ensure the dates are Pandas Timestamps.
df['ref_date'] = pd.to_datetime(df['ref_date'])
# Create a monthly index.
idx_monthly = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
# Reindex to the daily index, forward fill, reindex to the monthly index.
>>> (df
.set_index('ref_date')
.reindex(idx_monthly, method='ffill')
.reset_index()
.rename(columns={'index': 'ref_date'}))
ref_date tag
0 2010-01-29 1.0
1 2010-02-26 3.0
2 2010-03-31 4.0
3 2010-04-30 4.0
4 2010-05-31 1.0
5 2010-06-30 3.0
6 2010-07-30 3.0
7 2010-08-31 1.0
8 2010-09-30 4.0
9 2010-10-29 4.0
10 2010-11-30 4.0
11 2010-12-31 2.0
Thanks to the previous person that answered this question but deleted his answer. I got the solution:
df[ref_date] = pd.to_datetime(df[ref_date])
idx = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
df = df.set_index(ref_date).reindex(idx).ffill().reset_index().rename(columns={'index': ref_date})

Categories