Replace "NaT" with next date based on previous date - python

My DF looks like below:
column1 column2
2020-11-01 1
2020-12-01 2
2021-01-01 3
NaT 4
NaT 5
NaT 6
Output should be like this:
column1 column2
2020-11-01 1
2020-12-01 2
2021-01-01 3
2021-02-01 4
2021-03-01 5
2021-04-01 6
I can't create next date (only months and years changed) based on the last existing date in df. Is there any pythonic way to do this? Thanks for any help!
Regards
Tomasz

This is how I would do it, you could probably tidy this up into more of a one liner but this will help illustrate the process a little more.
#convert to date
df['column1'] = pd.to_datetime(df['column1'], format='%Y-%d-%m')
#create a group for each missing section
df['temp'] = df.column1.fillna(method = 'ffill')
#count the row within this group
df['temp2'] = df.groupby(['temp']).cumcount()
# add month
df['column1'] = [x + pd.DateOffset(months=y) for x,y in zip(df['temp'], df['temp2'])]

pandas supports time series data
pd.date_range("2020-11-1", freq=pd.tseries.offsets.DateOffset(months=1), periods=10)
will give
DatetimeIndex(['2020-11-01', '2020-12-01', '2021-01-01', '2021-02-01',
'2021-03-01', '2021-04-01', '2021-05-01', '2021-06-01',
'2021-07-01', '2021-08-01'],
dtype='datetime64[ns]', freq='<DateOffset: months=1>')

Related

How to Calculate Year over Year Percentage Change in Dataframe with Datetime Index based on Date and not number of Periods

I have multiple Dataframes for macroeconomic timeseries. In each of these Dataframes I want to add a column showing the Year over Year percentage change. Ideally I would do this with a for loop so I don't have to repeat the process multiple times. However, the series do not have the same frequency. For example, GDP is quarterly, PCE is monthly and S&P returns are daily. So, I cannot specify the number of periods. Since my dataframe is already in Datetime index I would like to specify that I want to the percentage change to be calculated based on the dates. Is that possible?
Please see examples of my Dataframes below:
print(gdp):
Date GDP
1947-01-01 2.034450e+12
1947-04-01 2.029024e+12
1947-07-01 2.024834e+12
1947-10-01 2.056508e+12
1948-01-01 2.087442e+12
...
2021-04-01 1.936831e+13
2021-07-01 1.947889e+13
2021-10-01 1.980629e+13
2022-01-01 1.972792e+13
2022-04-01 1.969946e+13
[302 rows x 1 columns]
print(pce):
Date PCE
1960-01-01 1.695549
1960-02-01 1.706421
1960-03-01 1.692806
1960-04-01 1.863354
1960-05-01 1.911975
...
2022-02-01 6.274030
2022-03-01 6.638595
2022-04-01 6.269216
2022-05-01 6.324989
2022-06-01 6.758935
[750 rows x 1 columns]
print(spx):
Date SPX
1928-01-03 17.76
1928-01-04 17.72
1928-01-05 17.55
1928-01-06 17.66
1928-01-09 17.59
...
2022-08-19 4228.48
2022-08-22 4137.99
2022-08-23 4128.73
2022-08-24 4140.77
2022-08-25 4199.12
[24240 rows x 1 columns]
Instead of doing this:
gdp['GDP] = gdp['GDP'].pct_change(4)
pce['PCE'] = pce['PCE'].pct_change(12)
spx['SPX'] = spx['SPX'].pct_change(252)
I would like a for loop to do it for all Dataframes without specifying the periods but specifying that I want the percentage change from Year to Year.
Given:
d = {'Date': [ '2021-02-01',
'2021-03-01',
'2021-04-01',
'2021-05-01',
'2021-06-01',
'2022-02-01',
'2022-03-01',
'2022-04-01',
'2022-05-01',
'2022-06-01'],
'PCE': [ 1.695549, 1.706421, 1.692806, 1.863354, 1.911975,
6.274030, 6.638595, 6.269216, 6.324989, 6.758935]}
pce = pd.DataFrame(d)
pce = pce.set_index('Date')
pce.index = pce.to_datetime(pce.index)
You could create a new dataframe with a copy of the datetime index as a new column, resample the new dataframe with annual frequency ('A') and count all unique values in the Date column.
pce_annual_rows = pce.index.to_frame()
resampled_annual = pce_annual_rows.resample('A').count()
Next you can get the second last Date-count value and use that as your periods values in the pct_change method.
The second last, because if there is an incomplete year at the end, you probably end up with a wrong periods value. This assumes, that you have more than 1 year of data in every dataframe, otherwise you'll get an IndexError.
periods_per_year = resampled_annual['Date'].iloc[-2]
pce['ROC'] = pce['PCE'].pct_change(periods_per_year)
This produces the following output:
PCE ROC
Date
2021-02-01 1.695549 NaN
2021-03-01 1.706421 NaN
2021-04-01 1.692806 NaN
2021-05-01 1.863354 NaN
2021-06-01 1.911975 NaN
2022-02-01 6.274030 2.700294
2022-03-01 6.638595 2.890362
2022-04-01 6.269216 2.703446
2022-05-01 6.324989 2.394411
2022-06-01 6.758935 2.535054
This solution isn't very nice, maybe someone comes up with another, less complicated idea.
To build your for-loop to do this for every dataframe, you'd probably better use the same column name for the columns you want to apply the pct_change method on.

Merge dataframes by dates

I have got the below 2 df:
lst=[['2021-01-01','A'],['2021-01-01','B'],['2021-02-01','A'],['2021-02-01','B'],['2021-03-01','A'],['2021-03-01','B']]
df1=pd.DataFrame(lst,columns=['Date','Pf'])
lst=[['2021-02-01','A','New']]
df22=pd.DataFrame(lst,columns=['Date','Pf','Status'])
I would like to merge them in order to obtain the df below:
lst=[['2021-01-01','A','NaN'],['2021-01-01','B','NaN'],['2021-02-01','A','New'],['2021-02-01','B','NaN'],['2021-03-01','A','New'],['2021-03-01','B','NaN']]
df3=pd.DataFrame(lst,columns=['Date','Pf','Status'])
For the period 2021-02-01 one could apply the merge formula. However, I would like to get the same status "New" as soon the same Pf as in df2 appears by changing dates equal and bigger than 2021-02-01
Do you have any idea how I could solve this question?
Thank you for your help
Use merge_asof with default direction='backward':
df1['Date'] = pd.to_datetime(df1['Date'])
df22['Date'] = pd.to_datetime(df22['Date'])
df = pd.merge_asof(df1, df22, on='Date', by='Pf')
print (df)
Date Pf Status
0 2021-01-01 A NaN
1 2021-01-01 B NaN
2 2021-02-01 A New
3 2021-02-01 B NaN
4 2021-03-01 A New
5 2021-03-01 B NaN

How to create a dataframe with pandas.date_range for previous years?

I want to create a dataframe with date from previous years. For example something like this -
df = pd.DataFrame({'Years': pd.date_range('2021-09-21', periods=-5, freq='Y')})
but negative period is not supported. How to achieve that?
Use end parameter in date_range aand then add DateOffset:
d = pd.to_datetime('2021-09-21')
df = pd.DataFrame({'Years': pd.date_range(end=d, periods=5, freq='Y') +
pd.DateOffset(day=d.day, month=d.month)})
print (df)
Years
0 2016-09-21
1 2017-09-21
2 2018-09-21
3 2019-09-21
4 2020-09-21
Or if need also actual year to last value of column use YS for start of year:
d = pd.to_datetime('2021-09-21')
df = pd.DataFrame({'Years': pd.date_range(end=d, periods=5, freq='YS') +
pd.DateOffset(day=d.day, month=d.month)})
print (df)
Years
0 2017-09-21
1 2018-09-21
2 2019-09-21
3 2020-09-21
4 2021-09-21

Populating empty data frame from conditional checks on a table

I’m looking for some bearings on how to approach this problem.
I have a table ‘AnnualContracts’
Of which, there is the following types of data:
Year1Amount
Year1Payday
Year2Amount
Year2Payday
Year3Amount
Year3Payday
1000.0
2020-08-01
1000.0
2021-08-01
1000.0
2022-08-01
2400.0
2021-06-01
3400.0
2022-06-01
4400.0
2023-06-01
1259.0
2019-05-01
1259.0
2020-05-01
1259.0
2021-05-01
2150.0
2021-08-01
2150.0
2022-08-01
2150.0
2023-08-01
etc, this ranges up to 5 years, and 380+ rows, there are four types of customers (with their own respective tables set up similar as above): Annual paying, Bi-Annual paying, Quarterly Paying and Monthly paying.
I also have an empty dataframe (SumsOfPayments) with indexes based on variables which update each month and columns based on the above mentioned customer types.
looks like this:
Annual
Bi-Annual
Quarterly
Monthly
12monthsago
11monthsago
10monthsago
etc until until it hits 60 months into the future.
the indexes on the SumOfPayments and the YearXPaydays are all set to the 1st of their respective month, so they can == match.
(as an example of how the index variables are set on the SumOfPayments table):
12monthsago = datetime.today().replace(day=1,hour=0,minute=0).replace(second=0,microsecond=0)+relativedelta(months=-12)
so if todays date is 13/08/2021, the above would produce 2020-08-01 00:00:00.
What the intention behind this is to:
order the YearXPaydays by date, have a total of the sum of the YearXAmounts by that grouped date
from those grouped sums, check against the index on the SumOfPayments dataframe, and enter the sum wherever the dates match
example (based on the tables above)
AnnualContracts:
Year1Amount
Year1Payday
Year2Amount
Year2Payday
Year3Amount
Year3Payday
1000.0
2020-08-01
1000.0
2021-08-01
1000.0
2022-08-01
2400.0
2021-06-01
3400.0
2022-06-01
4400.0
2023-06-01
1259.0
2019-05-01
1259.0
2020-05-01
1259.0
2021-05-01
2150.0
2021-08-01
2150.0
2022-08-01
2150.0
2023-08-01
SumOfPayments:
Annual
Bi-Annual
Quarterly
Monthly
12monthsago
1000.0
11monthsago
10monthsago
9monthsago
8monthsago
7monthsago
6monthsago
5monthsago
4monthsago
3monthsago
1259.0
2monthsago
2400.0
1monthsago
currentmont
3150.0
Any help on this would be massively appreciated, thanks in advance for any assistance.
You could use wide_to_long if your column names were a little different. Instead I'll just split and melt them to get the data in the right shape. If you're curious what's happening, just print out dt and amt to see what they look like after melting.
Then you can create your output table using 13 periods (this month plus the past 12 months) and start it from the beginning of the month on year ago.
You can create multiple tables for each level of aggregation you want, annual, bi-annual, etc. Then just merge them to the table with the date range.
import pandas as pd
from datetime import date, timedelta, date
df = pd.DataFrame({'Year1Amount': {0: 1000.0, 1: 2400.0, 2: 1259.0, 3: 2150.0},
'Year1Payday': {0: '2020-08-01',
1: '2021-06-01',
2: '2019-05-01',
3: '2021-08-01'},
'Year2Amount': {0: 1000.0, 1: 3400.0, 2: 1259.0, 3: 2150.0},
'Year2Payday': {0: '2021-08-01',
1: '2022-06-01',
2: '2020-05-01',
3: '2022-08-01'},
'Year3Amount': {0: 1000.0, 1: 4400.0, 2: 1259.0, 3: 2150.0},
'Year3Payday': {0: '2022-08-01',
1: '2023-06-01',
2: '2021-05-01',
3: '2023-08-01'}})
hist = pd.DataFrame({'Date':pd.date_range(start=(date.today() - timedelta(days=365)).replace(day=1),
freq=pd.offsets.MonthBegin(),
periods=13)})
# Split and melt
dt = df[[x for x in df.columns if 'Payday' in x]].melt(value_name='Date')
amt = df[[x for x in df.columns if 'Amount' in x]].melt(value_name='Annual')
# Combine and make datetime
df = pd.concat([amt['Annual'], dt['Date']],axis=1)
df['Date'] = pd.to_datetime(df['Date'])
# Do all of your aggregations into new dataframes like such, you'll need one for each column
# here's how to do the annual one
annual_sum = df.groupby('Date', as_index=False).sum()
# For each aggregation, merge to the hist df
hist = hist.merge(annual_sum, on='Date', how='left')
Output
Date Annual
0 2020-08-01 1000.0
1 2020-09-01 NaN
2 2020-10-01 NaN
3 2020-11-01 NaN
4 2020-12-01 NaN
5 2021-01-01 NaN
6 2021-02-01 NaN
7 2021-03-01 NaN
8 2021-04-01 NaN
9 2021-05-01 1259.0
10 2021-06-01 2400.0
11 2021-07-01 NaN
12 2021-08-01 3150.0

Pandas - Add at least one row for every day (datetimes include a time)

Edit: You can use the alleged duplicate solution with reindex() if your dates don't include times, otherwise you need a solution like the one by #kosnik. In addition, their solution doesn't need your dates to be the index!
I have data formatted like this
df = pd.DataFrame(data=[['2017-02-12 20:25:00', 'Sam', '8'],
['2017-02-15 16:33:00', 'Scott', '10'],
['2017-02-15 16:45:00', 'Steve', '5']],
columns=['Datetime', 'Sender', 'Count'])
df['Datetime'] = pd.to_datetime(df['Datetime'], format='%Y-%m-%d %H:%M:%S')
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-15 16:33:00 Scott 10
2 2017-02-15 16:45:00 Steve 5
I need there to be at least one row for every date, so the expected result would be
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-13 00:00:00 None 0
2 2017-02-14 00:00:00 None 0
3 2017-02-15 16:33:00 Scott 10
4 2017-02-15 16:45:00 Steve 5
I have tried to make datetime the index, add the dates and use reindex() like so
df.index = df['Datetime']
values = df['Datetime'].tolist()
for i in range(len(values)-1):
if values[i].date() + timedelta < values[i+1].date():
values.insert(i+1, pd.Timestamp(values[i].date() + timedelta))
print(df.reindex(values, fill_value=0))
This makes every row forget about the other columns and the same thing happens for asfreq('D') or resample()
ID Sender Count
Datetime
2017-02-12 16:25:00 0 Sam 8
2017-02-13 00:00:00 0 0 0
2017-02-14 00:00:00 0 0 0
2017-02-15 20:25:00 0 0 0
2017-02-15 20:25:00 0 0 0
What would be the appropriate way of going about this?
I would create a new DataFrame column which contains all the required data and then left join with your data frame.
A working code example is the following
df['Datetime'] = pd.to_datetime(df['Datetime']) # first convert to datetimes
datetimes = df['Datetime'].tolist() # these are existing datetimes - will add here the missing
dates = [x.date() for x in datetimes] # these are converted to dates
min_date = min(dates)
max_date = max(dates)
for d in range((max_date - min_date).days):
forward_date = min_date + datetime.timedelta(d)
if forward_date not in dates:
datetimes.append(np.datetime64(forward_date))
# create new dataframe, merge and fill 'Count' column with zeroes
df = pd.DataFrame({'Datetime': datetimes}).merge(df, on='Datetime', how='left')
df['Count'].fillna(0, inplace=True)

Categories