Populating empty data frame from conditional checks on a table - python

I’m looking for some bearings on how to approach this problem.
I have a table ‘AnnualContracts’
Of which, there is the following types of data:
Year1Amount
Year1Payday
Year2Amount
Year2Payday
Year3Amount
Year3Payday
1000.0
2020-08-01
1000.0
2021-08-01
1000.0
2022-08-01
2400.0
2021-06-01
3400.0
2022-06-01
4400.0
2023-06-01
1259.0
2019-05-01
1259.0
2020-05-01
1259.0
2021-05-01
2150.0
2021-08-01
2150.0
2022-08-01
2150.0
2023-08-01
etc, this ranges up to 5 years, and 380+ rows, there are four types of customers (with their own respective tables set up similar as above): Annual paying, Bi-Annual paying, Quarterly Paying and Monthly paying.
I also have an empty dataframe (SumsOfPayments) with indexes based on variables which update each month and columns based on the above mentioned customer types.
looks like this:
Annual
Bi-Annual
Quarterly
Monthly
12monthsago
11monthsago
10monthsago
etc until until it hits 60 months into the future.
the indexes on the SumOfPayments and the YearXPaydays are all set to the 1st of their respective month, so they can == match.
(as an example of how the index variables are set on the SumOfPayments table):
12monthsago = datetime.today().replace(day=1,hour=0,minute=0).replace(second=0,microsecond=0)+relativedelta(months=-12)
so if todays date is 13/08/2021, the above would produce 2020-08-01 00:00:00.
What the intention behind this is to:
order the YearXPaydays by date, have a total of the sum of the YearXAmounts by that grouped date
from those grouped sums, check against the index on the SumOfPayments dataframe, and enter the sum wherever the dates match
example (based on the tables above)
AnnualContracts:
Year1Amount
Year1Payday
Year2Amount
Year2Payday
Year3Amount
Year3Payday
1000.0
2020-08-01
1000.0
2021-08-01
1000.0
2022-08-01
2400.0
2021-06-01
3400.0
2022-06-01
4400.0
2023-06-01
1259.0
2019-05-01
1259.0
2020-05-01
1259.0
2021-05-01
2150.0
2021-08-01
2150.0
2022-08-01
2150.0
2023-08-01
SumOfPayments:
Annual
Bi-Annual
Quarterly
Monthly
12monthsago
1000.0
11monthsago
10monthsago
9monthsago
8monthsago
7monthsago
6monthsago
5monthsago
4monthsago
3monthsago
1259.0
2monthsago
2400.0
1monthsago
currentmont
3150.0
Any help on this would be massively appreciated, thanks in advance for any assistance.

You could use wide_to_long if your column names were a little different. Instead I'll just split and melt them to get the data in the right shape. If you're curious what's happening, just print out dt and amt to see what they look like after melting.
Then you can create your output table using 13 periods (this month plus the past 12 months) and start it from the beginning of the month on year ago.
You can create multiple tables for each level of aggregation you want, annual, bi-annual, etc. Then just merge them to the table with the date range.
import pandas as pd
from datetime import date, timedelta, date
df = pd.DataFrame({'Year1Amount': {0: 1000.0, 1: 2400.0, 2: 1259.0, 3: 2150.0},
'Year1Payday': {0: '2020-08-01',
1: '2021-06-01',
2: '2019-05-01',
3: '2021-08-01'},
'Year2Amount': {0: 1000.0, 1: 3400.0, 2: 1259.0, 3: 2150.0},
'Year2Payday': {0: '2021-08-01',
1: '2022-06-01',
2: '2020-05-01',
3: '2022-08-01'},
'Year3Amount': {0: 1000.0, 1: 4400.0, 2: 1259.0, 3: 2150.0},
'Year3Payday': {0: '2022-08-01',
1: '2023-06-01',
2: '2021-05-01',
3: '2023-08-01'}})
hist = pd.DataFrame({'Date':pd.date_range(start=(date.today() - timedelta(days=365)).replace(day=1),
freq=pd.offsets.MonthBegin(),
periods=13)})
# Split and melt
dt = df[[x for x in df.columns if 'Payday' in x]].melt(value_name='Date')
amt = df[[x for x in df.columns if 'Amount' in x]].melt(value_name='Annual')
# Combine and make datetime
df = pd.concat([amt['Annual'], dt['Date']],axis=1)
df['Date'] = pd.to_datetime(df['Date'])
# Do all of your aggregations into new dataframes like such, you'll need one for each column
# here's how to do the annual one
annual_sum = df.groupby('Date', as_index=False).sum()
# For each aggregation, merge to the hist df
hist = hist.merge(annual_sum, on='Date', how='left')
Output
Date Annual
0 2020-08-01 1000.0
1 2020-09-01 NaN
2 2020-10-01 NaN
3 2020-11-01 NaN
4 2020-12-01 NaN
5 2021-01-01 NaN
6 2021-02-01 NaN
7 2021-03-01 NaN
8 2021-04-01 NaN
9 2021-05-01 1259.0
10 2021-06-01 2400.0
11 2021-07-01 NaN
12 2021-08-01 3150.0

Related

How to Calculate Year over Year Percentage Change in Dataframe with Datetime Index based on Date and not number of Periods

I have multiple Dataframes for macroeconomic timeseries. In each of these Dataframes I want to add a column showing the Year over Year percentage change. Ideally I would do this with a for loop so I don't have to repeat the process multiple times. However, the series do not have the same frequency. For example, GDP is quarterly, PCE is monthly and S&P returns are daily. So, I cannot specify the number of periods. Since my dataframe is already in Datetime index I would like to specify that I want to the percentage change to be calculated based on the dates. Is that possible?
Please see examples of my Dataframes below:
print(gdp):
Date GDP
1947-01-01 2.034450e+12
1947-04-01 2.029024e+12
1947-07-01 2.024834e+12
1947-10-01 2.056508e+12
1948-01-01 2.087442e+12
...
2021-04-01 1.936831e+13
2021-07-01 1.947889e+13
2021-10-01 1.980629e+13
2022-01-01 1.972792e+13
2022-04-01 1.969946e+13
[302 rows x 1 columns]
print(pce):
Date PCE
1960-01-01 1.695549
1960-02-01 1.706421
1960-03-01 1.692806
1960-04-01 1.863354
1960-05-01 1.911975
...
2022-02-01 6.274030
2022-03-01 6.638595
2022-04-01 6.269216
2022-05-01 6.324989
2022-06-01 6.758935
[750 rows x 1 columns]
print(spx):
Date SPX
1928-01-03 17.76
1928-01-04 17.72
1928-01-05 17.55
1928-01-06 17.66
1928-01-09 17.59
...
2022-08-19 4228.48
2022-08-22 4137.99
2022-08-23 4128.73
2022-08-24 4140.77
2022-08-25 4199.12
[24240 rows x 1 columns]
Instead of doing this:
gdp['GDP] = gdp['GDP'].pct_change(4)
pce['PCE'] = pce['PCE'].pct_change(12)
spx['SPX'] = spx['SPX'].pct_change(252)
I would like a for loop to do it for all Dataframes without specifying the periods but specifying that I want the percentage change from Year to Year.
Given:
d = {'Date': [ '2021-02-01',
'2021-03-01',
'2021-04-01',
'2021-05-01',
'2021-06-01',
'2022-02-01',
'2022-03-01',
'2022-04-01',
'2022-05-01',
'2022-06-01'],
'PCE': [ 1.695549, 1.706421, 1.692806, 1.863354, 1.911975,
6.274030, 6.638595, 6.269216, 6.324989, 6.758935]}
pce = pd.DataFrame(d)
pce = pce.set_index('Date')
pce.index = pce.to_datetime(pce.index)
You could create a new dataframe with a copy of the datetime index as a new column, resample the new dataframe with annual frequency ('A') and count all unique values in the Date column.
pce_annual_rows = pce.index.to_frame()
resampled_annual = pce_annual_rows.resample('A').count()
Next you can get the second last Date-count value and use that as your periods values in the pct_change method.
The second last, because if there is an incomplete year at the end, you probably end up with a wrong periods value. This assumes, that you have more than 1 year of data in every dataframe, otherwise you'll get an IndexError.
periods_per_year = resampled_annual['Date'].iloc[-2]
pce['ROC'] = pce['PCE'].pct_change(periods_per_year)
This produces the following output:
PCE ROC
Date
2021-02-01 1.695549 NaN
2021-03-01 1.706421 NaN
2021-04-01 1.692806 NaN
2021-05-01 1.863354 NaN
2021-06-01 1.911975 NaN
2022-02-01 6.274030 2.700294
2022-03-01 6.638595 2.890362
2022-04-01 6.269216 2.703446
2022-05-01 6.324989 2.394411
2022-06-01 6.758935 2.535054
This solution isn't very nice, maybe someone comes up with another, less complicated idea.
To build your for-loop to do this for every dataframe, you'd probably better use the same column name for the columns you want to apply the pct_change method on.

Replace "NaT" with next date based on previous date

My DF looks like below:
column1 column2
2020-11-01 1
2020-12-01 2
2021-01-01 3
NaT 4
NaT 5
NaT 6
Output should be like this:
column1 column2
2020-11-01 1
2020-12-01 2
2021-01-01 3
2021-02-01 4
2021-03-01 5
2021-04-01 6
I can't create next date (only months and years changed) based on the last existing date in df. Is there any pythonic way to do this? Thanks for any help!
Regards
Tomasz
This is how I would do it, you could probably tidy this up into more of a one liner but this will help illustrate the process a little more.
#convert to date
df['column1'] = pd.to_datetime(df['column1'], format='%Y-%d-%m')
#create a group for each missing section
df['temp'] = df.column1.fillna(method = 'ffill')
#count the row within this group
df['temp2'] = df.groupby(['temp']).cumcount()
# add month
df['column1'] = [x + pd.DateOffset(months=y) for x,y in zip(df['temp'], df['temp2'])]
pandas supports time series data
pd.date_range("2020-11-1", freq=pd.tseries.offsets.DateOffset(months=1), periods=10)
will give
DatetimeIndex(['2020-11-01', '2020-12-01', '2021-01-01', '2021-02-01',
'2021-03-01', '2021-04-01', '2021-05-01', '2021-06-01',
'2021-07-01', '2021-08-01'],
dtype='datetime64[ns]', freq='<DateOffset: months=1>')

Pandas: how to calculate percentage of a population from elsewhere

I found this data file on covid vaccinations, and I'd like to see the vaccination coverage in (parts of) the population. It'll probably become more clear with the actual example, so bear with me.
If I read the csv using df = pd.read_csv('https://epistat.sciensano.be/Data/COVID19BE_VACC.csv', parse_dates=['DATE']) I get this result:
DATE REGION AGEGROUP SEX BRAND DOSE COUNT
0 2020-12-28 Brussels 18-34 F Pfizer-BioNTech A 1
1 2020-12-28 Brussels 45-54 F Pfizer-BioNTech A 2
2 2020-12-28 Brussels 55-64 F Pfizer-BioNTech A 3
3 2020-12-28 Brussels 55-64 M Pfizer-BioNTech A 1
4 2020-12-28 Brussels 65-74 F Pfizer-BioNTech A 2
I'm particularly interested in the numbers by region & date.
So I regrouped using df.groupby(['REGION','DATE']).sum()
COUNT
REGION DATE
Brussels 2020-12-28 56
2020-12-30 5
2021-01-05 725
2021-01-06 989
2021-01-07 994
... ...
Wallonia 2021-06-18 49567
2021-06-19 43577
2021-06-20 2730
2021-06-21 37193
2021-06-22 16938
In order to compare vaccination 'speeds' in different regions I have to transform the data from absolute to relative numbers, using the population from each region.
I have found some posts explaining how to calculate percentages in a multi-index dataframe like this, but the problem is that I want to divide each COUNT by a population number that is not in the original dataframe.
The population numbers are here below
REGION POP
Flanders 6629143
Wallonia 3645243
Brussels 1218255
I think the solution must be in looping through the original df and checking both REGIONs or index levels, but I have absolutely no idea how. It's a technique I'd like to master, because it might come in handy when I want some other subsets with different populations (AGEGROUP or SEX maybe).
Thank you so much for reading this far!
Disclaimer: I've only just started out using Python, and this is my very first question on Stack Overflow, so please be gentle with me... The reason why I'm posting this is because I can't find an answer anywhere else. This is probably because I haven't got the terminology down and I don't exactly know what to look for ^_^
One option would be to reformat the population_df with set_index + rename:
population_df = pd.DataFrame({
'REGION': {0: 'Flanders', 1: 'Wallonia', 2: 'Brussels'},
'POP': {0: 6629143, 1: 3645243, 2: 1218255}
})
denom = population_df.set_index('REGION').rename(columns={'POP': 'COUNT'})
denom:
COUNT
REGION
Flanders 6629143
Wallonia 3645243
Brussels 1218255
Then div the results of groupby sum relative to level=0:
new_df = df.groupby(['REGION', 'DATE']).agg({'COUNT': 'sum'}).div(denom, level=0)
new_df:
COUNT
REGION DATE
Brussels 2020-12-28 0.000046
2020-12-30 0.000004
2021-01-05 0.000595
2021-01-06 0.000812
2021-01-07 0.000816
... ...
Wallonia 2021-06-18 0.013598
2021-06-19 0.011954
2021-06-20 0.000749
2021-06-21 0.010203
2021-06-22 0.004647
Or as a new column:
new_df = df.groupby(['REGION', 'DATE']).agg({'COUNT': 'sum'})
new_df['NEW'] = new_df.div(denom, level=0)
new_df:
COUNT NEW
REGION DATE
Brussels 2020-12-28 56 0.000046
2020-12-30 5 0.000004
2021-01-05 725 0.000595
2021-01-06 989 0.000812
2021-01-07 994 0.000816
... ... ...
Wallonia 2021-06-18 49567 0.013598
2021-06-19 43577 0.011954
2021-06-20 2730 0.000749
2021-06-21 37193 0.010203
2021-06-22 16938 0.004647
You could run reset_index() on the groupby and then run df.apply on a custom function that does the calculations:
import pandas as pd
df = pd.read_csv('https://epistat.sciensano.be/Data/COVID19BE_VACC.csv', parse_dates=['DATE'])
df = df.groupby(['REGION','DATE']).sum().reset_index()
def calculate(row):
if row['REGION'] == 'Flanders':
return row['COUNT'] / 6629143
elif row['REGION'] == 'Wallonia':
return row['COUNT'] / 3645243
elif row['REGION'] == 'Brussels':
return row['COUNT'] / 1218255
df['REL_COUNT'] = df.apply(calculate, axis=1) #axis=1 takes the rows as input, axis=0 would run on columns
Output df.head():
REGION
DATE
COUNT
REL_COUNT
0
Brussels
2020-12-28 00:00:00
56
0.000046
1
Brussels
2020-12-30 00:00:00
5
0.000004
2
Brussels
2021-01-05 00:00:00
725
0.000595
3
Brussels
2021-01-06 00:00:00
989
0.000812
4
Brussels
2021-01-07 00:00:00
994
0.000816

monthly resampling pandas with specific start day

I'm creating a pandas DataFrame with random dates and random integers values and I want to resample it by month and compute the average value of integers. This can be done with the following code:
def random_dates(start='2018-01-01', end='2019-01-01', n=300):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
start = pd.to_datetime('2018-01-01')
end = pd.to_datetime('2019-01-01')
dates = random_dates(start, end)
ints = np.random.randint(100, size=300)
df = pd.DataFrame({'Month': dates, 'Integers': ints})
print(df.resample('M', on='Month').mean())
The thing is that the resampled months always starts from day one and I want all months to start from day 15. I'm using pandas 1.1.4 and I've tried using origin='15/01/2018' or offset='15' and none of them works with 'M' resample rule (they do work when I use 30D but it is of no use). I've also tried to use '2SM'but it also doesn't work.
So my question is if is there a way of changing the resample rule or I will have to add an offset in my data?
Assume that the source DataFrame is:
Month Amount
0 2020-05-05 1
1 2020-05-14 1
2 2020-05-15 10
3 2020-05-20 10
4 2020-05-30 10
5 2020-06-15 20
6 2020-06-20 20
To compute your "shifted" resample, first shift Month column so that
the 15-th day of month becomes the 1-st:
df.Month = df.Month - pd.Timedelta('14D')
and then resample:
res = df.resample('M', on='Month').mean()
The result is:
Amount
Month
2020-04-30 1
2020-05-31 10
2020-06-30 20
If you want, change dates in the index to month periods:
res.index = res.index.to_period('M')
Then the result will be:
Amount
Month
2020-04 1
2020-05 10
2020-06 20
Edit: Not a working solution for OP's request. See short discussion in the comments.
Interesting problem. I suggest to resample using 'SMS' - semi-month start frequency (1st and 15th). Instead of keeping just the mean values, keep the count and sum values and recalculate the weighted mean for each monthly period by its two sub-period (for example: 15/1 to 15/2 is composed of 15/1-31/1 and 1/2-15/2).
The advantages here is that unlike with an (improper use of an) offset, we are certain we always start on the 15th of the month till the 14th of the next month.
df_sm = df.resample('SMS', on='Month').aggregate(['sum', 'count'])
df_sm
Integers
sum count
Month
2018-01-01 876 16
2018-01-15 864 16
2018-02-01 412 10
2018-02-15 626 12
...
2018-12-01 492 10
2018-12-15 638 16
Rolling sum and rolling count; Find the mean out of them:
df_sm['sum_rolling'] = df_sm['Integers']['sum'].rolling(2).sum()
df_sm['count_rolling'] = df_sm['Integers']['count'].rolling(2).sum()
df_sm['mean'] = df_sm['sum_rolling'] / df_sm['count_rolling']
df_sm
Integers count_sum count_rolling mean
sum count
Month
2018-01-01 876 16 NaN NaN NaN
2018-01-15 864 16 1740.0 32.0 54.375000
2018-02-01 412 10 1276.0 26.0 49.076923
2018-02-15 626 12 1038.0 22.0 47.181818
...
2018-12-01 492 10 1556.0 27.0 57.629630
2018-12-15 638 16 1130.0 26.0 43.461538
Now, just filter the odd indices of df_sm:
df_sm.iloc[1::2]['mean']
Month
2018-01-15 54.375000
2018-02-15 47.181818
2018-03-15 51.000000
2018-04-15 44.897436
2018-05-15 52.450000
2018-06-15 33.722222
2018-07-15 41.277778
2018-08-15 46.391304
2018-09-15 45.631579
2018-10-15 54.107143
2018-11-15 58.058824
2018-12-15 43.461538
Freq: 2SMS-15, Name: mean, dtype: float64
The code:
df_sm = df.resample('SMS', on='Month').aggregate(['sum', 'count'])
df_sm['sum_rolling'] = df_sm['Integers']['sum'].rolling(2).sum()
df_sm['count_rolling'] = df_sm['Integers']['count'].rolling(2).sum()
df_sm['mean'] = df_sm['sum_rolling'] / df_sm['count_rolling']
df_out = df_sm[1::2]['mean']
Edit: Changed a name of one of the columns to make it clearer

How to define a 4-4-5 week period in Pandas

My company uses a 4-4-5 calendar for reporting purposes. Each month (aka period) is 4-weeks long, except every 3rd month is 5-weeks long.
Pandas seems to have good support for custom calendar periods. However, I'm having trouble figuring out the correct frequency string or custom business month offset to achieve months for a 4-4-5 calendar.
For example:
df_index = pd.date_range("2020-03-29", "2021-03-27", freq="D", name="date")
df = pd.DataFrame(
index=df_index, columns=["a"], data=np.random.randint(0, 100, size=len(df_index))
)
df.groupby(pd.Grouper(level=0, freq="4W-SUN")).mean()
Grouping by 4-weeks starting on Sunday results in the following. The first three month start dates are correct but I need every third month to be 5-weeks long. The 4th month start date should be 2020-06-28.
a
date
2020-03-29 16.000000
2020-04-26 50.250000
2020-05-24 39.071429
2020-06-21 52.464286
2020-07-19 41.535714
2020-08-16 46.178571
2020-09-13 51.857143
2020-10-11 44.250000
2020-11-08 47.714286
2020-12-06 56.892857
2021-01-03 55.821429
2021-01-31 53.464286
2021-02-28 53.607143
2021-03-28 45.037037
Essentially what I'd like to achieve is something like this:
a
date
2020-03-29 20.000000
2020-04-26 50.750000
2020-05-24 49.750000
2020-06-28 49.964286
2020-07-26 52.214286
2020-08-23 47.714286
2020-09-27 46.250000
2020-10-25 53.357143
2020-11-22 52.035714
2020-12-27 39.750000
2021-01-24 43.428571
2021-02-21 49.392857
Pandas currently support only yearly and quarterly 5253 (aka 4-4-5 calendar).
See is pandas.tseries.offsets.FY5253 and pandas.tseries.offsets.FY5253Quarter
df_index = pd.date_range("2020-03-29", "2021-03-27", freq="D", name="date")
df = pd.DataFrame(index=df_index)
df['a'] = np.random.randint(0, 100, df.shape[0])
So indeed you need some more work to get to week level and maintain a 4-4-5 calendar. You could align to quarters using the native pandas offset and fill-in the 4-4-5 week pattern manually.
def date_range(start, end, offset_array, name=None):
start = pd.to_datetime(start)
end = pd.to_datetime(end)
index = []
start -= offset_array[0]
while(start<end):
for x in offset_array:
start += x
if start > end:
break
index.append(start)
return pd.Series(index, name=name)
This function takes a list of offsets rather than a regular frequency period, so it allows to move from date to date following the offsets in the given array:
offset_445 = [
pd.tseries.offsets.FY5253Quarter(weekday=6),
4*pd.tseries.offsets.Week(weekday=6),
4*pd.tseries.offsets.Week(weekday=6),
]
df_index_445 = date_range("2020-03-29", "2021-03-27", offset_445, name='date')
Out:
0 2020-05-03
1 2020-05-31
2 2020-06-28
3 2020-08-02
4 2020-08-30
5 2020-09-27
6 2020-11-01
7 2020-11-29
8 2020-12-27
9 2021-01-31
10 2021-02-28
Name: date, dtype: datetime64[ns]
Once the index is created, then it's back to aggregations logic to get the data in the right row buckets. Assuming that you want the mean for the start of each 4 or 5 week period, according to the df_index_445 you have generated, it could look like this:
# calculate the mean on reindex groups
reindex = df_index_445.searchsorted(df.index, side='right') - 1
res = df.groupby(reindex).mean()
# filter valid output
res = res[res.index>=0]
res.index = df_index_445
Out:
a
2020-05-03 47.857143
2020-05-31 53.071429
2020-06-28 49.257143
2020-08-02 40.142857
2020-08-30 47.250000
2020-09-27 52.485714
2020-11-01 48.285714
2020-11-29 56.178571
2020-12-27 51.428571
2021-01-31 50.464286
2021-02-28 53.642857
Note that since the frequency is not regular, pandas will set the datetime index frequency to None.

Categories