Custom PeriodIndex (Python / Pandas equivalent to SAS INTNX) - python

I have a SAS background and I am new to Python. I would like to how to use PeriodIndex in a similar way that we use SAS intervals. This is my problem:
We have an official interest rate that is published more or less monthly. This interest rate is valid until the next one is published. My objective is to obtain for any given date (let’s call it reference_date), the valid interest rate for that day.
For instance:
df = pd.DataFrame({ 'publication_date': ['2012-07-03', '2012-08-02', '2012-09-04', '2012-10-02', '2012-11-03', '2012-12-04' ] ,
'interest_value': [1.219, 1.061, 0.877, 0.74, 0.65, 0.588] })
interest_value publication_date
0 1.219 2012-07-03
1 1.061 2012-08-02
2 0.877 2012-09-04
3 0.740 2012-10-02
4 0.650 2012-11-03
5 0.588 2012-12-04
In SAS I would create a custom interval, (let’s call it INTEREST_INTERVAL). It would contain the periods (that is the BEGIN date and END date) for which each interest is valid. For the example above, the interval would be the following:
BEGIN END
03JUL12 01AUG12
02AUG12 03SEP12
04SEP12 01OCT12
02OCT12 02NOV12
03NOV12 03DEC12
Than I would use the INTNX function. INTNX allow to “move” a number of periods up and down my custom interval and then return either the period start date or end date.
In this case, I would use:
pub_date = INTNX(INTEREST_INTERVAL, reference_date, 0 , 'BEGINNING')
This will instruct to add zero intervals to the reference date and return the start date of the interval.
For instance, if the reference_date is equal to '2012-09-02', the above function would return 02AUG12. Then I would do a direct lookup (dictionary search) on the 'publication_date' / 'interest_value' table to obtain the valid interest rate for that day.
I thought that thru Panda´s PeriodIndex, with a second column for interest rate value, I would be able to do something similar, but I could not find out:
How to create custom PeriodIndex?
From a specific date value (reference_date) return the row corresponding to the period it falls into?
How would be the best way to do this in Pandas.
Thanks,
B.

My suggestion is actually more python than pandas. Think about it this way, if your date is after the last date in the list you want the last rate, if it's not then if it is after the one before you want the one before and again and again. this is the logic the following code uses (I hope). Let me know if it works for you.
df.publication_date = pd.to_datetime(df.publication_date)
def get_interest_by_date(date):
for date_interest in zip(df.publication_date.values[::-1], df.interest_value.values[::-1]):
if date>=date_interest[0]:
return float(date_interest[1])
and testing:
get_interest_by_date(pd.to_datetime('2012-10-05'))
0.74

Related

How to best calculate average issue age for a range of historic dates (pandas)

Objective:
I need to show the trend in ageing of issues. e.g. for each date in 2021 show the average age of the issues that were open as at that date.
Starting data (Historic issue list):. "df"
ref
Created
resolved
a1
20/4/2021
15/5/2021
a2
21/4/2021
15/6/2021
a3
23/4/2021
15/7/2021
Endpoint: "df2"
Date
Avg_age
1/1/2021
x
2/1/2021
y
3/1/2021
z
where x,y,z are the calculated averages of age for all issues open on the Date.
Tried so far:
I got this to work in what feels like a very poor way.
create a date range (pd.date_range(start,finish,freq="D")
I loop through the dates in this range and for each date I filter the "df" dataframe (boolean filtering) to show only issues live on the date in question. Then calc age (date - created) and average for those. Each result appended to a list.
once done I just convert the list into a series for my final result, which I can then graph or whatever.
hist_dates = pd.date_range(start="2021-01-01",end="2021-12-31"),freq="D")
result_list = []
for each_date in hist_dates:
f1=df.Created < each_date #filter1
f2=df.Resolved >= each_date #filter2
df['Refdate'] = each_date #make column to allow refdate-created
df['Age']= (df.Refdate - df.Created)
results_list.append(df[f1 & f2]).Age.mean())
Problems:
This works, but it feels sloppy and it doesn't seem fast. The current data-set is small, but I suspect this wouldn't scale well. I'm trying not to solve everything with loops as I understand it is a common mistake for beginners like me.
I'll give you two solutions: the first one is step-by-step for you to understand the idea and process, the second one replicates the functionality in a much more condensed way, skipping some intermediate steps
First, create a new column that holds your issue age, i.e. df['age'] = df.resolved - df.Created (I'm assuming your columns are of datetime type, if not, use pd.to_datetime to convert them)
You can then use groupby to group your data by creation date. This will internally slice your dataframe into several pieces, one for each distinct value of Created, grouping all values with the same creation date together. This way, you can then use aggregation on a creation date level to get the average issue age like so
# [['Created', 'age']] selects only the columns you are interested in
df[['Created', 'age']].groupby('Created').mean()
With an additional fourth data point [a4, 2021/4/20, 2021/4/30] (to enable some proper aggregation), this would end up giving you the following Series with the average issue age by creation date:
age
Created
2021-04-20 17 days 12:00:00
2021-04-21 55 days 00:00:00
2021-04-23 83 days 00:00:00
A more condensed way of doing this is by defining a custom function and apply it to each creation date grouping
def issue_age(s: pd.Series):
return (s['resolved'] - s['Created']).mean()
df.groupby('Created').apply(issue_age)
This call will give you the same Series as before.

Recalculate fitting parameters after period of time loop

I have a dataframe similar to the one shown below and was wondering how I can loop through and calculate fitting parameters every set number of days. For example, I would like to be able to input 30 days and have be able to get new constants for the first 30 days, then the first 60 days and so on until the end of the date range.
ID date amount delta_t
1 2020/1/1 10.2 0
1 2020/1/2 11.2 1
2 2020/1/1 12.3 0
2 2020/1/2 13.3 1
I would like to have the parameters stored in another dataframe which is what I am currently doing for the entire dataset but that is over the whole time period rather than n day blocks. Then using the constants for each set period I will calculate the graph points and plot them.
Right now I am using groupby to group the wells by ID then using the apply method to calculate the constants for each ID. This works for the entire dataframe but the constants will change if I am only using 30 day periods.
I don't know if there is a way in the apply method to more easily do this and output the constants either to a new column or a seperate dataframe that is one row per ID. Any input is greatly appreciated.
def parameters(x):
variables, _ = curve_fit(expo, x['delta_t'], x['amount'])
return pd.Series({'param1': variables[0], 'param2': variables[1], 'param3': variables[2]})
param_series = df_filt.groupby('ID').apply(parameters)

groupby agg using a date offset or similar

Sample of dataset below
Trying to create a groupby that will give me the number of months that I specify eg last 12 months, last 36 months etc.
My groupby that rolls up my whole dataset for each 'client' is below. rolled_ret is just a custom function that geometrically links whatever performance array it gets, we can pretend is is sum()
df_client_perf = df_perf.groupby(df_perf.CLIENT_NAME)['GROUP_PERFORMANCE'].agg(Client_Return = rolled_ret)
If I put .rolling(12) I can take the most recent entry to get the previous 12 months but there is obviously a better way to do this.
Worth saying that the period column is a monthly period datetime type using to_period
thanks in advance
PERIOD,CLIENT_NAME,GROUP_PERFORMANCE
2020-03,client1,0.104
2020-04,client1,0.004
2020-05,client1,0.23
2020-06,client1,0.113
2020-03,client2,0.0023
2020-04,client2,0.03
2020-05,client2,0.15
2020-06,client2,0.143
lets say for example that I wanted to do a groupby to SUM the latest three months of data, my expected output of the above would be
client1,0.347
client2,0.323
also - I would like a way to return nan if the dataset is missing the minimum number of periods, as you can do with the rolling function.
Here is my answer.
I've used a DatetimeIndex because the method last does not work with period. First I sort values based on the PERIOD column, then I set it as Index to keep only the last 3 months (or whatever you provide), then I do the groupby the same way as you.
df['PERIOD'] = pd.to_datetime(df['PERIOD'])
(df.sort_values(by='PERIOD')
.set_index('PERIOD')
.last('3M')
.groupby('CLIENT_NAME')
.GROUP_PERFORMANCE
.sum())
# Result
CLIENT_NAME GROUP_PERFORMANCE
client1 0.347
client2 0.323

Python: Shift time series so they all match at a given y value

I'm writing my own code to analyse/visualise COVID-19 data from the European CDC.
https://opendata.ecdc.europa.eu/covid19/casedistribution/csv'
I've got a simple code to extract the data and make plots with cumulative deaths against time, and am trying to add functionality.
My aim is something like the attached graph, with all countries time shifted to match at (in this case the 5th death) I want to make a general bit of code to shift countries to match at the 'n'th death.
https://ourworldindata.org/grapher/covid-confirmed-deaths-since-5th-death
The current way I'm trying to do this is to have a maze of "if group is 'country' shift by ..." terms.
Where ... is a lookup to find the date for the particular 'country' when there were 'n' deaths, and to interpolate fractional dates where appropriate.
i.e. currently deaths are assigned as 00:00 day/month, but the data can be shifted by 2/3 a day as below.
datetime cumulative deaths
00:00 15/02 80
00:00 16/02 110
my '...' should give 16:00 15/02
I'm working on this right now but it doesn't feel very efficient and I'm sure there must be a much simpler way that I'm not seeing.
Essentially despite copious googling I can't seem to find a simple way of automatically shifting a bunch of timeseries to match at a particular y value, which feels like it should have some built-in functionality, i.e. a Lookup with interpolation.
####Live url (I've downloaded my own csv and been calling that for code development)
url = 'https://opendata.ecdc.europa.eu/covid19/casedistribution/csv'
dataraw = pd.read_csv(url)
#extract relevanty colums
data = dataraw.loc[:,["dateRep","countriesAndTerritories","deaths"]]
####convert date format
data['dateRep'] = pd.to_datetime(data['dateRep'],dayfirst=True)
####sort by date
data = data.sort_values(["dateRep"],ascending=True)
data['cumdeaths'] = data.groupby(['countriesAndTerritories']).cumsum()
##### limit to countries with cumulative deaths > 500
data = data.groupby('countriesAndTerritories').filter(lambda x:x['cumdeaths'].max() >500)
###### remove China from data for now as it doesn't match so well with dates
data = data.groupby('countriesAndTerritories').filter(lambda x:(x['countriesAndTerritories'] != "China").any())
##### only recent dates
data = data[data['dateRep'] > '2020-03-01']
print(data)
You can use groupby('country') and the pd.transform function to add a column which will set every row with the date in which its country hit the nth death.
Then you can do a vectorized subtraction of the date column and the new column to get the number of days.

With pandas, how do I calculate a rolling number of events in the last second given timestamp data?

I have dataset where I calculate service times based on request and response times. I would like to add a calculation of requests in the last second to show the obvious relationship that as we get more requests per second the system slows. Here is the data that I have, for example:
serviceTimes.head()
Out[71]:
Id Req_Time Rsp_Time ServiceTime
0 3_1 2015-02-13 14:07:08.729000 2015-02-13 14:07:08.821000 00:00:00.092000
1 3_2 2015-02-13 14:07:08.929000 2015-02-13 14:07:08.929000 00:00:00
2 3_12 2015-02-13 14:11:53.908000 2015-02-13 14:11:53.981000 00:00:00.073000
3 3_14 2015-02-13 14:11:54.111000 2015-02-13 14:11:54.250000 00:00:00.139000
4 3_15 2015-02-13 14:11:54.111000 2015-02-13 14:11:54.282000 00:00:00.171000
For this I would like a rolling data set of something like:
0 14:07:08 2
1 14:11:53 1
2 14:11:54 2
I've tried rolling_sum and rolling_count, but unless I am using them wrong or not understanding the period function, it is not working for me.
For your problem, it looks like you want to summarize your data set using a split-apply-combine approach. See here for the documentation that will help you get your code in working but basically, you'll want to do the following:
Create a new column (say, 'Req_Time_Sec that includes Req_Time down to only second resolution (e.g. 14:07:08.729000 becomes 14:07:08)
use groups = serviceTimes.groupby('Req_Time_Sec) to separate your data set into sub-groups based on which second each request occurs in.
Finally, create a new data set by calculating the length of each sub group (which represents the number of requests in that second) and aggregating the results into a single DataFrame (something like new_df = groups.aggregate(len))
The above is all untested pseudo-code, but the code, along with the link to the documentation, should help you get where you want to go.
You first need to transform the timestamp into a string which you then groupby, showing the count and average service times:
serviceTimes['timestamp'] = [t.strftime('%y-%m-%d %H:%M') for t in serviceTimes.Req_Time]
serviceTimes.groupby('timestamp')['ServiceTime'].agg(['mean', 'count'])
Alternatively, create a data frame of the request time in the appropriate string format, e.g. 15-13-15 17:27, then count the occurrence of each time stamp using value_counts(). You can also plot the results quite easily.
df = pd.DataFrame([t.strftime('%y-%m-%d %H:%M') for t in serviceTimes.Req_Time],
columns=['timestamp'])
response = df.timestamp.value_counts()
response.plot(rot=90)

Categories