Objective:
I need to show the trend in ageing of issues. e.g. for each date in 2021 show the average age of the issues that were open as at that date.
Starting data (Historic issue list):. "df"
ref
Created
resolved
a1
20/4/2021
15/5/2021
a2
21/4/2021
15/6/2021
a3
23/4/2021
15/7/2021
Endpoint: "df2"
Date
Avg_age
1/1/2021
x
2/1/2021
y
3/1/2021
z
where x,y,z are the calculated averages of age for all issues open on the Date.
Tried so far:
I got this to work in what feels like a very poor way.
create a date range (pd.date_range(start,finish,freq="D")
I loop through the dates in this range and for each date I filter the "df" dataframe (boolean filtering) to show only issues live on the date in question. Then calc age (date - created) and average for those. Each result appended to a list.
once done I just convert the list into a series for my final result, which I can then graph or whatever.
hist_dates = pd.date_range(start="2021-01-01",end="2021-12-31"),freq="D")
result_list = []
for each_date in hist_dates:
f1=df.Created < each_date #filter1
f2=df.Resolved >= each_date #filter2
df['Refdate'] = each_date #make column to allow refdate-created
df['Age']= (df.Refdate - df.Created)
results_list.append(df[f1 & f2]).Age.mean())
Problems:
This works, but it feels sloppy and it doesn't seem fast. The current data-set is small, but I suspect this wouldn't scale well. I'm trying not to solve everything with loops as I understand it is a common mistake for beginners like me.
I'll give you two solutions: the first one is step-by-step for you to understand the idea and process, the second one replicates the functionality in a much more condensed way, skipping some intermediate steps
First, create a new column that holds your issue age, i.e. df['age'] = df.resolved - df.Created (I'm assuming your columns are of datetime type, if not, use pd.to_datetime to convert them)
You can then use groupby to group your data by creation date. This will internally slice your dataframe into several pieces, one for each distinct value of Created, grouping all values with the same creation date together. This way, you can then use aggregation on a creation date level to get the average issue age like so
# [['Created', 'age']] selects only the columns you are interested in
df[['Created', 'age']].groupby('Created').mean()
With an additional fourth data point [a4, 2021/4/20, 2021/4/30] (to enable some proper aggregation), this would end up giving you the following Series with the average issue age by creation date:
age
Created
2021-04-20 17 days 12:00:00
2021-04-21 55 days 00:00:00
2021-04-23 83 days 00:00:00
A more condensed way of doing this is by defining a custom function and apply it to each creation date grouping
def issue_age(s: pd.Series):
return (s['resolved'] - s['Created']).mean()
df.groupby('Created').apply(issue_age)
This call will give you the same Series as before.
I have a 6 x n matrix with the data: year, month, day, hour, minute, use.
I have to make a new matrix containing the aggregated measurements for use, in the value ’hour’. So all rows recorded within the same hour are combined.
So every time the number of hour chances the code need to know a new period starts.
I just tried something, but I don't now how to solve this.
Thank you. This is what I tried + a test
def groupby_measurements(data):
count = -1
for i in range(9):
array = np.split(data, np.where(data[i,3] != data[i+1,3])[0][:1])
return array
print(groupby_measurements(np.array([[2006,2,11,1,1,55],
[2006,2,11,1,11,79],
[2006,2,11,1,32,2],
[2006,2,11,1,41,66],
[2006,2,11,1,51,76],
[2006,2,11,10,2,89],
[2006,2,11,10,3,33],
[2006,2,11,14,2,22],
[2006,2,11,14,5,34]])))
In this case I tried, I expect the output to be:
np.array([[2006,2,11,1,1,55],
[2006,2,11,1,11,79],
[2006,2,11,1,32,2],
[2006,2,11,1,41,66],
[2006,2,11,1,51,76]]),
np.array([[2006,2,11,10,2,89],
[2006,2,11,10,3,33]]),
np.array([[2006,2,11,14,2,22],
[2006,2,11,14,5,34]])
The final output should be:
np.array([2006,2,11,1,0,278]),
np.array([2006,2,11,10,0,122]),
np.array([2006,2,11,14,0,56])
(the sum of use in the 3 hour periodes)
I would recommend using pandas Dataframes, and then using groupby combined with sum
import pandas as pd
import numpy as np
data = pd.DataFrame(np.array(
[[2006,2,11,1,1,55],
[2006,2,11,1,11,79],
[2006,2,11,1,32,2],
[2006,2,11,1,41,66],
[2006,2,11,1,51,76],
[2006,2,11,10,2,89],
[2006,2,11,10,3,33],
[2006,2,11,14,2,22],
[2006,2,11,14,5,34]]),
columns=['year','month','day','hour','minute','use'])
aggregated = data.groupby(['year','month','day','hour'])['use'].sum()
# you can also use .agg and pass which aggregation function you want as a string.
aggregated = data.groupby(['year','month','day','hour'])['use'].agg('sum')
year month day hour
2006 2 11 1 278
10 122
14 56
Aggregated is now a pandas Series, if you want it as an array just do
aggregated.values
I have a SAS background and I am new to Python. I would like to how to use PeriodIndex in a similar way that we use SAS intervals. This is my problem:
We have an official interest rate that is published more or less monthly. This interest rate is valid until the next one is published. My objective is to obtain for any given date (let’s call it reference_date), the valid interest rate for that day.
For instance:
df = pd.DataFrame({ 'publication_date': ['2012-07-03', '2012-08-02', '2012-09-04', '2012-10-02', '2012-11-03', '2012-12-04' ] ,
'interest_value': [1.219, 1.061, 0.877, 0.74, 0.65, 0.588] })
interest_value publication_date
0 1.219 2012-07-03
1 1.061 2012-08-02
2 0.877 2012-09-04
3 0.740 2012-10-02
4 0.650 2012-11-03
5 0.588 2012-12-04
In SAS I would create a custom interval, (let’s call it INTEREST_INTERVAL). It would contain the periods (that is the BEGIN date and END date) for which each interest is valid. For the example above, the interval would be the following:
BEGIN END
03JUL12 01AUG12
02AUG12 03SEP12
04SEP12 01OCT12
02OCT12 02NOV12
03NOV12 03DEC12
Than I would use the INTNX function. INTNX allow to “move” a number of periods up and down my custom interval and then return either the period start date or end date.
In this case, I would use:
pub_date = INTNX(INTEREST_INTERVAL, reference_date, 0 , 'BEGINNING')
This will instruct to add zero intervals to the reference date and return the start date of the interval.
For instance, if the reference_date is equal to '2012-09-02', the above function would return 02AUG12. Then I would do a direct lookup (dictionary search) on the 'publication_date' / 'interest_value' table to obtain the valid interest rate for that day.
I thought that thru Panda´s PeriodIndex, with a second column for interest rate value, I would be able to do something similar, but I could not find out:
How to create custom PeriodIndex?
From a specific date value (reference_date) return the row corresponding to the period it falls into?
How would be the best way to do this in Pandas.
Thanks,
B.
My suggestion is actually more python than pandas. Think about it this way, if your date is after the last date in the list you want the last rate, if it's not then if it is after the one before you want the one before and again and again. this is the logic the following code uses (I hope). Let me know if it works for you.
df.publication_date = pd.to_datetime(df.publication_date)
def get_interest_by_date(date):
for date_interest in zip(df.publication_date.values[::-1], df.interest_value.values[::-1]):
if date>=date_interest[0]:
return float(date_interest[1])
and testing:
get_interest_by_date(pd.to_datetime('2012-10-05'))
0.74