How to get average of a column inside a time era? - python

I need to get the average of a column (which I will set in the input of my function) during a precise era :
In my case the date is the index, so I can get the week with index.week.
Then I would like to compute some basic statistics each 2 weeks for instances
So I will need to "slice" the dataframe every 2 weeks and then compute. It can destroy the part of the dataframe already computed, but what's still in the dataframe mustn't be erase.
My first guess was to parse the data with a row iterator then compare it :
# get the week num. of the first row
start_week = temp.data.index.week[0]
# temp.data is my data frame
for index, row in temp.data.iterrows():
while index.week < start_week + 2:
print index.week
but it's really slow so shouldn't be the proper way

Welcome to Stackoverflow. Please note that your question is not very specific and is difficult to supply you with exactly what you want. Optimally, you would supply code to recreate your dataset and also post the expected outcome. I'll post regarding two parts: (i) Working with dataframes sliced using time-specific functions and (ii) Applying statistical functions using rolling window operations.
Working with Dataframes and time indices
The question is not how to get the mean of x, because you know how to do that (x.mean()). The question is, how to get x: How do you select elements of a dataframe which satisfy certain conditions on their timestamp? I will use a series generated by the documentation which I found after googling for one minute:
In[13]: ts
Out[13]:
2011-01-31 0.356701
2011-02-28 -0.814078
2011-03-31 1.382372
2011-04-29 0.604897
2011-05-31 1.415689
2011-06-30 -0.237188
2011-07-29 -0.197657
2011-08-31 -0.935760
2011-09-30 2.060165
2011-10-31 0.618824
2011-11-30 1.670747
2011-12-30 -1.690927
Then, you can select some time series based on index weeks using
ts[(ts.index.week > 3) & (ts.index.week < 10)]
And specifically, if you want to get the mean of this series, you can do
ts[(ts.index.week > 3) & (ts.index.week < 10)].mean()
If you work with a dataframe, you might want to select the column first:
df[(df.index.week > 3) & (df.index.week < 10)]['someColumn'].mean()
Rolling window operations
Now, if you want to operate rolling statistics onto a time series indexed pandas object, have a look at this part of the manual.
Given that I have a monthly time series, say I want the mean for 3 months, I'd do:
rolling_mean(ts, window=3)
Out[25]:
2011-01-31 NaN
2011-02-28 NaN
2011-03-31 0.308331
2011-04-29 0.391064
2011-05-31 1.134319
2011-06-30 0.594466
2011-07-29 0.326948
2011-08-31 -0.456868
2011-09-30 0.308916
2011-10-31 0.581076
2011-11-30 1.449912
2011-12-30 0.199548

Related

How to best calculate average issue age for a range of historic dates (pandas)

Objective:
I need to show the trend in ageing of issues. e.g. for each date in 2021 show the average age of the issues that were open as at that date.
Starting data (Historic issue list):. "df"
ref
Created
resolved
a1
20/4/2021
15/5/2021
a2
21/4/2021
15/6/2021
a3
23/4/2021
15/7/2021
Endpoint: "df2"
Date
Avg_age
1/1/2021
x
2/1/2021
y
3/1/2021
z
where x,y,z are the calculated averages of age for all issues open on the Date.
Tried so far:
I got this to work in what feels like a very poor way.
create a date range (pd.date_range(start,finish,freq="D")
I loop through the dates in this range and for each date I filter the "df" dataframe (boolean filtering) to show only issues live on the date in question. Then calc age (date - created) and average for those. Each result appended to a list.
once done I just convert the list into a series for my final result, which I can then graph or whatever.
hist_dates = pd.date_range(start="2021-01-01",end="2021-12-31"),freq="D")
result_list = []
for each_date in hist_dates:
f1=df.Created < each_date #filter1
f2=df.Resolved >= each_date #filter2
df['Refdate'] = each_date #make column to allow refdate-created
df['Age']= (df.Refdate - df.Created)
results_list.append(df[f1 & f2]).Age.mean())
Problems:
This works, but it feels sloppy and it doesn't seem fast. The current data-set is small, but I suspect this wouldn't scale well. I'm trying not to solve everything with loops as I understand it is a common mistake for beginners like me.
I'll give you two solutions: the first one is step-by-step for you to understand the idea and process, the second one replicates the functionality in a much more condensed way, skipping some intermediate steps
First, create a new column that holds your issue age, i.e. df['age'] = df.resolved - df.Created (I'm assuming your columns are of datetime type, if not, use pd.to_datetime to convert them)
You can then use groupby to group your data by creation date. This will internally slice your dataframe into several pieces, one for each distinct value of Created, grouping all values with the same creation date together. This way, you can then use aggregation on a creation date level to get the average issue age like so
# [['Created', 'age']] selects only the columns you are interested in
df[['Created', 'age']].groupby('Created').mean()
With an additional fourth data point [a4, 2021/4/20, 2021/4/30] (to enable some proper aggregation), this would end up giving you the following Series with the average issue age by creation date:
age
Created
2021-04-20 17 days 12:00:00
2021-04-21 55 days 00:00:00
2021-04-23 83 days 00:00:00
A more condensed way of doing this is by defining a custom function and apply it to each creation date grouping
def issue_age(s: pd.Series):
return (s['resolved'] - s['Created']).mean()
df.groupby('Created').apply(issue_age)
This call will give you the same Series as before.

Find indices of rows of dates falling within datetime range

Using datetime and a dataframe, I want to find which rows fall within the range of dates I have specified.
Sample dataframe:
times = pd.date_range(start="2018-01-01",end="2020-02-02")
values = np.random.rand(512)
# Make df
df = pd.DataFrame({'Time' : times,
'Value': values})
How do I easily select all values that fall within a certain month or range?
I feel like a good step is using:
pd.to_datetime(df['Time']).dt.to_period('M')
>>> df
0 2018-01
1 2018-02
2 2018-03
But I wouldn't know how to contine. I would like to be able to select a year/month like 2019-01 or a range 2019-01:2020-01 to find the indices in the dataframe that the input.
I apparently did the right thing already, but had a wrong syntax.
It was quick, but just to be sure here is the answer:
np.where(pd.to_datetime(df['time']).dt.to_period('M') == '2018-01')
Then a range can be specified as well.
With query you can also select date ranges pretty quickly:
df.query('"2019-01-01" <= Time < "2019-02-01"')

Get last row with specific value in column, Python

I have
df = pd.DataFrame('data', car, 'ok')
df:
data car ok
2020-03-25 A
2020-04-01 A x
2020-04-15 A
2020-05-08 A x
2020-06-25 A x
2020-06-27 A
2020-07-15
I want to select last (old in this case) row being column 'ok' with "x"
I want to obtain
2020-04-01 A x
Thanks!
The head() method on DataFrame can give you the first n rows of a DataFrame. Indexing of a DataFrame will allow you to select rows meeting your filter criteria - to narrow to a subset DataFrame with such rows. Together, you could use them to do:
r = df.loc[df.ok == 'x', :].head(1)
What you are doing here is narrowing to a subset DataFrame where ok is 'x' (the df.loc[df.ok == 'x', :] part), then taking the first row of it (the .head(1) part). This of course assumes the DataFrame is sorted by date as it is above.
Indexing is a huge and fundamental topic (think of it as the SQL WHERE of pandas) so you should spend time gaining a deep knowledge of it. Here is a great tutorial. This one and this one are also good.
This will work also when your data is not sorted:
df[df.ok == 'x'][df.data == df.data.min()]

Pandas - How to remove duplicates based on another series?

I have a dataframe that contains three series called Date, Element,
and Data_Value--their types are string, string, and numpy.int64
respectively. Date has dates in the form of yyyy-mm-dd; Element has
strings that say either TMIN or TMAX, and it denotes whether the
Data_Value is the minimum or maximum temperature of a particular date;
lastly, the Data_Value series just represents the actual temperature.
The date series has multiple duplicates of the same date. E.g. for the
date 2005-01-01, there are 19 entries for the temperature column, the
values start at 28 and go all the way up to 156. I want to create a
new dataframe with the date and the maximum temperature only--I'll
eventually want one for TMIN values too, but I figure that if I can do
one I can figure out the other. I'll post some psuedocode with
explanation below to show what I've tried so far.
So far I have pulled in the csv and assigned it to a variable, df.
Then I sorted the values by Date, Element and Temperature
(Data_Value). After that, I created a variable called tmax that grabs
the necessary dates (I only need the data from 2005-2014) that have
'TMAX' as its Element value. I cast tmax into a new DataFrame, reset
its index to get rid of the useless index data from the first
dataframe, and dropped the 'Element' column since it was redundant at
this point. Now I'm (ultimately) trying to create a list of all the
Temperatures for TMAX so that I can plot it with pyplot. But I can't
figure out for the life of me how to reduce the dataframe to just the
single date and max value for that date. If I could just get that then
I could easily convert the series to a list and plot it.
def record_high_and_low_temperatures():
#read in csv
df = pd.read_csv('somedata.csv')
#sort values so they're in a nice order
df.sort_values(by=['Date', 'Element', 'Data_Value'], inplace=True)
# grab all entries for TMAX in correct date range
tmax = df[(df['Element'] == 'TMAX') & (df['Date'].between("2005-01-01", "2014-12-31"))]
# cast to dataframe
tmax = pd.DataFrame(tmax, columns=['Date', 'Data_Value'])
# Remove index column from previous dataframe
tmax.reset_index(drop=True, inplace=True)
# this is where I'm stuck, how do I get the max value per unique date?
max_temp_by_date = tmax.loc[tmax['Data_Value'].idxmax()]
Any and all help is appreciated, let me know if I need to clarify anything.
TL;DR:
Ok...
input dataframe looks like
date | data_value
2005-01-01 28
2005-01-01 33
2005-01-01 33
2005-01-01 44
2005-01-01 56
2005-01-02 0
2005-01-02 12
2005-01-02 30
2005-01-02 28
2005-01-02 22
Expected df should look like:
date | data_value
2005-01-01 79
2005-01-02 90
2005-01-03 88
2005-01-04 44
2005-01-05 63
I just want a dataframe that has each unique date coupled with the highest temperature on that day.
If I understand you correctly, what you would want to do is as Grzegorz already suggested in the comments, is to groupby date (take all elements of one date) and then take the maximum of that date:
df.groupby('date').max()
This will take all your groups and reduce them to only one row, taking the maximum element of every group. In this case, max() is called the aggregation function of the group. As you mentioned that you will also need the minimum at some point, a nice way to do this (instead of two groupbys) is to do the following:
df.groupby('date').agg(['max', 'min'])
which will pass over all groups once and apply both aggregation functions max and min returning two columns for each input column. More documentation on aggregation is here.
Try this:
df.groupby("Date")['data_value'].max()

With pandas, how do I calculate a rolling number of events in the last second given timestamp data?

I have dataset where I calculate service times based on request and response times. I would like to add a calculation of requests in the last second to show the obvious relationship that as we get more requests per second the system slows. Here is the data that I have, for example:
serviceTimes.head()
Out[71]:
Id Req_Time Rsp_Time ServiceTime
0 3_1 2015-02-13 14:07:08.729000 2015-02-13 14:07:08.821000 00:00:00.092000
1 3_2 2015-02-13 14:07:08.929000 2015-02-13 14:07:08.929000 00:00:00
2 3_12 2015-02-13 14:11:53.908000 2015-02-13 14:11:53.981000 00:00:00.073000
3 3_14 2015-02-13 14:11:54.111000 2015-02-13 14:11:54.250000 00:00:00.139000
4 3_15 2015-02-13 14:11:54.111000 2015-02-13 14:11:54.282000 00:00:00.171000
For this I would like a rolling data set of something like:
0 14:07:08 2
1 14:11:53 1
2 14:11:54 2
I've tried rolling_sum and rolling_count, but unless I am using them wrong or not understanding the period function, it is not working for me.
For your problem, it looks like you want to summarize your data set using a split-apply-combine approach. See here for the documentation that will help you get your code in working but basically, you'll want to do the following:
Create a new column (say, 'Req_Time_Sec that includes Req_Time down to only second resolution (e.g. 14:07:08.729000 becomes 14:07:08)
use groups = serviceTimes.groupby('Req_Time_Sec) to separate your data set into sub-groups based on which second each request occurs in.
Finally, create a new data set by calculating the length of each sub group (which represents the number of requests in that second) and aggregating the results into a single DataFrame (something like new_df = groups.aggregate(len))
The above is all untested pseudo-code, but the code, along with the link to the documentation, should help you get where you want to go.
You first need to transform the timestamp into a string which you then groupby, showing the count and average service times:
serviceTimes['timestamp'] = [t.strftime('%y-%m-%d %H:%M') for t in serviceTimes.Req_Time]
serviceTimes.groupby('timestamp')['ServiceTime'].agg(['mean', 'count'])
Alternatively, create a data frame of the request time in the appropriate string format, e.g. 15-13-15 17:27, then count the occurrence of each time stamp using value_counts(). You can also plot the results quite easily.
df = pd.DataFrame([t.strftime('%y-%m-%d %H:%M') for t in serviceTimes.Req_Time],
columns=['timestamp'])
response = df.timestamp.value_counts()
response.plot(rot=90)

Categories