I have the following dataframe:
data = [
("10/10/2016","A"),
("10/10/2016","B"),
("09/12/2016","B"),
("09/12/2016","A"),
("08/11/2016","A"),
("08/11/2016","C")]
#Create DataFrame base
df = pd.DataFrame(data, columns=("Time","User"))
# Convert time column to correct format for time calculations
df["Time"] = pd.to_datetime(df["Time"], '%m/%d/%Y')
Each row represents when a user makes a specific action. I want to compute how frequently (in terms of days) each user makes that specific action.
Let's say user A transacted first time on 08/11/2016, and then he transacted again on 09/12/2016, i.e. around 30 days after. Then, he transacted again on 10/10/2016, around 29 days after his second transaction. So, his average frequency in days would be (29+30)/2.
What is the most efficient way to do that?
Thanks in advance!
Update
I wrote the following function that computes my desired output.
from datetime import timedelta
def averagetime(a):
numdeltas = len(a) - 1
sumdeltas = 0
i = 1
while i < len(a):
delta = abs((a[i] - a[i-1]).days)
sumdeltas += delta
i += 1
if numdeltas > 1:
avg = sumdeltas / numdeltas
else:
avg = 'NaN'
return avg
It works correctly, for example, when I pass the whole "Time" column:
averagetime(df["Time"])
But it gives me an error when I try to apply it after group by.
df.groupby('User')['Time'].apply(averagetime)
Any suggestions how I can fix the above?
You can use diff, convert to float by np.timedelta64(1,'D') and with abs count sum:
print (averagetime(df["Time"]))
12.0
su = ((df["Time"].diff() / np.timedelta64(1,'D')).abs().sum())
print (su / (len(df) - 1))
12.0
Then I apply it to groupby, but there is necessary condition, because:
ZeroDivisionError: float division by zero
print (df.groupby('User')['Time']
.apply(lambda x: np.nan if len(x) == 1
else (x.diff()/np.timedelta64(1,'D')).abs().sum()/(len(x)-1)))
User
A 30.0
B 28.0
C NaN
Name: Time, dtype: float64
Building on from #Jezrael's answer:
If by "how frequently" you mean - how much time passes between each user performing the action then here's an approach:
import pandas as pd
import numpy as np
data = [
("10/10/2016","A"),
("10/10/2016","B"),
("09/12/2016","B"),
("09/12/2016","A"),
("08/11/2016","A"),
("08/11/2016","C"),
]
# Create DataFrame base
df = pd.DataFrame(data, columns=("Time","User"))
# Convert time column to correct format for time calculations
df["Time"] = pd.to_datetime(df["Time"], dayfirst=True)
# Group the DF by min, max and count the number of instances
grouped = (df.groupby("User").agg([np.max, np.min, np.count_nonzero])
# This step is a bit messy and could be improved,
# but we need the count as an int
.assign(counter=lambda x: x["Time"]["count_nonzero"].astype(int))
# Use apply to calculate the time between first and last, then divide by frequency
.apply(lambda x: (x["Time"]["amax"] - x["Time"]["amin"]) / x["counter"].astype(int), axis=1)
)
# Output the DF if using an interactive prompt
grouped
Output:
User
A 20 days
B 30 days
C 0 days
Related
I have a DataFrame with events that each have a start and end date. I also have a reporting period with a start and end date and a reporting frequency, e.g. monthly. What I want to calculate is the number of "active" events in each reporting period bin. An active event is an event for which the time overlaps with the time interval of the reporting period bin.
After struggling to much with DataFrame aggregation functions, I have come up with the following code that does the job but which is far from compact and elegant.
I'm pretty sure there is a way to write this more compactly but need some leads.
import numpy as np
import pandas as pd
import datetime as dt
# Example DF of events each with a start and end date provided as a string (my input data)
df = pd.DataFrame(columns=['id','start','end'], index=range(7), \
data=[[1,'2006-01-01','2007-10-01'],
[2,'2007-10-02','2008-12-01'],
[3,'2010-01-15','2010-10-20'],
[4,'2009-04-04','2010-06-03'],
[5,'2010-05-12','2010-08-31'],
[6,'2016-05-12','2199-12-31'],
[7,'2016-05-12','2199-12-31']])
# Reporting period in which we want to calculate the number of "ongoing"/"active" events:
reporting_period_start = '2010-01-01'
reporting_period_end = '2011-01-01'
reporting_freq = 'MS'
print('Input data:')
print(df)
# Convert the string dates to timestamps
def to_timestamp(str):
return pd.Timestamp(str)
df.start = df.start.apply(to_timestamp)
df.end = df.end.apply(to_timestamp)
# Create an additional colmun in the dataframe to capture the event time interval as an pandas.Interval
# pandas.Intervals offer a since .overlaps() function
def to_interval(s, e):
return pd.Interval(s, e)
df['interval'] = df.apply(lambda row: to_interval(row.start, row.end), axis=1)
# Create a data range and a period range to create reporting intervals (e.g. monthly)
# for which we want to count the number of event intervals that overlap with the reporting interval.
bins = pd.date_range(reporting_period_start, reporting_period_end, freq=reporting_freq)
print(bins)
# Convert the date ranges into a list of reporting intervals
# This is ugly code that most probably can be writting a lot more elegantly
intervals = []
n = bins.values.shape[0]
i = 0;
for b in bins[:-1]:
intervals.append(pd.Interval(pd.to_datetime(bins.values[i]), pd.to_datetime(bins.values[(i+1)%n]), closed='right'))
i = i + 1
# Function for trying a pandas.Dataframe.apply / resample / groupby or something alike...
def overlaps(i1, i2):
try:
return i1.overlaps(i2)
except:
return None
result_list = np.zeros(len(intervals)).astype(int)
for index, row in df.iterrows():
j = 0
for interval in intervals:
result_list[j] = result_list[j]+overlaps(intervals[j], row.interval)
j = j + 1
print(result_list)
If you think of your intervals as step functions, which have a value of 1 for the duration of the interval, and 0 otherwise then this can be concisely solved with staircase which has been built upon pandas and numpy for analysis with step functions.
In this setup code I have changing the dates in year 2199 to be None to indicate the end time is not known. I'm assuming that's what you may have wanted. If this is not correct then don't make this change.
setup
import numpy as np
import pandas as pd
# Example DF of events each with a start and end date provided as a string
df = pd.DataFrame(
columns=['id','start','end'],
index=range(7),
data=[[1,'2006-01-01','2007-10-01'],
[2,'2007-10-02','2008-12-01'],
[3,'2010-01-15','2010-10-20'],
[4,'2009-04-04','2010-06-03'],
[5,'2010-05-12','2010-08-31'],
[6,'2016-05-12',None],
[7,'2016-05-12',None]])
df["start"] = pd.to_datetime(df["start"])
df["end"] = pd.to_datetime(df["end"])
reporting_period_start = '2010-01-01'
reporting_period_end = '2011-01-01'
reporting_freq = 'MS'
solution
Your intervals do not start and end on month boundaries. We need to "floor" the start times to month boundaries, and "ceiling" the end times to month boundaries, to make sure that intervals that overlap a month overlap each other too. To my knowledge, there is currently no elegant way to do this, but the following will work
df["start"] = df["start"].dt.to_period("M").dt.to_timestamp()
df["end"] = (df["end"].dt.to_period("M")+1).dt.to_timestamp()
df now looks like this
id start end
0 1 2006-01-01 2007-11-01
1 2 2007-10-01 2009-01-01
2 3 2010-01-01 2010-11-01
3 4 2009-04-01 2010-07-01
4 5 2010-05-01 2010-09-01
5 6 2016-05-01 NaT
6 7 2016-05-01 NaT
Now we create a step function which is the combination of all intervals. When an interval starts the step function value increases by 1. When an interval finishes the value decreases by 1. So the value of the step function at any point will be the number of intervals overlapping that point. A step function is represented by the staircase.Stairs class. This class is to staircase as Series is to pandas.
import staircase as sc
stepfunction = sc.Stairs(df, "start", "end")
There are many things you can do with step functions in staircase including plotting.
stepfunction.plot(style="hlines")
Since the intervals now start and end at month boundaries, and the bins are month boundaries we can answer your question by finding the maximum value of the step function for each month.
bins = pd.date_range(reporting_period_start, reporting_period_end, freq=reporting_freq)
result = stepfunction.slice(bins).max()
result will be a pandas.Series indexed by a monthly interval index, whose values are the number of intervals overlapping with that month
[2010-01-01, 2010-02-01) 2.0
[2010-02-01, 2010-03-01) 2.0
[2010-03-01, 2010-04-01) 2.0
[2010-04-01, 2010-05-01) 2.0
[2010-05-01, 2010-06-01) 3.0
[2010-06-01, 2010-07-01) 3.0
[2010-07-01, 2010-08-01) 2.0
[2010-08-01, 2010-09-01) 2.0
[2010-09-01, 2010-10-01) 1.0
[2010-10-01, 2010-11-01) 1.0
[2010-11-01, 2010-12-01) 0.0
[2010-12-01, 2011-01-01) 0.0
dtype: float64
To recap, the solution (after imports and setup) is
df["start"] = df["start"].dt.to_period("M").dt.to_timestamp()
df["end"] = (df["end"].dt.to_period("M")+1).dt.to_timestamp()
result = sc.Stairs(df, "start", "end").slice(bins).max()
note: I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.
df['diff']
23:59:01
23:59:13
23:59:17
23:59:27
23:59:52
hh-mm-ss data is obtained after calculating the difference between sessions via TimesDelta.
Converted time into seconds and found the median. How do I find the median in hh-mm-ss format?
The diff column need to be converted to numerical seconds.
import pandas as pd
def time2sec(t):
(h, m, s) = t.split(':')
return int(h) * 3600 + int(m) * 60 + int(s)
df = pd.DataFrame(['23:59:01','23:59:13','23:59:17','23:59:27','23:59:52'],columns=['diff'])
df['diff_sec'] = df['diff'].map(time2sec)
print(df)
median = df['diff_sec'].median()
print('median :',median)
diff diff_sec
0 23:59:01 86341
1 23:59:13 86353
2 23:59:17 86357
3 23:59:27 86367
4 23:59:52 86392
86357.0
If your data is already in Timedelta format as you mentioned, you can just use df.median() to get the median of the series.
You can try:
pd.to_timedelta(df['diff']).median()
pd.to_timedelta converts the date string to Timedelta. Then, we can use Series.median() to get the median.
Result:
Timedelta('0 days 23:59:17')
I have a 40 year time series in the format stn;yyyymmddhh;rainfall , where yyyy= year, mm = month, dd= day,hh= hour. The series is at an hourly resolution. I extracted the maximum values for each year by the following groupby method:
import pandas as pd
df = pd.read_csv('data.txt', delimiter = ";")
df['yyyy'] = df['yyyymmhhdd'].astype(str).str[:4]
df.groupby(['yyyy'])['rainfall'].max().reset_index()
Now, i am trying to extract the maximum values for 3 hour duration each year. I tried this sliding maxima approach but it is not working. k is the duration I am interested in. In simple words,i need maximum precipitation sum for multiple durations in every year (eg 3h, 6h, etc)
class AMS:
def sliding_max(self, k, data):
tp = data.values
period = 24*365
agg_values = []
start_j = 1
end_j = k*int(np.floor(period/k))
for j in range(start_j, end_j + 1):
start_i = j - 1
end_i = j + k + 1
agg_values.append(np.nansum(tp[start_i:end_i]))
self.sliding_max = max(agg_values)
return self.sliding_max
Any suggestions or improvements in my code or is there a way i can implement it with groupby. I am a bit new to python environment, so please excuse if the question isn't put properly.
Stn;yyyymmddhh;rainfall
xyz;1981010100;0.0
xyz;1981010101;0.0
xyz;1981010102;0.0
xyz;1981010103;0.0
xyz;1981010104;0.0
xyz;1981010105;0.0
xyz;1981010106;0.0
xyz;1981010107;0.0
xyz;1981010108;0.0
xyz;1981010109;0.4
xyz;1981010110;0.6
xyz;1981010111;0.1
xyz;1981010112;0.1
xyz;1981010113;0.0
xyz;1981010114;0.1
xyz;1981010115;0.6
xyz;1981010116;0.0
xyz;1981010117;0.0
xyz;1981010118;0.2
xyz;1981010119;0.0
xyz;1981010120;0.0
xyz;1981010121;0.0
xyz;1981010122;0.0
xyz;1981010123;0.0
xyz;1981010200;0.0
You first have to convert your column containing the datetimes to a Series of type datetime. You can do that parsing by providing the format of your datetimes.
df["yyyymmddhh"] = pd.to_datetime(df["yyyymmddhh"], format="%Y%M%d%H")
After having the correct data type you have to set that column as your index and can now use pandas functionality for time series data (resampling in your case).
First you resample the data to 3 hour windows and sum the values. From that you resample to yearly data and take the maximum value of all the 3 hour windows for each year.
df.set_index("yyyymmddhh").resample("3H").sum().resample("Y").max()
# Output
yyyymmddhh rainfall
1981-12-31 1.1
I have a half-hourly dataframe with two columns. I would like to take all the hours of a day, then do some calculation which returns one number and assign that to all half-hours of that day. Below is an example code:
dates = pd.date_range("2003-01-01 08:30:00","2003-01-05",freq="30min")
data = np.transpose(np.array([np.random.rand(dates.shape[0]),np.random.rand(dates.shape[0])*100]))
data[0:50,0]=np.nan # my actual dataframe includes nan
df = pd.DataFrame(data = data,index =dates,columns=["DATA1","DATA2"])
print(df)
DATA1 DATA2
2003-01-01 08:30:00 NaN 79.990866
2003-01-01 09:00:00 NaN 5.461791
2003-01-01 09:30:00 NaN 68.892447
2003-01-01 10:00:00 NaN 44.823338
2003-01-01 10:30:00 NaN 57.860309
... ... ...
2003-01-04 22:00:00 0.394574 31.943657
2003-01-04 22:30:00 0.140950 78.275981
Then I would like to apply the following function which returns one numbre:
def my_f(data1,data2):
y = data1[data2>20]
return np.median(y)
This function selects all data in DATA1 based on a condition (DATA2>20) then takes the median of all these data.
How can I create a third column (let's say result) and assign back this fixed number (y) for all half-hours data of that day?
My guess is I should use something like this:
daily_tmp = df.resample('D').apply(my_f)
df['results'] = b.reindex(df.index,method='ffill')
If this approach is correct, how can I pass my_f with two arguments to resample.apply()?
Or is there any other way to do the similar task?
My solution assumes that you have a fairly small dataset. Please let me know if it is not the case.
I would decompose your goal as follows:
(1) group data by day
(2) for each day, compute some complicated function
(3) assign the resulted value in to half-hours.
# specify the day for each datapoint
df['day'] = df.index.map(lambda x: x.strftime('%Y-%m-%d'))
# compute a complicated function for each day and store the result
mapping = {}
for day, data_for_the_day in df.groupby(by='day'):
# assign to mapping[day] the result of a complicated function
mapping[day] = np.mean(data_for_the_day[data_for_the_day['Data2'] > 20]['Data1'])
# assign the values to half-hours
df['result'] = df.index.map(lambda x: mapping.get(x.strftime('%Y-%m-%d'), np.nan) if x.strftime('%M')=='30' else np.nan)
That's not the neatest solution, but it is straight-forward, easy-to-understand, and works well on small datasets.
Here is a fast way to do it.
First, import libraries :
import time
import pandas as pd
import numpy as np
import datetime as dt
Second, the code to achieve it:
%%time
dates = pd.date_range("2003-01-01 08:30:00","2003-01-05",freq="30min")
data = np.transpose(np.array([np.random.rand(dates.shape[0]),np.random.rand(dates.shape[0])*100]))
data[0:50,0]=np.nan # my actual dataframe includes nan
df = pd.DataFrame(data = data,index =dates,columns=["DATA1","DATA2"])
#### Create an unique marker per hour
df['Date'] = df.index
df['Date'] = df['Date'].dt.strftime(date_format='%Y-%m-%d %H')
#### Then Stipulate some conditions
_condition_1 = df.Date == df.Date.shift(-1) # if full hour
_condition_2 = df.DATA2 > 20 # yours
_condition_3 = df.Date == df.Date.shift(1) # if half an hour
#### Now, report median where condition 1 and 2 are fullfilled
df['result'] = np.where(_condition_1 & _condition_2,(df.DATA1+df.DATA1.shift(-1)/2),0)
#### Fill the hours with median
df['result'] = np.where(_condition_3,df.result.shift(1),df.result)
#### Drop useless column
df = df.drop(['Date'],axis=1)
df[df.DATA2>20].tail(20)
Third: the output
output
I'm trying to calculate daily sums of values using pandas. Here's the test file - http://pastebin.com/uSDfVkTS
This is the code I came up so far:
import numpy as np
import datetime as dt
import pandas as pd
f = np.genfromtxt('test', dtype=[('datetime', '|S16'), ('data', '<i4')], delimiter=',')
dates = [dt.datetime.strptime(i, '%Y-%m-%d %H:%M') for i in f['datetime']]
s = pd.Series(f['data'], index = dates)
d = s.resample('D', how='sum')
Using the given test file this produces:
2012-01-02 1128
Freq: D
First problem is that calculated sum corresponds to the next day. I've been able to solve that by using parameter loffset='-1d'.
Now the actual problem is that the data may start not from 00:30 of a day but at any time of a day. Also the data has gaps filled with 'nan' values.
That said, is it possible to set a lower threshold of number of values that are necessary to calculate daily sums? (e.g. if there're less than 40 values in a single day, then put NaN instead of a sum)
I believe that it is possible to define a custom function to do that and refer to it in 'how' parameter, but I have no clue how to code the function itself.
You can do it directly in Pandas:
s = pd.read_csv('test', header=None, index_col=0, parse_dates=True)
d = s.groupby(lambda x: x.date()).aggregate(lambda x: sum(x) if len(x) >= 40 else np.nan)
X.2
2012-01-01 1128
Much easier way is to use pd.Grouper:
d = s.groupby(pd.Grouper(freq='1D')).sum()