Aggregrate measurements per time period - python

I have a 6 x n matrix with the data: year, month, day, hour, minute, use.
I have to make a new matrix containing the aggregated measurements for use, in the value ’hour’. So all rows recorded within the same hour are combined.
So every time the number of hour chances the code need to know a new period starts.
I just tried something, but I don't now how to solve this.
Thank you. This is what I tried + a test
def groupby_measurements(data):
count = -1
for i in range(9):
array = np.split(data, np.where(data[i,3] != data[i+1,3])[0][:1])
return array
print(groupby_measurements(np.array([[2006,2,11,1,1,55],
[2006,2,11,1,11,79],
[2006,2,11,1,32,2],
[2006,2,11,1,41,66],
[2006,2,11,1,51,76],
[2006,2,11,10,2,89],
[2006,2,11,10,3,33],
[2006,2,11,14,2,22],
[2006,2,11,14,5,34]])))
In this case I tried, I expect the output to be:
np.array([[2006,2,11,1,1,55],
[2006,2,11,1,11,79],
[2006,2,11,1,32,2],
[2006,2,11,1,41,66],
[2006,2,11,1,51,76]]),
np.array([[2006,2,11,10,2,89],
[2006,2,11,10,3,33]]),
np.array([[2006,2,11,14,2,22],
[2006,2,11,14,5,34]])
The final output should be:
np.array([2006,2,11,1,0,278]),
np.array([2006,2,11,10,0,122]),
np.array([2006,2,11,14,0,56])
(the sum of use in the 3 hour periodes)

I would recommend using pandas Dataframes, and then using groupby combined with sum
import pandas as pd
import numpy as np
data = pd.DataFrame(np.array(
[[2006,2,11,1,1,55],
[2006,2,11,1,11,79],
[2006,2,11,1,32,2],
[2006,2,11,1,41,66],
[2006,2,11,1,51,76],
[2006,2,11,10,2,89],
[2006,2,11,10,3,33],
[2006,2,11,14,2,22],
[2006,2,11,14,5,34]]),
columns=['year','month','day','hour','minute','use'])
aggregated = data.groupby(['year','month','day','hour'])['use'].sum()
# you can also use .agg and pass which aggregation function you want as a string.
aggregated = data.groupby(['year','month','day','hour'])['use'].agg('sum')
year month day hour
2006 2 11 1 278
10 122
14 56
Aggregated is now a pandas Series, if you want it as an array just do
aggregated.values

Related

Generating rolling averages from a time series, but subselecting based onmonth

I have long time series of weekly data. For a given observation, I want to calculate that week's value versus the average of the three previous years' average value for the same month.
Concrete example: For the 2019-02-15 datapoint, I want to compare it to the value of the average of all the feb-2018, feb-2017, and feb-2016 datapoints.
I want to populate the entire timeseries in this way. (the first three years will be np.nans of course)
I made a really gross single-datapoint example of the calculation I want to do, but I am not sure how to implement this in a vectorized solution. I also am not impressed that I had to use this intermediate helper table "mth_avg".
import pandas as pd
ix = pd.date_range(freq='W-FRI',start="20100101", end='20190301' )
df = pd.DataFrame({"foo": [x for x in range(len(ix))]}, index=ix) #weekly data
mth_avg = df.resample("M").mean() #data as a monthly average over time
mth_avg['month_hack'] = mth_avg.index.month
#average of previous three years' same-month averages
df['avg_prev_3_year_same-month'] = "?"
#single arbitrary example of my intention
df.loc['2019-02-15', "avg_prev_3_year_same-month"]= (
mth_avg[mth_avg.month_hack==2]
.loc[:'2019-02-15']
.iloc[-3:]
.loc[:,'foo']
.mean()
)
df[-5:]
I think it's actually a nontrivial problem - there's no existing functionality I'm aware of Pandas for this. Making a helper table saves calculation time, in fact I used two. My solution uses a loop (namely a list comprehension) and Pandas datetime awareness to avoid your month_hack. Otherwise I think it was a good start. Would be happy to see something more elegant!
# your code
ix = pd.date_range(freq='W-FRI',start="20100101", end='20190301' )
df = pd.DataFrame({"foo": [x for x in range(len(ix))]}, index=ix)
mth_avg = df.resample("M").mean()
# use multi-index of month/year with month first
mth_avg.index = [mth_avg.index.month, mth_avg.index.year]
tmp = mth_avg.sort_index().groupby(level=0).rolling(3).foo.mean()
tmp.index = tmp.index.droplevel(0)
# get rolling value from tmp
res = [tmp.xs((i.month, i.year - 1)) for i in df[df.index > '2010-12-31'].index]
# NaNs for 2010
df['avg_prev_3_year_same-month'] = np.NaN
df.loc[df.index > '2010-12-31', 'avg_prev_3_year_same-month'] = res
# output
df.sort_index(ascending=False).head()
foo avg_prev_3_year_same-month
2019-03-01 478 375.833333
2019-02-22 477 371.500000
2019-02-15 476 371.500000
2019-02-08 475 371.500000
2019-02-01 474 371.500000

Pandas calculated column from datetime index groups loop

I have a Pandas df with a Datetime Index. I want to loop over the following code with different values of strike, based on the index date value (different strike for different time period). Here is my code that produces what I am after for 1 strike across the whole time series:
import pandas as pd
import numpy as np
index=pd.date_range('2017-10-1 00:00:00', '2018-12-31 23:50:00', freq='30min')
df=pd.DataFrame(np.random.randn(len(index),2).cumsum(axis=0),columns=['A','B'],index=index)
strike = 40
payoffs = df[df>strike]-strike
mean_payoff = payoffs.fillna(0).mean()
dist = mean_payoff.describe(percentiles=[0.05,.5,.95])
print(dist)
I want to use different values of strike based on the time period (index value).
So far I have tried to create a categorical calculated column with the intention of using map or apply row wise on the df. I have also played around with creating a dictionary and mapping the dict across the df.
Even if I get the calculated column with the correct strike value, I can 't think how to subtract the calculated column value (strike) from all other columns to get payoffs from above.
I feel like I need to use for loop and potentially create groups of date chunks that get appended together at the end of the loop, maybe with pd.concat.
Thanks in advance
I think you need convert DatetimeIndex to quarter period by to_period, then to string and last map by dict.
For comapring need gt with sub:
d = {'2017Q4':30, '2018Q1':40, '2018Q2':50, '2018Q3':60, '2018Q4':70}
strike = df.index.to_series().dt.to_period('Q').astype(str).map(d)
payoffs = df[df.gt(strike, 0)].sub(strike, 0)
mean_payoff = payoffs.fillna(0).mean()
dist = mean_payoff.describe(percentiles=[0.05,.5,.95])
Mapping your dataframe index into a dictionary can be a starting point.
a = dict()
a[2017]=30
a[2018]=40
ranint = random.choices([30,35,40,45],k=21936)
#given your index used in example
df = pd.DataFrame({values:ranint},index=index)
values year strick
2017-10-01 00:00:00 30 2017 30
2017-10-01 00:30:00 30 2017 30
2017-10-01 01:00:00 45 2017 30
df.year = df.index.year
index.strike = df.year.map(a)
df.returns = df.values - df.strike
Then you can extract return that are greater than 0:
df[df.returns>0]

Finding a range in the series with the most frequent occurrence of the entries over a defined time (in Pandas)

I have a large dataset in Pandas in which the entries are marked with a time stamp. I'm looking for a solution how to get a range of a defined length (like 1 minute) with the highest occurrence of entries.
One solution could be to resample the data to a higher timeframe (such as a minute) and comparing the sections with the highest number of values. However, It would only find ranges that correspond to the start and end time of the given timeframe.
I'd rather find a solution to find any 1-minute ranges no matter where they actually start.
In following example I would be looking for 1 minute “window” with highest occurrence of the entries starting with the first signal in the range and ending with last signal in the range:
8:50:00
8:50:01
8:50:03
8:55:00
8:59:10
9:00:01
9:00:02
9:00:03
9:00:04
9:05:00
Thus I would like to get range 8:59:10 - 9:00:04
Any hint how to accomplish this?
You need to create 1 minute windows with a sliding start time of 1 second; compute the maximum occurrence for any of the windows. In pandas 0.19.0 or greater, you can resample a time series using base as an argument to start the resampled windows at different times.
I used tempfile to copy your data as a toy data set below.
import tempfile
import pandas as pd
tf = tempfile.TemporaryFile()
tf.write(b'''8:50:00
8:50:01
8:50:03
8:55:00
8:59:10
9:00:01
9:00:02
9:00:03
9:00:04
9:05:00''')
tf.seek(0)
df = pd.read_table(tf, header=None)
df.columns = ['time']
df.time = pd.to_datetime(df.time)
max_vals = []
for t in range(60):
# .max().max() is not a mistake, use it to return just the value
max_vals.append(
(t, df.resample('60s', on='time', base=t).count().max().max())
)
max(max_vals, key=lambda x: x[-1])
# returns:
(5, 5)
For this toy dataset, an offset of 5 seconds for the window (i.e. 8:49:05, 8:50:05, ...) has the first of the maximum count for a windows of 1 minute with 5 counts.

Advanced array/dataframe slicing (numpy/pandas)

I'm trying to generate 50 random samples of 30 continuous day periods from a list of corn prices (which is index by date).
So far I've got 'select 50 random days' on line one. For the second line, what I really want is an array of dataframes, each one containing 30 days from sample date. Currently it just returns the price on that day.
samples=np.random.choice(corn[:'1981'].index,50)
corn['Open'][samples] #line I need to fix
What's the cleanest way of doing that?
You could use
corn.loc[date:date+pd.Timedelta(days=29)]
to select 30 days worth of rows starting from date date. Note that .loc[start:end] includes both start and end (unlike Python slices, which use half-open intervals). Thus adding 29 days to date results in a DataFrame of length 30.
To get an list of DataFrames, use a list comprehension:
dfs = [corn.loc[date:date+pd.Timedelta(days=29)] for date in samples]
import numpy as np
import pandas as pd
N = 365
corn = pd.DataFrame({'Open': np.random.random(N)},
index=pd.date_range('1980-1-1', periods=N))
samples = np.random.choice(corn[:'1981'].index,50)
dfs = [corn.loc[date:date+pd.Timedelta(days=29)] for date in samples]

Average of daily count of records per month in a Pandas DataFrame

I have a pandas DataFrame with a TIMESTAMP column, which is of the datetime64 data type. Please keep in mind, initially this column is not set as the index; the index is just regular integers, and the first few rows look like this:
TIMESTAMP TYPE
0 2014-07-25 11:50:30.640 2
1 2014-07-25 11:50:46.160 3
2 2014-07-25 11:50:57.370 2
There is an arbitrary number of records for each day, and there may be days with no data. What I am trying to get is the average number of daily records per month then plot it as a bar chart with months in the x-axis (April 2014, May 2014... etc.). I managed to calculate these values using the code below
dfWIM.index = dfWIM.TIMESTAMP
for i in range(dfWIM.TIMESTAMP.dt.year.min(),dfWIM.TIMESTAMP.dt.year.max()+1):
for j in range(1,13):
print dfWIM[(dfWIM.TIMESTAMP.dt.year == i) & (dfWIM.TIMESTAMP.dt.month == j)].resample('D', how='count').TIMESTAMP.mean()
which gives the following output:
nan
nan
3100.14285714
6746.7037037
9716.42857143
10318.5806452
9395.56666667
9883.64516129
8766.03225806
9297.78571429
10039.6774194
nan
nan
nan
This is ok as it is, and with some more work, I can map to results to correct month names, then plot the bar chart. However, I am not sure if this is the correct/best way, and I suspect there might be an easier way to get the results using Pandas.
I would be glad to hear what you think. Thanks!
NOTE: If I do not set the TIMESTAMP column as the index, I get a "reduction operation 'mean' not allowed for this dtype" error.
I think you'll want to do two rounds of groupby, first to group by day and count the instances, and next to group by month and compute the mean of the daily counts. You could do something like this.
First I'll generate some fake data that looks like yours:
import pandas as pd
# make 1000 random times throughout the year
N = 1000
times = pd.date_range('2014', '2015', freq='min')
ind = np.random.permutation(np.arange(len(times)))[:N]
data = pd.DataFrame({'TIMESTAMP': times[ind],
'TYPE': np.random.randint(0, 10, N)})
data.head()
Now I'll do the two groupbys using pd.TimeGrouper and plot the monthly average counts:
import seaborn as sns # for nice plot styles (optional)
daily = data.set_index('TIMESTAMP').groupby(pd.TimeGrouper(freq='D'))['TYPE'].count()
monthly = daily.groupby(pd.TimeGrouper(freq='M')).mean()
ax = monthly.plot(kind='bar')
The formatting along the x axis leaves something to be desired, but you can tweak that if necessary.

Categories