Generating rolling averages from a time series, but subselecting based onmonth - python

I have long time series of weekly data. For a given observation, I want to calculate that week's value versus the average of the three previous years' average value for the same month.
Concrete example: For the 2019-02-15 datapoint, I want to compare it to the value of the average of all the feb-2018, feb-2017, and feb-2016 datapoints.
I want to populate the entire timeseries in this way. (the first three years will be np.nans of course)
I made a really gross single-datapoint example of the calculation I want to do, but I am not sure how to implement this in a vectorized solution. I also am not impressed that I had to use this intermediate helper table "mth_avg".
import pandas as pd
ix = pd.date_range(freq='W-FRI',start="20100101", end='20190301' )
df = pd.DataFrame({"foo": [x for x in range(len(ix))]}, index=ix) #weekly data
mth_avg = df.resample("M").mean() #data as a monthly average over time
mth_avg['month_hack'] = mth_avg.index.month
#average of previous three years' same-month averages
df['avg_prev_3_year_same-month'] = "?"
#single arbitrary example of my intention
df.loc['2019-02-15', "avg_prev_3_year_same-month"]= (
mth_avg[mth_avg.month_hack==2]
.loc[:'2019-02-15']
.iloc[-3:]
.loc[:,'foo']
.mean()
)
df[-5:]

I think it's actually a nontrivial problem - there's no existing functionality I'm aware of Pandas for this. Making a helper table saves calculation time, in fact I used two. My solution uses a loop (namely a list comprehension) and Pandas datetime awareness to avoid your month_hack. Otherwise I think it was a good start. Would be happy to see something more elegant!
# your code
ix = pd.date_range(freq='W-FRI',start="20100101", end='20190301' )
df = pd.DataFrame({"foo": [x for x in range(len(ix))]}, index=ix)
mth_avg = df.resample("M").mean()
# use multi-index of month/year with month first
mth_avg.index = [mth_avg.index.month, mth_avg.index.year]
tmp = mth_avg.sort_index().groupby(level=0).rolling(3).foo.mean()
tmp.index = tmp.index.droplevel(0)
# get rolling value from tmp
res = [tmp.xs((i.month, i.year - 1)) for i in df[df.index > '2010-12-31'].index]
# NaNs for 2010
df['avg_prev_3_year_same-month'] = np.NaN
df.loc[df.index > '2010-12-31', 'avg_prev_3_year_same-month'] = res
# output
df.sort_index(ascending=False).head()
foo avg_prev_3_year_same-month
2019-03-01 478 375.833333
2019-02-22 477 371.500000
2019-02-15 476 371.500000
2019-02-08 475 371.500000
2019-02-01 474 371.500000

Related

Python Pandas: How to get the maximum value per peak in multiple cycles

I am importing data from a machine that has thousands of cycles on it. Each cycle lasts a few minutes and has two peaks in pressure that I need to record. One example can be seen in the graph below.
In this cycle you can see there are two peaks, one at 807 psi and one at 936 psi. I need to record these values. I have sorted the data so i can determine when a cycle is on or off already, but now I need to figure out how to record these two maxes. I previouly tried this:
df2 = df.groupby('group')['Pressure'].nlargest(2).rename_axis (index=['group', 'row_index'])
to get the maxes, but realized this will only give me the two largest values which in some cycles happen right before the the peak.
In this example dataframe I have provided one cycle:
import pandas as pd
data = {'Pressure' : [100,112,114,120,123,420,123,1230,1320,1,23,13,13,13,123,13,123,3,222,2303,1233,1233,1,1,30,20,40,401,10,40,12,122,1,12,333]}
df = pd.DataFrame(data)
The peak values for this should be 1320, and 2303 whilke ignoring the slow increase to these peaks.
Thanks for any help!
(This is also for a ton of cycles, so i need it to be able to go through and record the peaks for each cycle)
Alright, I had a go, using the simple heuristic I suggested in my comment.
def filter_peaks(df):
df["before"] = df["Pressure"].shift(1)
df["after"] = df["Pressure"].shift(-1)
df["max"] = df.max(axis=1)
df = df.fillna(0)
return df[df["Pressure"] == df["max"]]["Pressure"].to_frame()
filter_peaks(df) # test one application
If you apply this once to your test dataframe, you get the following result:
You can see, that it almost doesn't work: the value at line 21 only needed to be a little higher for it to exceed the true second peak at line 8.
You can get round this by iterating, ie., with filter_peaks(filter_peaks(df)). You then do end up with a clean dataframe that you can apply your .nlargest strategy to.
EDIT
Complete code example:
import pandas as pd
data = {'Pressure' : [100,112,114,120,123,420,123,1230,1320,1,23,13,13,13,123,13,123,3,222,2303,1233,1233,1,1,30,20,40,401,10,40,12,122,1,12,333]}
df = pd.DataFrame(data)
def filter_peaks(df):
df["before"] = df["Pressure"].shift(1)
df["after"] = df["Pressure"].shift(-1)
df["max"] = df.max(axis=1)
df = df.fillna(0)
return df[df["Pressure"] == df["max"]]["Pressure"].to_frame()
df2 = filter_peaks(df) # or do it twice if you want to be sure: filter_peaks(filter_peaks(df))
df2["Pressure"].nlargest(2)
Output:
19 2303
8 1320
Name: Pressure, dtype: int64

Faster way of iterate through xarray and dataframe

Im new to python and dont know all the aspects.
I want to loop through a dataframe (2D) and assign some of those values to an xarray (3D).
The coordinates of my xarray are company ticker symbols (1), financial variables (2) and daily dates (3).
The columns of the dataframe for each company are some of the same financial variables as in the xarray and the index is made up of quarterly dates.
My goal is to take an already generated dataframe for each company and look for a value that is in the column of a certain variable and the row of a certain date and assign it to its corresponding spot in the xarray.
Since some dates are not going to be in the index of the dataframe (only has 4 dates per calendar year), I want to assign either a 0 to that spot on the xarray or the value from the previous date on the xarray if that value is also not 0.
I have tried to do it using nested for loops, but it takes around 20 seconds to go through all the dates in just one variable.
My date list if made up of around 8000 dates, the variable list has around 30 variables and the company list is around 800 companies.
If I were to loop around all of that it would take me several days to complete the nested for loops.
Is there a faster way to assign these values to the xarray? My guess is something similar to iterrows() or iteritems() but in xarray.
Heres is a sample code of my program with shorter lists for the companies and variables:
import pandas as pd
from datetime import datetime, date, timedelta
import numpy as np
import xarray as xr
import time
start_time = time.time()
# We create the df. This is aun auxiliary made-up df. Its a shorter version of the real df.
# The real df I want to use is much larger and comes from an external method.
cols = ['cashAndCashEquivalents', 'shortTermInvestments', 'cashAndShortTermInvestments', 'totalAssets',
'totalLiabilities', 'totalStockholdersEquity', 'netIncome', 'freeCashFlow']
rows = []
for year in range(1989, 2020):
for month, day in zip([3, 6, 9, 12], [31, 30, 30, 31]):
rows.append(date(year, month, day))
a = np.random.randint(100, size=(len(rows), len(cols)))
df = pd.DataFrame(data=a, columns=cols)
df.insert(column='date', value=rows, loc=0)
# This is just to set the date format so that I can later look up the values
for item, i in zip(df.iloc[:, 0], range(len(df.iloc[:, 0]))):
df.iloc[i, 0] = datetime.strptime(str(item), '%Y-%m-%d')
df.set_index('date', inplace=True)
# Coordinates for the xarray:
companies = ['AAPL'] # This is actually longer (around 800 companies), but for the sake of the question, it is limited to just one company.
variables = ['totalAssets', 'totalLiabilities', 'totalStockholdersEquity'] # Same as with the companies (around 30 variables).
first_date = date(1998, 3, 25)
last_date = date.today() + timedelta(-300)
dates = pd.date_range(start=first_date, end=last_date).tolist()
# We create a zero xarray, so that we can later fill it up with values:
z = np.zeros((len(companies), len(variables), len(dates)))
ds = xr.DataArray(z, coords=[companies, variables, dates],
dims=['companies', 'variables', 'dates'])
# We assign values from the df to the ds
for company in companies:
for variable in variables:
first_value_found = False
for date in dates:
# Dates in the df are quarterly dates and dates in the ds are daily dates.
# We start off by looking for a certain date in the df. If we dont find it, we give it the value 0 in the ds
# If we do find it, we assign it the value found in the df and tell it that the first value has been found
# Now that the first value has been found, when we dont find a value in the df, instead of giving it a value of 0, we give it the value of the last date.
if first_value_found == False:
try:
ds.loc[company, variable, date] = df.loc[date, variable]
first_value_found = True
except:
ds.loc[company, variable, date] = 0
else:
try:
ds.loc[company, variable, date] = df.loc[date, variable]
except:
ds.loc[company, variable, date] = ds.loc[company, variable, date + timedelta(-1)]
print("My program took", time.time() - start_time, "to run")
The main problem is with the for loops, as I have tested these loops on separate files and these seem to be what takes the most time.
One possible strategy is to loop over the actual index of the DataFrame, rather than all possible indices
avail_dates = df.index
for date in avail_dates:
# Copy the data
That should already reduce the number of iterations by quite a bit. You still have to make sure all the blanks are filled, so you'd do something like
da.loc[company, variables, date:] = df.loc[date, variables]
That's right, you can index into DataArray and DataFrame with lists. (Also I wouldn't use ds as a variable name for something from xarray other than a DataSet)
What you probably want to use, though, is pandas.DataFrame.reindex().
If I understand what you're trying to do, this should more or less do the trick (not tested)
complete_df = df.reindex(dates, method='pad', fill_value=0)
da.loc[company, variables, :] = complete_df.loc[:, variables].T

Aggregrate measurements per time period

I have a 6 x n matrix with the data: year, month, day, hour, minute, use.
I have to make a new matrix containing the aggregated measurements for use, in the value ’hour’. So all rows recorded within the same hour are combined.
So every time the number of hour chances the code need to know a new period starts.
I just tried something, but I don't now how to solve this.
Thank you. This is what I tried + a test
def groupby_measurements(data):
count = -1
for i in range(9):
array = np.split(data, np.where(data[i,3] != data[i+1,3])[0][:1])
return array
print(groupby_measurements(np.array([[2006,2,11,1,1,55],
[2006,2,11,1,11,79],
[2006,2,11,1,32,2],
[2006,2,11,1,41,66],
[2006,2,11,1,51,76],
[2006,2,11,10,2,89],
[2006,2,11,10,3,33],
[2006,2,11,14,2,22],
[2006,2,11,14,5,34]])))
In this case I tried, I expect the output to be:
np.array([[2006,2,11,1,1,55],
[2006,2,11,1,11,79],
[2006,2,11,1,32,2],
[2006,2,11,1,41,66],
[2006,2,11,1,51,76]]),
np.array([[2006,2,11,10,2,89],
[2006,2,11,10,3,33]]),
np.array([[2006,2,11,14,2,22],
[2006,2,11,14,5,34]])
The final output should be:
np.array([2006,2,11,1,0,278]),
np.array([2006,2,11,10,0,122]),
np.array([2006,2,11,14,0,56])
(the sum of use in the 3 hour periodes)
I would recommend using pandas Dataframes, and then using groupby combined with sum
import pandas as pd
import numpy as np
data = pd.DataFrame(np.array(
[[2006,2,11,1,1,55],
[2006,2,11,1,11,79],
[2006,2,11,1,32,2],
[2006,2,11,1,41,66],
[2006,2,11,1,51,76],
[2006,2,11,10,2,89],
[2006,2,11,10,3,33],
[2006,2,11,14,2,22],
[2006,2,11,14,5,34]]),
columns=['year','month','day','hour','minute','use'])
aggregated = data.groupby(['year','month','day','hour'])['use'].sum()
# you can also use .agg and pass which aggregation function you want as a string.
aggregated = data.groupby(['year','month','day','hour'])['use'].agg('sum')
year month day hour
2006 2 11 1 278
10 122
14 56
Aggregated is now a pandas Series, if you want it as an array just do
aggregated.values

Pandas calculated column from datetime index groups loop

I have a Pandas df with a Datetime Index. I want to loop over the following code with different values of strike, based on the index date value (different strike for different time period). Here is my code that produces what I am after for 1 strike across the whole time series:
import pandas as pd
import numpy as np
index=pd.date_range('2017-10-1 00:00:00', '2018-12-31 23:50:00', freq='30min')
df=pd.DataFrame(np.random.randn(len(index),2).cumsum(axis=0),columns=['A','B'],index=index)
strike = 40
payoffs = df[df>strike]-strike
mean_payoff = payoffs.fillna(0).mean()
dist = mean_payoff.describe(percentiles=[0.05,.5,.95])
print(dist)
I want to use different values of strike based on the time period (index value).
So far I have tried to create a categorical calculated column with the intention of using map or apply row wise on the df. I have also played around with creating a dictionary and mapping the dict across the df.
Even if I get the calculated column with the correct strike value, I can 't think how to subtract the calculated column value (strike) from all other columns to get payoffs from above.
I feel like I need to use for loop and potentially create groups of date chunks that get appended together at the end of the loop, maybe with pd.concat.
Thanks in advance
I think you need convert DatetimeIndex to quarter period by to_period, then to string and last map by dict.
For comapring need gt with sub:
d = {'2017Q4':30, '2018Q1':40, '2018Q2':50, '2018Q3':60, '2018Q4':70}
strike = df.index.to_series().dt.to_period('Q').astype(str).map(d)
payoffs = df[df.gt(strike, 0)].sub(strike, 0)
mean_payoff = payoffs.fillna(0).mean()
dist = mean_payoff.describe(percentiles=[0.05,.5,.95])
Mapping your dataframe index into a dictionary can be a starting point.
a = dict()
a[2017]=30
a[2018]=40
ranint = random.choices([30,35,40,45],k=21936)
#given your index used in example
df = pd.DataFrame({values:ranint},index=index)
values year strick
2017-10-01 00:00:00 30 2017 30
2017-10-01 00:30:00 30 2017 30
2017-10-01 01:00:00 45 2017 30
df.year = df.index.year
index.strike = df.year.map(a)
df.returns = df.values - df.strike
Then you can extract return that are greater than 0:
df[df.returns>0]

Average of daily count of records per month in a Pandas DataFrame

I have a pandas DataFrame with a TIMESTAMP column, which is of the datetime64 data type. Please keep in mind, initially this column is not set as the index; the index is just regular integers, and the first few rows look like this:
TIMESTAMP TYPE
0 2014-07-25 11:50:30.640 2
1 2014-07-25 11:50:46.160 3
2 2014-07-25 11:50:57.370 2
There is an arbitrary number of records for each day, and there may be days with no data. What I am trying to get is the average number of daily records per month then plot it as a bar chart with months in the x-axis (April 2014, May 2014... etc.). I managed to calculate these values using the code below
dfWIM.index = dfWIM.TIMESTAMP
for i in range(dfWIM.TIMESTAMP.dt.year.min(),dfWIM.TIMESTAMP.dt.year.max()+1):
for j in range(1,13):
print dfWIM[(dfWIM.TIMESTAMP.dt.year == i) & (dfWIM.TIMESTAMP.dt.month == j)].resample('D', how='count').TIMESTAMP.mean()
which gives the following output:
nan
nan
3100.14285714
6746.7037037
9716.42857143
10318.5806452
9395.56666667
9883.64516129
8766.03225806
9297.78571429
10039.6774194
nan
nan
nan
This is ok as it is, and with some more work, I can map to results to correct month names, then plot the bar chart. However, I am not sure if this is the correct/best way, and I suspect there might be an easier way to get the results using Pandas.
I would be glad to hear what you think. Thanks!
NOTE: If I do not set the TIMESTAMP column as the index, I get a "reduction operation 'mean' not allowed for this dtype" error.
I think you'll want to do two rounds of groupby, first to group by day and count the instances, and next to group by month and compute the mean of the daily counts. You could do something like this.
First I'll generate some fake data that looks like yours:
import pandas as pd
# make 1000 random times throughout the year
N = 1000
times = pd.date_range('2014', '2015', freq='min')
ind = np.random.permutation(np.arange(len(times)))[:N]
data = pd.DataFrame({'TIMESTAMP': times[ind],
'TYPE': np.random.randint(0, 10, N)})
data.head()
Now I'll do the two groupbys using pd.TimeGrouper and plot the monthly average counts:
import seaborn as sns # for nice plot styles (optional)
daily = data.set_index('TIMESTAMP').groupby(pd.TimeGrouper(freq='D'))['TYPE'].count()
monthly = daily.groupby(pd.TimeGrouper(freq='M')).mean()
ax = monthly.plot(kind='bar')
The formatting along the x axis leaves something to be desired, but you can tweak that if necessary.

Categories