Pandas calculated column from datetime index groups loop - python

I have a Pandas df with a Datetime Index. I want to loop over the following code with different values of strike, based on the index date value (different strike for different time period). Here is my code that produces what I am after for 1 strike across the whole time series:
import pandas as pd
import numpy as np
index=pd.date_range('2017-10-1 00:00:00', '2018-12-31 23:50:00', freq='30min')
df=pd.DataFrame(np.random.randn(len(index),2).cumsum(axis=0),columns=['A','B'],index=index)
strike = 40
payoffs = df[df>strike]-strike
mean_payoff = payoffs.fillna(0).mean()
dist = mean_payoff.describe(percentiles=[0.05,.5,.95])
print(dist)
I want to use different values of strike based on the time period (index value).
So far I have tried to create a categorical calculated column with the intention of using map or apply row wise on the df. I have also played around with creating a dictionary and mapping the dict across the df.
Even if I get the calculated column with the correct strike value, I can 't think how to subtract the calculated column value (strike) from all other columns to get payoffs from above.
I feel like I need to use for loop and potentially create groups of date chunks that get appended together at the end of the loop, maybe with pd.concat.
Thanks in advance

I think you need convert DatetimeIndex to quarter period by to_period, then to string and last map by dict.
For comapring need gt with sub:
d = {'2017Q4':30, '2018Q1':40, '2018Q2':50, '2018Q3':60, '2018Q4':70}
strike = df.index.to_series().dt.to_period('Q').astype(str).map(d)
payoffs = df[df.gt(strike, 0)].sub(strike, 0)
mean_payoff = payoffs.fillna(0).mean()
dist = mean_payoff.describe(percentiles=[0.05,.5,.95])

Mapping your dataframe index into a dictionary can be a starting point.
a = dict()
a[2017]=30
a[2018]=40
ranint = random.choices([30,35,40,45],k=21936)
#given your index used in example
df = pd.DataFrame({values:ranint},index=index)
values year strick
2017-10-01 00:00:00 30 2017 30
2017-10-01 00:30:00 30 2017 30
2017-10-01 01:00:00 45 2017 30
df.year = df.index.year
index.strike = df.year.map(a)
df.returns = df.values - df.strike
Then you can extract return that are greater than 0:
df[df.returns>0]

Related

Faster way of iterate through xarray and dataframe

Im new to python and dont know all the aspects.
I want to loop through a dataframe (2D) and assign some of those values to an xarray (3D).
The coordinates of my xarray are company ticker symbols (1), financial variables (2) and daily dates (3).
The columns of the dataframe for each company are some of the same financial variables as in the xarray and the index is made up of quarterly dates.
My goal is to take an already generated dataframe for each company and look for a value that is in the column of a certain variable and the row of a certain date and assign it to its corresponding spot in the xarray.
Since some dates are not going to be in the index of the dataframe (only has 4 dates per calendar year), I want to assign either a 0 to that spot on the xarray or the value from the previous date on the xarray if that value is also not 0.
I have tried to do it using nested for loops, but it takes around 20 seconds to go through all the dates in just one variable.
My date list if made up of around 8000 dates, the variable list has around 30 variables and the company list is around 800 companies.
If I were to loop around all of that it would take me several days to complete the nested for loops.
Is there a faster way to assign these values to the xarray? My guess is something similar to iterrows() or iteritems() but in xarray.
Heres is a sample code of my program with shorter lists for the companies and variables:
import pandas as pd
from datetime import datetime, date, timedelta
import numpy as np
import xarray as xr
import time
start_time = time.time()
# We create the df. This is aun auxiliary made-up df. Its a shorter version of the real df.
# The real df I want to use is much larger and comes from an external method.
cols = ['cashAndCashEquivalents', 'shortTermInvestments', 'cashAndShortTermInvestments', 'totalAssets',
'totalLiabilities', 'totalStockholdersEquity', 'netIncome', 'freeCashFlow']
rows = []
for year in range(1989, 2020):
for month, day in zip([3, 6, 9, 12], [31, 30, 30, 31]):
rows.append(date(year, month, day))
a = np.random.randint(100, size=(len(rows), len(cols)))
df = pd.DataFrame(data=a, columns=cols)
df.insert(column='date', value=rows, loc=0)
# This is just to set the date format so that I can later look up the values
for item, i in zip(df.iloc[:, 0], range(len(df.iloc[:, 0]))):
df.iloc[i, 0] = datetime.strptime(str(item), '%Y-%m-%d')
df.set_index('date', inplace=True)
# Coordinates for the xarray:
companies = ['AAPL'] # This is actually longer (around 800 companies), but for the sake of the question, it is limited to just one company.
variables = ['totalAssets', 'totalLiabilities', 'totalStockholdersEquity'] # Same as with the companies (around 30 variables).
first_date = date(1998, 3, 25)
last_date = date.today() + timedelta(-300)
dates = pd.date_range(start=first_date, end=last_date).tolist()
# We create a zero xarray, so that we can later fill it up with values:
z = np.zeros((len(companies), len(variables), len(dates)))
ds = xr.DataArray(z, coords=[companies, variables, dates],
dims=['companies', 'variables', 'dates'])
# We assign values from the df to the ds
for company in companies:
for variable in variables:
first_value_found = False
for date in dates:
# Dates in the df are quarterly dates and dates in the ds are daily dates.
# We start off by looking for a certain date in the df. If we dont find it, we give it the value 0 in the ds
# If we do find it, we assign it the value found in the df and tell it that the first value has been found
# Now that the first value has been found, when we dont find a value in the df, instead of giving it a value of 0, we give it the value of the last date.
if first_value_found == False:
try:
ds.loc[company, variable, date] = df.loc[date, variable]
first_value_found = True
except:
ds.loc[company, variable, date] = 0
else:
try:
ds.loc[company, variable, date] = df.loc[date, variable]
except:
ds.loc[company, variable, date] = ds.loc[company, variable, date + timedelta(-1)]
print("My program took", time.time() - start_time, "to run")
The main problem is with the for loops, as I have tested these loops on separate files and these seem to be what takes the most time.
One possible strategy is to loop over the actual index of the DataFrame, rather than all possible indices
avail_dates = df.index
for date in avail_dates:
# Copy the data
That should already reduce the number of iterations by quite a bit. You still have to make sure all the blanks are filled, so you'd do something like
da.loc[company, variables, date:] = df.loc[date, variables]
That's right, you can index into DataArray and DataFrame with lists. (Also I wouldn't use ds as a variable name for something from xarray other than a DataSet)
What you probably want to use, though, is pandas.DataFrame.reindex().
If I understand what you're trying to do, this should more or less do the trick (not tested)
complete_df = df.reindex(dates, method='pad', fill_value=0)
da.loc[company, variables, :] = complete_df.loc[:, variables].T

Pandas - How to remove duplicates based on another series?

I have a dataframe that contains three series called Date, Element,
and Data_Value--their types are string, string, and numpy.int64
respectively. Date has dates in the form of yyyy-mm-dd; Element has
strings that say either TMIN or TMAX, and it denotes whether the
Data_Value is the minimum or maximum temperature of a particular date;
lastly, the Data_Value series just represents the actual temperature.
The date series has multiple duplicates of the same date. E.g. for the
date 2005-01-01, there are 19 entries for the temperature column, the
values start at 28 and go all the way up to 156. I want to create a
new dataframe with the date and the maximum temperature only--I'll
eventually want one for TMIN values too, but I figure that if I can do
one I can figure out the other. I'll post some psuedocode with
explanation below to show what I've tried so far.
So far I have pulled in the csv and assigned it to a variable, df.
Then I sorted the values by Date, Element and Temperature
(Data_Value). After that, I created a variable called tmax that grabs
the necessary dates (I only need the data from 2005-2014) that have
'TMAX' as its Element value. I cast tmax into a new DataFrame, reset
its index to get rid of the useless index data from the first
dataframe, and dropped the 'Element' column since it was redundant at
this point. Now I'm (ultimately) trying to create a list of all the
Temperatures for TMAX so that I can plot it with pyplot. But I can't
figure out for the life of me how to reduce the dataframe to just the
single date and max value for that date. If I could just get that then
I could easily convert the series to a list and plot it.
def record_high_and_low_temperatures():
#read in csv
df = pd.read_csv('somedata.csv')
#sort values so they're in a nice order
df.sort_values(by=['Date', 'Element', 'Data_Value'], inplace=True)
# grab all entries for TMAX in correct date range
tmax = df[(df['Element'] == 'TMAX') & (df['Date'].between("2005-01-01", "2014-12-31"))]
# cast to dataframe
tmax = pd.DataFrame(tmax, columns=['Date', 'Data_Value'])
# Remove index column from previous dataframe
tmax.reset_index(drop=True, inplace=True)
# this is where I'm stuck, how do I get the max value per unique date?
max_temp_by_date = tmax.loc[tmax['Data_Value'].idxmax()]
Any and all help is appreciated, let me know if I need to clarify anything.
TL;DR:
Ok...
input dataframe looks like
date | data_value
2005-01-01 28
2005-01-01 33
2005-01-01 33
2005-01-01 44
2005-01-01 56
2005-01-02 0
2005-01-02 12
2005-01-02 30
2005-01-02 28
2005-01-02 22
Expected df should look like:
date | data_value
2005-01-01 79
2005-01-02 90
2005-01-03 88
2005-01-04 44
2005-01-05 63
I just want a dataframe that has each unique date coupled with the highest temperature on that day.
If I understand you correctly, what you would want to do is as Grzegorz already suggested in the comments, is to groupby date (take all elements of one date) and then take the maximum of that date:
df.groupby('date').max()
This will take all your groups and reduce them to only one row, taking the maximum element of every group. In this case, max() is called the aggregation function of the group. As you mentioned that you will also need the minimum at some point, a nice way to do this (instead of two groupbys) is to do the following:
df.groupby('date').agg(['max', 'min'])
which will pass over all groups once and apply both aggregation functions max and min returning two columns for each input column. More documentation on aggregation is here.
Try this:
df.groupby("Date")['data_value'].max()

Generating rolling averages from a time series, but subselecting based onmonth

I have long time series of weekly data. For a given observation, I want to calculate that week's value versus the average of the three previous years' average value for the same month.
Concrete example: For the 2019-02-15 datapoint, I want to compare it to the value of the average of all the feb-2018, feb-2017, and feb-2016 datapoints.
I want to populate the entire timeseries in this way. (the first three years will be np.nans of course)
I made a really gross single-datapoint example of the calculation I want to do, but I am not sure how to implement this in a vectorized solution. I also am not impressed that I had to use this intermediate helper table "mth_avg".
import pandas as pd
ix = pd.date_range(freq='W-FRI',start="20100101", end='20190301' )
df = pd.DataFrame({"foo": [x for x in range(len(ix))]}, index=ix) #weekly data
mth_avg = df.resample("M").mean() #data as a monthly average over time
mth_avg['month_hack'] = mth_avg.index.month
#average of previous three years' same-month averages
df['avg_prev_3_year_same-month'] = "?"
#single arbitrary example of my intention
df.loc['2019-02-15', "avg_prev_3_year_same-month"]= (
mth_avg[mth_avg.month_hack==2]
.loc[:'2019-02-15']
.iloc[-3:]
.loc[:,'foo']
.mean()
)
df[-5:]
I think it's actually a nontrivial problem - there's no existing functionality I'm aware of Pandas for this. Making a helper table saves calculation time, in fact I used two. My solution uses a loop (namely a list comprehension) and Pandas datetime awareness to avoid your month_hack. Otherwise I think it was a good start. Would be happy to see something more elegant!
# your code
ix = pd.date_range(freq='W-FRI',start="20100101", end='20190301' )
df = pd.DataFrame({"foo": [x for x in range(len(ix))]}, index=ix)
mth_avg = df.resample("M").mean()
# use multi-index of month/year with month first
mth_avg.index = [mth_avg.index.month, mth_avg.index.year]
tmp = mth_avg.sort_index().groupby(level=0).rolling(3).foo.mean()
tmp.index = tmp.index.droplevel(0)
# get rolling value from tmp
res = [tmp.xs((i.month, i.year - 1)) for i in df[df.index > '2010-12-31'].index]
# NaNs for 2010
df['avg_prev_3_year_same-month'] = np.NaN
df.loc[df.index > '2010-12-31', 'avg_prev_3_year_same-month'] = res
# output
df.sort_index(ascending=False).head()
foo avg_prev_3_year_same-month
2019-03-01 478 375.833333
2019-02-22 477 371.500000
2019-02-15 476 371.500000
2019-02-08 475 371.500000
2019-02-01 474 371.500000

Aggregrate measurements per time period

I have a 6 x n matrix with the data: year, month, day, hour, minute, use.
I have to make a new matrix containing the aggregated measurements for use, in the value ’hour’. So all rows recorded within the same hour are combined.
So every time the number of hour chances the code need to know a new period starts.
I just tried something, but I don't now how to solve this.
Thank you. This is what I tried + a test
def groupby_measurements(data):
count = -1
for i in range(9):
array = np.split(data, np.where(data[i,3] != data[i+1,3])[0][:1])
return array
print(groupby_measurements(np.array([[2006,2,11,1,1,55],
[2006,2,11,1,11,79],
[2006,2,11,1,32,2],
[2006,2,11,1,41,66],
[2006,2,11,1,51,76],
[2006,2,11,10,2,89],
[2006,2,11,10,3,33],
[2006,2,11,14,2,22],
[2006,2,11,14,5,34]])))
In this case I tried, I expect the output to be:
np.array([[2006,2,11,1,1,55],
[2006,2,11,1,11,79],
[2006,2,11,1,32,2],
[2006,2,11,1,41,66],
[2006,2,11,1,51,76]]),
np.array([[2006,2,11,10,2,89],
[2006,2,11,10,3,33]]),
np.array([[2006,2,11,14,2,22],
[2006,2,11,14,5,34]])
The final output should be:
np.array([2006,2,11,1,0,278]),
np.array([2006,2,11,10,0,122]),
np.array([2006,2,11,14,0,56])
(the sum of use in the 3 hour periodes)
I would recommend using pandas Dataframes, and then using groupby combined with sum
import pandas as pd
import numpy as np
data = pd.DataFrame(np.array(
[[2006,2,11,1,1,55],
[2006,2,11,1,11,79],
[2006,2,11,1,32,2],
[2006,2,11,1,41,66],
[2006,2,11,1,51,76],
[2006,2,11,10,2,89],
[2006,2,11,10,3,33],
[2006,2,11,14,2,22],
[2006,2,11,14,5,34]]),
columns=['year','month','day','hour','minute','use'])
aggregated = data.groupby(['year','month','day','hour'])['use'].sum()
# you can also use .agg and pass which aggregation function you want as a string.
aggregated = data.groupby(['year','month','day','hour'])['use'].agg('sum')
year month day hour
2006 2 11 1 278
10 122
14 56
Aggregated is now a pandas Series, if you want it as an array just do
aggregated.values

Average of daily count of records per month in a Pandas DataFrame

I have a pandas DataFrame with a TIMESTAMP column, which is of the datetime64 data type. Please keep in mind, initially this column is not set as the index; the index is just regular integers, and the first few rows look like this:
TIMESTAMP TYPE
0 2014-07-25 11:50:30.640 2
1 2014-07-25 11:50:46.160 3
2 2014-07-25 11:50:57.370 2
There is an arbitrary number of records for each day, and there may be days with no data. What I am trying to get is the average number of daily records per month then plot it as a bar chart with months in the x-axis (April 2014, May 2014... etc.). I managed to calculate these values using the code below
dfWIM.index = dfWIM.TIMESTAMP
for i in range(dfWIM.TIMESTAMP.dt.year.min(),dfWIM.TIMESTAMP.dt.year.max()+1):
for j in range(1,13):
print dfWIM[(dfWIM.TIMESTAMP.dt.year == i) & (dfWIM.TIMESTAMP.dt.month == j)].resample('D', how='count').TIMESTAMP.mean()
which gives the following output:
nan
nan
3100.14285714
6746.7037037
9716.42857143
10318.5806452
9395.56666667
9883.64516129
8766.03225806
9297.78571429
10039.6774194
nan
nan
nan
This is ok as it is, and with some more work, I can map to results to correct month names, then plot the bar chart. However, I am not sure if this is the correct/best way, and I suspect there might be an easier way to get the results using Pandas.
I would be glad to hear what you think. Thanks!
NOTE: If I do not set the TIMESTAMP column as the index, I get a "reduction operation 'mean' not allowed for this dtype" error.
I think you'll want to do two rounds of groupby, first to group by day and count the instances, and next to group by month and compute the mean of the daily counts. You could do something like this.
First I'll generate some fake data that looks like yours:
import pandas as pd
# make 1000 random times throughout the year
N = 1000
times = pd.date_range('2014', '2015', freq='min')
ind = np.random.permutation(np.arange(len(times)))[:N]
data = pd.DataFrame({'TIMESTAMP': times[ind],
'TYPE': np.random.randint(0, 10, N)})
data.head()
Now I'll do the two groupbys using pd.TimeGrouper and plot the monthly average counts:
import seaborn as sns # for nice plot styles (optional)
daily = data.set_index('TIMESTAMP').groupby(pd.TimeGrouper(freq='D'))['TYPE'].count()
monthly = daily.groupby(pd.TimeGrouper(freq='M')).mean()
ax = monthly.plot(kind='bar')
The formatting along the x axis leaves something to be desired, but you can tweak that if necessary.

Categories