Inexpensive way to add time series intensity in python pandas dataframe - python

I am trying to sum (and plot) a total from functions which change states at different times using Python's Pandas.DataFrame. For example:
Suppose we have 3 people whose states can be a) holding nothing, b) holding a 5 pound weight, and c) holding a 10 pound weight. Over time, these people pick weights up and put them down. I want to plot the total amount of weight being held. So, given:
My brute forece attempt:
import pandas as ps
import math
import numpy as np
person1=[3,0,10,10,10,10,10]
person2=[4,0,20,20,25,25,40]
person3=[5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['count','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDfNoCount=allPeopleDf[['start1', 'end1', 'start2', 'end2', 'start3','end3']]
uniqueTimes=sorted(ps.unique(allPeopleDfNoCount.values.ravel()))
possibleStates=[-1,0,1,2] #extra state 0 for initialization
stateData={}
comboStates={}
#initialize dict to add up all of the stateData
for time in uniqueTimes:
comboStates[time]=0.0
allPeopleDf['track']=-1
allPeopleDf['status']=-1
numberState=len(possibleStates)
starti=-1
endi=0
startState=0
for i in range(3):
starti=starti+2
print starti
endi=endi+2
for time in uniqueTimes:
def helper(row):
start=row[starti]
end=row[endi]
track=row[7]
if start <= time and time < end:
return possibleStates[i+1]
else:
return possibleStates[0]
def trackHelp(row):
status=row[8]
track=row[7]
if track<=status:
return status
else:
return track
def Multiplier(row):
x=row[8]
if x==0:
return 0.0*row[0]
if x==1:
return 5.0*row[0]
if x==2:
return 10.0*row[0]
if x==-1:#numeric place holder for non-contributing
return 0.0*row[0]
allPeopleDf['status']=allPeopleDf.apply(helper,axis=1)
allPeopleDf['track']=allPeopleDf.apply(trackHelp,axis=1)
stateData[time]=allPeopleDf.apply(Multiplier,axis=1).sum()
for k,v in stateData.iteritems():
comboStates[k]=comboStates.get(k,0)+v
print allPeopleDf
print stateData
print comboStates
Plots of weight being held over time might look like the following:
And the sum of the intensities over time might look like the black line in the following:
with the black line defined with the Cartesian points: (0,0 lbs),(5,0 lbs),(5,5 lbs),(15,5 lbs),(15,10 lbs),(20,10 lbs),(20,15 lbs),(25,15 lbs),(25,20 lbs),(40,20 lbs). However, I'm flexible and don't necessarily need to define the combined intensity line as a set of Cartesian points. The unique times can be found with:
print list(set(uniqueTimes).intersection(allNoCountT[1].values.ravel())).sort()
,but I can't come up with a slick way of getting the corresponding intensity values.
I started out with a very ugly function to break apart each "person's" graph so that all people had start and stop times (albeit many stop and start times without state change) at the same time, and then I could add up all the "chunks" of time. This was cumbersome; there has to be a slick pandas way of handling this. If anyone can offer a suggestion or point me to another SO like that I might have missed, I'd appreciate the help!
In case my simplified example isn't clear, another might be plotting the intensity of sound coming from a piano: there are many notes being played for different durations with different intensities. I would like the sum of intensity coming from the piano over time. While my example is simplistic, I need a solution that is more on the scale of a piano song: thousands of discrete intensity levels per key, and many keys contributing over the course of a song.
Edit--Implementation of mgab's provided solution:
import pandas as ps
import math
import numpy as np
person1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
person2=['person2',4,0,20,20,25,25,40]
person3=['person3',5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['id','intensity','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDf=ps.melt(allPeopleDf,id_vars=['intensity','id'])
allPeopleDf.columns=['intensity','id','timeid','time']
df=ps.DataFrame(allPeopleDf).drop('timeid',1)
df[df.id=='person1'].drop('id',1) #easier to visualize one id for check
df['increment']=df.groupby('id')['intensity'].transform( lambda x: x.sub(x.shift(), fill_value= 0 ))
TypeError: unsupported operand type(s) for -: 'str' and 'int'
End Edit

Going for the piano keys example, lets assume you have three keys, with 30 levels of intensity.
I would try to keep the data in this format:
import pandas as pd
df = pd.DataFrame([[10,'A',5],
[10,'B',7],
[13,'C',10],
[15,'A',15],
[20,'A',7],
[23,'C',0]], columns=["time", "key", "intensity"])
time key intensity
0 10 A 5
1 10 B 7
2 13 C 10
3 15 A 15
4 20 A 7
5 23 C 0
where you record every change in intensity of any of the keys. From here you can already get the Cartesian coordinates for each individual key as (time,intensity) pairs
df[df.key=="A"].drop('key',1)
time intensity
0 10 5
3 15 15
4 20 7
Then, you can easily create a new column increment that will indicate the change in intensity that occurred for that key at that time point (intensity indicates just the new value of intensity)
df["increment"]=df.groupby("key")["intensity"].transform(
lambda x: x.sub(x.shift(), fill_value= 0 ))
df
time key intensity increment
0 10 A 5 5
1 10 B 7 7
2 13 C 10 10
3 15 A 15 10
4 20 A 7 -8
5 23 C 0 -10
And then, using this new column, you can generate the (time, total_intensity) pairs to use as Cartesian coordinates
df.groupby("time").sum()["increment"].cumsum()
time
10 12
13 22
15 32
20 24
23 14
dtype: int64
EDIT: applying the specific data presented in question
Assuming the data comes as a list of values, starting with the element id (person/piano key), then a factor multiplying the measured weight/intensities for this element, and then pairs of time values indicating the start and end of a series of known states (weight being carried/intensity being emitted). Not sure if I got the data format right. From your question:
data1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
data2=['person2',4,0,20,20,25,25,40]
data3=['person3',5,0,5,5,15,15,40]
And if we know the weight/intensity of each one of the states, we can define:
known_states = [5, 10, 15]
DF_columns = ["time", "id", "intensity"]
Then, the easiest way I came up to load the data includes this function:
import pandas as pd
def read_data(data, states, columns):
id = data[0]
factor = data[1]
reshaped_data = []
for i in xrange(len(states)):
j += 2+2*i
if not data[j] == data[j+1]:
reshaped_data.append([data[j], id, factor*states[i]])
reshaped_data.append([data[j+1], id, -1*factor*states[i]])
return pd.DataFrame(reshaped_data, columns=columns)
Notice that the if not data[j] == data[j+1]: avoids loading data to the dataframe when start and end times for a given state are equal (seems uninformative, and wouldn't appear in your plots anyway). But take it out if you still want these entries.
Then, you load the data:
df = read_data(data1, known_states, DF_columns)
df = df.append(read_data(data2, known_states, DF_columns), ignore_index=True)
df = df.append(read_data(data3, known_states, DF_columns), ignore_index=True)
# and so on...
And then you're right at the beginning of this answer (substituting 'key' by 'id' and the ids, of course)

Appears to be what .sum() is for:
In [10]:
allPeopleDf.sum()
Out[10]:
aStart 0
aEnd 35
bStart 35
bEnd 50
cStart 50
cEnd 90
dtype: int32

Related

How to find the average of data samples at random intervals in python?

I have temperature data stored in a csv file when plotted looks like the below image. How do I find the average during each interval when the temperature goes above 12. The result should be the T1, T2 ,T3 which should be the average temperature during the interval when its value is above 12.
Could you please suggest how to achieve this in python?
Highlighted the areas approximately over which I need to calculate the average:
Please find below sample data:
R3,R4
1,11
2,11
3,11
4,11
5,11
6,15.05938512
7,15.12975992
8,15.05850141
18,15.1677708
19,15.00921862
20,15.00686921
21,15.01168888
22,11
23,11
24,11
25,11
26,11
27,15.05938512
28,15.12975992
29,15.05850141
30,15.00328706
31,15.12622611
32,15.01479819
33,15.17778891
34,15.01411488
35,9
36,9
37,9
38,9
39,16.16042435
40,16.00091253
41,16.00419677
42,16.15381827
43,16.0471766
44,16.03725301
45,16.13925003
46,16.00072279
47,11
48,1
In pandas, an idea would be to group the data based on the condition T > 12 and use mean as agg func. Ex:
import pandas as pd
# a dummy df:
df = pd.DataFrame({'T': [11, 13, 13, 10, 14]})
# set the condition
m = df['T'] > 12
# define groups
grouper = (~m).cumsum().where(m)
# ...looks like
# 0 NaN
# 1 1.0
# 2 1.0
# 3 NaN
# 4 2.0
# Name: T, dtype: float64
# now we can easily calculate the mean for each group:
grp_mean = df.groupby(grouper)['T'].mean()
# T
# 1.0 13
# 2.0 14
# Name: T, dtype: int64
Note: if you have noisy data (T jumps up and down), it might be clever to apply a filter first (savgol, median etc. - whatever is appropriate) so you don't end up with groups caused by the noise.
I couldn't find a good pattern for this - here's a clunky bit of code that does what you want, though.
In general, use .shift() to find transition points, and use groupby with transform to get your means.
#if you had a csv with Dates and Temps, do this
#tempsDF = pd.read_csv("temps.csv", columns=["Date","Temp"])
#tempsDF.set_index("Date", inplace=True)
#Using fake data since I don't have your csv
tempsDF = pd.DataFrame({'Temp': [0,13,14,13,8,7,5,0,14,16,16,0,0,0]})
#This is a bit clunky - I bet there's a more elegant way to do it
tempsDF["CumulativeFlag"] = 0
tempsDF.loc[tempsDF["Temp"]>12, "CumulativeFlag"]=1
tempsDF.loc[tempsDF["CumulativeFlag"] > tempsDF["CumulativeFlag"].shift(), "HighTempGroup"] = list(range(1,len(tempsDF.loc[tempsDF["CumulativeFlag"] > tempsDF["CumulativeFlag"].shift()])+1))
tempsDF["HighTempGroup"].fillna(method='ffill', inplace=True)
tempsDF.loc[tempsDF["Temp"]<=12, "HighTempGroup"]= None
tempsDF["HighTempMean"] = tempsDF.groupby("HighTempGroup").transform(np.mean)["Temp"]

first attempt at python, error ("IndexError: index 8 is out of bounds for axis 0 with size 8") and efficiency question

learning python, just began last week, havent otherwise coded for about 20 years and was never that advanced to begin with. I got the hello world thing down. Now im trying to back test FX pairs. Any help up the learning curve appreciated, and of course scouring this site while on my Lynda vids.
Getting a funky error, and also wondering if theres blatantly more efficient ways to loop through columns of excel data the way I am.
The spreadsheet being read is simple ... 56 FX pairs down column A, and 8 rows over where the column headers are dates, and the cells in each column are the respective FX pair closing price on that date. The strategy starts at the top of the 2nd column (so that there is a return % that can be calc'd vs the prior priord) and calcs out period/period % returns for each pair, identifying which is the 'maximum value', and then "goes long" that highest performer ... whose performance in the subsequent period/period is recorded as PnL to the portfolio ("p" in the code), loops through that until the current, most recent column is read.
The error relates to using 8 columns instead of 7 ... works when i limit the loop to 7 columns but not 8. When I used 8 I get a wall of text concluding with "IndexError: index 8 is out of bounds for axis 0 with size 8" Similar error when i use too many rows, 56 instead of 55, think im missing the bottom row.
Here's my code:
,,,
enter code here
#set up imports
import pandas as pd
#import spreadsheet
x1 = pd.ExcelFile(r"C:\Users\Gamblor\Desktop\Python\test2020.xlsx")
df = pd.read_excel(x1, "Sheet1", header=1)
#define counters for loops
o = 1 # observation counter
c = 3 # column counter
r = 0 # active row counter for sorting through for max
#define identifiers for the portfolio
rpos = 0 # static row, for identifying which currency pair is in column 0 of that row
p = 100 # portfolio size starts at $100
#define the stuff we are evaluating for
pair = df.iat[r,0] # starting pair at 0,0 where each loop will begin
pair_pct_rtn = 0 # starts out at zero, becomes something at first evaluation, then gets
compared to each subsequent eval
pair_pct_rtn_calc = 0 # a second version of above, for comparison to prior return
#runs a loop starting at the top to find the max period/period % return in a specific column
while (c < 8): # manually limiting this to 5 columns left to right
while (r < 55): # i am manually limiting this to 55 data rows per the spreadsheet ... would be better if automatic
pair_pct_rtn_calc = ((df.iat[r,c])/(df.iat[r,c-1]) - 1)
if pair_pct_rtn_calc > pair_pct_rtn: # if its a higher return, it must be the "max" to that point
pair = df.iat[r,0] # identifies the max pair for this column observation, so far
pair_pct_rtn = pair_pct_rtn_calc # sets pair_pct_rtn as the new max
rpos = r # identifies the max pair's ROW for this column observation, so far
r = r + 1 # adds to r in order to jump down and calc the next row
print('in obs #', o ,', ', pair ,'did best at' ,pair_pct_rtn ,'.')
o = o + 1
# now adjust the portfolio by however well USDMXN did in the subsequent week
p = p * ( 1 + ((df.iat[rpos,c+1])/(df.iat[rpos,c]) - 1))
print('then the subsequent period it did: ',(df.iat[rpos,c+1])/(df.iat[rpos,c]) - 1)
print('resulting in portfolio value of', p)
rpos = 0
r = 0
pair_pct_rtn = 0
c = c + 1 # adds to c in order to move to the next period to the right
print(p)
Since indices are labelled from 0 onwards, the 8th element you are looking for will have index 7. Likewise, row index 55 (the 56th row) will be your last row.

Python: numpy, pandas, and performing operations on the previous array value (smoothed averages): any way to not use FOR loop? EWMA?

Tbh, I'm not really sure how to ask this question. I've got an array of values, and I'm looking to take the smoothed average of these values moving forward. In Excel, the calculation process is:
average_val_1 = mean average of values through window_size
average_val_2 = (value at location window_size+1 * window_size-1 + average_val_1) / window_size
average_val_3 = (value at location window_size+2 * window_size-1 + average_val_2) / window_size
etc., etc.
In pandas and numpy, my code for this is the following
df = pd.DataFrame({'av':np.nan, 'values':np.random.rand(10)})
df = df[['values','av']]
window = 5
df['av'].iloc[5] = np.mean(df['values'][:5])
for i in range(window+1,len(df.index)):
df['av'].iloc[i] = (df['values'].iloc[i] * (window-1) + df['av'].iloc[i-1])/window
Which returns:
values av
0 0.418498 NaN
1 0.570326 NaN
2 0.296878 NaN
3 0.308445 NaN
4 0.127376 NaN
5 0.381160 0.344305
6 0.239725 0.260641
7 0.928491 0.794921
8 0.711632 0.728290
9 0.319791 0.401491
These are the values I am looking for, but there has to be a better way than using for loops. I think the answer has something to do with using exponentially weighted moving averages, but I'll be damned if I can figure out the syntax to make any sense of that.
Any suggestions?
you can use ewm such as:
window = 5
df['av'] = np.nan
df['av'].iloc[window] = np.mean(df['values'][:window])
df.loc[window:,'av'] = (df.loc[window:,'av'].fillna(df['values'])
.ewm(adjust=False, alpha=(window-1.)/window).mean())
and you get the same result than with your loop for. To be sure it works, column 'av' must be nan otherwise the fillna with column 'values' will not happen and the value calculted in 'av' will be wrong. The parameter alpha in ewm is what helps you to weigth the row you are calculating.
Note: while this code does as yours, I would recommend to have a look at this line in your code:
df['av'].iloc[5] = np.mean(df['values'][:5])
Because of the exclusion of the uppper bound when doing slicing [:5], df['values'][:5] is:
0 0.418498
1 0.570326
2 0.296878
3 0.308445
4 0.127376
Name: values, dtype: float64
so I think that what you should do is df['av'].iloc[4] = np.mean(df['values'][:5]). If you agree, then my above must be slightly changed
df['av'].iloc[window-1] = np.mean(df['values'][:window])
df.loc[window-1:,'av'] = (df.loc[window-1:,'av'].fillna(df['values'])
.ewm(adjust=False, alpha=(window-1.)/window).mean())

Rolling Product in PANDAS over 30-day time window

I am trying to get data ready for a financial event analysis and want to calculate the buy-and-hold abnormal return (BHAR). For a test data set I have three events (noted by event_id), and for each event I have 272 rows, going from t-252 days to t+20 days (noted by the variable time). For each day I also have the stock's return data (ret) as well as the expected return (Exp_Ret), which was calculated using a market model. Here's a sample of the data:
index event_id time ret vwretd Exp_Ret
0 0 -252 0.02905 0.02498 nan
1 0 -251 0.01146 -0.00191 nan
2 0 -250 0.01553 0.00562 nan
...
250 0 -2 -0.00378 0.00028 -0.00027
251 0 -1 0.01329 0.00426 0.00479
252 0 0 -0.01723 -0.00875 -0.01173
271 0 19 0.01335 0.01150 0.01398
272 0 20 0.00722 -0.00579 -0.00797
273 1 -252 0.01687 0.00928 nan
274 1 -251 -0.00615 -0.01103 nan
And here's the issue. I would like to calculate the following BHAR formula for each day:
So, using the above formula as an example, if I would like to calculate the 10-day buy-and-hold abnormal return,I would have to calculate (1+ret_t=0)x(1+ret_t=1)...x(1+ret_t=10), then do the same with the expected return, (1+Exp_Ret_t=0)x(1+Exp_Ret_t=1)...x(1+Exp_Ret_t=10), then substract the latter from the former.
I have made some progress using rolling_apply but it doesn't solve all my problems:
df['part1'] = pd.rolling_apply(df['ret'], 10, lambda x : (1+x).prod())
This seems to correctly implement the left-side of the BHAR equation in that it will add in the correct product -- though it will enter the value two rows down (which can be solved by shifting). One problem, though, is that there are three different 'groups' in the dataframe (3 events), and if the window were to go forward more than 30 days it might start using products from the next event. I have tried to implement a groupby with rolling_apply but keep getting error: TypeError: 'Series' objects are mutable, thus they cannot be hashed
df.groupby('event_id').apply(pd.rolling_apply(df['ret'], 10, lambda x : (1+x).prod()))
I am sure I am missing something basic here so any help would be appreciated. I might just need to approach it from a different angle. Here's one thought: In the end, what I am most interested in is getting the 30-day and 60-day buy-and-hold abnormal returns starting at time=0. So, maybe it is easier to select each event at time=0 and then calculate the 30-day product going forward? I'm not sure how I could best approach that.
Thanks in advance for any insights.
# Create sample data.
np.random.seed(0)
VOL = .3
df = pd.DataFrame({'event_id': [0] * 273 + [1] * 273 + [2] * 273,
'time': range(-252, 21) * 3,
'ret': np.random.randn(273 * 3) * VOL / 252 ** .5,
'Exp_Ret': np.random.randn(273 * 3) * VOL / 252 ** .5})
# Pivot on time and event_id.
df = df.set_index(['time', 'event_id']).unstack('event_id')
# Calculated return difference from t=0.
df_diff = df.ix[df.index >= 0, 'ret'] - df.loc[df.index >= 0, 'Exp_Ret']
# Calculate cumulative abnormal returns.
cum_returns = (1 + df_diff).cumprod() - 1
# Get 10 day abnormal returns.
>>> cum_returns.loc[10]
event_id
0 -0.014167
1 -0.172599
2 -0.032647
Name: 10, dtype: float64
Edited so that final values of BHAR are included in the main DataFrame.
BHAR = pd.Series()
def bhar(arr):
return np.cumprod(arr+1)[-1]
grouped = df.groupby('event_id')
for name, group in grouped:
BHAR = BHAR.append(pd.rolling_apply(group['ret'],10,bhar) -
pd.rolling_apply(group['Exp_Ret'],10,bhar))
df['BHAR'] = BHAR
You can then slice the DataFrame using df[df['time']>=0] such that you get only the required part.
You can obviously collapse the loop in one line using .apply() on the group, but I like it this way. Shorter lines to read = better readability.
This is what I did:
((df+1.0) \
.apply(lambda x: np.log(x),axis=1)\
.rolling(365).sum() \
.apply(lambda x: np.exp(x),axis=1)-1.0)
result is a rolling product.

Selecting elements of a pandas dataframe that fall above a critical threshold

I have a pandas.df and I'm trying to remove all hypotheses that can be rejected.
Here is a snippet of the df in question:
best value p_value
0 11.9549 0.986927
1 11.9588 0.986896
2 12.1185 0.985588
3 12.1682 0.985161
4 12.3907 0.983131
5 12.4148 0.982899
6 12.6273 0.980750
7 12.9020 0.977680
8 13.4576 0.970384
9 13.5058 0.969679
10 13.5243 0.969405
11 13.5886 0.968439
12 13.8025 0.965067
13 13.9840 0.962011
14 14.1896 0.958326
15 14.3939 0.954424
16 14.6229 0.949758
17 14.6689 0.948783
18 14.9464 0.942626
19 15.1216 0.938494
20 15.5326 0.928039
21 17.7720 0.851915
22 17.8668 0.847993
23 17.9662 0.843822
24 19.2481 0.785072
25 19.5257 0.771242
I want to remove the elements with a p_value greater then a critical threshold alpha by selecting the ones fall below alpha. The p value is calculated using scipy.stats.chisqprob(chisq,df) where chisq is the chi squared statistic and df is the degrees of freedom. This is all done using the custom method self.get_p_values shown below.
def reject_null_hypothesis(self,alpha,df):
assert alpha>0
assert alpha<1
p_value=self.get_p_values(df) #calculates the data frame above
return p_value.loc[p_value['best value']
Im then calling this method using:
PE=Modelling_Tools.PE_Results(PE_file) #Modelling.Tools is the module and PE_Results is the class which is given the data 'PE_file'
print PE.reject_null_hypothesis(0.5,25)
From what I've read this should do what I want but I'm new to pandas.df and this code returns the unchanged
Are you getting any errors when you run this? I ask because:
print PE.reject_null_hypothesis(0.5, 25)
is passing into reject_null_hypothesis() 25, an int object instead of a pandas.DataFrame object, in the last argument position.
(Apologies. I would respond with this as a comment instead of an answer, but I only have 46 reputation at the moment, and 50 is needed to comment.)
refer indexging with boolean array
df[ df.p_value < threshold ]
Turns out there is a simple way to do what I want. Here is the code for those who want to know.
def reject_null_hypothesis(self,alpha,df):
'''
alpha = critical threshold for chisq statistic
df=degrees of freedom
values below this critical threshold are rejected.
values above this threshold are not 'proven' but
cannot be rejected and must therefore be subject to
further statistics
'''
assert alpha>0
assert alpha<1
p_value=self.get_p_values(df)
passed= p_value[p_value.loc[:,'p_value']>alpha].index
return p_value[:max(passed)]

Categories