Tbh, I'm not really sure how to ask this question. I've got an array of values, and I'm looking to take the smoothed average of these values moving forward. In Excel, the calculation process is:
average_val_1 = mean average of values through window_size
average_val_2 = (value at location window_size+1 * window_size-1 + average_val_1) / window_size
average_val_3 = (value at location window_size+2 * window_size-1 + average_val_2) / window_size
etc., etc.
In pandas and numpy, my code for this is the following
df = pd.DataFrame({'av':np.nan, 'values':np.random.rand(10)})
df = df[['values','av']]
window = 5
df['av'].iloc[5] = np.mean(df['values'][:5])
for i in range(window+1,len(df.index)):
df['av'].iloc[i] = (df['values'].iloc[i] * (window-1) + df['av'].iloc[i-1])/window
Which returns:
values av
0 0.418498 NaN
1 0.570326 NaN
2 0.296878 NaN
3 0.308445 NaN
4 0.127376 NaN
5 0.381160 0.344305
6 0.239725 0.260641
7 0.928491 0.794921
8 0.711632 0.728290
9 0.319791 0.401491
These are the values I am looking for, but there has to be a better way than using for loops. I think the answer has something to do with using exponentially weighted moving averages, but I'll be damned if I can figure out the syntax to make any sense of that.
Any suggestions?
you can use ewm such as:
window = 5
df['av'] = np.nan
df['av'].iloc[window] = np.mean(df['values'][:window])
df.loc[window:,'av'] = (df.loc[window:,'av'].fillna(df['values'])
.ewm(adjust=False, alpha=(window-1.)/window).mean())
and you get the same result than with your loop for. To be sure it works, column 'av' must be nan otherwise the fillna with column 'values' will not happen and the value calculted in 'av' will be wrong. The parameter alpha in ewm is what helps you to weigth the row you are calculating.
Note: while this code does as yours, I would recommend to have a look at this line in your code:
df['av'].iloc[5] = np.mean(df['values'][:5])
Because of the exclusion of the uppper bound when doing slicing [:5], df['values'][:5] is:
0 0.418498
1 0.570326
2 0.296878
3 0.308445
4 0.127376
Name: values, dtype: float64
so I think that what you should do is df['av'].iloc[4] = np.mean(df['values'][:5]). If you agree, then my above must be slightly changed
df['av'].iloc[window-1] = np.mean(df['values'][:window])
df.loc[window-1:,'av'] = (df.loc[window-1:,'av'].fillna(df['values'])
.ewm(adjust=False, alpha=(window-1.)/window).mean())
I'm wondering if this is possible in a more efficient way.
I have a dataset in PostGRESQL that is structured like this:
Year, Sitename, Array (length = 4500)
For example:
1982, DANC, array([2,3,4,5,6,7,...])
1982, ANCH, array([5,6,4,3,5,7,...])
1983, DANC, array([3,3,4,6,3,6,...])
1983, ANCH, array([8,8,5,4,3,2,...])
What I want to do is add up the arrays (across rows) by years
E.G.,
1982 1982 1982
DANC ANCH TOT
2 5 7
3 6 9
4 4 8
5 3 8
6 5 11
7 7 14
... ... ...
My Django model looks like this:
class Abundance(models.Model):
abundance_id = models.AutoField(primary_key=True)
site = models.ForeignKey('Site')
season = models.SmallIntegerField()
samples = ArrayField(models.DecimalField(blank=True, decimal_places=3, max_digits=30))
def __unicode__(self):
return self.site
The following code in my Views.py works:
import numpy as np
import bottleneck as bn
...
def testview(request):
s = ["ACUN","BRDM"]
quants = []
medians = []
for yr in range(1982,2015):
X = Abundance.objects.values_list('samples').filter(site__site_id__in = s).filter(season = yr)
h = np.matrix(np.array(X,dtype=float))
i = h.sum(axis=0)
m = bn.median(i)
up = np.percentile(i,95)
down = np.percentile(i,5)
qlist = [yr, round(down,3), round(up,3)]
mlist = [yr, round(m,3)]
quants.append(qlist)
medians.append(mlist)
return JsonResponse({'quants':quants, 'medians':medians})
However, the above code is very slow - especially when drawing many sites. I have tried playing with .aggregate() but I've not found a good solution.
Thanks in advance
You can probably use some of the .aggregate() on there to push the load down to Postgres, but I think one of the bigger problems with speed here is the Decimal field. It's the highest precision, but it's also one of the more expensive types for Python to move in and out of.
That said, I'm not sure if there's a quick way to get the percentiles out from the DB call but the sums and medians you can easily push down to the DB via the Django ORM. For the others (percentiles, etc.) you can probably push them down as well but you'll be delving into custom aggregates for django (https://docs.djangoproject.com/en/1.9/ref/models/expressions/#creating-your-own-aggregate-functions), which if you're going to go that far it might be worth checking out something like aldjemy (https://github.com/Deepwalker/aldjemy/) and convert the entire query over to SQLAlchemy so you have maximum control over it.
I am trying to get data ready for a financial event analysis and want to calculate the buy-and-hold abnormal return (BHAR). For a test data set I have three events (noted by event_id), and for each event I have 272 rows, going from t-252 days to t+20 days (noted by the variable time). For each day I also have the stock's return data (ret) as well as the expected return (Exp_Ret), which was calculated using a market model. Here's a sample of the data:
index event_id time ret vwretd Exp_Ret
0 0 -252 0.02905 0.02498 nan
1 0 -251 0.01146 -0.00191 nan
2 0 -250 0.01553 0.00562 nan
...
250 0 -2 -0.00378 0.00028 -0.00027
251 0 -1 0.01329 0.00426 0.00479
252 0 0 -0.01723 -0.00875 -0.01173
271 0 19 0.01335 0.01150 0.01398
272 0 20 0.00722 -0.00579 -0.00797
273 1 -252 0.01687 0.00928 nan
274 1 -251 -0.00615 -0.01103 nan
And here's the issue. I would like to calculate the following BHAR formula for each day:
So, using the above formula as an example, if I would like to calculate the 10-day buy-and-hold abnormal return,I would have to calculate (1+ret_t=0)x(1+ret_t=1)...x(1+ret_t=10), then do the same with the expected return, (1+Exp_Ret_t=0)x(1+Exp_Ret_t=1)...x(1+Exp_Ret_t=10), then substract the latter from the former.
I have made some progress using rolling_apply but it doesn't solve all my problems:
df['part1'] = pd.rolling_apply(df['ret'], 10, lambda x : (1+x).prod())
This seems to correctly implement the left-side of the BHAR equation in that it will add in the correct product -- though it will enter the value two rows down (which can be solved by shifting). One problem, though, is that there are three different 'groups' in the dataframe (3 events), and if the window were to go forward more than 30 days it might start using products from the next event. I have tried to implement a groupby with rolling_apply but keep getting error: TypeError: 'Series' objects are mutable, thus they cannot be hashed
df.groupby('event_id').apply(pd.rolling_apply(df['ret'], 10, lambda x : (1+x).prod()))
I am sure I am missing something basic here so any help would be appreciated. I might just need to approach it from a different angle. Here's one thought: In the end, what I am most interested in is getting the 30-day and 60-day buy-and-hold abnormal returns starting at time=0. So, maybe it is easier to select each event at time=0 and then calculate the 30-day product going forward? I'm not sure how I could best approach that.
Thanks in advance for any insights.
# Create sample data.
np.random.seed(0)
VOL = .3
df = pd.DataFrame({'event_id': [0] * 273 + [1] * 273 + [2] * 273,
'time': range(-252, 21) * 3,
'ret': np.random.randn(273 * 3) * VOL / 252 ** .5,
'Exp_Ret': np.random.randn(273 * 3) * VOL / 252 ** .5})
# Pivot on time and event_id.
df = df.set_index(['time', 'event_id']).unstack('event_id')
# Calculated return difference from t=0.
df_diff = df.ix[df.index >= 0, 'ret'] - df.loc[df.index >= 0, 'Exp_Ret']
# Calculate cumulative abnormal returns.
cum_returns = (1 + df_diff).cumprod() - 1
# Get 10 day abnormal returns.
>>> cum_returns.loc[10]
event_id
0 -0.014167
1 -0.172599
2 -0.032647
Name: 10, dtype: float64
Edited so that final values of BHAR are included in the main DataFrame.
BHAR = pd.Series()
def bhar(arr):
return np.cumprod(arr+1)[-1]
grouped = df.groupby('event_id')
for name, group in grouped:
BHAR = BHAR.append(pd.rolling_apply(group['ret'],10,bhar) -
pd.rolling_apply(group['Exp_Ret'],10,bhar))
df['BHAR'] = BHAR
You can then slice the DataFrame using df[df['time']>=0] such that you get only the required part.
You can obviously collapse the loop in one line using .apply() on the group, but I like it this way. Shorter lines to read = better readability.
This is what I did:
((df+1.0) \
.apply(lambda x: np.log(x),axis=1)\
.rolling(365).sum() \
.apply(lambda x: np.exp(x),axis=1)-1.0)
result is a rolling product.
I have a pandas.df and I'm trying to remove all hypotheses that can be rejected.
Here is a snippet of the df in question:
best value p_value
0 11.9549 0.986927
1 11.9588 0.986896
2 12.1185 0.985588
3 12.1682 0.985161
4 12.3907 0.983131
5 12.4148 0.982899
6 12.6273 0.980750
7 12.9020 0.977680
8 13.4576 0.970384
9 13.5058 0.969679
10 13.5243 0.969405
11 13.5886 0.968439
12 13.8025 0.965067
13 13.9840 0.962011
14 14.1896 0.958326
15 14.3939 0.954424
16 14.6229 0.949758
17 14.6689 0.948783
18 14.9464 0.942626
19 15.1216 0.938494
20 15.5326 0.928039
21 17.7720 0.851915
22 17.8668 0.847993
23 17.9662 0.843822
24 19.2481 0.785072
25 19.5257 0.771242
I want to remove the elements with a p_value greater then a critical threshold alpha by selecting the ones fall below alpha. The p value is calculated using scipy.stats.chisqprob(chisq,df) where chisq is the chi squared statistic and df is the degrees of freedom. This is all done using the custom method self.get_p_values shown below.
def reject_null_hypothesis(self,alpha,df):
assert alpha>0
assert alpha<1
p_value=self.get_p_values(df) #calculates the data frame above
return p_value.loc[p_value['best value']
Im then calling this method using:
PE=Modelling_Tools.PE_Results(PE_file) #Modelling.Tools is the module and PE_Results is the class which is given the data 'PE_file'
print PE.reject_null_hypothesis(0.5,25)
From what I've read this should do what I want but I'm new to pandas.df and this code returns the unchanged
Are you getting any errors when you run this? I ask because:
print PE.reject_null_hypothesis(0.5, 25)
is passing into reject_null_hypothesis() 25, an int object instead of a pandas.DataFrame object, in the last argument position.
(Apologies. I would respond with this as a comment instead of an answer, but I only have 46 reputation at the moment, and 50 is needed to comment.)
refer indexging with boolean array
df[ df.p_value < threshold ]
Turns out there is a simple way to do what I want. Here is the code for those who want to know.
def reject_null_hypothesis(self,alpha,df):
'''
alpha = critical threshold for chisq statistic
df=degrees of freedom
values below this critical threshold are rejected.
values above this threshold are not 'proven' but
cannot be rejected and must therefore be subject to
further statistics
'''
assert alpha>0
assert alpha<1
p_value=self.get_p_values(df)
passed= p_value[p_value.loc[:,'p_value']>alpha].index
return p_value[:max(passed)]
I am trying to sum (and plot) a total from functions which change states at different times using Python's Pandas.DataFrame. For example:
Suppose we have 3 people whose states can be a) holding nothing, b) holding a 5 pound weight, and c) holding a 10 pound weight. Over time, these people pick weights up and put them down. I want to plot the total amount of weight being held. So, given:
My brute forece attempt:
import pandas as ps
import math
import numpy as np
person1=[3,0,10,10,10,10,10]
person2=[4,0,20,20,25,25,40]
person3=[5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['count','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDfNoCount=allPeopleDf[['start1', 'end1', 'start2', 'end2', 'start3','end3']]
uniqueTimes=sorted(ps.unique(allPeopleDfNoCount.values.ravel()))
possibleStates=[-1,0,1,2] #extra state 0 for initialization
stateData={}
comboStates={}
#initialize dict to add up all of the stateData
for time in uniqueTimes:
comboStates[time]=0.0
allPeopleDf['track']=-1
allPeopleDf['status']=-1
numberState=len(possibleStates)
starti=-1
endi=0
startState=0
for i in range(3):
starti=starti+2
print starti
endi=endi+2
for time in uniqueTimes:
def helper(row):
start=row[starti]
end=row[endi]
track=row[7]
if start <= time and time < end:
return possibleStates[i+1]
else:
return possibleStates[0]
def trackHelp(row):
status=row[8]
track=row[7]
if track<=status:
return status
else:
return track
def Multiplier(row):
x=row[8]
if x==0:
return 0.0*row[0]
if x==1:
return 5.0*row[0]
if x==2:
return 10.0*row[0]
if x==-1:#numeric place holder for non-contributing
return 0.0*row[0]
allPeopleDf['status']=allPeopleDf.apply(helper,axis=1)
allPeopleDf['track']=allPeopleDf.apply(trackHelp,axis=1)
stateData[time]=allPeopleDf.apply(Multiplier,axis=1).sum()
for k,v in stateData.iteritems():
comboStates[k]=comboStates.get(k,0)+v
print allPeopleDf
print stateData
print comboStates
Plots of weight being held over time might look like the following:
And the sum of the intensities over time might look like the black line in the following:
with the black line defined with the Cartesian points: (0,0 lbs),(5,0 lbs),(5,5 lbs),(15,5 lbs),(15,10 lbs),(20,10 lbs),(20,15 lbs),(25,15 lbs),(25,20 lbs),(40,20 lbs). However, I'm flexible and don't necessarily need to define the combined intensity line as a set of Cartesian points. The unique times can be found with:
print list(set(uniqueTimes).intersection(allNoCountT[1].values.ravel())).sort()
,but I can't come up with a slick way of getting the corresponding intensity values.
I started out with a very ugly function to break apart each "person's" graph so that all people had start and stop times (albeit many stop and start times without state change) at the same time, and then I could add up all the "chunks" of time. This was cumbersome; there has to be a slick pandas way of handling this. If anyone can offer a suggestion or point me to another SO like that I might have missed, I'd appreciate the help!
In case my simplified example isn't clear, another might be plotting the intensity of sound coming from a piano: there are many notes being played for different durations with different intensities. I would like the sum of intensity coming from the piano over time. While my example is simplistic, I need a solution that is more on the scale of a piano song: thousands of discrete intensity levels per key, and many keys contributing over the course of a song.
Edit--Implementation of mgab's provided solution:
import pandas as ps
import math
import numpy as np
person1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
person2=['person2',4,0,20,20,25,25,40]
person3=['person3',5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['id','intensity','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDf=ps.melt(allPeopleDf,id_vars=['intensity','id'])
allPeopleDf.columns=['intensity','id','timeid','time']
df=ps.DataFrame(allPeopleDf).drop('timeid',1)
df[df.id=='person1'].drop('id',1) #easier to visualize one id for check
df['increment']=df.groupby('id')['intensity'].transform( lambda x: x.sub(x.shift(), fill_value= 0 ))
TypeError: unsupported operand type(s) for -: 'str' and 'int'
End Edit
Going for the piano keys example, lets assume you have three keys, with 30 levels of intensity.
I would try to keep the data in this format:
import pandas as pd
df = pd.DataFrame([[10,'A',5],
[10,'B',7],
[13,'C',10],
[15,'A',15],
[20,'A',7],
[23,'C',0]], columns=["time", "key", "intensity"])
time key intensity
0 10 A 5
1 10 B 7
2 13 C 10
3 15 A 15
4 20 A 7
5 23 C 0
where you record every change in intensity of any of the keys. From here you can already get the Cartesian coordinates for each individual key as (time,intensity) pairs
df[df.key=="A"].drop('key',1)
time intensity
0 10 5
3 15 15
4 20 7
Then, you can easily create a new column increment that will indicate the change in intensity that occurred for that key at that time point (intensity indicates just the new value of intensity)
df["increment"]=df.groupby("key")["intensity"].transform(
lambda x: x.sub(x.shift(), fill_value= 0 ))
df
time key intensity increment
0 10 A 5 5
1 10 B 7 7
2 13 C 10 10
3 15 A 15 10
4 20 A 7 -8
5 23 C 0 -10
And then, using this new column, you can generate the (time, total_intensity) pairs to use as Cartesian coordinates
df.groupby("time").sum()["increment"].cumsum()
time
10 12
13 22
15 32
20 24
23 14
dtype: int64
EDIT: applying the specific data presented in question
Assuming the data comes as a list of values, starting with the element id (person/piano key), then a factor multiplying the measured weight/intensities for this element, and then pairs of time values indicating the start and end of a series of known states (weight being carried/intensity being emitted). Not sure if I got the data format right. From your question:
data1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
data2=['person2',4,0,20,20,25,25,40]
data3=['person3',5,0,5,5,15,15,40]
And if we know the weight/intensity of each one of the states, we can define:
known_states = [5, 10, 15]
DF_columns = ["time", "id", "intensity"]
Then, the easiest way I came up to load the data includes this function:
import pandas as pd
def read_data(data, states, columns):
id = data[0]
factor = data[1]
reshaped_data = []
for i in xrange(len(states)):
j += 2+2*i
if not data[j] == data[j+1]:
reshaped_data.append([data[j], id, factor*states[i]])
reshaped_data.append([data[j+1], id, -1*factor*states[i]])
return pd.DataFrame(reshaped_data, columns=columns)
Notice that the if not data[j] == data[j+1]: avoids loading data to the dataframe when start and end times for a given state are equal (seems uninformative, and wouldn't appear in your plots anyway). But take it out if you still want these entries.
Then, you load the data:
df = read_data(data1, known_states, DF_columns)
df = df.append(read_data(data2, known_states, DF_columns), ignore_index=True)
df = df.append(read_data(data3, known_states, DF_columns), ignore_index=True)
# and so on...
And then you're right at the beginning of this answer (substituting 'key' by 'id' and the ids, of course)
Appears to be what .sum() is for:
In [10]:
allPeopleDf.sum()
Out[10]:
aStart 0
aEnd 35
bStart 35
bEnd 50
cStart 50
cEnd 90
dtype: int32