I have some stock data based on daily close values. I need to be able to insert these values into a python list and get a median for the last 30 closes. Is there a python library that does this?
In pure Python, having your data in a Python list a, you could do
median = sum(sorted(a[-30:])[14:16]) / 2.0
(This assumes a has at least 30 items.)
Using the NumPy package, you could use
median = numpy.median(a[-30:])
Have you considered pandas? It is based on numpy and can automatically associate timestamps with your data, and discards any unknown dates as long as you fill it with numpy.nan. It also offers some rather powerful graphing via matplotlib.
Basically it was designed for financial analysis in python.
isn't the median just the middle value in a sorted range?
so, assuming your list is stock_data:
last_thirty = stock_data[-30:]
median = sorted(last_thirty)[15]
Now you just need to get the off-by-one errors found and fixed and also handle the case of stock_data being less than 30 elements...
let us try that here a bit:
def rolling_median(data, window):
if len(data) < window:
subject = data[:]
else:
subject = data[-30:]
return sorted(subject)[len(subject)/2]
#found this helpful:
list=[10,20,30,40,50]
med=[]
j=0
for x in list:
sub_set=list[0:j+1]
median = np.median(sub_set)
med.append(median)
j+=1
print(med)
Here is a much faster method with w*|x| space complexity.
def moving_median(x, w):
shifted = np.zeros((len(x)+w-1, w))
shifted[:,:] = np.nan
for idx in range(w-1):
shifted[idx:-w+idx+1, idx] = x
shifted[idx+1:, idx+1] = x
# print(shifted)
medians = np.median(shifted, axis=1)
for idx in range(w-1):
medians[idx] = np.median(shifted[idx, :idx+1])
medians[-idx-1] = np.median(shifted[-idx-1, -idx-1:])
return medians[(w-1)//2:-(w-1)//2]
moving_median(np.arange(10), 4)
# Output
array([0.5, 1. , 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8. ])
The output has the same length as the input vector.
Rows with less than one entry will be ignored and with half of them nans (happens only for an even window-width), only the first option will be returned. Here is the shifted_matrix from above with the respective median values:
[[ 0. nan nan nan] -> -
[ 1. 0. nan nan] -> 0.5
[ 2. 1. 0. nan] -> 1.0
[ 3. 2. 1. 0.] -> 1.5
[ 4. 3. 2. 1.] -> 2.5
[ 5. 4. 3. 2.] -> 3.5
[ 6. 5. 4. 3.] -> 4.5
[ 7. 6. 5. 4.] -> 5.5
[ 8. 7. 6. 5.] -> 6.5
[ 9. 8. 7. 6.] -> 7.5
[nan 9. 8. 7.] -> 8.0
[nan nan 9. 8.] -> -
[nan nan nan 9.]]-> -
The behaviour can be changed by adapting the final slice medians[(w-1)//2:-(w-1)//2].
Benchmark:
%%timeit
moving_median(np.arange(1000), 4)
# 267 µs ± 759 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Alternative approach: (the results will be shifted)
def moving_median_list(x, w):
medians = np.zeros(len(x))
for j in range(len(x)):
medians[j] = np.median(x[j:j+w])
return medians
%%timeit
moving_median_list(np.arange(1000), 4)
# 15.7 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Both algorithms have a linear time complexity.
Therefore, the function moving_median will be the faster option.
Related
Why does :
print(np.delete(MatrixAnalytics(Cmp),[0],1))
MyNewMatrix = np.delete(MatrixAnalytics(Cmp),[0],1)
print("SecondPrint")
print(MyNewMatrix)
returns :
[[ 2. 2. 2. 2. 2.]
[ 1. 2. 2. 2. 2.]
[ 1. 2. 0. 2. 2.]
[ 2. 2. 2. 2. 2.]
[ 2. 2. 2. 0. 0.]
[ 1. 2. 2. 0. 2.]
[ 1. 2. 2. 2. 2.]
[ 1. 2. 2. 2. nan]
[ 2. 2. 2. 2. 2.]
[ 2. 2. 2. 2. nan]]
Second Print
[[-1. 0. 0. 0. 0.]
[-1. 0. 0. 0. 0.]
[-1. 0. -1. 0. 0.]
[-1. 0. 0. 0. 0.]
[-1. 0. 0. -1. 0.]
[-1. 0. 0. -1. 0.]
[-1. 0. 0. 0. 0.]
[-1. 0. 0. 0. nan]
[-1. 0. 0. 0. 0.]
[-1. 0. 0. 0. nan]]
This is weird, and can't figure this out. Why Would the values change without any line of code between 3 print ?
def MatrixAnalytics(DataMatrix):
AnalyzedMatrix = DataMatrix
for i in range(len(AnalyzedMatrix)): #Browse Each Column
for j in range(len(AnalyzedMatrix[i])): #Browse Each Line
if j>0:
if AnalyzedMatrix[i][j] > 50:
if AnalyzedMatrix[i][j] > AnalyzedMatrix[i][j-1]:
AnalyzedMatrix[i][j] = 2
else:
AnalyzedMatrix[i][j] = 1
else:
if AnalyzedMatrix[i][j] <50:
if AnalyzedMatrix[i][j] > AnalyzedMatrix[i][j-1]:
AnalyzedMatrix[i][j] = 0
else:
AnalyzedMatrix[i][j] = -1
return AnalyzedMatrix
The input array is :
[[55. 57.6 57.2 57. 51.1 55.9]
[55.3 54.7 56.1 55.8 52.7 55.5]
[55.5 52. 52.2 49.9 53.8 55.6]
[54.9 57.8 57.6 53.6 54.2 59.9]
[47.9 50.7 53.3 52.5 49.9 45.8]
[57. 56.2 58.3 55.4 47.9 56.5]
[56.6 54.2 57.6 54.7 50.1 53.6]
[54.7 53.4 52. 52. 50.9 nan]
[51.4 51.5 51.2 53. 50.1 50.1]
[55.3 58.7 59.2 56.4 53. nan]]
It seems that it call again the function MatrixAnalytics But I don't understand why
**
Doing this works :
**
MyNewMatrix = np.delete(MatrixAnalytics(Cmp),[0],1)
print(MyNewMatrix)
MyNewMatrix = np.delete(MatrixAnalytics(Cmp),[0],1)
print("SecondPrint")
print(MyNewMatrix)
I think I got the issue.
In this code :
def MatrixAnalytics(DataMatrix):
AnalyzedMatrix = DataMatrix
...
...
return AnalyzedMatrix
AnalyzedMatrix is not a copy of DataMatrix, it's referencing to the same object in memory !
So on the first call of MatrixAnalytics, your are actually modifying the object behind the reference given as argument (because arrays are mutable).
In the second call, your are giving the same reference as argument so the array behind it has already been modified.
note : return AnalyzedMatrix statement just returns the a new reference to the object referenced by the DataMatrix argument (not a copy).
Try to replace this line :
AnalyzedMatrix = DataMatrix
with this one (in your definition of MatrixAnalytics) :
AnalyzedMatrix = np.copy(DataMatrix)
For more info :
mutable vs unmutable
numpy.delete()
numpy.copy()
I believe you want same output in both the cases,
Sadly the thing is np.delete performs changes in the array itself, so when you called the first line (np.delete(MatrixAnalytics(Cmp),[0],1))
it deletes the 0th column and saves it in matrixanalytics, so never call this function in print statement, either call it during assignment or even without assignment as it will make the changes in the given array itself, but never in print since the column would be lost in the print statement.
I'm looking for the functionality that operates like such
lookup_dict = {5:1.0, 12:2.0, 39:2.0...}
# this is the missing magic:
lookup = vectorized_dict(lookup_dict)
x = numpy.array([5.0, 59.39, 39.49...])
xbins = numpy.trunc(x).astype(numpy.int_)
y = lookup.get(xbins, 0.0)
# the idea is that we get this as the postcondition:
for (result, input) in zip(y, xbins):
assert(result==lookup_dict.get(input, 0.0))
Is there some flavor of sparse array in numpy (or scipy) that gets at this kind of functionality?
The full context is that I'm binning some samples of a 1-D feature.
As far as I know, numpy does not support different data types in the same array structures but you can achieve a similar result if you are willing to separate keys from values and maintain the keys (and corresponding values) in sorted order:
import numpy as np
keys = np.array([5,12,39])
values = np.array([1.0, 2.0, 2.0])
valueOf5 = values[keys.searchsorted(5)] # 2.0
k = np.array([5,5,12,39,12])
values[keys.searchsorted(k)] # array([1., 1., 2., 2., 2.])
This may not be as efficient as a hashing key but it does support the propagation of indirections from arrays with any number of dimensions.
note that this assumes your keys are always present in the keys array. If not, rather than an error, you could be getting the value from the next key up.
Using np.select to create boolean masks over the array, ([xbins == k for k in lookup_dict]), the values from the dict (lookup_dict.values()), and a default value of 0:
y = np.select(
[xbins == k for k in lookup_dict],
lookup_dict.values(),
0.0
)
# In [17]: y
# Out[17]: array([1., 0., 2.])
This assumes that the dictionary is sorted, I'm not sure what the behaviour would be below python 3.6.
OR overkill with pandas:
import pandas as pd
s = pd.Series(xbins)
s = s.map(lookup_dict).fillna(0)
Another approach is to use searchsorted to search a numpy array which has the integer 'keys' and returns the initially loaded value in the range n <= x < n+1. This may be useful to somebody asking the a similar question in the future.
import numpy as np
class NpIntDict:
""" Class to simulate a python dict get for a numpy array. """
def __init__( self, dict_in, default = np.nan ):
""" dict_in: a dictionary with integer keys.
default: the value to be returned for keys not in the dictionary.
defaults to np.nan
default must be consistent with the dtype of values
"""
# Create list of dict items sorted by key.
list_in = sorted([ item for item in dict_in.items() ])
# Create three empty lists.
key_list = []
val_list = []
is_def_mask = []
for key, value in list_in:
key = int(key)
if not key in key_list: # key not yet in key list
# Update the three lists for key as default.
key_list.append( key )
val_list.append( default )
is_def_mask.append( True )
# Update the lists for key+1. With searchsorted this gives the required results.
key_list.append( key + 1 )
val_list.append( value )
is_def_mask.append( False )
# Add the key > max(key) to the val and is_def_mask lists.
val_list.append( default )
is_def_mask.append( True )
self.keys = np.array( key_list, dtype = np.int )
self.values = np.array( val_list )
self.default_mask = np.array( is_def_mask )
def set_default( self, default = 0 ):
""" Set the default to a new default value. Using self.default_mask.
Changes the default value for all future self.get(arr).
"""
self.values[ self.default_mask ] = default
def get( self, arr, default = None ):
""" Returns an array looking up the values in `arr` in the dict.
default can be used to change the default value returned for this get only.
"""
if default is None:
values = self.values
else:
values= self.values.copy()
values[ self.default_mask ] = default
return values[ np.searchsorted( self.keys, arr, side = 'right' ) ]
# side = 'right' to ensure key[ix] <= x < key[ix+1]
# side = 'left' would mean key[ix] < x <= key[ix+1]
This could be simplified if there's no requirement to change the default returned after the NpIntDict is created.
To test it.
d = { 2: 5.1, 3: 10.2, 5: 47.1, 8: -6}
# x <2 Return default
# 2 <= x <3 return 5.1
# 3 <= x < 4 return 10.2
# 4 <= x < 5 return default
# 5 <= x < 6 return 47.1
# 6 <= x < 8 return default
# 8 <= x < 9 return -6.
# 9 <= x return default
test = NpIntDict( d, default = 0.0 )
arr = np.arange( 0., 100. ).reshape(10,10)/10
print( arr )
"""
[[0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9]
[1. 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9]
[2. 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9]
[3. 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9]
[4. 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9]
[5. 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9]
[6. 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9]
[7. 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9]
[8. 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9]
[9. 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9]]
"""
print( test.get( arr ) )
"""
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1 5.1]
[10.2 10.2 10.2 10.2 10.2 10.2 10.2 10.2 10.2 10.2]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[47.1 47.1 47.1 47.1 47.1 47.1 47.1 47.1 47.1 47.1]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[-6. -6. -6. -6. -6. -6. -6. -6. -6. -6. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]]
"""
This could be amended to raise an exception if any of the arr elements aren't in the key list. For me returning a default would be more useful.
I wonder if there is a way to set up a NumPy linspace function with multiple num parameters so that I can create sequence of evenly spaced values with different intervals without any for loop operations.
To illustrate a bit more my issue, I have the following np.array for which I want to subdivide 3 segments represented by their 2 respective vertices on the x,y,z axis:
*************************
3D SEGMENTS TO DISCRETIZE
*************************
SegmentToDiscretize = np.array([[[150.149, 167.483, 4.2 ],[160.149, 167.483, 4.2 ]],
[[148.594, 163.634, 25.8 ],[180.547, 170.667, 25.8 ]],
[[180.547, 170.667, 25.8 ],[200.547, 190.667, 25.8 ]]])
And the folling function dedicated to add equidistant points between each pairs of vertices:
******************************
EQUIDISTANT POINTS COMPUTATION
******************************
nbsubdiv = 10
addedpoint = np.linspace(SegmentToDiscretize[:,0],SegmentToDiscretize[:,1],nbsubdiv, dtype = np.float)
Thanks to the argument nbsubdiv, I can specify how many subdivisions I want.
But I would like to specify 3 different subdivision values for each segments/rows contained in my SegmentToDiscretize np.array
[[[150.149, 167.483, 4.2 ],[160.149, 167.483, 4.2 ]], <-- nbsubdiv = 4
[[148.594, 163.634, 25.8 ],[180.547, 170.667, 25.8 ]], <-- nbsubdiv = 30
[[180.547, 170.667, 25.8 ],[200.547, 190.667, 25.8 ]]] <-- nbsubdiv = 10
I tried to transform my nbsubdiv parameter as a list but without success...
nbsubdiv = [4,30,10]
addedpoint = np.linspace(SegmentToDiscretize[:,0],SegmentToDiscretize[:,1],nbsubdiv[0], dtype = np.float)
With the above code, I obtain :
[[148.594 163.634 4.2 ]
[150.149 165.97833333 4.2 ]
[153.48233333 167.483 4.2 ]
[156.81566667 167.483 4.2 ]
[159.245 167.483 25.8 ]
[160.149 167.483 25.8 ]
[169.896 168.32266667 25.8 ]
[180.547 170.667 25.8 ]
[180.547 170.667 25.8 ]
[187.21366667 177.33366667 25.8 ]
[193.88033333 184.00033333 25.8 ]
[200.547 190.667 25.8 ]]
Which is normal since nbsubdiv[0] takes the first element in the list. But I did not succeeded in finding a way to use each values in this list recursively without a for loop.
So I would be very delighted if anyone could help me solve this challenge.
Thanks in advance
Warm regards,
Hervé
I have a list and I want to find shortest sublist with sum greater than 50.
For example my list is
[8.4 , 10.3 , 12.9 , 8.2 , 13.7 , 11.2 , 11.3 ,10.4 , 4.2 , 3.3 , 4.0 , 2.1]
and I want to find shortest sublist so that its sum is more than 50.
Output Should be like [12.9 , 13.7 , 11.2 , 11.3, 10.4]
this is way bad solution (in term of not doing all graph serach and find optimum values ), but solution is correct
lis =[8.4 , 10.3 , 12.9 , 8.2 , 13.7 , 11.2 , 11.3 ,10.4 , 4.2 , 3.3 , 4.0 , 2.1]
from collections import defaultdict
dic = defaultdict(list)
for i in range(len(lis)):
dic[lis[i]]+=[i]
tmp_lis = lis.copy()
tmp_lis.sort(reverse=True)
res =[]
for i in tmp_lis:
if sum(res)>50 :
break
else:
res.append(i)
res1 = [(i,dic[i]) for i in res]
res1.sort(key=lambda x:x[1])
solution =[i[0] for i in res1]
output
[12.9, 13.7, 11.2, 11.3, 10.4]
O(n) solution for list of positive numbers
Provided your list cannot contain negative numbers, then there is a linear solution using two-pointers traversal.
Track the sum between both pointers. Increment the right pointer whenever the sum is below 50 and increment the left one otherwise.
This provides a sequence of pointers within which you will find the ones with minimal distance. It suffices to use min to get the smallest interval out of those.
Due to the behaviour of min, this will return the left-most sublist with minimal length if more than one solution exists.
Code
def intervals_generator(lst, bound):
i, j = 0, 0
sum_ = 0
while True:
try:
if sum_ <= bound:
sum_ += lst[j]
j += 1
else:
yield i, j
sum_ -= lst[i]
i += 1
except IndexError:
break
def smallest_sub_list(lst, bound):
i, j = min(intervals_generator(lst, bound), key=lambda x: x[1] - x[0])
return lst[i:j]
Examples
lst = [8.4 , 10.3 , 12.9 , 8.2 , 13.7 , 11.2 , 11.3 ,10.4 , 4.2 , 3.3 , 4.0 , 2.1]
print(smallest_sub_list(lst, 50)) # [8.4, 10.3, 12.9, 8.2, 13.7]
lst = [0, 10, 45, 55]
print(smallest_sub_list(lst, 50)) # [55]
Solution for general list of numbers
If the list can contain negative numbers then the above will not work and I believe there exists no solution more efficient than to iterate over all possible sublists.
Sort it in descending order and sum the first elements until you hit +50.0.
myList = [8.4 , 10.3 , 12.9 , 8.2 , 13.7 , 11.2 , 11.3 ,10.4 , 4.2 , 3.3 , 4.0 , 2.1]
mySublist = []
for i in sorted(myList, reverse=True):
mySublist.append(i)
if sum(mySublist) > 50.0:
break
print mySublist # [13.7, 12.9, 11.3, 11.2, 10.4]
Considering that what you want is the smallest sublist in size, and not the smallest in sum value.
If you are searching for any shortest sublist, this can be a solution (maybe to be optimized):
lst = [8.4 , 10.3 , 12.9 , 8.2 , 13.7 , 11.2 , 11.3 , 10.4 , 4.2 , 3.3 , 4.0 , 2.1]
def find_sub(lst, limit=50):
for l in range(1, len(lst)+1):
for i in range(len(lst)-l+1):
sub = lst[i:i+l]
if sum(sub) > limit:
return sub
>>> print(find_sub(lst))
Output:
[8.4, 10.3, 12.9, 8.2, 13.7]
I have a dataframe that initially contains two columns, Home, which is 1 if a game was player at home, else 0, and PTS, which records the number of points a player scored in a given game. I want to end up with a third column, a rolling metric that represents how sensitive a player is to playing at home. I'll calculate this as follows:
Home Sensitivity = (Average PTS Home - Average PTS Away)/Average PTS
I did this successfully in the following code, but it felt cumbersome, as I created many columns I didn't need in the end. How can I solve this problem more directly?
df=pd.DataFrame({'Home':[1,0,1,0,1,0,1,0], 'PTS':[11, 10, 12, 11, 13, 12, 14, 12]})
df.loc[testDF['Home'] == 1, 'Home PTS'] = df['PTS']
df.loc[testDF['Home'] == 0, 'Away PTS'] = df['PTS']
df['Home PTS'] = df['Home PTS'].fillna(0)
df['Away PTS'] = df['Away PTS'].fillna(0)
df['Home Sum'] = df['Home PTS'].expanding(min_periods=1).sum()
df['Away Sum'] = df['Away PTS'].expanding(min_periods=1).sum()
df['Home Count']=df['Home'].expanding().sum()
df['Index']=df.index+1
df['Away Count']=df['Index']-df['Home Count']
df['Home Average']=df['Home Sum']/df['Home Count']
df['Away Average']=df['Away Sum']/df['Away Count']
df['Average']=df['PTS'].expanding().mean()
df['Metric']=(df['Home Average']-df['Away Average'])/df['Average']
Here is a naive way to do it: take increasingly larger slices of the DataFrame in a loop; do the math on each slice and store it in a list; assign the list to a new column of the DataFrame (using your testDF):
df = tesdDF
sens = []
for i in range(len(df)):
d = df[:i]
mean_pts = d.PTS.mean()
home = d[d.Home == 1].PTS.mean()
away = d[d.Home == 0].PTS.mean()
#print(home, away, (home - away) / mean_pts)
sens.append((home - away) / mean_pts)
df['sens'] = sens
>>> df
Home PTS sens
0 1 11 NaN
1 0 10 NaN
2 1 12 0.095238
3 0 11 0.136364
4 1 13 0.090909
5 0 12 0.131579
6 1 14 0.086957
7 0 12 0.126506
Using DataFrame.expanding(): Not quite there yet ...
>>> mean_pts = df.PTS.expanding(1).mean()
>>> away = df[df['Home'] == 0].PTS.expanding(1).mean()
>>> home = df[df['Home'] == 1].PTS.expanding(1).mean()
>>>
>>> home
0 11.0
2 11.5
4 12.0
6 12.5
Name: PTS, dtype: float64
>>> away
1 10.00
3 10.50
5 11.00
7 11.25
Name: PTS, dtype: float64
>>> mean_pts
0 11.000000
1 10.500000
2 11.000000
3 11.000000
4 11.400000
5 11.500000
6 11.857143
7 11.875000
Name: PTS, dtype: float64
>>>
To do the math will require more manipulation.
You cannot get the difference between home and away directly because the indices are different - but you can do ...
>>> home.values - away.values
array([ 1. , 1. , 1. , 1.25])
>>>
Also home and away only have four rows and mean_pts has eight.
I tried .expanding(1).apply() with the following function and didn't get what I expected, expanding doesn't pass both columns to the function, it appears to pass one column then the other; so I punted...
def f(thing):
print(thing, '***')
return thing.mean()
>>> df.expanding(1).apply(f)
[ 1.] ***
[ 1. 0.] ***
[ 1. 0. 1.] ***
[ 1. 0. 1. 0.] ***
[ 1. 0. 1. 0. 1.] ***
[ 1. 0. 1. 0. 1. 0.] ***
[ 1. 0. 1. 0. 1. 0. 1.] ***
[ 1. 0. 1. 0. 1. 0. 1. 0.] ***
[ 11.] ***
[ 11. 10.] ***
[ 11. 10. 12.] ***
[ 11. 10. 12. 11.] ***
[ 11. 10. 12. 11. 13.] ***
[ 11. 10. 12. 11. 13. 12.] ***
[ 11. 10. 12. 11. 13. 12. 14.] ***
[ 11. 10. 12. 11. 13. 12. 14. 12.] ***