I have the following program, it seems that the amp and period at the end print out a list of list(see below). And I am unable to plot them (I want to plot period against amp)
I have tried methods in How to make a flat list out of list of lists? to combine the output of amp and period so that they are plot-table, but nothing worked.
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import solve_ivp
def derivatives(t,y,q,F):
return [y[1], -np.sin(y[0])-q*y[1]+F*np.sin((2/3)*t)]
t = np.linspace(0.0, 100, 10000)
#initial conditions
theta0 = np.linspace(0.0,np.pi,100)
q = 0.0 #alpha / (mass*g), resistive term
F = 0.0 #G*np.sin(2*t/3)
for i in range (0,100):
sol = solve_ivp(derivatives, (0.0,100.0), (theta0[i], 0.0), method = 'RK45', t_eval = t,args = (q,F))
velocity = sol.y[1]
time = sol.t
zero_cross = 0
value = []
amp = []
period = []
for k in range (len(velocity)-1):
if (velocity[k+1]*velocity[k]) < 0:
zero_cross += 1
value.append(k)
else:
zero_cross += 0
zero_cross = zero_cross - zero_cross % 2 # makes the total number of zero-crossings even
if zero_cross != 0:
amp.append(theta0[i])
# period calculated using the time evolved between the first and last zero-crossing detected
period.append((2*(time[value[zero_cross - 1]] - time[value[0]])) / (zero_cross -1))
If I print out amp inside the loop, it displays as follows:
[0.03173325912716963]
[0.06346651825433926]
[0.0951997773815089]
[0.12693303650867852]
[0.15866629563584814]
[0.1903995547630178]
[0.2221328138901874]
[0.25386607301735703]
[0.28559933214452665]
[0.3173325912716963]
[0.3490658503988659]
[0.3807991095260356]
[0.4125323686532052]
[0.4442656277803748]
[0.47599888690754444]
[0.5077321460347141]
[0.5394654051618837]
[0.5711986642890533]
[0.6029319234162229]
[0.6346651825433925]
[0.6663984416705622]
[0.6981317007977318]
[0.7298649599249014]
[0.7615982190520711]
[0.7933314781792408]
[0.8250647373064104]
[0.85679799643358]
[0.8885312555607496]
[0.9202645146879193]
[0.9519977738150889]
[0.9837310329422585]
[1.0154642920694281]
[1.0471975511965979]
[1.0789308103237674]
[1.110664069450937]
[1.1423973285781066]
[1.1741305877052763]
[1.2058638468324459]
[1.2375971059596156]
[1.269330365086785]
[1.3010636242139548]
[1.3327968833411243]
[1.364530142468294]
[1.3962634015954636]
[1.4279966607226333]
[1.4597299198498028]
[1.4914631789769726]
[1.5231964381041423]
[1.5549296972313118]
[1.5866629563584815]
[1.618396215485651]
[1.6501294746128208]
[1.6818627337399903]
[1.71359599286716]
[1.7453292519943295]
[1.7770625111214993]
[1.8087957702486688]
[1.8405290293758385]
[1.872262288503008]
[1.9039955476301778]
[1.9357288067573473]
[1.967462065884517]
[1.9991953250116865]
[2.0309285841388562]
[2.0626618432660258]
[2.0943951023931957]
[2.126128361520365]
[2.1578616206475347]
[2.1895948797747042]
[2.221328138901874]
[2.2530613980290437]
[2.284794657156213]
[2.3165279162833827]
[2.3482611754105527]
[2.379994434537722]
[2.4117276936648917]
[2.443460952792061]
[2.475194211919231]
[2.5069274710464007]
[2.53866073017357]
[2.57039398930074]
[2.6021272484279097]
[2.633860507555079]
[2.6655937666822487]
[2.6973270258094186]
[2.729060284936588]
[2.7607935440637577]
[2.792526803190927]
[2.824260062318097]
[2.8559933214452666]
[2.887726580572436]
[2.9194598396996057]
[2.9511930988267756]
[2.982926357953945]
[3.0146596170811146]
[3.141592653589793]
[Finished in 3.822s]
I am not sure what type of output that is and how to handle, any help would be appreciated!
You are declaring the lists inside the loop, which means they will be reset to empty at every iteration. Consider declaring amp, period, and any array that should be set to empty only once (as initial state) before the loop, like so:
#initialize arrays, executes only once before the loop
amp = []
period = []
for i in range (0,100):
#your logic here, plus appending values to `amp` and `period`
#now `amp` and `period` should contain all desired values
Related
Tl Dr. If I were to explain the problem in short:
I have signals:
np.random.seed(42)
x = np.random.randn(1000)
y = np.random.randn(1000)
z = np.random.randn(1000)
and human readable string tuple logic like :
entry_sig_ = ((x,y,'crossup',False),)
exit_sig_ = ((x,z,'crossup',False), 'or_',(x,y,'crossdown',False))
where:
'entry_sig_' means the output will be 1 when the time series unfolds from left to right and 'entry_sig_' is hit. (x,y,'crossup',False) means: x crossed y up at a particular time i, and False means signal doesn't have "memory". Otherwise number of hits accumulates.
'exit_sig_' means the output will again become '0' when the 'exit_sig_' is hit.
The output is generated through:
#njit
def run(x, entry_sig, exit_sig):
'''
x: np.array
entry_sig, exit_sig: homogeneous tuples of tuple signals
Returns: sequence of 0 and 1 satisfying entry and exit sigs
'''
L = x.shape[0]
out = np.empty(L)
out[0] = 0.0
out[-1] = 0.0
i = 1
trade = True
while i < L-1:
out[i] = 0.0
if reduce_sig(entry_sig,i) and i<L-1:
out[i] = 1.0
trade = True
while trade and i<L-2:
i += 1
out[i] = 1.0
if reduce_sig(exit_sig,i):
trade = False
i+= 1
return out
reduce_sig(sig,i) is a function (see definition below) that parses the tuple and returns resulting output for a given point in time.
Question:
As of now, an object of SingleSig class is instantiated in the for loop from scratch for any given point in time; thus, not having "memory", which totally cancels the merits of having a class, a bare function will do. Does there exist a workaround (a different class template, a different approach, etc) so that:
combined tuple signal can be queried for its value at a particular point in time i.
"memory" can be reset; i.e. e.g. MultiSig(sig_tuple).memory_field can be set to 0 at a constituent signals levels.
Following code adds a memory to the signals which can be wiped using MultiSig.reset() to reset the count of all signals to 0. The memory can be queried using MultiSig.query_memory(key) to return the number of hits for that signal at that time.
For the memory function to work, I had to add unique keys to the signals to identify them.
from numba import njit, int64, float64, types
from numba.types import Array, string, boolean
from numba import jitclass
import numpy as np
np.random.seed(42)
x = np.random.randn(1000000)
y = np.random.randn(1000000)
z = np.random.randn(1000000)
# Example of "human-readable" signals
entry_sig_ = ((x,y,'crossup',False),)
exit_sig_ = ((x,z,'crossup',False), 'or_',(x,y,'crossdown',False))
# Turn signals into homogeneous tuple
#entry_sig_
entry_sig = (((x,y,'crossup',False),'NOP','1'),)
#exit_sig_
exit_sig = (((x,z,'crossup',False),'or_','2'),((x,y,'crossdown',False),'NOP','3'))
#njit
def cross(x, y, i):
'''
x,y: np.array
i: int - point in time
Returns: 1 or 0 when condition is met
'''
if (x[i - 1] - y[i - 1])*(x[i] - y[i]) < 0:
out = 1
else:
out = 0
return out
kv_ty = (types.string,types.int64)
spec = [
('memory', types.DictType(*kv_ty)),
]
#njit
def single_signal(x, y, how, acc, i):
'''
i: int - point in time
Returns either signal or accumulator
'''
if cross(x, y, i):
if x[i] < y[i] and how == 'crossdown':
out = 1
elif x[i] > y[i] and how == "crossup":
out = 1
else:
out = 0
else:
out = 0
return out
#jitclass(spec)
class MultiSig:
def __init__(self,entry,exit):
'''
initialize memory at single signal level
'''
memory_dict = {}
for i in entry:
memory_dict[str(i[2])] = 0
for i in exit:
memory_dict[str(i[2])] = 0
self.memory = memory_dict
def reduce_sig(self, sig, i):
'''
Parses multisignal
sig: homogeneous tuple of tuples ("human-readable" signal definition)
i: int - point in time
Returns: resulting value of multisignal
'''
L = len(sig)
out = single_signal(*sig[0][0],i)
logic = sig[0][1]
if out:
self.update_memory(sig[0][2])
for cnt in range(1, L):
s = single_signal(*sig[cnt][0],i)
if s:
self.update_memory(sig[cnt][2])
out = out | s if logic == 'or_' else out & s
logic = sig[cnt][1]
return out
def update_memory(self, key):
'''
update memory
'''
self.memory[str(key)] += 1
def reset(self):
'''
reset memory
'''
dicti = {}
for i in self.memory:
dicti[i] = 0
self.memory = dicti
def query_memory(self, key):
'''
return number of hits on signal
'''
return self.memory[str(key)]
#njit
def run(x, entry_sig, exit_sig):
'''
x: np.array
entry_sig, exit_sig: homogeneous tuples of tuples
Returns: sequence of 0 and 1 satisfying entry and exit sigs
'''
L = x.shape[0]
out = np.empty(L)
out[0] = 0.0
out[-1] = 0.0
i = 1
multi = MultiSig(entry_sig,exit_sig)
while i < L-1:
out[i] = 0.0
if multi.reduce_sig(entry_sig,i) and i<L-1:
out[i] = 1.0
trade = True
while trade and i<L-2:
i += 1
out[i] = 1.0
if multi.reduce_sig(exit_sig,i):
trade = False
i+= 1
return out
run(x, entry_sig, exit_sig)
To reiterate what I said in the comments, | and & are bitwise operators, not logical operators. 1 & 2 outputs 0/False which is not what I believe you want this to evaluate to so I made sure the out and s can only be 0/1 in order for this to produce the expected output.
You are aware that the because of:
out = out | s if logic == 'or_' else out & s
the order of the time-series inside entry_sig and exit_sig matters?
Let (output, logic) be tuples where output is 0 or 1 according to how crossup and crossdown would evalute the passed information of the tuple and logic is or_ or and_.
tuples = ((0,'or_'),(1,'or_'),(0,'and_'))
out = tuples[0][0]
logic = tuples[0][1]
for i in range(1,len(tuples)):
s = tuples[i][0]
out = out | s if logic == 'or_' else out & s
out = s
logic = tuples[i][1]
print(out)
0
changing the order of the tuple yields the other signal:
tuples = ((0,'or_'),(0,'and_'),(1,'or_'))
out = tuples[0][0]
logic = tuples[0][1]
for i in range(1,len(tuples)):
s = tuples[i][0]
out = out | s if logic == 'or_' else out & s
out = s
logic = tuples[i][1]
print(out)
1
The performance hinges on how many times the count needs to be updated. Using n=1,000,000 for all three time series, your code had a mean run-time of 0.6s on my machine, my code had 0.63s.
I then changed the crossing logic up a bit to save the number of if/else so that the nested if/else is only triggered if the time-series crossed which can be checked by one comparison only. This further halved the difference in run-time so above code now sits at 2.5% longer run-time your original code.
I have two ordered lists of consecutive integers m=0, 1, ... M and n=0, 1, 2, ... N. Each value of m has a probability pm, and each value of n has a probability pn. I am trying to find the ordered list of unique values r=n/m and their probabilities pr. I am aware that r is infinite if n=0 and can even be undefined if m=n=0.
In practice, I would like to run for M and N each be of the order of 2E4, meaning up to 4E8 values of r - which would mean 3 GB of floats (assuming 8 Bytes/float).
For this calculation, I have written the python code below.
The idea is to iterate over m and n, and for each new m/n, insert it in the right place with its probability if it isn't there yet, otherwise add its probability to the existing number. My assumption is that it is easier to sort things on the way instead of waiting until the end.
The cases related to 0 are added at the end of the loop.
I am using the Fraction class since we are dealing with fractions.
The code also tracks the multiplicity of each unique value of m/n.
I have tested up to M=N=100, and things are quite slow. Are there better approaches to the question, or more efficient ways to tackle the code?
Timing:
M=N=30: 1 s
M=N=50: 6 s
M=N=80: 30 s
M=N=100: 82 s
import numpy as np
from fractions import Fraction
import time # For timiing
start_time = time.time() # Timing
M, N = 6, 4
mList, nList = np.arange(1, M+1), np.arange(1, N+1) # From 1 to M inclusive, deal with 0 later
mProbList, nProbList = [1/(M+1)]*(M), [1/(N+1)]*(N) # Probabilities, here assumed equal (not general case)
# Deal with mn=0 later
pmZero, pnZero = 1/(M+1), 1/(N+1) # P(m=0) and P(n=0)
pNaN = pmZero * pnZero # P(0/0) = P(m=0)P(n=0)
pZero = pmZero * (1 - pnZero) # P(0) = P(m=0)P(n!=0)
pInf = pnZero * (1 - pmZero) # P(inf) = P(m!=0)P(n=0)
# Main list of r=m/n, P(r) and mult(r)
# Start with first line, m=1
rList = [Fraction(mList[0], n) for n in nList[::-1]] # Smallest first
rProbList = [mProbList[0] * nP for nP in nProbList[::-1]] # Start with first line
rMultList = [1] * len(rList) # Multiplicity of each element
# Main loop
for m, mP in zip(mList[1:], mProbList[1:]):
for n, nP in zip(nList[::-1], nProbList[::-1]): # Pick an n value
r, rP, rMult = Fraction(m, n), mP*nP, 1
for i in range(len(rList)-1): # See where it fits in existing list
if r < rList[i]:
rList.insert(i, r)
rProbList.insert(i, rP)
rMultList.insert(i, 1)
break
elif r == rList[i]:
rProbList[i] += rP
rMultList[i] += 1
break
elif r < rList[i+1]:
rList.insert(i+1, r)
rProbList.insert(i+1, rP)
rMultList.insert(i+1, 1)
break
elif r == rList[i+1]:
rProbList[i+1] += rP
rMultList[i+1] += 1
break
if r > rList[-1]:
rList.append(r)
rProbList.append(rP)
rMultList.append(1)
break
# Deal with 0
rList.insert(0, Fraction(0, 1))
rProbList.insert(0, pZero)
rMultList.insert(0, N)
# Deal with infty
rList.append(np.Inf)
rProbList.append(pInf)
rMultList.append(M)
# Deal with undefined case
rList.append(np.NAN)
rProbList.append(pNaN)
rMultList.append(1)
print(".... done in %s seconds." % round(time.time() - start_time, 2))
print("************** Final list\nr", 'Prob', 'Mult')
for r, rP, rM in zip(rList, rProbList, rMultList): print(r, rP, rM)
print("************** Checks")
print("mList", mList, 'nList', nList)
print("Sum of proba = ", np.sum(rProbList))
print("Sum of multi = ", np.sum(rMultList), "\t(M+1)*(N+1) = ", (M+1)*(N+1))
Based on the suggestion of #Prune, and on this thread about merging lists of tuples, I have modified the code as below. It's a lot easier to read, and runs about an order of magnitude faster for N=M=80 (I have omitted dealing with 0 - would be done same way as in original post). I assume there may be ways to tweak the merge and conversion back to lists further yet.
# Do calculations
data = [(Fraction(m, n), mProb(m) * nProb(n)) for n in range(1, N+1) for m in range(1, M+1)]
data.sort()
# Merge duplicates using a dictionary
d = {}
for r, p in data:
if not (r in d): d[r] = [0, 0]
d[r][0] += p
d[r][1] += 1
# Convert back to lists
rList, rProbList, rMultList = [], [], []
for k in d:
rList.append(k)
rProbList.append(d[k][0])
rMultList.append(d[k][1])
I expect that "things are quite slow" because you've chosen a known inefficient sort. A single list insertion is O(K) (later list elements have to be bumped over, and there is added storage allocation on a regular basis). Thus a full-list insertion sort is O(K^2). For your notation, that is O((M*N)^2).
If you want any sort of reasonable performance, research and use the best-know methods. The most straightforward way to do this is to make your non-exception results as a simple list comprehension, and use the built-in sort for your penultimate list. Simply append your n=0 cases, and you're done in O(K log K) time.
I the expression below, I've assumed functions for m and n probabilities.
This is a notational convenience; you know how to directly compute them, and can substitute those expressions if you wish.
data = [ (mProb(m) * nProb(n), Fraction(m, n))
for n in range(1, N+1)
for m in range(0, M+1) ]
data.sort()
data.extend([ # generate your "zero" cases here ])
I'm trying to get a script to run on each individual column of a csv file. I've figured out how to tell python which column I would like to run the script on but I want it to analyze column one, output the results, the move to column two and continue on and on through the file. What I want is a "if etc goto etc" command. I've found how to do this with simple oneliners but I have a larger script. Any help would be great as I'm sure I'm just missing something. Like if I could loop back to where I define my data (h=data) but tell it to choose the next column. Here is my script.
import numpy as np
import matplotlib.pyplot as plt
from pylab import *
import pylab
from scipy import linalg
import sys
import scipy.interpolate as interpolate
import scipy.optimize as optimize
a=raw_input("Data file name? ") #Name of the data file including the directory, must be .csv
datafile = open(a, 'r')
data = []
for row in datafile:
data.append(row.strip().split(',')) #opening and organizing the csv file
print('Data points= ', len(data))
print data
c=raw_input("Is there a header row? y/n?") #Remove header line if present
if c is ('y'):
del data[0]
data2=data
print('Raw data= ', data2)
else:
print('Raw data= ', data)
'''
#if I wanted to select a column
b=input("What column to analyze?") #Asks what column depth data is in
if b is 1:
h=[[rowa[i] for rowa in data] for i in range(1)] #first row
'''
h=data # all columns
g=reduce(lambda x,y: x+y,h) #prepares data for calculations
a=map(float, g)
a.sort()
print ('Organized data= ',a)
def GRLC(values):
'''
Calculate Gini index, Gini coefficient, Robin Hood index, and points of
Lorenz curve based on the instructions given in
www.peterrosenmai.com/lorenz-curve-graphing-tool-and-gini-coefficient-calculator
Lorenz curve values as given as lists of x & y points [[x1, x2], [y1, y2]]
#param values: List of values
#return: [Gini index, Gini coefficient, Robin Hood index, [Lorenz curve]]
'''
n = len(values)
assert(n > 0), 'Empty list of values'
sortedValues = sorted(values) #Sort smallest to largest
#Find cumulative totals
cumm = [0]
for i in range(n):
cumm.append(sum(sortedValues[0:(i + 1)]))
#Calculate Lorenz points
LorenzPoints = [[], []]
sumYs = 0 #Some of all y values
robinHoodIdx = -1 #Robin Hood index max(x_i, y_i)
for i in range(1, n + 2):
x = 100.0 * (i - 1)/n
y = 100.0 * (cumm[i - 1]/float(cumm[n]))
LorenzPoints[0].append(x)
LorenzPoints[1].append(y)
sumYs += y
maxX_Y = x - y
if maxX_Y > robinHoodIdx: robinHoodIdx = maxX_Y
giniIdx = 100 + (100 - 2 * sumYs)/n #Gini index
return [giniIdx, giniIdx/100, robinHoodIdx, LorenzPoints]
result = GRLC(a)
print 'Gini Index', result[0]
print 'Gini Coefficient', result[1]
print 'Robin Hood Index', result[2]
I'm ignoring all of that GRLC function and just solving the looping question. Give this a try. It uses while True: to loop forever (you can just break out by ending the program; Ctrl+C in Windows, depends on OS). Just load the data from the csv once then each time it loops, you can re-build some variables. If you have questions please ask. Also, I didn't test it as I don't have all the NumPy packages installed :)
import numpy as np
import matplotlib.pyplot as plt
from pylab import *
import pylab
from scipy import linalg
import sys
import scipy.interpolate as interpolate
import scipy.optimize as optimize
def GRLC(values):
'''
Calculate Gini index, Gini coefficient, Robin Hood index, and points of
Lorenz curve based on the instructions given in
www.peterrosenmai.com/lorenz-curve-graphing-tool-and-gini-coefficient-calculator
Lorenz curve values as given as lists of x & y points [[x1, x2], [y1, y2]]
#param values: List of values
#return: [Gini index, Gini coefficient, Robin Hood index, [Lorenz curve]]
'''
n = len(values)
assert(n > 0), 'Empty list of values'
sortedValues = sorted(values) #Sort smallest to largest
#Find cumulative totals
cumm = [0]
for i in range(n):
cumm.append(sum(sortedValues[0:(i + 1)]))
#Calculate Lorenz points
LorenzPoints = [[], []]
sumYs = 0 #Some of all y values
robinHoodIdx = -1 #Robin Hood index max(x_i, y_i)
for i in range(1, n + 2):
x = 100.0 * (i - 1)/n
y = 100.0 * (cumm[i - 1]/float(cumm[n]))
LorenzPoints[0].append(x)
LorenzPoints[1].append(y)
sumYs += y
maxX_Y = x - y
if maxX_Y > robinHoodIdx: robinHoodIdx = maxX_Y
giniIdx = 100 + (100 - 2 * sumYs)/n #Gini index
return [giniIdx, giniIdx/100, robinHoodIdx, LorenzPoints]
#Name of the data file including the directory, must be .csv
a=raw_input("Data file name? ")
datafile = open(a.strip(), 'r')
data = []
#opening and organizing the csv file
for row in datafile:
data.append(row.strip().split(','))
#Remove header line if present
c=raw_input("Is there a header row? y/n?")
if c.strip().lower() == ('y'):
del data[0]
while True :
#if I want the first column, that's index 0.
b=raw_input("What column to analyze?")
# Validate that the column input data is correct here. Otherwise it might be out of range, etc.
# Maybe try this. You might want more smarts in there, depending on your intent:
b = int(b.strip())
# If you expect the user to inpt "2" to mean the second column, you're going to use index 1 (list indexes are 0 based)
h=[[rowa[b-1] for rowa in data] for i in range(1)]
# prepares data for calculations
g=reduce(lambda x,y: x+y,h)
a=map(float, g)
a.sort()
print ('Organized data= ',a)
result = GRLC(a)
print 'Gini Index', result[0]
print 'Gini Coefficient', result[1]
print 'Robin Hood Index', result[2]
I have two large vectors (~133000 values) of different length. They are each sortet from small to large values. I want to find values that are similar within a given tolerance. This is my solution but it is very slow. Is there a way to speed this up?
import numpy as np
for lv in range(np.size(vector1)):
for lv_2 in range(np.size(vector2)):
if np.abs(vector1[lv_2]-vector2[lv])<.02:
print(vector1[lv_2],vector2[lv],lv,lv_2)
break
Your algorithm is far from optimal. You compare way too much values. Assume you are at a certain position in vector1 and the current value in vector2 is already more than 0.02 bigger. Why would you compare the rest of vector2?
Start with something like
pos1 = 0
pos2 = 0
Now compare the values at those postions in your vectors. If the difference is too big, move the position of the smaller one fowared and check again. Continue until you reach the end of one vector.
haven't tested it, but the following should work. The idea is to exploit the fact that the vectors are sorted
lv_1, lv_2 = 0,0
while lv_1 < len(vector1) and lv_2 < len(vector2):
if np.abs(vector1[lv_2]-vector2[lv_1])<.02:
print(vector1[lv_2],vector2[lv_1],lv_1,lv_2)
lv_1 += 1
lv_2 += 1
elif vector1[lv_1] < vector2[lv_2]: lv_1 += 1
else: lv_2 += 1
The following code gives a nice increase in performance that depends upon how dense the numbers are. Using a set of 1000 random numbers, sampled uniformly between 0 and 100, it runs about 30 times faster than your implementation.
pos_1_start = 0
for i in range(np.size(vector1)):
for j in range(pos1_start, np.size(vector2)):
if np.abs(vector1[i] - vector2[j]) < .02:
results1 += [(vector1[i], vector2[j], i, j)]
else:
if vector2[j] < vector1[i]:
pos1_start += 1
else:
break
The timing:
time new method: 0.112464904785
time old method: 3.59720897675
Which is produced by the following script:
import random
import numpy as np
import time
# initialize the vectors to be compared
vector1 = [random.uniform(0, 40) for i in range(1000)]
vector2 = [random.uniform(0, 40) for i in range(1000)]
vector1.sort()
vector2.sort()
# the arrays that will contain the results for the first method
results1 = []
# the arrays that will contain the results for the second method
results2 = []
pos1_start = 0
t_start = time.time()
for i in range(np.size(vector1)):
for j in range(pos1_start, np.size(vector2)):
if np.abs(vector1[i] - vector2[j]) < .02:
results1 += [(vector1[i], vector2[j], i, j)]
else:
if vector2[j] < vector1[i]:
pos1_start += 1
else:
break
t1 = time.time() - t_start
print "time new method:", t1
t = time.time()
for lv1 in range(np.size(vector1)):
for lv2 in range(np.size(vector2)):
if np.abs(vector1[lv1]-vector2[lv2])<.02:
results2 += [(vector1[lv1], vector2[lv2], lv1, lv2)]
t2 = time.time() - t_start
print "time old method:", t2
# sort the results
results1.sort()
results2.sort()
print np.allclose(results1, results2)
data is a matrix containing 2500 time series of a measurment. I need to average each time series over time, discarding data points that were recorded around a spike (in the interval tspike-dt*10... tspike+10*dt). The number of spiketimes is variable for each neuron and stored in a dictionary with 2500 entries. My current code iterates over neurons and spiketimes and sets the masked values to NaN. Then bottleneck.nanmean() is called. However this code is to slow in the current version, and I am wondering wheater there is a faster solution. thanks!
import bottleneck
import numpy as np
from numpy.random import rand, randint
t = 1
dt = 1e-4
N = 2500
dtbin = 10*dt
data = np.float32(ones((N, t/dt)))
times = np.arange(0,t,dt)
spiketimes = dict.fromkeys(np.arange(N))
for key in spiketimes:
spiketimes[key] = rand(randint(100))
means = np.empty(N)
for i in range(N):
spike_times = spiketimes[i]
datarow = data[i]
if len(spike_times) > 0:
for spike_time in spike_times:
start=max(spike_time-dtbin,0)
end=min(spike_time+dtbin,t)
idx = np.all([times>=start,times<=end],0)
datarow[idx] = np.NaN
means[i] = bottleneck.nanmean(datarow)
The vast majority of the processing time in your code comes from this line:
idx = np.all([times>=start,times<=end],0)
This is because for each spike, you are comparing every value in times against start and end. Since you have uniform time steps in this example (and I presume this is true in your data as well), it is much faster to simply compute the start and end indexes:
# This replaces the last loop in your example:
for i in range(N):
spike_times = spiketimes[i]
datarow = data[i]
if len(spike_times) > 0:
for spike_time in spike_times:
start=max(spike_time-dtbin,0)
end=min(spike_time+dtbin,t)
#idx = np.all([times>=start,times<=end],0)
#datarow[idx] = np.NaN
datarow[int(start/dt):int(end/dt)] = np.NaN
## replaced this with equivalent for testing
means[i] = datarow[~np.isnan(datarow)].mean()
This reduces the run time for me from ~100s to ~1.5s.
You can also shave off a bit more time by vectorizing the loop over spike_times. The effect of this will depend on the characteristics of your data (should be most effective for high spike rates):
kernel = np.ones(20, dtype=bool)
for i in range(N):
spike_times = spiketimes[i]
datarow = data[i]
mask = np.zeros(len(datarow), dtype=bool)
indexes = (spike_times / dt).astype(int)
mask[indexes] = True
mask = np.convolve(mask, kernel)[10:-9]
means[i] = datarow[~mask].mean()
Instead of using nanmean you could just index the values you need and use mean.
means[i] = data[ (times<start) | (times>end) ].mean()
If I misunderstood and you do need your indexing, you might try
means[i] = data[numpy.logical_not( np.all([times>=start,times<=end],0) )].mean()
Also in the code you probably want to not use if len(spike_times) > 0 (I assume you remove the spike time at each iteration or else that statement will always be true and you'll have an infinite loop), only use for spike_time in spike_times.