How to decrease the output time of covariance function - python

I have written a function for covariance matrix and the output I am getting is correct but the problem with the code is, it's taking too much time for high dimension dataset.
Could you please help me to modify the code below to take less time for output?
def cov_variance(norm_data,mean_of_mat):
col = len(norm_data)
row = len(norm_data[0])
out =[]
i = 0
sum_of_covar = 0
freezrow = 0
flag = 1
while flag<=len(mean_of_mat):
for r in range(row):
for c in range(col):
sum_of_covar+=(((norm_data[c][freezrow])-mean_of_mat[freezrow])*\
((norm_data[c][r])-mean_of_mat[i]))
freezrow=freezrow
out.append(sum_of_covar)
i+=1
sum_of_covar=0
freezrow=freezrow
flag+=1
freezrow+=1
i=0
out1 = map(lambda x : x/col-1,out)
cov_variance_output = reshape(out1,row)
return cov_variance_output

Like doctorlove already said, don't implement your own. It will almost certainly be slower and/or less versatile (speaking from my own experience).
I tried commenting with this info but my rep is too low. You can find information on calculating covariance matrices with numpy here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html

Related

how can i refactor my python code to decrease the time complexity

this code takes 9 sec which is very long time, i guess the problem in 2 loop in my code
for symptom in symptoms:
# check if the symptom is mentioned in the user text
norm_symptom = symptom.replace("_"," ")
for combin in list_of_combinations:
print(getSimilarity([combin, norm_symptom]))
if getSimilarity([combin,norm_symptom])>0.25:
if symptom not in extracted_symptoms:
extracted_symptoms.append(symptom)
i tried to use zip like this:
for symptom, combin in zip(symptoms,list_of_combinations):
norm_symptom = symptom.replace("_"," ")
if (getSimilarity([combin, norm_symptom]) > 0.25 and symptom not in extracted_symptoms):
extracted_symptoms.append(symptom)
Indeed, you're algorithm is slow because of the 2 nested loops.
It performs with big O N*M (see more here https://www.freecodecamp.org/news/big-o-notation-why-it-matters-and-why-it-doesnt-1674cfa8a23c/)
N being the lenght of symptoms
and M being the list_of_combinations
What can takes time also is the computation getSimilarity, what is this operation ?
Use a dict to store the results of getSimilarity for each combination and symptom. This way, you can avoid calling getSimilarity multiple times for the same combination and symptom. This way it will be more efficient, thus faster.
import collections
similarity_results = collections.defaultdict(dict)
for symptom in symptoms:
norm_symptom = symptom.replace("_"," ")
for combin in list_of_combinations:
# Check if the similarity has already been computed
if combin in similarity_results[symptom]:
similarity = similarity_results[symptom][combin]
else:
similarity = getSimilarity([combin, norm_symptom])
similarity_results[symptom][combin] = similarity
if similarity > 0.25:
if symptom not in extracted_symptoms:
extracted_symptoms.append(symptom)
Update:
Alternatively You could use an algorithm based on the Levenshtein distance, which is a measure of the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into another. Python-Levenshtein library does that.
import Levenshtein
def getSimilarity(s1, s2):
distance = Levenshtein.distance(s1, s2)
return 1.0 - (distance / max(len(s1), len(s2)))
extracted_symptoms = []
for symptom, combin in zip(symptoms, list_of_combinations):
norm_symptom = symptom.replace("_", " ")
if (getSimilarity(combin, norm_symptom) > 0.25) and (symptom not in extracted_symptoms):
extracted_symptoms.append(symptom)

make big process on graph with python parallelised

i'm working on graphs and big dataset of complex network's. i run SIR algorithm on them with ndlib library.
but each iteration takes something like 1Sec and it make code takes 10-12 h to complete .
i was wondering is there any way to make it parallelised ?
the code is like down bellow
this line of the code is core :
sir = model.infected_SIR_MODEL(it, infectionList, False)
is there any simple method to make it run on multi thread or parallelised ?
count = 500
for i in numpy.arange(1, count, 1):
for it in model.get_nodes():
sir = model.infected_SIR_MODEL(it, infectionList, False)
each iteration :
for u in self.graph.nodes():
u_status = self.status[u]
eventp = np.random.random_sample()
neighbors = self.graph.neighbors(u)
if isinstance(self.graph, nx.DiGraph):
neighbors = self.graph.predecessors(u)
if u_status == 0:
infected_neighbors = len([v for v in neighbors if self.status[v] == 1])
if eventp < self.BetaList[u] * infected_neighbors:
actual_status[u] = 1
elif u_status == 1:
if eventp < self.params['model']['gamma']:
actual_status[u] = 2
So, if the iterations are independent, then I don't see the point of iteration over count=500. Either way the multiprocessing library might be of interest to you.
I've prepared 2 stub solutions (i.e. alter to your exact needs).
The first expects that every input is static (the changes in solutions as far as I understand the OP's question raise from the random state generation inside each iteration). With the second, you can update the input data between iterations of i. I've not tried the code as I don't have the model so it might not work directly.
import multiprocessing as mp
# if everything is independent (eg. "infectionList" is static and does not change during the iterations)
def worker(model, infectionList):
sirs = []
for it in model.get_nodes():
sir = model.infected_SIR_MODEL(it, infectionList, False)
sirs.append(sir)
return sirs
count = 500
infectionList = []
model = "YOUR MODEL INSTANCE"
data = [(model, infectionList) for _ in range(1, count+1)]
with mp.Pool() as pool:
results = pool.starmap(worker, data)
The second proposed solution if "infectionList" or something else gets updated in each iteration of "i":
def worker2(model, it, infectionList):
sir = model.infected_SIR_MODEL(it, infectionList, False)
return sir
with mp.Pool() as pool:
for i in range(1, count+1):
data = [(model, it, infectionList) for it in model.get_nodes()]
results = pool.starmap(worker2, data)
# process results, update something go to next iteration....
Edit: Updated the answer to separate proposals more clearly.

Smart indexing using numpy

So, this is more like a structural problem but I think it's looking fairy ugly at the moment, I have code looking like:
for i in range(length_of_tree):
potential_ways = np.zeros((M, 2))
for m in range(omega):
for s in range(Z):
potential_ways[m][s] = sum([quad[r][m][s] for r in range(reps)])
The code is currently working, but I've noticed that there are several ways using numpy to avoid for-loops, my question is therefore, is there a way for me to make this code a bit more minimalistic?
A sum over values in an array can always be changed into an inner product which is optimised in numpy. As has been suggested here, I don't really understand the context of your question without examples but you should be able to do something like the following:
np.random.seed(1)
# your examples
M = 2
length_of_tree,reps = 100,100
omega,Z = 2,2
# a random matrix of values of shape 100,2,2
quad = np.random.normal(0,1,size=(100,2,2))
# useful initializations
quadT = quad.T
dummy = np.ones(shape=(100,))
for i in range(length_of_tree):
# option 1
potential_ways = np.zeros((M, 2))
for m in range(omega):
for s in range(Z):
potential_ways[m][s] = sum([quad[r][m][s] for r in range(reps)])
# option 2
potential_ways = quadT.dot(dummy).T

Speeding up this code snippet

I am rewriting one code from Matlab to Python.
Matlab code looks like this:
validIntSize = length(prices1)-125; %valid interval size
interval720s = zeros(validIntSize,120+1);
interval360s = zeros(validIntSize,60+1);
interval180s = zeros(validIntSize,30+1);
for i = 1:intJump:validIntSize
interval180s(i,:) = [prices1(i:i+29),priceDiff(i+29)];
interval360s(i,:) = [prices1(i:i+59),priceDiff(i+59)];
interval720s(i,:) = [prices1(i:i+119),priceDiff(i+119)];
end
It populates the rows of this huge matrix in a nutshell. I do this through append in Python and it takes way too long:
intervals = []
for mean in self.means: intervals.append([])
for i in range(stop_point):
for idx, mean in enumerate(self.means):
intervals[idx].append(list(prices1[i:(i+mean)])+[price_diff[i+mean]])
I suspect that this is due to the append method that I am using?
P.S. it takes about 2 seconds in Matlab and about 2 minutes in Python.
P.P.S changing my code to list comprehension as suggested in the comments:
for idx, k_mean in enumerate(self.k_means):
intervals[idx] = [list(prices1[i:(i+k_mean)])+[price_diff[i+k_mean]] for i in range(stop_point)]
does not speed up the code. Takes virtually the same amount of time.

update of weights in a neural network

I was trying to program the perceptron learning rule for the case of an AND example. Graphically we will have:
where the value of x0=1, the algorithm for updating the weights is:
and I have made the following program in Python:
import math
def main():
theta=[-0.8,0.5,0.5]
learnrate=0.1
target=[0,0,0,1]
output=[0,0,0,0]
x=[[1,0,0],[1,0,1],[1,1,0],[1,1,1]]
for i in range(0,len(x)):
output[i]=evaluate(theta,x[i])
for j in range(0,100):
update(theta,x,learnrate,target,output)
def evaluate(theta,x):
r=theta[0]*x[0]+theta[1]*x[1]+theta[2]*x[2]
r=1/(1+math.exp(-r))
return r
def update(theta,x,n,target,output):
for i in range(0,len(x)):
for j in range(0,len(x[i])):
delta=n*(target[i]-output[i])*x[i][j]
theta[j]=theta[j]+delta
print theta
r=evaluate(theta,x[i])
print r
print "\n"
if __name__=="__main__":
main()
The problem occurs when I run the program, for the first set of theta values:
theta=[-0.8,0.5,0.5]
I got the values:
[-7.869649929246505, 0.7436243430418894, 0.7436243430418894]
0.000382022127989
[-7.912205677565339, 0.7436243430418894, 0.7010685947230553]
0.000737772440166
[-7.954761425884173, 0.7010685947230553, 0.7010685947230553]
0.000707056388635
[-7.90974482561542, 0.7460851949918075, 0.7460851949918075]
0.00162995036457
the bracket terms are the updated theta values, while the other values are the results of the evaluation. In this case my results should be very close to 1 for the last case and close to 0 for the other, but this is not happening.
When I use this values:
theta=[-30,20,20]
they neatly approach to one the last data set, and 0 for the others:
[-30.00044943890137, 20.0, 20.0]
9.35341823401e-14
[-30.000453978688242, 20.0, 19.99999546021313]
4.53770586567e-05
[-30.000458518475114, 19.99999546021313, 19.99999546021313]
4.53768526644e-05
[-30.000453978688242, 20.0, 20.0]
0.999954581518
and even when I try with another set:
theta=[-5,20,20]
my results are not as good as the previous ones:
[-24.86692245237865, 10.100003028432075, 10.100003028432075]
1.5864734081e-11
[-24.966922421788425, 10.100003028432075, 10.000003059022298]
3.16190904073e-07
[-25.0669223911982, 10.000003059022298, 10.000003059022298]
2.86101378609e-07
[-25.0669223911982, 10.000003059022298, 10.000003059022298]
0.00626235903
am I missing some part or is there something wrong in this implementation? I know that there is another algorithm that uses derivatives, but I would like to implement this naive case.
Thanks
The problem is that you are not recomputing the output after the weights change so the error signal remains constant and the weights will change in the same way on every iteration.
Change the code as follows:
def update(theta,x,n,target,output):
for i in range(0,len(x)):
output[i] = evaluate(theta,x[i]) # This line is added
for j in range(0,len(x[i])):
delta=n*(target[i]-output[i])*x[i][j]
theta[j]=theta[j]+delta
print theta
r=evaluate(theta,x[i])
print r
print "\n"
and you should find it converges much better.

Categories