Python/Pandas - String Comparisons

Python/Pandas - String Comparisons - python

I have a list of strings/narratives which I need to compare and get a distance measure between each string. The current code I have written works but for larger lists it takes along time since I use 2 for loops. I have used the levenshtien distance to measure the distance between strings.
The list of strings/narratives is stored in a dataframe.
def edit_distance(s1, s2):
m=len(s1)+1
n=len(s2)+1
tbl = {}
for i in range(m): tbl[i,0]=i
for j in range(n): tbl[0,j]=j
for i in range(1, m):
for j in range(1, n):
cost = 0 if s1[i-1] == s2[j-1] else 1
tbl[i,j] = min(tbl[i, j-1]+1, tbl[i-1, j]+1, tbl[i-1, j-1]+cost)
return tbl[i,j]
def narrative_feature_extraction(df):
startTime = time.time()
leven_matrix = np.zeros((len(df['Narrative']),len(df['Narrative'])))
for i in range(len(df['Narrative'])):
for j in range(len(df['Narrative'])):
leven_matrix[i][j] = edit_distance(df['Narrative'].iloc[i],df['Narrative'].iloc[j])
endTime = time.time()
total = (endTime - startTime)
print "Feature Extraction (Leven) Runtime:" + str(total)
return leven_matrix
X = narrative_feature_extraction(df)
If the list has n narratives, the resulting X is a nxn matrix, where the rows are the narratives and the columns is what that narrative is compared to. For example, for the distance (i,j) it is the levenshtien distance between narrative i and j.
Is there a way to optimize this code so that there isn't a need to have so many for loops? Or is there a pythonic way of calculating this?

hard to give exact code without data/examples, but a few suggestions:
Use list comprehension, much faster than for ... in range ...
Depending on your version of pandas, "df[i][j]" indexing can be veeeery slow, instead use .iloc or .loc (if you want to mix and match use .iloc[df.index.get_loc("itemname"),df.columns.get_loc("itemname")] to convert loc to iloc properly if you have this issue. (I think it is only slow if you are getting warning flags for writing to a dataframe slice and depends a lot on what version of python/pandas you have, but have not tested extensively)
Better yet, run all calcs and then throw into dataframe in one go depending on your use case
If you like the pythonic reading of for loops, try to avoid using "in range" at least and instead use "for j in X[:,0]" for example. I find this to be faster in most cases, and you can use with enumerate to keep index values (example below)
Examples/timings:
def test1(): #list comprehension
X=np.random.normal(size=(100,2))
results=[[x*y for x in X[:,0]] for y in X[:,1]]
df=pd.DataFrame(data=np.array(results))
if __name__ == '__main__':
import timeit
print("test1: "+str(timeit.timeit("test1()", setup="from __main__ import test1",number=10)))
def test2(): #enumerate, df at end
X=np.random.normal(size=(100,2))
results=np.zeros((100,100))
for ind,i in enumerate(X[:,0]):
for col,j in enumerate(X[:,1]):
results[ind,col]=i*j
df=pd.DataFrame(data=results)
if __name__ == '__main__':
import timeit
print("test2: "+str(timeit.timeit("test2()", setup="from __main__ import test2",number=10)))
def test3(): #in range, but df at end
X=np.random.normal(size=(100,2))
results=np.zeros((100,100))
for i in range(len(X)):
for j in range(len(X)):
results[i,j]=X[i,0]*X[j,1]
df=pd.DataFrame(data=results)
if __name__ == '__main__':
import timeit
print("test3: "+str(timeit.timeit("test3()", setup="from __main__ import test3",number=10)))
def test4(): #current method
X=np.random.normal(size=(100,2))
df=pd.DataFrame(data=np.zeros((100,100)))
for i in range(len(X)):
for j in range(len(X)):
df[i][j]=(X[i,0]*X[j,1])
if __name__ == '__main__':
import timeit
print("test4: "+str(timeit.timeit("test4()", setup="from __main__ import test4",number=10)))
output:
test1: 0.0492231889643
test2: 0.0587620022106
test3: 0.123777403419
test4: 12.6396287782
so list comprehension is ~250 times faster, and enumerate is twice as fast as "for x in range". Although the real slowdown is individual indexing of your dataframe (even if using .loc or .iloc this will still be your bottleneck so I suggest working with arrays outside of the df if possible)
Hope this helps and you are able to apply to your case. I'd recommend reading up on map, filter, reduce, (maybe enumerate) functions as well as they are quite quick and might help you: http://book.pythontips.com/en/latest/map_filter.html
Unfortunately I am not really familiar with your use case though, but I don't see a reason why it wouldn't be applicable or compatible with this type of code tuning.

Related

Python - Big For Loop

I'm computing a very big for cycle and i'll try to explain how does it works. There are 4320 matrices (40x80 each) that have been taken from a matlab file.
This loop takes a matrix per time: it assign to each value the right value of H and T. Once finished, it pass to the next matrix and so on.
The dataframe created is then written on a csv file needed for the creation of a database for the wave energy converters productivity.
The problem is that this code is running since 9 days and it is at half on the total computations..Is there any way to drastically reduce the computational time?
indice_4 = 0
configuration_id=-1
n_configurations=4320
for z in range(0,n_configurations,1): #iteration on all the configurations
print(z)
power_matrix=P_mat[z]
energy_wave_period_converted = pd.DataFrame([],columns=['energy_wave_period'])
H_start=0.25
H_end=10
H_step=0.25
T_start=3
T_end=17
T_step=0.177
y=T_start
relative_direction = int(direc[z])
if relative_direction==0:
configuration_id = configuration_id + 1
print(configuration_id)
r=0 #r=row
c=0 #c=column
while y <= T_end:
energy_wave_period= float('%.2f'%y)
x=H_start #initialize on the right wave haights
r=0
while x <= H_end:
significant_wave_height= float('%.2f'%x)
average_power=float('%.2f'%power_matrix[r,c])
new_line_4 = pd.Series([indice_4 , configuration_id, significant_wave_height , energy_wave_period ,relative_direction ,average_power] , index =['id','configuration_id','significant_wave_height','energy_wave_period','relative_direction','average_output_power'])
seastate_productivity = seastate_productivity.append([new_line_4], ignore_index=True)
indice_4= indice_4 + 1
r=r+1
x=x+H_step
c=c+1
y = y + T_step
seastate_productivity.to_csv('seastate_productivity.csv',index=False,sep=';')
'

One of the main things slowing your code down is that you do pandas operations in an iteration. Specifically using pd.Series and pd.DataFrame.append in the loop (which runs for over 12 million times) really slows you down. When using pandas you should really aim to vectorize your operations (meaning performing operations in batch). When I tried your original code every iteration took about 4 seconds, but the time increased gradually. When removing the pd.append every iteration only took 0.5 seconds, and when removing the pd.Series it dropped even more.
I did some improvements by saving the data in lists and later to a dataframe in one go, which took about 2 minutes to run till completion on my laptop:
import time
import numpy as np
import pandas as pd
# Generate random data for testing
P_mat = np.random.rand(4320,40,80)
direc=np.random.rand(4320)
H_start=0.25
H_end=10
H_step=0.25
T_start=3
T_end=17
T_step=0.177
indice_4 = 0
configuration_id=-1
n_configurations=4320
data = []
# Time it
t0 = time.perf_counter()
for z in range(n_configurations):
power_matrix=P_mat[z]
print(z)
y=T_start
relative_direction = int(direc[z])
if relative_direction==0:
configuration_id = configuration_id + 1
r=0 #r=row
c=0 #c=column
while y <= T_end:
energy_wave_period= float('%.2f'%y)
x=H_start #initialize on the right wave haights
r=0
while x <= H_end:
significant_wave_height= float('%.2f'%x)
average_power=float('%.2f'%power_matrix[r,c])
# Save data to list
new_line_4 = [indice_4 , configuration_id, significant_wave_height , energy_wave_period ,relative_direction ,average_power]
data.append(new_line_4) # Append to create a list of lists
indice_4= indice_4 + 1
r=r+1
x=x+H_step
c=c+1
y = y + T_step
# Make dataframe from list of lists
seastate_productivity = pd.DataFrame.from_records(data,columns =['id','configuration_id','significant_wave_height','energy_wave_period','relative_direction','average_output_power'])
# Save data
seastate_productivity.to_csv('seastate_productivity.csv',index=False,sep=';')
# Print time it took
print("Done in:",time.perf_counter()-t0)
You could probably still optimize this solution, by moving the rounding from the loop to outside, by rounding the pandas columns. Also, since you are only moving data around, there is probably also a completely vectorized solution (without a loop) but this is probably sufficient for you.
A way to find out what the issue is with slow code is by timing portions of code. You can use the timeit module, or the time module like I used. You can then isolate lines of code, and run them and analyse the performance.

You should consider using numpy. Using numpy's matrix operations you should be able to reduce computation time.

I suggest you to dig also into concurrent.futures.
It specifically enables to run parallel tasks and reduce run time.
You need to convert your code into a function and then call it into the async func, each element at a time.
The concurrent.futures module provides a high-level interface for asynchronously executing callables.
The asynchronous execution can be performed with threads, using ThreadPoolExecutor, or separate processes, using ProcessPoolExecutor.
https://docs.python.org/3/library/concurrent.futures.html
this is a scolastic example
import concurrent.futures
nums = range(10)
def f(x):
return x * x
def main():
print([val for val in map(f, nums)])
with concurrent.futures.ProcessPoolExecutor() as executor:
print([val for val in executor.map(f, nums)])
if __name__ == '__main__':
main()

what is the Error in int object iteration?

What is 'int' object is not subscriptable in this code?
import math
import os
import random
import re
import sys
# Complete the hourglassSum function below.
def hourglassSum(arr):
sum1=0
result=0
for i in range(4):
for j in range(4):
sum1=arr[i][j]+arr[i+1][j]+arr[i+2][j]+arr[i+1][j+1]+arr[i][j+2]+arr[i+1][j+2]+arr[i+2[j+2]]
if sum1>result:
result=sum1
return result
if __name__ == '__main__':
fptr = open(os.environ['OUTPUT_PATH'], 'w')
arr = []
for _ in range(6):
arr.append(list(map(int, input().rstrip().split())))
result = hourglassSum(arr)
fptr.write(str(result) + '\n')
fptr.close()

The very last part of this long line:
sum1=arr[i][j]+arr[i+1][j]+arr[i+2][j]+arr[i+1][j+1]+arr[i][j+2]+arr[i+1][j+2]+arr[i+2[j+2]]
(this part here):
arr[i+2[j+2]]
Is an error; you seem to be trying to refer to 2[j+2]. Clearly the integer 2 is not an array, so Python complains to you that it makes no sense to index an integer.
You probably want that last term to be:
arr[i+2][j+2]
Looking more closely at the long line, it seems like what you are trying to accomplish is obtain the sum of the elements in a 3x3 section of arr. But even the long line is missing some of the combinations. Rather than risk typing the list of addition problems incorrectly (because there are so many), use a set of nested loops to build up the sum of the 3x3 segment.

Interpreting Hamming Distance speed in python

I've been working on making my python more pythonic and toying with runtimes of short snippets of code. My goal to improve the readability, but additionally, to speed execution.
This example conflicts with the best practices I've been reading about and I'm interested to find the where the flaw in my thought process is.
The problem is to compute the hamming distance on two equal length strings. For example the hamming distance of strings 'aaab' and 'aaaa' is 1.
The most straightforward implementation I could think of is as follows:
def hamming_distance_1(s_1, s_2):
dist = 0
for x in range(len(s_1)):
if s_1[x] != s_2[x]: dist += 1
return dist
Next I wrote two "pythonic" implementations:
def hamming_distance_2(s_1, s_2):
return sum(i.imap(operator.countOf, s_1, s_2))
and
def hamming_distance_3(s_1, s_2):
return sum(i.imap(lambda s: int(s[0]!=s[1]), i.izip(s_1, s_2)))
In execution:
s_1 = (''.join(random.choice('ABCDEFG') for i in range(10000)))
s_2 = (''.join(random.choice('ABCDEFG') for i in range(10000)))
print 'ham_1 ', timeit.timeit('hamming_distance_1(s_1, s_2)', "from __main__ import s_1,s_2, hamming_distance_1",number=1000)
print 'ham_2 ', timeit.timeit('hamming_distance_2(s_1, s_2)', "from __main__ import s_1,s_2, hamming_distance_2",number=1000)
print 'ham_3 ', timeit.timeit('hamming_distance_3(s_1, s_2)', "from __main__ import s_1,s_2, hamming_distance_3",number=1000)
returning:
ham_1 1.84980392456
ham_2 3.26420593262
ham_3 3.98718094826
I expected that ham_3 would run slower then ham_2, due to the fact that calling a lambda is treated as a function call, which is slower then calling the built in operator.countOf.
I was surprised I couldn't find a way to get a more pythonic version to run faster then ham_1 however. I have trouble believing that ham_1 is the lower bound for pure python.
Thoughts anyone?

The key is making less method lookups and function calls:
def hamming_distance_4(s_1, s_2):
return sum(i != j for i, j in i.izip(s_1, s_2))
runs at ham_4 1.10134792328 in my system.
ham_2 and ham_3 makes lookups inside the loops, so they are slower.

I wonder if this might be a bit more Pythonic, in some broader sense. What if you use http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.hamming.html ... a module that already implements what you're looking for?

Iterate over two big arrays at once

I have to iterate over two arrays which are 1000x1000 big. I already reduced the resolution to 100x100 to make the iteration faster, but it still takes about 15 minutes for ONE array!
So I tried to iterate over both at the same time, for which I found this:
for index, (x,y) in ndenumerate(izip(x_array,y_array)):
but then I get the error:
ValueError: too many values to unpack
Here is my full python code: I hope you can help me make this a lot faster, because this is for my master thesis and in the end I have to run it about a 100 times...
area_length=11
d_circle=(area_length-1)/2
xdis_new=xdis.copy()
ydis_new=ydis.copy()
ie,je=xdis_new.shape
while (np.isnan(np.sum(xdis_new))) and (np.isnan(np.sum(ydis_new))):
xdis_interpolated=xdis_new.copy()
ydis_interpolated=ydis_new.copy()
# itx=np.nditer(xdis_new,flags=['multi_index'])
# for x in itx:
# print 'next x and y'
for index, (x,y) in ndenumerate(izip(xdis_new,ydis_new)):
if np.isnan(x):
print 'index',index[0],index[1]
print 'interpolate'
# define indizes of interpolation area
i1=index[0]-(area_length-1)/2
if i1<0:
i1=0
i2=index[0]+((area_length+1)/2)
if i2>ie:
i2=ie
j1=index[1]-(area_length-1)/2
if j1<0:
j1=0
j2=index[1]+((area_length+1)/2)
if j2>je:
j2=je
# -->
print 'i1',i1,'','i2',i2
print 'j1',j1,'','j2',j2
area_values=xdis_new[i1:i2,j1:j2]
print area_values
b=area_values[~np.isnan(area_values)]
if len(b)>=((area_length-1)/2)*4:
xi,yi=meshgrid(arange(len(area_values[0,:])),arange(len(area_values[:,0])))
weight=zeros((len(area_values[0,:]),len(area_values[:,0])))
d=zeros((len(area_values[0,:]),len(area_values[:,0])))
weight_fac=zeros((len(area_values[0,:]),len(area_values[:,0])))
weighted_area=zeros((len(area_values[0,:]),len(area_values[:,0])))
d=sqrt((xi-xi[(area_length-1)/2,(area_length-1)/2])*(xi-xi[(area_length-1)/2,(area_length-1)/2])+(yi-yi[(area_length-1)/2,(area_length-1)/2])*(yi-yi[(area_length-1)/2,(area_length-1)/2]))
weight=1/d
weight[where(d==0)]=0
weight[where(d>d_circle)]=0
weight[where(np.isnan(area_values))]=0
weight_sum=np.sum(weight.flatten())
weight_fac=weight/weight_sum
weighted_area=area_values*weight_fac
print 'weight'
print weight_fac
print 'values'
print area_values
print 'weighted'
print weighted_area
m=nansum(weighted_area)
xdis_interpolated[index]=m
print 'm',m
else:
print 'insufficient elements'
if np.isnan(y):
print 'index',index[0],index[1]
print 'interpolate'
# define indizes of interpolation area
i1=index[0]-(area_length-1)/2
if i1<0:
i1=0
i2=index[0]+((area_length+1)/2)
if i2>ie:
i2=ie
j1=index[1]-(area_length-1)/2
if j1<0:
j1=0
j2=index[1]+((area_length+1)/2)
if j2>je:
j2=je
# -->
print 'i1',i1,'','i2',i2
print 'j1',j1,'','j2',j2
area_values=ydis_new[i1:i2,j1:j2]
print area_values
b=area_values[~np.isnan(area_values)]
if len(b)>=((area_length-1)/2)*4:
xi,yi=meshgrid(arange(len(area_values[0,:])),arange(len(area_values[:,0])))
weight=zeros((len(area_values[0,:]),len(area_values[:,0])))
d=zeros((len(area_values[0,:]),len(area_values[:,0])))
weight_fac=zeros((len(area_values[0,:]),len(area_values[:,0])))
weighted_area=zeros((len(area_values[0,:]),len(area_values[:,0])))
d=sqrt((xi-xi[(area_length-1)/2,(area_length-1)/2])*(xi-xi[(area_length-1)/2,(area_length-1)/2])+(yi-yi[(area_length-1)/2,(area_length-1)/2])*(yi-yi[(area_length-1)/2,(area_length-1)/2]))
weight=1/d
weight[where(d==0)]=0
weight[where(d>d_circle)]=0
weight[where(np.isnan(area_values))]=0
weight_sum=np.sum(weight.flatten())
weight_fac=weight/weight_sum
weighted_area=area_values*weight_fac
print 'weight'
print weight_fac
print 'values'
print area_values
print 'weighted'
print weighted_area
m=nansum(weighted_area)
ydis_interpolated[index]=m
print 'm',m
else:
print 'insufficient elements'
else:
print 'no need to interpolate'
xdis_new=xdis_interpolated
ydis_new=ydis_interpolated

Some advice:
Profile your code to see what is the slowest part. It may not be the iteration but the computations that need to be done each time.
Reduce function calls as much as possible. Function calls are not for free in Python.
Rewrite the slowest part as a C extension and then call that C function in your Python code (see Extending and Embedding the Python interpreter).
This page has some good advice as well.

You specifically asked for iterating two arrays in a single loop. Here is a way to do that
l1 = ["abc", "def", "hi"]
l2 = ["ghi", "jkl", "lst"]
for f,s in zip(l1,l2):
print "%s : %s" %(f,s)
The above is for python 3, you can use izip for python 2

You may use this as your for loop:
for index, x in ndenumerate((x_array,y_array)):
But it wont help you much, because your computer cant do two things at the same time.

Profiling is definitely a good start to identify where all the time spent actually goes.
I usually use the cProfile module, as it requires minimal overhead and gives me more than enough information.
import cProfile
import pstats
cProfile.run('main()', "ProfileData.txt", 'tottime')
p = pstats.Stats('ProfileData.txt')
p.sort_stats('cumulative').print_stats(100)
I your example you would have to wrap your code into a main() function to be able to use this code snippet at the very end of your file.

Comment #1: You don't want to use ndenumerate on the izip iterator, as it'll output you the iterator, which isn't what you want.
Comment #2:
i1=index[0]-(area_length-1)/2
if i1<0:
i1=0
could be simplified in i1 = min(index[0]-(area_length-1)/2, 0), and you could store your (area_length+/-1)/2 in specific variables.
Idea #1 : try to iterate on flat versions of the arrays, i.e. with something like
for (i, (x, y)) in enumerate(izip(xdis_new.flat,ydis_new.flat)):
You could get the original indices via divmod(i, xdis_new.shape[-1]), as you should be iterating by rows first.
Idea #2 : Iterate only on the nans, i.e. indexing your arrays with np.isnan(xdis_new)|np.isnan(ydis_new), that could save you some iterations
EDIT #1
You probably don't need to initialize d, weight_fac and weighted_area in your loop, as you compute them separately.
Your weight[where(d>0)] can be simplified in weight[d>0]
Do you need weight_fac ? Can't you just compute weight then normalize it in place ? That should save you some temporary arrays.

Numpy/Python performing terribly vs. Matlab

Novice programmer here. I'm writing a program that analyzes the relative spatial locations of points (cells). The program gets boundaries and cell type off an array with the x coordinate in column 1, y coordinate in column 2, and cell type in column 3. It then checks each cell for cell type and appropriate distance from the bounds. If it passes, it then calculates its distance from each other cell in the array and if the distance is within a specified analysis range it adds it to an output array at that distance.
My cell marking program is in wxpython so I was hoping to develop this program in python as well and eventually stick it into the GUI. Unfortunately right now python takes ~20 seconds to run the core loop on my machine while MATLAB can do ~15 loops/second. Since I'm planning on doing 1000 loops (with a randomized comparison condition) on ~30 cases times several exploratory analysis types this is not a trivial difference.
I tried running a profiler and array calls are 1/4 of the time, almost all of the rest is unspecified loop time.
Here is the python code for the main loop:
for basecell in range (0, cellnumber-1):
if firstcelltype == np.array((cellrecord[basecell,2])):
xloc=np.array((cellrecord[basecell,0]))
yloc=np.array((cellrecord[basecell,1]))
xedgedist=(xbound-xloc)
yedgedist=(ybound-yloc)
if xloc>excludedist and xedgedist>excludedist and yloc>excludedist and yedgedist>excludedist:
for comparecell in range (0, cellnumber-1):
if secondcelltype==np.array((cellrecord[comparecell,2])):
xcomploc=np.array((cellrecord[comparecell,0]))
ycomploc=np.array((cellrecord[comparecell,1]))
dist=math.sqrt((xcomploc-xloc)**2+(ycomploc-yloc)**2)
dist=round(dist)
if dist>=1 and dist<=analysisdist:
arraytarget=round(dist*analysisdist/intervalnumber)
addone=np.array((spatialraw[arraytarget-1]))
addone=addone+1
targetcell=arraytarget-1
np.put(spatialraw,[targetcell,targetcell],addone)
Here is the matlab code for the main loop:
for basecell = 1:cellnumber;
if firstcelltype==cellrecord(basecell,3);
xloc=cellrecord(basecell,1);
yloc=cellrecord(basecell,2);
xedgedist=(xbound-xloc);
yedgedist=(ybound-yloc);
if (xloc>excludedist) && (yloc>excludedist) && (xedgedist>excludedist) && (yedgedist>excludedist);
for comparecell = 1:cellnumber;
if secondcelltype==cellrecord(comparecell,3);
xcomploc=cellrecord(comparecell,1);
ycomploc=cellrecord(comparecell,2);
dist=sqrt((xcomploc-xloc)^2+(ycomploc-yloc)^2);
if (dist>=1) && (dist<=100.4999);
arraytarget=round(dist*analysisdist/intervalnumber);
spatialsum(1,arraytarget)=spatialsum(1,arraytarget)+1;
end
end
end
end
end
end
Thanks!

Here are some ways to speed up your python code.
First: Don't make np arrays when you are only storing one value. You do this many times over in your code. For instance,
if firstcelltype == np.array((cellrecord[basecell,2])):
can just be
if firstcelltype == cellrecord[basecell,2]:
I'll show you why with some timeit statements:
>>> timeit.Timer('x = 111.1').timeit()
0.045882196294822819
>>> t=timeit.Timer('x = np.array(111.1)','import numpy as np').timeit()
0.55774970267830071
That's an order of magnitude in difference between those calls.
Second: The following code:
arraytarget=round(dist*analysisdist/intervalnumber)
addone=np.array((spatialraw[arraytarget-1]))
addone=addone+1
targetcell=arraytarget-1
np.put(spatialraw,[targetcell,targetcell],addone)
can be replaced with
arraytarget=round(dist*analysisdist/intervalnumber)-1
spatialraw[arraytarget] += 1
Third: You can get rid of the sqrt as Philip mentioned by squaring analysisdist beforehand. However, since you use analysisdist to get arraytarget, you might want to create a separate variable, analysisdist2 that is the square of analysisdist and use that for your comparison.
Fourth: You are looking for cells that match secondcelltype every time you get to that point rather than finding those one time and using the list over and over again. You could define an array:
comparecells = np.where(cellrecord[:,2]==secondcelltype)[0]
and then replace
for comparecell in range (0, cellnumber-1):
if secondcelltype==np.array((cellrecord[comparecell,2])):
with
for comparecell in comparecells:
Fifth: Use psyco. It is a JIT compiler. Matlab has a built-in JIT compiler if you're using a somewhat recent version. This should speed-up your code a bit.
Sixth: If the code still isn't fast enough after all previous steps, then you should try vectorizing your code. It shouldn't be too difficult. Basically, the more stuff you can have in numpy arrays the better. Here's my try at vectorizing:
basecells = np.where(cellrecord[:,2]==firstcelltype)[0]
xlocs = cellrecord[basecells, 0]
ylocs = cellrecord[basecells, 1]
xedgedists = xbound - xloc
yedgedists = ybound - yloc
whichcells = np.where((xlocs>excludedist) & (xedgedists>excludedist) & (ylocs>excludedist) & (yedgedists>excludedist))[0]
selectedcells = basecells[whichcells]
comparecells = np.where(cellrecord[:,2]==secondcelltype)[0]
xcomplocs = cellrecords[comparecells,0]
ycomplocs = cellrecords[comparecells,1]
analysisdist2 = analysisdist**2
for basecell in selectedcells:
dists = np.round((xcomplocs-xlocs[basecell])**2 + (ycomplocs-ylocs[basecell])**2)
whichcells = np.where((dists >= 1) & (dists <= analysisdist2))[0]
arraytargets = np.round(dists[whichcells]*analysisdist/intervalnumber) - 1
for target in arraytargets:
spatialraw[target] += 1
You can probably take out that inner for loop, but you have to be careful because some of the elements of arraytargets could be the same. Also, I didn't actually try out all of the code, so there could be a bug or typo in there. Hopefully, it gives you a good idea of how to do this. Oh, one more thing. You make analysisdist/intervalnumber a separate variable to avoid doing that division over and over again.

Not too sure about the slowness of python but you Matlab code can be HIGHLY optimized. Nested for-loops tend to have horrible performance issues. You can replace the inner loop with a vectorized function ... as below:
for basecell = 1:cellnumber;
if firstcelltype==cellrecord(basecell,3);
xloc=cellrecord(basecell,1);
yloc=cellrecord(basecell,2);
xedgedist=(xbound-xloc);
yedgedist=(ybound-yloc);
if (xloc>excludedist) && (yloc>excludedist) && (xedgedist>excludedist) && (yedgedist>excludedist);
% for comparecell = 1:cellnumber;
% if secondcelltype==cellrecord(comparecell,3);
% xcomploc=cellrecord(comparecell,1);
% ycomploc=cellrecord(comparecell,2);
% dist=sqrt((xcomploc-xloc)^2+(ycomploc-yloc)^2);
% if (dist>=1) && (dist<=100.4999);
% arraytarget=round(dist*analysisdist/intervalnumber);
% spatialsum(1,arraytarget)=spatialsum(1,arraytarget)+1;
% end
% end
% end
%replace with:
secondcelltype_mask = secondcelltype == cellrecord(:,3);
xcomploc_vec = cellrecord(secondcelltype_mask ,1);
ycomploc_vec = cellrecord(secondcelltype_mask ,2);
dist_vec = sqrt((xcomploc_vec-xloc)^2+(ycomploc_vec-yloc)^2);
dist_mask = dist>=1 & dist<=100.4999
arraytarget_vec = round(dist_vec(dist_mask)*analysisdist/intervalnumber);
count = accumarray(arraytarget_vec,1, [size(spatialsum,1),1]);
spatialsum(:,1) = spatialsum(:,1)+count;
end
end
end
There may be some small errors in there since I don't have any data to test the code with but it should get ~10X speed up on the Matlab code.
From my experience with numpy I've noticed that swapping out for-loops for vectorized/matrix-based arithmetic has noticeable speed-ups as well. However, without the shapes the shapes of all of your variables its hard to vectorize things.

You can avoid some of the math.sqrt calls by replacing the lines
dist=math.sqrt((xcomploc-xloc)**2+(ycomploc-yloc)**2)
dist=round(dist)
if dist>=1 and dist<=analysisdist:
arraytarget=round(dist*analysisdist/intervalnumber)
with
dist=(xcomploc-xloc)**2+(ycomploc-yloc)**2
dist=round(dist)
if dist>=1 and dist<=analysisdist_squared:
arraytarget=round(math.sqrt(dist)*analysisdist/intervalnumber)
where you have the line
analysisdist_squared = analysis_dist * analysis_dist
outside of the main loop of your function.
Since math.sqrt is called in the innermost loop, you should have from math import sqrt at the top of the module and just call the function as sqrt.
I would also try replacing
dist=(xcomploc-xloc)**2+(ycomploc-yloc)**2
with
dist=(xcomploc-xloc)*(xcomploc-xloc)+(ycomploc-yloc)*(ycomploc-yloc)
There's a chance it will produce faster byte code to do multiplication rather than exponentiation.
I doubt these will get you all the way to MATLABs performance, but they should help reduce some overhead.

If you have a multicore, you could maybe give the multiprocessing module a try and use multiple processes to make use of all the cores.
Instead of sqrt you could use x**0.5, which is, if I remember correct, slightly faster.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python/Pandas - String Comparisons - python

Related

Python - Big For Loop

what is the Error in int object iteration?

Interpreting Hamming Distance speed in python

Iterate over two big arrays at once

Numpy/Python performing terribly vs. Matlab

Categories

Resources