Python - Big For Loop

Python - Big For Loop - python

I'm computing a very big for cycle and i'll try to explain how does it works. There are 4320 matrices (40x80 each) that have been taken from a matlab file.
This loop takes a matrix per time: it assign to each value the right value of H and T. Once finished, it pass to the next matrix and so on.
The dataframe created is then written on a csv file needed for the creation of a database for the wave energy converters productivity.
The problem is that this code is running since 9 days and it is at half on the total computations..Is there any way to drastically reduce the computational time?
indice_4 = 0
configuration_id=-1
n_configurations=4320
for z in range(0,n_configurations,1): #iteration on all the configurations
print(z)
power_matrix=P_mat[z]
energy_wave_period_converted = pd.DataFrame([],columns=['energy_wave_period'])
H_start=0.25
H_end=10
H_step=0.25
T_start=3
T_end=17
T_step=0.177
y=T_start
relative_direction = int(direc[z])
if relative_direction==0:
configuration_id = configuration_id + 1
print(configuration_id)
r=0 #r=row
c=0 #c=column
while y <= T_end:
energy_wave_period= float('%.2f'%y)
x=H_start #initialize on the right wave haights
r=0
while x <= H_end:
significant_wave_height= float('%.2f'%x)
average_power=float('%.2f'%power_matrix[r,c])
new_line_4 = pd.Series([indice_4 , configuration_id, significant_wave_height , energy_wave_period ,relative_direction ,average_power] , index =['id','configuration_id','significant_wave_height','energy_wave_period','relative_direction','average_output_power'])
seastate_productivity = seastate_productivity.append([new_line_4], ignore_index=True)
indice_4= indice_4 + 1
r=r+1
x=x+H_step
c=c+1
y = y + T_step
seastate_productivity.to_csv('seastate_productivity.csv',index=False,sep=';')
'

One of the main things slowing your code down is that you do pandas operations in an iteration. Specifically using pd.Series and pd.DataFrame.append in the loop (which runs for over 12 million times) really slows you down. When using pandas you should really aim to vectorize your operations (meaning performing operations in batch). When I tried your original code every iteration took about 4 seconds, but the time increased gradually. When removing the pd.append every iteration only took 0.5 seconds, and when removing the pd.Series it dropped even more.
I did some improvements by saving the data in lists and later to a dataframe in one go, which took about 2 minutes to run till completion on my laptop:
import time
import numpy as np
import pandas as pd
# Generate random data for testing
P_mat = np.random.rand(4320,40,80)
direc=np.random.rand(4320)
H_start=0.25
H_end=10
H_step=0.25
T_start=3
T_end=17
T_step=0.177
indice_4 = 0
configuration_id=-1
n_configurations=4320
data = []
# Time it
t0 = time.perf_counter()
for z in range(n_configurations):
power_matrix=P_mat[z]
print(z)
y=T_start
relative_direction = int(direc[z])
if relative_direction==0:
configuration_id = configuration_id + 1
r=0 #r=row
c=0 #c=column
while y <= T_end:
energy_wave_period= float('%.2f'%y)
x=H_start #initialize on the right wave haights
r=0
while x <= H_end:
significant_wave_height= float('%.2f'%x)
average_power=float('%.2f'%power_matrix[r,c])
# Save data to list
new_line_4 = [indice_4 , configuration_id, significant_wave_height , energy_wave_period ,relative_direction ,average_power]
data.append(new_line_4) # Append to create a list of lists
indice_4= indice_4 + 1
r=r+1
x=x+H_step
c=c+1
y = y + T_step
# Make dataframe from list of lists
seastate_productivity = pd.DataFrame.from_records(data,columns =['id','configuration_id','significant_wave_height','energy_wave_period','relative_direction','average_output_power'])
# Save data
seastate_productivity.to_csv('seastate_productivity.csv',index=False,sep=';')
# Print time it took
print("Done in:",time.perf_counter()-t0)
You could probably still optimize this solution, by moving the rounding from the loop to outside, by rounding the pandas columns. Also, since you are only moving data around, there is probably also a completely vectorized solution (without a loop) but this is probably sufficient for you.
A way to find out what the issue is with slow code is by timing portions of code. You can use the timeit module, or the time module like I used. You can then isolate lines of code, and run them and analyse the performance.

You should consider using numpy. Using numpy's matrix operations you should be able to reduce computation time.

I suggest you to dig also into concurrent.futures.
It specifically enables to run parallel tasks and reduce run time.
You need to convert your code into a function and then call it into the async func, each element at a time.
The concurrent.futures module provides a high-level interface for asynchronously executing callables.
The asynchronous execution can be performed with threads, using ThreadPoolExecutor, or separate processes, using ProcessPoolExecutor.
https://docs.python.org/3/library/concurrent.futures.html
this is a scolastic example
import concurrent.futures
nums = range(10)
def f(x):
return x * x
def main():
print([val for val in map(f, nums)])
with concurrent.futures.ProcessPoolExecutor() as executor:
print([val for val in executor.map(f, nums)])
if __name__ == '__main__':
main()

Related

Saving continuously generated simulation data with Python3

So my question is how I should save a large amount of simulation data to a file using Python (or update new data rows to the existing file).
Lets say I have NN=1000 particles, and I want to save the position and velocity data of each particle (x y z, vx vy vz). The data is in format [x1,y1,z1,vx1,vy1,vz1, x2,y2,z2,vx2,vy2,vz2, ...] and so on.
Simulation is working well, but I believe the methods I use for saving and keeping these information saved is not really optimal for me.
Pseudo code similar to my code
T_max = 1000 # for example
dt = 0.1 # time step
T = 0 # current time
iterations = int(T_max/dt) # number of iterations we are doing
NN = 1000 # Number of particles
ZZ = np.zeros( (iterations, 2+NN*6 ) ) # Here I generate whole data matrix at the beginning.
# ^ might not be the best idea as the system needs to keep everything in memory for the whole time
# So I guess saving could be done in chunks?
ZZ[0][0], ZZ[0][1] = T , dt
# ZZ[0][2:] = initialize_system(NN=NN) # so lets initialize the system.
# However, for this post I do this differently due to simplicity. See below
ZZ[0][2:] = np.random.uniform(-100,100,NN*6)
i = 0
while i < iteration:
T += dt
Z[i+1][0], Z[i+1][1] = T, dt
#Z[i+1][2:] = rk4(EOM_function, posvel=Z[i][2:])
# ^ Using this I would calculate new positions based on previous ones.
Z[i+1][2:] = np.random.uniform(-100,100,NN*6) #This is just for example here.
i += 1
# Now the simulation data is basically done, so one would need to save
# This one feels slow, as it takes 181s to save and is size of 1046246KB
np.savetxt('test1.txt', ZZ)
#other method with a bit less accuracy as I don't need to have all decimals saved
np.savetxt('test2.txt', ZZ, fmt='%1.6f') # Takes 125s and size is 426698KB
# Both of the above are kinda slow so I also tried to save to npy format
np.save('test.npy', ZZ) # It took 8.9s and size 164118KB
so this np.save() method seems to be fast, but I read somewhere that I can not append data to it. So this would not work if I keep saving the data in parts while calculating new positions.
So back to my question. How should/could I save the data efficiently (fast and memory friendly). I keep having some memory issues when NN and T_max gets larger because with this method I keep this whole ZZ all the time in memory.
So I guess I should calculate ZZ in parts, i.e. iterations/10 parts but then I should append this data to an existing file, and tests I have made felt slow. Any suggestions?
EDIT: feel free to ask more specifying questions as I feel like I forgot to explain something.

That highly depends on what you intend to use the output for. If it's stored for further calculations, .npy or some other binary format is always the way to go as it is faster, takes less space, and doesn't lose precision between loads and saves, instead of serializing it into a human readable format. If you need it to be readable, you might as well just output row by row to a csv file or something.
If you want to do it with binary, h5py allows you to extend a dataset after saving and append more stuff to it.
import numpy as np
import h5py
T_max = 10**4 # for example
dt = 0.1 # time step
T = 0 # current time
iterations = int(T_max/dt) # number of iterations we are doing
NN = 1000 # Number of particles
chunk_size = 10**3
ZZ = np.zeros( (chunk_size, 2+NN*6 ) )
ZZ[0][0], ZZ[0][1] = T , dt
# ZZ[0][2:] = initialize_system(NN=NN) # so lets initialize the system.
# However, for this post I do this differently due to simplicity. See below
ZZ[0][2:] = np.random.uniform(-100,100,NN*6)
with h5py.File("test.h5", "a") as f:
dset = f.create_dataset('ZZ', (0,2+NN*6), maxshape=(None,2+NN*6), dtype='float64', chunks=(chunk_size,2+NN+6))
for chunk in range(0, iterations, chunk_size):
for i in range(0, chunk_size - 1):
T += dt
ZZ[i + 1][0], ZZ[i + 1][1] = T, dt
#Z[i+1][2:] = rk4(EOM_function, posvel=Z[i][2:])
# ^ Using this I would calculate new positions based on previous ones.
ZZ[i + 1][2:] = np.random.uniform(-100,100,NN*6) #This is just for example here.
# Expand the file here to allow for more data.
dset.resize(dset.shape[0] + chunk_size, axis=0)
dset[chunk: chunk + chunk_size ] = ZZ
# update and initialize next chunk. the next chunk's first row should be the last row of the previous chunk + iteration
T += dt
ZZ[0][0], ZZ[0][1] = T, dt
#Z[0][2:] = rk4(EOM_function, posvel=Z[-1][2:])
# ^ Using this I would calculate new positions based on previous ones.
ZZ[0][2:] = np.random.uniform(-100,100,NN*6) #This is just for example here.
print(dset.shape)
This takes 70 seconds on the save step on my computer, generating a 45GB file, for a dataset that is 100 times your original code.
The above code is more general in case you are streaming your data and don't know your final size. If you know it from the start, you can replace the initial create_dataset with
dset = f.create_dataset('ZZ', (iterations,2+NN*6), dtype='float64')
and remove the dset.resize(dset.shape[0] + chunk_size, axis=0)
You'll probably also want to read it back in chunks afterwards for other processing, in which case you can follow the docs here: https://docs.h5py.org/en/latest/high/dataset.html#reading-writing-data

Okay so I'm continuing my question / providing possible answer to it based on the answer of EricChen1248. EDIT: Answer provided by EricChen1248 works now and is way better than this my code part. See his code
I do not yet still understand completely how this f.create_dataset () truly works (i.e. when does it write data to file in the loop etc).
Using the code provided by Eric, it created and saved the data files fastly, but when I read the file as follows
hf = h5py.File('temp/test.h5', 'r')
ZZ = np.array(hf['ZZ'])
hf.close()
and plotted the first column (time T column, which should increase by timestep dt after each iteration) I get the following figure
plt.plot(ZZ[:,0])
time T column plotted
and as can be seen, it grows to a time of 100, and then goes to zero. This happens after the first 'chunk_size' has been passed. I started to read docs provided by Eric, and using his code as reference I managed to write something like this
import numpy as np
import h5py
T_max = 10**4
dt = 0.1
T = 0
NN = 1000
iterations = int(T_max/dt)
chunk_size = 10**3
with h5py.File('temp/data12.h5', 'a') as hf:
dset = hf.create_dataset("ZZ", (chunk_size, 2+NN*6),maxshape=(None,2+NN*6) ,chunks=(chunk_size, 2+NN*6), dtype='f8' )
# ^ first I create data set equals to one chunk_size
# Here I initialize the system. Columns ; 0=T , 1=dt, 2=arbitrary data point, 3=sin(column2)
# all the rest columns are random numbers just to fill some numbers in
dset[0,0], dset[0,1] = T, dt
#dset[0,2:] = np.random.uniform(0,1,NN*6)
dset[0,2] = 1
dset[0,3] = np.sin(dset[0,2])
dset[0,4:] = np.random.uniform(0,1,NN*6 -2)
print('starts')
# Main difference down there is that I use dataset (dset)
# as a data matrix to be filled instead of matrix ZZ as in my question.
i = 0
#for j, s in enumerate(dset.iter_chunks()):
for j, s in enumerate(range(0, iterations, chunk_size )):
print(j, s)
while i < iterations and i < chunk_size*(j+1) -1:
#for i in range(chunk_size*j, chunk_size*(j+1)-1):
T += dt
dset[i+1,0], dset[i+1,1] = T, dt
#dset[i+1,2:] = np.sin(dset[i,2:]+dt)
dset[i+1,2] = dset[i,2] + dt
dset[i+1,3] = np.sin(dset[i,2]+dt)
dset[i+1,4:] = dset[i,4:] + np.random.uniform(-1,1,NN*6-2)
i+=1
print(dset.shape)
dset.resize(dset.shape[0] + chunk_size, axis=0)
This code runs in 1min 50s , and saves a file of size 4.47GB so I am happy with the speed, and what I'm really happy is that it do not use so much memory while iterating (I used to get into problem with huge RAM usage).
When I read the data file provided by my code (similarly as above) I get following image for time Time T column plotted, my code version and it grows nicely to T=10e4 as should be. It still generated one more chunk_size block to the end of dataset which is full of zeros. That I need to get rid of. One more proof that the code works and saves data without weird problems is this sinusoidal plot plt.plot(ZZ[500:1500,0] , ZZ[500:1500,3]). Sinusoidal image proof Note that the plot is limited for T ~ [50,150] so one could still see something there (if plotted the whole thing, one could not see lines well).
I believe this is not the best way to write this code, but it is the way I got this working. So if someone sees improvements, please let me know. Also, I am curious to know why the code provided by Eric did not work, at least for me.
EDIT : fixed typos

Large memory consumption by iPython Parallel module

I am using the ipyparallel module to speed up an all by all list comparison but I am having issues with huge memory consumption.
Here is a simplified version of the script that I am running:
From a SLURM script start the cluster and run the python script
ipcluster start -n 20 --cluster-id="cluster-id-dummy" &
sleep 60
ipython /global/home/users/pierrj/git/python/dummy_ipython_parallel.py
ipcluster stop --cluster-id="cluster-id-dummy"
In python, make two list of lists for the simplified example
import ipyparallel as ipp
from itertools import compress
list1 = [ [i, i, i] for i in range(4000000)]
list2 = [ [i, i, i] for i in range(2000000, 6000000)]
Then define my list comparison function:
def loop(item):
for i in range(len(list2)):
if list2[i][0] == item[0]:
return True
return False
Then connect to my ipython engines, push list2 to each of them and map my function:
rc = ipp.Client(profile='default', cluster_id = "cluster-id-dummy")
dview = rc[:]
dview.block = True
lview = rc.load_balanced_view()
lview.block = True
mydict = dict(list2 = list2)
dview.push(mydict)
trueorfalse = list(lview.map(loop, list1))
As mentioned, I am running this on a cluster using SLURM and getting the memory usage from the sacct command. Here is the memory usage that I am getting for each of the steps:
Just creating the two lists: 1.4 Gb
Creating two lists and pushing them to 20 engines: 22.5 Gb
Everything: 62.5 Gb++ (this is where I get an OUT_OF_MEMORY failure)
From running htop on the node while running the job, it seems that the memory usage is going up slowly over time until it reaches the maximum memory and fails.
I combed through this previous thread and implemented a few of the suggested solutions without success
Memory leak in IPython.parallel module?
I tried clearing the view with each loop:
def loop(item):
lview.results.clear()
for i in range(len(list2)):
if list2[i][0] == item[0]:
return True
return False
I tried purging the client with each loop:
def loop(item):
rc.purge_everything()
for i in range(len(list2)):
if list2[i][0] == item[0]:
return True
return False
And I tried using the --nodb and --sqlitedb flags with ipcontroller and started my cluster like this:
ipcontroller --profile=pierrj --nodb --cluster-id='cluster-id-dummy' &
sleep 60
for (( i = 0 ; i < 20; i++)); do ipengine --profile=pierrj --cluster-id='cluster-id-dummy' & done
sleep 60
ipython /global/home/users/pierrj/git/python/dummy_ipython_parallel.py
ipcluster stop --cluster-id="cluster-id-dummy" --profile=pierrj
Unfortunately none of this has helped and has resulted in the exact same out of memory error.
Any advice or help would be greatly appreciated!

Looking around, there seems to be lots of people complaining about LoadBalancedViews being very memory inefficient, and I have not been able to find any useful suggestions on how to fix this, for example.
However, I suspect given your example that's not the place to start. I assume that your example is a reasonable approximation of your code. If your code is doing list comparisons with several million data points, I would advise you to use something like numpy to perform the calculations rather than iterating in python.
If you restructure your algorithm to use numpy vector operations it will be much, much faster than indexing into a list and performing the calculation in python. numpy is a C library and calculation done within the library will benefit from compile time optimisations. Furthermore, performing operations on arrays also benefits from processor predictive caching (your CPU expects you to use adjacent memory looking forward and preloads it; you potentially lose this benefit if you access the data piecemeal).
I have done a very quick hack of your example to demonstrate this. This example compares your loop calculation with a very naïve numpy implementation of the same question. The python loop method is competitive with small numbers of entries, but it quickly heads towards x100 faster with the number of entries you are dealing with. I suspect looking at the way you structure data will outweigh the performance gain you are getting through parallelisation.
Note that I have chosen a matching value in the middle of the distribution; performance differences will obviously depend on the distribution.
import numpy as np
import time
def loop(item, list2):
for i in range(len(list2)):
if list2[i][0] == item[0]:
return True
return False
def run_comparison(scale):
list2 = [ [i, i, i] for i in range(4 * scale)]
arr2 = np.array([i for i in range(4 * scale)])
test_value = (2 * scale)
np_start = time.perf_counter()
res1 = test_value in arr2
np_end = time.perf_counter()
np_time = np_end - np_start
loop_start = time.perf_counter()
res2 = loop((test_value, 0, 0), list2)
loop_end = time.perf_counter()
loop_time = loop_end - loop_start
assert res1 == res2
return (scale, loop_time / np_time)
print([run_comparison(v) for v in [100, 1000, 10000, 100000, 1000000, 10000000]])
returns:
[
(100, 1.0315526939407524),
(1000, 19.066806587378263),
(10000, 91.16463510672537),
(100000, 83.63064249916434),
(1000000, 114.37531283123414),
(10000000, 121.09979997458508)
]

Assuming that a single task on the two lists is being divided up between the worker threads you will want to ensure that the individual workers are using the same copy of the lists. In most cases is looks like ipython parallel will pickle objects sent to workers (relevant doc). If you are able to use one of the types that are not copied (as stated in doc)
buffers/memoryviews, bytes objects, and numpy arrays.
the memory issue might be resolved since a reference is distributed. This answer also assumes that the individual tasks do not need to operate on the lists while working (thread-safe).
TL;DR It looks like moving the objects passed to the parallel workers into a numpy array may resolve the explosion in memory.

Using Cython to create parallel threads without prange

I have a recursive function that does something similar to the following:
import numpy as np
from copy import copy
shared_data = np.random.randn(6, 5, 3)
def grow(current_data, level):
grown_data = []
if level < shared_data.shape[0] - 1:
nlevel = level + 1
valid = ((shared_data[nlevel] - current_data[-1])**2).sum(axis=-1) < 1
for new_data in shared_data[nlevel, valid]:
continue_data = copy(current_data)
continue_data.append(new_data)
grown_data.extend(grow(continue_data, level+1))
else:
grown_data.append(current_data)
return grown_data
begin_data = np.random.randn(3)
print(grow([begin_data], 0))
I am wondering if there is some way to start a new parallel thread in cython to do the current processing on each entry for the grow function in order to speed this type of recursion up. While the above sample code runs relatively fast, the actual code is slower (a) because it does more than the simple distance calculation included above and (b) because the data it is operating on is more like the size (3000, 10, 3), which even for this simple example is prohibitively slow, at least on my machine.
One thought that I had was to use a list/queue to add recursive jobs to instead of calling them directly, then, on each return from grow, using a prange loop to process the jobs in the list/queue in parallel, but I'm afraid this will result in the recreation of threads all the time and decrease the efficiency.

How to correctly implement apply_async for data processing?

I am new to using parallel processing for data analysis. I have a fairly large array and I want to apply a function to each index of said array.
Here is the code I have so far:
import numpy as np
import statsmodels.api as sm
from statsmodels.regression.quantile_regression import QuantReg
import multiprocessing
from functools import partial
def fit_model(data,q):
#data is a 1-D array holding precipitation values
years = np.arange(1895,2018,1)
res = QuantReg(exog=sm.add_constant(years),endog=data).fit(q=q)
pointEstimate = res.params[1] #output slope of quantile q
return pointEstimate
#precipAll is an array of shape (1405*621,123,12) (longitudes*latitudes,years,months)
#find all indices where there is data
nonNaN = np.where(~np.isnan(precipAll[:,0,0]))[0] #481631 indices
month = 4
#holder array for results
asyncResults = np.zeros((precipAll.shape[0])) * np.nan
def saveResult(result,pos):
asyncResults[pos] = result
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=20) #my server has 24 CPUs
for i in nonNaN:
#use partial so I can also pass the index i so the result is
#stored in the expected position
new_callback_function = partial(saveResult, pos=i)
pool.apply_async(fit_model, args=(precipAll[i,:,month],0.9),callback=new_callback_function)
pool.close()
pool.join()
When I ran this, I stopped it after it took longer than had I not used multiprocessing at all. The function, fit_model, is on the order of 0.02 seconds, so could the overhang associated with apply_async be causing the slowdown? I need to maintain order of the results as I am plotting this data onto a map after this processing is done. Any thoughts on where I need improvement is greatly appreciated!

If you need to use the multiprocessing module, you'll probably want to batch more rows together into each task that you give to the worker pool. However, for what you're doing, I'd suggest trying out Ray due to its efficient handling of large numerical data.
import numpy as np
import statsmodels.api as sm
from statsmodels.regression.quantile_regression import QuantReg
import ray
#ray.remote
def fit_model(precip_all, i, month, q):
data = precip_all[i,:,month]
years = np.arange(1895, 2018, 1)
res = QuantReg(exog=sm.add_constant(years), endog=data).fit(q=q)
pointEstimate = res.params[1]
return pointEstimate
if __name__ == '__main__':
ray.init()
# Create an array and place it in shared memory so that the workers can
# access it (in a read-only fashion) without creating copies.
precip_all = np.zeros((100, 123, 12))
precip_all_id = ray.put(precip_all)
result_ids = []
for i in range(precip_all.shape[0]):
result_ids.append(fit_model.remote(precip_all_id, i, 4, 0.9))
results = np.array(ray.get(result_ids))
Some Notes
The example above runs out of the box, but note that I simplified the logic a bit. In particular, I removed the handling of NaNs.
On my laptop with 4 physical cores, this takes about 4 seconds. If you use 20 cores instead and make the data 9000 times bigger, I'd expect it to take about 7200 seconds, which is quite a long time. One possible approach to speeding this up is to use more machines or to process multiple rows in each call to fit_model in order to amortize some of the overhead.
The above example actually passes the entire precip_all matrix into each task. This is fine because each fit_model task only has read access to a copy of the matrix stored in shared memory and so doesn't need to create its own local copy. The call to ray.put(precip_all) places the array in shared memory once up front.
For about the differences between Ray and Python multiprocessing. Note I'm helping develop Ray.

Numpy/Python performing terribly vs. Matlab

Novice programmer here. I'm writing a program that analyzes the relative spatial locations of points (cells). The program gets boundaries and cell type off an array with the x coordinate in column 1, y coordinate in column 2, and cell type in column 3. It then checks each cell for cell type and appropriate distance from the bounds. If it passes, it then calculates its distance from each other cell in the array and if the distance is within a specified analysis range it adds it to an output array at that distance.
My cell marking program is in wxpython so I was hoping to develop this program in python as well and eventually stick it into the GUI. Unfortunately right now python takes ~20 seconds to run the core loop on my machine while MATLAB can do ~15 loops/second. Since I'm planning on doing 1000 loops (with a randomized comparison condition) on ~30 cases times several exploratory analysis types this is not a trivial difference.
I tried running a profiler and array calls are 1/4 of the time, almost all of the rest is unspecified loop time.
Here is the python code for the main loop:
for basecell in range (0, cellnumber-1):
if firstcelltype == np.array((cellrecord[basecell,2])):
xloc=np.array((cellrecord[basecell,0]))
yloc=np.array((cellrecord[basecell,1]))
xedgedist=(xbound-xloc)
yedgedist=(ybound-yloc)
if xloc>excludedist and xedgedist>excludedist and yloc>excludedist and yedgedist>excludedist:
for comparecell in range (0, cellnumber-1):
if secondcelltype==np.array((cellrecord[comparecell,2])):
xcomploc=np.array((cellrecord[comparecell,0]))
ycomploc=np.array((cellrecord[comparecell,1]))
dist=math.sqrt((xcomploc-xloc)**2+(ycomploc-yloc)**2)
dist=round(dist)
if dist>=1 and dist<=analysisdist:
arraytarget=round(dist*analysisdist/intervalnumber)
addone=np.array((spatialraw[arraytarget-1]))
addone=addone+1
targetcell=arraytarget-1
np.put(spatialraw,[targetcell,targetcell],addone)
Here is the matlab code for the main loop:
for basecell = 1:cellnumber;
if firstcelltype==cellrecord(basecell,3);
xloc=cellrecord(basecell,1);
yloc=cellrecord(basecell,2);
xedgedist=(xbound-xloc);
yedgedist=(ybound-yloc);
if (xloc>excludedist) && (yloc>excludedist) && (xedgedist>excludedist) && (yedgedist>excludedist);
for comparecell = 1:cellnumber;
if secondcelltype==cellrecord(comparecell,3);
xcomploc=cellrecord(comparecell,1);
ycomploc=cellrecord(comparecell,2);
dist=sqrt((xcomploc-xloc)^2+(ycomploc-yloc)^2);
if (dist>=1) && (dist<=100.4999);
arraytarget=round(dist*analysisdist/intervalnumber);
spatialsum(1,arraytarget)=spatialsum(1,arraytarget)+1;
end
end
end
end
end
end
Thanks!

Here are some ways to speed up your python code.
First: Don't make np arrays when you are only storing one value. You do this many times over in your code. For instance,
if firstcelltype == np.array((cellrecord[basecell,2])):
can just be
if firstcelltype == cellrecord[basecell,2]:
I'll show you why with some timeit statements:
>>> timeit.Timer('x = 111.1').timeit()
0.045882196294822819
>>> t=timeit.Timer('x = np.array(111.1)','import numpy as np').timeit()
0.55774970267830071
That's an order of magnitude in difference between those calls.
Second: The following code:
arraytarget=round(dist*analysisdist/intervalnumber)
addone=np.array((spatialraw[arraytarget-1]))
addone=addone+1
targetcell=arraytarget-1
np.put(spatialraw,[targetcell,targetcell],addone)
can be replaced with
arraytarget=round(dist*analysisdist/intervalnumber)-1
spatialraw[arraytarget] += 1
Third: You can get rid of the sqrt as Philip mentioned by squaring analysisdist beforehand. However, since you use analysisdist to get arraytarget, you might want to create a separate variable, analysisdist2 that is the square of analysisdist and use that for your comparison.
Fourth: You are looking for cells that match secondcelltype every time you get to that point rather than finding those one time and using the list over and over again. You could define an array:
comparecells = np.where(cellrecord[:,2]==secondcelltype)[0]
and then replace
for comparecell in range (0, cellnumber-1):
if secondcelltype==np.array((cellrecord[comparecell,2])):
with
for comparecell in comparecells:
Fifth: Use psyco. It is a JIT compiler. Matlab has a built-in JIT compiler if you're using a somewhat recent version. This should speed-up your code a bit.
Sixth: If the code still isn't fast enough after all previous steps, then you should try vectorizing your code. It shouldn't be too difficult. Basically, the more stuff you can have in numpy arrays the better. Here's my try at vectorizing:
basecells = np.where(cellrecord[:,2]==firstcelltype)[0]
xlocs = cellrecord[basecells, 0]
ylocs = cellrecord[basecells, 1]
xedgedists = xbound - xloc
yedgedists = ybound - yloc
whichcells = np.where((xlocs>excludedist) & (xedgedists>excludedist) & (ylocs>excludedist) & (yedgedists>excludedist))[0]
selectedcells = basecells[whichcells]
comparecells = np.where(cellrecord[:,2]==secondcelltype)[0]
xcomplocs = cellrecords[comparecells,0]
ycomplocs = cellrecords[comparecells,1]
analysisdist2 = analysisdist**2
for basecell in selectedcells:
dists = np.round((xcomplocs-xlocs[basecell])**2 + (ycomplocs-ylocs[basecell])**2)
whichcells = np.where((dists >= 1) & (dists <= analysisdist2))[0]
arraytargets = np.round(dists[whichcells]*analysisdist/intervalnumber) - 1
for target in arraytargets:
spatialraw[target] += 1
You can probably take out that inner for loop, but you have to be careful because some of the elements of arraytargets could be the same. Also, I didn't actually try out all of the code, so there could be a bug or typo in there. Hopefully, it gives you a good idea of how to do this. Oh, one more thing. You make analysisdist/intervalnumber a separate variable to avoid doing that division over and over again.

Not too sure about the slowness of python but you Matlab code can be HIGHLY optimized. Nested for-loops tend to have horrible performance issues. You can replace the inner loop with a vectorized function ... as below:
for basecell = 1:cellnumber;
if firstcelltype==cellrecord(basecell,3);
xloc=cellrecord(basecell,1);
yloc=cellrecord(basecell,2);
xedgedist=(xbound-xloc);
yedgedist=(ybound-yloc);
if (xloc>excludedist) && (yloc>excludedist) && (xedgedist>excludedist) && (yedgedist>excludedist);
% for comparecell = 1:cellnumber;
% if secondcelltype==cellrecord(comparecell,3);
% xcomploc=cellrecord(comparecell,1);
% ycomploc=cellrecord(comparecell,2);
% dist=sqrt((xcomploc-xloc)^2+(ycomploc-yloc)^2);
% if (dist>=1) && (dist<=100.4999);
% arraytarget=round(dist*analysisdist/intervalnumber);
% spatialsum(1,arraytarget)=spatialsum(1,arraytarget)+1;
% end
% end
% end
%replace with:
secondcelltype_mask = secondcelltype == cellrecord(:,3);
xcomploc_vec = cellrecord(secondcelltype_mask ,1);
ycomploc_vec = cellrecord(secondcelltype_mask ,2);
dist_vec = sqrt((xcomploc_vec-xloc)^2+(ycomploc_vec-yloc)^2);
dist_mask = dist>=1 & dist<=100.4999
arraytarget_vec = round(dist_vec(dist_mask)*analysisdist/intervalnumber);
count = accumarray(arraytarget_vec,1, [size(spatialsum,1),1]);
spatialsum(:,1) = spatialsum(:,1)+count;
end
end
end
There may be some small errors in there since I don't have any data to test the code with but it should get ~10X speed up on the Matlab code.
From my experience with numpy I've noticed that swapping out for-loops for vectorized/matrix-based arithmetic has noticeable speed-ups as well. However, without the shapes the shapes of all of your variables its hard to vectorize things.

You can avoid some of the math.sqrt calls by replacing the lines
dist=math.sqrt((xcomploc-xloc)**2+(ycomploc-yloc)**2)
dist=round(dist)
if dist>=1 and dist<=analysisdist:
arraytarget=round(dist*analysisdist/intervalnumber)
with
dist=(xcomploc-xloc)**2+(ycomploc-yloc)**2
dist=round(dist)
if dist>=1 and dist<=analysisdist_squared:
arraytarget=round(math.sqrt(dist)*analysisdist/intervalnumber)
where you have the line
analysisdist_squared = analysis_dist * analysis_dist
outside of the main loop of your function.
Since math.sqrt is called in the innermost loop, you should have from math import sqrt at the top of the module and just call the function as sqrt.
I would also try replacing
dist=(xcomploc-xloc)**2+(ycomploc-yloc)**2
with
dist=(xcomploc-xloc)*(xcomploc-xloc)+(ycomploc-yloc)*(ycomploc-yloc)
There's a chance it will produce faster byte code to do multiplication rather than exponentiation.
I doubt these will get you all the way to MATLABs performance, but they should help reduce some overhead.

If you have a multicore, you could maybe give the multiprocessing module a try and use multiple processes to make use of all the cores.
Instead of sqrt you could use x**0.5, which is, if I remember correct, slightly faster.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Big For Loop - python

You should consider using numpy. Using numpy's matrix operations you should be able to reduce computation time.

Related

Saving continuously generated simulation data with Python3

Large memory consumption by iPython Parallel module

Using Cython to create parallel threads without prange

How to correctly implement apply_async for data processing?

Numpy/Python performing terribly vs. Matlab

Categories

Resources