So I am running a simulation where particles are interacting with each other and the walls. Here is the snippet that has been writing the particle data (number of timesteps, velocity-x, velocity-y, velocity-z, position-x, position-y, position-z) to individual files for each particle over a large amount of time steps (incremented by 1000). Right now I have 15 particles but in the future there will be more.
N_max = sim.getNumTimeSteps()
particleData = [ [] for x in range(len(sim.getParticleList()))]
for n in range (N_max):
sim.runTimeStep()
if (n%1000==0):
particles = sim.getParticleList()
for i in range(len(sim.getParticleList())):
x, y, z = particles[i].getVelocity()
x2, y2, z2 = particles[i].getPosition()
particleData[i].append( (n, x, y, z, x2, y2, z2) )
for i in range(len(sim.getParticleList())):
with open("{0:d}.dat".format(i), "w") as f:
for j in particleData[i]:
f.write("%f : %f,%f,%f : %f,%f,%f \n" % (j[0], j[1], j[2], j[3], j[4], j[5], j[6]))
sim.exit()
In my simulation, the top wall is fixed and the bottom is sheared (moving). I am interested in dividing my simulation into strips based on y-position. So if it is 10 units in the y direction, I want to split it into 10 strips of 1-width. I am trying to collect the speeds of particles throughout these strips (to compare speeds depending on proximity to which wall), which I will then later average and graph with matplotlib.
I am very new to Python, so someone very good at it recommended I use binning. IE for each time step, after reading the particle position and velocity, I should check where that particle's y-position is. How do I bin like that--adding it to a list of particles for each bin? And they recommended storing the average information in another array. I've Googled plenty on binning but I'm overwhelmed by all the things that numpy and scipy can do so these complicated/advanced examples are lost on me. Is this the best way to go about it? Does this all make sense?!
This is as far as I've gotten with reading the particle's data...
for i in range(10):
with open("{}.dat".format(i),'r') as csvfile:
data = csv.reader(csvfile, delimiter=',')
y2 = []
for row in data:
y2.append(float(row[5]))
then I'm assuming the binning happens, putting y2 in between certain values? like if (n / 10) <= y2 <= ((n+1) / 10):?
Here is an example of the dat files:
0.0 : 0.999900,-0.999900,0.0 : -6.999000,-7.001000,0.0
1000.0 : -1.617575,-0.927360,0.0 : -6.032388,-9.007120,0.0
2000.0 : -1.019145,-0.939388,0.0 : -3.059924,-9.008897,0.0
3000.0 : 0.654350,-0.560711,0.0 : -4.575242,-9.242543,0.0
4000.0 : 0.592084,0.509928,0.0 : -3.952575,-9.275643,0.0
5000.0 : 2.288733,0.0,0.0 : -3.038456,-10.0,0.0
etc until end of simulation, n=20000
Each file belongs to an individual particle, so it shows that particle's movement and speed across the timesteps.
I am simulating 15 particles so I have 15 files.
For the strips I want all the particles are in that strip at any time.
I will average those numbers later.
If the simulation's domain is 10x10, the particles are anywhere between y=0 and y=10.
Here is a non-[Pandas,Numpy,SciPy] solution. If at some point in the future processing time becomes annoying you could delve into those - there is a learning curve. There are other advantages particularly with Pandas - subsequent analysis might be easier with Pandas - But you can probably do all analysis without it.
For the strips I want all the particles are in that strip at any time.
You will need to identify each data point after you have lumped them all together. For simplicity I've used a namedtuple to make an object of each data point.
import csv
from collections import namedtuple
Particle = namedtuple('Particle',('name','t','x','y','z','x2','y2','z2'))
Often choosing the correct container for your stuff is important - you have to figure that out early and it affects the mechanics of the processing later. Again I've opted for simplicity with no thought of how it will be used later - a dictionary with key/value pairs for each strip. Each key is the left-edge of the strip/bin. Converting the y position to an integer easily categorizes it.
# positions in example data are all negative
bins = {-0:[],-1:[],-2:[],-3:[],-4:[],-5:[],-6:[],-7:[],-8:[],-9:[]}
Use the csv module to read all the files; make Particles; put them in bins.
for name in range(3):
with open(f'{name}.dat') as f:
reader = csv.reader(f,delimiter=':')
# example row
# 0.0 : 0.999900,-0.999900,0.0 : -6.999000,-7.001000,0.0
for t,vel,pos in reader:
t = float(t)
x,y,z = map(float, vel.split(','))
x2,y2,z2 = map(float, pos.split(','))
p = Particle(name,t,x,y,z,x2,y2,z2)
y = int(p.y2)
#print(f'{y}:{p}')
bins[y].append(p)
Partial bins made from some random data.
{-9: [Particle(name=1, t=1000.0, x=1.09185, y=2.13655, z=-1.96046, x2=-8.74504, y2=-9.89888, z2=-9.49985),...],
-8: [Particle(name=0, t=5000.0, x=1.2371, y=1.10508, z=-0.9939, x2=-9.47672, y2=-8.90004, z2=-8.06145),
Particle(name=2, t=7000.0, x=-0.82952, y=0.14332, z=-0.3446, x2=-2.76384, y2=-8.14855, z2=-7.2325)],
-7: [...,Particle(name=2, t=12000.0, x=1.06694, y=0.02654, z=-2.93894, x2=-8.62668, y2=-7.93497, z2=-6.18243)],
-6: [Particle(name=0, t=3000.0, x=0.01791, y=-2.67168, z=-1.39907, x2=-6.00256, y2=-6.64951, z2=-6.35569),...,
Particle(name=2, t=18000.0, x=2.41593, y=-2.27558, z=-1.1414, x2=-6.90592, y2=-6.42374, z2=-9.67672)],
-5: [...],
-4: [...],
...}
Random data maker.
import numpy as np
import csv
def make_data(q=3):
for n in range(q):
data = np.random.random((21,6))
np.add(data, [-.5,-.5,-.5,0,0,0], out=data)
np.multiply(data,[6,6,6,-10,-10,-10],out=data)
np.round_(data,5,data)
t = np.linspace(0,20000,21)
data = np.hstack((t[:,None],data))
with open(f'{n}.dat', 'w', newline='') as f:
writer = csv.writer(f,delimiter=':')
writer.writerows(data.tolist())
If in the future you want finer strips, say hundredths of units, just multiply by that factor.
>>> factor = 100
>>> y2 = -1.20513
>>> int(y2*factor)
-120
>>> d = {n:[] for n in range(0,-10*factor,-1)}
>>> d[int(y2*factor)].append(str(y2))
>>> d[-120]
['-1.20513']
>>>
Related
In order to speed up my code I want to exchange my for loops by vectorization or other recommended tools. I found plenty of examples with replacing simple for loops but nothing for replacing nested for loops in combination with conditions, which I was able to comprehend / would have helped me...
With my code I want to check if points (X, Y coordinates) can be connected by lineaments (linear structures). I started pretty simple but over time the code outgrew itself and is now exhausting slow...
Here is an working example of the part taking the most time:
import numpy as np
import matplotlib.pyplot as plt
from shapely.geometry import MultiLineString, LineString, Point
from shapely.affinity import rotate
from math import sqrt
from tqdm import tqdm
import random as rng
# creating random array of points
xys = rng.sample(range(201 * 201), 100)
points = [list(divmod(xy, 201)) for xy in xys]
# plot points
plt.scatter(*zip(*points))
# calculate length for rotating lines -> diagonal of bounds so all points able to be reached
length = sqrt(2)*200
# calculate angles to rotate lines
angles = []
for a in range(0, 360, 1):
angle = np.deg2rad(a)
angles.append(angle)
# copy points array to helper array (points_list) so original array is not manipulated
points_list = points.copy()
# array to save final lines
lines = []
# iterate over every point in points array to search for connecting lines
for point in tqdm(points):
# delete point from helper array to speed up iteration -> so points do not get
# double, triple, ... checked
if len(points_list) > 0:
points_list.remove(point)
else:
break
# create line from original point to point at end of line (x+length) - this line
# gets rotated at calculated angles
start = Point(point)
end = Point(start.x+length, start.y)
line = LineString([start,end])
# iterate over angle Array to rotate line by each angle
for angle in angles:
rot_line = rotate(line, angle, origin=start, use_radians=True)
lst = list(rot_line.coords)
# save starting point (a) and ending point(b) of rotated line for np.cross()
# (cross product to check if points on/near rotated line)
a = np.asarray(lst[0])
b = np.asarray(lst[1])
# counter to count number of points on/near line
count = 0
line_list = []
# iterate manipulated points_list array (only points left for which there has
# not been a line rotated yet)
for poi in points_list:
# check whether point (pio) is on/near rotated line by calculating cross
# product (np.corss())
p = np.asarray(poi)
cross = np.cross(p-a,b-a)
# check if poi is inside accepted deviation from cross product
if cross > -750 and cross < 750:
# check if more than 5 points (poi) are on/near the rotated line
if count < 5:
line_list.append(poi)
count += 1
# if 5 points are connected by the rotated line sort the coordinates
# of the points and check if the length of the line meets the criteria
else:
line_list = sorted(line_list , key=lambda k: [k[1], k[0]])
line_length = LineString(line_list)
if line_length.length >= 10 and line_length.length <= 150:
lines.append(line_list)
break
# use shapeplys' MultiLineString to create lines from coordinates and plot them
# afterwards
multiLines = MultiLineString(lines)
fig, ax = plt.subplots()
ax.set_title("Lines")
for multiLine in MultiLineString(multiLines).geoms:
# print(multiLine)
plt.plot(*multiLine.xy)
As mentioned above it was thinking about using pandas or numpy vectorization and therefore build a pandas df for the points and lines (gdf) and one with the different angles (angles) to rotate the lines:
Name
Type
Size
Value
gdf
DataFrame
(122689, 6)
Column name: x, y, value, start, end, line
angles
DataFrame
(360, 1)
Column name: angle
But I ran out of ideas to replace this nested for loops with conditions with pandas vectorization. I found this article on medium and halfway through the article there are conditions for vectorization mentioned and I was wondering if my code maybe is not suitbale for vectorization because of dependencies within the loops...
If this is right, it does not necessarily needs to be vectoriation everything boosting the performance is welcome!
You can quite easily vectorize the most computationally intensive part: the innermost loop. The idea is to compute the points_list all at once. np.cross can be applied on each lines, np.where can be used to filter the result (and get the IDs).
Here is the (barely tested) modified main loop:
for point in tqdm(points):
if len(points_list) > 0:
points_list.remove(point)
else:
break
start = Point(point)
end = Point(start.x+length, start.y)
line = LineString([start,end])
# CHANGED PART
if len(points_list) == 0:
continue
p = np.asarray(points_list)
for angle in angles:
rot_line = rotate(line, angle, origin=start, use_radians=True)
a, b = np.asarray(rot_line.coords)
cross = np.cross(p-a,b-a)
foundIds = np.where((cross > -750) & (cross < 750))[0]
if foundIds.size > 5:
# Similar to the initial part, not efficient, but rarely executed
line_list = p[foundIds][:5].tolist()
line_list = sorted(line_list, key=lambda k: [k[1], k[0]])
line_length = LineString(line_list)
if line_length.length >= 10 and line_length.length <= 150:
lines.append(line_list)
This is about 15 times faster on my machine.
Most of the time is spent in the shapely module which is very inefficient (especially rotate and even np.asarray(rot_line.coords)). Indeed, each call to rotate takes about 50 microseconds which is simply insane: it should take no more than 50 nanoseconds, that is, 1000 time faster (actually, an optimized native code should be able to to that in less than 20 ns on my machine). If you want a faster code, then please consider not using this package (or improving its performance).
In my work I have the task to read in a CSV file and do calculations with it. The CSV file consists of 9 different columns and about 150 lines with different values acquired from sensors. First the horizontal acceleration was determined, from which the distance was derived by double integration. This represents the lower plot of the two plots in the picture. The upper plot represents the so-called force data. The orange graph shows the plot over the 9th column of the CSV file and the blue graph shows the plot over the 7th column of the CSV file.
As you can see I have drawn two vertical lines in the lower plot in the picture. These lines represent the x-value, which in the upper plot is the global minimum of the orange function and the intersection with the blue function. Now I want to do the following, but I need some help: While I want the intersection point between the first vertical line and the graph to be (0,0), i.e. the function has to be moved down. How do I achieve this? Furthermore, the piece of the function before this first intersection point (shown in purple) should be omitted, so that the function really only starts at this point. How can I do this?
In the following picture I try to demonstrate how I would like to do that:
If you need my code, here you can see it:
import numpy as np
import matplotlib.pyplot as plt
import math as m
import loaddataa as ld
import scipy.integrate as inte
from scipy.signal import find_peaks
import pandas as pd
import os
# Loading of the values
print(os.path.realpath(__file__))
a,b = os.path.split(os.path.realpath(__file__))
print(os.chdir(a))
print(os.chdir('..'))
print(os.chdir('..'))
path=os.getcwd()
path=path+"\\Data\\1 Fabienne\\Test1\\left foot\\50cm"
print(path)
dataListStride = ld.loadData(path)
indexStrideData = 0
strideData = dataListStride[indexStrideData]
#%%Calculation of the horizontal acceleration
def horizontal(yAngle, yAcceleration, xAcceleration):
a = ((m.cos(m.radians(yAngle)))*yAcceleration)-((m.sin(m.radians(yAngle)))*xAcceleration)
return a
resultsHorizontal = list()
for i in range (len(strideData)):
strideData_yAngle = strideData.to_numpy()[i, 2]
strideData_xAcceleration = strideData.to_numpy()[i, 4]
strideData_yAcceleration = strideData.to_numpy()[i, 5]
resultsHorizontal.append(horizontal(strideData_yAngle, strideData_yAcceleration, strideData_xAcceleration))
resultsHorizontal.insert(0, 0)
#plt.plot(x_values, resultsHorizontal)
#%%
#x-axis "convert" into time: 100 Hertz makes 0.01 seconds
scale_factor = 0.01
x_values = np.arange(len(resultsHorizontal)) * scale_factor
#Calculation of the global high and low points
heel_one=pd.Series(strideData.iloc[:,7])
plt.scatter(heel_one.idxmax()*scale_factor,heel_one.max(), color='red')
plt.scatter(heel_one.idxmin()*scale_factor,heel_one.min(), color='blue')
heel_two=pd.Series(strideData.iloc[:,9])
plt.scatter(heel_two.idxmax()*scale_factor,heel_two.max(), color='orange')
plt.scatter(heel_two.idxmin()*scale_factor,heel_two.min(), color='green')#!
#Plot of force data
plt.plot(x_values[:-1],strideData.iloc[:,7]) #force heel
plt.plot(x_values[:-1],strideData.iloc[:,9]) #force toe
# while - loop to calculate the point of intersection with the blue function
i = heel_one.idxmax()
while strideData.iloc[i,7] > strideData.iloc[i,9]:
i = i-1
# Length calculation between global minimum orange function and intersection with blue function
laenge=(i-heel_two.idxmin())*scale_factor
print(laenge)
#%% Integration of horizontal acceleration
velocity = inte.cumtrapz(resultsHorizontal,x_values)
plt.plot(x_values[:-1], velocity)
#%% Integration of the velocity
s = inte.cumtrapz(velocity, x_values[:-1])
plt.plot(x_values[:-2],s)
I hope it's clear what I want to do. Thanks for helping me!
I didn't dig all the way through your code, but the following tricks may be useful.
Say you have x and y values:
x = np.linspace(0,3,100)
y = x**2
Now, you only want the values corresponding to, say, .5 < x < 1.5. First, create a boolean mask for the arrays as follows:
mask = np.logical_and(.5 < x, x < 1.5)
(If this seems magical, then run x < 1.5 in your interpreter and observe the results).
Then use this mask to select your desired x and y values:
x_masked = x[mask]
y_masked = y[mask]
Then, you can translate all these values so that the first x,y pair is at the origin:
x_translated = x_masked - x_masked[0]
y_translated = y_masked - y_masked[0]
Is this the type of thing you were looking for?
I would like to know how I can put the values resulting from my loops into the external file in form of columns. I have this code:
Perhaps a very specific brief explanation of what I would like to obtain could help me a bit, I have a group of particles that fall on the surface of the earth around a point (0.0) each with their respective weights and I would like to know the sum of the weights of the particles that fall within the rings of internal radius Ri and external Rj (the external radius of the first ring becomes the internal radius of the ring that follows it)
#Insert the radio values
#For example Ri_0=-20
#For example Rj_max=4000
#For example Bin=40
data=pd.read_csv("photons.txt", header=0, delim_whitespace=True)
df=pd.DataFrame(data)`
Ri_0=input("Insert the internal minimum radius value: ")
Rj_max=input("Insert the external maximum radius value: ")
Bin= input("Insert the bin: ")
R_internals=range(Ri_0,Rj_max+1,Bin)
Ri=list(R_internals)
Rj=[]
R=[]
for m in Ri:
R_externals=m+Bin
Rj.append(R_externals)
for d,f in zip(Ri,Rj):
R_average=(d+f)/2
R.append(R_average)
import zipfile
#Loops
count=0
#I think the problem is in this loop
for i,j in zip(Ri,Rj):
for r in df["radio"]:
if r >= i and r <= j:
d=df[df['radio']==r]['ParWeight'].iloc[0]
count=count+d
I have a problem adding the sum of the weights of the particles that fall within the internal radius ring Ri and external Rj and then adding it to an external file in the form of a data column and next to it the value of R that comes to be the average of Ri and Rj, I get a systematic error because it adds the value of the sum of weights of all the particles and does not separate them by ring, I attach the file "photons.txt" in the next link [https://drive.google.com/file/d/1YM0U3UN4p1OGvbiajZakMWtQSmyZoCkN/view?usp=sharing]
I was several days to trying to solve that problem.
Thanks so much.
Bryan, before function definition I am using your code and your data. Then my idea is to find where each radius is fall into in your Ri array. Once I find where each radius falls in I add up all weights per each group. Please let me know if you have any questions. Cheers.
import pandas as pd
import numpy as np
data_file = r"D:\workspace\projects\misc\data\photons.txt"
df = pd.read_csv(data_file, delimiter="\t")
Ri_0 = -20
Rj_max = 4000
Bin = 40
R_internals=range(Ri_0,Rj_max+1,Bin)
Ri=list(R_internals)
def getRadiusBoundary(x, listBoundaries):
""" This function will place a specific radius to its boundaries.
Indexes of external and internal radiuses are used
"""
for i in range(len(listBoundaries)):
if x <= listBoundaries[0]:
return 0
elif x > listBoundaries[len(listBoundaries) - 1]:
return len(listBoundaries)
idx = [i+1 for i in range(len(listBoundaries)-1) if listBoundaries[i] < x and x <= listBoundaries[i+1]]
return idx[0]
# this is a dictionary that maps radius groups to its boundaries
dictRadiusMapping = {i+1: [Ri[i], Ri[i+1]] for i in range(len(Ri)-1)}
dictRadiusMapping[0] = [-np.inf, Ri[0]]
dictRadiusMapping[len(Ri)] = [Ri[len(Ri) - 1], np.inf]
# here I am placing each radius into its appropriate group (finding its [external, internal] boundaries)
df["radius_group"] = df["radio"].apply(lambda x: getRadiusBoundary(x, Ri))
radius_sums = df.groupby("radius_group")["ParWeight"].sum().reset_index() # summing for each radius group
radius_sums.columns = ["radius_group", "weight_sum"]
radius_sums["radiuses"] = radius_sums["radius_group"].map(dictRadiusMapping)
This problem is about using scipy.signal.find_peaks for extracting mean peak height from data files efficiently. I am a beginner with Python (3.7), so I am not sure if I have written my code in the most optimal way, with regard to speed and code quality.
I have a set of measurement files containing one million data points (30MB) each. The graph of this data is a signal with peaks at regular intervals, and with noise. Also, the baseline and the amplitude of the signal vary across parts of the signal. I attached an image of an example. The signal can be much less clean.
My goal is to calculate the mean height of the peaks for each file. In order to do this, first I use find_peaks to locate all the peaks. Then, I loop over each peak location and detect the peak in a small interval around the peak, to make sure I get the local height of the peak.
I then put all these heights in numpy arrays and calculate their mean and standard deviation afterwards.
Here is a barebone version of my code, it is a bit long but I think that might also be because I am doing something wrong.
import numpy as np
from scipy.signal import find_peaks
# Allocate empty lists for values
mean_heights = []
std_heights = []
mean_baselines = []
std_baselines = []
temperatures = []
# Loop over several files, read them in and process data
for file in file_list:
temperatures.append(file)
# Load in data from a file of 30 MB
t_dat, x_dat = np.loadtxt(file, delimiter='\t', unpack=True)
# Find all peaks in this file
peaks, peak_properties = find_peaks(x_dat, prominence=prom, width=0)
# Calculate window size, make sure it is even
if round(len(t_dat)/len(peaks)) % 2 == 0:
n_points = len(t_dat) // len(peaks)
else:
n_points = len(t_dat) // len(peaks) + 1
t_slice = t_dat[-1] / len(t_dat)
# Allocate np arrays for storing heights
baseline_list = np.zeros(len(peaks) - 2)
height_list = np.zeros(len(peaks) - 2)
# Loop over all found peaks, and re-detect the peak in a window around the peak to be able
# to detect its local height without triggering to a baseline far away
for i in range(len(peaks) - 2):
# Making a window around a peak_properties
sub_t = t_dat[peaks[i+1] - n_points // 2: peaks[i+1] + n_points // 2]
sub_x = x_dat[peaks[i+1] - n_points // 2: peaks[i+1] + n_points // 2]
# Detect the peaks (2 version, specific to the application I have)
h_min = max(sub_x) - np.mean(sub_x)
_, baseline_props = find_peaks(
sub_x, prominence=h_min, distance=n_points - 1, width=0)
_, height_props = find_peaks(np.append(
min(sub_x) - 1, sub_x), prominence=h_min, distance=n_points - 1, width=0)
# Add the heights to the np arrays storing the heights
baseline_list[i] = baseline_props["prominences"]
height_list[i] = height_props["prominences"]
# Fill lists with values, taking the stdev and mean of the np arrays with the heights
mean_heights.append(np.mean(height_list))
std_heights.append(np.std(height_list))
mean_baselines.append(np.mean(baseline_list))
std_baselines.append(np.std(baseline_list))
It takes ~30 s to execute. Is this normal or too slow? If so, can it be optimised?
In the meantime I have improved the speed by getting rid of various inefficiencies, that I found by using the Python profiler. I will list the optimisations here ordered by significance for the speed:
Using pandas pd.read_csv() for I/O instead of np.loadtxt() cut off about 90% of the runtime. As also mentioned here, this saves a lot of time. This means changing this:
t_dat, x_dat = np.loadtxt(file, delimiter='\t', unpack=True)
to this:
data = pd.read_csv(file, delimiter = "\t", names=["t_dat", "x_dat"])
t_dat = data.values[:,0]
x_dat = data.values[:,1]
Removing redundant len() calls. I noticed that len() was called many times, and then noticed that this happened unnecessary. Changing this:
if round(len(t_dat) / len(peaks)) % 2 == 0:
n_points = int(len(t_dat) / len(peaks))
else:
n_points = int(len(t_dat) / len(peaks) + 1)
to this:
n_points = round(len(t_dat) / len(peaks))
if n_points % 2 != 0:
n_points += 1
proved to be also a significant improvement.
Lastly, a disproportionally component of the computational time (about 20%) was used by the built in Python functions min(), max() and sum(). Since I was using numpy arrays already, switching to the numpy equivalents for these functions, resulted in a 84% improvement on this part. This means for example changing max(sub_x) to sub_x.max().
These are all unrelated optimisations that I still think might be useful for Python beginner like myself, and they do help a lot.
I have been playing around for months on how to best write a program that will analyze multiple tables for similarities in geographical coordinates. I have tried everything now from nested for-loops to currently using a KD-Tree which seems to be working great. However I am not sure it is functioning properly when reading in my 3rd dimension, in this case is defined as Z.
import numpy
from scipy import spatial
import math as ma
def d(a,b):
d = ma.acos(ma.sin(ma.radians(a[1]))*ma.sin(ma.radians(b[1]))
+ma.cos(ma.radians(a[1]))*ma.cos(ma.radians(b[1]))*(ma.cos(ma.radians((a[0]-b[0])))))
return d
filename1 = "A"
pos1 = numpy.genfromtxt(filename1,
skip_header=1,
usecols=(1, 2))
z1 = numpy.genfromtxt(filename1,
skip_header=1,
usecols=(3))
filename2 = "B"
pos2 = numpy.genfromtxt(filename2,
#skip_header=1,
usecols=(0, 1))
z2 = numpy.genfromtxt(filename2,
#skip_header=1,
usecols=(2))
filename1 = "A"
data1 = numpy.genfromtxt(filename1,
skip_header=1)
#usecols=(0, 1))
filename2 = "B"
data2 = numpy.genfromtxt(filename2,
skip_header=1)
#usecols=(0, 1)
tree1 = spatial.KDTree(pos1)
match = tree1.query(pos2)
#print match
indices_pos1, indices_pos2 = [], []
for idx_pos1 in range(len(pos1)):
# find indices in pos2 that match this position (idx_pos1)
matching_indices_pos2 = numpy.where(match[1]==idx_pos1)[0]
for idx_pos2 in matching_indices_pos2:
# distance in sph coo
distance = d(pos1[idx_pos1], pos2[idx_pos2])
if distance < 0.01 and z1[idx_pos1]-z2[idx_pos2] > 0.001:
print pos1[idx_pos1], pos2[idx_pos2], z1[idx_pos1], z2[idx_pos2], distance
As you can see I am first calculating the (x,y) position as a single unit measured in spherical coordinates. Each element in file1 is compared to each element in file2. The problem lies somewhere in the Z dimension but I cant seem to crack this issue. When the results are printed out, the Z coordinates are often nowhere near each other. It seems as if my program is entirely ignoring the and statement. Below I have posted a string of results from my data which shows the issue that the z-values are in fact very far apart.
[ 358.98787832 -3.87297365] [ 358.98667162 -3.82408566] 0.694282 0.5310796 0.000853515096105
[ 358.98787832 -3.87297365] [ 359.00303872 -3.8962745 ] 0.694282 0.5132215 0.000484847441066
[ 358.98787832 -3.87297365] [ 358.99624509 -3.84617685] 0.694282 0.5128636 0.000489860962243
[ 359.0065807 -8.81507801] [ 358.99226267 -8.8451829 ] 0.6865379 0.6675241 0.000580562641945
[ 359.0292886 9.31398903] [ 358.99296163 9.28436493] 0.68445694 0.45485374 0.000811677349685
How the out put is structured: [ position1 (x,y)] [position2 (x,y)] [Z1] [Z2] distance
As you can see, specifically in the last example the Z-coordinates are sperated by about .23, which is way over the .001 restriction I typed for it above.
Any insights you could share would be really wonderful!
As for your original problem, you have a simple problem with the sign. You test if z1-z2 > 0.001, but you probably wanted abs(z1-z2) < 0.001 (notice the < instead of a >).
You could have the tree to also take the z coordinate into account, then you need to give it data as (x,y,z) and not only (x,y).
If it doesn't know the z value, it cannot use it.
It should be possible (although the sklearn API might not allow this) to query the tree directly for a window, where you bound the coordinate range and the z range independently. Think of a box that has different extensions in x,y,z. But because z will have a different value range, combining these different scales is difficult.
Beware that the k-d-tree does not know about spherical coordinates. A point at +180 degree and one at -180 degree - or one at 0 and one at 360 - are very far for the k-d-tree, but very close by spherical distance. So it will miss some points!