Find line intersection with all possible combinations of points in dataframe - python

let's say I have the following line:
l = Line(Point(25, 0), Point(25, 25))
and I have a dataframe (df) which contains 2500 points, something like:
x y
0 0 49
1 13 48
2 0 47
3 5 46
4 9 45
...
How can I efficiently examine if the lines formed by each and every combination of those points intersects with the above line?
Note that I am using the intersection function from the sympy library.
And note that using two nested loop takes forever... not efficient.

Related

Find the shortest path fast

I want to make the shortest path between many points.
I generate an 8x8 matrix, with random values like:
[[ 0 31 33 0 43 10 0 0]
[31 0 30 0 0 13 0 0]
[33 30 0 11 12 5 6 0]
[ 0 0 11 0 15 0 38 11]
[43 0 12 15 0 39 0 0]
[10 13 5 0 39 0 3 49]
[ 0 0 6 38 0 3 0 35]
[ 0 0 0 11 0 49 35 0]]
Now I want to take the first list and see which is the smaller number. The see where it is in the list and take its position. Next I clear the first list to forget the first point. And put the next position in a new list of path. Then it will do the same for the new point. And at the final when all points are in my list of path it shows me the shortest way.
indm=0
lenm=[]
prochain=max(matrixF[indm])
chemin=[]
long=len(chemin)
while long != n:
for i in range (n):
if matrixF[indm,i] <= prochain and matrixF[indm,i]!=0:
pluspetit=matrixF[indm,i]
prochainpoint=np.where(matrixF == pluspetit)
chemin.append(prochainpoint)
indm=prochainpoint
for i in range (n):
matrixF[indm,i]=0
long=len(chemin)
print(chemin)
print(matrixF)
But I got this error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
This line computes all the indices where matrixF == pluspetit:
prochainpoint=np.where(matrixF == pluspetit)
Problem is, there's more than one, so on your first pass through, prochainpoint ends up as (array([0, 5]), array([5, 0])). You then set indm to prochainpoint, so on your next pass, instead of getting a single value with matrixF[indm,i], you retrieve a 2x2 array of (repeated) values, as if you'd done:
np.array([[matrixF[0,1], matrixF[5,1]]
[matrixF[5,1], matrixF[0,1]])
Comparing this to prochain (still a scalar value) produces a 2x2 array of boolean results, which you then try to test for truthiness, but numpy doesn't want to guess at whether you mean "are they all true?" or "are any of them true?", so it dumps that decision back on you with the error message.
I'm assuming the problem is with prochainpoint=np.where(matrixF == pluspetit), where you get many results when you presumably only want one, but I'm not clear what the real intent of the line is, so you'll have to figure out what you really intended to do there and replace it with something that consistently computes a single value.

Lines between points by id

I have a dataframe with points on a 2-dimensional plane:
index x y
0 0 -0.032836 49.268820
1 0 4.160005 49.268820
2 0 4.105928 68.330440
3 0 -0.062953 68.342125
4 1 4.166139 49.269398
5 1 8.497650 49.278310
6 1 8.592334 68.336560
7 1 4.041361 68.336560
8 2 8.426349 49.278890
9 2 13.480260 49.278890
10 2 13.446286 68.336560
11 2 8.467557 68.336560
12 3 13.438516 49.278374
13 3 17.356792 49.287285
14 3 17.378400 68.338240
15 3 13.382163 68.333786
16 4 17.295988 49.289800
17 4 21.418156 49.289800
18 4 21.336264 67.359630
19 4 17.313816 67.359630
and I've been trying to find a way to draw lines between the (x,y) coordinates for each index. The resulting plot should be closed rectangles.
Now, I've tried to approach this by defining series:
x = df['x']
y = df['y']
and then
index_l = df.index.tolist()
for i in index_l:
plt.plot([df.x[i],df.y[i]])
This doesn't work at all. Any idea on how to proceed. A note: ideally, I would like to have a rectangle, but if doing this by even connecting diagonally is easier, I can live with it.
Thankful for any hints or solutions.
You can group by the index and then for x, y values of each group, append the first row to the end so that plt.plot plots a closed rectangle:
for idx, points in df.groupby("index")[["x", "y"]]:
points_to_plot = points.append(points.iloc[0])
plt.plot(points_to_plot.x, points_to_plot.y)
to get this plot

Discard points with X,Y coordinate close to eachother in Dataframe

I have the following dataframe (it is actually several hundred MB long):
X Y Size
0 10 20 5
1 11 21 2
2 9 35 1
3 8 7 7
4 9 19 2
I want discard any X, Y point that has an euclidean distance from any other X, Y point in the dataframe of less than delta=3. In those cases I want to keep only the row with the bigger size.
In this example the intended result would be:
X Y Size
0 10 20 5
2 9 35 1
3 8 7 7
As the question is stated, the behavior of the desired algorithm is not clear about how to deal with the chaining of distances.
If chaining is allowed, one solution is to cluster the dataset using a density-based clustering algorithm such as DBSCAN.
You just need to set the neighboorhood radius epsto delta and the min_sample parameter to 1 to allow isolated points as clusters. Then, you can find in each group which point has the maximum size.
from sklearn.cluster import DBSCAN
X = df[['X', 'Y']]
db = DBSCAN(eps=3, min_samples=1).fit(X)
df['grp'] = db.labels_
df_new = df.loc[df.groupby('grp').idxmax()['Size']]
print(df_new)
>>>
X Y Size grp
0 10 20 5 0
2 9 35 1 1
3 8 7 7 2
You can use below script and also try improving it.
#get all euclidean distances using sklearn;
#it will create an array of euc distances;
#then get index from df whose euclidean distance is less than 3
from sklearn.metrics.pairwise import euclidean_distances
Z = df[['X', 'Y']]
euc = euclidean_distances(Z, Z)
idx = [(i, j) for i in range(len(euc)-1) for j in range(i+1, len(euc)) if euc[i, j] < 3]
# collect all index of df that has euc dist < 3 and get the max value
# then collect all index in df NOT in euc and add the row with max size
# create a new called df_new by combining the rest in df and row with max size
from itertools import chain
df_idx = list(set(chain(*idx)))
df2 = df.iloc[df_idx]
idx_max = df2[df2['Size'] == df2['Size'].max()].index.tolist()
df_new = pd.concat([df.iloc[~df.index.isin(df_idx)], df2.iloc[idx_max]])
df_new
Result:
X Y Size
2 9 35 1
3 8 7 7
0 10 20 5

How to iterate over a range of permutations? [duplicate]

This question already has an answer here:
How to iterate through array combinations with constant sum efficiently?
(1 answer)
Closed 6 years ago.
Say you have n items each ranging from 1-100. How can I get go over all possible variations within the range?
Example:
3 stocks A, B and C
Working to find possible portfolio allocation.
A - 0 0 0 1 2 1 1
B - 0 1 2 ... 0 0 ... 1 2
C - 100 99 98 99 98 98 97
Looking for an efficient way to get a matrix of all possible outcomes.
Sum should add up to 100 and cover all possible variations for n elements.
How I'd do it:
>>> import itertools
>>> cp = itertools.product(range(101),repeat=3)
>>> portfolios = list(p for p in cp if sum(p)==100)
But that creates unnecessary combinations. See discussions of integer partitioning to avoid that. E.g., Elegant Python code for Integer Partitioning

Find Repeating Sublist Within Large List

I have a large list of sub-lists (approx. 16000) that I want to find where the repeating pattern starts and ends. I am not 100% sure that there is a repeat, however I have a strong reason to believe so, due to the diagonals that appear within the sub-list sequence. The structure of a list of sub-lists is preferred, as it is used that way for other things in this script. The data looks like this:
data = ['1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011', etc
I do not have any time constraints, however the fastest method would not be frown upon. The code should be able to return the starting/ending sequence and location within the list, to be called upon in the future. If there is an arrangement of the data that would be more useful, I can try to reformat it if necessary. Python is something that I have been learning for the past few months, so I am not quite able to just create my own algorithms from scratch just yet. Thank you!
Here's some fairly simple code that scans a string for adjacent repeating subsequences. Set minrun to the length of the smallest subsequences that you want to check. For each match, the code prints the starting index of the first subsequence, the length of the subsequence, and the subsequence itself.
data = [
'1100100100000010',
'1001001000000110',
'0010010000001100',
'0100100000011011',
]
data = ''.join(data)
minrun = 3
lendata = len(data)
for runlen in range(minrun, lendata // 2):
i = 0
while i < lendata - runlen * 2:
s1 = data[i:i + runlen]
s2 = data[i + runlen:i + runlen * 2]
if s1 == s2:
print(i, runlen, s1)
i += runlen
else:
i += 1
output
1 3 100
4 3 100
8 3 000
15 3 010
18 3 010
23 3 000
32 3 001
38 3 000
47 3 001
53 3 000
17 15 001001000000110
32 15 001001000000110
Note that we get the same sequence of length 3 at index 15 and 18 = 15 + 3 : 010; that indicates that there are 3 adjacent copies of 010. Similarly, there are 3 adjacent copies of the sequence at index 17 of length 15.

Categories