In my algorithm, I am finding graphs at different thresholds. Each graph G = (V,E). These are undirected graphs found using breadth first search. I would like to determine if the vertices of another graph G' = (V',E') lie within graph G. I am unfamiliar with graph algorithms so please let me know if you would like to see code or a more thorough explanation.
For example, If I have a graph G1 which is a square with 'corner' vertices (among others, but reduced for simplicity) of {(1,1), (1,6), (6,6), (6,1)}, then a smaller square G2 defined by corner vertices {(2,2), (2,5), (5,5), (5,2)} would lie within G1. The third graph G3 defined by corners {(3,3), (3,4), (4,4),(4,3)}. My algorithm produces the following figure for this configuration:
A square thresholded at 2, surrounded by t=1, surrounded by t=0. (I need to fix the edges but the vertices are correct)
My algorithm works on the following matrix:
import numpy as np
A = np.zeros((7,7))
#A[A<1] = -1
for i in np.arange(1,6):
for j in np.arange(1,6):
A[i,j] = 1
for i in np.arange(2,5):
for j in np.arange(2,5):
A[i,j] = 2
for i in np.arange(3,4):
for j in np.arange(3,4):
A[i,j] = 3
print(A)
To create three graphs, the first at threshold 2, the second at threshold 1, the third at threshold 0.
v1 = [[(3.0, 2.25), (3.0, 3.75), (2.25, 3.0), (3.75, 3.0)]]
v2 = [[(2.0, 1.333333), (1.333333, 3.0), (1.333333, 2.0), (1.333333, 4.0), (2.0, 4.666667), (3.0, 4.666667), (4.0, 4.666667), (4.666667, 4.0), (4.666667, 3.0), (4.666667, 2.0), (4.0, 1.333333), (3.0, 1.333333)]]
v3 = [[(1.0, 0.5), (0.5, 2.0), (0.5, 1.0), (0.5, 3.0), (0.5, 4.0), (0.5, 5.0), (1.0, 5.5), (2.0, 5.5), (3.0, 5.5), (4.0, 5.5), (5.0, 5.5), (5.5, 5.0), (5.5, 4.0), (5.5, 3.0), (5.5, 2.0), (5.5, 1.0), (5.0, 0.5), (4.0, 0.5), (3.0, 0.5), (2.0, 0.5)]]
And edge lists:
e1 = [[[2.25, 3.0], [3.0, 2.25]], [[3.0, 3.75], [2.25, 3.0]], [[3.0, 2.25], [3.75, 3.0]], [[3.0, 3.75], [3.75, 3.0]]]
e2 = [[[1.333333, 2.0], [2.0, 1.333333]], [[1.333333, 3.0], [1.333333, 2.0]], [[1.333333, 4.0], [1.333333, 3.0]], [[2.0, 4.666667], [1.333333, 4.0]], [[2.0, 1.333333], [3.0, 1.333333]], [[2.0, 4.666667], [3.0, 4.666667]], [[3.0, 1.333333], [4.0, 1.333333]], [[3.0, 4.666667], [4.0, 4.666667]], [[4.0, 1.333333], [4.666667, 2.0]], [[4.666667, 3.0], [4.666667, 2.0]], [[4.666667, 4.0], [4.666667, 3.0]], [[4.0, 4.666667], [4.666667, 4.0]]]
e3 = [[[0.5, 1.0], [1.0, 0.5]], [[0.5, 2.0], [0.5, 1.0]], [[0.5, 3.0], [0.5, 2.0]], [[0.5, 4.0], [0.5, 3.0]], [[0.5, 5.0], [0.5, 4.0]], [[1.0, 5.5], [0.5, 5.0]], [[1.0, 0.5], [2.0, 0.5]], [[1.0, 5.5], [2.0, 5.5]], [[2.0, 0.5], [3.0, 0.5]], [[2.0, 5.5], [3.0, 5.5]], [[3.0, 0.5], [4.0, 0.5]], [[3.0, 5.5], [4.0, 5.5]], [[4.0, 0.5], [5.0, 0.5]], [[4.0, 5.5], [5.0, 5.5]], [[5.0, 0.5], [5.5, 1.0]], [[5.5, 2.0], [5.5, 1.0]], [[5.5, 3.0], [5.5, 2.0]], [[5.5, 4.0], [5.5, 3.0]], [[5.5, 5.0], [5.5, 4.0]], [[5.0, 5.5], [5.5, 5.0]]]
Again, this gives graphs that look like this
This is the real data that I am working on. More complicated shapes.
Here, for example, I have a red shape inside of a green shape. Ideally, red shapes would lie within red shapes. They would be grouped together in one object (say an array of graphs).
The graphs are connected in a clockwise fashion. I really don't know how to describe it, but perhaps the graphs in the link show this. There's a bug on two of the lines (as you can see in the first plot, in the top right corner), but the vertices are correct.
Hope this helps! I can attach a full workable example, but it would include my whole algorithm and be pages long, with many functions! I basically want to use either input either g1, g2, and g3 into a function (or e1, e2, and e3). The function would tell me that g3 is contained with g2, which is contained within g1.
Your problem really does not have much to do with networks. Fundamentally, you are trying to determine if a point is inside a region described by an ordered list of points. The simplest way to this is to create matplotlib Path which has a contains_point method (there is also a 'contains_points` method to test many points simultaneously).
#!/usr/bin/env python
"""
Determine if a point is within the area defined by a path.
"""
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.path import Path
from matplotlib.patches import PathPatch
point = [0.5, 0.5]
vertices = np.array([
[0, 0],
[0, 1],
[1, 1],
[1, 0],
[0, 0] # NOTE the repetition of the first vertex
])
path = Path(vertices, closed=True)
print(path.contains_point(point))
# True
# plot to check visually
fig, ax = plt.subplots(1,1)
ax.add_patch(PathPatch(path))
ax.plot(point[0], point[1], 'ro')
Note that if a point is directly on the path, it is not inside the path. However, contains_point supports a radius argument that allows you to add an increment to the extent of the area. Whether you need a positive or negative increment depends on the ordering of the points. IIRC, radius shifts the path left in direction of the path but don't quote me on that.
Related
I have a clustering problem in which I have to split a set S of samples into C clusters where C is known. Normally, I am able to perform the clustering operation with a simple KMeans clustering, which works just fine.
To complicate things, I have a known set of pairs D of samples that cannot under any circumstances be assinged to the same cluster. Currently I am not using this information and the clustering still works fine, but I would like to introduce it to improve robustness, since it comes for free from the problem I am trying to solve.
Example: S consists of 20 samples with 5 features each, C is 3, and D forces the following pairs {(1, 3), (3, 5), (10, 19)} to be in different clusters.
I am looking for a solution in python3, preferably with numpy/sklearn/scipy.
Do you know if there is some out-of-the-box clustering algorithm that takes into account this kind of constraint? I have looked into sklearn but found no such thing.
This sounds exactly like semi-supervised clustering with pairwise constraints. In it, the unsupervised k-means clustering is augmented by (imperfect) supervision through pairwise constraints for a subset of the data. In your particular example, it is a cannot link-constraint. In addition, must-link constraints could be added as well.
Unfortunately, most implementations I encountered in Python are rather brittle. For example, the Python library active-semi-supervised-clustering allows to add ml (must link) and cl (cannot link) relations just as you describe. The code is:
import numpy as np
from matplotlib import pyplot as plt
from active_semi_clustering.semi_supervised.pairwise_constraints import PCKMeans
# data
S = [[-0.2, -1.0], [3.3, 3.9], [-2.0, 0.6], [2.3, -0.8], [1.1, 1.9], [2.8, -0.3], [4.2, 2.6], [1.8, 6.8], [1.4, -0.7], [2.6, 1.8], [2.6, 5.4], [0.8, -0.6], [3.0, 1.4], [-0.6, -0.4], [0.3, -0.2], [0.8, -0.4], [4.8, 5.1], [2.4, 5.2], [2.3, 5.3], [0.9, 0.3], [2.8, 4.1], [1.4, -0.7], [2.7, 5.6], [0.8, 0.8], [1.9, 5.3], [2.3, 5.3], [2.1, 0.5], [3.1, 5.3], [2.3, 0.8], [-0.2, -0.0], [2.4, 0.0], [3.6, -0.5], [1.3, -0.4], [3.0, 4.6], [0.4, -0.1], [-2.3, -1.4], [-1.9, -1.9], [4.2, 5.4], [-1.3, -0.9], [2.7, 0.2], [1.9, 6.5], [2.8, -0.8], [0.0, -0.3], [3.2, 5.9], [1.7, 4.6], [2.3, -0.3], [2.9, 1.2], [3.5, 2.0], [1.2, 2.3], [2.0, 1.5], [4.2, 5.8], [0.7, -2.0], [-0.8, -0.9], [4.7, 0.7], [-1.2, -1.8], [3.5, 5.1], [2.6, 0.7], [1.1, 3.0], [1.9, 6.5], [2.5, 6.5], [2.2, -0.2], [-0.9, -0.3], [3.1, 4.1], [-0.7, -0.3], [4.1, 5.2], [2.6, 0.8], [4.0, 3.5], [4.2, 4.3], [3.1, 1.1], [0.9, -0.1], [-0.3, 1.2], [0.2, -0.8], [0.1, -1.1], [0.4, -1.1], [-0.1, -0.7]]
S = np.array([np.array(s) for s in S])
# no. of clusters
C = 3
# constraints (indices of points in S)
D = [(1, 3), (3, 5), (10, 19), (7, 11), (4, 6)]
# color plots
colDict = {0: '#fc6A03', 1 : 'green', 2 :'#006699'}
plt.title('Input Data ($S$)', fontsize=20)
plt.scatter(x=[s[0] for s in list(S)], y=[s[1] for s in list(S)], c='darkgrey')
plt.show()
# Naïve Clustering
clust = PCKMeans(n_clusters=C, max_iter=1000)
clust.fit(S, cl=[], ml=[])
plt.title('Naïve (unconstrained) k-Means', fontsize=18)
plt.scatter(x=[s[0] for s in list(S)], y=[s[1] for s in list(S)], c=[colDict[c] for c in clust.labels_])
plt.show()
# Constr. Clustering
const_clust = PCKMeans(n_clusters=C, max_iter=10000)
const_clust.fit(S, ml=[], cl=D)
plt.title('Constrained k-Means', fontsize=18)
plt.scatter(x=[s[0] for s in S.tolist()], y=[s[1] for s in S.tolist()], c=[colDict[c] for c in const_clust.labels_])
plt.show()
which yields
Although the plot looks different, checking if the cannot link-constraints are indeed met results in
[const_clust.labels_[d[0]] != const_clust.labels_[d[1]] for d in D]
>[True, False, True]
indicating that points with index 3 and 5 were assigned the same cluster label. Not good. However, the sample size and the distribution of the data points across the feature space seem to impact this greatly. Potentially, you will see no adverse effects when you apply it to your actual data.
Unfortunately, the repository does not allow to set a seed (to make the iterative estimation procedure reproducible) and ignores the one set via np.random.seed(567). Beware of reproducibility and rerun the code several times.
Other repositories such as scikit-learn indicate that some clustering routines may allow constraints but don't indicate how this can be done.
Note that there are other variants of constrained k-means clustering of this, e.g. where the pairwise constraints are not certain (see this reference) or the number of data points per cluster is constrained (see this python library).
How can I determine whether one graph lies within another?
My algorithm works on the following matrix:
import numpy as np
A = np.zeros((9,9))
for i in np.arange(1,8):
for j in np.arange(1,8):
A[i,j] = 1
for i in np.arange(2,4):
for j in np.arange(2,4):
A[i,j] = 2
print(A)
yields the matrix:
[[-1. -1. -1. -1. -1. -1. -1. -1. -1.]
[-1. 1. 1. 1. 1. 1. 1. 1. -1.]
[-1. 1. 2. 2. 1. 1. 1. 1. -1.]
[-1. 1. 2. 2. 1. 1. 1. 1. -1.]
[-1. 1. 1. 1. 1. 1. 1. 1. -1.]
[-1. 1. 1. 1. 1. 1. 1. 1. -1.]
[-1. 1. 1. 1. 1. 1. 1. 1. -1.]
[-1. 1. 1. 1. 1. 1. 1. 1. -1.]
[-1. -1. -1. -1. -1. -1. -1. -1. -1.]]
To create two graphs:
With vertices:
V1 = [[(2.0, 1.333333), (1.333333, 3.0), (1.333333, 2.0), (2.0, 3.666667), (3.0, 3.666667), (3.666667, 3.0), (3.666667, 2.0), (3.0, 1.333333)]]
V2 = [[(1.0, 0.5), (0.5, 2.0), (0.5, 1.0), (0.5, 3.0), (0.5, 4.0), (0.5, 5.0), (0.5, 6.0), (0.5, 7.0), (1.0, 7.5), (2.0, 7.5), (3.0, 7.5), (4.0, 7.5), (5.0, 7.5), (6.0, 7.5), (7.0, 7.5), (7.5, 7.0), (7.5, 6.0), (7.5, 5.0), (7.5, 4.0), (7.5, 3.0), (7.5, 2.0), (7.5, 1.0), (7.0, 0.5), (6.0, 0.5), (5.0, 0.5), (4.0, 0.5), (3.0, 0.5), (2.0, 0.5)]]
And edge lists:
e1 = [[[1.333333, 2.0], [2.0, 1.333333]], [[1.333333, 3.0], [1.333333, 2.0]], [[2.0, 3.666667], [1.333333, 3.0]], [[2.0, 1.333333], [3.0, 1.333333]], [[2.0, 3.666667], [3.0, 3.666667]], [[3.0, 1.333333], [3.666667, 2.0]], [[3.666667, 3.0], [3.666667, 2.0]], [[3.0, 3.666667], [3.666667, 3.0]]]
e2 = [[[0.5, 1.0], [1.0, 0.5]], [[0.5, 2.0], [0.5, 1.0]], [[0.5, 3.0], [0.5, 2.0]], [[0.5, 4.0], [0.5, 3.0]], [[0.5, 5.0], [0.5, 4.0]], [[0.5, 6.0], [0.5, 5.0]], [[0.5, 7.0], [0.5, 6.0]], [[1.0, 7.5], [0.5, 7.0]], [[1.0, 0.5], [2.0, 0.5]], [[1.0, 7.5], [2.0, 7.5]], [[2.0, 0.5], [3.0, 0.5]], [[2.0, 7.5], [3.0, 7.5]], [[3.0, 0.5], [4.0, 0.5]], [[3.0, 7.5], [4.0, 7.5]], [[4.0, 0.5], [5.0, 0.5]], [[4.0, 7.5], [5.0, 7.5]], [[5.0, 0.5], [6.0, 0.5]], [[5.0, 7.5], [6.0, 7.5]], [[6.0, 0.5], [7.0, 0.5]], [[6.0, 7.5], [7.0, 7.5]], [[7.0, 0.5], [7.5, 1.0]], [[7.5, 2.0], [7.5, 1.0]], [[7.5, 3.0], [7.5, 2.0]], [[7.5, 4.0], [7.5, 3.0]], [[7.5, 5.0], [7.5, 4.0]], [[7.5, 6.0], [7.5, 5.0]], [[7.5, 7.0], [7.5,
6.0]], [[7.0, 7.5], [7.5, 7.0]]]
As Prune suggests, the shapely package has what you need. While your line loops can be thought of as a graph, it's more useful to consider them as polygons embedded in the 2D plane.
By creating Polygon objects from your points and edge segments, you can use the contains method that all shapely objects have to test if one is inside the other.
You'll need to sort the edge segments into order. Clockwise or anti-clockwise probably doesn't matter as shapely likely detects inside and outside by constructing a point at infinity and ensuring that is 'outside'.
Here's a full example with the orignal pair of squares from your post:
from shapely.geometry import Polygon
p1 = Polygon([(0,0), (0,8), (8,8), (8,0)])
p2 = Polygon([(2,2), (2,4), (4,4), (4,2)])
print(p1.contains(p2))
Documentation for the Polygon object is at https://shapely.readthedocs.io/en/latest/manual.html#Polygon
and for the contains method at https://shapely.readthedocs.io/en/latest/manual.html#object.contains
In my algorithm, I am finding graphs at different thresholds. Each graph G = (V,E). These are undirected graphs found using breadth first search. I would like to determine if the vertices of another graph G' = (V',E') lie within graph G. I am unfamiliar with graph algorithms so please let me know if you would like to see code or a more thorough explanation.
For example, If I have a graph G which is a square with 'corner' vertices (among others, but reduced for simplicity) of (0,0), (0,8), (8,8), and (8,0), then the smaller square defined by corner vertices (2,2), (2,4), (4,4), and (4,2) would lie within G. I am sorry if this is an obvious question, I am just unfamiliar with working with graphs and could use a pointer or two (keywords welcome).
Edit:
My algorithm works on the following matrix:
import numpy as np
A = np.zeros((9,9))
for i in np.arange(1,8):
for j in np.arange(1,8):
A[i,j] = 1
for i in np.arange(2,4):
for j in np.arange(2,4):
A[i,j] = 2
print(A)
yields the matrix:
[[-1. -1. -1. -1. -1. -1. -1. -1. -1.]
[-1. 1. 1. 1. 1. 1. 1. 1. -1.]
[-1. 1. 2. 2. 1. 1. 1. 1. -1.]
[-1. 1. 2. 2. 1. 1. 1. 1. -1.]
[-1. 1. 1. 1. 1. 1. 1. 1. -1.]
[-1. 1. 1. 1. 1. 1. 1. 1. -1.]
[-1. 1. 1. 1. 1. 1. 1. 1. -1.]
[-1. 1. 1. 1. 1. 1. 1. 1. -1.]
[-1. -1. -1. -1. -1. -1. -1. -1. -1.]]
To create two graphs:
]
With vertices:
V1 = [[(2.0, 1.333333), (1.333333, 3.0), (1.333333, 2.0), (2.0, 3.666667), (3.0, 3.666667), (3.666667, 3.0), (3.666667, 2.0), (3.0, 1.333333)]]
V2 = [[(1.0, 0.5), (0.5, 2.0), (0.5, 1.0), (0.5, 3.0), (0.5, 4.0), (0.5, 5.0), (0.5, 6.0), (0.5, 7.0), (1.0, 7.5), (2.0, 7.5), (3.0, 7.5), (4.0, 7.5), (5.0, 7.5), (6.0, 7.5), (7.0, 7.5), (7.5, 7.0), (7.5, 6.0), (7.5, 5.0), (7.5, 4.0), (7.5, 3.0), (7.5, 2.0), (7.5, 1.0), (7.0, 0.5), (6.0, 0.5), (5.0, 0.5), (4.0, 0.5), (3.0, 0.5), (2.0, 0.5)]]
And edge lists:
e1 = [[[1.333333, 2.0], [2.0, 1.333333]], [[1.333333, 3.0], [1.333333, 2.0]], [[2.0, 3.666667], [1.333333, 3.0]], [[2.0, 1.333333], [3.0, 1.333333]], [[2.0, 3.666667], [3.0, 3.666667]], [[3.0, 1.333333], [3.666667, 2.0]], [[3.666667, 3.0], [3.666667, 2.0]], [[3.0, 3.666667], [3.666667, 3.0]]]
e2 = [[[0.5, 1.0], [1.0, 0.5]], [[0.5, 2.0], [0.5, 1.0]], [[0.5, 3.0], [0.5, 2.0]], [[0.5, 4.0], [0.5, 3.0]], [[0.5, 5.0], [0.5, 4.0]], [[0.5, 6.0], [0.5, 5.0]], [[0.5, 7.0], [0.5, 6.0]], [[1.0, 7.5], [0.5, 7.0]], [[1.0, 0.5], [2.0, 0.5]], [[1.0, 7.5], [2.0, 7.5]], [[2.0, 0.5], [3.0, 0.5]], [[2.0, 7.5], [3.0, 7.5]], [[3.0, 0.5], [4.0, 0.5]], [[3.0, 7.5], [4.0, 7.5]], [[4.0, 0.5], [5.0, 0.5]], [[4.0, 7.5], [5.0, 7.5]], [[5.0, 0.5], [6.0, 0.5]], [[5.0, 7.5], [6.0, 7.5]], [[6.0, 0.5], [7.0, 0.5]], [[6.0, 7.5], [7.0, 7.5]], [[7.0, 0.5], [7.5, 1.0]], [[7.5, 2.0], [7.5, 1.0]], [[7.5, 3.0], [7.5, 2.0]], [[7.5, 4.0], [7.5, 3.0]], [[7.5, 5.0], [7.5, 4.0]], [[7.5, 6.0], [7.5, 5.0]], [[7.5, 7.0], [7.5,
6.0]], [[7.0, 7.5], [7.5, 7.0]]]
I hope to use it on finding more complicated shapes like the following:
]
In the second picture, I have a red shape inside of a green shape. Ideally, red shapes would lie within red shapes.
I can attach a full workable example, but it would include my whole algorithm and be pages long, with many functions! I basically want to use either input (V1, E1) and (V2, E2) into a function, which would tell me whether one lies within the other.
You can use ray-casting to figure this out. It's a common method for dealing with this problem, so you can find additional information on it elsewhere too. The general description of the algorithm is this:
G1 and G2 are graphs whose edges form a simple polygon/convex hull, where we are attempting to determine if G2 is inside G1.
Choose some arbitrary direction in your space.
For each vertex in G2, cast a ray (a line that starts from one point and extends infinitely in a single direction) in the direction you chose.
If the vertex (a) intersects an edge of G1 an odd number of times OR (b) lies on one of those edges--> the vertex is inside of G1. For all other cases, the vertex is not inside of G1.
G2 is inside of G1 if an only if each vertex of G2 is inside of G1.
This will involve the following subtasks
-Get a list of vertices for G2
-Casting the rays
-Detecting and counting intersection
If you loop through each vertex and draw a line by adding the value you are using to represent G2 on your matrix to all cells in the direction you choose, the intersection value would then just be the sum of the values you are using to represent G1 and G2. In your current case, because you're making squares, this is a little problematic. There may be a better algorithm for drawing the objects or a better way to detect intersections.
Lastly, for detecting if en edge lies on the graph, you should run the check for intersections BEFORE you loop through the vertices. If any of your vertices produce the intersection value before the ray casting, it will tell you that it is on the edge of G1. Mark that this vertex is inside G1, remove it from the list of vertices that need to be checked, and make note of this value so it doesn't count as an extra intersection for all
You may have to tweak this algorithm depending on things like whether you want to count nodes on the edge as inside or outside, or whether you need all vertices inside the figure, but I hope this is a helpful start.
I have two lists of lists, sorted with respect to the first item of each inner list (represents timestamp) , containing data like this [[time0, voltage0],[time1,voltage1],....]
l1 =[[0,0],[1,1],[2,2],[3,3]]
l2 =[[0,0],[0.5,0.5],[1,1.2],[1.5,1.5],[2,2]]
the goal is to produce a single list of lists containing the elements from both lists and sorted with respect to the first item of the inner lists BUT
if there is an item which his timestamp is the same in both lists, the final list will contain the item from the other list.
for the example above the output should be:
result = [[0,0],[0,5,0.5],[1,1],[1.5,1.5],[2,2],[3,3]]
I've tried to save a reference in each element which will specify from which list the element came and then go over the list to find duplicates and delete those who came from the second list but finding duplicates isn't working since ["first",0,0] isn't a duplicate of ["second",0,0]
# examples of lists
import itemgetter
lFirst = [[0.0, 0.0], [1.0, 1.0], [2.0, 2.0], [3.0, 3.0], [4.0, 4.0], [5.0, 5.0]]
lSecond = [[0.0, 0.0], [0.5, 0.5], [1.0, 1.2], [1.5, 1.5], [2.0, 2.0], [2.5, 2.5], [3.0, 3.0], [3.5, 3.5], [4.0, 4.0], [4.5, 4.5]]
print "first list: {}".format(lFirst)
print "second list: {}".format(lSecond)
res = sorted(lFirst+lSecond , key = itemgetter(0))
print res
One way is to concatenate your lists, with l2 coming first. Then create a dictionary and sort the items():
print([list(x) for x in sorted(dict(l2 + l1).items())])
#[[0, 0], [0.5, 0.5], [1, 1], [1.5, 1.5], [2, 2], [3, 3]]
This works because dictionary keys are unique. You start with a key-value pair from l2, but if the key (timestamp) also exists in l1 it gets updated.
You could remove all duplicates from the second list before merging.
lFirst = [[0.0, 0.0], [1.0, 1.0], [2.0, 2.0], [3.0, 3.0], [4.0, 4.0], [5.0, 5.0]]
lSecond = [[0.0, 0.0], [0.5, 0.5], [1.0, 1.2], [1.5, 1.5], [2.0, 2.0], [2.5, 2.5], [3.0, 3.0], [3.5, 3.5], [4.0, 4.0], [4.5, 4.5]]
print("first list: {0}".format(lFirst))
print("second list: {0}".format(lSecond))
lFirstTimes = [x[0] for x in lFirst]
lSecondFiltered = [x for x in lSecond if x[0] not in lFirstTimes]
print("second list without duplicates: {0}".format(lSecondFiltered))
res = lFirst+lSecondFiltered
res.sort()
print(res)
You can use heapq.merge (doc) to merge the lists and itertools.grouby (doc) to group the elements.
The list which is first in merge() will get priority:
l1 = [[0.0, 0.0], [1.0, 1.0], [2.0, 2.0], [3.0, 3.0], [4.0, 4.0], [5.0, 5.0]]
l2 = [[0.0, 0.0], [0.5, 0.5], [1.0, 1.2], [1.5, 1.5], [2.0, 2.0], [2.5, 2.5], [3.0, 3.0], [3.5, 3.5], [4.0, 4.0], [4.5, 4.5]]
from heapq import merge
from itertools import groupby
out = [next(g) for _, g in groupby(merge(l1, l2, key=lambda k: k[0]), lambda k: k[0])]
from pprint import pprint
pprint(out)
Prints:
[[0.0, 0.0],
[0.5, 0.5],
[1.0, 1.0],
[1.5, 1.5],
[2.0, 2.0],
[2.5, 2.5],
[3.0, 3.0],
[3.5, 3.5],
[4.0, 4.0],
[4.5, 4.5],
[5.0, 5.0]]
EDIT: Works in Python3.5+ (In Python2.7 merge() doesn't have key= argument)
When reading the documentation for pd.qcut?, I simply couldn't understand its writing, particularly with its examples, one of them is below
>>> pd.qcut(range(5), 4)
... # doctest: +ELLIPSIS
[(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]]
Categories (4, interval[float64]): [(-0.001, 1.0] < (1.0, 2.0] ...
Why did it return 5 elements in the list (although the code specifying 4 buckets) and the 2 first elements are the same (-0.001, 1.0)?
Thanks.
Because 0 is in (-0.001, 1], so is 1.
range(5) # [0, 1, 2, 3, 4, 5]
The corresponding category of [0, 1, 2, 3, 4, 5] is [(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]].
Look at the range
list(range(5))
Out[116]: [0, 1, 2, 3, 4]
it is return 5 number , when you do qcut , 0,1 are considered into one range
pd.qcut(range(5), 4)
Out[115]:
[(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]]
Categories (4, interval[float64]): [(-0.001, 1.0] < (1.0, 2.0] < (2.0, 3.0] < (3.0, 4.0]]