Confusing result for panda qcut function - python

When reading the documentation for pd.qcut?, I simply couldn't understand its writing, particularly with its examples, one of them is below
>>> pd.qcut(range(5), 4)
... # doctest: +ELLIPSIS
[(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]]
Categories (4, interval[float64]): [(-0.001, 1.0] < (1.0, 2.0] ...
Why did it return 5 elements in the list (although the code specifying 4 buckets) and the 2 first elements are the same (-0.001, 1.0)?
Thanks.

Because 0 is in (-0.001, 1], so is 1.
range(5) # [0, 1, 2, 3, 4, 5]
The corresponding category of [0, 1, 2, 3, 4, 5] is [(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]].

Look at the range
list(range(5))
Out[116]: [0, 1, 2, 3, 4]
it is return 5 number , when you do qcut , 0,1 are considered into one range
pd.qcut(range(5), 4)
Out[115]:
[(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]]
Categories (4, interval[float64]): [(-0.001, 1.0] < (1.0, 2.0] < (2.0, 3.0] < (3.0, 4.0]]

Related

Determining if vertices lie within a set vertices

In my algorithm, I am finding graphs at different thresholds. Each graph G = (V,E). These are undirected graphs found using breadth first search. I would like to determine if the vertices of another graph G' = (V',E') lie within graph G. I am unfamiliar with graph algorithms so please let me know if you would like to see code or a more thorough explanation.
For example, If I have a graph G1 which is a square with 'corner' vertices (among others, but reduced for simplicity) of {(1,1), (1,6), (6,6), (6,1)}, then a smaller square G2 defined by corner vertices {(2,2), (2,5), (5,5), (5,2)} would lie within G1. The third graph G3 defined by corners {(3,3), (3,4), (4,4),(4,3)}. My algorithm produces the following figure for this configuration:
A square thresholded at 2, surrounded by t=1, surrounded by t=0. (I need to fix the edges but the vertices are correct)
My algorithm works on the following matrix:
import numpy as np
A = np.zeros((7,7))
#A[A<1] = -1
for i in np.arange(1,6):
for j in np.arange(1,6):
A[i,j] = 1
for i in np.arange(2,5):
for j in np.arange(2,5):
A[i,j] = 2
for i in np.arange(3,4):
for j in np.arange(3,4):
A[i,j] = 3
print(A)
To create three graphs, the first at threshold 2, the second at threshold 1, the third at threshold 0.
v1 = [[(3.0, 2.25), (3.0, 3.75), (2.25, 3.0), (3.75, 3.0)]]
v2 = [[(2.0, 1.333333), (1.333333, 3.0), (1.333333, 2.0), (1.333333, 4.0), (2.0, 4.666667), (3.0, 4.666667), (4.0, 4.666667), (4.666667, 4.0), (4.666667, 3.0), (4.666667, 2.0), (4.0, 1.333333), (3.0, 1.333333)]]
v3 = [[(1.0, 0.5), (0.5, 2.0), (0.5, 1.0), (0.5, 3.0), (0.5, 4.0), (0.5, 5.0), (1.0, 5.5), (2.0, 5.5), (3.0, 5.5), (4.0, 5.5), (5.0, 5.5), (5.5, 5.0), (5.5, 4.0), (5.5, 3.0), (5.5, 2.0), (5.5, 1.0), (5.0, 0.5), (4.0, 0.5), (3.0, 0.5), (2.0, 0.5)]]
And edge lists:
e1 = [[[2.25, 3.0], [3.0, 2.25]], [[3.0, 3.75], [2.25, 3.0]], [[3.0, 2.25], [3.75, 3.0]], [[3.0, 3.75], [3.75, 3.0]]]
e2 = [[[1.333333, 2.0], [2.0, 1.333333]], [[1.333333, 3.0], [1.333333, 2.0]], [[1.333333, 4.0], [1.333333, 3.0]], [[2.0, 4.666667], [1.333333, 4.0]], [[2.0, 1.333333], [3.0, 1.333333]], [[2.0, 4.666667], [3.0, 4.666667]], [[3.0, 1.333333], [4.0, 1.333333]], [[3.0, 4.666667], [4.0, 4.666667]], [[4.0, 1.333333], [4.666667, 2.0]], [[4.666667, 3.0], [4.666667, 2.0]], [[4.666667, 4.0], [4.666667, 3.0]], [[4.0, 4.666667], [4.666667, 4.0]]]
e3 = [[[0.5, 1.0], [1.0, 0.5]], [[0.5, 2.0], [0.5, 1.0]], [[0.5, 3.0], [0.5, 2.0]], [[0.5, 4.0], [0.5, 3.0]], [[0.5, 5.0], [0.5, 4.0]], [[1.0, 5.5], [0.5, 5.0]], [[1.0, 0.5], [2.0, 0.5]], [[1.0, 5.5], [2.0, 5.5]], [[2.0, 0.5], [3.0, 0.5]], [[2.0, 5.5], [3.0, 5.5]], [[3.0, 0.5], [4.0, 0.5]], [[3.0, 5.5], [4.0, 5.5]], [[4.0, 0.5], [5.0, 0.5]], [[4.0, 5.5], [5.0, 5.5]], [[5.0, 0.5], [5.5, 1.0]], [[5.5, 2.0], [5.5, 1.0]], [[5.5, 3.0], [5.5, 2.0]], [[5.5, 4.0], [5.5, 3.0]], [[5.5, 5.0], [5.5, 4.0]], [[5.0, 5.5], [5.5, 5.0]]]
Again, this gives graphs that look like this
This is the real data that I am working on. More complicated shapes.
Here, for example, I have a red shape inside of a green shape. Ideally, red shapes would lie within red shapes. They would be grouped together in one object (say an array of graphs).
The graphs are connected in a clockwise fashion. I really don't know how to describe it, but perhaps the graphs in the link show this. There's a bug on two of the lines (as you can see in the first plot, in the top right corner), but the vertices are correct.
Hope this helps! I can attach a full workable example, but it would include my whole algorithm and be pages long, with many functions! I basically want to use either input either g1, g2, and g3 into a function (or e1, e2, and e3). The function would tell me that g3 is contained with g2, which is contained within g1.
Your problem really does not have much to do with networks. Fundamentally, you are trying to determine if a point is inside a region described by an ordered list of points. The simplest way to this is to create matplotlib Path which has a contains_point method (there is also a 'contains_points` method to test many points simultaneously).
#!/usr/bin/env python
"""
Determine if a point is within the area defined by a path.
"""
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.path import Path
from matplotlib.patches import PathPatch
point = [0.5, 0.5]
vertices = np.array([
[0, 0],
[0, 1],
[1, 1],
[1, 0],
[0, 0] # NOTE the repetition of the first vertex
])
path = Path(vertices, closed=True)
print(path.contains_point(point))
# True
# plot to check visually
fig, ax = plt.subplots(1,1)
ax.add_patch(PathPatch(path))
ax.plot(point[0], point[1], 'ro')
Note that if a point is directly on the path, it is not inside the path. However, contains_point supports a radius argument that allows you to add an increment to the extent of the area. Whether you need a positive or negative increment depends on the ordering of the points. IIRC, radius shifts the path left in direction of the path but don't quote me on that.

Determine if vertices lie inside of a set of vertices

How can I determine whether one graph lies within another?
My algorithm works on the following matrix:
import numpy as np
A = np.zeros((9,9))
for i in np.arange(1,8):
for j in np.arange(1,8):
A[i,j] = 1
for i in np.arange(2,4):
for j in np.arange(2,4):
A[i,j] = 2
print(A)
yields the matrix:
[[-1. -1. -1. -1. -1. -1. -1. -1. -1.]
[-1. 1. 1. 1. 1. 1. 1. 1. -1.]
[-1. 1. 2. 2. 1. 1. 1. 1. -1.]
[-1. 1. 2. 2. 1. 1. 1. 1. -1.]
[-1. 1. 1. 1. 1. 1. 1. 1. -1.]
[-1. 1. 1. 1. 1. 1. 1. 1. -1.]
[-1. 1. 1. 1. 1. 1. 1. 1. -1.]
[-1. 1. 1. 1. 1. 1. 1. 1. -1.]
[-1. -1. -1. -1. -1. -1. -1. -1. -1.]]
To create two graphs:
With vertices:
V1 = [[(2.0, 1.333333), (1.333333, 3.0), (1.333333, 2.0), (2.0, 3.666667), (3.0, 3.666667), (3.666667, 3.0), (3.666667, 2.0), (3.0, 1.333333)]]
V2 = [[(1.0, 0.5), (0.5, 2.0), (0.5, 1.0), (0.5, 3.0), (0.5, 4.0), (0.5, 5.0), (0.5, 6.0), (0.5, 7.0), (1.0, 7.5), (2.0, 7.5), (3.0, 7.5), (4.0, 7.5), (5.0, 7.5), (6.0, 7.5), (7.0, 7.5), (7.5, 7.0), (7.5, 6.0), (7.5, 5.0), (7.5, 4.0), (7.5, 3.0), (7.5, 2.0), (7.5, 1.0), (7.0, 0.5), (6.0, 0.5), (5.0, 0.5), (4.0, 0.5), (3.0, 0.5), (2.0, 0.5)]]
And edge lists:
e1 = [[[1.333333, 2.0], [2.0, 1.333333]], [[1.333333, 3.0], [1.333333, 2.0]], [[2.0, 3.666667], [1.333333, 3.0]], [[2.0, 1.333333], [3.0, 1.333333]], [[2.0, 3.666667], [3.0, 3.666667]], [[3.0, 1.333333], [3.666667, 2.0]], [[3.666667, 3.0], [3.666667, 2.0]], [[3.0, 3.666667], [3.666667, 3.0]]]
e2 = [[[0.5, 1.0], [1.0, 0.5]], [[0.5, 2.0], [0.5, 1.0]], [[0.5, 3.0], [0.5, 2.0]], [[0.5, 4.0], [0.5, 3.0]], [[0.5, 5.0], [0.5, 4.0]], [[0.5, 6.0], [0.5, 5.0]], [[0.5, 7.0], [0.5, 6.0]], [[1.0, 7.5], [0.5, 7.0]], [[1.0, 0.5], [2.0, 0.5]], [[1.0, 7.5], [2.0, 7.5]], [[2.0, 0.5], [3.0, 0.5]], [[2.0, 7.5], [3.0, 7.5]], [[3.0, 0.5], [4.0, 0.5]], [[3.0, 7.5], [4.0, 7.5]], [[4.0, 0.5], [5.0, 0.5]], [[4.0, 7.5], [5.0, 7.5]], [[5.0, 0.5], [6.0, 0.5]], [[5.0, 7.5], [6.0, 7.5]], [[6.0, 0.5], [7.0, 0.5]], [[6.0, 7.5], [7.0, 7.5]], [[7.0, 0.5], [7.5, 1.0]], [[7.5, 2.0], [7.5, 1.0]], [[7.5, 3.0], [7.5, 2.0]], [[7.5, 4.0], [7.5, 3.0]], [[7.5, 5.0], [7.5, 4.0]], [[7.5, 6.0], [7.5, 5.0]], [[7.5, 7.0], [7.5,
6.0]], [[7.0, 7.5], [7.5, 7.0]]]
As Prune suggests, the shapely package has what you need. While your line loops can be thought of as a graph, it's more useful to consider them as polygons embedded in the 2D plane.
By creating Polygon objects from your points and edge segments, you can use the contains method that all shapely objects have to test if one is inside the other.
You'll need to sort the edge segments into order. Clockwise or anti-clockwise probably doesn't matter as shapely likely detects inside and outside by constructing a point at infinity and ensuring that is 'outside'.
Here's a full example with the orignal pair of squares from your post:
from shapely.geometry import Polygon
p1 = Polygon([(0,0), (0,8), (8,8), (8,0)])
p2 = Polygon([(2,2), (2,4), (4,4), (4,2)])
print(p1.contains(p2))
Documentation for the Polygon object is at https://shapely.readthedocs.io/en/latest/manual.html#Polygon
and for the contains method at https://shapely.readthedocs.io/en/latest/manual.html#object.contains

How to produce equally sized bins with pandas cut?

In pandas own documentation on the cut method, it says that it produces equally sized bins. However, in the example they provide, it clearly doesn't:
>>>pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
[(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
Categories (3, interval[float64]): [(0.994, 3.0] < (3.0, 5.0] ...
The first interval is larger than all the others, why is that?
Edit: even if the smallest number (1) in the array is made more than 1 (e.g. 1.001), it still produces bins of unequal width:
In [291]: pd.cut(np.array([1.001, 7, 5, 4, 6, 3]), 3)
Out[291]:
[(0.995, 3.001], (5.0, 7.0], (3.001, 5.0], (3.001, 5.0], (5.0, 7.0], (0.995, 3.001]]
Categories (3, interval[float64]): [(0.995, 3.001] < (3.001, 5.0] < (5.0, 7.0]]
For the kind of performance you get, I can live with this amount of fractional inaccuracy. However, if you know your data and want to get as close to evenly spaced bins as possible, use linspace for the bin spec (similar to here):
arr = np.array([1, 7, 5, 4, 6, 3])
pd.cut(arr, np.linspace(arr.min(), arr.max(), 3+1), include_lowest=True)
# [(0.999, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], (0.999, 3.0]]
# Categories (3, interval[float64]): [(0.999, 3.0] < (3.0, 5.0] < (5.0, 7.0]]

Merging two sorted lists of lists and if there are duplicate values in the inner lists, keep only those from the first one

I have two lists of lists, sorted with respect to the first item of each inner list (represents timestamp) , containing data like this [[time0, voltage0],[time1,voltage1],....]
l1 =[[0,0],[1,1],[2,2],[3,3]]
l2 =[[0,0],[0.5,0.5],[1,1.2],[1.5,1.5],[2,2]]
the goal is to produce a single list of lists containing the elements from both lists and sorted with respect to the first item of the inner lists BUT
if there is an item which his timestamp is the same in both lists, the final list will contain the item from the other list.
for the example above the output should be:
result = [[0,0],[0,5,0.5],[1,1],[1.5,1.5],[2,2],[3,3]]
I've tried to save a reference in each element which will specify from which list the element came and then go over the list to find duplicates and delete those who came from the second list but finding duplicates isn't working since ["first",0,0] isn't a duplicate of ["second",0,0]
# examples of lists
import itemgetter
lFirst = [[0.0, 0.0], [1.0, 1.0], [2.0, 2.0], [3.0, 3.0], [4.0, 4.0], [5.0, 5.0]]
lSecond = [[0.0, 0.0], [0.5, 0.5], [1.0, 1.2], [1.5, 1.5], [2.0, 2.0], [2.5, 2.5], [3.0, 3.0], [3.5, 3.5], [4.0, 4.0], [4.5, 4.5]]
print "first list: {}".format(lFirst)
print "second list: {}".format(lSecond)
res = sorted(lFirst+lSecond , key = itemgetter(0))
print res
One way is to concatenate your lists, with l2 coming first. Then create a dictionary and sort the items():
print([list(x) for x in sorted(dict(l2 + l1).items())])
#[[0, 0], [0.5, 0.5], [1, 1], [1.5, 1.5], [2, 2], [3, 3]]
This works because dictionary keys are unique. You start with a key-value pair from l2, but if the key (timestamp) also exists in l1 it gets updated.
You could remove all duplicates from the second list before merging.
lFirst = [[0.0, 0.0], [1.0, 1.0], [2.0, 2.0], [3.0, 3.0], [4.0, 4.0], [5.0, 5.0]]
lSecond = [[0.0, 0.0], [0.5, 0.5], [1.0, 1.2], [1.5, 1.5], [2.0, 2.0], [2.5, 2.5], [3.0, 3.0], [3.5, 3.5], [4.0, 4.0], [4.5, 4.5]]
print("first list: {0}".format(lFirst))
print("second list: {0}".format(lSecond))
lFirstTimes = [x[0] for x in lFirst]
lSecondFiltered = [x for x in lSecond if x[0] not in lFirstTimes]
print("second list without duplicates: {0}".format(lSecondFiltered))
res = lFirst+lSecondFiltered
res.sort()
print(res)
You can use heapq.merge (doc) to merge the lists and itertools.grouby (doc) to group the elements.
The list which is first in merge() will get priority:
l1 = [[0.0, 0.0], [1.0, 1.0], [2.0, 2.0], [3.0, 3.0], [4.0, 4.0], [5.0, 5.0]]
l2 = [[0.0, 0.0], [0.5, 0.5], [1.0, 1.2], [1.5, 1.5], [2.0, 2.0], [2.5, 2.5], [3.0, 3.0], [3.5, 3.5], [4.0, 4.0], [4.5, 4.5]]
from heapq import merge
from itertools import groupby
out = [next(g) for _, g in groupby(merge(l1, l2, key=lambda k: k[0]), lambda k: k[0])]
from pprint import pprint
pprint(out)
Prints:
[[0.0, 0.0],
[0.5, 0.5],
[1.0, 1.0],
[1.5, 1.5],
[2.0, 2.0],
[2.5, 2.5],
[3.0, 3.0],
[3.5, 3.5],
[4.0, 4.0],
[4.5, 4.5],
[5.0, 5.0]]
EDIT: Works in Python3.5+ (In Python2.7 merge() doesn't have key= argument)

How to print categories in pandas.cut?

Notice that when you input pandas.cut into a dataframe, you get the bins of each element, Name:, Length:, dtype:, and Categories in the output. I just want the Categories array printed for me so I can obtain just the range of the number of bins I was looking for. For example, with bins=4 inputted into a dataframe of numbers "1,2,3,4,5", I would want the output to print solely the range of the four bins, i.e. (1, 2], (2, 3], (3, 4], (4, 5].
Is there anyway I can do this? It can be anything, even if it doesn't require printing "Categories".
I guessed that you just would like to get the 'bins' from pd.cut().
If so, you can simply set retbins=True, see the doc of pd.cut
For example:
In[01]:
data = pd.DataFrame({'a': [1, 2, 3, 4, 5]})
cats, bins = pd.cut(data.a, 4, retbins=True)
Out[01]:
cats:
0 (0.996, 2.0]
1 (0.996, 2.0]
2 (2.0, 3.0]
3 (3.0, 4.0]
4 (4.0, 5.0]
Name: a, dtype: category
Categories (4, interval[float64]): [(0.996, 2.0] < (2.0, 3.0] < (3.0, 4.0] < (4.0, 5.0]]
bins:
array([0.996, 2. , 3. , 4. , 5. ])
Then you can reuse the bins as you pleased.
e.g.,
lst = [1, 2, 3]
category = pd.cut(lst,bins)
For anyone who has come here to see how to select a particular bin from pd.cut function - we can use the pd.Interval funtcion
df['bin'] = pd.cut(df['y'], [0.1, .2,.3,.4,.5, .6,.7,.8 ,.9])
print(df["bin"].value_counts())
Ouput
(0.2, 0.3] 697
(0.4, 0.5] 156
(0.5, 0.6] 122
(0.3, 0.4] 12
(0.6, 0.7] 8
(0.7, 0.8] 4
(0.1, 0.2] 0
(0.8, 0.9] 0
print(df.loc[df['bin'] == pd.Interval(0.7,0.8)]

Categories