How to produce equally sized bins with pandas cut? - python

In pandas own documentation on the cut method, it says that it produces equally sized bins. However, in the example they provide, it clearly doesn't:
>>>pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
[(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
Categories (3, interval[float64]): [(0.994, 3.0] < (3.0, 5.0] ...
The first interval is larger than all the others, why is that?
Edit: even if the smallest number (1) in the array is made more than 1 (e.g. 1.001), it still produces bins of unequal width:
In [291]: pd.cut(np.array([1.001, 7, 5, 4, 6, 3]), 3)
Out[291]:
[(0.995, 3.001], (5.0, 7.0], (3.001, 5.0], (3.001, 5.0], (5.0, 7.0], (0.995, 3.001]]
Categories (3, interval[float64]): [(0.995, 3.001] < (3.001, 5.0] < (5.0, 7.0]]

For the kind of performance you get, I can live with this amount of fractional inaccuracy. However, if you know your data and want to get as close to evenly spaced bins as possible, use linspace for the bin spec (similar to here):
arr = np.array([1, 7, 5, 4, 6, 3])
pd.cut(arr, np.linspace(arr.min(), arr.max(), 3+1), include_lowest=True)
# [(0.999, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], (0.999, 3.0]]
# Categories (3, interval[float64]): [(0.999, 3.0] < (3.0, 5.0] < (5.0, 7.0]]

Related

How to get max or min values elementwise in tuples in a tuple matrix using python?

rows = int(input("Enter the Number of rows : "))
column = int(input("Enter the Number of Columns: "))
print("Enter the elements of First Matrix:")
matrix_a = [[tuple(map(float, input().split(" "))) for i in range(column)] for i in range(rows)]
print("First Matrix is: ")
for n in matrix_a:
print(n)
print("Enter the elements of Second Matrix:")
matrix_b = [[tuple(map(float, input().split(" "))) for i in range(column)] for i in range(rows)]
print("second Matrix is: ")
for n in matrix_b:
print(n)
result = [[0 for i in range(column)] for i in range(rows)]
for i in range(rows):
for j in range(column):
res = tuple(map(lambda i, j : max(i, j) , matrix_a, matrix_b))
print("Maximum of Above two Matrices is : ")
for r in res:
print(r)
got the output as:
Enter the Number of rows :
2
Enter the Number of Columns:
2
Enter the elements of First Matrix:
5 6 7
3 4 5
1 2 3
7 8 9
First Matrix is:
[(5.0, 6.0, 7.0), (3.0, 4.0, 5.0)]
[(1.0, 2.0, 3.0), (7.0, 8.0, 9.0)]
Enter the elements of Second Matrix:
6 3 7
8 4 6
9 8 5
2 5 7
second Matrix is:
[(6.0, 3.0, 7.0), (8.0, 4.0, 6.0)]
[(9.0, 8.0, 5.0), (2.0, 5.0, 7.0)]
Maximum of Above two Matrices is :
[(6.0, 3.0, 7.0), (8.0, 4.0, 6.0)]
[(9.0, 8.0, 5.0), (2.0, 5.0, 7.0)]
what should I do to get (max, min, min) value in tuples of the matrix
For example if
matrix1 =
[(5.0, 6.0, 7.0), (3.0, 4.0, 5.0)]
[(1.0, 2.0, 3.0), (7.0, 8.0, 9.0)]
matrix2=
[(6.0, 3.0, 7.0), (8.0, 4.0, 6.0)]
[(9.0, 8.0, 5.0), (2.0, 5.0, 7.0)]
I need the result to be
[(6.0, 3.0, 7.0), (8.0, 4.0, 5.0)]
[(9.0, 2.0, 3.0), (7.0, 5.0, 7.0)]
i.e., consider the tuple1 in matrix1 (5.0, 6.0, 7.0) and tuple2 (6.0, 3.0, 7.0) in matrix2 then I want the resultant tuple to be
(max{5.0, 6.0}, min{6.0, 3.0}, min{7.0, 7.0}) =(6.0, 3.0, 7.0)
You can do that with a list comprehension:
result = [
[(max(matrix_a[row][col][0], matrix_b[row][col][0]),
*(min(a, b) for a, b in zip(matrix_a[row][col][1:], matrix_b[row][col][1:]))
)
for col in range(column)
]
for row in range(rows)
]
print("Maximum of Above two Matrices is : ")
for r in result:
print(r)
Expression
max(matrix_a[row][col][0], matrix_b[row][col][0])
computes the max on the first component of each couple of cell at the
same position in your two matrices.
Then expression
*[min(a, b) for a, b in zip(matrix_a[row][col][1:], matrix_b[row][col][1:])]
zips the next elements (thanks to [1:]) of the two cells and
produces a list containing the min elements.
The * operator flattens the list, which means that, at the end, the
whole expression is equivalent to:
(
max(matrix_a[row][col][0], matrix_b[row][col][0]),
min(matrix_a[row][col][1], matrix_b[row][col][1]),
min(matrix_a[row][col][2], matrix_b[row][col][2]),
...
)
which produces the expected result.

Determining if vertices lie within a set vertices

In my algorithm, I am finding graphs at different thresholds. Each graph G = (V,E). These are undirected graphs found using breadth first search. I would like to determine if the vertices of another graph G' = (V',E') lie within graph G. I am unfamiliar with graph algorithms so please let me know if you would like to see code or a more thorough explanation.
For example, If I have a graph G1 which is a square with 'corner' vertices (among others, but reduced for simplicity) of {(1,1), (1,6), (6,6), (6,1)}, then a smaller square G2 defined by corner vertices {(2,2), (2,5), (5,5), (5,2)} would lie within G1. The third graph G3 defined by corners {(3,3), (3,4), (4,4),(4,3)}. My algorithm produces the following figure for this configuration:
A square thresholded at 2, surrounded by t=1, surrounded by t=0. (I need to fix the edges but the vertices are correct)
My algorithm works on the following matrix:
import numpy as np
A = np.zeros((7,7))
#A[A<1] = -1
for i in np.arange(1,6):
for j in np.arange(1,6):
A[i,j] = 1
for i in np.arange(2,5):
for j in np.arange(2,5):
A[i,j] = 2
for i in np.arange(3,4):
for j in np.arange(3,4):
A[i,j] = 3
print(A)
To create three graphs, the first at threshold 2, the second at threshold 1, the third at threshold 0.
v1 = [[(3.0, 2.25), (3.0, 3.75), (2.25, 3.0), (3.75, 3.0)]]
v2 = [[(2.0, 1.333333), (1.333333, 3.0), (1.333333, 2.0), (1.333333, 4.0), (2.0, 4.666667), (3.0, 4.666667), (4.0, 4.666667), (4.666667, 4.0), (4.666667, 3.0), (4.666667, 2.0), (4.0, 1.333333), (3.0, 1.333333)]]
v3 = [[(1.0, 0.5), (0.5, 2.0), (0.5, 1.0), (0.5, 3.0), (0.5, 4.0), (0.5, 5.0), (1.0, 5.5), (2.0, 5.5), (3.0, 5.5), (4.0, 5.5), (5.0, 5.5), (5.5, 5.0), (5.5, 4.0), (5.5, 3.0), (5.5, 2.0), (5.5, 1.0), (5.0, 0.5), (4.0, 0.5), (3.0, 0.5), (2.0, 0.5)]]
And edge lists:
e1 = [[[2.25, 3.0], [3.0, 2.25]], [[3.0, 3.75], [2.25, 3.0]], [[3.0, 2.25], [3.75, 3.0]], [[3.0, 3.75], [3.75, 3.0]]]
e2 = [[[1.333333, 2.0], [2.0, 1.333333]], [[1.333333, 3.0], [1.333333, 2.0]], [[1.333333, 4.0], [1.333333, 3.0]], [[2.0, 4.666667], [1.333333, 4.0]], [[2.0, 1.333333], [3.0, 1.333333]], [[2.0, 4.666667], [3.0, 4.666667]], [[3.0, 1.333333], [4.0, 1.333333]], [[3.0, 4.666667], [4.0, 4.666667]], [[4.0, 1.333333], [4.666667, 2.0]], [[4.666667, 3.0], [4.666667, 2.0]], [[4.666667, 4.0], [4.666667, 3.0]], [[4.0, 4.666667], [4.666667, 4.0]]]
e3 = [[[0.5, 1.0], [1.0, 0.5]], [[0.5, 2.0], [0.5, 1.0]], [[0.5, 3.0], [0.5, 2.0]], [[0.5, 4.0], [0.5, 3.0]], [[0.5, 5.0], [0.5, 4.0]], [[1.0, 5.5], [0.5, 5.0]], [[1.0, 0.5], [2.0, 0.5]], [[1.0, 5.5], [2.0, 5.5]], [[2.0, 0.5], [3.0, 0.5]], [[2.0, 5.5], [3.0, 5.5]], [[3.0, 0.5], [4.0, 0.5]], [[3.0, 5.5], [4.0, 5.5]], [[4.0, 0.5], [5.0, 0.5]], [[4.0, 5.5], [5.0, 5.5]], [[5.0, 0.5], [5.5, 1.0]], [[5.5, 2.0], [5.5, 1.0]], [[5.5, 3.0], [5.5, 2.0]], [[5.5, 4.0], [5.5, 3.0]], [[5.5, 5.0], [5.5, 4.0]], [[5.0, 5.5], [5.5, 5.0]]]
Again, this gives graphs that look like this
This is the real data that I am working on. More complicated shapes.
Here, for example, I have a red shape inside of a green shape. Ideally, red shapes would lie within red shapes. They would be grouped together in one object (say an array of graphs).
The graphs are connected in a clockwise fashion. I really don't know how to describe it, but perhaps the graphs in the link show this. There's a bug on two of the lines (as you can see in the first plot, in the top right corner), but the vertices are correct.
Hope this helps! I can attach a full workable example, but it would include my whole algorithm and be pages long, with many functions! I basically want to use either input either g1, g2, and g3 into a function (or e1, e2, and e3). The function would tell me that g3 is contained with g2, which is contained within g1.
Your problem really does not have much to do with networks. Fundamentally, you are trying to determine if a point is inside a region described by an ordered list of points. The simplest way to this is to create matplotlib Path which has a contains_point method (there is also a 'contains_points` method to test many points simultaneously).
#!/usr/bin/env python
"""
Determine if a point is within the area defined by a path.
"""
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.path import Path
from matplotlib.patches import PathPatch
point = [0.5, 0.5]
vertices = np.array([
[0, 0],
[0, 1],
[1, 1],
[1, 0],
[0, 0] # NOTE the repetition of the first vertex
])
path = Path(vertices, closed=True)
print(path.contains_point(point))
# True
# plot to check visually
fig, ax = plt.subplots(1,1)
ax.add_patch(PathPatch(path))
ax.plot(point[0], point[1], 'ro')
Note that if a point is directly on the path, it is not inside the path. However, contains_point supports a radius argument that allows you to add an increment to the extent of the area. Whether you need a positive or negative increment depends on the ordering of the points. IIRC, radius shifts the path left in direction of the path but don't quote me on that.

Confusing result for panda qcut function

When reading the documentation for pd.qcut?, I simply couldn't understand its writing, particularly with its examples, one of them is below
>>> pd.qcut(range(5), 4)
... # doctest: +ELLIPSIS
[(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]]
Categories (4, interval[float64]): [(-0.001, 1.0] < (1.0, 2.0] ...
Why did it return 5 elements in the list (although the code specifying 4 buckets) and the 2 first elements are the same (-0.001, 1.0)?
Thanks.
Because 0 is in (-0.001, 1], so is 1.
range(5) # [0, 1, 2, 3, 4, 5]
The corresponding category of [0, 1, 2, 3, 4, 5] is [(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]].
Look at the range
list(range(5))
Out[116]: [0, 1, 2, 3, 4]
it is return 5 number , when you do qcut , 0,1 are considered into one range
pd.qcut(range(5), 4)
Out[115]:
[(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]]
Categories (4, interval[float64]): [(-0.001, 1.0] < (1.0, 2.0] < (2.0, 3.0] < (3.0, 4.0]]

How to print categories in pandas.cut?

Notice that when you input pandas.cut into a dataframe, you get the bins of each element, Name:, Length:, dtype:, and Categories in the output. I just want the Categories array printed for me so I can obtain just the range of the number of bins I was looking for. For example, with bins=4 inputted into a dataframe of numbers "1,2,3,4,5", I would want the output to print solely the range of the four bins, i.e. (1, 2], (2, 3], (3, 4], (4, 5].
Is there anyway I can do this? It can be anything, even if it doesn't require printing "Categories".
I guessed that you just would like to get the 'bins' from pd.cut().
If so, you can simply set retbins=True, see the doc of pd.cut
For example:
In[01]:
data = pd.DataFrame({'a': [1, 2, 3, 4, 5]})
cats, bins = pd.cut(data.a, 4, retbins=True)
Out[01]:
cats:
0 (0.996, 2.0]
1 (0.996, 2.0]
2 (2.0, 3.0]
3 (3.0, 4.0]
4 (4.0, 5.0]
Name: a, dtype: category
Categories (4, interval[float64]): [(0.996, 2.0] < (2.0, 3.0] < (3.0, 4.0] < (4.0, 5.0]]
bins:
array([0.996, 2. , 3. , 4. , 5. ])
Then you can reuse the bins as you pleased.
e.g.,
lst = [1, 2, 3]
category = pd.cut(lst,bins)
For anyone who has come here to see how to select a particular bin from pd.cut function - we can use the pd.Interval funtcion
df['bin'] = pd.cut(df['y'], [0.1, .2,.3,.4,.5, .6,.7,.8 ,.9])
print(df["bin"].value_counts())
Ouput
(0.2, 0.3] 697
(0.4, 0.5] 156
(0.5, 0.6] 122
(0.3, 0.4] 12
(0.6, 0.7] 8
(0.7, 0.8] 4
(0.1, 0.2] 0
(0.8, 0.9] 0
print(df.loc[df['bin'] == pd.Interval(0.7,0.8)]

python- add col names to np.array

Why the following works:
mat = np.array(
[(0,0,0),
(0,0,0),
(0,0,0)],
dtype=[('MSFT','float'),('CSCO','float'),('GOOG','float') ]
)
while this doesn't:
mat = np.array(
[[0]*3]*3,
dtype=[('MSFT','float'),('CSCO','float'),('GOOG','float')]
)
TypeError: expected a readable buffer object
How can I create a matrix easily like
[[None]*M]*N
But with tuples in it to be able to assign names to columns?
When I make an zero array with your dtype
In [548]: dt=np.dtype([('MSFT','float'),('CSCO','float'),('GOOG','float') ])
In [549]: A = np.zeros(3, dtype=dt)
In [550]: A
Out[550]:
array([(0.0, 0.0, 0.0), (0.0, 0.0, 0.0), (0.0, 0.0, 0.0)],
dtype=[('MSFT', '<f8'), ('CSCO', '<f8'), ('GOOG', '<f8')])
notice that the display shows a list of tuples. That's intentional, to distinguish the dtype records from a row of a 2d (ordinary) array.
That also means that when creating the array, or assigning values, you also need to use a list of tuples.
For example let's make a list of lists:
In [554]: ll = np.arange(9).reshape(3,3).tolist()
In [555]: ll
In [556]: A[:]=ll
...
TypeError: a bytes-like object is required, not 'list'
but if I turn it into a list of tuples:
In [557]: llt = [tuple(i) for i in ll]
In [558]: llt
Out[558]: [(0, 1, 2), (3, 4, 5), (6, 7, 8)]
In [559]: A[:]=llt
In [560]: A
Out[560]:
array([(0.0, 1.0, 2.0), (3.0, 4.0, 5.0), (6.0, 7.0, 8.0)],
dtype=[('MSFT', '<f8'), ('CSCO', '<f8'), ('GOOG', '<f8')])
assignment works fine. That list also can be used directly in array.
In [561]: np.array(llt, dtype=dt)
Out[561]:
array([(0.0, 1.0, 2.0), (3.0, 4.0, 5.0), (6.0, 7.0, 8.0)],
dtype=[('MSFT', '<f8'), ('CSCO', '<f8'), ('GOOG', '<f8')])
Similarly assigning values to one record requires a tuple, not a list:
In [563]: A[0]=(10,12,14)
The other common way of setting values is on a field by field basis. That can be done with a list or array:
In [564]: A['MSFT']=[100,200,300]
In [565]: A
Out[565]:
array([(100.0, 12.0, 14.0), (200.0, 4.0, 5.0), (300.0, 7.0, 8.0)],
dtype=[('MSFT', '<f8'), ('CSCO', '<f8'), ('GOOG', '<f8')])
The np.rec.fromarrays method recommended in the other answer ends up using the copy-by-fields approach. It's code is, in essence:
arrayList = [sb.asarray(x) for x in arrayList]
<determine shape>
<determine dtype>
_array = recarray(shape, descr)
# populate the record array (makes a copy)
for i in range(len(arrayList)):
_array[_names[i]] = arrayList[i]
If you have a number of 1D arrays (columns) you would like to merge while keeping column names, you can use np.rec.fromarrays:
>>> dt = np.dtype([('a', float),('b', float),('c', float),])
>>> np.rec.fromarrays([[0] * 3 ] * 3, dtype=dt)
rec.array([(0.0, 0.0, 0.0), (0.0, 0.0, 0.0), (0.0, 0.0, 0.0)], dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')])
This gives you a record/structured array in which columns can have names & different datatypes.

Categories