Make pandas.qcut output compatible to scipy.stats.binned_statistic_dd - python

From scipy reference for scipy.stats.binned_statistic_dd,
Binedges: All but the last (righthand-most) bin is half-open in each
dimension. In other words, if bins is [1, 2, 3, 4], then the first bin
is [1, 2) (including 1, but excluding 2) and the second [2, 3). The
last bin, however, is [3, 4], which includes 4.
I want to use pandas.qcut to generate the bin edges to pass to binned statistic, but the edges are defined exactly the other way around.
a = np.arange(0,10,1)
[0 1 2 3 4 5 6 7 8 9]
where,
d,b = pd.qcut(a, 9, retbins=True)
print(d.value_counts())
print(b)
(-0.001, 1.0] 2
(1.0, 2.0] 1
(2.0, 3.0] 1
(3.0, 4.0] 1
(4.0, 5.0] 1
(5.0, 6.0] 1
(6.0, 7.0] 0
(7.0, 8.0] 2
(8.0, 9.0] 1
dtype: int64
[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
If I now run the binned_statistic using this binning,
h,e,binning = sp.stats.binned_statistic_dd(values=a,sample=a,bins=[np.array(b)])
print(binning)
[1 2 3 4 5 6 7 8 9 9]
which is a different binning of course, due to the different definition of the bin edges.
Is there a way to the the edges of qcut reversed? since are "real" numbers I cannot just shift the values.
Otherwise, does scipy has this capability in some way I cannot see? Does binned_statistic allow to automatically define the bins based on the data distribution somehow?
So, expected output (is not uniquely defined) for this particular case could be
be = [0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7, 8.5, 9.5]
such that,
h,e,binning = sp.stats.binned_statistic_dd(values=a,sample=a,bins=[np.array(be)])
print(binning)
[1 1 2 3 4 5 6 8 8 9]

Related

calculate the average of each dimension defining the group in python

I have a dataframe (df) that has three columns (user, vector, and group name), the vector column with multiple comma-separated values in each row.
df = pd.DataFrame({'user': ['user_1', 'user_2', 'user_3', 'user_4', 'user_5', 'user_6'], 'vector': [[1, 0, 2, 0], [1, 8, 0, 2],[6, 2, 0, 0], [5, 0, 2, 2], [3, 8, 0, 0],[6, 0, 0, 2]], 'group': ['A', 'B', 'C', 'B', 'A', 'A']})
I would like to calculate for each group, the sum of dimensions in all rows divided by the total number of rows for this group.
For example:
For group, A is [(1+3+6)/3, (0+8+0)/3, (2+0+0)/3, (0+0+2)/3] = [3.3, 2.6, 0.6, 0.6].
For group, B is [(1+5)/2, (8+0)/2, (0+2)/2, (2+2)/2] = [3,4,1,2].
For group, C is [6, 2, 0, 0]
So, the expected result is an array:
group A: [3.3, 2.6, 0.6, 0.6]
group B: [3,4,1,2]
group C: [6, 2, 0, 0]
I'm not sure if you were looking for the results stored in a single array/dataframe, or if you're just looking to get the results as separate arrays.
If the latter, something like this should work for you:
for group in df.group.unique():
print(f'Group {group} results: ')
tmp_df = pd.DataFrame(df[df.group==group]['vector'].tolist())
print(tmp_df.mean().values)
Output:
Group A results:
[3.33333333 2.66666667 0.66666667 0.66666667]
Group B results:
[3. 4. 1. 2.]
Group C results:
[6. 2. 0. 0.]
It's a little clunky, but gets the job done if you're just looking to get the results.
Filters the dataframe based on group, then turns the vectors of that into it's own tmp_df and gets the mean for each column.
If you want you could easily take those arrays and save them for further manipulation or what have you.
Hope that helps!
Take advantage of numpy:
import numpy as np
out = (df.groupby('group')['vector']
.agg(lambda x: np.vstack(x).mean(0).round(2))
)
print(out)
Output:
group
A [3.33, 2.67, 0.67, 0.67]
B [3.0, 4.0, 1.0, 2.0]
C [6.0, 2.0, 0.0, 0.0]
Name: vector, dtype: object
as DataFrame
out = (df.groupby('group', as_index=False)['vector']
.agg(lambda x: np.vstack(x).mean(0).round(2))
)
Output:
group vector
0 A [3.33, 2.67, 0.67, 0.67]
1 B [3.0, 4.0, 1.0, 2.0]
2 C [6.0, 2.0, 0.0, 0.0]
as array
out = np.vstack(df.groupby('group')['vector']
.agg(lambda x: np.vstack(x).mean(0).round(2))
)
Output:
[[3.33 2.67 0.67 0.67]
[3. 4. 1. 2. ]
[6. 2. 0. 0. ]]

Whitespaces after addition to numpy array

Why when I'm executing code below I get those weird whitespaces in output?
import numpy as np
str = 'a a b c a a d a g a'
string_array = np.array(str.split(" "))
char_indices = np.where(string_array == 'a')
array = char_indices[0]
print(array)
array += 2
print(array)
output:
[0 1 4 5 7 9]
[ 2 3 6 7 9 11]
That's just numpy's way of displaying data to make it appear aligned and more readable.
The alignment between your two lists changes
[0 1 4 5 7 9]
[ 2 3 6 7 9 11]
because there is a two-digit element in the second list.
In vectors it is more difficult to appreciate, but it is very useful when we have more dimensions:
>>> a = np.random.uniform(0,1,(5,5))
>>> a[a>0.5] = 0
>>> print(a)
[[0. 0. 0.00460074 0.22880318 0.46584641]
[0.0455245 0. 0. 0. 0. ]
[0. 0.07891556 0.21795357 0.14944522 0.20732431]
[0. 0. 0. 0.3381172 0.08182367]
[0. 0. 0.10734559 0. 0.31228533]]
>>> print(a.tolist())
[[0.0, 0.0, 0.0046007414146133074, 0.22880318354923768, 0.4658464110307319], [0.04552450444387102, 0.0, 0.0, 0.0, 0.0], [0.0, 0.07891556038021574, 0.21795356574892966, 0.1494452184954096, 0.2073243102108967], [0.0, 0.0, 0.0, 0.33811719550156627, 0.08182367499758836], [0.0, 0.0, 0.10734558995972832, 0.0, 0.31228532775003903]]

How to create a networkx Graph using 2D np array as input

My algorithm outputs the set of vertices describing objects in 3D space (x, y, z). In this case, there are two objects:
verts =
[[0.1 1. 1. ] [1. 1. 0.1] [1. 0.1 1. ] [1. 1. 1.9] [1. 1.9 1. ]
[1.9 1. 1. ] [7.1 8. 8. ] [8. 8. 7.1] [8. 7.1 8. ] [8. 8. 8.9]
[8. 8.9 8. ] [8.9 8. 8. ]]
There are two tetrahedrons, one confined between centered on (1, 1, 1), the other on (8, 8, 8). My goal is to use breadth-first search to identify that the objects are separate, and then classify each. I have not been able to get the data in the correct form for my algorithm.
Instead, I intend to use the networkx module, specifically using the Graph class, which takes ndarrays as input. I have tried:
import networkx as nx
import numpy as np
graph = Graph(verts)
for idx, graph in enumerate(nx.connected_components(graph)):
print("Graph ",idx, " in ", graph,'\n\n',file=open("output.txt","a"))
However, I cannot create graph. Instead, I get the error:
"Input is not a correct numpy matrix or array.")
networkx.exception.NetworkXError: Input is not a correct numpy matrix or array.
This confuses me because type of verts = numpy.ndarray.
I am open to either using networkx for this task, or developing some other strategy. Additionally, please let me know if there are any edits that might make this post more clear.
Edit: One thing that may help is another output, faces. These 'define triangular faces via referencing vertex indices from verts.' I believe these can be used to 'connect' or draw lines from vertex to vertex, eventually to create a dictionary.
faces =
[[ 2 1 0] [ 0 3 2] [ 1 4 0] [ 0 4 3] [ 5 1 2] [ 3 5 2]
[ 5 4 1] [ 4 5 3] [ 8 7 6] [ 6 9 8] [ 7 10 6] [ 6 10 9]
[11 7 8] [ 9 11 8] [11 10 7] [10 11 9]]
A method has been proposed, and it works for this set of data. However, it does not work for all. This edit uploads a new set of data.
verts =
[[0.1 1. 1. ] [1. 1. 0.1] [1. 0.1 1. ] [1. 1. 1.9] [1. 1.9 1. ] [1.9 1. 1. ]
[3.1 1. 4. ] [4. 1. 3.1] [4. 0.1 4. ] [4. 1. 4.9] [4. 1.9 4. ] [5. 1. 3.1]
[5. 0.1 4. ] [5. 1. 4.9] [5. 1.9 4. ] [5.9 1. 4. ] [7.1 8. 8. ]
[8. 8. 7.1] [8. 7.1 8. ] [8. 8. 8.9] [8. 8.9 8. ] [9. 8. 7.1]
[9. 7.1 8. ] [9. 8. 8.9] [9. 8.9 8. ] [9.9 8. 8. ]]
And it looks like this.
I was able to answer this by another approach. It is lengthy because I need to include extra pieces. As a general outlook, I solved this problem by utilizing faces, which defines each triangle with the indices of its vertices. faces tells me which vertices are connected. This allowed me to build a linelist, which contains all of the connections between vertices.
# using faces and verts in original post
linelist = []
for idx, vert in enumerate(faces):
print(vert)
for i,x in enumerate(vert):
l = [np.ndarray.tolist(verts[faces[idx][i]]), np.ndarray.tolist(verts[faces[idx][(i+1)%len(vert)]])]
linelist.append(l)
Which yields elements like:
[[1.0, 0.10000000149011612, 1.0], [1.0, 1.0, 0.10000000149011612]]
Edit: Discovered faster method:
tmp = [tuple(tuple(j) for j in i) for i in linelist]
graph = nx.Graph(tmp)
graphs = []
i=0
open('output.txt','w').close()
for idx, graph in enumerate(nx.connected_components(graph)):
graphs.append(graph)
print("Graph ",idx," corresponds to vertices: ",graph,'\n\n',file=open("output.txt","a"))
i+=1
These points are connected. Next, I used someone else's code to create a dictionary where each key is a vertex and each value is a connected vertex. And then I used breath-first-search on this dictionary. See the class below.
class MS_Graph():
def __init__ (self, linelist=None, vertices=None):
self.linelist = linelist if linelist is not None else None
self.vertices = vertices if vertices is not None else None
def getGraph(self):
'''
Takes self.linelist and converts to dict
'''
linelist = self.linelist
# edge list usually reads v1 -> v2
graph = {}
# however these are lines so symmetry is assumed
for l in linelist:
v1, v2 = map(tuple, l)
graph[v1] = graph.get(v1, ()) + (v2,)
graph[v2] = graph.get(v2, ()) + (v1,)
return graph
def BFS(self, graph):
"""
Implement breadth-first search
"""
# get nodes
#nodes = list(graph.keys()) # changed 4/16/2020
nodes = list(graph)
graphs = []
# check all nodes
while nodes:
# initialize BFS
toCheck = [nodes[0]]
discovered = []
# run bfs
while toCheck:
startNode = toCheck.pop()
for neighbor in graph.get(startNode):
if neighbor not in discovered:
discovered.append(neighbor)
toCheck.append(neighbor)
nodes.remove(neighbor)
# add discovered graphs
graphs.append(discovered)
self.graphs = graphs
return graphs
And, bringing it altogether:
Graph = MS_Graph(linelist)
graph = Graph.getGraph()
graphs = Graph.BFS(graph)
print(len(graphs))
# output: 3
print(graphs)
# output:
[[(1.0, 1.0, 0.10000000149011612), (0.10000000149011612, 1.0, 1.0), (1.0, 1.0, 1.899999976158142), (1.899999976158142, 1.0, 1.0), (1.0, 0.10000000149011612, 1.0), (1.0, 1.899999976158142, 1.0)],
[(4.0, 1.0, 3.0999999046325684), (3.0999999046325684, 1.0, 4.0), (4.0, 1.0, 4.900000095367432), (5.0, 1.0, 3.0999999046325684), (5.0, 0.10000000149011612, 4.0), (4.0, 0.10000000149011612, 4.0), (5.0, 1.0, 4.900000095367432), (5.900000095367432, 1.0, 4.0), (5.0, 1.899999976158142, 4.0), (4.0, 1.899999976158142, 4.0)],
[(8.0, 8.0, 7.099999904632568), (7.099999904632568, 8.0, 8.0), (8.0, 8.0, 8.899999618530273), (8.899999618530273, 8.0, 8.0), (8.0, 7.099999904632568, 8.0), (8.0, 8.899999618530273, 8.0)]]
That said, I do wonder if there is a faster method.
Edit: There may be a faster way. Since faces contains the vertices of every single triangle, all triangles that belong to one object will have an unbroken chain. i.e. the set of vertices composing object 1 will be distinct from the set of vertices composing any other object.
For example the set of faces for each object:
object_1_faces =
[ 2 1 0]
[ 0 3 2]
[ 1 4 0]
[ 0 4 3]
[ 5 1 2]
[ 3 5 2]
[ 5 4 1]
[ 4 5 3]
object_2_faces =
[ 8 7 6]
[ 6 9 8]
[ 7 10 6]
[ 6 10 9]
[11 7 8]
[ 9 11 8]
[11 10 7]
[10 11 9]
object_1_vertices = {0,1,2,3,4,5}
object_2_vertices = {6,7,8,9,10,11}
I imagine this means there is a faster way than finding all of the lines.
The problem is how you're constructing the graph. You should first create a new instance of a graph with g = nx.Graph(), and then use its methods to either add its nodes or edges. In this case, you want to add its paths from the nested list:
G = nx.Graph()
for path in verts:
nx.add_path(G, path)
And then obtain the connected components:
cc = list(nx.connected_components(G))
# [{0.1, 1.0, 1.9}, {7.1, 8.0, 8.9}]
Now if you wanted to find which component each path belongs to, you could iterate over the paths and check with which of the components they intersect:
from collections import defaultdict
subgraphs = defaultdict(list)
for path in verts:
for ix,c in enumerate(cc):
if c.intersection(path):
subgraphs[ix].append(path)
print(subgraphs)
defaultdict(list,
{0: [[0.1, 1.0, 1.0],
[1.0, 1.0, 0.1],
[1.0, 0.1, 1.0],
[1.0, 1.0, 1.9],
[1.0, 1.9, 1.0],
[1.9, 1.0, 1.0]],
1: [[7.1, 8.0, 8.0],
[8.0, 8.0, 7.1],
[8.0, 7.1, 8.0],
[8.0, 8.0, 8.9],
[8.0, 8.9, 8.0],
[8.9, 8.0, 8.0]]})

How to print categories in pandas.cut?

Notice that when you input pandas.cut into a dataframe, you get the bins of each element, Name:, Length:, dtype:, and Categories in the output. I just want the Categories array printed for me so I can obtain just the range of the number of bins I was looking for. For example, with bins=4 inputted into a dataframe of numbers "1,2,3,4,5", I would want the output to print solely the range of the four bins, i.e. (1, 2], (2, 3], (3, 4], (4, 5].
Is there anyway I can do this? It can be anything, even if it doesn't require printing "Categories".
I guessed that you just would like to get the 'bins' from pd.cut().
If so, you can simply set retbins=True, see the doc of pd.cut
For example:
In[01]:
data = pd.DataFrame({'a': [1, 2, 3, 4, 5]})
cats, bins = pd.cut(data.a, 4, retbins=True)
Out[01]:
cats:
0 (0.996, 2.0]
1 (0.996, 2.0]
2 (2.0, 3.0]
3 (3.0, 4.0]
4 (4.0, 5.0]
Name: a, dtype: category
Categories (4, interval[float64]): [(0.996, 2.0] < (2.0, 3.0] < (3.0, 4.0] < (4.0, 5.0]]
bins:
array([0.996, 2. , 3. , 4. , 5. ])
Then you can reuse the bins as you pleased.
e.g.,
lst = [1, 2, 3]
category = pd.cut(lst,bins)
For anyone who has come here to see how to select a particular bin from pd.cut function - we can use the pd.Interval funtcion
df['bin'] = pd.cut(df['y'], [0.1, .2,.3,.4,.5, .6,.7,.8 ,.9])
print(df["bin"].value_counts())
Ouput
(0.2, 0.3] 697
(0.4, 0.5] 156
(0.5, 0.6] 122
(0.3, 0.4] 12
(0.6, 0.7] 8
(0.7, 0.8] 4
(0.1, 0.2] 0
(0.8, 0.9] 0
print(df.loc[df['bin'] == pd.Interval(0.7,0.8)]

averaging elements in a matrix with the corresponding elements in another matrix (in python)

I have the following matrices:
1 2 3
4 5 6
7 8 9
m2:
2 3 4
5 6 7
8 9 10
I want to average the two to get:
1.5 2.5 3.5
4.5 5.5 6.5
7.5 8.5 9.5
What is the best way of doing this?
Thanks
List comprehensions and the zip function are your friends:
>>> from __future__ import division
>>> m1 = [[1,2,3], [4,5,6], [7,8,9]]
>>> m2 = [[2,3,4], [5,6,7], [8,9,10]]
>>> [[(x+y)/2 for x,y in zip(r1, r2)] for r1, r2 in zip(m1, m2)]
[[1.5, 2.5, 3.5], [4.5, 5.5, 6.5], [7.5, 8.5, 9.5]]
Of course, the numpy package makes these kind of computations trivially easy:
>>> from numpy import array
>>> m1 = array([[1,2,3], [4,5,6], [7,8,9]])
>>> m2 = array([[2,3,4], [5,6,7], [8,9,10]])
>>> (m1 + m2) / 2
array([[ 1.5, 2.5, 3.5],
[ 4.5, 5.5, 6.5],
[ 7.5, 8.5, 9.5]])
The obvious answer would be:
m1 = np.arange(1,10,dtype=np.double).reshape((3,3))
m2 = 1. + m1
m_average = 0.5 * (m1 + m2)
print m_average
array([[ 1.5, 2.5, 3.5],
[ 4.5, 5.5, 6.5],
[ 7.5, 8.5, 9.5]])
Perhaps a more elegant way (although probably a bit slower) to do it would be to use the numpy.mean function on a stacked version of the two arrays:
m_average = np.dstack([m1,m2]).mean(axis=2)
print m_average
array([[ 1.5, 2.5, 3.5],
[ 4.5, 5.5, 6.5],
[ 7.5, 8.5, 9.5]])

Categories