How to generate tree on intents? - python

I have a mixed data-typed feature matrix. And corresponding output labels on each row. I am interested to find the hierarchy among the output labels (i.e, classes or intents). Following is a sample:
token probability intent
--------------------------------------------------------
t1 0.2 a
t2 0.7 a
t3 0.1 a
t1 0.3 b
t4 0.6 b
t3 0.1 b
t5 0.3 c
t6 0.3 c
t7 0.25 c
t8 0.15 c
t1 0.5 d
t2 0.5 d
Based on this data I want to generate a tree to represent a relationship among the output labels:
()
/ \
() \
/ |\ \
/ | \ \
a b d c
I have gone through dendrogram and for mixed data-type, the distance matrix which can be used is Gower distance. I guess probably they are helpful, although I was not able to find a way to put them together. Also note that there can be more than two crunchings (a, b, d). Is there any way to do this in this way or otherwise?

Related

standardize pandas groupby results

I am using pandas to get subgroup averages, and the basics work fine. For instance,
d = np.array([[1,4],[1,1],[0,1],[1,1]])
m = d.mean(axis=1)
p = pd.DataFrame(m,index='A1,A2,B1,B2'.split(','),columns=['Obs'])
pprint(p)
x = p.groupby([v[0] for v in p.index])
pprint(x.mean('Obs'))
x = p.groupby([v[1] for v in p.index])
pprint(x.mean('Obs'))
YIELDS:
Obs
A1 2.5
A2 1.0
B1 0.5
B2 1.0
Obs
A 1.75. <<<< 1.75 is (2.5 + 1.0) / 2
B 0.75
Obs
1 1.5
2 1.0
But, I also need to know how much A and B (1 and 2) deviate from their common mean. That is, I'd like to have tables like:
Obs Dev
A 1.75 0.50 <<< deviation of the Obs average, i.e., 1.75 - 1.25
B 0.75 -0.50 <<< 0.75 - 1.25 = -0.50
Obs Dev
1 1.5 0.25
2 1.0 -0.25
I can do this using loc, apply etc - but this seems silly. Can anyone think of an elegant way to do this using groupby or something similar?
Aggregate the means, then compute the difference to the mean of means:
(p.groupby(p.index.str[0])
.agg(Obs=('Obs', 'mean'))
.assign(Dev=lambda d: d['Obs']-d['Obs'].mean())
)
Or, in case of a variable number of items if you want the difference to the overall mean (not the mean of means!):
(p.groupby(p.index.str[0])
.agg(Obs=('Obs', 'mean'))
.assign(Dev=lambda d: d['Obs']-p['Obs'].mean()) # notice the p (not d)
)
output:
Obs Dev
A 1.75 0.5
B 0.75 -0.5

Distances between subsets of rows in pandas

I have a matrix of the format
S1 S2 id var
0 1.2 3.2 A1 A
1 3.4 0.4 A2 A
2 -2.3 1.2 A3 A
3 0.1 -1.3 B1 B
4 4.5 1.3 B2 B
5 -2.3 -1.2 C1 C
And I want to compare the pairwise distances between all sets of A vs all B, then A vs C, and B vs C such that I get an average for dist_AB, dist_AC, and dist_BC. In other words:
dist_AB = ((A1 - B1) + (A1 - B2) + (A2 - B1) + (A2 - B2))/4
dist_AC = ((A1 - C1) + (A2 - C1))/2
dist_BC = ((B1 - C1) + (B2 - C2))/2
The challenge here is to do it on subsets. To implement this I can use loops:
import io
import numpy as np
import pandas as pd
TESTDATA="""
S1 S2 id var
1.2 3.2 A1 A
3.4 0.4 A2 A
-2.3 1.2 A3 A
0.1 -1.3 B1 B
4.5 1.3 B2 B
-2.3 -1.2 C1 C
"""
df = pd.read_csv(io.StringIO(TESTDATA), sep="\s+")
vars_set=df[['id','var']].groupby('var')['id'].agg(list)
distances=pd.DataFrame()
for v1,v2 in itertools.combinations(vars_set.keys(),2):
print(v1+v2)
data1=df.loc[df['var']==v1]
data2=df.loc[df['var']==v2]
for row1 in data1.index:
for row2 in data2.index:
data1_row=data1.loc[row1,]
data2_row=data2.loc[row2,]
dist=np.linalg.norm(
data1_row[['S1','S2']]-data2_row[['S1','S2']]
)
out=pd.Series([v1+v2, data1_row['id'], data2_row['id'], dist], index=['var','id1','id2','dist'])
distances=pd.concat([distances, out], axis=1)
distances=distances.T
distances=distances.groupby('var')['dist'].agg('mean').reset_index()
distances
### returns the mean distances
var dist
0 AB 3.973345
1 AC 4.647527
2 BC 4.823540
My question is regarding the implementation. As I will be doing this calculation over many thousands of rows, this is very inefficient. Is there any more elegant and efficient way of doing it?
I have a solution without using itertools, but it involves a few steps. Let me know if it works with your larger dataset.
First we create a dataframe containing every combination using df.merge():
df2 = df.merge(df, 'cross')
Then we need to remove some combinations (e.g. A-A and e.g. A1-B1 is the same as B1-A1).
df2 = df2[df2.var_x != df2.var_y].reset_index(drop=True)
df2 = df2[pd.DataFrame(np.sort(df2[['id_x','id_y']].values, 1)).duplicated()]
Now we compute the distance:
df2['distance'] = np.linalg.norm(df2[['S1_x', 'S2_x']] - df2[['S1_y', 'S2_y']].values, axis = 1)
And finally using groupby we can compute the mean distance between the variables:
df2.groupby(['var_x', 'var_y']).distance.mean()
I hope it speeds up your computations!

How to measure similarity of inner observation variation without considering actual values?

I am sure that this has been done before but I am unsure of how to even phrase the question for google and have been racking my brain for a few hours now, but I can explain it with an example. Imagine you have the data below.
observation #
m1
m2
m3
m4
m5
m6
1
T
L
T
L
T
L
2
A
R
A
R
A
A
3
B
C
B
C
B
C
4
K
K
K
A
L
K
5
P
P
P
R
L
P
I want to generate some sort of similarity metric between observations that relates to the variation across the m1-6 variables. The actual values in the cells shouldn't matter at all.
Considering the table above, for example observations 1 and 3 are exactly the same as they vary the same across the m's (TLTLTL & BCBCBC). 1 & 3 are very similar to 2, and observations 4 and 5 are the same but not similar to 1-3.
I would like an output that captures all these relationships for example . . .
observation #
1
2
3
4
5
1
1
0.8
1
0.1
0.1
2
0.8
1
0.8
0.2
0.2
3
1
0.8
1
0.1
0.1
4
0.1
0.2
0.1
1
1
5
0.1
0.2
0.1
1
1
A few notes - each cell can have more than just 1 letter but again the actual contents of each cell don't matter - just the variation across the m's within each observation compared to other observations. Is their a name for what I am trying to do here? Also I only know python & R so if you provide any code please have it in those (python preferred).
It is driving me crazy that I can't figure this out. Thanks in advance for any help :)

Choosing a random value from a discrete distribution

I had come across the following code while reading up about RL. The probs vector contains the probabilities of each action to be taken. And I believe the given loop tries to choose an action randomly from the given distribution. Why/How does this work?
a = 0
rand_select = np.random.rand()
while True:
rand_select -= probs[a]
if rand_select < 0 or a + 1 == n_actions:
break
a += 1
actions = a
After going through similar code, I realised that "actions" contains the final action to be taken.
You can view the probabilities as a distribution of contiguous parts on the line from 0.0 to 1.0.
if we have A: 0.2, B: 0.3, C: 0.5 to the line could be
0.0 --A--> 0.2
0.2 --B--> 0.5
0.5 --C--> 1.0
And in total 1.0.
The algorithm is choosing a random location between 0.0->1.0 and finds out where it "landed" (A, B or C) by sequentially ruling out parts.
Suppose we draw 0.73, We can "visualize" it like this (selection marked with *):
0.0 ---------------------------> 1.0
*
0.0 --A--> 0.2 --B--> 0.5 --C--> 1.0
0.73 - 0.2 > 0 so we reduce 0.2 (=0.53) and are left with:
0.2 --B--> 0.5
0.5 --C--> 1.0
0.53 - 0.3 > 0 so we reduce 0.5 (=0.23) and are left with:
0.5 --C--> 1.0
0.23 - 0.5 < 0 so we know the part we drew was C.
The selection distributes the same as the probabilities and the algorithm is O(n) where n is the number of probabilities.

Difficulty piping with qhull through python

I'm having trouble piping command through QHull in python. I'm currently trying to do so like this:
input_command = "rbox c " + str(qpoints) + " | qconvex FQ FV n"
command = subprocess.Popen(input_command.split(" "), stdout=subprocess.PIPE)
print command.communicate()[0]
Here, qpoints is formatted so that input_command winds up as:
rbox c P0,0,0 P0,0,2 P0,2,0 P0,2,2 P2,0,0 P2,0,2 P2,2,0 P2,2,2 | qconvex FQ FV n
Unfortunately though, this just prints out the usage of qconvex:
qconvex- compute the convex hull. Qhull 2012.1 2012/02/18
input (stdin): dimension, number of points, point coordinates
comments start with a non-numeric character
options (qconvex.htm):
Qt - triangulated output
QJ - joggled input instead of merged facets
Tv - verify result: structure, convexity, and point inclusion
. - concise list of all options
- - one-line description of all options
output options (subset):
s - summary of results (default)
i - vertices incident to each facet
n - normals with offsets
p - vertex coordinates (includes coplanar points if 'Qc')
Fx - extreme points (convex hull vertices)
FA - report total area and volume
FS - compute total area and volume
o - OFF format (dim, n, points, facets)
G - Geomview output (2-d, 3-d, and 4-d)
m - Mathematica output (2-d and 3-d)
QVn - print facets that include point n, -n if not
TO file- output results to file, may be enclosed in single quotes
examples:
rbox c D2 | qconvex s n rbox c D2 | qconvex i
rbox c D2 | qconvex o rbox 1000 s | qconvex s Tv FA
rbox c d D2 | qconvex s Qc Fx rbox y 1000 W0 | qconvex s n
rbox y 1000 W0 | qconvex s QJ rbox d G1 D12 | qconvex QR0 FA Pp
rbox c D7 | qconvex FA TF1000
I have read online some examples of extra steps that have to be taken when including piping in python calls. But I can't get any examples of them to work, and there's been almost no explanation as to what's going on. Can someone explain to me a code snippet here that will work and why it works?
I have also tried reading in the result of one function from file. For instance, I have tried reading the result of rbox from file:
python code:
input_command = "qconvex FQ FV n < rbox.txt"
command = subprocess.Popen(input_command.split(" "), shell=True)
result = command.communicate()
return result
data:
3 rbox c P1,1,1 P1,1,3 P1,3,1 P1,3,3 P3,1,1 P3,1,3 P3,3,1 P3,3,3
16
1 1 1
1 1 3
1 3 1
1 3 3
3 1 1
3 1 3
3 3 1
3 3 3
-0.5 -0.5 -0.5
-0.5 -0.5 0.5
-0.5 0.5 -0.5
-0.5 0.5 0.5
0.5 -0.5 -0.5
0.5 -0.5 0.5
0.5 0.5 -0.5
0.5 0.5 0.5
This still just prints out the QConvex description though. The weird thing is that this works perfectly fine from the command line, just not through python. Even if I can't get piping to work, I absolutely need the reading in from file to work. Does anyone know what the trick is to making this function call?
use shell=True if you use shell features such as | or rewrite the command using pure Python, see Replacing shell pipeline
if you use shell=True then pass the command as a string as specified in the docs
from subprocess import check_output as qx
output = qx("rbox c {qpoints} | qconvex FQ FV n".format(qpoints=qpoints),
shell=True)
print output

Categories