Measuring average cosine similarity between the groups

Measuring average cosine similarity between the groups - python

I have the following data frame:
Group Vector
1 [1 1 0 1 0 0]
1 [1 0 0 1 0 0]
1 [1 0 0 1 1 1]
1 [0 0 0 1 0 1]
2 [0 0 0 1 0 1]
2 [0 0 0 1 0 1]
2 [0 1 1 1 0 1]
2 [1 1 0 0 0 1]
How could I calculate the average cosine similarity within the groups? This is the expected outcome (Note I make up to numbers for the calculation)
Group Vector Average_Similarity
1 [1 1 0 1 0 0] 0.34
1 [1 0 0 1 0 0] 0.34
1 [1 0 0 1 1 1] 0.34
1 [0 0 0 1 0 1] 0.34
2 [0 0 0 1 0 1] 0.48
2 [0 0 0 1 0 1] 0.48
2 [0 1 1 1 0 1] 0.48
2 [1 1 0 0 0 1] 0.48

Suppose we read data from your example like:
from ast import literal_eval
df = pd.read_clipboard(sep="|", converters = {"Vector":literal_eval})
df
Group Vector
0 1 [1, 1, 0, 1, 0, 0]
1 1 [1, 0, 0, 1, 0, 0]
2 1 [1, 0, 0, 1, 1, 1]
3 1 [0, 0, 0, 1, 0, 1]
4 2 [0, 0, 0, 1, 0, 1]
5 2 [0, 0, 0, 1, 0, 1]
6 2 [0, 1, 1, 1, 0, 1]
7 2 [1, 1, 0, 0, 0, 1]
Then try:
from scipy.spatial.distance import pdist
df["Average_Similarity"] = df.groupby("Group")["Vector"].transform(
lambda group: pdist(group.to_list(), metric="cosine").mean()
)
df
Group Vector Average_Similarity
0 1 [1, 1, 0, 1, 0, 0] 0.380615
1 1 [1, 0, 0, 1, 0, 0] 0.380615
2 1 [1, 0, 0, 1, 1, 1] 0.380615
3 1 [0, 0, 0, 1, 0, 1] 0.380615
4 2 [0, 0, 0, 1, 0, 1] 0.365323
5 2 [0, 0, 0, 1, 0, 1] 0.365323
6 2 [0, 1, 1, 1, 0, 1] 0.365323
7 2 [1, 1, 0, 0, 0, 1] 0.365323

You can do a groupby apply
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
df.groupby('group').apply(lambda x: cosine_similarity(np.array([i for i in x['vec']])))
group
1 [[1.0000000000000002, 0.816496580927726, 0.577...
2 [[0.9999999999999998, 0.9999999999999998, 0.70...

Reconstruct your DataFrame so each value in the vector is placed into its own cell. Then we self merge within group and use the index to de-duplicate comparisons (i.e. we only compare 1 to 3 and not 1 to 3 and 3 to 1).
Then we calculate the cosine similarity for all rows and average within group.
df = pd.concat([df['Group'], pd.DataFrame(df['Vector'].tolist())], axis=1).reset_index()
m = (df.merge(df, on='Group').query('index_x > index_y')
.drop(columns=['index_x', 'index_y'])
.set_index('Group'))
X = m.filter(like='_x')
X.columns = X.columns.str.strip('_x')
Y = m.filter(like='_y')
Y.columns = Y.columns.str.strip('_y')
m['cos'] = 1-(X*Y).sum(1).div((np.sqrt((X**2).sum(1))*np.sqrt((Y**2).sum(1))), axis=0)
m.groupby(level=0)['cos'].mean()
Group
1 0.380615
2 0.365323
Name: cos, dtype: float64

Related

create a list of lists with a checkerboard pattern

I would like to change the values of this list by alternating the 0 and 1 values in a checkerboard pattern.
table =
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
i tried:
for i in range(len(table)):
for j in range(0, len(table[i]), 2): # ho definito uno step nella funzione range
table[i][j] = 0
but for each list the count starts again and the result is:
0 1 0 1 0
0 1 0 1 0
0 1 0 1 0
0 1 0 1 0
0 1 0 1 0
my question is how can I change the loop to form a checkerboard pattern.
I expect the result to be like:
0 1 0 1 0
1 0 1 0 1
0 1 0 1 0
1 0 1 0 1
0 1 0 1 0

for i in range(len(table)):
for j in range(len(table[i])):
if (i+j)%2 == 0:
table[i][j] = 0
output:
[[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 1],
[0, 1, 0, 1, 0]]

There doesn't appear to be any reliance on the original values in the list. Therefore it might be better to implement something that creates a list in the required format like this:
def checkboard(rows, columns):
e = 0
result = []
for _ in range(rows):
c = []
for _ in range(columns):
c.append(e)
e ^= 1
result.append(c)
return result
print(checkboard(5, 5))
print(checkboard(2, 3))
print(checkboard(4, 4))
Output:
[[0, 1, 0, 1, 0], [1, 0, 1, 0, 1], [0, 1, 0, 1, 0], [1, 0, 1, 0, 1], [0, 1, 0, 1, 0]]
[[0, 1, 0], [1, 0, 1]]
[[0, 1, 0, 1], [0, 1, 0, 1], [0, 1, 0, 1], [0, 1, 0, 1]]

How to keep track of row index of the rows I randomly select from a matrix?

I am trying to perform tournament selection in a GA whereby I need select two rows randomly. Is there a way of keeping track of the index values of the 2 random rows I select from the matrix self.population and storing those in variables?
At the moment it just outputs the two random rows but I need to keep track of which rows were selected.
Below is what I have so far although ideally I would like to store both rows I select from my matrix in separate variables.
self.population = [[0 1 1 1 0 0 1 1 0 1]
[1 0 1 1 0 0 0 1 1 1]
[0 0 0 0 0 1 1 0 0 0]
[1 1 0 0 1 1 1 0 1 1]
[0 1 0 1 1 1 1 1 1 0]
[0 0 0 0 1 0 1 1 1 0]]
def tournament_select(self):
b = np.random.randint(0, self.population[0], 2)
return self.population[b]

Is this what you're looking for?
from random import sample
import numpy as np
population = np.array([[0, 1, 1, 1, 0, 0, 1, 1, 0, 1],
[1, 0, 1, 1, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 1, 1, 0, 0, 0],
[1, 1, 0, 0, 1, 1, 1, 0, 1, 1],
[0, 1, 0, 1, 1, 1, 1, 1, 1, 0],
[0, 0, 0, 0, 1, 0, 1, 1, 1, 0]])
def tournament_select():
row_indices = sample(range(len(population)), k=2)
return row_indices, population[row_indices]
row_indices, candidates = tournament_select()
print(row_indices)
print(candidates)
Output:
[2, 3]
[[0 0 0 0 0 1 1 0 0 0]
[1 1 0 0 1 1 1 0 1 1]]

Cloning a column in 3d numpy array

Let's say I have a 3D array representing tic-tac-toe games (and their respective historical states):
[
[[0,0,0,1,1,0,0,0,1]], #<<--game 1
[[1,0,0,1,0,0,1,0,1]], #<<--game 2
[[1,0,0,1,0,0,1,0,1]] #<<--game 3
]
I would like to pre-pend a clone of these states, but then keep the historical records growing out to the right where they will act as an unadultered historical record
So the next iteration would look like this:
[
[[0,0,0,1,1,0,0,0,1], [0,0,0,1,1,0,0,0,1]], #<<--game 1
[[1,0,0,1,0,0,1,0,1], [1,0,0,1,0,0,1,0,1]], #<<--game 2
[[1,0,0,1,0,0,1,0,1], [1,0,0,1,0,0,1,0,1]] #<<--game 3
]
I will then edit these new columns. At a later time, I will copy it again.
So, I always want to copy this leftmost column (pass by value) - but I don't know how to perform this operation.

You can use concatenate:
# initial array
a = np.array([
[[0,0,0,1,1,0,0,0,1], [0,1,0,1,1,0,0,0,1]], #<<--game 1
[[1,0,0,1,0,0,1,0,1], [1,1,0,1,0,0,1,0,1]], #<<--game 2
[[1,0,0,1,0,0,1,0,1], [1,1,0,1,0,0,1,0,1]] #<<--game 3
])
#subset of this array (column 0)
b = a[:,0,:]
# reshape to add dimension
b = b.reshape ([-1,1,9])
print(a.shape, b.shape) # ((3, 2, 9), (3, 1, 9))
# concatenate:
c = np.concatenate ((a,b), axis = 1)
print (c)
array([[[0, 0, 0, 1, 1, 0, 0, 0, 1],
[0, 1, 0, 1, 1, 0, 0, 0, 1],
[0, 0, 0, 1, 1, 0, 0, 0, 1]], # leftmost column copied
[[1, 0, 0, 1, 0, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0, 1, 0, 1],
[1, 0, 0, 1, 0, 0, 1, 0, 1]], # leftmost column copied
[[1, 0, 0, 1, 0, 0, 1, 0, 1],
[1, 1, 0, 1, 0, 0, 1, 0, 1],
[1, 0, 0, 1, 0, 0, 1, 0, 1]]]) # leftmost column copied

You can do this using hstack and slicing:
import numpy as np
start= np.asarray([[[0,0,0,1,1,0,0,0,1]],[[1,0,0,1,0,0,1,0,1]],[[1,0,0,1,0,0,1,0,1]]])
print(start)
print("duplicating...")
finish = np.hstack((start,start[:,:1,:]))
print(finish)
print("modifying...")
finish[0,1,2]=2
print(finish)
[[[0 0 0 1 1 0 0 0 1]]
[[1 0 0 1 0 0 1 0 1]]
[[1 0 0 1 0 0 1 0 1]]]
duplicating...
[[[0 0 0 1 1 0 0 0 1]
[0 0 0 1 1 0 0 0 1]]
[[1 0 0 1 0 0 1 0 1]
[1 0 0 1 0 0 1 0 1]]
[[1 0 0 1 0 0 1 0 1]
[1 0 0 1 0 0 1 0 1]]]
modifying...
[[[0 0 0 1 1 0 0 0 1]
[0 0 2 1 1 0 0 0 1]]
[[1 0 0 1 0 0 1 0 1]
[1 0 0 1 0 0 1 0 1]]
[[1 0 0 1 0 0 1 0 1]
[1 0 0 1 0 0 1 0 1]]]

In opencv how do I get a list of segemented regions

I'm working on a project where I want to evaluate certain parameters on regions of a segemented image. So I have the following code
col = cv2.imread("in.jpg",1)
col=cv2.resize(col,(width,height),interpolation=cv2.INTER_CUBIC)
res=cv2.pyrMeanShiftFiltering(col,20,45,3)
and would now like to somehow get a list of masks per region in res.
So for example if res was now something like this
1 1 0 2 1
1 0 0 2 1
0 0 2 2 1
I would like to get an output such as
1 1 0 0 0
1 0 0 0 0
0 0 0 0 0
,
0 0 1 0 0
0 1 1 0 0
1 1 0 0 0
,
0 0 0 1 0
0 0 0 1 0
0 0 1 1 0
,
0 0 0 0 1
0 0 0 0 1
0 0 0 0 1
So that is a mask for each group of the same values that are connected. Maybe this could somehow involve the floodfill function? I can
see that maybe by looping over every pixel and then flood filling and comparing to see if that set of pixels was already set might work but that seems like a very expensive way so is there something faster?
Oh and here is an example image of res after the code has run

Here's one approach with cv2.connectedComponents -
def list_seg_regs(a): # a is array
out = []
for i in np.unique(a):
ret, l = cv2.connectedComponents((a==i).astype(np.uint8))
for j in range(1,ret):
out.append((l==j).astype(int)) #skip .astype(int) for bool
return out
Sample run -
In [53]: a = np.array([
...: [1, 1, 0, 2, 1],
...: [1, 0, 0, 2, 1],
...: [0, 0, 2, 2, 1]])
In [54]: out = list_seg_regs(a)
In [55]: out[0]
Out[55]:
array([[0, 0, 1, 0, 0],
[0, 1, 1, 0, 0],
[1, 1, 0, 0, 0]])
In [56]: out[1]
Out[56]:
array([[1, 1, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0]])
In [57]: out[2]
Out[57]:
array([[0, 0, 0, 0, 1],
[0, 0, 0, 0, 1],
[0, 0, 0, 0, 1]])
In [58]: out[3]
Out[58]:
array([[0, 0, 0, 1, 0],
[0, 0, 0, 1, 0],
[0, 0, 1, 1, 0]])

Lock Combinations for dynamic lock size

In the following I will give two examples that have different dimension values.
Lock-1
# numbers are the shown values on the so in this case: 0,1,2
numbers = 5
# fields are those things i can turn to change my combination
fields = 4
So what I would expect for all of my posibilities is
0 0 0 5
0 0 1 4
0 0 2 3
0 0 3 2
0 0 4 1
0 0 5 0
0 1 0 4
0 1 1 3
0 1 2 2
0 1 3 1
0 1 4 0
0 2 0 3
0 2 1 2
0 2 2 1
0 2 3 0
0 3 0 2
0 3 1 1
0 3 2 0
0 4 0 1
0 4 1 0
0 5 0 0
1 0 0 4
1 0 1 3
1 0 2 2
1 0 3 1
1 0 4 0
1 1 0 3
1 1 1 2
1 1 2 1
1 1 3 0
1 2 0 2
1 2 1 1
1 2 2 0
1 3 0 1
1 3 1 0
1 4 0 0
2 0 0 3
2 0 1 2
2 0 2 1
2 0 3 0
2 1 0 2
2 1 1 1
2 1 2 0
2 2 0 1
2 2 1 0
2 3 0 0
3 0 0 2
3 0 1 1
3 0 2 0
3 1 0 1
3 1 1 0
3 2 0 0
4 0 0 1
4 0 1 0
4 1 0 0
5 0 0 0
My second lock has the following values:
numbers = 3
values = 3
So what I would expect as my posibilities would look like this
0 0 3
0 1 2
0 2 1
0 3 0
1 0 2
1 1 1
1 2 0
2 0 1
2 1 0
3 0 0
I know this can be done with itertools.permutations and so on, but I want to generate the rows by building them and not by overloading my RAM. I figured out that the last 2 rows are always building up the same way.
So I wrote a funtion which builds it for me:
def posibilities(value):
all_pos = []
for y in range(value + 1):
posibility = []
posibility.append(y)
posibility.append(value)
all_pos.append(posibility)
value -= 1
return all_pos
Now I want some kind of way to fit the other values dynamically around my function, so e.g. Lock - 2 would now look like this:
0 posibilities(3)
1 posibilities(2)
2 posibilities(1)
3 posibilities(0)
I know I should use a while loops and so on, but I can't get the solution for dynamic values.

You could do this recursively, but it's generally best to avoid recursion in Python unless you really need it, eg, when processing recursive data structures (like trees). Recursion in standard Python (aka CPython) is not very efficient because it cannot do tail call elimination. Also, it applies a recursion limit (which is by default 1000 levels, but that can be modified by the user).
The sequences that you want to generate are known as weak compositions, and the Wikipedia article gives a simple algorithm which is easy to implement with the help of the standard itertools.combinations function.
#!/usr/bin/env python3
''' Generate the compositions of num of a given width
Algorithm from
https://en.wikipedia.org/wiki/Composition_%28combinatorics%29#Number_of_compositions
Written by PM 2Ring 2016.11.11
'''
from itertools import combinations
def compositions(num, width):
m = num + width - 1
last = (m,)
first = (-1,)
for t in combinations(range(m), width - 1):
yield [v - u - 1 for u, v in zip(first + t, t + last)]
# test
for t in compositions(5, 4):
print(t)
print('- ' * 20)
for t in compositions(3, 3):
print(t)
output
[0, 0, 0, 5]
[0, 0, 1, 4]
[0, 0, 2, 3]
[0, 0, 3, 2]
[0, 0, 4, 1]
[0, 0, 5, 0]
[0, 1, 0, 4]
[0, 1, 1, 3]
[0, 1, 2, 2]
[0, 1, 3, 1]
[0, 1, 4, 0]
[0, 2, 0, 3]
[0, 2, 1, 2]
[0, 2, 2, 1]
[0, 2, 3, 0]
[0, 3, 0, 2]
[0, 3, 1, 1]
[0, 3, 2, 0]
[0, 4, 0, 1]
[0, 4, 1, 0]
[0, 5, 0, 0]
[1, 0, 0, 4]
[1, 0, 1, 3]
[1, 0, 2, 2]
[1, 0, 3, 1]
[1, 0, 4, 0]
[1, 1, 0, 3]
[1, 1, 1, 2]
[1, 1, 2, 1]
[1, 1, 3, 0]
[1, 2, 0, 2]
[1, 2, 1, 1]
[1, 2, 2, 0]
[1, 3, 0, 1]
[1, 3, 1, 0]
[1, 4, 0, 0]
[2, 0, 0, 3]
[2, 0, 1, 2]
[2, 0, 2, 1]
[2, 0, 3, 0]
[2, 1, 0, 2]
[2, 1, 1, 1]
[2, 1, 2, 0]
[2, 2, 0, 1]
[2, 2, 1, 0]
[2, 3, 0, 0]
[3, 0, 0, 2]
[3, 0, 1, 1]
[3, 0, 2, 0]
[3, 1, 0, 1]
[3, 1, 1, 0]
[3, 2, 0, 0]
[4, 0, 0, 1]
[4, 0, 1, 0]
[4, 1, 0, 0]
[5, 0, 0, 0]
- - - - - - - - - - - - - - - - - - - -
[0, 0, 3]
[0, 1, 2]
[0, 2, 1]
[0, 3, 0]
[1, 0, 2]
[1, 1, 1]
[1, 2, 0]
[2, 0, 1]
[2, 1, 0]
[3, 0, 0]
FWIW, the above code can generate the 170544 sequences of compositions(15, 8) in around 1.6 seconds on my old 2GHz 32bit machine, running on Python 3.6 or Python 2.6. (The timing information was obtained by using the Bash time command).
FWIW, here's a recursive version taken from this answer by user3736966. I've modified it to use the same argument names as my code, to use lists instead of tuples, and to be compatible with Python 3.
def compositions(num, width, parent=[]):
if width > 1:
for i in range(num, -1, -1):
yield from compositions(i, width - 1, parent + [num - i])
else:
yield parent + [num]
Somewhat surprisingly, this one is a little faster than the original version, clocking in at around 1.5 seconds for compositions(15, 8).
If your version of Python doesn't understand yield from, you can do this:
def compositions(num, width, parent=[]):
if width > 1:
for i in range(num, -1, -1):
for t in compositions(i, width - 1, parent + [num - i]):
yield t
else:
yield parent + [num]
To generate the compositions in descending order, simply reverse the range call, i.e. for i in range(num + 1):.
Finally, here's an unreadable one-line version. :)
def c(n, w, p=[]):
yield from(t for i in range(n,-1,-1)for t in c(i,w-1,p+[n-i]))if w-1 else[p+[n]]
Being an inveterate tinkerer, I couldn't stop myself from making yet another version. :) This is simply the original version combined with the code for combinations listed in the itertools docs. Of course, the real itertools.combinations is written in C so it runs faster than the roughly equivalent Python code shown in the docs.
def compositions(num, width):
r = width - 1
indices = list(range(r))
revrange = range(r-1, -1, -1)
first = [-1]
last = [num + r]
yield [0] * r + [num]
while True:
for i in revrange:
if indices[i] != i + num:
break
else:
return
indices[i] += 1
for j in range(i+1, r):
indices[j] = indices[j-1] + 1
yield [v - u - 1 for u, v in zip(first + indices, indices + last)]
This version is about 50% slower than the original at doing compositions(15, 8): it takes around 2.3 seconds on my machine.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Measuring average cosine similarity between the groups - python

You can do a groupby apply from sklearn.metrics.pairwise import cosine_similarity import numpy as np df.groupby('group').apply(lambda x: cosine_similarity(np.array([i for i in x['vec']]))) group 1 [[1.0000000000000002, 0.816496580927726, 0.577... 2 [[0.9999999999999998, 0.9999999999999998, 0.70...

Related

create a list of lists with a checkerboard pattern

How to keep track of row index of the rows I randomly select from a matrix?

Cloning a column in 3d numpy array

In opencv how do I get a list of segemented regions

Lock Combinations for dynamic lock size

Categories

Resources