how to efficiently construct an affinity matrix from rows of transactions? - python

Given transactions between nodes in a (potentially large ~ 2+GBs) json file, with ~ million nodes and ~10 million transactions each having 10-1000 nodes such as
{"transactions":
[
{"transaction 1": ["node1","node2","node7"], "weight":0.41},
{"transaction 2": ["node4","node2","node1","node3","node10","node7","node9"], "weight":0.67},
{"transaction 3": ["node3","node10","node11","node2","node1"], "weight":0.33},...
]
}
what would the most elegant and efficient pythonic way to convert this into a node affinity matrix, where the affinities are the sum of weighted transactions between the nodes.
affinity [i,j] = weighted transaction count between nodes[i] and nodes[j] = affinity [j,i]
e.g.
affinity[node1, node7] = [0.41 (transaction1) + 0.67 (transaction2)] / 2 = affinity[node7, node1]
Note: the affinity matrix will be symmetrical and thus computing lower triangle alone will suffice.
Values not representative*** structure example only!
node1 | node2 | node3 | node4 | .... node1 1 .4 .1 .9 ... node2 .4 1 .6 .3 ... node3 .1 .6 1 .7 ... node4 .9 .3 .7
1 ......

First of all I would clean the data and represent each node with an integer and start with a dictionary like this
data=[{'transaction': [1, 2, 7], 'weight': 0.41},
{'transaction': [4, 2, 1, 3, 10, 7, 9], 'weight': 0.67},
{'transaction': [3, 10, 11, 2, 1], 'weight': 0.33}]
Not sure if this is pythonic enough but it should be self-explanatory
def weight(i,j,data_item):
return data_item["weight"] if i in data_item["transaction"] and j in data_item["transaction"] else 0
def affinity(i,j):
if j<i: # matrix is symmetric
return affinity(j,i)
else:
weights = [weight(i,j,data_item) for data_item in data if weight(i,j,data_item)!=0]
if len(weights)==0:
return 0
else:
return sum(weights) / float(len(weights))
ln = 10 # number of nodes
A = [[affinity(i,j) for j in range(1,ln+1)] for i in range(1,ln+1)]
To view the affinity matrix
import numpy as np
print(np.array(A))
[[ 0.47 0.47 0.5 0.67 0. 0. 0.54 0. 0.67 0.5 ]
[ 0.47 0.47 0.5 0.67 0. 0. 0.54 0. 0.67 0.5 ]
[ 0.5 0.5 0.5 0.67 0. 0. 0.67 0. 0.67 0.5 ]
[ 0.67 0.67 0.67 0.67 0. 0. 0.67 0. 0.67 0.67]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0.54 0.54 0.67 0.67 0. 0. 0.54 0. 0.67 0.67]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0.67 0.67 0.67 0.67 0. 0. 0.67 0. 0.67 0.67]
[ 0.5 0.5 0.5 0.67 0. 0. 0.67 0. 0.67 0.5 ]]

Related

Creating a Kernel matrix without for-loops in Python

I know there are other posts asking similar questions, but didn't manage to find something that answers my specific question.
I have the code below :
def kernel_function(self, x1, x2):
h = 0.5
return np.exp(-(np.linalg.norm(x2 - x1)/h)**2)
for i, x1 in enumerate(train_x):
for j, x2 in enumerate(train_x):
K[i,j] = self.kernel_function(x1, x2)
where x1 and x2 are arrays of shape (2,). I need to vertorize it for performance. I looked at np.fromfunction, np.outer, but they don't seem to be what I am looking for...
Thank you in advance. Sorry if there is already an answer somewhere!
Assuming train_x has the following format:
>>> train_x = np.array(((-.2, -.1), (0, .1), (.2, 0), (.1, -.1)))
Executing your code you get:
>>> np.set_printoptions(precision=2)
>>> K
[[1. 0.73 0.51 0.7 ]
[0.73 1. 0.82 0.82]
[0.51 0.82 1. 0.92]
[0.7 0.82 0.92 1. ]]
You can reshape train_x:
>>> train_x_cols = train_x.T.reshape(2, -1, 1)
>>> train_x_rows = train_x.T.reshape(2, 1, -1)
So, thanks to broadcasting, you get all the combinations when you subtract them:
>>> train_x_rows - train_x_cols
[[[ 0. 0.2 0.4 0.3]
[-0.2 0. 0.2 0.1]
[-0.4 -0.2 0. -0.1]
[-0.3 -0.1 0.1 0. ]]
[[ 0. 0.2 0.1 0. ]
[-0.2 0. -0.1 -0.2]
[-0.1 0.1 0. -0.1]
[ 0. 0.2 0.1 0. ]]]
And you can rewrite kernel_function() to calculate the norm on the first axis only:
def kernel_function(x1, x2):
h = 0.5
return np.exp(-(np.linalg.norm(x2 - x1, axis=0) / h) ** 2)
Then you get:
>>> kernel_function(train_x_cols, train_x_rows)
[[1. 0.73 0.51 0.7 ]
[0.73 1. 0.82 0.82]
[0.51 0.82 1. 0.92]
[0.7 0.82 0.92 1. ]]

Loop over finite probability weights with SciPy/NumPy

Let us have a single event probability prob which is a scalar between 0-1. If I want to iterate over every possible probability with 0.1 increments, then I can use:
prob = np.arange(0.01, 1, 0.1)
Now assume I have 5 events (independent, probabilities sum to 1), each with probability p_i. I would like to have multi-dimensional probability arrays such as:
1.0 - 0.0 - 0.0 - 0.0 - 0.0
0.9 - 0.1 - 0.0 - 0.0 - 0.0
0.9 - 0.0 - 0.1 - 0.0 - 0.0
0.9 - 0.0 - 0.0 - 0.1 - 0.0
0.9 - 0.0 - 0.0 - 0.0 - 0.1
0.8 - 0.1 - 0.1 - 0.0 - 0.0
0.8 - 0.1 - 0.0 - 0.1 - 0.0
. . . . .
. . . . .
. . . . .
0.2 - 0.2 - 0.2 - 0.2 - 0.2
Is there a more clever way than to consider all the combinations of 0 - 0.1 - ... - 1 and delete the rows not summing up to 1? If yes, what is the easiest way?
You can use itertools.product and filter to create all combinations that sum 10 and pass it to an array:
import itertools
f = filter(lambda x: sum(x) == 10, itertools.product(*[range(11)]*5))
x = np.array(list(f)).astype(np.float)/10
x
>> array([[0. , 0. , 0. , 0. , 1. ],
[0. , 0. , 0. , 0.1, 0.9],
[0. , 0. , 0. , 0.2, 0.8],
...,
[0.9, 0. , 0.1, 0. , 0. ],
[0.9, 0.1, 0. , 0. , 0. ],
[1. , 0. , 0. , 0. , 0. ]])
EDIT
For the record, here's a more efficient way without using filtering. Essentially you create k bins (in your example, 10), and "assign" them to "n" samples (in your example, 3) in all possible combinations, using combinations_with_replacement
Then, you count how many bins each samples gets: this is your probability. This method is more complex to understand but avoids the filter, and thus it is much more efficient. You can try it with divisions of 0.01 (k = 100)
n = 3 # number of samples
k = 100 # number of subdivisions
f = itertools.combinations_with_replacement(range(3),k) #your iterator
r = np.array(list(f)) #your array of combinations
x = np.vstack((r==i).sum(1) for i in range(n)).T/k #your probability matrix
There's likely a more elegant solution using itertools but this is probably fine and uses no dependencies?:
for i in prob:
for j in prob:
for k in prob:
for l in prob:
m = 1 - i - j - l
if m>=0:
print(i,j,k,l,m)

how to know precinct of an image prediction?

I want to know the predict precinct of one image
classes = model.predict(image)
print(classes)
Output:
0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
I want to show
[0.95, 0.20 , 0.30 , 0.0 , 0.25 .........]

transform an adjacency list into a sparse adjacency matrix using python

When using scipy, I was able to transform my data in the following format:
(row, col) (weight)
(0, 0) 5
(0, 47) 5
(0, 144) 5
(0, 253) 4
(0, 513) 5
...
(6039, 3107) 5
(6039, 3115) 3
(6039, 3130) 4
(6039, 3132) 2
How can I transform this into an array or sparse matrix with zeros for missing weight values as such? (based on the data above, column 1 to 46 should be filled with zeros, and so on...)
0 1 2 3 ... 47 48 49 50
1 [0 0 0 0 ... 5 0 0 0 0
2 2 0 1 0 ... 4 0 5 0 0
3 3 1 0 5 ... 1 0 0 4 2
4 0 0 0 4 ... 5 0 1 3 0
5 5 1 5 4 ... 0 0 3 0 1]
I know it is better in terms of memory to keep the data in the format above, but I need it as a matrix for experimentation.
scipy.sparse does it for you.
import numpy as np
from scipy.sparse import dok_matrix
your_data = [((2, 7), 1)]
XDIM, YDIM = 10, 10 # Replace with your values
dct = {}
for (row, col), weight in your_data:
dct[(row, col)] = weight
smat = dok_matrix((XDIM, YDIM))
smat.update(dct)
dense = smat.toarray()
print dense
'''
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
'''

Python: Element wise division operator error

I would like to know is there any better way to perform element wise division operator in python. The code below suppose to perform division A1 with B1 row and A2 with B2 rows therefore my expected output is only two rows. However the division part is A1 with B1, A1 with B2, A2 with B1 and A2 with B2. Can anyone help me?
The binary file is for A,C,G,T representations using 1000,0100,0010,0001.
Division file has four columns each each A, C, G, T and therefore the values obtained
earlier must divide accordingly.
Code
import numpy as np
from numpy import genfromtxt
import csv
csvfile = open('output.csv', 'wb')
writer = csv.writer(csvfile)
#open csv file into arrays
with open('binary.csv') as actg:
actg=actg.readlines()
with open('single.csv') as single:
single=single.readlines()
with open('division.csv') as division:
division=division.readlines()
# Converting binary line and single line into 3 rows and 4 columns
# binary values using reshape
for line in actg:
myarray = np.fromstring(line, dtype=float, sep=',')
myarray = myarray.reshape((-1, 3, 4))
for line2 in single:
single1 = np.fromstring(line2, dtype=float, sep=',')
single1 = single1.reshape((-1, 4))
# This division is in 2 rows and 4 column: first column
# represents 1000, 2nd-0100, 3rd-0010, 4th-0001 in the
# binary.csv. Therefore the division part where 1000's
# value should be divided by 1st column, 0010 should be
# divided by 3rd column value
for line1 in division:
division1 = np.fromstring(line1, dtype=float, sep=',')
m=np.asmatrix(division1)
m=np.array(m)
res2 = (single1[np.newaxis,:,:] / m[:,np.newaxis,:] * myarray).sum(axis=-1)
print(res2)
writer.writerow(res2)
csvfile.close()
binary.csv
0,1,0,0,1,0,0,0,0,0,0,1
0,0,1,0,1,0,0,0,1,0,0,0
single.csv:
0.28,0.22,0.23,0.27,0.12,0.29,0.34,0.21,0.44,0.56,0.51,0.65
division.csv
0.4,0.5,0.7,0.1
0.2,0.8,0.9,0.3
Expected output
0.44,0.3,6.5
0.26,0.6,2.2
Actual output
0.44,0.3,6.5
0.275,0.6,2.16666667
0.32857143,0.3,1.1
0.25555556,0.6,2.2
Explanation on the error
Let division file as follows:
A,B,C,D
E,F,G,H
Let after single and binary computation result as follows:
1,3,4
2,2,1
Let the number 1,2,3,4 is assigned to the location A,B,C,D and next row E,F,G,H
1/A,3/C,4/D
2/F,2/F,1/E
where 1 divided by A, 3 divided by C and so on. Basically this is what the code can do. Unfortunately the division part it happened to be like what described earlier. 221 operates with BBC and 134 operates with EGH therefore the output has 4 rows which is not what I want.
I don't know if this is what you are looking for, but here is a short way to get what (I think) you want:
import numpy as np
binary = np.genfromtxt('binary.csv', delimiter = ',').reshape((2, 3, 4))
single = np.genfromtxt('single.csv', delimiter = ',').reshape((1, 3, 4))
divisi = np.genfromtxt('division.csv', delimiter = ',').reshape((2, 1, 4))
print(np.sum(single / divisi * binary, axis = -1))
Output:
[[ 0.44 0.3 6.5 ]
[ 0.25555556 0.6 2.2 ]]
The output of your program looks kind of like this:
myarray
[ 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1.]
[[[ 0. 1. 0. 0.]
[ 1. 0. 0. 0.]
[ 0. 0. 0. 1.]]]
single1
[ 0.28 0.22 0.23 0.27 0.12 0.29 0.34 0.21 0.44 0.56 0.51 0.65]
[[ 0.28 0.22 0.23 0.27]
[ 0.12 0.29 0.34 0.21]
[ 0.44 0.56 0.51 0.65]]
division
[ 0.4 0.5 0.7 0.1]
m
[[ 0.4 0.5 0.7 0.1]]
res2
[[ 0.44 0.3 6.5 ]]
division
[ 0.2 0.8 0.9 0.3]
m
[[ 0.2 0.8 0.9 0.3]]
res2
[[ 0.275 0.6 2.16666667]]
myarray
[ 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 0.]
[[[ 0. 0. 1. 0.]
[ 1. 0. 0. 0.]
[ 1. 0. 0. 0.]]]
single1
[ 0.28 0.22 0.23 0.27 0.12 0.29 0.34 0.21 0.44 0.56 0.51 0.65]
[[ 0.28 0.22 0.23 0.27]
[ 0.12 0.29 0.34 0.21]
[ 0.44 0.56 0.51 0.65]]
division
[ 0.4 0.5 0.7 0.1]
m
[[ 0.4 0.5 0.7 0.1]]
res2
[[ 0.32857143 0.3 1.1 ]]
division
[ 0.2 0.8 0.9 0.3]
m
[[ 0.2 0.8 0.9 0.3]]
res2
[[ 0.25555556 0.6 2.2 ]]
So, with that in mind, it looks like your last two lines of the output, the one's you did not expect are caused by the second line in binary.csv. So don't use that line in your calculations if you don't want 4 line in your result.

Categories