Grouping pairs of combination data based on given condition - python

Suppose I have a huge array of data and sample of them are :
x= [ 511.31, 512.24, 571.77, 588.35, 657.08, 665.49, -1043.45, -1036.56,-969.39, -955.33]
I used the following code to generate all possible pairs
Pairs=[(x[i],x[j]) for i in range(len(x)) for j in range(i+1, len(x))]
Which gave me all possible pairs. Now, I would like to group these pairs if they are within threshold values of -25 or +25 and label them accordingly.
Any idea or advice on how to do this? Thanks in advance

If I understood correctly your problem, the code below should do the trick. The idea is to generate a dictionary whose keys are the mean value, and just keep appending data onto it:
import numpy as np #I use numpy for the mean.
#Your threshold
threshold = 25
#A dictionary will hold the relevant pairs
mylist = {}
for i in Pairs:
#Check for the threshold and discard otherwise
diff = abs(i[1]-i[0])
if(diff < threshold):
#Name of the entry in the dictionary
entry = str('%d'%int(np.mean(i)))
#If the entry already exists, append. Otherwise, create a container list
if(entry in mylist):
mylist[entry].append(i)
else:
mylist[entry] = [i]
which results in the following output:
{'-1040': [(-1043.45, -1036.56)],
'-962': [(-969.39, -955.33)],
'511': [(511.1, 511.31),
(511.1, 512.24),
(511.1, 512.35),
(511.31, 512.24),
(511.31, 512.35)],
'512': [(511.1, 513.35),
(511.31, 513.35),
(512.24, 512.35),
(512.24, 513.35),
(512.35, 513.35)],
'580': [(571.77, 588.35)],
'661': [(657.08, 665.49)]}

This should be a fast way to do that:
import numpy as np
from scipy.spatial.distance import pdist
# Input data
x = np.array([511.31, 512.24, 571.77, 588.35, 657.08,
665.49, -1043.45, -1036.56,-969.39, -955.33])
thres = 25.0
# Compute pairwise distances
# default distance metric is'euclidean' which
# would be equivalent but more expensive to compute
d = pdist(x[:, np.newaxis], 'cityblock')
# Find distances within threshold
d_idx = np.where(d <= thres)[0]
# Convert "condensed" distance indices to pair of indices
r = np.arange(len(x))
c = np.zeros_like(r, dtype=np.int32)
np.cumsum(r[:0:-1], out=c[1:])
i = np.searchsorted(c[1:], d_idx, side='right')
j = d_idx - c[i] + r[i] + 1
# Get pairs of values
v_i = x[i]
v_j = x[j]
# Find means
m = np.round((v_i + v_j) / 2).astype(np.int32)
# Print result
for idx in range(len(m)):
print(f'{m[idx]}: ({v_i[idx]}, {v_j[idx]})')
Output
512: (511.31, 512.24)
580: (571.77, 588.35)
661: (657.08, 665.49)
-1040: (-1043.45, -1036.56)
-962: (-969.39, -955.33)

Related

Appending list sometimes give 'IndexError: list index out of range' error and results in not as expected

So i'm still new to programming and trying to implement an initialization method for a clustering problem using python-2.7.
The steps are:
Pick a random data from dataset as first centroid
While number of data in centroid < n_klas : Calculate the data distance to the data in centroids
Calculate the probability of all datas to their closest centroid using formula
P(x) = D(x)**2 / sum(D(x)**2), in which D(x) is euclidean distance from data[x] to the closest centroid
Pick Data with highest P(x), then loop back to no.2.
But when i try to appending data sometimes i got this error 'IndexError: list index out of range' and sometimes the code works but only give 2 different centroid and the 3rd to n centroid give the same values as the 2nd centroid.
Where did i do wrong?
(Edit: i edited the steps to doi it because i was wrong)
def pickcentroid(df):
x = df.values.tolist()
n_klas = 3
# random.seed(2)
idx_pusat_pertama = random.randint(0, len(df))
centroid = []
centroid_idx = []
centroid.append(x[idx_pusat_pertama])
centroid_idx.append(idx_pusat_pertama)
prob_data = []
while len(centroid) < n_klas:
ac_mindist = 0
for i in x:
dist_ke_c = []
for c in centroid:
dist_ke_c.append(dist(i,c))
ac_mindist += min(dist_ke_c)**2
for idx in range(len(df)) :
if idx not in centroid_idx:
dist_ke_c2 = []
mindist_per_data = 0
for c in centroid:
dist_ke_c2.append(dist(x[idx],c))
mindist_per_data = min(dist_ke_c2)**2
prob_data.append(mindist_per_data/ac_mindist)
else:
prob_data.append(0)
new_cen_idx = prob_data.index(max(prob_data))
centroid_idx.append(new_cen_idx)
centroid.append(x[new_cen_idx])
print(centroid)
return centroid
def dist(x,y):
r = np.array(x) - np.array(y)
distance = np.linalg.norm(r)
# print(distance)
return distance
c = pickcentroid(df)
And the data looks like this:
-0.19864726098025476,-0.2174575876560727
-0.19427576174137176,-0.2658220115362011
0.24385376109048476,0.1555938625346895
-0.23636704446757748,0.14005058641250595
0.37563103051045826,0.33204816285389527
-0.13210748354848134,-0.0019122205360639893
-0.17120654390561796,0.04231258139538708
0.2865229979171536,0.34175192153482764
-0.328896319205639,-0.22737124434792602
0.03115098005450885,0.17089336362457433
Thankyou very much for your kind help
The randint(a, b) returns random integers from a to b, including b. So, when you use randint(0, len(x)), you might get the value len(x) as output, which is out of range when used as index.
For your use case, you could probably use random_value = random.choice(x) instead.

How to apply my own function along each rows and columns with NumPy

I'm using NumPy to store data into matrices.
I'm struggling to make the below Python code perform better.
RESULT is the data store I want to put the data into.
TMP = np.array([[1,1,0],[0,0,1],[1,0,0],[0,1,1]])
n_row, n_col = TMP.shape[0], TMP.shape[0]
RESULT = np.zeros((n_row, n_col))
def do_something(array1, array2):
intersect_num = np.bitwise_and(array1, array2).sum()
union_num = np.bitwise_or(array1, array2).sum()
try:
return intersect_num / float(union_num)
except ZeroDivisionError:
return 0
for i in range(n_row):
for j in range(n_col):
if i >= j:
continue
RESULT[i, j] = do_something(TMP[i], TMP[j])
I guess it would be much faster if I could use some NumPy built-in function instead of for-loops.
I was looking for the various questions around here, but I couldn't find the best fit for my problem.
Any suggestion? Thanks in advance!
Approach #1
You could do something like this as a vectorized solution -
# Store number of rows in TMP as a paramter
N = TMP.shape[0]
# Get the indices that would be used as row indices to select rows off TMP and
# also as row,column indices for setting output array. These basically correspond
# to the iterators involved in the loopy implementation
R,C = np.triu_indices(N,1)
# Calculate intersect_num, union_num and division results across all iterations
I = np.bitwise_and(TMP[R],TMP[C]).sum(-1)
U = np.bitwise_or(TMP[R],TMP[C]).sum(-1)
vals = np.true_divide(I,U)
# Setup output array and assign vals into it
out = np.zeros((N, N))
out[R,C] = vals
Approach #2
For cases with TMP holding 1s and 0s, those np.bitwise_and and np.bitwise_or would be replaceable with dot-products and as such could be faster alternatives. So, with those we would have an implementation like so -
M = TMP.shape[1]
I = TMP.dot(TMP.T)
TMP_inv = 1-TMP
U = M - TMP_inv.dot(TMP_inv.T)
out = np.triu(np.true_divide(I,U),1)

Transforming a 3 Column Matrix into an N x N Matrix in Numpy

I have a 2D numpy array with 3 columns. Columns 1 and 2 are a list of connections between ID's. Column 3 is a the strength of that connection. I would like to transform this 3 column matrix into a weighted adjacency matrix (an N x N matrix where cells represent the strength of connection between each ID).
I have already done this in my code below. matrix is the 3 column 2D array and t1 is the weighted adjacency matrix. My problem is this code is very slow because I am using nested for loops. I am familiar with the pandas function melt which does this, but I am not able to use pandas. Is there a faster implementation not using pandas?
import numpy as np
a = np.arange(2000)
np.random.shuffle(a)
b = np.arange(2000)
np.random.shuffle(b)
c = np.random.rand(2000,1)
matrix = np.column_stack((a,b,c))
#get unique value list of nm
flds = list(np.unique(matrix[:,0]))
flds.extend(list(np.unique(matrix[:,1])))
flds = np.asarray(flds)
flds = np.unique(flds)
#make lookup dict
lookup = dict(zip(np.arange(0,len(flds)), flds))
lookup_rev = dict(zip(flds, np.arange(0,len(flds))))
#make empty n by n matrix with unique lists
t1 = np.zeros([len(flds) , len(flds)])
#map values into the n by n matrix and make the rest 0
'''this takes a long time to run'''
#iterate through rows
for i in np.arange(0,len(lookup)):
#iterate through columns
for k in np.arange(0,len(lookup)):
val = matrix[(matrix[:,0] == lookup[i]) & (matrix[:,1] == lookup[k])][:,2]
if val:
t1[i,k] = sum(val)
Assuming that I understood the question correctly and that val is a scalar, you could use a vectorized approach that involves initializing with zeros and then indexing, like so -
out = np.zeros((len(flds),len(flds)))
out[matrix[:,0].astype(int),matrix[:,1].astype(int)] = matrix[:,2]
Please note that by my observation it looks like you can avoid using lookup.
You need to iterate your matrix only once:
import numpy as np
size = 2000
a = np.arange(size)
np.random.shuffle(a)
b = np.arange(size)
np.random.shuffle(b)
c = np.random.rand(size,1)
matrix = np.column_stack((a,b,c))
#get unique value list of nm
fields = np.unique(matrix[:,:2])
n = len(fields)
#make reverse lookup dict
lookup = dict(zip(fields, range(n)))
#make empty n by n matrix
t1 = np.zeros([n, n])
for src, dest, val in matrix:
i = lookup[src]
j = lookup[dest]
t1[i, j] += val
The main acceleration you can get is by not iterating through each element of the NxN matrix but instead iterate trough your connection list, which is much smaller.
I tried to simplify your code a bit. It use the list.index method, which can be slow, but it should still be faster that what you had.
import numpy as np
a = np.arange(2000)
np.random.shuffle(a)
b = np.arange(2000)
np.random.shuffle(b)
c = np.random.rand(2000,1)
matrix = np.column_stack((a,b,c))
lookup = np.unique(matrix[:,:2]).tolist() # You can call unique only once
t1 = np.zeros((len(lookup),len(lookup)))
for i,j,val in matrix:
t1[lookup.index(i),lookup.index(j)] = val # Fill the matrix

Find two disjoint pairs of pairs that sum to the same vector

This is a follow-up to Find two pairs of pairs that sum to the same value .
I have random 2d arrays which I make using
import numpy as np
from itertools import combinations
n = 50
A = np.random.randint(2, size=(m,n))
I would like to determine if the matrix has two disjoint pairs of pairs of columns which sum to the same column vector. I am looking for a fast method to do this. In the previous problem ((0,1), (0,2)) was acceptable as a pair of pairs of column indices but in this case it is not as 0 is in both pairs.
The accepted answer from the previous question is so cleverly optimised I can't see how to make this simple looking change unfortunately. (I am interested in columns rather than rows in this question but I can always just do A.transpose().)
Here is some code to show it testing all 4 by 4 arrays.
n = 4
nxn = np.arange(n*n).reshape(n, -1)
count = 0
for i in xrange(2**(n*n)):
A = (i >> nxn) %2
p = 1
for firstpair in combinations(range(n), 2):
for secondpair in combinations(range(n), 2):
if firstpair < secondpair and not set(firstpair) & set(secondpair):
if (np.array_equal(A[firstpair[0]] + A[firstpair[1]], A[secondpair[0]] + A[secondpair[1]] )):
if (p):
count +=1
p = 0
print count
This should output 3136.
Here is my solution, extended to do what I believe you want. It isn't entirely clear though; one may get an arbitrary number of row-pairs that sum to the same total; there may exist unique subsets of rows within them that sum to the same value. For instance:
Given this set of row-pairs that sum to the same total
[[19 19 30 30]
[11 16 11 16]]
There exists a unique subset of these rows that may still be counted as valid; but should it?
[[19 30]
[16 11]]
Anyway, I hope those details are easy to deal with, given the code below.
import numpy as np
n = 20
#also works for non-square A
A = np.random.randint(2, size=(n*6,n)).astype(np.int8)
##A = np.array( [[0, 0, 0], [1, 1, 1], [1, 1 ,1]], np.uint8)
##A = np.zeros((6,6))
#force the inclusion of some hits, to keep our algorithm on its toes
##A[0] = A[1]
def base_pack_lazy(a, base, dtype=np.uint64):
"""
pack the last axis of an array as minimal base representation
lazily yields packed columns of the original matrix
"""
a = np.ascontiguousarray( np.rollaxis(a, -1))
packing = int(np.dtype(dtype).itemsize * 8 / (float(base) / 2))
for columns in np.array_split(a, (len(a)-1)//packing+1):
R = np.zeros(a.shape[1:], dtype)
for col in columns:
R *= base
R += col
yield R
def unique_count(a):
"""returns counts of unique elements"""
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), np.int)
np.add.at(count, inverse, 1) #note; this scatter operation requires numpy 1.8; use a sparse matrix otherwise!
return unique, count, inverse
def voidview(arr):
"""view the last axis of an array as a void object. can be used as a faster form of lexsort"""
return np.ascontiguousarray(arr).view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1]))).reshape(arr.shape[:-1])
def has_identical_row_sums_lazy(A, combinations_index):
"""
compute the existence of combinations of rows summing to the same vector,
given an nxm matrix A and an index matrix specifying all combinations
naively, we need to compute the sum of each row combination at least once, giving n^3 computations
however, this isnt strictly required; we can lazily consider the columns, giving an early exit opportunity
all nicely vectorized of course
"""
multiplicity, combinations = combinations_index.shape
#list of indices into combinations_index, denoting possibly interacting combinations
active_combinations = np.arange(combinations, dtype=np.uint32)
#keep all packed columns; we might need them later
columns = []
for packed_column in base_pack_lazy(A, base=multiplicity+1): #loop over packed cols
columns.append(packed_column)
#compute rowsums only for a fixed number of columns at a time.
#this is O(n^2) rather than O(n^3), and after considering the first column,
#we can typically already exclude almost all combinations
partial_rowsums = sum(packed_column[I[active_combinations]] for I in combinations_index)
#find duplicates in this column
unique, count, inverse = unique_count(partial_rowsums)
#prune those combinations which we can exclude as having different sums, based on columns inspected thus far
active_combinations = active_combinations[count[inverse] > 1]
#early exit; no pairs
if len(active_combinations)==0:
return False
"""
we now have a small set of relevant combinations, but we have lost the details of their particulars
to see which combinations of rows does sum to the same value, we do need to consider rows as a whole
we can simply apply the same mechanism, but for all columns at the same time,
but only for the selected subset of row combinations known to be relevant
"""
#construct full packed matrix
B = np.ascontiguousarray(np.vstack(columns).T)
#perform all relevant sums, over all columns
rowsums = sum(B[I[active_combinations]] for I in combinations_index)
#find the unique rowsums, by viewing rows as a void object
unique, count, inverse = unique_count(voidview(rowsums))
#if not, we did something wrong in deciding on active combinations
assert(np.all(count>1))
#loop over all sets of rows that sum to an identical unique value
for i in xrange(len(unique)):
#set of indexes into combinations_index;
#note that there may be more than two combinations that sum to the same value; we grab them all here
combinations_group = active_combinations[inverse==i]
#associated row-combinations
#array of shape=(mulitplicity,group_size)
row_combinations = combinations_index[:,combinations_group]
#if no duplicate rows involved, we have a match
if len(np.unique(row_combinations[:,[0,-1]])) == multiplicity*2:
print row_combinations
return True
#none of identical rowsums met uniqueness criteria
return False
def has_identical_triple_row_sums(A):
n = len(A)
idx = np.array( [(i,j,k)
for i in xrange(n)
for j in xrange(n)
for k in xrange(n)
if i<j and j<k], dtype=np.uint16)
idx = np.ascontiguousarray( idx.T)
return has_identical_row_sums_lazy(A, idx)
def has_identical_double_row_sums(A):
n = len(A)
idx = np.array(np.tril_indices(n,-1), dtype=np.int32)
return has_identical_row_sums_lazy(A, idx)
from time import clock
t = clock()
for i in xrange(1):
## print has_identical_double_row_sums(A)
print has_identical_triple_row_sums(A)
print clock()-t
Edit: code cleanup

Looping back to the next column in a csv

I'm trying to get a script to run on each individual column of a csv file. I've figured out how to tell python which column I would like to run the script on but I want it to analyze column one, output the results, the move to column two and continue on and on through the file. What I want is a "if etc goto etc" command. I've found how to do this with simple oneliners but I have a larger script. Any help would be great as I'm sure I'm just missing something. Like if I could loop back to where I define my data (h=data) but tell it to choose the next column. Here is my script.
import numpy as np
import matplotlib.pyplot as plt
from pylab import *
import pylab
from scipy import linalg
import sys
import scipy.interpolate as interpolate
import scipy.optimize as optimize
a=raw_input("Data file name? ") #Name of the data file including the directory, must be .csv
datafile = open(a, 'r')
data = []
for row in datafile:
data.append(row.strip().split(',')) #opening and organizing the csv file
print('Data points= ', len(data))
print data
c=raw_input("Is there a header row? y/n?") #Remove header line if present
if c is ('y'):
del data[0]
data2=data
print('Raw data= ', data2)
else:
print('Raw data= ', data)
'''
#if I wanted to select a column
b=input("What column to analyze?") #Asks what column depth data is in
if b is 1:
h=[[rowa[i] for rowa in data] for i in range(1)] #first row
'''
h=data # all columns
g=reduce(lambda x,y: x+y,h) #prepares data for calculations
a=map(float, g)
a.sort()
print ('Organized data= ',a)
def GRLC(values):
'''
Calculate Gini index, Gini coefficient, Robin Hood index, and points of
Lorenz curve based on the instructions given in
www.peterrosenmai.com/lorenz-curve-graphing-tool-and-gini-coefficient-calculator
Lorenz curve values as given as lists of x & y points [[x1, x2], [y1, y2]]
#param values: List of values
#return: [Gini index, Gini coefficient, Robin Hood index, [Lorenz curve]]
'''
n = len(values)
assert(n > 0), 'Empty list of values'
sortedValues = sorted(values) #Sort smallest to largest
#Find cumulative totals
cumm = [0]
for i in range(n):
cumm.append(sum(sortedValues[0:(i + 1)]))
#Calculate Lorenz points
LorenzPoints = [[], []]
sumYs = 0 #Some of all y values
robinHoodIdx = -1 #Robin Hood index max(x_i, y_i)
for i in range(1, n + 2):
x = 100.0 * (i - 1)/n
y = 100.0 * (cumm[i - 1]/float(cumm[n]))
LorenzPoints[0].append(x)
LorenzPoints[1].append(y)
sumYs += y
maxX_Y = x - y
if maxX_Y > robinHoodIdx: robinHoodIdx = maxX_Y
giniIdx = 100 + (100 - 2 * sumYs)/n #Gini index
return [giniIdx, giniIdx/100, robinHoodIdx, LorenzPoints]
result = GRLC(a)
print 'Gini Index', result[0]
print 'Gini Coefficient', result[1]
print 'Robin Hood Index', result[2]
I'm ignoring all of that GRLC function and just solving the looping question. Give this a try. It uses while True: to loop forever (you can just break out by ending the program; Ctrl+C in Windows, depends on OS). Just load the data from the csv once then each time it loops, you can re-build some variables. If you have questions please ask. Also, I didn't test it as I don't have all the NumPy packages installed :)
import numpy as np
import matplotlib.pyplot as plt
from pylab import *
import pylab
from scipy import linalg
import sys
import scipy.interpolate as interpolate
import scipy.optimize as optimize
def GRLC(values):
'''
Calculate Gini index, Gini coefficient, Robin Hood index, and points of
Lorenz curve based on the instructions given in
www.peterrosenmai.com/lorenz-curve-graphing-tool-and-gini-coefficient-calculator
Lorenz curve values as given as lists of x & y points [[x1, x2], [y1, y2]]
#param values: List of values
#return: [Gini index, Gini coefficient, Robin Hood index, [Lorenz curve]]
'''
n = len(values)
assert(n > 0), 'Empty list of values'
sortedValues = sorted(values) #Sort smallest to largest
#Find cumulative totals
cumm = [0]
for i in range(n):
cumm.append(sum(sortedValues[0:(i + 1)]))
#Calculate Lorenz points
LorenzPoints = [[], []]
sumYs = 0 #Some of all y values
robinHoodIdx = -1 #Robin Hood index max(x_i, y_i)
for i in range(1, n + 2):
x = 100.0 * (i - 1)/n
y = 100.0 * (cumm[i - 1]/float(cumm[n]))
LorenzPoints[0].append(x)
LorenzPoints[1].append(y)
sumYs += y
maxX_Y = x - y
if maxX_Y > robinHoodIdx: robinHoodIdx = maxX_Y
giniIdx = 100 + (100 - 2 * sumYs)/n #Gini index
return [giniIdx, giniIdx/100, robinHoodIdx, LorenzPoints]
#Name of the data file including the directory, must be .csv
a=raw_input("Data file name? ")
datafile = open(a.strip(), 'r')
data = []
#opening and organizing the csv file
for row in datafile:
data.append(row.strip().split(','))
#Remove header line if present
c=raw_input("Is there a header row? y/n?")
if c.strip().lower() == ('y'):
del data[0]
while True :
#if I want the first column, that's index 0.
b=raw_input("What column to analyze?")
# Validate that the column input data is correct here. Otherwise it might be out of range, etc.
# Maybe try this. You might want more smarts in there, depending on your intent:
b = int(b.strip())
# If you expect the user to inpt "2" to mean the second column, you're going to use index 1 (list indexes are 0 based)
h=[[rowa[b-1] for rowa in data] for i in range(1)]
# prepares data for calculations
g=reduce(lambda x,y: x+y,h)
a=map(float, g)
a.sort()
print ('Organized data= ',a)
result = GRLC(a)
print 'Gini Index', result[0]
print 'Gini Coefficient', result[1]
print 'Robin Hood Index', result[2]

Categories