I want to convert arrays of integers to 0 or 1s, padding with 0s if the other array possesses the larger value.
Examples:
ex1 = np.array([[0],[3]])
=> array([[0,0,0],[1,1,1]])
ex2 = np.array([[2,1],[0,0]])
=> array([[1,1,1],[0,0,0]])
ex3 = np.array([ [2,1,2],[3,1,1] ])
=> array([[1,1,0,1,1,1]
[1,1,1,1,1,0]])
How shall I achieve this? Can it also expand the N-dimension array?
Came up with this approach:
def expand_multi_bin(a):
# Create result array
n = np.max(a, axis=0).sum()
d = a.shape[0]
newa = np.zeros(d*n).reshape(d,n)
row=0
for x in np.nditer(a, flags=['external_loop'], order='F'):
# Iterate each column
for idx,c in enumerate(np.nditer(x)):
# Store it to the result array
newa[idx,row:row+c] = np.ones(c)
row += np.max(x)
return newa
Though, given the multiple loops, highly skeptical that this is the best approach.
Related
I have a 1D vector Zc containing n elements that are 2D arrays. I want to find the index of each 2D array that equals np.ones(Zc[i].shape).
a = np.zeros((5,5))
b = np.ones((5,5))*4
c = np.ones((5,5))
d = np.ones((5,5))*2
Zc = np.stack((a,b,c,d))
for i in range(len(Zc)):
a = np.ones(Zc[i].shape)
b = Zc[i]
if np.array_equal(a,b):
print(i)
else:
pass
Which returns 2. The code above works and returns the correct answer, but I want to know if there a vectorized way to achieve the same result?
Going off of hpaulj's comment:
>>> allones = (Zc == np.array(np.ones(Zc[i].shape))).all(axis=(1,2))
>>> np.where(allones)[0][0]
2
I'm using NumPy to store data into matrices.
I'm struggling to make the below Python code perform better.
RESULT is the data store I want to put the data into.
TMP = np.array([[1,1,0],[0,0,1],[1,0,0],[0,1,1]])
n_row, n_col = TMP.shape[0], TMP.shape[0]
RESULT = np.zeros((n_row, n_col))
def do_something(array1, array2):
intersect_num = np.bitwise_and(array1, array2).sum()
union_num = np.bitwise_or(array1, array2).sum()
try:
return intersect_num / float(union_num)
except ZeroDivisionError:
return 0
for i in range(n_row):
for j in range(n_col):
if i >= j:
continue
RESULT[i, j] = do_something(TMP[i], TMP[j])
I guess it would be much faster if I could use some NumPy built-in function instead of for-loops.
I was looking for the various questions around here, but I couldn't find the best fit for my problem.
Any suggestion? Thanks in advance!
Approach #1
You could do something like this as a vectorized solution -
# Store number of rows in TMP as a paramter
N = TMP.shape[0]
# Get the indices that would be used as row indices to select rows off TMP and
# also as row,column indices for setting output array. These basically correspond
# to the iterators involved in the loopy implementation
R,C = np.triu_indices(N,1)
# Calculate intersect_num, union_num and division results across all iterations
I = np.bitwise_and(TMP[R],TMP[C]).sum(-1)
U = np.bitwise_or(TMP[R],TMP[C]).sum(-1)
vals = np.true_divide(I,U)
# Setup output array and assign vals into it
out = np.zeros((N, N))
out[R,C] = vals
Approach #2
For cases with TMP holding 1s and 0s, those np.bitwise_and and np.bitwise_or would be replaceable with dot-products and as such could be faster alternatives. So, with those we would have an implementation like so -
M = TMP.shape[1]
I = TMP.dot(TMP.T)
TMP_inv = 1-TMP
U = M - TMP_inv.dot(TMP_inv.T)
out = np.triu(np.true_divide(I,U),1)
I have a 2D numpy array with 3 columns. Columns 1 and 2 are a list of connections between ID's. Column 3 is a the strength of that connection. I would like to transform this 3 column matrix into a weighted adjacency matrix (an N x N matrix where cells represent the strength of connection between each ID).
I have already done this in my code below. matrix is the 3 column 2D array and t1 is the weighted adjacency matrix. My problem is this code is very slow because I am using nested for loops. I am familiar with the pandas function melt which does this, but I am not able to use pandas. Is there a faster implementation not using pandas?
import numpy as np
a = np.arange(2000)
np.random.shuffle(a)
b = np.arange(2000)
np.random.shuffle(b)
c = np.random.rand(2000,1)
matrix = np.column_stack((a,b,c))
#get unique value list of nm
flds = list(np.unique(matrix[:,0]))
flds.extend(list(np.unique(matrix[:,1])))
flds = np.asarray(flds)
flds = np.unique(flds)
#make lookup dict
lookup = dict(zip(np.arange(0,len(flds)), flds))
lookup_rev = dict(zip(flds, np.arange(0,len(flds))))
#make empty n by n matrix with unique lists
t1 = np.zeros([len(flds) , len(flds)])
#map values into the n by n matrix and make the rest 0
'''this takes a long time to run'''
#iterate through rows
for i in np.arange(0,len(lookup)):
#iterate through columns
for k in np.arange(0,len(lookup)):
val = matrix[(matrix[:,0] == lookup[i]) & (matrix[:,1] == lookup[k])][:,2]
if val:
t1[i,k] = sum(val)
Assuming that I understood the question correctly and that val is a scalar, you could use a vectorized approach that involves initializing with zeros and then indexing, like so -
out = np.zeros((len(flds),len(flds)))
out[matrix[:,0].astype(int),matrix[:,1].astype(int)] = matrix[:,2]
Please note that by my observation it looks like you can avoid using lookup.
You need to iterate your matrix only once:
import numpy as np
size = 2000
a = np.arange(size)
np.random.shuffle(a)
b = np.arange(size)
np.random.shuffle(b)
c = np.random.rand(size,1)
matrix = np.column_stack((a,b,c))
#get unique value list of nm
fields = np.unique(matrix[:,:2])
n = len(fields)
#make reverse lookup dict
lookup = dict(zip(fields, range(n)))
#make empty n by n matrix
t1 = np.zeros([n, n])
for src, dest, val in matrix:
i = lookup[src]
j = lookup[dest]
t1[i, j] += val
The main acceleration you can get is by not iterating through each element of the NxN matrix but instead iterate trough your connection list, which is much smaller.
I tried to simplify your code a bit. It use the list.index method, which can be slow, but it should still be faster that what you had.
import numpy as np
a = np.arange(2000)
np.random.shuffle(a)
b = np.arange(2000)
np.random.shuffle(b)
c = np.random.rand(2000,1)
matrix = np.column_stack((a,b,c))
lookup = np.unique(matrix[:,:2]).tolist() # You can call unique only once
t1 = np.zeros((len(lookup),len(lookup)))
for i,j,val in matrix:
t1[lookup.index(i),lookup.index(j)] = val # Fill the matrix
I'm trying to optimize the following code, potentially by rewriting it in Cython: it simply takes a low dimensional but relatively long numpy arrays, looks into of its columns for 0 values, and marks those as -1 in an array. The code is:
import numpy as np
def get_data():
data = np.array([[1,5,1]] * 5000 + [[1,0,5]] * 5000 + [[0,0,0]] * 5000)
return data
def get_cols(K):
cols = np.array([2] * K)
return cols
def test_nonzero(data):
K = len(data)
result = np.array([1] * K)
# Index into columns of data
cols = get_cols(K)
# Mark zero points with -1
idx = np.nonzero(data[np.arange(K), cols] == 0)[0]
result[idx] = -1
import time
t_start = time.time()
data = get_data()
for n in range(5000):
test_nonzero(data)
t_end = time.time()
print (t_end - t_start)
data is the data. cols is the array of columns of data to look for non-zero values (for simplicity, I made it all the same column). The goal is to compute a numpy array, result, which has a 1 value for each row where the column of interest is non-zero, and -1 for the rows where the corresponding columns of interest have a zero.
Running this function 5000 times on a not-so-large array of 15,000 rows by 3 columns takes about 20 seconds. Is there a way this can be sped up? It appears that most of the work goes into finding the nonzero elements and retrieving them with indices (the call to nonzero and subsequent use of its index.) Can this be optimized or is this the best that can be done?
How could a Cython implementation gain speed on this?
cols = np.array([2] * K)
That's going to be really slow. That's create a very large python list and then converts it into a numpy array. Instead, do something like:
cols = np.ones(K, int)*2
That'll be way faster
result = np.array([1] * K)
Here you should do:
result = np.ones(K, int)
That will produce the numpy array directly.
idx = np.nonzero(data[np.arange(K), cols] == 0)[0]
result[idx] = -1
The cols is an array, but you can just pass a 2. Furthermore, using nonzero adds an extra step.
idx = data[np.arange(K), 2] == 0
result[idx] = -1
Should have the same effect.
I have several arrays of size (20000,1) with different contents. I'd like to randomly delete 25% of all rows per array in such a way that for each array the same row is deleted.
A rather tedious way I found is the following:
import numpy as np
a=np.array(range(1000))
b=np.array(np.random.rand(1000))
seed=np.random.randint(0,100000000) #picking a random seed
np.random.seed(seed) #Setting the same seed for each deletion
a[np.random.rand(*a.shape) < .25] = 0
np.random.seed(seed)
b[np.random.rand(*b.shape) < .25] = 0
a=a[a !=0]
b=b[b !=0]
There are several problems with this approach, such as what if an array already contains zeros?
Is there a better way of doing this?
based on and extended from Joel Cornett's solution:
import numpy as np
length = 20000
limit = int(0.75*length)
keep = np.random.permutation(length)[:limit]
newArray = oldArray[keep]
Here is a non-numpy solution in very general terms:
import random
to_keep = set(random.sample(range(total_rows), keep_ratio * total_rows))
#do this for each array:
new_array = np.array(item for index, item in enumerate(old_array) if index in to_keep)
total_rows is the number of rows in each array (I think you said this was 20,000)
keep_ratio is the percentage of rows to keep, which according to you is 1 - 0.25 = 0.75
EDIT
You can also use numpy's compress() method.
import random
to_keep = set(random.sample(range(total_rows), keep_ratio * total_rows))
kompressor = (1 if i in to_keep else 0 for i in xrange(total_rows))
new_array = numpy.compress(kompressor, old_array, axis=1)
kompressor
Similar to Theodros's answer, but preserves the original ordering of elements:
import numpy as np
mask = np.ones(len(a), dtype=bool)
mask[:len(a)/4] = 0
np.random.shuffle(mask)
a = a[mask]
b = b[mask]
I have no idea how well this works with numpy, but this is what I'd use in pure Python:
total = len(a)
toss = int(0.25 * total)
keeping = [False] * toss + [True] * (total - toss)
random.shuffle(keeping)
a = [value for value, flag in zip(a, keeping) if flag]
b = [value for value, flag in zip(b, keeping) if flag]