Formatting multidimensional arrays - python

How can I code a function that isolates the 0s and 1s in the 2 dimensional array. So the function is to separate the bits into chunks.
[[ 0 38846]
[ 1 51599]
[ 0 51599]
[ 1 52598]
[ 0 290480]
[ 0 360467]]
Expected Output:
Ones = 51599 ,52598
Zeroes = 38846, 51599, 290480, 360467

I am not sure I am following. Are you looking for a particular method, like with numpy or is list comprehension good enough? In any case, here are some examples:
x = array([[ 0, 38846],
[ 1, 51599],
[ 0, 51599],
[ 1, 52598],
[ 0, 290480],
[ 0, 360467]])
#List comprehension
ones = [v for b,v in x if b]
zeros = [v for b,v in x if b==0]
#Numpy access
ones = x[x[:,0]==1][:,1]
zeros = x[x[:,0]==0][:,1]

The following code should do the trick:
arr = np.array([[0, 38846],
[1, 51599],
[0, 51599],
[1, 52598],
[0, 290480],
[0, 360467]])
col1 = arr[:, 0]
col2 = arr[:, 1]
zeros = col2[col1 == 0]
ones = col2[col1 == 1]
The lines
col1 = arr[:, 0]
col2 = arr[:, 1]
get the 1st column (containing the 0s and 1s) and the 2nd column (which contain the numbers you want).
The col1 == 0 creates an array of booleans for which indexes where col1 is 0 are True and the other indexes are False. By doing col2[col1 == 0], we can then get the values in column 2 where the corresponding index of col1 == 0 is true, which gives us the values in column 2 which correspond to 0s in column 1.

Related

Comparing two numpy arrays for compliance with two conditions

Consider two numpy arrays having the same shape, A and B, composed of 1s and 0s. A small example is shown:
A = [[1 0 0 1] B = [[0 0 0 0]
[0 0 1 0] [0 0 0 0]
[0 0 0 0] [1 1 0 0]
[0 0 0 0] [0 0 1 0]
[0 0 1 1]] [0 1 0 1]]
I now want to assign values to the two Boolean variables test1 and test2 as follows:
test1: Is there at least one instance where a 1 in an A column and a 1 in the SAME B column have row differences of exactly 1 or 2? If so, then test1 = True, otherwise False.
In the example above, column 0 of both arrays have 1s that are 2 rows apart, so test1 = True. (there are other instances in column 2 as well, but that doesn't matter - we only require one instance.)
test2: Do the 1 values in A and B all have different array addresses? If so, then test2 = True, otherwise False.
In the example above, both arrays have [4,3] = 1, so test2 = False.
I'm struggling to find an efficient way to do this and would appreciate some assistance.
Here is a simple way to test if two arrays have an entry one element apart in the same column (only in one direction):
(A[1:, :] * B[:-1, :]).any(axis=None)
So you can do
test1 = (A[1:, :] * B[:-1, :] + A[:-1, :] * B[1:, :]).any(axis=None) or (A[2:, :] * B[:-2, :] + A[:-2, :] * B[2:, :]).any(axis=None)
The second test can be done by converting the locations to indices, stacking them together, and using np.unique to count the number of duplicates. Duplicates can only come from the same index in two arrays since an array will never have duplicate indices. We can further speed up the calculation by using flatnonzero instead of nonzero:
test2 = np.all(np.unique(np.concatenate((np.flatnonzero(A), np.flatnonzero(B))), return_counts=True)[1] == 1)
A more efficient test would use np.intersect1d in a similar manner:
test2 = not np.intersect1d(np.flatnonzero(A), np.flatnonzero(B)).size
You can use masked_arrays and for second task you can do:
A_m = np.ma.masked_equal(A, 0)
B_m = np.ma.masked_equal(B, 0)
test2 = np.any((A_m==B_m).compressed())
And a naive way of doing first task is:
test1 = np.any((np.vstack((A_m[:-1],A_m[:-2],A_m[1:],A_m[2:]))==np.vstack((B_m[1:],B_m[2:],B_m[:-1],B_m[:-2]))).compressed())
output:
True
True
For Test2: You could just check if they found any similar indexes found for a value of 1.
A = np.array([[1, 0, 0, 1],[0, 0, 1, 0],[0, 0, 0, 0],[0, 0, 0, 0],[0, 0, 1, 1]])
B = np.array([[0, 0, 0, 0],[0, 0, 0, 0],[1, 1, 0, 0],[0, 0, 1, 0],[0, 1, 0, 1]])
print(len(np.intersect1d(np.flatnonzero(A==1),np.flatnonzero(B==1)))>0))

How can I get exactly the same amount elements replaced in numpy 2D matrix?

I got a symmetrical 2D numpy matrix, it only contains ones and zeros and diagonal elements are always 0.
I want to replace part of the elements from one to zero, and the result need to keep symmetrical too. How many elements will be selected depends on the parameterreplace_rate.
Since it's a symmetrical matrix, I take half of the matrix and select the elements(those values are 1) randomly, change them from 1 to 0. And then with a mirror operation, make sure the whole matrix are still symmetrical.
For example
com = np.array ([[0, 1, 1, 1, 1],
[1, 0, 1, 1, 1],
[1, 1, 0, 1, 1],
[1, 1, 1, 0, 1],
[1, 1, 1, 1, 0]])
replace_rate = 0.1
com = np.triu(com)
mask = np.random.choice([0,1],size=(com.shape),p=((1-replace_rate),replace_rate)).astype(np.bool)
r1 = np.random.rand(*com.shape)
com[mask] = r1[mask]
com += com.T - np.diag(com.diagonal())
com is a (5,5) symmetrical matrix, and 10% of elements (only include those values are 1, the diagonal elements are excluded) will be replaced to 0 randomly.
The question is , how can I make sure the amount of elements changed keep the same each time?
Keep the same replace_rate = 0.1, sometimes I will get result like:
com = np.array([[0 1 1 1 1]
[1 0 1 1 1]
[1 1 0 1 1]
[1 1 1 0 1]
[1 1 1 1 0]])
Actually no one changed this time, and if I repeat it, I got 2 elements changed :
com = np.array([[0 1 1 1 1]
[1 0 1 1 1]
[1 1 0 1 0]
[1 1 1 0 1]
[1 1 0 1 0]])
I want to know how to fix the amount of elements changed with the same replace_rate?
Thanks in advance!!
How about something like this:
def make_transform(m, replace_rate):
changed = [] # keep track of indices we already changed
def get_random():
# Get a random pair of indices which are not equal (i.e. not on the diagonal)
c1, c2 = random.choices(range(len(com)), k=2)
if c1 == c2 or (c1,c2) in changed or (c2,c1) in changed:
return get_random() # Recurse until we find an i,j pair : i!=j , that hasnt already been changed
else:
changed.append((c1,c2))
return c1, c2
n_changes = int(m.shape[0]**2 * replace_rate) # the number of changes to make
print(n_changes)
for _ in range(n_changes):
i, j = get_random() # Get an valid index
m[i][j] = m[j][i] = 0
return m
This is the solution I suggest:
def rand_zero(mat, replace_rate):
triu_mat = np.triu(mat)
_ind = np.where(triu_mat != 0) # gets indices of non-zero elements, not just non-diagonals
ind = [x for x in zip(*_ind)]
chng = np.random.choice(range(len(ind)), # select some indices, at rate 'replace_rate'
size = int(replace_rate*mat.size),
replace = False) # do not select duplicates
mod_mat = triu_mat
for c in chng:
mod_mat[ind[c]] = 0
mod_mat = mod_mat + mod_mat.T
return mod_mat
I use int() to truncate to an integer in size, but you can use round() if that's what you desire.
Hope this gives consistent results!

Editting python 2-dimensional array without for-loop?

So, I have a given 2 dimensional matrix which is randomly generated:
a = np.random.randn(4,4)
which gives output:
array([[-0.11449491, -2.7777728 , -0.19784241, 1.8277976 ],
[-0.68511473, 0.40855461, 0.06003551, -0.8779363 ],
[-0.55650378, -0.16377137, 0.10348714, -0.53449633],
[ 0.48248298, -1.12199767, 0.3541335 , 0.48729845]])
I want to change all the negative values to 0 and all the positive values to 1.
How can I do this without a for loop?
You can use np.where()
import numpy as np
a = np.random.randn(4,4)
a = np.where(a<0, 0, 1)
print(a)
[[1 1 0 1]
[1 0 1 0]
[1 1 0 0]
[0 1 1 0]]
(a<0).astype(int)
This is one possibly solution - converting the array to boolean array according to your condition and then converting it from boolean to integer.
array([[ 0.63694991, -0.02785534, 0.07505496, 1.04719295],
[-0.63054947, -0.26718763, 0.34228736, 0.16134474],
[ 1.02107383, -0.49594998, -0.11044738, 0.64459594],
[ 0.41280766, 0.668819 , -1.0636972 , -0.14684328]])
And the result -
(a<0).astype(int)
>>> array([[0, 1, 0, 0],
[1, 1, 0, 0],
[0, 1, 1, 0],
[0, 0, 1, 1]])

One-hot encoding of categories

I have a list like similar to this:
list = ['Opinion, Journal, Editorial',
'Opinion, Magazine, Evidence-based',
'Evidence-based']
where the commas split between categories eg. Opinion and Journal are two separate categories. The real list is much larger and has more possible categories. I would like to use one-hot encoding to transform the list so that it can be used for machine learning. For example, from that list I would like to produce a sparse matrix containing data like:
list = [[1, 1, 1, 0, 0],
[1, 0, 0, 0, 1],
[0, 0, 0, 0, 1]]
Ideally, I would like to use scikit-learn's one hot encoder as I presume this would be the most efficient.
In response to #nbrayns comment:
The idea is to transform the list of categories from text to a vector wherby if it belongs to that category it will be assigned 1, otherwise 0. For the above example, the headings would be:
headings = ['Opinion', 'Journal', 'Editorial', 'Magazine', 'Evidence-based']
If you are able to use Pandas, this functionality is essentially built-in there:
import pandas as pd
l = ['Opinion, Journal, Editorial', 'Opinion, Magazine, Evidence-based', 'Evidence-based']
pd.Series(l).str.get_dummies(', ')
Editorial Evidence-based Journal Magazine Opinion
0 1 0 1 0 1
1 0 1 0 1 1
2 0 1 0 0 0
If you'd like to stick with the sklearn ecosystem, you are looking for MultiLabelBinarizer, not for OneHotEncoder. As the name implies, OneHotEncoder only supports one level per sample per category, while your dataset has multiple.
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer() # pass sparse_output=True if you'd like
mlb.fit_transform(s.split(', ') for s in l)
[[1 0 1 0 1]
[0 1 0 1 1]
[0 1 0 0 0]]
To map the columns back to categorical levels, you can access mlb.classes_. For the above example, this gives ['Editorial' 'Evidence-based' 'Journal' 'Magazine' 'Opinion'].
One more way:
l = ['Opinion, Journal, Editorial', 'Opinion, Magazine, Evidence-based', 'Evidence-based']
# Get list of unique classes
classes = list(set([j for i in l for j in i.split(', ')]))
=> ['Journal', 'Opinion', 'Editorial', 'Evidence-based', 'Magazine']
# Get indices in the matrix
indices = np.array([[k, classes.index(j)] for k, i in enumerate(l) for j in i.split(', ')])
=> array([[0, 1],
[0, 0],
[0, 2],
[1, 1],
[1, 4],
[1, 3],
[2, 3]])
# Generate output
output = np.zeros((len(l), len(classes)), dtype=int)
output[indices[:, 0], indices[:, 1]]=1
=> array([[ 1, 1, 1, 0, 0],
[ 0, 1, 0, 1, 1],
[ 0, 0, 0, 1, 0]])
This may not be the most efficient method, but probably easy to grasp.
If you don't already have a list of all possible words, you need to create that. In the code below it's called unique. The columns of the output matrix s will then correspond to those unique words; the rows will be the item from the list.
import numpy as np
lis = ['Opinion, Journal, Editorial','Opinion, Magazine, Evidence-based','Evidence-based']
unique=list(set(", ".join(lis).split(", ")))
print unique
# prints ['Opinion', 'Journal', 'Magazine', 'Editorial', 'Evidence-based']
s = np.zeros((len(lis), len(unique)))
for i, item in enumerate(lis):
for j, notion in enumerate(unique):
if notion in item:
s[i,j] = 1
print s
# prints [[ 1. 1. 0. 1. 0.]
# [ 1. 0. 1. 0. 1.]
# [ 0. 0. 0. 0. 1.]]
Very easy in pandas:
import pandas as pd
s = pd.Series(['a','b','c'])
pd.get_dummies(s)
Output:
a b c
0 1 0 0
1 0 1 0
2 0 0 1

Numpy sum running length of non-zero values

Looking for a fast vectorized function that returns the rolling number of consecutive non-zero values. The count should start over at 0 whenever encountering a zero. The result should have the same shape as the input array.
Given an array like this:
x = np.array([2.3, 1.2, 4.1 , 0.0, 0.0, 5.3, 0, 1.2, 3.1])
The function should return this:
array([1, 2, 3, 0, 0, 1, 0, 1, 2])
This post lists a vectorized approach which basically consists of two steps:
Initialize a zeros vector of the same size as input vector, x and set ones at places corresponding to non-zeros of x.
Next up, in that vector, we need to put minus of runlengths of each island right after the ending/stop positions for each "island". The intention is to use cumsum again later on, which would result in sequential numbers for the "islands" and zeros elsewhere.
Here's the implementation -
import numpy as np
#Append zeros at the start and end of input array, x
xa = np.hstack([[0],x,[0]])
# Get an array of ones and zeros, with ones for nonzeros of x and zeros elsewhere
xa1 =(xa!=0)+0
# Find consecutive differences on xa1
xadf = np.diff(xa1)
# Find start and stop+1 indices and thus the lengths of "islands" of non-zeros
starts = np.where(xadf==1)[0]
stops_p1 = np.where(xadf==-1)[0]
lens = stops_p1 - starts
# Mark indices where "minus ones" are to be put for applying cumsum
put_m1 = stops_p1[[stops_p1 < x.size]]
# Setup vector with ones for nonzero x's, "minus lens" at stops +1 & zeros elsewhere
vec = xa1[1:-1] # Note: this will change xa1, but it's okay as not needed anymore
vec[put_m1] = -lens[0:put_m1.size]
# Perform cumsum to get the desired output
out = vec.cumsum()
Sample run -
In [116]: x
Out[116]: array([ 0. , 2.3, 1.2, 4.1, 0. , 0. , 5.3, 0. , 1.2, 3.1, 0. ])
In [117]: out
Out[117]: array([0, 1, 2, 3, 0, 0, 1, 0, 1, 2, 0], dtype=int32)
Runtime tests -
Here's some runtimes tests comparing the proposed approach against the other itertools.groupby based approach -
In [21]: N = 1000000
...: x = np.random.rand(1,N)
...: x[x>0.5] = 0.0
...: x = x.ravel()
...:
In [19]: %timeit sumrunlen_vectorized(x)
10 loops, best of 3: 19.9 ms per loop
In [20]: %timeit sumrunlen_loopy(x)
1 loops, best of 3: 2.86 s per loop
You can use itertools.groupby and np.hstack :
>>> import numpy as np
>>> x = np.array([2.3, 1.2, 4.1 , 0.0, 0.0, 5.3, 0, 1.2, 3.1])
>>> from itertools import groupby
>>> np.hstack([[i if j!=0 else j for i,j in enumerate(g,1)] for _,g in groupby(x,key=lambda x: x!=0)])
array([ 1., 2., 3., 0., 0., 1., 0., 1., 2.])
We can group the array elements based on non-zero elements then use a list comprehension and enumerate to replace the non-zero sub-arrays with those index then flatten the list with np.hstack.
This sub-problem came up in Kick Start 2021 Round A for me. My solution:
def current_run_len(a):
a_ = np.hstack([0, a != 0, 0]) # first in starts and last in stops defined
d = np.diff(a_)
starts = np.where(d == 1)[0]
stops = np.where(d == -1)[0]
a_[stops + 1] = -(stops - starts) # +1 for behind-last
return a_[1:-1].cumsum()
In fact, the problem also required a version where you count down consecutive sequences. Thus here another version with an optional keyword argument which does the same for rev=False:
def current_run_len(a, rev=False):
a_ = np.hstack([0, a != 0, 0]) # first in starts and last in stops defined
d = np.diff(a_)
starts = np.where(d == 1)[0]
stops = np.where(d == -1)[0]
if rev:
a_[starts] = -(stops - starts)
cs = -a_.cumsum()[:-2]
else:
a_[stops + 1] = -(stops - starts) # +1 for behind-last
cs = a_.cumsum()[1:-1]
return cs
Results:
a = np.array([1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1])
print('a = ', a)
print('current_run_len(a) = ', current_run_len(a))
print('current_run_len(a, rev=True) = ', current_run_len(a, rev=True))
a = [1 1 1 1 0 0 0 1 1 0 1 0 0 0 1]
current_run_len(a) = [1 2 3 4 0 0 0 1 2 0 1 0 0 0 1]
current_run_len(a, rev=True) = [4 3 2 1 0 0 0 2 1 0 1 0 0 0 1]
For an array that consists of 0s and 1s only, you can simplify [0, a != 0, 0] to [0, a, 0]. But the version as-posted also works for arbitrary non-zero numbers.

Categories