How to shuffle a 2d binary matrix, preserving marginal distributions - python

Suppose I have an (n*m) binary matrix df similar to the following:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.binomial(1, .3, size=(6,8)))
0 1 2 3 4 5 6 7
------------------------------
0 | 0 0 0 0 0 1 1 0
1 | 0 1 0 0 0 0 0 0
2 | 0 0 0 0 1 0 0 0
3 | 0 0 0 0 0 1 0 1
4 | 0 1 1 0 1 0 0 0
5 | 1 0 1 1 1 0 0 1
I want to shuffle the values in the matrix to create a new_df of the same shape, such that both marginal distributions are the same, such as the following:
0 1 2 3 4 5 6 7
------------------------------
0 | 0 0 0 0 1 0 0 1
1 | 0 0 0 0 1 0 0 0
2 | 0 0 0 0 0 0 0 1
3 | 0 1 1 0 0 0 0 0
4 | 1 0 0 0 1 1 0 0
5 | 0 1 1 1 0 1 1 0
In the new matrix, the sum of each row is equal to the sum of the corresponding row in the original matrix, and likewise, columns in the new matrix have the same sum as the corresponding column in the original matrix.
The solution is pretty easy to check:
# rows have the same marginal distribution
assert(all(df.sum(axis=1) == new_df.sum(axis=1)))
# columns have the same marginal distribution
assert(all(df.sum(axis=0) == new_df.sum(axis=0)))
If n*m is small, I can use a brute-force approach to the shuffle:
def shuffle_2d(df):
"""Shuffles a multidimensional binary array, preserving marginal distributions"""
# get a list of indices where the df is 1
rowlist = []
collist = []
for i_row, row in df.iterrows():
for i_col, val in row.iteritems():
if df.loc[i_row, i_col] == 1:
rowlist.append(i_row)
collist.append(i_col)
# create an empty df of the same shape
new_df = pd.DataFrame(index=df.index, columns=df.columns, data=0)
# shuffle until you get no repeat coordinates
# this is so you don't increment the same cell in the matrix twice
repeats = 999
while repeats > 1:
pairs = list(zip(np.random.permutation(rowlist), np.random.permutation(collist)))
repeats = pd.value_counts(pairs).max()
# populate new data frame at indicated points
for i_row, i_col in pairs:
new_df.at[i_row, i_col] += 1
return new_df
The problem is that the brute force approach scales poorly. (As in that line from Indiana Jones and the Last Crusade: https://youtu.be/Ubw5N8iVDHI?t=3)
As a quick demo, for an n*n matrix, the number of attempts needed to get an acceptable shuffle looks like: (in one run)
n attempts
2 1
3 2
4 4
5 1
6 1
7 11
8 9
9 22
10 4416
11 800
12 66
13 234
14 5329
15 26501
16 27555
17 5932
18 668902
...
Is there a straightforward solution that preserves the exact marginal distributions (or tells you where no other pattern is possible that preserves that distribution)?
As a fallback, I could also use an approximation algorithm that could minimize the sum of squared errors on each row.
Thanks! =)
EDIT:
For some reason I wasn't finding existing answers before I wrote this question, but after posting it they all show up in the sidebar:
Is it possible to shuffle a 2D matrix while preserving row AND column frequencies?
Randomize matrix in perl, keeping row and column totals the same
Sometimes all you need to do is ask...

Thanks mostly to https://stackoverflow.com/a/2137012/6361632 for inspiration, here's a solution that seems to work:
def flip1(m):
"""
Chooses a single (i0, j0) location in the matrix to 'flip'
Then randomly selects a different (i, j) location that creates
a quad [(i0, j0), (i0, j), (i, j0), (i, j) in which flipping every
element leaves the marginal distributions unaltered.
Changes those elements, and returns 1.
If such a quad cannot be completed from the original position,
does nothing and returns 0.
"""
i0 = np.random.randint(m.shape[0])
j0 = np.random.randint(m.shape[1])
level = m[i0, j0]
flip = 0 if level == 1 else 1 # the opposite value
for i in np.random.permutation(range(m.shape[0])): # try in random order
if (i != i0 and # don't swap with self
m[i, j0] != level): # maybe swap with a cell that holds opposite value
for j in np.random.permutation(range(m.shape[1])):
if (j != j0 and # don't swap with self
m[i, j] == level and # check that other swaps work
m[i0, j] != level):
# make the swaps
m[i0, j0] = flip
m[i0, j] = level
m[i, j0] = level
m[i, j] = flip
return 1
return 0
def shuffle(m1, n=100):
m2 = m1.copy()
f_success = np.mean([flip1(m2) for _ in range(n)])
# f_success is the fraction of flip attempts that succeed, for diagnostics
#print(f_success)
# check the answer
assert(all(m1.sum(axis=1) == m2.sum(axis=1)))
assert(all(m1.sum(axis=0) == m2.sum(axis=0)))
return m2
Which we can call as:
m1 = np.random.binomial(1, .3, size=(6,8))
array([[0, 0, 0, 1, 1, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 1, 0, 1, 0, 1],
[1, 1, 0, 0, 0, 1, 0, 1],
[0, 0, 0, 0, 0, 1, 0, 0],
[1, 0, 1, 0, 1, 0, 0, 0]])
m2 = shuffle(m1)
array([[0, 0, 0, 0, 1, 1, 0, 1],
[1, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 1, 1],
[1, 1, 1, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0],
[1, 0, 0, 1, 0, 0, 0, 1]])
How many iterations do we need to get to a steady-state distribution? I've set a default of 100 here, which is sufficient for these small matrices.
Below I plot the correlation between the original matrix and the shuffled matrix (500 times) for various numbers of iterations.
for _ in range(500):
m1 = np.random.binomial(1, .3, size=(9,9)) # create starting df
m2 = shuffle(m1, n_iters)
corrs.append(np.corrcoef(m1.flatten(), m2.flatten())[1,0])
plt.hist(corrs, bins=40, alpha=.4, label=n_iters)
For a 9x9 matrix, we see improvements up until about 25 iterations, beyond which we're in a steady state.
For an 18x18 matrix, we see small gains going from 100 to 250 iterations, but not much beyond.
Note that the correlation between starting and ending distributions is lower for larger matrices, but it takes us longer to get there.

You have to look for two rows and two columns, the cut points of which give a matrix with 1 0 on the top and 0 1 on the bottom (or the other way around). These values you can switch (to 01 and 10).
There is even an algorithm, that can sample from all possible matrices with identical marginals (implemented in the R-package RaschSampler) developed by Verhelst (2008, link to article page).
A newer algorithm by Wang (2020, link), more efficient for some cases, is also available.

Related

In a boolean matrix, what is the best way to make every value adjacent to True/1 to True?

I have a numpy boolean 2d array with True/False. I want to make every adjacent cell of a True value to be True. What's the best/fastest of doing that in python?
For Eg:
#Initial Matrix
1 0 0 0 0 1 0
0 0 0 1 0 0 0
0 0 0 0 0 0 0
#After operation
1 1 1 1 1 1 1
1 1 1 1 1 1 1
0 0 1 1 1 0 0
It looks like you want to do dilation. OpenCV might be your best tool
import cv2
dilatation_dst = cv2.dilate(src, np.ones((3,3)))
https://docs.opencv.org/3.4/db/df6/tutorial_erosion_dilatation.html
You can use scipy.signal.convolve2d.
import numpy as np
from scipy.signal import convolve2d
result = convolve2d(src, np.ones((3,3)), mode='same').astype(bool).astype(int)
print(result)
Or we can use scipy.ndimage.
from scipy import ndimage
result = ndimage.binary_dilation(src, np.ones((3,3))).astype(int)
print(result)
Output:
[[1 1 1 1 1 1 1]
[1 1 1 1 1 1 1]
[0 0 1 1 1 0 0]]
Given
arr = np.array([[1, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]])
You can do
from scipy.ndimage import shift
arr2 = arr | shift(arr, (0, 1), cval=0) | shift(arr, (0, -1), cval=0)
arr3 = arr2 | shift(arr2, (1, 0), cval=0), (-1, 0), cval=0)

I need help filling a 2D array so it has only zeros above one of the diagonals

I have Size x Size array that is initialized with '0' only. I want to fill it with randomly picked integers in a way so it forms a triangle. I tried making a for loop with another 2 in its body, but it doesn't seem to work.
pyramid = [[1, 0, 0],
[4, 8, 0],
[1, 5, 3]]
This is the desirable format
pyramid = [[0]*rows]*rows
for i in range(0, 3, 1):
for j in range(0, 3, 1):
for k in range(0, i+1, 1):
pyramid[i][k] = random.randint(1, 10)
You have an array that is initially:
0 1 2 3
-------
0|0 0 0 0
1|0 0 0 0
2|0 0 0 0
3|0 0 0 0
You want it to be
0 1 2 3
-------
0|4 0 0 0
1|1 5 0 0
2|9 8 4 0
3|2 5 7 1
Notice that if you iterate over rows, you only need to fill up that row until you hit a diagonal (including the diagonal).
So for row 0, you fill up until column 0. For row 1, you do column 0 and 1. For row 2, you do column 0,1 and 2. And so on, so that for row i, you do columns 0, 1, ... i-1, i.
You could iterate over rows in one loop (i.e. row will range between 0 and N inclusive) and then in an inner loop iterate over the columns for that row - i.e. column for row i should range between 0 and i inclusive.
You will need to use the value of rows. Additionally, your value of j does not add anything. You need to loop through a 2-dimensional array, so 2 loops should be sufficient.
for i in range(rows):
for j in range(i+1):
pyramid[i][j] = random.randint(1, 10)
Additionally, you are initializing a 2-d list, where each entry is a copy of the others, referencing the same object. So if you change pyramid[i][j], you also change pyramid[i+1][j].
To prevent this use pyramid = [[0 for i in range(rows)] for i in range(rows)]
This should work fine:
import random
pyramid = [
[0, 0, 0],
[0, 0, 0],
[0, 0, 0]
]
x = len(pyramid)
y = len(pyramid[0])
for i in range(x):
for j in range(y):
if i >= j:
pyramid[i][j] = random.randint(1,10)

Function to use indexes in a matrix

I am trying to create a function which takes two inputs. One input is the matrix (n*m), and the second is K. K is a integer value. The distance between the cells A[3][2] and A[1][4] is |1-3| + |4-2| = 4. The expected output from the function is the count of cells with cell distance greater than K.
Cell here is each entry in the given matrix A. For example, A[0][0] is a cell and it has an entry value of 1 in the matrix.
I have created a function like this:
A = [[1, 0, 0],
[0, 0, 0],
[0, 0, 1],
[0, 1, 0]]
def findw(K, matrix):
m_c = matrix.copy()
result = 0
for i, j in zip(range(len(matrix)), range(len(m_c))):
for k, l in zip(range(len(matrix[i])), range(len(m_c[j]))):
D = abs(i - l) + abs(j - k)
print(i, k)
print(j, l)
print(D)
if D > K:
result += 1
return result
findw(1, A)
The output I got from the above function for the given matrix A with K = 1 is 9. But I am expecting 3. From the output I also realized that for both the matrices my function is always taking same value, for example (0,0) or (1,0), etc. See the print output below.
findw(1, A)
0 0
0 0
0
0 1
0 1
2
0 2
0 2
4
1 0
1 0
2
1 1
1 1
0
1 2
1 2
2
2 0
2 0
4
2 1
2 1
2
2 2
2 2
0
3 0
3 0
6
3 1
3 1
4
3 2
3 2
2
Out[120]: 9
It looks like my function is not iterating for cells where the indexes for both matrices are different. For example, matrix[0][0] and m_c[0][1].
How can I resolve this issue?
Working under the assumption that it is only the positions which have the value 1 that you care about, you could first enumerate those indices and then loop over the pairs of such things. itertools is a natural tool to use here:
from itertools import product, combinations
def D(p,q):
i,j = p
k,l = q
return abs(i-k) + abs(j-l)
def findw(k,matrix):
m = len(matrix)
n = len(matrix[0])
result = 0
indices = [(i,j) for i,j in product(range(m),range(n)) if matrix[i][j] == 1]
for p,q in combinations(indices,2):
d = D(p,q)
if d > k:
print(p,q,d)
result += 1
return result
#test:
A = [[1, 0, 0],
[0, 0, 0],
[0, 0, 1],
[0, 1, 0]]
print(findw(1, A))
Output:
(0, 0) (2, 2) 4
(0, 0) (3, 1) 4
(2, 2) (3, 1) 2
3

Numpy / Pandas slicing based on intervals

Trying to figure out a way to slice non-contiguous and non-equal length rows of a pandas / numpy matrix so I can set the values to a common value. Has anyone come across an elegant solution for this?
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(12).reshape(3,4))
#x is the matrix we want to index into
"""
x before:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
"""
y = pd.DataFrame([[0,3],[2,2],[1,2],[0,0]])
#y is a matrix where each row contains a start idx and end idx per column of x
"""
0 1
0 0 3
1 2 3
2 1 3
3 0 1
"""
What I'm looking for is a way to effectively select different length slices of x based on the rows of y
x[y] = 0
"""
x afterwards:
array([[ 0, 1, 2, 0],
[ 0, 5, 0, 7],
[ 0, 0, 0, 11]])
Masking can still be useful, because even if a loop cannot be entirely avoided, the main dataframe x would not need to be involved in the loop, so this should speed things up:
mask = np.zeros_like(x, dtype=bool)
for i in range(len(y)):
mask[y.iloc[i, 0]:(y.iloc[i, 1] + 1), i] = True
x[mask] = 0
x
0 1 2 3
0 0 1 2 0
1 0 5 0 7
2 0 0 0 11
As a further improvement, consider defining y as a NumPy array if possible.
I customized this answer to your problem:
y_t = y.values.transpose()
y_t[1,:] = y_t[1,:] - 1 # or remove this line and change '>= r' below to '> r`
r = np.arange(x.shape[0])
mask = ((y_t[0,:,None] <= r) & (y_t[1,:,None] >= r)).transpose()
res = x.where(~mask, 0)
res
# 0 1 2 3
# 0 0 1 2 0
# 1 0 5 0 7
# 2 0 0 0 11

Numpy - How to shift values at indexes where change happened

So I would like to shift my values in a 1D numpy arrays, where change happened. The sample of shifting shall be configured.
input = np.array([0,0,0,0,1,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0])
shiftSize = 2
out = np.magic(input, shiftSize)
print out
np.array([0,0,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,0,0])
For example the first switch happened and index 4, so index 2,3 becomes '1'.
The next happened at 5, so 6 and 7 becomes '1'.
EDIT: Also it would be important to be without for cycle because, that might be slow (it is needed for large data sets)
EDIT2: indexes and variable name
I tried with np.diff, so i get where the changes happened and then np.put, but with multiple index ranges it seems impossible.
Thank you for the help in advance!
What you want is called "binary dilation" and is contained in scipy.ndimage:
import numpy as np
import scipy.ndimage
input = np.array([0,0,0,0,1,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0], dtype=bool)
out = scipy.ndimage.morphology.binary_dilation(input, iterations=2).astype(int)
# array([0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0])
Nils' answer seems good. Here is an alternative using NumPy only:
import numpy as np
def dilate(ar, amount):
# Convolve with a kernel as big as the dilation scope
dil = np.convolve(np.abs(ar), np.ones(2 * amount + 1), mode='same')
# Crop in case the convolution kernel was bigger than array
dil = dil[-len(ar):]
# Take non-zero and convert to input type
return (dil != 0).astype(ar.dtype)
# Test
inp = np.array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0])
print(inp)
print(dilate(inp, 2))
Output:
[0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0]
[0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 0]
Another numpy solution :
def dilatation(seed,shift):
out=seed.copy()
for sh in range(1,shift+1):
out[sh:] |= seed[:-sh]
for sh in range(-shift,0):
out[:sh] |= seed[-sh:]
return out
Example (shift = 2) :
in : [0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0]
out: [0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1]

Categories