Numpy / Pandas slicing based on intervals - python

Trying to figure out a way to slice non-contiguous and non-equal length rows of a pandas / numpy matrix so I can set the values to a common value. Has anyone come across an elegant solution for this?
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(12).reshape(3,4))
#x is the matrix we want to index into
"""
x before:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
"""
y = pd.DataFrame([[0,3],[2,2],[1,2],[0,0]])
#y is a matrix where each row contains a start idx and end idx per column of x
"""
0 1
0 0 3
1 2 3
2 1 3
3 0 1
"""
What I'm looking for is a way to effectively select different length slices of x based on the rows of y
x[y] = 0
"""
x afterwards:
array([[ 0, 1, 2, 0],
[ 0, 5, 0, 7],
[ 0, 0, 0, 11]])

Masking can still be useful, because even if a loop cannot be entirely avoided, the main dataframe x would not need to be involved in the loop, so this should speed things up:
mask = np.zeros_like(x, dtype=bool)
for i in range(len(y)):
mask[y.iloc[i, 0]:(y.iloc[i, 1] + 1), i] = True
x[mask] = 0
x
0 1 2 3
0 0 1 2 0
1 0 5 0 7
2 0 0 0 11
As a further improvement, consider defining y as a NumPy array if possible.

I customized this answer to your problem:
y_t = y.values.transpose()
y_t[1,:] = y_t[1,:] - 1 # or remove this line and change '>= r' below to '> r`
r = np.arange(x.shape[0])
mask = ((y_t[0,:,None] <= r) & (y_t[1,:,None] >= r)).transpose()
res = x.where(~mask, 0)
res
# 0 1 2 3
# 0 0 1 2 0
# 1 0 5 0 7
# 2 0 0 0 11

Related

How to shuffle a 2d binary matrix, preserving marginal distributions

Suppose I have an (n*m) binary matrix df similar to the following:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.binomial(1, .3, size=(6,8)))
0 1 2 3 4 5 6 7
------------------------------
0 | 0 0 0 0 0 1 1 0
1 | 0 1 0 0 0 0 0 0
2 | 0 0 0 0 1 0 0 0
3 | 0 0 0 0 0 1 0 1
4 | 0 1 1 0 1 0 0 0
5 | 1 0 1 1 1 0 0 1
I want to shuffle the values in the matrix to create a new_df of the same shape, such that both marginal distributions are the same, such as the following:
0 1 2 3 4 5 6 7
------------------------------
0 | 0 0 0 0 1 0 0 1
1 | 0 0 0 0 1 0 0 0
2 | 0 0 0 0 0 0 0 1
3 | 0 1 1 0 0 0 0 0
4 | 1 0 0 0 1 1 0 0
5 | 0 1 1 1 0 1 1 0
In the new matrix, the sum of each row is equal to the sum of the corresponding row in the original matrix, and likewise, columns in the new matrix have the same sum as the corresponding column in the original matrix.
The solution is pretty easy to check:
# rows have the same marginal distribution
assert(all(df.sum(axis=1) == new_df.sum(axis=1)))
# columns have the same marginal distribution
assert(all(df.sum(axis=0) == new_df.sum(axis=0)))
If n*m is small, I can use a brute-force approach to the shuffle:
def shuffle_2d(df):
"""Shuffles a multidimensional binary array, preserving marginal distributions"""
# get a list of indices where the df is 1
rowlist = []
collist = []
for i_row, row in df.iterrows():
for i_col, val in row.iteritems():
if df.loc[i_row, i_col] == 1:
rowlist.append(i_row)
collist.append(i_col)
# create an empty df of the same shape
new_df = pd.DataFrame(index=df.index, columns=df.columns, data=0)
# shuffle until you get no repeat coordinates
# this is so you don't increment the same cell in the matrix twice
repeats = 999
while repeats > 1:
pairs = list(zip(np.random.permutation(rowlist), np.random.permutation(collist)))
repeats = pd.value_counts(pairs).max()
# populate new data frame at indicated points
for i_row, i_col in pairs:
new_df.at[i_row, i_col] += 1
return new_df
The problem is that the brute force approach scales poorly. (As in that line from Indiana Jones and the Last Crusade: https://youtu.be/Ubw5N8iVDHI?t=3)
As a quick demo, for an n*n matrix, the number of attempts needed to get an acceptable shuffle looks like: (in one run)
n attempts
2 1
3 2
4 4
5 1
6 1
7 11
8 9
9 22
10 4416
11 800
12 66
13 234
14 5329
15 26501
16 27555
17 5932
18 668902
...
Is there a straightforward solution that preserves the exact marginal distributions (or tells you where no other pattern is possible that preserves that distribution)?
As a fallback, I could also use an approximation algorithm that could minimize the sum of squared errors on each row.
Thanks! =)
EDIT:
For some reason I wasn't finding existing answers before I wrote this question, but after posting it they all show up in the sidebar:
Is it possible to shuffle a 2D matrix while preserving row AND column frequencies?
Randomize matrix in perl, keeping row and column totals the same
Sometimes all you need to do is ask...
Thanks mostly to https://stackoverflow.com/a/2137012/6361632 for inspiration, here's a solution that seems to work:
def flip1(m):
"""
Chooses a single (i0, j0) location in the matrix to 'flip'
Then randomly selects a different (i, j) location that creates
a quad [(i0, j0), (i0, j), (i, j0), (i, j) in which flipping every
element leaves the marginal distributions unaltered.
Changes those elements, and returns 1.
If such a quad cannot be completed from the original position,
does nothing and returns 0.
"""
i0 = np.random.randint(m.shape[0])
j0 = np.random.randint(m.shape[1])
level = m[i0, j0]
flip = 0 if level == 1 else 1 # the opposite value
for i in np.random.permutation(range(m.shape[0])): # try in random order
if (i != i0 and # don't swap with self
m[i, j0] != level): # maybe swap with a cell that holds opposite value
for j in np.random.permutation(range(m.shape[1])):
if (j != j0 and # don't swap with self
m[i, j] == level and # check that other swaps work
m[i0, j] != level):
# make the swaps
m[i0, j0] = flip
m[i0, j] = level
m[i, j0] = level
m[i, j] = flip
return 1
return 0
def shuffle(m1, n=100):
m2 = m1.copy()
f_success = np.mean([flip1(m2) for _ in range(n)])
# f_success is the fraction of flip attempts that succeed, for diagnostics
#print(f_success)
# check the answer
assert(all(m1.sum(axis=1) == m2.sum(axis=1)))
assert(all(m1.sum(axis=0) == m2.sum(axis=0)))
return m2
Which we can call as:
m1 = np.random.binomial(1, .3, size=(6,8))
array([[0, 0, 0, 1, 1, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 1, 0, 1, 0, 1],
[1, 1, 0, 0, 0, 1, 0, 1],
[0, 0, 0, 0, 0, 1, 0, 0],
[1, 0, 1, 0, 1, 0, 0, 0]])
m2 = shuffle(m1)
array([[0, 0, 0, 0, 1, 1, 0, 1],
[1, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 1, 1],
[1, 1, 1, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0],
[1, 0, 0, 1, 0, 0, 0, 1]])
How many iterations do we need to get to a steady-state distribution? I've set a default of 100 here, which is sufficient for these small matrices.
Below I plot the correlation between the original matrix and the shuffled matrix (500 times) for various numbers of iterations.
for _ in range(500):
m1 = np.random.binomial(1, .3, size=(9,9)) # create starting df
m2 = shuffle(m1, n_iters)
corrs.append(np.corrcoef(m1.flatten(), m2.flatten())[1,0])
plt.hist(corrs, bins=40, alpha=.4, label=n_iters)
For a 9x9 matrix, we see improvements up until about 25 iterations, beyond which we're in a steady state.
For an 18x18 matrix, we see small gains going from 100 to 250 iterations, but not much beyond.
Note that the correlation between starting and ending distributions is lower for larger matrices, but it takes us longer to get there.
You have to look for two rows and two columns, the cut points of which give a matrix with 1 0 on the top and 0 1 on the bottom (or the other way around). These values you can switch (to 01 and 10).
There is even an algorithm, that can sample from all possible matrices with identical marginals (implemented in the R-package RaschSampler) developed by Verhelst (2008, link to article page).
A newer algorithm by Wang (2020, link), more efficient for some cases, is also available.

Add element after each element in numpy array python

I am just starting off with numpy and am trying to create a function that takes in an array (x), converts this into a np.array, and returns a numpy array with 0,0,0,0 added after each element.
It should look like so:
input array: [4,5,6]
output: [4,0,0,0,0,5,0,0,0,0,6,0,0,0,0]
I have tried the following:
import numpy as np
x = np.asarray([4,5,6])
y = np.array([])
for index, value in enumerate(x):
y = np.insert(x, index+1, [0,0,0,0])
print(y)
which returns:
[4 0 0 0 0 5 6]
[4 5 0 0 0 0 6]
[4 5 6 0 0 0 0]
So basically I need to combine the output into one single numpy array rather than three lists.
Would anybody know how to solve this?
Many thanks!
Use the numpy .zeros function !
import numpy as np
inputArray = [4,5,6]
newArray = np.zeros(5*len(inputArray),dtype=int)
newArray[::5] = inputArray
In fact, you 'force' all the values with indexes 0,5 and 10 to become 4,5 and 6.
so _____[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
becomes [4 0 0 0 0 5 0 0 0 0 6 0 0 0 0]
>>> newArray
array([4, 0, 0, 0, 0, 5, 0, 0, 0, 0, 6, 0, 0, 0 ,0])
I haven't used numpy to solve this problem,but this code seems to return your required output:
a = [4,5,6]
b = [0,0,0,0]
c = []
for x in a:
c = c + [x] + b
print(c)
I hope this helps!

Index of identical rows in a NumPy array

I already asked a variation of this question, but I still have a problem regarding the runtime of my code.
Given a numpy array consisting of 15000 rows and 44 columns. My goal is to find out which rows are equal and add them to a list, like this:
1 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
1 0 0 0 0
1 2 3 4 5
Result:
equal_rows1 = [1,2,3]
equal_rows2 = [0,4]
What I did up till now is using the following code:
import numpy as np
input_data = np.load('IN.npy')
equal_inputs1 = []
equal_inputs2 = []
for i in range(len(input_data)):
for j in range(i+1,len(input_data)):
if np.array_equal(input_data[i],input_data[j]):
equal_inputs1.append(i)
equal_inputs2.append(j)
The problem is that it takes a lot of time to return the desired arrays and that this allows only 2 different "similar row lists" although there can be more. Is there any better solution for this, especially regarding the runtime?
This is pretty simple with pandas groupby:
df
A B C D E
0 1 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 1 0 0 0 0
5 1 2 3 4 5
[g.index.tolist() for _, g in df.groupby(df.columns.tolist()) if len(g.index) > 1]
# [[1, 2, 3], [0, 4]]
If you are dealing with many rows and many unique groups, this might get a bit slow. The performance depends on your data. Perhaps there is a faster NumPy alternative, but this is certainly the easiest to understand.
You can use collections.defaultdict, which retains the row values as keys:
from collections import defaultdict
dd = defaultdict(list)
for idx, row in enumerate(df.values):
dd[tuple(row)].append(idx)
print(list(dd.values()))
# [[0, 4], [1, 2, 3], [5]]
print(dd)
# defaultdict(<class 'list'>, {(1, 0, 0, 0, 0): [0, 4],
# (0, 0, 0, 0, 0): [1, 2, 3],
# (1, 2, 3, 4, 5): [5]})
You can, if you wish, filter out unique rows via a dictionary comprehension.

Python and creating 2d numpy arrays from list

I'm running into some trouble converting values in columns into a Numpy 2d array. What I have as an output from my code is something the following:
38617.0 0 0
40728.0 0 1
40538.0 0 2
40500.5 0 3
40214.0 0 4
40545.0 0 5
40352.5 0 6
40222.5 0 7
40008.0 0 8
40017.0 0 9
40126.0 0 10
40029.0 0 11
39681.5 0 12
39973.0 0 13
39903.0 0 14
39766.5 0 15
39784.0 0 16
39528.5 0 17
39513.5 0 18
And this continues for ~300,000 lines. The coords of the data are arranged as (z,x,y), and I want to convert it into a 2d array with dimensions 765X510 (x,y) so that the z-coordinates are sitting at their respective (x,y) coordinates so that I may write it to an image file.
Any ideas? I've been looking around and I haven't found anything on the matter.
EDIT:
This is the while-loop that's creating the above columns of data (it's actually two, a function is called within another while-loop):
def make_median_image(x,y):
while y < 509:
y = y + 1 # Makes the first value (x,0), b/c Python is indexed at 0
median_first_row0 = sc.median([a11[y,x],a22[y,x],a33[y,x],a44[y,x],a55[y,x],a66[y,x],a77[y,x],a88[y,x],a99[y,x],a1010[y,x]])
print median_first_row0,x,y
list1 = [median_first_row0,x,y]
list = list1.append(
while x < 764:
x = x + 1
make_median_image(x,y)
import numpy as np
l = [[1,2,3], [4,5,6], [7,8,9], [0,0,0]]
You can directly pass a python 2D list into a numpy array.
>>> np.array(l)
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[0, 0, 0]])
If you only want the latter two columns (which are your x,y values)
>>> np.array([[i[1],i[2]] for i in l])
array([[2, 3],
[5, 6],
[8, 9],
[0, 0]])
Would it be possible to create an array from the data below like so?
Data:
38617.0 0 0
40728.0 0 1
40538.0 0 2
40500.5 0 3
40214.0 0 4
40545.0 0 5
40352.5 0 6
40222.5 0 7
40008.0 0 8
40017.0 0 9
40126.0 0 10
40029.0 0 11
39681.5 0 12
39973.0 0 13
39903.0 0 14
39766.5 0 15
39784.0 0 16
39528.5 0 17
39513.5 0 18
... (continues for ~100,000 lines, so you can guess why I'm adamant to find an answer)
What I would like:
numpy_ndarray = [[38617.0, 40728.0, 40538.0, 40500.5, 40214.0, 40545.0, 40352.5, ... (continues until the last column value in the data above is 764) ], [begin next line, when x = 1, ... (until last y-value is 764)], ... [ ... (some last pixel value)]]
So it basically builds a matrix/image grid out of the pixel values in the first column of data that's associated with the (x,y) coordinate in the second and third columns.
lets say your array is l
import numpy as np
l = [[1,2,3], [4,5,6], [7,8,9], [0,0,0]]
>>> np.array(l)
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[0, 0, 0]])
this should work
>>> np.array([[i[1] for i in l],[i[2] for i in l]])
array([[2, 5, 8, 0],
[3, 6, 9, 0]])

Insert data from one sorted array into another sorted array

I apologize if this has been asked here - I've hunted around here and in the Tentative NumPy Tutorial for an answer.
I have 2 numpy arrays. The first array is similar to:
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
(etc... It's ~700x10 in actuality)
I then have a 2nd array similar to
3 1
4 18
5 2
(again, longer - maybe 400 or so rows)
The first column of the 2nd array is always completely contained within the first column of the first array
What I'd like to do is to insert the 2nd column of the 2nd array into that first array as part of an existing column, i.e:
array a:
1 0 0 0 0
2 0 0 0 0
3 1 0 0 0
4 18 0 0 0
5 2 0 0 0
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
(I'd be filling in each of those columns in turn, but each covers a different range within the original)
My first try was along the lines of a[b[:,0],1]=b[:,1] which puts them into the indices of b, not the values (ie, in my example above instead of filling in rows 3, 4, and 5, I filled in 2, 3, and 4). I should have realized that!
Since then, I've tried to make it work pretty inelegantly with where(), and I think I could make it work by finding the difference in the starting values of the first columns.
I'm new to python, so perhaps I'm overly optimistic - but it seems like there should be a more elegant way and I'm just missing it.
Thanks for any insights!
If the numbers in the first column of a are in sorted order, then you could use
a[a[:,0].searchsorted(b[:,0]),1] = b[:,1]
For example:
import numpy as np
a = np.array([(1,0,0,0,0),
(2,0,0,0,0),
(3,0,0,0,0),
(4,0,0,0,0),
(5,0,0,0,0),
(6,0,0,0,0),
(7,0,0,0,0),
(8,0,0,0,0),
])
b = np.array([(3, 1),
(5, 18),
(7, 2)])
a[a[:,0].searchsorted(b[:,0]),1] = b[:,1]
print(a)
yields
[[ 1 0 0 0 0]
[ 2 0 0 0 0]
[ 3 1 0 0 0]
[ 4 0 0 0 0]
[ 5 18 0 0 0]
[ 6 0 0 0 0]
[ 7 2 0 0 0]
[ 8 0 0 0 0]]
(I changed your example a bit to show that the values in b's first column do not have to be contiguous.)
If a[:,0] is not in sorted order, then you could use np.argsort to workaround this:
a = np.array( [(1,0,0,0,0),
(2,0,0,0,0),
(5,0,0,0,0),
(3,0,0,0,0),
(4,0,0,0,0),
(6,0,0,0,0),
(7,0,0,0,0),
(8,0,0,0,0),
])
b = np.array([(3, 1),
(5, 18),
(7, 2)])
perm = np.argsort(a[:,0])
a[:,1][perm[a[:,0][perm].searchsorted(b[:,0])]] = b[:,1]
print(a)
yields
[[ 1 0 0 0 0]
[ 2 0 0 0 0]
[ 5 18 0 0 0]
[ 3 1 0 0 0]
[ 4 0 0 0 0]
[ 6 0 0 0 0]
[ 7 2 0 0 0]
[ 8 0 0 0 0]]
The setup:
a = np.arange(20).reshape(2,10).T
b = np.array([[1, 100], [3, 300], [8, 800]])
This will work if you don't know anything about a[:, 0] except that it is sorted.
index = a[:, 0].searchsorted(b[:, 0])
a[index, 1] = b[:, 1]
print a
array([[ 0, 10],
[ 1, 100],
[ 2, 12],
[ 3, 300],
[ 4, 14],
[ 5, 15],
[ 6, 16],
[ 7, 17],
[ 8, 800],
[ 9, 19]])
But if you know that a[:, 0] is a sequence of contiguous integers like your example you can do:
index = b[:,0] + a[0, 0]
a[index, 1] = b[:, 1]

Categories