I have following two dimensional array:
seq_length = 5
x = np.array([[0, 2, 0, 4], [5,6,7,8]])
x_repeated = np.repeat(x, seq_length, axis=1)
[[0 0 0 0 0 2 2 2 2 2 0 0 0 0 0 4 4 4 4 4]
[5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8]]
I want to shuffle x_repeated according to seq_length that all items of seq will be shuffled together.
For example, possible shuffle:
[[0 0 0 0 0 6 6 6 6 6 0 0 0 0 0 8 8 8 8 8]
[5 5 5 5 5 2 2 2 2 2 7 7 7 7 7 4 4 4 4 4]]
Thanks
You can do something like this:
import numpy as np
seq_length = 5
x = np.array([[0, 2, 0, 4], [5, 6, 7, 8]])
swaps = np.random.choice([False, True], size=4)
for swap_index, swap in enumerate(swaps):
if swap:
x[[0, 1], swap_index] = x[[1, 0], swap_index]
x_repeated = np.repeat(x, seq_length, axis=1)
You can also rely on the fact that True is non-zero, and replace the for with:
for swap_index in swaps.nonzero()[0]:
x[[0, 1], swap_index] = x[[1, 0], swap_index]
The key is that I did the shuffling/swapping before the np.repeat call, which will be much more efficient compared to doing it afterwards (while meeting your requirement of sequences of values needing to be swapped). There is a 50% chance for each pair of sequences of the same values to be swapped.
Managed to solve it following way:
items_count = x.shape[-1]
swap_flags = np.repeat(np.random.choice([0, 1], size=items_count), single_item_length)
gives:
[1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1]
for idx, flag in enumerate(swap_flags):
if flag:
x_repeated[0,idx], x_repeated[1,idx] = x_repeated[1,idx], x_repeated[0,idx]
Result:
[[5 5 5 5 5 6 6 6 6 6 0 0 0 0 0 8 8 8 8 8]
[0 0 0 0 0 2 2 2 2 2 7 7 7 7 7 4 4 4 4 4]]
Still not so elegant numpy way.
Here's my attempt:
def randomize_blocks(arr):
""" Shuffles an n-dimensional array given consecutive blocks of numbers.
"""
groups = (np.diff(arr.ravel(), prepend=0) != 0).cumsum().reshape(arr.shape)
u, c = np.unique(groups, return_counts=True)
np.random.shuffle(u)
o = np.argsort(u)
return arr.ravel()[np.argsort(np.repeat(u, c[o]))].reshape(arr.shape)
Breakdown
First we get the groups
groups = (np.diff(arr.ravel(), prepend=0) != 0).cumsum().reshape(arr.shape)
array([[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
[4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7]])
Then, we get unique and counts for each group.
u, c = np.unique(groups, return_counts=True)
>>> print(u, c)
(array([6, 0, 3, 5, 2, 4, 7, 1]),
array([5, 5, 5, 5, 5, 5, 5, 5]))
Finally, we shuffle our unique groups, reconstruct the array and use argsort to re-order the shuffled unique groups.
o = np.argsort(u)
arr.ravel()[np.argsort(np.repeat(u, c[o]))].reshape(arr.shape)
Example usage:
>>> randomize_blocks(arr)
array([[0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 5, 5, 5, 5, 5],
[7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 2, 2, 2, 2, 2, 6, 6, 6, 6, 6]])
>>> randomize_blocks(arr)
array([[6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 2, 2, 2, 2, 2]])
Here is a solution that does is completely in-place and does not require allocating and generating random indices:
import numpy as np
def row_block_shuffle(a: np.ndarray, seq_len: int):
cols = a.shape[1]
rng = np.random.default_rng()
for block in x_repeated.T.reshape(cols // seq_len, seq_length, -1).transpose(0, 2, 1):
rng.shuffle(block)
if __name__ == "__main__":
seq_length = 5
x = np.array([[0, 2, 0, 4], [5, 6, 7, 8]])
x_repeated = np.repeat(x, seq_length, axis=1)
row_block_shuffle(x_repeated, seq_length)
print(x_repeated)
Output:
[[5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8]
[0 0 0 0 0 2 2 2 2 2 0 0 0 0 0 4 4 4 4 4]]
What I do is to create "blocks" that shares memory with the original array:
>>> x_repeated.T.reshape(cols // seq_len, seq_length, -1).transpose(0, 2, 1)
[[[0 0 0 0 0]
[5 5 5 5 5]]
[[2 2 2 2 2]
[6 6 6 6 6]]
[[0 0 0 0 0]
[7 7 7 7 7]]
[[4 4 4 4 4]
[8 8 8 8 8]]]
Then I shuffle each "block", which will in turn shuffles the original array as well. I believe this is the most effective solution for large arrays as this solution is as in-place as it can be. This answer at least backs up my hypothesis:
https://stackoverflow.com/a/5044364/13091658
Also! The general problem you are facing is sorting "sliding window views" of your array, so if you would like to sort "windows" within your array that both moves horizontally and vertically you can for example see my previous answers for problems related to sliding windows here:
https://stackoverflow.com/a/67416335/13091658
https://stackoverflow.com/a/69924828/13091658
import numpy as np
m = np.array([[0, 2, 0, 4], [5, 6, 7, 8]])
def np_shuffle(m, m_rows = len(m), m_cols = len(m[0]), n_duplicate = 5):
# Flatten the numpy matrix
m = m.flatten()
# Randomize the flattened matrix m
np.random.shuffle(m)
# Duplicate elements
m = np.repeat(m, n_duplicate, axis=0)
# Return reshape numpy array
return (np.reshape(m, (m_rows, n_duplicate*m_cols)))
r = np_shuffle(m)
print(r)
# [[8 8 8 8 8 5 5 5 5 5 2 2 2 2 2 0 0 0 0 0]
# [0 0 0 0 0 7 7 7 7 7 4 4 4 4 4 6 6 6 6 6]]
Related
I am trying to write a code that replaces all rows of three or more continuous values for zeros. so the three threes on the first row should become zero. I wrote this code which in my mind should work but when I execute my code it seems to me that I am stuck in an infinite loop.
import numpy as np
A = np.array([[1, 2, 3, 3, 3, 4],
[1, 3, 2, 4, 2, 4],
[1, 2, 4, 2, 4, 4],
[1, 2, 3, 5, 5, 5],
[1, 2, 1, 3, 4, 4]])
row_nmbr,column_nmbr = (A.shape)
row = 0
column = 0
while column < column_nmbr:
next_col = column + 1
next_col2 = next_col + 1
if A[row][column] == A[row][next_col] and A[row][next_col] == A[row][next_col2]:
A[row][column] = 0
column =+ 1
print(A)
Don't use if-else. It gets messy easily. Here's an approach without if-else.
Iterate over each row, and find unique element and their counts in it.
If an element occurs three or more times, filter that into an array.
Start iteration for each filtered element (val)
Find the indices of val in the given row
Do a groupby on the indices from step 4 to find blocks of contiguous indices.
Check if contiguous indices are three or more in number
If yes, do replacement.
The following sample code is scalable and works for multiple contiguous elements.
from functools import partial
from operator import itemgetter
A = np.array([[3, 3, 5, 3, 3, 3, 5, 5, 5, 6, 6, 5, 5, 5],
[1, 8, 8, 4, 7, 4, 7, 7, 7, 7, 1, 2, 3, 9],
[1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5],
[1, 2, 3, 3, 3, 3, 3, 2, 1, 1, 1, 2, 2, 2],
[1, 2, 1, 3, 4, 4, 9, 8, 8, 8, 8, 9, 9, 8]])
def func1d(row, replacement):
# find and filter elements which occurs three or more times
vals, count = np.unique(row, return_counts=True)
vals = vals[count >= 3]
# Iteration for each filtered element (val)
for val in vals:
# get indices of val from row
indices = (row == val).nonzero()[0]
# find contiguous indices
for k, g in groupby(enumerate(indices), lambda t: t[1] - t[0]):
l = list(map(itemgetter(1), g))
# if grouped indices are three or more, do replacement
if len(l) >=3:
row[l] = replacement
return row
wrapper = partial(func1d, replacement=0)
np.apply_along_axis(wrapper, 1, A)
Output, when compared with A:
# original array
[[3 3 5 3 3 3 5 5 5 6 6 5 5 5]
[1 8 8 4 7 4 7 7 7 7 1 2 3 9]
[1 1 1 2 2 2 3 3 3 4 4 4 4 5]
[1 2 3 3 3 3 3 2 1 1 1 2 2 2]
[1 2 1 3 4 4 9 8 8 8 8 9 9 8]]
# array with replaced values
[[3 3 5 0 0 0 0 0 0 6 6 0 0 0]
[1 8 8 4 7 4 0 0 0 0 1 2 3 9]
[0 0 0 0 0 0 0 0 0 0 0 0 0 5]
[1 2 0 0 0 0 0 2 0 0 0 0 0 0]
[1 2 1 3 4 4 9 0 0 0 0 9 9 8]]
Your loop will be infinite since column will always be 0 and less than column_nmbr.
Do it right like this:
for i in range(row_nmbr):
m, k = np.unique(A[i], return_inverse=True)
val = m[np.bincount(k) > 2]
if len(val) > 0:
aaa = A[i]
aaa[A[i] == val] = 0
print(A)
Output:
[[1 2 0 0 0 4]
[1 3 2 4 2 4]
[1 2 0 2 0 0]
[1 2 3 0 0 0]
[1 2 1 3 4 4]]
I have a Pandas dataframe:
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 5, size=(10, 1)), columns=['col0'])
ie
col0
0 3
1 4
2 2
3 4
4 4
5 1
6 2
7 2
8 2
9 4
I would like to get a column indicating in each row the indeces of all the rows with the same value as the given row. I do:
df = df.assign(sameas = df.col0.apply(lambda val: [i for i, e in enumerate(df.col0) if e==val]))
I get:
col0 sameas
0 3 [0]
1 4 [1, 3, 4, 9]
2 2 [2, 6, 7, 8]
3 4 [1, 3, 4, 9]
4 4 [1, 3, 4, 9]
5 1 [5]
6 2 [2, 6, 7, 8]
7 2 [2, 6, 7, 8]
8 2 [2, 6, 7, 8]
9 4 [1, 3, 4, 9]
Which is the expected result. In my real world application, the df is much bigger, and this method does not complete in required time.
I think the runtime scales with the square of the number of rows, which is bad. How can I do the above computation faster?
You can try to groupby col0 and convert the grouped index to list
df['sameas'] = df['col0'].map(df.reset_index().groupby('col0')['index'].apply(list))
print(df)
col0 sameas
0 3 [0]
1 4 [1, 3, 4, 9]
2 2 [2, 6, 7, 8]
3 4 [1, 3, 4, 9]
4 4 [1, 3, 4, 9]
5 1 [5]
6 2 [2, 6, 7, 8]
7 2 [2, 6, 7, 8]
8 2 [2, 6, 7, 8]
9 4 [1, 3, 4, 9]
You can just do groupby with transform
df['new'] = df.reset_index().groupby('col0')['index'].transform(lambda x : [x.tolist()]*len(x)).values
Out[146]:
0 [0]
1 [1, 3, 4, 9]
2 [2, 6, 7, 8]
3 [1, 3, 4, 9]
4 [1, 3, 4, 9]
5 [5]
6 [2, 6, 7, 8]
7 [2, 6, 7, 8]
8 [2, 6, 7, 8]
9 [1, 3, 4, 9]
Name: index, dtype: object
try:
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 5, size=(10, 1)), columns=['col0'])
df
'''
col0
0 3
1 4
2 2
3 4
4 4
5 1
6 2
7 2
8 2
9 4
'''
get a Series as mapping:
ser = df.groupby('col0').apply(lambda x: x.index.to_list())
ser
col0
1 [5]
2 [2, 6, 7, 8]
3 [0]
4 [1, 3, 4, 9]
dtype: object
use mapping:
df.assign(col1=df.col0.map(ser))
'''
col0 col1
0 3 [0]
1 4 [1, 3, 4, 9]
2 2 [2, 6, 7, 8]
3 4 [1, 3, 4, 9]
4 4 [1, 3, 4, 9]
5 1 [5]
6 2 [2, 6, 7, 8]
7 2 [2, 6, 7, 8]
8 2 [2, 6, 7, 8]
9 4 [1, 3, 4, 9]
'''
On-liner method:
df['col1'] = [df[df.col0.values == i].index.tolist()for i in df.col0.values]
df
Output:
index
col0
col1
0
3
0
1
4
1,3,4,9
2
2
2,6,7,8
3
4
1,3,4,9
4
4
1,3,4,9
5
1
5
6
2
2,6,7,8
7
2
2,6,7,8
8
2
2,6,7,8
9
4
1,3,4,9
I have a file with various columns. Say
1 2 3 4 5 6
2 4 5 6 7 4
3 4 5 6 7 6
2 0 1 5 6 0
2 4 6 8 9 9
I would like to select and save out rows (in each column) in a new file which have the values in column two in the range [0 - 2].
The answer in the new file should be
1 2 3 4 5 6
2 0 1 5 6 0
Kindly assist me. I prefer doing this with numpy in python.
For array a, you can use:
a[(a[:,1] <= 2) & (a[:,1] >= 0)]
Here, the condition filters the values in your second column.
For your example:
>>> a
array([[1, 2, 3, 4, 5, 6],
[2, 4, 5, 6, 7, 4],
[3, 4, 5, 6, 7, 6],
[2, 0, 1, 5, 6, 0],
[2, 4, 6, 8, 9, 9]])
>>> a[(a[:,1] <= 2) & (a[:,1] >= 0)]
array([[1, 2, 3, 4, 5, 6],
[2, 0, 1, 5, 6, 0]])
If you have an x*n matrix how do you check for a row that contains a certain number and if so, how do you delete that row?
If you are using pandas, you can create a mask that you can use to index the dataframe, negating the mask with ~:
df = pd.DataFrame(np.arange(12).reshape(3, 4))
# 0 1 2 3
# 0 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
value = 2
If you want to check if the value is contained in a specific column:
df[~(df[2] == value)]
# 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
Or if it can be contained in any column:
df[~(df == value).any(axis=1)]
# 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
Just reassign it to df afterwards.
This also works if you are using just numpy:
x = np.arange(12).reshape(3, 4)
# array([[ 0, 1, 2, 3],
# [ 4, 5, 6, 7],
# [ 8, 9, 10, 11]])
x[~(x == value).any(axis=1)]
# array([[ 4, 5, 6, 7],
# [ 8, 9, 10, 11]])
And finally, if you are using plain Python and have a list of lists, use the built-in any in a list comprehension:
y = [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]]
[row for row in y if not any(x == value for x in row)]
# [[4, 5, 6, 7], [8, 9, 10, 11]]
I am trying to use the function as_strided from numpy.lib.stride_tricks to extract sub series from a larger 2D array, but I struggled to find the right thing to write for the strides argument.
Let's say I have a matrix m which contains 5 1D array of length (a=)10. I want to extract sub 1D arrays of length (b=)4 for each 1D array in m.
import numpy
from numpy.lib.stride_tricks import as_strided
a, b = 10, 4
m = numpy.array([range(i,i+a) for i in range(5)])
# first try
sub_m = as_strided(m, shape=(m.shape[0], m.shape[1]-b+1, b))
print sub_m.shape # (5,7,4) which is what i expected
print sub_m[-1,-1,-1] # Some unexpected strange number: 8227625857902995061
# second try with strides argument
sub_m = as_strided(m, shape=(m.shape[0], m.shape[1]-b+1, b), strides=(m.itemize,m.itemize,m.itemize))
# gives error, see below
AttributeError: 'numpy.ndarray' object has no attribute 'itemize'
As you can see I succeed to get the right shape for sub_m in my first try. However I can't find what to write in strides=()
For information:
m = [[ 0 1 2 3 4 5 6 7 8 9]
[ 1 2 3 4 5 6 7 8 9 10]
[ 2 3 4 5 6 7 8 9 10 11]
[ 3 4 5 6 7 8 9 10 11 12]
[ 4 5 6 7 8 9 10 11 12 13]]
Expected output:
sub_n = [
[[0 1 2 3] [1 2 3 4] ... [5 6 7 8] [6 7 8 9]]
[[1 2 3 4] [2 3 4 5] ... [6 7 8 9] [7 8 9 10]]
[[2 3 4 5] [3 4 5 6] ... [7 8 9 10] [8 9 10 11]]
[[3 4 5 6] [4 5 6 7] ... [8 9 10 11] [9 10 11 12]]
[[4 5 6 7] [5 6 7 8] ... [9 10 11 12] [10 11 12 13]]
]
edit: I have much more data, that's the reason why I want to use as_strided (efficiency)
Here's one approach with np.lib.stride_tricks.as_strided -
def strided_lastaxis(a, L):
s0,s1 = a.strides
m,n = a.shape
return np.lib.stride_tricks.as_strided(a, shape=(m,n-L+1,L), strides=(s0,s1,s1))
Bit of explanation on strides for as_strided :
We have 3D strides, that increments by one element along the last/third axis, so s1 there for the last axis striding. The second axis strides by the same one element "distance", so s1 for that too. For the first axis, the striding is same as the first axis stride length of the array, as we move on the next row, so s0 there.
Sample run -
In [46]: a
Out[46]:
array([[0, 5, 6, 2, 3, 6, 7, 1, 4, 8],
[2, 1, 3, 7, 0, 3, 5, 4, 0, 1]])
In [47]: strided_lastaxis(a, L=4)
Out[47]:
array([[[0, 5, 6, 2],
[5, 6, 2, 3],
[6, 2, 3, 6],
[2, 3, 6, 7],
[3, 6, 7, 1],
[6, 7, 1, 4],
[7, 1, 4, 8]],
[[2, 1, 3, 7],
[1, 3, 7, 0],
[3, 7, 0, 3],
[7, 0, 3, 5],
[0, 3, 5, 4],
[3, 5, 4, 0],
[5, 4, 0, 1]]])