Using numpy to select rows based on a condition of one column

Using numpy to select rows based on a condition of one column - python

I have a file with various columns. Say
1 2 3 4 5 6
2 4 5 6 7 4
3 4 5 6 7 6
2 0 1 5 6 0
2 4 6 8 9 9
I would like to select and save out rows (in each column) in a new file which have the values in column two in the range [0 - 2].
The answer in the new file should be
1 2 3 4 5 6
2 0 1 5 6 0
Kindly assist me. I prefer doing this with numpy in python.

For array a, you can use:
a[(a[:,1] <= 2) & (a[:,1] >= 0)]
Here, the condition filters the values in your second column.
For your example:
>>> a
array([[1, 2, 3, 4, 5, 6],
[2, 4, 5, 6, 7, 4],
[3, 4, 5, 6, 7, 6],
[2, 0, 1, 5, 6, 0],
[2, 4, 6, 8, 9, 9]])
>>> a[(a[:,1] <= 2) & (a[:,1] >= 0)]
array([[1, 2, 3, 4, 5, 6],
[2, 0, 1, 5, 6, 0]])

Related

Get a column of list of indices showing which rows are equal in a Pandas Dataframe

I have a Pandas dataframe:
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 5, size=(10, 1)), columns=['col0'])
ie
col0
0 3
1 4
2 2
3 4
4 4
5 1
6 2
7 2
8 2
9 4
I would like to get a column indicating in each row the indeces of all the rows with the same value as the given row. I do:
df = df.assign(sameas = df.col0.apply(lambda val: [i for i, e in enumerate(df.col0) if e==val]))
I get:
col0 sameas
0 3 [0]
1 4 [1, 3, 4, 9]
2 2 [2, 6, 7, 8]
3 4 [1, 3, 4, 9]
4 4 [1, 3, 4, 9]
5 1 [5]
6 2 [2, 6, 7, 8]
7 2 [2, 6, 7, 8]
8 2 [2, 6, 7, 8]
9 4 [1, 3, 4, 9]
Which is the expected result. In my real world application, the df is much bigger, and this method does not complete in required time.
I think the runtime scales with the square of the number of rows, which is bad. How can I do the above computation faster?

You can try to groupby col0 and convert the grouped index to list
df['sameas'] = df['col0'].map(df.reset_index().groupby('col0')['index'].apply(list))
print(df)
col0 sameas
0 3 [0]
1 4 [1, 3, 4, 9]
2 2 [2, 6, 7, 8]
3 4 [1, 3, 4, 9]
4 4 [1, 3, 4, 9]
5 1 [5]
6 2 [2, 6, 7, 8]
7 2 [2, 6, 7, 8]
8 2 [2, 6, 7, 8]
9 4 [1, 3, 4, 9]

You can just do groupby with transform
df['new'] = df.reset_index().groupby('col0')['index'].transform(lambda x : [x.tolist()]*len(x)).values
Out[146]:
0 [0]
1 [1, 3, 4, 9]
2 [2, 6, 7, 8]
3 [1, 3, 4, 9]
4 [1, 3, 4, 9]
5 [5]
6 [2, 6, 7, 8]
7 [2, 6, 7, 8]
8 [2, 6, 7, 8]
9 [1, 3, 4, 9]
Name: index, dtype: object

try:
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 5, size=(10, 1)), columns=['col0'])
df
'''
col0
0 3
1 4
2 2
3 4
4 4
5 1
6 2
7 2
8 2
9 4
'''
get a Series as mapping:
ser = df.groupby('col0').apply(lambda x: x.index.to_list())
ser
col0
1 [5]
2 [2, 6, 7, 8]
3 [0]
4 [1, 3, 4, 9]
dtype: object
use mapping:
df.assign(col1=df.col0.map(ser))
'''
col0 col1
0 3 [0]
1 4 [1, 3, 4, 9]
2 2 [2, 6, 7, 8]
3 4 [1, 3, 4, 9]
4 4 [1, 3, 4, 9]
5 1 [5]
6 2 [2, 6, 7, 8]
7 2 [2, 6, 7, 8]
8 2 [2, 6, 7, 8]
9 4 [1, 3, 4, 9]
'''

On-liner method:
df['col1'] = [df[df.col0.values == i].index.tolist()for i in df.col0.values]
df
Output:
index
col0
col1
0
3
0
1
4
1,3,4,9
2
2
2,6,7,8
3
4
1,3,4,9
4
4
1,3,4,9
5
1
5
6
2
2,6,7,8
7
2
2,6,7,8
8
2
2,6,7,8
9
4
1,3,4,9

Python dataframe repeat column data in each cell as a list

I am trying to repeat the whole data in a column in each each cell of the column.
My code:
df3=pd.DataFrame({
'x':[1,2,3,4,5],
'y':[10,20,30,20,10],
'z':[5,4,3,2,1]
})
df3 =
x y z
0 1 10 5
1 2 20 4
2 3 30 3
3 4 20 2
4 5 10 1
df3['z'] = df['z'].agg(lambda x: list(x))
Present output:
KeyError: 'z'
Expected output:
df=
x y z
0 1 10 [5, 4, 3, 2, 1]
1 2 20 [5, 4, 3, 2, 1]
2 3 30 [5, 4, 3, 2, 1]
3 4 20 [5, 4, 3, 2, 1]
4 5 10 [5, 4, 3, 2, 1]

Another way is to list(df.column.values)
df3.assign(z=[list(df3.z.values)]*len(df3))
x y z
0 5 10 [5, 4, 3, 2, 1]
1 4 20 [5, 4, 3, 2, 1]
2 3 30 [5, 4, 3, 2, 1]
3 2 20 [5, 4, 3, 2, 1]
4 1 10 [5, 4, 3, 2, 1]

Check with
df3['new_z']=[df3.z.tolist()]*len(df3)
More safe
df3['new_z']=[df3.z.tolist() for x in df.index]

Numpy 2D array shuffle elements between rows and not columns

I have following two dimensional array:
seq_length = 5
x = np.array([[0, 2, 0, 4], [5,6,7,8]])
x_repeated = np.repeat(x, seq_length, axis=1)
[[0 0 0 0 0 2 2 2 2 2 0 0 0 0 0 4 4 4 4 4]
[5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8]]
I want to shuffle x_repeated according to seq_length that all items of seq will be shuffled together.
For example, possible shuffle:
[[0 0 0 0 0 6 6 6 6 6 0 0 0 0 0 8 8 8 8 8]
[5 5 5 5 5 2 2 2 2 2 7 7 7 7 7 4 4 4 4 4]]
Thanks

You can do something like this:
import numpy as np
seq_length = 5
x = np.array([[0, 2, 0, 4], [5, 6, 7, 8]])
swaps = np.random.choice([False, True], size=4)
for swap_index, swap in enumerate(swaps):
if swap:
x[[0, 1], swap_index] = x[[1, 0], swap_index]
x_repeated = np.repeat(x, seq_length, axis=1)
You can also rely on the fact that True is non-zero, and replace the for with:
for swap_index in swaps.nonzero()[0]:
x[[0, 1], swap_index] = x[[1, 0], swap_index]
The key is that I did the shuffling/swapping before the np.repeat call, which will be much more efficient compared to doing it afterwards (while meeting your requirement of sequences of values needing to be swapped). There is a 50% chance for each pair of sequences of the same values to be swapped.

Managed to solve it following way:
items_count = x.shape[-1]
swap_flags = np.repeat(np.random.choice([0, 1], size=items_count), single_item_length)
gives:
[1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1]
for idx, flag in enumerate(swap_flags):
if flag:
x_repeated[0,idx], x_repeated[1,idx] = x_repeated[1,idx], x_repeated[0,idx]
Result:
[[5 5 5 5 5 6 6 6 6 6 0 0 0 0 0 8 8 8 8 8]
[0 0 0 0 0 2 2 2 2 2 7 7 7 7 7 4 4 4 4 4]]
Still not so elegant numpy way.

Here's my attempt:
def randomize_blocks(arr):
""" Shuffles an n-dimensional array given consecutive blocks of numbers.
"""
groups = (np.diff(arr.ravel(), prepend=0) != 0).cumsum().reshape(arr.shape)
u, c = np.unique(groups, return_counts=True)
np.random.shuffle(u)
o = np.argsort(u)
return arr.ravel()[np.argsort(np.repeat(u, c[o]))].reshape(arr.shape)
Breakdown
First we get the groups
groups = (np.diff(arr.ravel(), prepend=0) != 0).cumsum().reshape(arr.shape)
array([[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
[4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7]])
Then, we get unique and counts for each group.
u, c = np.unique(groups, return_counts=True)
>>> print(u, c)
(array([6, 0, 3, 5, 2, 4, 7, 1]),
array([5, 5, 5, 5, 5, 5, 5, 5]))
Finally, we shuffle our unique groups, reconstruct the array and use argsort to re-order the shuffled unique groups.
o = np.argsort(u)
arr.ravel()[np.argsort(np.repeat(u, c[o]))].reshape(arr.shape)
Example usage:
>>> randomize_blocks(arr)
array([[0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 5, 5, 5, 5, 5],
[7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 2, 2, 2, 2, 2, 6, 6, 6, 6, 6]])
>>> randomize_blocks(arr)
array([[6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 2, 2, 2, 2, 2]])

Here is a solution that does is completely in-place and does not require allocating and generating random indices:
import numpy as np
def row_block_shuffle(a: np.ndarray, seq_len: int):
cols = a.shape[1]
rng = np.random.default_rng()
for block in x_repeated.T.reshape(cols // seq_len, seq_length, -1).transpose(0, 2, 1):
rng.shuffle(block)
if __name__ == "__main__":
seq_length = 5
x = np.array([[0, 2, 0, 4], [5, 6, 7, 8]])
x_repeated = np.repeat(x, seq_length, axis=1)
row_block_shuffle(x_repeated, seq_length)
print(x_repeated)
Output:
[[5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8]
[0 0 0 0 0 2 2 2 2 2 0 0 0 0 0 4 4 4 4 4]]
What I do is to create "blocks" that shares memory with the original array:
>>> x_repeated.T.reshape(cols // seq_len, seq_length, -1).transpose(0, 2, 1)
[[[0 0 0 0 0]
[5 5 5 5 5]]
[[2 2 2 2 2]
[6 6 6 6 6]]
[[0 0 0 0 0]
[7 7 7 7 7]]
[[4 4 4 4 4]
[8 8 8 8 8]]]
Then I shuffle each "block", which will in turn shuffles the original array as well. I believe this is the most effective solution for large arrays as this solution is as in-place as it can be. This answer at least backs up my hypothesis:
https://stackoverflow.com/a/5044364/13091658
Also! The general problem you are facing is sorting "sliding window views" of your array, so if you would like to sort "windows" within your array that both moves horizontally and vertically you can for example see my previous answers for problems related to sliding windows here:
https://stackoverflow.com/a/67416335/13091658
https://stackoverflow.com/a/69924828/13091658

import numpy as np
m = np.array([[0, 2, 0, 4], [5, 6, 7, 8]])
def np_shuffle(m, m_rows = len(m), m_cols = len(m[0]), n_duplicate = 5):
# Flatten the numpy matrix
m = m.flatten()
# Randomize the flattened matrix m
np.random.shuffle(m)
# Duplicate elements
m = np.repeat(m, n_duplicate, axis=0)
# Return reshape numpy array
return (np.reshape(m, (m_rows, n_duplicate*m_cols)))
r = np_shuffle(m)
print(r)
# [[8 8 8 8 8 5 5 5 5 5 2 2 2 2 2 0 0 0 0 0]
# [0 0 0 0 0 7 7 7 7 7 4 4 4 4 4 6 6 6 6 6]]

reshaping 2-d array using specific block shape [duplicate]

This question already has answers here:
Flatten or group array in blocks of columns - NumPy / Python
(6 answers)
Closed 3 years ago.
I've got problem with reshaping simple 2-d array into another.
Let`s assume matrix :
[[4 1 2 1 2 4 1 2 4]
[2 3 0 3 0 2 3 0 2]
[5 5 1 5 1 5 5 1 5]
[6 6 6 6 6 6 6 6 6]]
What I want to do is to reshape it to (12, 3) matrix, but using (4, 3) block. What I meant to do is to get matrix like:
[[4 1 2
2 3 0
5 5 1
6 6 6
1 2 4
3 0 2
5 1 5
6 6 6
1 2 4
3 0 2
5 1 5
6 6 6]]
I have highlighted the "egde" of cutting this matrix by additional newline.
I`ve tried numpy reshape (with all available order parameter value), but still I get array with "mixed" values.

You can always do this manually for custom reshapes:
import numpy as np
data = [[4, 1, 2, 1, 2, 4, 1, 2, 4],
[2, 3, 0, 3, 0, 2, 3, 0, 2],
[5, 5, 1, 5, 1, 5, 5, 1, 5],
[6, 6, 6, 6, 6, 6, 6, 6, 6]]
X = np.array(data)
Z = np.r_[X[:, 0:3], X[:, 3:6], X[:, 6:9]]
print(Z)
yields
array([[4, 1, 2],
[2, 3, 0],
[5, 5, 1],
[6, 6, 6],
[1, 2, 4],
[3, 0, 2],
[5, 1, 5],
[6, 6, 6],
[1, 2, 4],
[3, 0, 2],
[5, 1, 5],
[6, 6, 6]])
note the special np.r_ operator that concatenates arrays on rows (first axis). It is just a handy alias for np.concatenate.

Filter a 2D numpy array from an array of values

Let's say I have a numpy array with the following shape :
nonSortedNonFiltered=np.array([[9,8,5,4,6,7,1,2,3],[1,3,2,6,4,5,7,9,8]])
I want to :
- Sort the array according to nonSortedNonFiltered[1]
- Filter the array according to nonSortedNonFiltered[0] and an array of values
I currently do the sorting with :
sortedNonFiltered=nonSortedNonFiltered[:,nonSortedNonFiltered[1].argsort()]
Which gives : np.array([[9 5 8 6 7 4 1 3 2],[1 2 3 4 5 6 7 8 9]])
Now I want to filter sortedNonFiltered from an array of values, for example :
sortedNonFiltered=np.array([[9 5 8 6 7 4 1 3 2],[1 2 3 4 5 6 7 8 9]])
listOfValues=np.array([8 6 5 2 1])
...Something here...
> np.array([5 8 6 1 2],[2 3 4 7 9]) #What I want to get in the end
Note : Each value in a column of my 2D array is exclusive.

You can use np.in1d to get a boolean mask and use it to filter columns in the sorted array, something like this -
output = sortedNonFiltered[:,np.in1d(sortedNonFiltered[0],listOfValues)]
Sample run -
In [76]: nonSortedNonFiltered
Out[76]:
array([[9, 8, 5, 4, 6, 7, 1, 2, 3],
[1, 3, 2, 6, 4, 5, 7, 9, 8]])
In [77]: sortedNonFiltered
Out[77]:
array([[9, 5, 8, 6, 7, 4, 1, 3, 2],
[1, 2, 3, 4, 5, 6, 7, 8, 9]])
In [78]: listOfValues
Out[78]: array([8, 6, 5, 2, 1])
In [79]: sortedNonFiltered[:,np.in1d(sortedNonFiltered[0],listOfValues)]
Out[79]:
array([[5, 8, 6, 1, 2],
[2, 3, 4, 7, 9]])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using numpy to select rows based on a condition of one column - python

Related

Get a column of list of indices showing which rows are equal in a Pandas Dataframe

Python dataframe repeat column data in each cell as a list

Numpy 2D array shuffle elements between rows and not columns

reshaping 2-d array using specific block shape [duplicate]

Filter a 2D numpy array from an array of values

Categories

Resources