Related
I have a Pandas dataframe:
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 5, size=(10, 1)), columns=['col0'])
ie
col0
0 3
1 4
2 2
3 4
4 4
5 1
6 2
7 2
8 2
9 4
I would like to get a column indicating in each row the indeces of all the rows with the same value as the given row. I do:
df = df.assign(sameas = df.col0.apply(lambda val: [i for i, e in enumerate(df.col0) if e==val]))
I get:
col0 sameas
0 3 [0]
1 4 [1, 3, 4, 9]
2 2 [2, 6, 7, 8]
3 4 [1, 3, 4, 9]
4 4 [1, 3, 4, 9]
5 1 [5]
6 2 [2, 6, 7, 8]
7 2 [2, 6, 7, 8]
8 2 [2, 6, 7, 8]
9 4 [1, 3, 4, 9]
Which is the expected result. In my real world application, the df is much bigger, and this method does not complete in required time.
I think the runtime scales with the square of the number of rows, which is bad. How can I do the above computation faster?
You can try to groupby col0 and convert the grouped index to list
df['sameas'] = df['col0'].map(df.reset_index().groupby('col0')['index'].apply(list))
print(df)
col0 sameas
0 3 [0]
1 4 [1, 3, 4, 9]
2 2 [2, 6, 7, 8]
3 4 [1, 3, 4, 9]
4 4 [1, 3, 4, 9]
5 1 [5]
6 2 [2, 6, 7, 8]
7 2 [2, 6, 7, 8]
8 2 [2, 6, 7, 8]
9 4 [1, 3, 4, 9]
You can just do groupby with transform
df['new'] = df.reset_index().groupby('col0')['index'].transform(lambda x : [x.tolist()]*len(x)).values
Out[146]:
0 [0]
1 [1, 3, 4, 9]
2 [2, 6, 7, 8]
3 [1, 3, 4, 9]
4 [1, 3, 4, 9]
5 [5]
6 [2, 6, 7, 8]
7 [2, 6, 7, 8]
8 [2, 6, 7, 8]
9 [1, 3, 4, 9]
Name: index, dtype: object
try:
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 5, size=(10, 1)), columns=['col0'])
df
'''
col0
0 3
1 4
2 2
3 4
4 4
5 1
6 2
7 2
8 2
9 4
'''
get a Series as mapping:
ser = df.groupby('col0').apply(lambda x: x.index.to_list())
ser
col0
1 [5]
2 [2, 6, 7, 8]
3 [0]
4 [1, 3, 4, 9]
dtype: object
use mapping:
df.assign(col1=df.col0.map(ser))
'''
col0 col1
0 3 [0]
1 4 [1, 3, 4, 9]
2 2 [2, 6, 7, 8]
3 4 [1, 3, 4, 9]
4 4 [1, 3, 4, 9]
5 1 [5]
6 2 [2, 6, 7, 8]
7 2 [2, 6, 7, 8]
8 2 [2, 6, 7, 8]
9 4 [1, 3, 4, 9]
'''
On-liner method:
df['col1'] = [df[df.col0.values == i].index.tolist()for i in df.col0.values]
df
Output:
index
col0
col1
0
3
0
1
4
1,3,4,9
2
2
2,6,7,8
3
4
1,3,4,9
4
4
1,3,4,9
5
1
5
6
2
2,6,7,8
7
2
2,6,7,8
8
2
2,6,7,8
9
4
1,3,4,9
I have a file with various columns. Say
1 2 3 4 5 6
2 4 5 6 7 4
3 4 5 6 7 6
2 0 1 5 6 0
2 4 6 8 9 9
I would like to select and save out rows (in each column) in a new file which have the values in column two in the range [0 - 2].
The answer in the new file should be
1 2 3 4 5 6
2 0 1 5 6 0
Kindly assist me. I prefer doing this with numpy in python.
For array a, you can use:
a[(a[:,1] <= 2) & (a[:,1] >= 0)]
Here, the condition filters the values in your second column.
For your example:
>>> a
array([[1, 2, 3, 4, 5, 6],
[2, 4, 5, 6, 7, 4],
[3, 4, 5, 6, 7, 6],
[2, 0, 1, 5, 6, 0],
[2, 4, 6, 8, 9, 9]])
>>> a[(a[:,1] <= 2) & (a[:,1] >= 0)]
array([[1, 2, 3, 4, 5, 6],
[2, 0, 1, 5, 6, 0]])
I have following two dimensional array:
seq_length = 5
x = np.array([[0, 2, 0, 4], [5,6,7,8]])
x_repeated = np.repeat(x, seq_length, axis=1)
[[0 0 0 0 0 2 2 2 2 2 0 0 0 0 0 4 4 4 4 4]
[5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8]]
I want to shuffle x_repeated according to seq_length that all items of seq will be shuffled together.
For example, possible shuffle:
[[0 0 0 0 0 6 6 6 6 6 0 0 0 0 0 8 8 8 8 8]
[5 5 5 5 5 2 2 2 2 2 7 7 7 7 7 4 4 4 4 4]]
Thanks
You can do something like this:
import numpy as np
seq_length = 5
x = np.array([[0, 2, 0, 4], [5, 6, 7, 8]])
swaps = np.random.choice([False, True], size=4)
for swap_index, swap in enumerate(swaps):
if swap:
x[[0, 1], swap_index] = x[[1, 0], swap_index]
x_repeated = np.repeat(x, seq_length, axis=1)
You can also rely on the fact that True is non-zero, and replace the for with:
for swap_index in swaps.nonzero()[0]:
x[[0, 1], swap_index] = x[[1, 0], swap_index]
The key is that I did the shuffling/swapping before the np.repeat call, which will be much more efficient compared to doing it afterwards (while meeting your requirement of sequences of values needing to be swapped). There is a 50% chance for each pair of sequences of the same values to be swapped.
Managed to solve it following way:
items_count = x.shape[-1]
swap_flags = np.repeat(np.random.choice([0, 1], size=items_count), single_item_length)
gives:
[1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1]
for idx, flag in enumerate(swap_flags):
if flag:
x_repeated[0,idx], x_repeated[1,idx] = x_repeated[1,idx], x_repeated[0,idx]
Result:
[[5 5 5 5 5 6 6 6 6 6 0 0 0 0 0 8 8 8 8 8]
[0 0 0 0 0 2 2 2 2 2 7 7 7 7 7 4 4 4 4 4]]
Still not so elegant numpy way.
Here's my attempt:
def randomize_blocks(arr):
""" Shuffles an n-dimensional array given consecutive blocks of numbers.
"""
groups = (np.diff(arr.ravel(), prepend=0) != 0).cumsum().reshape(arr.shape)
u, c = np.unique(groups, return_counts=True)
np.random.shuffle(u)
o = np.argsort(u)
return arr.ravel()[np.argsort(np.repeat(u, c[o]))].reshape(arr.shape)
Breakdown
First we get the groups
groups = (np.diff(arr.ravel(), prepend=0) != 0).cumsum().reshape(arr.shape)
array([[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
[4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7]])
Then, we get unique and counts for each group.
u, c = np.unique(groups, return_counts=True)
>>> print(u, c)
(array([6, 0, 3, 5, 2, 4, 7, 1]),
array([5, 5, 5, 5, 5, 5, 5, 5]))
Finally, we shuffle our unique groups, reconstruct the array and use argsort to re-order the shuffled unique groups.
o = np.argsort(u)
arr.ravel()[np.argsort(np.repeat(u, c[o]))].reshape(arr.shape)
Example usage:
>>> randomize_blocks(arr)
array([[0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 5, 5, 5, 5, 5],
[7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 2, 2, 2, 2, 2, 6, 6, 6, 6, 6]])
>>> randomize_blocks(arr)
array([[6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 2, 2, 2, 2, 2]])
Here is a solution that does is completely in-place and does not require allocating and generating random indices:
import numpy as np
def row_block_shuffle(a: np.ndarray, seq_len: int):
cols = a.shape[1]
rng = np.random.default_rng()
for block in x_repeated.T.reshape(cols // seq_len, seq_length, -1).transpose(0, 2, 1):
rng.shuffle(block)
if __name__ == "__main__":
seq_length = 5
x = np.array([[0, 2, 0, 4], [5, 6, 7, 8]])
x_repeated = np.repeat(x, seq_length, axis=1)
row_block_shuffle(x_repeated, seq_length)
print(x_repeated)
Output:
[[5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8]
[0 0 0 0 0 2 2 2 2 2 0 0 0 0 0 4 4 4 4 4]]
What I do is to create "blocks" that shares memory with the original array:
>>> x_repeated.T.reshape(cols // seq_len, seq_length, -1).transpose(0, 2, 1)
[[[0 0 0 0 0]
[5 5 5 5 5]]
[[2 2 2 2 2]
[6 6 6 6 6]]
[[0 0 0 0 0]
[7 7 7 7 7]]
[[4 4 4 4 4]
[8 8 8 8 8]]]
Then I shuffle each "block", which will in turn shuffles the original array as well. I believe this is the most effective solution for large arrays as this solution is as in-place as it can be. This answer at least backs up my hypothesis:
https://stackoverflow.com/a/5044364/13091658
Also! The general problem you are facing is sorting "sliding window views" of your array, so if you would like to sort "windows" within your array that both moves horizontally and vertically you can for example see my previous answers for problems related to sliding windows here:
https://stackoverflow.com/a/67416335/13091658
https://stackoverflow.com/a/69924828/13091658
import numpy as np
m = np.array([[0, 2, 0, 4], [5, 6, 7, 8]])
def np_shuffle(m, m_rows = len(m), m_cols = len(m[0]), n_duplicate = 5):
# Flatten the numpy matrix
m = m.flatten()
# Randomize the flattened matrix m
np.random.shuffle(m)
# Duplicate elements
m = np.repeat(m, n_duplicate, axis=0)
# Return reshape numpy array
return (np.reshape(m, (m_rows, n_duplicate*m_cols)))
r = np_shuffle(m)
print(r)
# [[8 8 8 8 8 5 5 5 5 5 2 2 2 2 2 0 0 0 0 0]
# [0 0 0 0 0 7 7 7 7 7 4 4 4 4 4 6 6 6 6 6]]
Say that I have a 2D numpy array that looks like this
x = np.array( [[ 3 , 4, 2 ,4, 7, 9, 7, 5, 2, 1, 7 ], [ 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1] ])
print(x)
>[[ 3 4 2 4 7 9 7 5 2 1 7]
> [11 10 9 8 7 6 5 4 3 2 1]]
I want to delete duplicate numbers in the first row, and delete the corresponding number in the 2nd row. The number that does Not get deleted is the biggest one (from the 2nd row).
Here's what I want the output to look like.
>[[ 3 4 2 7 9 5 1 ]
> [11 10 9 7 6 4 2 ]]
All duplicates from row 1 have been deleted, as well as the corresponding value in row 2. The value that remains is always the biggest value in row 2.
If it helps, we can assume row 2 is always sorted in descending order like it is above.
Using np.unique with return_index=True:
_, idx = np.unique(x[0], return_index=1)
x[:, np.sort(idx)]
array([[ 3, 4, 2, 7, 9, 5, 1],
[11, 10, 9, 7, 6, 4, 2]])
I have matrix of 4x4 like this
ds1=
4 13 6 9
7 12 5 7
7 0 4 22
9 8 12 0
and other file with two columns:
ds2 =
4 1
5 3
6 1
7 2
8 2
9 3
12 1
13 2
22 3
ds1 = ds1.apply(lambda x: ds2_mean[1] if [condition])
What condition to be added to compare and check that elements from ds1 and ds2 are equal?
I want col1 value from 2nd matrix to be replaced by col2 value in matrix 1, so resultant matrix should look like
1 2 1 3
2 1 3 2
2 0 1 3
3 2 1 0
please see Replacing mean value from one dataset to another this does not answer my question
If you are working with numpy arrays, you could do this -
# Make a copy of ds1 to initialize output array
out = ds1.copy()
# Find out the row indices in ds2 that have intersecting elements between
# its first column and ds1
_,C = np.where(ds1.ravel()[:,None] == ds2[:,0])
# New values taken from the second column of ds2 to be put in output
newvals = ds2[C,1]
# Valid positions in output array to be changed
valid = np.in1d(ds1.ravel(),ds2[:,0])
# Finally make the changes to get desired output
out.ravel()[valid] = newvals
Sample input, output -
In [79]: ds1
Out[79]:
array([[ 4, 13, 6, 9],
[ 7, 12, 5, 7],
[ 7, 0, 4, 22],
[ 9, 8, 12, 0]])
In [80]: ds2
Out[80]:
array([[ 4, 1],
[ 5, 3],
[ 6, 1],
[ 7, 2],
[ 8, 2],
[ 9, 3],
[12, 1],
[13, 2],
[22, 3]])
In [81]: out
Out[81]:
array([[1, 2, 1, 3],
[2, 1, 3, 2],
[2, 0, 1, 3],
[3, 2, 1, 0]])
Here is another solution. Using DataFrame.replace() function.
df1.replace(to_replace= df2[0].tolist(), value= df2[1].tolist, inplace=True)