numpy slicing the matrix based condition on the column python - python

X = np.arange(1, 26).reshape(5, 5)
X[:,1:2] % 2 == 0
The conditions should only be applied to the second column
I want the whole matrix where the condition is true like
[array([[False, True, False, False, False],
[ False, False, False, False, False],
[False, True, False, False, False],
[ False, False, False, False, False],
[False, True, False, False, False]])]
It's giving the error
IndexError: boolean index did not match indexed array along dimension 1; dimension is 5 but corresponding boolean dimension is 1

Is this what you want?
import numpy as np
X = np.arange(1, 26).reshape(5, 5)
X=[X[::] % 2 == 0]
print(X)
Output
[array([[False, True, False, True, False],
[ True, False, True, False, True],
[False, True, False, True, False],
[ True, False, True, False, True],
[False, True, False, True, False]])]

If you want to get the whole matrix where the condition is true. You can simply do this
X % 2 == 0
If you want to get the first column where condition is true then
X[:, 1:2] % 2 ==0

Related

numpy isin for multi-dimmensions

I have a big array of integers and second array of arrays. I want to create a boolean mask for the first array based on data from the second array of arrays. Preferably I would use the numpy.isin but it clearly states in it's documentation:
The values against which to test each value of element. This argument is flattened if it is an array or array_like. See notes for behavior with non-array-like parameters.
Do you maybe know some performant way of doing this instead of list comprehension?
So for example having those arrays:
a = np.array([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
b = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
I would like to have result like:
np.array([
[True, True, False, False, False, False, False, False, False, False],
[False, False, True, True, False, False, False, False, False, False],
[False, False, False, False, True, True, False, False, False, False],
[False, False, False, False, False, False, True, True, False, False],
[False, False, False, False, False, False, False, False, True, True]
])
You can use broadcasting to avoid any loop (this is however more memory expensive):
(a == b[...,None]).any(-2)
Output:
array([[ True, True, False, False, False, False, False, False, False, False],
[False, False, True, True, False, False, False, False, False, False],
[False, False, False, False, True, True, False, False, False, False],
[False, False, False, False, False, False, True, True, False, False],
[False, False, False, False, False, False, False, False, True True]])
Try numpy.apply_along_axis to work with numpy.isin:
np.apply_along_axis(lambda x: np.isin(a, x), axis=1, arr=b)
returns
array([[[ True, True, False, False, False, False, False, False, False, False]],
[[False, False, True, True, False, False, False, False, False, False]],
[[False, False, False, False, True, True, False, False, False, False]],
[[False, False, False, False, False, False, True, True, False, False]],
[[False, False, False, False, False, False, False, False, True, True]]])
I will update with an edit comparing the runtime with a list comp
EDIT:
Whelp, I tested the runtime, and wouldn't you know, listcomp is faster
timeit.timeit("[np.isin(a,x) for x in b]",number=10000, globals=globals())
0.37380070000654086
vs
timeit.timeit("np.apply_along_axis(lambda x: np.isin(a, x), axis=1, arr=b) ",number=10000, globals=globals())
0.6078917000122601
the other answer to this post by #mozway is much faster:
timeit.timeit("(a == b[...,None]).any(-2)",number=100, globals=globals())
0.007107900004484691
and should probably be accepted.
This is a bit cheated but ultra fast solution. The cheating is that I sort the seconds matrix before so that I can use binary search.
#nb.njit(parallel=True)
def isin_multi(a, b):
out = np.zeros((b.shape[0], a.shape[0]), dtype=nb.boolean)
for i in nb.prange(a.shape[0]):
for j in nb.prange(b.shape[0]):
index = np.searchsorted(b[j], a[i])
if index >= len(b[j]) or b[j][index] != a[i]:
out[j][i] = False
else:
out[j][i] = True
break
return out
a = np.random.randint(200000, size=200000)
b = np.random.randint(200000, size=(50, 5000))
b = np.sort(b, axis=1)
start = time.perf_counter()
for _ in range(20):
isin_multi(a, b)
print(f"isin_multi {time.perf_counter() - start:.3f} seconds")
start = time.perf_counter()
for _ in range(20):
np.array([np.isin(a, ids) for ids in b])
print(f"comprehension {time.perf_counter() - start:.3f} seconds")
Results:
isin_multi 2.951 seconds.
comprehension 21.093 seconds

Mean of last N rows of pandas dataframe if the previous rows meet a condition

I have a pandas dataframe like
index start end label
0 2 5 0
1 3 8 1
2 4 8 0
3 5 9 1
4 6 10 0
5 7 10 1
6 8 11 1
7 9 12 0
I want a new column 'mean'; where the value is the mean of the previous rows label with the condition df['start']<df['end']
Example,
for index 1, df['mean'] = (df[0]['label']+ df[1]['label'])/2
for index 3, df['mean'] = (df[1]['label']+ df[2]['label']+ df[3]['label'])/3 ; here we ignore index 0 as df[3]['start']<df[0]['end'] condition does not satisfy.
similarly, for index 7, df['mean'] = (df[4]['label']+ df[5]['label']+ df[6]['label']+ df[7]['label'])/4 ; as for index 0,1,2,3; df[7]['start']<df[i]['end'] condition does not satisfy.
So the final output would be
index start end label mean
0 2 5 0 0
1 3 8 1 1/2
2 4 8 0 1/3
3 5 9 1 2/3
4 6 10 0 2/4
5 7 10 1 3/5
6 8 11 1 3/4
7 9 12 0 2/4
I was trying using cumsum; but I am not sure how to put the condition.
To be fair I decided to try and compare three types of approaches:
using loop
using parallel loop (numba)
using matrix
Here is the code
import pandas as pd
from numba import njit, prange
import numpy as np
from timeit import timeit
from pandas.testing import assert_frame_equal
big_df = pd.DataFrame(np.random.randint(0,100,size=(1000, 3)), columns=["start", "end", "label"])
def cond_cumsum_matrix(df):
mask_matrix = (
(df.start.to_numpy().reshape(1,-1).T < df.end.to_numpy())
& (df.index.to_numpy() <= np.arange(0,len(df)).reshape(1, -1).T)
)
with np.errstate(divide='ignore', invalid='ignore'):
df_add = pd.DataFrame(
(np.matmul(
(
(mask_matrix)
), df.label.to_numpy()
)
) / (mask_matrix.sum(axis=-1)),
columns = ["mean"]
)
return df_add
def cond_cumsum_parallel_loop(df):
#njit
def numba_cond_cumsum_parallel_loop(label, start, end):
cumsum = []
for i in prange(len(label)):
running = 0
count = 0
for j in prange(i+1):
if start[i] < end[j] :
running += label[j]
count += 1
if count == 0:
cumsum.append(np.nan)
else:
cumsum.append(running/count)
return cumsum
return pd.DataFrame(
numba_cond_cumsum_parallel_loop(
df.label.to_numpy(),
df.start.to_numpy(),
df.end.to_numpy(),
), columns=["mean"],)
def cond_cumsum_loop(df):
start = df.start.tolist()
end = df.end.tolist()
label = df.label.tolist()
cumsum = []
for index, row in df.iterrows():
running = 0
count = 0
for j in range(index+1):
if row.start < end[j] :
running += label[j]
count += 1
if count == 0:
cumsum.append(np.nan)
else:
cumsum.append(running/count)
return pd.DataFrame(
cumsum,
columns=["mean"],)
assert_frame_equal(cond_cumsum_matrix(big_df), cond_cumsum_loop(big_df))
assert_frame_equal(cond_cumsum_matrix(big_df), cond_cumsum_parallel_loop(big_df))
repetitions = 5
print(f"cond_cumsum_loop runs {timeit(lambda: cond_cumsum_loop(big_df), number=repetitions)/repetitions} seconds")
print(f"cond_cumsum_parallel_loop runs {timeit(lambda: cond_cumsum_parallel_loop(big_df), number=repetitions)/repetitions} seconds")
print(f"cond_cumsum_matrix runs {timeit(lambda: cond_cumsum_matrix(big_df), number=repetitions)/repetitions} seconds")
and here is what result it gives:
cond_cumsum_loop runs 1.2179410583339632 seconds
cond_cumsum_parallel_loop runs 0.07655967501923441 seconds
cond_cumsum_matrix runs 0.004219983238726854 seconds
Of course the code could be improved so the comparison is not ideal but anyway the conclusion is that although matrix still wins in performance with using O(n^2) additional memory, a parallel loop gives a somewhat decent performance with using only O(n) additional memory.
Here is a less performant solution (looping over each row should generally be avoided in Pandas) but one that is hopefully accessible as a starting point that you can then optimize:
df = pd.DataFrame([
[2,5,0],
[3,8,1],
[4,8,0],
[5,9,1],
[6,10,0],
[7,10,1],
[8,11,1],
[9,12,0]],columns=['start','end','label'])
for index, row in df.iterrows():
if index == 0:
df.at[index, 'cumulative_mean'] = 0
else:
current_row_start = row['start']
previous_rows_as_df = df.loc[0:index] # create a DF which is all the previous rows
for p_index, p_row in previous_rows_as_df.iterrows():
if current_row_start < p_row['end']:
previous_rows_as_df.at[p_index, 'include'] = True
df.at[index, 'cumulative_mean'] = previous_rows_as_df[previous_rows_as_df['include'] == True]['label'].mean()
Here is your result.
import numpy as np
mask_matrix = (
(df.start.to_numpy().reshape(1,-1).T < df.end.to_numpy())
& (df.index.to_numpy() <= np.arange(0,len(df)).reshape(1, -1).T)
)
df_add = pd.DataFrame(
(np.matmul(
(
(mask_matrix)
), df.label.to_numpy()
)
) / (mask_matrix.sum(axis=-1)),
columns = ["mean"]
)
df = pd.concat([df, df_add], axis=1)
When we create the matrix we use O(n^2) of additional space. Hopefully it is not a problem. Otherwise need to use a loop which I don't personally like when using vectorized computations.
A few additional comments:
df.start.to_numpy().reshape(1,-1).T < df.end.to_numpy() basically compares where start is below end for each row. This is the result:
array([[ True, True, True, True, True, True, True, True, True],
[ True, True, True, True, True, True, True, True, True],
[ True, True, True, True, True, True, True, True, True],
[False, True, True, True, True, True, True, True, True],
[False, True, True, True, True, True, True, True, True],
[False, True, True, True, True, True, True, True, True],
[False, False, False, True, True, True, True, True, False],
[False, False, False, False, True, True, True, True, False],
[False, False, False, False, True, True, True, True, False]])
(df.index.to_numpy() <= np.arange(0,len_).reshape(1, -1).T) restricts previous result to only rows that are precedent to current one. This mask looks like this:
array([[ True, False, False, False, False, False, False, False, False],
[ True, True, False, False, False, False, False, False, False],
[ True, True, True, False, False, False, False, False, False],
[ True, True, True, True, False, False, False, False, False],
[ True, True, True, True, True, False, False, False, False],
[ True, True, True, True, True, True, False, False, False],
[ True, True, True, True, True, True, True, False, False],
[ True, True, True, True, True, True, True, True, False],
[ True, True, True, True, True, True, True, True, True]])
Final mask_matrix (elementwise multiplication of previous two matrices) looks like this
array([[ True, False, False, False, False, False, False, False, False],
[ True, True, False, False, False, False, False, False, False],
[ True, True, True, False, False, False, False, False, False],
[False, True, True, True, False, False, False, False, False],
[False, True, True, True, True, False, False, False, False],
[False, True, True, True, True, True, False, False, False],
[False, False, False, True, True, True, True, False, False],
[False, False, False, False, True, True, True, True, False],
[False, False, False, False, True, True, True, True, False]])
Now multiplying this mask_matrix by vector df.label gives almost what we need. Just need to elementwise divide by the sum of True in mask_matrix

2d index to select elements from 1d array

I'm trying to use a 2d boolean array (ix) to pick elements from a 1d array (c) to create a 2d array (r). The resulting 2d array is also a boolean array. Each column stands for the unique value in c.
Example:
>>> ix
array([[ True, True, False, False, False, False, False],
[False, False, True, False, False, False, True],
[False, False, False, True, False, False, False]])
>>> c
array([1, 2, 3, 4, 8, 2, 4])
Expected result
1, 2, 3, 4, 8
r = [
[ True, True, False, False, False], # c[ix[0][0]] == 1 and c[ix[0][1]] == 2; it doesn't matter that ix[0][5] (pointing to `2` in `c`) is False as ix[0][1] was already True which is sufficient.
[False, False, True, True, False], # [3]
[False, False, False, True, False] # [4] as ix[2][3] is True
]
Can this be done in a vectorised way?
Let us try:
# unique values
uniques = np.unique(c)
# boolean index into each row
vals = np.tile(c,3)[ix.ravel()]
# search within the unique values
idx = np.searchsorted(uniques, vals)
# pre-populate output
out = np.full((len(ix), len(uniques)), False)
# index into the output:
out[np.repeat(np.arange(len(ix)), ix.sum(1)), idx ] = True
Output:
array([[ True, True, False, False, False],
[False, False, True, True, False],
[False, False, False, True, False]])

Encoding for patterns with Numpy

I want to find up/down patterns in a time series. This is what I use for simple up/down:
diff = np.diff(source, n=1)
encoding = np.where(diff > 0, 1, 0)
Is there a way with Numpy to do that for patterns with a given lookback length without a slow loop? For example up/up/up = 0 down/down/down = 1 up/down/up = 2 up/down/down = 3.....
Thank you for your help.
I learned yesterday about np.lib.stride_tricks.as_strided from one of StackOverflow answers similar to this. This is an awesome trick and not that hard to understand as I expected. Now, if you get it, let's define a function called rolling that lists all the patterns to check with:
def rolling(a, window):
shape = (a.size - window + 1, window)
strides = (a.itemsize, a.itemsize)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
compare_with = [True, False, True]
bool_arr = np.random.choice([True, False], size=15)
paterns = rolling(bool_arr, len(compare_with))
And after that you can calculate indexes of pattern matches as discussed here
idx = np.where(np.all(paterns == compare_with, axis=1))
Sample run:
bool_arr
array([ True, False, True, False, True, True, False, False, False,
False, False, False, True, True, False])
patterns
array([[ True, False, True],
[False, True, False],
[ True, False, True],
[False, True, True],
[ True, True, False],
[ True, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, True],
[False, True, True],
[ True, True, False]])
idx
(array([ 0, 2, 13], dtype=int64),)

Changing an array of True and False answers to a hex value Python

I have a list of True and False answers like this:
[True, True, True, False, False, True, False, False]
[True, True, False, False, True, False, False, True]
[True, False, False, True, False, False, True, True]
[False, False, True, False, False, True, True, True]
[False, True, False, False, True, True, True, False]
[True, False, False, True, True, True, False, False]
[False, False, True, True, True, False, False, True]
[False, True, True, True, False, False, True, False]
I want to give True a value of 1 and False a value of 0 and then convert that overall value to hexadecimal.
How would I go about doing that? Could I look at each value in turn in the list and if it equals 'True' change that value to a 1 and if its 'False' change the value to a 0 or would there be an easier way to change the entire list straight to hex?
EDIT: Here's the full code on Pastebin: http://pastebin.com/1839NKCx
Thanks
lists = [
[True, True, True, False, False, True, False, False],
[True, True, False, False, True, False, False, True],
[True, False, False, True, False, False, True, True],
[False, False, True, False, False, True, True, True],
[False, True, False, False, True, True, True, False],
[True, False, False, True, True, True, False, False],
[False, False, True, True, True, False, False, True],
[False, True, True, True, False, False, True, False],
]
for l in lists:
zero_one = map(int, l) # convert True to 1, False to 0 using `int`
n = int(''.join(map(str, zero_one)), 2) # numbers to strings, join them
# convert to number (base 2)
print('{:02x}'.format(n)) # format them as hex string using `str.format`
output:
e4
c9
93
27
4e
9c
39
72
If you want to combine a series of boolean values into one value (as a bitfield), you could do something like this:
x = [True, False, True, False, True, False ]
v = sum(a<<i for i,a in enumerate(x))
print hex(v)
No need for a two steps process if you use reduce (assuming MSB is at left as usual):
b = [True, True, True, False, False, True, False, False]
val = reduce(lambda byte, bit: byte*2 + bit, b, 0)
print val
print hex(val)
Displaying:
228
0xe4
This should do it:
def bool_list_to_hex(list):
n = 0
for bool in list:
n *= 2
n += int(bool)
return hex(n)
One-liner:
>>> lists = [
[True, True, True, False, False, True, False, False],
[True, True, False, False, True, False, False, True],
[True, False, False, True, False, False, True, True],
[False, False, True, False, False, True, True, True],
[False, True, False, False, True, True, True, False],
[True, False, False, True, True, True, False, False],
[False, False, True, True, True, False, False, True],
[False, True, True, True, False, False, True, False]]
>>> ''.join(hex(int(''.join('1' if boolValue else '0' for boolValue in byteOfBools),2))[2:] for byteOfBools in lists)
'e4c993274e9c3972'
Inner join produces a string of eight zeros and ones.
int(foo,2) turns the string into a number interpreting it as binary.
hex turns it to hex format.
[2:] removes the leading '0x' from the standard hex format
outer join does this to all sublists and, well, joins the results.
All above methods do not work if list of bits exceeds 64.
It could also be discussed whether it is efficient to transtype boolean several times especially string before conversion to hexa.
Here is a proposal, with MSB on th left of bitlist :
from collections import deque
# (lazy) Padd False on MSB side so that bitlist length is multiple of 4.
# Padded length can be zero
msb_padlen = (-len(bitlist))%4
bitlist = deque(bitlist)
bitlist.extendleft([False]*msb_padlen)
# (lazy) Re-pack list of bits into list of 4-bit tuples
pack4s = zip(* [iter(bitlist)]*4)
# Convert each 4-uple into hex digit
hexstring = [hex(sum(a<<i for i,a in enumerate(reversed(pack4))))[-1] for pack4 in pack4s ]
# Compact list of hex digits into a string
hexstring = '0x'+''.join(hexstring)
The 4-bit tuple pack4 is (msb,...,lsb) => it has to be reversed while calculating corresponding integer.
Alternative :
hexstring = [hex(sum(a<<3-i for i,a in enumerate(pack4)))[-1] for pack4 in pack4s ]

Categories