python - applying a mask to an array in a for loop - python

I have this code:
import numpy as np
result = {}
result['depth'] = [1,1,1,2,2,2]
result['generation'] = [1,1,1,2,2,2]
result['dimension'] = [1,2,3,1,2,3]
result['data'] = [np.array([0,0,0]), np.array([0,0,0]), np.array([0,0,0]), np.array([0,0,0]), np.array([0,0,0]), np.array([0,0,0])]
for v in np.unique(result['depth']):
temp_v = (result['depth'] == v)
values_v = [result[string][temp_v] for string in result.keys()]
this_v = dict(zip(result.keys(), values_v))
in which I want to create a new dictcalled 'this_v', with the same keys as the original dict result, but fewer values.
The line:
values_v = [result[string][temp_v] for string in result.keys()]
gives an error
TypeError: only integer scalar arrays can be converted to a scalar index
which I don't understand, since I can create ex = result[result.keys()[0]][temp_v] just fine. It just does not let me do this with a for loop so that I can fill the list.
Any idea as to why it does not work?

In order to solve your problem (finding and dropping duplicates) I encourage you to use pandas. It is a Python module that makes your life absurdly simple:
import numpy as np
result = {}
result['depth'] = [1,1,1,2,2,2]
result['generation'] = [1,1,1,2,2,2]
result['dimension'] = [1,2,3,1,2,3]
result['data'] = [np.array([0,0,0]), np.array([0,0,0]), np.array([0,0,0]),\
np.array([0,0,0]), np.array([0,0,0]), np.array([0,0,0])]
# Here comes pandas!
import pandas as pd
# Converting your dictionary of lists into a beautiful dataframe
df = pd.DataFrame(result)
#> data depth dimension generation
# 0 [0, 0, 0] 1 1 1
# 1 [0, 0, 0] 1 2 1
# 2 [0, 0, 0] 1 3 1
# 3 [0, 0, 0] 2 1 2
# 4 [0, 0, 0] 2 2 2
# 5 [0, 0, 0] 2 3 2
# Dropping duplicates... in one single command!
df = df.drop_duplicates('depth')
#> data depth dimension generation
# 0 [0, 0, 0] 1 1 1
# 3 [0, 0, 0] 2 1 2
If you want oyur data back in the original format... you need yet again just one line of code!
df.to_dict('list')
#> {'data': [array([0, 0, 0]), array([0, 0, 0])],
# 'depth': [1, 2],
# 'dimension': [1, 1],
# 'generation': [1, 2]}

Related

Why cant Pandas replace nan with an array of 0s using masks/replace?

I have a series like this
s = pd.Series([[1,2,3],[1,2,3],np.nan,[1,2,3],[1,2,3],np.nan])
and I simply want the NaN to be replaced by [0,0,0].
I have tried
s.fillna([0,0,0]) # TypeError: "value" parameter must be a scalar or dict, but you passed a "list"
s[s.isna()] = [[0,0,0],[0,0,0]] # just replaces the NaN with a single "0". WHY?!
s.fillna("NAN").replace({"NAN":[0,0,0]}) # ValueError: NumPy boolean array indexing assignment cannot
#assign 3 input values to the 2 output values where the mask is true
s.fillna("NAN").replace({"NAN":[[0,0,0],[0,0,0]]}) # TypeError: NumPy boolean array indexing assignment
# requires a 0 or 1-dimensional input, input has 2 dimensions
I really can't understand, why the two first approaches won't work (maybe I get the first, but the second I cant wrap my head around).
Thanks to this SO-question and answer, we can do it by
is_na = s.isna()
s.loc[is_na] = s.loc[is_na].apply(lambda x: [0,0,0])
but since apply often is rather slow I cannot understand, why we can't use replace or the slicing as above
Pandas working with list with pain, here is hacky solution:
s = s.fillna(pd.Series([[0,0,0]] * len(s), index=s.index))
print (s)
0 [1, 2, 3]
1 [1, 2, 3]
2 [0, 0, 0]
3 [1, 2, 3]
4 [1, 2, 3]
5 [0, 0, 0]
dtype: object
Series.reindex
s.dropna().reindex(s.index, fill_value=[0, 0, 0])
0 [1, 2, 3]
1 [1, 2, 3]
2 [0, 0, 0]
3 [1, 2, 3]
4 [1, 2, 3]
5 [0, 0, 0]
dtype: object
The documentation indicates that this value cannot be a list.
Value to use to fill holes (e.g. 0), alternately a
dict/Series/DataFrame of values specifying which value to use for each
index (for a Series) or column (for a DataFrame). Values not in the
dict/Series/DataFrame will not be filled. This value cannot be a list.
This is probably a limitation of the current implementation, and short of patching the source code you must resort to workarounds (as provided below).
However, if you are not planning to work with jagged arrays, what you really want to do is probably replace pd.Series() with pd.DataFrame(), e.g.:
import numpy as np
import pandas as pd
s = pd.DataFrame(
[[1, 2, 3],
[1, 2, 3],
[np.nan],
[1, 2, 3],
[1, 2, 3],
[np.nan]],
dtype=pd.Int64Dtype()) # to mix integers with NaNs
s.fillna(0)
# 0 1 2
# 0 1 2 3
# 1 1 2 3
# 2 0 0 0
# 3 1 2 3
# 4 1 2 3
# 5 0 0 0
If you do need to use jagged array, you could use any of the proposed workaround from other answers, or you could make one of your attempt work, e.g.:
ii = s.isna()
nn = ii.sum()
s[ii] = pd.Series([[0, 0, 0]] * nn).to_numpy()
# 0 [1, 2, 3]
# 1 [1, 2, 3]
# 2 [0, 0, 0]
# 3 [1, 2, 3]
# 4 [1, 2, 3]
# 5 [0, 0, 0]
# dtype: object
which basically uses NumPy masking to fill in the Series. The trick is to generate a compatible object for the assignment that works at the NumPy level.
If there are too many NaNs in the input, it is probably more efficient / faster to work in a similar way but with s.notna() instead, e.g.:
import pandas as pd
result = pd.Series([[0, 0, 0]] * len(s))
result[s.notna()] = s[s.notna()]
Let's try to do some benchmarking, where:
replace_nan_isna() is from above
import pandas as pd
def replace_nan_isna(s, value, inplace=False):
if not inplace:
s = s.copy()
ii = s.isna()
nn = ii.sum()
s[ii] = pd.Series([value] * nn).to_numpy()
return s
replace_nan_notna() is also from above
import pandas as pd
def replace_nan_notna(s, value, inplace=False):
if inplace:
raise ValueError("In-place not supported!")
result = pd.Series([value] * len(s))
result[s.notna()] = s[s.notna()]
return result
replace_nan_reindex() is from #ShubhamSharma's answer
def replace_nan_reindex(s, value, inplace=False):
if not inplace:
s = s.copy()
s.dropna().reindex(s.index, fill_value=value)
return s
replace_nan_fillna() is from #jezrael's answer
import pandas as pd
def replace_nan_fillna(s, value, inplace=False):
if not inplace:
s = s.copy()
s.fillna(pd.Series([value] * len(s), index=s.index))
return s
with the following code:
import numpy as np
import pandas as pd
def gen_data(n=5, k=2, p=0.7, obj=(1, 2, 3)):
return pd.Series(([obj] * int(p * n) + [np.nan] * (n - int(p * n))) * k)
funcs = replace_nan_isna, replace_nan_notna, replace_nan_reindex, replace_nan_fillna
# : inspect results
s = gen_data(5, 1)
for func in funcs:
print(f'{func.__name__:>20s} {func(s, value)}')
print()
# : generate benchmarks
s = gen_data(100, 1000)
value = (0, 0, 0)
base = funcs[0](s, value)
for func in funcs:
print(f'{func.__name__:>20s} {(func(s, value) == base).all()!s:>5}', end=' ')
%timeit func(s, value)
# replace_nan_isna True 100 loops, best of 5: 16.5 ms per loop
# replace_nan_notna True 10 loops, best of 5: 46.5 ms per loop
# replace_nan_reindex True 100 loops, best of 5: 9.74 ms per loop
# replace_nan_fillna True 10 loops, best of 5: 36.4 ms per loop
indicating that reindex() may be the fastest approach.

How to parse through a list of lists and analyze the elements together to see how many times they occur over time?

Let's say I have a machine that sends out 4 bits every second and I want to see the amount of times a certain bit signature is sent over time.
I am given an input list of lists that contain a message in bits that change over time.
For my output I would like a list of dictionaries, per bit pair, containing the unique bit pair as the key and the times it appears as the value.
Edit New Example:
For example this following data set would be a representation of that data. With the horizontal axis being bit position and the vertical axis being samples over time. So for the following example I have 4 total bits and 6 total samples.
a = [
[0, 0, 1, 1],
[0, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1],
[0, 0, 0, 0],
[1, 0, 1, 0]])
For this data set I am trying to get a count of how many times a certain bit string occurs this length should be able to vary but for this example let's say I am doing 2 bits at a time.
So the first sample [0,0,1,1] would be split into this
[00,01,11] and the second would be [01,11,11] and the third would be [11,11,11] and so on. Producing a list like so:
y = [
[00,01,11],
[01,11,11],
[11,11,11],
[11,11,11],
[00,00,00],
[10,01,10]]
From this I want to be able to count each unique signature and to produce a dictionary with keys corresponding to the signature and values to the counts.
The dictionary would like this
z = [
{'00':2, '01':1, '11':2, '10':1},
{'00':1, '01':2, '11':3},
{'00':1, '11':4], '10':1}]
Finding the counts is easy if a have a list of parsed items. However getting from the raw data to that parsed list is where I am currently having some trouble. I have an implementation but it's essentially 3 for loops and it runs really slow over large dataset. Surely there is a better and more pythonic way to get about this?
I am using numpy for some additional calculation later on in my program so I would not be against using it here.
UPDATE:
I have been looking around at other things and came to this. Not sure if this is the best solution either.
import numpy as np
a = np.array([
[0, 0, 1, 1],
[0, 1, 1, 1],
[1, 1, 1, 1]])
my_list = a.astype(str).tolist()
# How many elements each
# list should have
n = 2
# using list comprehension
final = [([''.join(c[i:(i) + n]) for i in range((len(c) + n) // n)]) for c in my_list]
final = [['00', '01', '11'], ['01', '11', '11'], ['11', '11', '11']]
UPDATE 2:
I have ran the following implementations and tested there speeds and here is what I have came up with.
Running the data on the small example of 4 bits and 4 samples with a width of 2.
x = [
[0,0,1,1],
[0,1,1,1],
[1,1,1,1]]
My implementation took 0.0003 seconds
Kasrâmvd's implementation took 0.0002 seconds
Chris' implementation took 0.0002 seconds
Paul's implementation took 0.0243 seconds
However when running against an actual dataset of 64 bits and 23,497 samples with a width of 2. I got these results:
My implementation took 1.5302 seconds
Kasrâmvd's implementation took 0.3913 seconds
Chris' Implementation took 2.0802 seconds
Paul's implementation took 0.0204 seconds
Here is an approach using convolution. As fast convolution depends on FFT and therefore needs to do computations with floats, we have 52 bits mantissa and 53 is the maximum pattern length we can handle.
import itertools as it
import numpy as np
import scipy.signal as ss
MAX_BITS = np.finfo(float).nmant + 1
def sliding_window(data, width, return_keys=True, return_dict=True, prune_empty=True):
n, m = data.shape
if width > MAX_BITS:
raise ValueError(f"max window width is {MAX_BITS}")
patterns = ss.convolve(data, 1<<np.arange(width)[None], 'valid', 'auto').astype(int)
patterns += np.arange(m-width+1)*(1<<width)
cnts = np.bincount(patterns.ravel(), None, (m-width+1)*(1<<width)).reshape(m-width+1,-1)
if return_keys or return_dict:
keys = np.array([*map("".join, it.product(*width*("01",)))], 'S')
if return_dict:
dt = np.dtype([('key', f'S{width}'), ('value', int)])
aux = np.empty(cnts.shape, dt)
aux['value'] = cnts
aux['key'] = keys
if prune_empty:
i,j = np.where(cnts)
return [*map(dict, np.split(aux[i,j],
i.searchsorted(np.arange(1,m-width+1))))]
return [*map(dict, aux.tolist())]
return keys, cnts
return cnts
example = np.random.randint(0, 2, (10,10))
print(example)
print(sliding_window(example,3))
Sample run:
[[0 1 1 1 0 1 1 1 1 1]
[0 0 1 0 1 0 0 1 0 1]
[0 0 1 0 1 1 1 0 1 1]
[1 1 1 1 1 0 0 0 1 0]
[0 0 0 0 1 1 1 0 0 0]
[1 1 0 0 0 1 0 0 1 1]
[0 1 1 1 0 1 1 1 1 1]
[0 1 0 0 0 1 1 0 0 1]
[1 0 1 1 0 1 1 0 1 0]
[0 0 1 1 0 1 0 1 0 0]]
[{b'000': 1, b'001': 3, b'010': 1, b'011': 2, b'101': 1, b'110': 1, b'111': 1}, {b'000': 1, b'010': 2, b'011': 2, b'100': 2, b'111': 3}, {b'000': 2, b'001': 1, b'101': 2, b'110': 4, b'111': 1}, {b'001': 2, b'010': 1, b'011': 2, b'101': 4, b'110': 1}, {b'010': 2, b'011': 4, b'100': 2, b'111': 2}, {b'000': 1, b'001': 1, b'100': 1, b'101': 1, b'110': 4, b'111': 2}, {b'001': 2, b'010': 2, b'100': 2, b'101': 2, b'111': 2}, {b'000': 1, b'001': 1, b'010': 2, b'011': 2, b'100': 1, b'101': 1, b'111': 2}]
If you wanna have a geometrical or algebraic analysis/solution you can do the following:
In [108]: x = np.array([[0,0,1,1],
...: [0,1,1,1],
...: [1,1,1,1]])
...:
In [109]:
In [109]: pairs = np.dstack((x[:, :-1], x[:, 1:]))
In [110]: x, y, z = pairs.shape
In [111]: uniques
Out[111]:
array([[0, 0],
[0, 1],
[1, 1]])
In [112]: uniques = np.unique(pairs.reshape(x*y, z), axis=0)
# None: 3d broadcasting is not recommended in any situation, please read doc for more details,
In [113]: R = (uniques[:,None][:,None,:] == pairs).all(3).sum(-1)
In [114]: R
Out[114]:
array([[1, 0, 0],
[1, 1, 0],
[1, 2, 3]])
The columns of matrix R stand for the count of each unique pair in uniques object in each row of your original array.
You can then get a Python object like what you want as following:
In [116]: [{tuple(i): j for i,j in zip(uniques, i) if j} for i in R.T]
Out[116]: [{(0, 0): 1, (0, 1): 1, (1, 1): 1}, {(0, 1): 1, (1, 1): 2}, {(1, 1): 3}]
This solution doesn't pair the bits, but gives them as tuples (although that should be simple enough to do).
EDIT: formed strings of bits as needed.
from collections import Counter
x = [[0,0,1,1],
[0,1,1,1],
[1,1,1,1]]
y = [[''.join(map(str, ref[j:j+2])) for j in range(len(x[0])-1)] \
for ref in x]
for bit in y:
d = Counter(bit)
print(d)
Prints
Counter({'00': 1, '01': 1, '11': 1})
Counter({'11': 2, '01': 1})
Counter({'11': 3})
EDIT: To increase the window from 2 to 3, you might add this to your code:
window = 3
offset = window - 1
y = [[''.join(map(str, ref[j:j+window])) for j in range(len(x[0])-offset)] \
for ref in x]

Tensorflow: tensor binarization

I want to transform this dataset in such a way that each tensor has a given size n and that a feature at index i of this new tensor is set to 1 if and only if there is a i in the original feature (modulo n).
I hope the following example will make things clearer
Let's suppose I have a dataset like:
t = tf.constant([
[0, 3, 4],
[12, 2 ,4]])
ds = tf.data.Dataset.from_tensors(t)
I want to get (if n = 9)
t = tf.constant([
[1, 0, 0, 1, 1, 0, 0, 0, 0], # index set to 1 are 0, 3 and 4
[0, 0, 1, 1, 1, 0, 0, 0, 0]]) # index set to 1 are 2, 4, and 12%9 = 3
I know how to apply the modulo to a tensor, but I don't find how to do the rest of the transformation
thanks
That is similar to tf.one_hot, only for multiple values at the same time. Here is a way to do that:
import tensorflow as tf
def binarization(t, n):
# One-hot encoding of each value
t_1h = tf.one_hot(t % n, n, dtype=tf.bool, on_value=True, off_value=False)
# Reduce across last dimension of the original tensor
return tf.cast(tf.reduce_any(t_1h, axis=-2), t.dtype)
# Test
with tf.Graph().as_default(), tf.Session() as sess:
t = tf.constant([
[ 0, 3, 4],
[12, 2, 4]
])
t_m1h = binarization(t, 9)
print(sess.run(t_m1h))
Output:
[[1 0 0 1 1 0 0 0 0]
[0 0 1 1 1 0 0 0 0]]

map matrix into specific vector with numpy

I have matrix similar to this:
1 0 0
1 0 0
0 2 0
0 2 0
0 0 3
0 0 3
(Non-zero numbers denote parts that I'm interested in. Actual number inside matrix could be random.)
And I need to produce vector like this:
[ 1 1 2 2 3 3 ].T
I can do this with loop:
result = np.zeros([rows])
for y in range(rows):
x = y // (rows // cols) # pick index of corresponded column
result[y] = mat[y][x]
But I can't figure out how to do this in vector form.
This might be what you want.
import numpy as np
m = np.array([
[1, 0, 0],
[1, 0, 0],
[0, 2, 0],
[0, 2, 0],
[0, 0, 3],
[0, 0, 3]
])
rows, cols = m.shape
# axis1 indices
y = np.arange(rows)
# axis2 indices
x = y // (rows // cols)
result = m[y,x]
print(result)
Result:
[1 1 2 2 3 3]

In order to generate all combinations of 1's and 0's we use a simple binary table. How can I easily create this binary table in an array?

For example the binary table for 3 bit:
0 0 0
0 0 1
0 1 0
1 1 1
1 0 0
1 0 1
And I want to store this into an n*n*2 array so it would be:
0 0 0
0 0 1
0 1 0
1 1 1
1 0 0
1 0 1
For generating the combinations automatically, you can use itertools.product standard library, which generates all possible combinations of the different sequences which are supplied, i. e. the cartesian product across the input iterables. The repeat argument comes in handy as all of our sequences here are identical ranges.
from itertools import product
x = [i for i in product(range(2), repeat=3)]
Now if we want an array instead a list of tuples from that, we can just pass this to numpy.array.
import numpy as np
x = np.array(x)
# [[0 0 0]
# [0 0 1]
# [0 1 0]
# [0 1 1]
# [1 0 0]
# [1 0 1]
# [1 1 0]
# [1 1 1]]
If you want all elements in a single list, so you could index them with a single index, you could chain the iterable:
from itertools import chain, product
x = list(chain.from_iterable(product(range(2), repeat=3)))
result: [0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1]
Most people would expect 2^n x n as in
np.c_[tuple(i.ravel() for i in np.mgrid[:2,:2,:2])]
# array([[0, 0, 0],
# [0, 0, 1],
# [0, 1, 0],
# [0, 1, 1],
# [1, 0, 0],
# [1, 0, 1],
# [1, 1, 0],
# [1, 1, 1]])
Explanation: np.mgrid as used here creates the coordinates of the corners of a unit cube which happen to be all combinations of 0 and 1. The individual coordinates are then ravelled and joined as columns by np.c_
Here's a recursive, native python (no libraries) version of it:
def allBinaryPossiblities(maxLength, s=""):
if len(s) == maxLength:
return s
else:
temp = allBinaryPossiblities(maxLength, s + "0") + "\n"
temp += allBinaryPossiblities(maxLength, s + "1")
return temp
print (allBinaryPossiblities(3))
It prints all possible:
000
001
010
011
100
101
110
111

Categories