how to do this operation in numpy (chaining of tiling operation)? - python

I'm trying to do fast generation of numpy array, possibly without passing through python.
I want to build an 1D index numpy array that would take this as an input:
[2,3] and this [2,4] and would return this
[0,1,0,1,0,1,2,0,1,2,0,1,2,0,1,2]
Explanation:
I iterate from 0 to 2 (so [0,1] array) and repeat it 2 times : [0,1,0,1]
Then I iterate from 0 to 3 (so [0,1,2] array) and repeat it 4 times : [0,1,2,0,1,2,0,1,2,0,1,2]
Then I flattened everything.
Is there a way to do this fully in numpy?
For now I'm building each table separately in numpy by using np.tile() and flattening everything afterwards but I feel like there is a more efficient way that would only translate to C functions calls and no python

Here is a vectorized solution:
def cycles(spec):
steps = np.repeat(*spec)
ps = steps.cumsum()
psj = np.zeros(ps[-1], int)
psj[ps[:-1]] = steps[:-1]
return np.arange(ps[-1]) - psj.cumsum()
Demo:
>>> cycles(((2,3),(2,4)))
array([0, 1, 0, 1, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2])

I am not entirely sure if this is what you want; here each tuple in the call to func() contains first the range and then the repeat.
import numpy
def func(tups):
Arr = numpy.empty(numpy.sum([ele[0] * ele[1] for ele in tups]), dtype=int)
i = 0
for ele in tups:
Arr[i:i + ele[0] * ele[1]] = numpy.tile(numpy.arange(ele[0]), ele[1])
i += ele[0] * ele[1]
return Arr
arr = func([(2, 3), (3, 4)])
print(arr)
# [0 1 0 1 0 1 0 1 2 0 1 2 0 1 2 0 1 2]

Related

Why cant Pandas replace nan with an array of 0s using masks/replace?

I have a series like this
s = pd.Series([[1,2,3],[1,2,3],np.nan,[1,2,3],[1,2,3],np.nan])
and I simply want the NaN to be replaced by [0,0,0].
I have tried
s.fillna([0,0,0]) # TypeError: "value" parameter must be a scalar or dict, but you passed a "list"
s[s.isna()] = [[0,0,0],[0,0,0]] # just replaces the NaN with a single "0". WHY?!
s.fillna("NAN").replace({"NAN":[0,0,0]}) # ValueError: NumPy boolean array indexing assignment cannot
#assign 3 input values to the 2 output values where the mask is true
s.fillna("NAN").replace({"NAN":[[0,0,0],[0,0,0]]}) # TypeError: NumPy boolean array indexing assignment
# requires a 0 or 1-dimensional input, input has 2 dimensions
I really can't understand, why the two first approaches won't work (maybe I get the first, but the second I cant wrap my head around).
Thanks to this SO-question and answer, we can do it by
is_na = s.isna()
s.loc[is_na] = s.loc[is_na].apply(lambda x: [0,0,0])
but since apply often is rather slow I cannot understand, why we can't use replace or the slicing as above
Pandas working with list with pain, here is hacky solution:
s = s.fillna(pd.Series([[0,0,0]] * len(s), index=s.index))
print (s)
0 [1, 2, 3]
1 [1, 2, 3]
2 [0, 0, 0]
3 [1, 2, 3]
4 [1, 2, 3]
5 [0, 0, 0]
dtype: object
Series.reindex
s.dropna().reindex(s.index, fill_value=[0, 0, 0])
0 [1, 2, 3]
1 [1, 2, 3]
2 [0, 0, 0]
3 [1, 2, 3]
4 [1, 2, 3]
5 [0, 0, 0]
dtype: object
The documentation indicates that this value cannot be a list.
Value to use to fill holes (e.g. 0), alternately a
dict/Series/DataFrame of values specifying which value to use for each
index (for a Series) or column (for a DataFrame). Values not in the
dict/Series/DataFrame will not be filled. This value cannot be a list.
This is probably a limitation of the current implementation, and short of patching the source code you must resort to workarounds (as provided below).
However, if you are not planning to work with jagged arrays, what you really want to do is probably replace pd.Series() with pd.DataFrame(), e.g.:
import numpy as np
import pandas as pd
s = pd.DataFrame(
[[1, 2, 3],
[1, 2, 3],
[np.nan],
[1, 2, 3],
[1, 2, 3],
[np.nan]],
dtype=pd.Int64Dtype()) # to mix integers with NaNs
s.fillna(0)
# 0 1 2
# 0 1 2 3
# 1 1 2 3
# 2 0 0 0
# 3 1 2 3
# 4 1 2 3
# 5 0 0 0
If you do need to use jagged array, you could use any of the proposed workaround from other answers, or you could make one of your attempt work, e.g.:
ii = s.isna()
nn = ii.sum()
s[ii] = pd.Series([[0, 0, 0]] * nn).to_numpy()
# 0 [1, 2, 3]
# 1 [1, 2, 3]
# 2 [0, 0, 0]
# 3 [1, 2, 3]
# 4 [1, 2, 3]
# 5 [0, 0, 0]
# dtype: object
which basically uses NumPy masking to fill in the Series. The trick is to generate a compatible object for the assignment that works at the NumPy level.
If there are too many NaNs in the input, it is probably more efficient / faster to work in a similar way but with s.notna() instead, e.g.:
import pandas as pd
result = pd.Series([[0, 0, 0]] * len(s))
result[s.notna()] = s[s.notna()]
Let's try to do some benchmarking, where:
replace_nan_isna() is from above
import pandas as pd
def replace_nan_isna(s, value, inplace=False):
if not inplace:
s = s.copy()
ii = s.isna()
nn = ii.sum()
s[ii] = pd.Series([value] * nn).to_numpy()
return s
replace_nan_notna() is also from above
import pandas as pd
def replace_nan_notna(s, value, inplace=False):
if inplace:
raise ValueError("In-place not supported!")
result = pd.Series([value] * len(s))
result[s.notna()] = s[s.notna()]
return result
replace_nan_reindex() is from #ShubhamSharma's answer
def replace_nan_reindex(s, value, inplace=False):
if not inplace:
s = s.copy()
s.dropna().reindex(s.index, fill_value=value)
return s
replace_nan_fillna() is from #jezrael's answer
import pandas as pd
def replace_nan_fillna(s, value, inplace=False):
if not inplace:
s = s.copy()
s.fillna(pd.Series([value] * len(s), index=s.index))
return s
with the following code:
import numpy as np
import pandas as pd
def gen_data(n=5, k=2, p=0.7, obj=(1, 2, 3)):
return pd.Series(([obj] * int(p * n) + [np.nan] * (n - int(p * n))) * k)
funcs = replace_nan_isna, replace_nan_notna, replace_nan_reindex, replace_nan_fillna
# : inspect results
s = gen_data(5, 1)
for func in funcs:
print(f'{func.__name__:>20s} {func(s, value)}')
print()
# : generate benchmarks
s = gen_data(100, 1000)
value = (0, 0, 0)
base = funcs[0](s, value)
for func in funcs:
print(f'{func.__name__:>20s} {(func(s, value) == base).all()!s:>5}', end=' ')
%timeit func(s, value)
# replace_nan_isna True 100 loops, best of 5: 16.5 ms per loop
# replace_nan_notna True 10 loops, best of 5: 46.5 ms per loop
# replace_nan_reindex True 100 loops, best of 5: 9.74 ms per loop
# replace_nan_fillna True 10 loops, best of 5: 36.4 ms per loop
indicating that reindex() may be the fastest approach.

How can I get exactly the same amount elements replaced in numpy 2D matrix?

I got a symmetrical 2D numpy matrix, it only contains ones and zeros and diagonal elements are always 0.
I want to replace part of the elements from one to zero, and the result need to keep symmetrical too. How many elements will be selected depends on the parameterreplace_rate.
Since it's a symmetrical matrix, I take half of the matrix and select the elements(those values are 1) randomly, change them from 1 to 0. And then with a mirror operation, make sure the whole matrix are still symmetrical.
For example
com = np.array ([[0, 1, 1, 1, 1],
[1, 0, 1, 1, 1],
[1, 1, 0, 1, 1],
[1, 1, 1, 0, 1],
[1, 1, 1, 1, 0]])
replace_rate = 0.1
com = np.triu(com)
mask = np.random.choice([0,1],size=(com.shape),p=((1-replace_rate),replace_rate)).astype(np.bool)
r1 = np.random.rand(*com.shape)
com[mask] = r1[mask]
com += com.T - np.diag(com.diagonal())
com is a (5,5) symmetrical matrix, and 10% of elements (only include those values are 1, the diagonal elements are excluded) will be replaced to 0 randomly.
The question is , how can I make sure the amount of elements changed keep the same each time?
Keep the same replace_rate = 0.1, sometimes I will get result like:
com = np.array([[0 1 1 1 1]
[1 0 1 1 1]
[1 1 0 1 1]
[1 1 1 0 1]
[1 1 1 1 0]])
Actually no one changed this time, and if I repeat it, I got 2 elements changed :
com = np.array([[0 1 1 1 1]
[1 0 1 1 1]
[1 1 0 1 0]
[1 1 1 0 1]
[1 1 0 1 0]])
I want to know how to fix the amount of elements changed with the same replace_rate?
Thanks in advance!!
How about something like this:
def make_transform(m, replace_rate):
changed = [] # keep track of indices we already changed
def get_random():
# Get a random pair of indices which are not equal (i.e. not on the diagonal)
c1, c2 = random.choices(range(len(com)), k=2)
if c1 == c2 or (c1,c2) in changed or (c2,c1) in changed:
return get_random() # Recurse until we find an i,j pair : i!=j , that hasnt already been changed
else:
changed.append((c1,c2))
return c1, c2
n_changes = int(m.shape[0]**2 * replace_rate) # the number of changes to make
print(n_changes)
for _ in range(n_changes):
i, j = get_random() # Get an valid index
m[i][j] = m[j][i] = 0
return m
This is the solution I suggest:
def rand_zero(mat, replace_rate):
triu_mat = np.triu(mat)
_ind = np.where(triu_mat != 0) # gets indices of non-zero elements, not just non-diagonals
ind = [x for x in zip(*_ind)]
chng = np.random.choice(range(len(ind)), # select some indices, at rate 'replace_rate'
size = int(replace_rate*mat.size),
replace = False) # do not select duplicates
mod_mat = triu_mat
for c in chng:
mod_mat[ind[c]] = 0
mod_mat = mod_mat + mod_mat.T
return mod_mat
I use int() to truncate to an integer in size, but you can use round() if that's what you desire.
Hope this gives consistent results!

I want to take the XOR of all the elements of 1 list with another. How do I do it? [duplicate]

This question already has answers here:
How do you get the logical xor of two variables in Python?
(28 answers)
Closed 3 years ago.
I have a bunch of lists in the form of say [0,0,1,0,1...], and I want to take the XOR of 2 lists and give the output as a list.
Like:
[ 0, 0, 1 ] XOR [ 0, 1, 0 ] -> [ 0, 1, 1 ]
res = []
tmp = []
for i in Employee_Specific_Vocabulary_Dict['Binary Vector']:
for j in Course_Specific_Vocabulary_Dict['Binary Vector']:
tmp = [i[index] ^ j[index] for index in range(len(i))]
res.append(temp)
The size of each of my lists / vectors is around 3500 elements, so I need something to save time, since this piece of code is taking more than 20 mins to run.
I have 3085 lists, each of which need an XOR operation with 4089 other lists.
How do I do this without iterating through each list explicitly?
Use map:
answer = list(map(operator.xor, lst1, lst2)).
or zip:
answer = [x ^ y for x,y in zip(lst1, lst2)]
If you need something faster, consider using NumPy instead of Python lists to hold your data.
Assuming a and b are the same size you can use the xor operation (i.e. ^) with simple list indexing:
a = [0, 0, 1]
b = [0, 1, 1]
c = [a[index] ^ b[index] for index in range(len(a))]
print(c) # [0, 1, 0]
or you can use zip with the xor:
a = [0, 0, 1]
b = [0, 1, 1]
c = [x ^ y for x, y in zip(a, b)]
print(c) # [0, 1, 0]
zip will only go to the shortest list (if they are not the same size). If they are not the same size and you want to go to the longer list you can use zip_longest:
from itertools import zip_longest
a = [0, 0, 1, 1]
b = [0, 1, 1]
c = [x ^ y for x, y in zip_longest(a, b, fillvalue=0)]
print(c) # [0, 1, 0, 1]
Using numpy you should have some performance gains, the function you need is bitwise_xor, like so:
import numpy as np
results = []
for i in Employee_Specific_Vocabulary_Dict['Binary Vector']:
for j in Course_Specific_Vocabulary_Dict['Binary Vector']:
results.append(np.bitwise_xor(i, j))
A proof of concept:
a = [1,0,0,1,1]
b = [1,1,0,0,1]
x = np.bitwise_xor(a,b)
print("a\tb\tres")
for i in range(len(a)):
print("{}\t{}\t{}".format(a[i], b[i], x[i]))
output:
a b x
1 1 0
0 1 1
0 0 0
1 0 1
1 1 0
Edit
Note that if your arrays have the same size, you can simply do one operation and the bitwise_xor will still work, so:
a = [[1,1,0], [0,0,1]]
b = [[0,1,0], [1,0,1]]
res = np.bitwise_xor(a, b)
will still work, and you'll have:
res: [[1, 0, 0], [1, 0, 0]]
In your case, a workaround would possibily be:
results = []
n = len(Course_Specific_Vocabulary_Dict['Binary Vector'])
for a in Employee_Specific_Vocabulary_Dict['Binary Vector']:
# Get same size array w.r.t Course_Specific_Vocabulary_Dict["Binary Vector]
repeated_a = np.repeat([a], n, axis=0)
results.append(np.bitwise_xor(repeated_a, Course_Specific_Vocabulary_Dict['Binary Vector']))
However I don't know if that would actually improve performance, it is to be checked; for sure it will require some more memory.

Compute the length of consecutive true values in a list

Essentially this problem can be split into two parts. I have a set of binary values that indicate whether a given signal is present or not. Given that the each value also corresponds to a unit of time (in this case minutes) I am trying to determine how long the signal exists on average given its occurrence within the overall list of values throughout the period I'm analyzing. For example, if I have the following list:
[0,0,0,1,1,1,0,0,1,0,0,0,1,1,1,1,0]
I can see that the signal occurs 3 separate times for variable lengths of time (i.e. in the first case for 3 minutes). If I want to calculate the average length of time for each occurrence however I need an indication of how many independent instances of the signal exist (i.e. 3). I have tried various index based strategies such as:
arb_ops.index(1)
to find the next occurrence of true values and correspondingly finding the next occurrence of 0 to find the length but am having trouble contextualizing this into a recursive function for the entire array.
You could use itertools.groupby() to group consecutive equal elements. To calculate a group's length convert the iterator to a list and apply len() to it:
>>> from itertools import groupby
>>> lst = [0 ,0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0 ,1, 1, 1, 1, 0]
>>> for k, g in groupby(lst):
... g = list(g)
... print(k, g, len(g))
...
0 [0, 0, 0] 3
1 [1, 1, 1] 3
0 [0, 0] 2
1 [1] 1
0 [0, 0, 0] 3
1 [1, 1, 1, 1] 4
0 [0] 1
Another option may be MaskedArray.count, which counts non-masked elements of an array along a given axis:
import numpy.ma as ma
a = ma.arange(6).reshape((2, 3))
a[1, :] = ma.masked
a
masked_array(data =
[[0 1 2]
[-- -- --]],
mask =
[[False False False]
[ True True True]],
fill_value = 999999)
a.count()
3
You can extend Masked Arrays quite far...
#eugene-yarmash solution with the groupby is decent. However, if you wanted to go with a solution that requires no import, and where you do the grouping yourself --for learning purposes-- you could try this::
>>> l = [0,0,0,1,1,1,0,0,1,0,0,0,1,1,1,1,0]
>>> def size(xs):
... sz = 0
... for x in xs:
... if x == 0 and sz > 0:
... yield sz
... sz = 0
... if x == 1:
... sz += 1
... if sz > 0:
... yield sz
...
>>> list(size(l))
[3, 1, 4]
I think this problem is actually pretty simple--you know you have a new signal if you see a value is 1, and the previous value is 0.
The code I provided is kind of long, but super simple, and done without imports.
signal = [0,0,0,1,1,1,0,0,1,0,0,0,1,1,1,1,0]
def find_number_of_signals(signal):
index = 0
signal_counter = 0
signal_duration = 0
for i in range(len(signal) - 1):
if signal[index] == 1:
signal_duration += 1.0
if signal[index- 1] == 0:
signal_counter += 1.0
index += 1
print signal_counter
print signal_duration
print float(signal_duration / signal_counter)
find_number_of_signals(signal)

Creating a special matrix in numpy

[a b c ]
[ a b c ]
[ a b c ]
[ a b c ]
Hello
For my economics course we are suppose to create an array that looks like this. The problem is I am an economist not a programmer. We are using numpy in python. Our professor says college is not preparing us for the real world and wants us to learn programming (which is a good thing). We are not allowed to use any packages and must come up with an original code. Does anybody out there have any idea how to make this matrix. I have spent hours trying codes and browsing the internet looking for help and have been unsuccessful.
This kind of matrix is called a Toeplitz matrix or constant diagonal matrix. Knowing this leads you to scipy.linalg.toeplitz:
import scipy.linalg
scipy.linalg.toeplitz([1, 0, 0, 0], [1, 2, 3, 0, 0, 0])
=>
array([[1, 2, 3, 0, 0, 0],
[0, 1, 2, 3, 0, 0],
[0, 0, 1, 2, 3, 0],
[0, 0, 0, 1, 2, 3]])
The method below fills one diagonal at a time:
import numpy as np
x = np.zeros((4, 6), dtype=np.int)
for i, v in enumerate((6,7,8)):
np.fill_diagonal(x[:,i:], v)
array([[6, 7, 8, 0, 0, 0],
[0, 6, 7, 8, 0, 0],
[0, 0, 6, 7, 8, 0],
[0, 0, 0, 6, 7, 8]])
or you could do the one liner:
x = [6,7,8,0,0,0]
y = np.vstack([np.roll(x,i) for i in range(4)])
Personally, I prefer the first since it's easier to understand and probably faster since it doesn't build all the temporary 1D arrays.
Edit:
Since a discussion of efficiency has come up, it might be worthwhile to run a test. I also included time to the toeplitz method suggested by chthonicdaemon (although personally I interpreted the question to exclude this approach since it uses a package rather than using original code -- also though speed isn't the point of the original question either).
import numpy as np
import timeit
import scipy.linalg as sl
def a(m, n):
x = np.zeros((m, m), dtype=np.int)
for i, v in enumerate((6,7,8)):
np.fill_diagonal(x[:,i:], v)
def b(m, n):
x = np.zeros((n,))
x[:3] = vals
y = np.vstack([np.roll(x,i) for i in range(m)])
def c(m, n):
x = np.zeros((n,))
x[:3] = vals
y = np.zeros((m,))
y[0] = vals[0]
r = sl.toeplitz(y, x)
return r
m, n = 4, 6
print timeit.timeit("a(m,n)", "from __main__ import np, a, b, m, n", number=1000)
print timeit.timeit("b(m,n)", "from __main__ import np, a, b, m, n", number=1000)
print timeit.timeit("c(m,n)", "from __main__ import np, c, sl, m, n", number=1000)
m, n = 1000, 1006
print timeit.timeit("a(m,n)", "from __main__ import np, a, b, m, n", number=1000)
print timeit.timeit("b(m,n)", "from __main__ import np, a, b, m, n", number=1000)
print timeit.timeit("c(m,n)", "from __main__ import np, c, sl, m, n", number=100)
# which gives:
0.03525209 # fill_diagonal
0.07554483 # vstack
0.07058787 # toeplitz
0.18803215 # fill_diagonal
2.58780789 # vstack
1.57608604 # toeplitz
So the first method is about a 2-3x faster for small arrays and 10-20x faster for larger arrays.
This is a simplified tridiagonal matrix. So it is essentially a this question
def tridiag(a, b, c, k1=-1, k2=0, k3=1):
return np.diag(a, k1) + np.diag(b, k2) + np.diag(c, k3)
a = [1, 1]; b = [2, 2, 2]; c = [3, 3]
A = tridiag(a, b, c)
print(A)
Result:
array([[2, 3, 0],
[1, 2, 3],
[0, 1, 2]])
Something along the lines of
import numpy as np
def createArray(theinput,rotations) :
l = [theinput]
for i in range(1,rotations) :
l.append(l[i-1][:])
l[i].insert(0,l[i].pop())
return np.array(l)
print(createArray([1,2,3,0,0,0],4))
"""
[[1 2 3 0 0 0]
[0 1 2 3 0 0]
[0 0 1 2 3 0]
[0 0 0 1 2 3]]
"""
If you care about efficiency, it is hard to beat this:
import numpy as np
def create_matrix(diags, n):
diags = np.asarray(diags)
m = np.zeros((n,n+len(diags)-1), diags.dtype)
s = m.strides
v = np.lib.index_tricks.as_strided(
m,
(len(diags),n),
(s[1],sum(s)))
v[:] = diags[:,None]
return m
print create_matrix(['a','b','c'], 8)
Might be a little over your head, but then again that's good inspiration ;)
Or even better: a solution which has both O(n) storage and runtime requirements, rather than all the other solutions posted thus far, which are O(n^2)
import numpy as np
def create_matrix(diags, n):
diags = np.asarray(diags)
b = np.zeros(len(diags)+n*2, diags.dtype)
b[n:][:len(diags)] = diags
s = b.strides[0]
v = np.lib.index_tricks.as_strided(
b[n:],
(n,n+len(diags)-1),
(-s,s))
return v
print create_matrix(np.arange(1,4), 8)
This is an old question, however some new input can always be useful.
I create tridiagonal matrices in python using list comprehension.
Say a matrix that is symmetric around "-2" and has a "1" on either side:
-2 1 0
Tsym(3) => 1 -2 1
0 1 -2
This can be created using the following "one liner":
Tsym = lambda n: [ [ 1 if (i+1==j or i-1==j) else -2 if j==i else 0 for i in xrange(n) ] for j in xrange(n)] # Symmetric tridiagonal matrix (1,-2,1)
A different case (that several of the other people answering has solved perfectly fine) is:
1 2 3 0 0 0
Tgen(4,6) => 0 1 2 3 0 0
0 0 1 2 3 0
0 0 0 1 2 3
Can be made using the one liner shown below.
Tgen = lambda n,m: [ [ 1 if i==j else 2 if i==j+1 else 3 if i==j+2 else 0 for i in xrange(m) ] for j in xrange(n)] # General tridiagonal matrix (1,2,3)
Feel free to modify to suit your specific needs. These matrices are very common when modelling physical systems and I hope this is useful to someone (other than me).
Hello since your professor asked you not to import any external package, while most answers use numpy or scipy.
You better use only python List to create 2D array (compound list), then populate its diagonals with the items you wish, Find the code below
def create_matrix(rows = 4, cols = 6):
mat = [[0 for col in range(cols)] for row in range(rows)] # create a mtrix filled with zeros of size(4,6)
for row in range(len(mat)): # gives number of lists in the main list,
for col in range(len(mat[0])): # gives number of items in sub-list 0, but all sublists have the same length
if row == col:
mat[row][col] = "a"
if col == row+1:
mat[row][col] = "b"
if col == row+2:
mat[row][col] = "c"
return mat
create_matrix(4, 6)
[['a', 'b', 'c', 0, 0, 0],
[0, 'a', 'b', 'c', 0, 0],
[0, 0, 'a', 'b', 'c', 0],
[0, 0, 0, 'a', 'b', 'c']]
Creating Band Matrix
Check out the definition for it in wiki :
https://en.wikipedia.org/wiki/Band_matrix
You can use this function to create band matrices like diagonal matrix with offset=1 or tridiagonal matrix (The one you are asking about) with offset=1 or Pentadiagonal Matrix with offset=2
def band(size=10, ones=False, low=0, high=100, offset=2):
shape = (size, size)
n_matrix = np.random.randint(low, high, shape) if not ones else np.ones(shape,dtype=int)
n_matrix = np.triu(n_matrix, -1*offset)
n_matrix = np.tril(n_matrix, offset)
return n_matrix
In your case you should use this
rand_tridiagonal = band(size=6,offset=1)
print(rand_tridiagonal)

Categories