I encounter a problem with numpy arrays.
I used CountVectorizer from sklearn with a wordset and values (from pandas column) to create an array of arrays that count words (BoW). And when I print the array and the shape, I have this result:
[[array([0, 5, 0, ..., 0, 0, 0])]
[array([0, 0, 0, ..., 0, 0, 0])]
[array([0, 0, 0, ..., 0, 0, 0])]
...
[array([0, 0, 0, ..., 0, 0, 0])]
[array([0, 0, 0, ..., 0, 0, 0])]
[array([0, 0, 0, ..., 0, 0, 0])]] (2800, 1)
An array of arrays having a vector shape ???
I checked that all rows have the same size.
Here is a way to reproduce my problem:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
data = pd.DataFrame(["blop blip blup", "bop bip bup", "boop boip boup"], columns=["corpus"])
# add labels column
data["label"] = ["blop", "bip", "boup"]
wordset = pd.Series([y for x in data["corpus"].str.split() for y in x]).unique()
cvec = CountVectorizer(vocabulary=wordset, ngram_range=(1, 2))
labels_count_np = data["label"].apply(lambda x: cvec.fit_transform([x]).toarray()[0]).values
print(labels_count_np, labels_count_np.shape)
it should return:
[array([1, 0, 0, 0, 0, 0, 0, 0, 0]) array([0, 0, 0, 0, 1, 0, 0, 0, 0])
array([0, 0, 0, 0, 0, 0, 0, 0, 1])] (3,)
Can someone explain me why numpy has this comportment ?
Also, I tried to find a way to concatenate multiple arrays like this:
A = [array([1, 0, 0, 0, 0, 0, 0, 0, 0]) array([0, 0, 0, 0, 1, 0, 0, 0, 0])
array([0, 0, 0, 0, 0, 0, 0, 0, 1])]
B = [array([0, 7, 2, 0]) array([1, 4, 0, 8])
array([6, 1, 0, 9])]
concatenate(A,B) =>
[
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 2, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 4, 0, 8],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 6, 1, 0, 9]
]
But I did not found a good way to do it.
values from a dataframe, even if it has just one column, will be 2d. values from a Series, one column of the frame, will be 1d.
If labels_count_np is (2800, 1) shape, you can easily make it 1d with labels_count_np[:,0] or np.squeeze(labels...). That's just basic numpy.
It will still be an object dtype array containing arrays, the elements of the dataframe cells. If those arrays are all the same size then
np.stack(labels_count_np[:,0])
should create a 2d numeric array.
Make a frame with array elements:
In [35]: df = pd.DataFrame([None,None,None], columns=['x'])
In [36]: df
Out[36]:
x
0 None
1 None
2 None
In [37]: for i in range(3):df['x'][i] = np.zeros(4,int)
In [38]: df
Out[38]:
x
0 [0, 0, 0, 0]
1 [0, 0, 0, 0]
2 [0, 0, 0, 0]
The 2d array from the frame:
In [39]: df.values
Out[39]:
array([[array([0, 0, 0, 0])],
[array([0, 0, 0, 0])],
[array([0, 0, 0, 0])]], dtype=object)
In [40]: _.shape
Out[40]: (3, 1)
from the Series:
In [41]: df['x'].values
Out[41]:
array([array([0, 0, 0, 0]), array([0, 0, 0, 0]), array([0, 0, 0, 0])],
dtype=object)
In [42]: _.shape
Out[42]: (3,)
Joining the Series values into one 2d array:
In [43]: np.stack(df['x'].values)
Out[43]:
array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]])
You can concatenate using list comprehension:
C = [np.append(x, B[i]) for i, x in enumerate(A)]
OUTPUT
[array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 2, 0]),
array([0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 4, 0, 8]),
array([0, 0, 0, 0, 0, 0, 0, 0, 1, 6, 1, 0, 9])]
Related
I know you can iterate over a 2d matrix using two indexes like this:
import numpy as np
A = np.zeros((10,10))
for i in range(0,10):
for j in range(0,10):
if (i==j):
A[i,j] = 4
Is there a way of doing this using only one for loop or using slices?
EDIT:
I also need to take into account of when i =/ j, for example:
A = np.zeros((10,10))
for i in range(0,10):
for j in range(0,10):
if (i==j):
A[i,j] = 1
if (i+1 ==j):
A[i,j] = 2
if (i-1==j):
A[i,j] = 3
You can always collapse multiple loops into one by calculating the components each iteration with the modulo operator like so:
import numpy as np
A = np.zeros((10,10))
for x in range(100):
i = math.floor(x/10)
j = x % 10
if (i==j):
A[i,j] = 1
if (i+1 ==j):
A[i,j] = 2
if (i-1==j):
A[i,j] = 3
With only i==j it could be even simpler:
for i in range(10):
A[i,i] = 4
In [129]: A = np.zeros((10,10), int)
...: for i in range(0,10):
...: for j in range(0,10):
...: if (i==j):
...: A[i,j] = 1
...: if (i+1 ==j):
...: A[i,j] = 2
...: if (i-1==j):
...: A[i,j] = 3
...:
You should have shown the resulting A:
In [130]: A
Out[130]:
array([[1, 2, 0, 0, 0, 0, 0, 0, 0, 0],
[3, 1, 2, 0, 0, 0, 0, 0, 0, 0],
[0, 3, 1, 2, 0, 0, 0, 0, 0, 0],
[0, 0, 3, 1, 2, 0, 0, 0, 0, 0],
[0, 0, 0, 3, 1, 2, 0, 0, 0, 0],
[0, 0, 0, 0, 3, 1, 2, 0, 0, 0],
[0, 0, 0, 0, 0, 3, 1, 2, 0, 0],
[0, 0, 0, 0, 0, 0, 3, 1, 2, 0],
[0, 0, 0, 0, 0, 0, 0, 3, 1, 2],
[0, 0, 0, 0, 0, 0, 0, 0, 3, 1]])
So you have set 3 diagonals:
In [131]: A[np.arange(10),np.arange(10)]
Out[131]: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
In [132]: A[np.arange(9),np.arange(1,10)]
Out[132]: array([2, 2, 2, 2, 2, 2, 2, 2, 2])
In [133]: A[np.arange(1,10),np.arange(9)]
Out[133]: array([3, 3, 3, 3, 3, 3, 3, 3, 3])
The key to eliminating loops in numpy is to get a big picture of the task, rather than focusing on the iterative steps.
There are various tools for making a diagonal array. One is np.diag, which can be used thus:
In [139]: np.diag(np.ones(10,int),0)+
np.diag(np.ones(9,int)*2,1)+
np.diag(np.ones(9,int)*3,-1)
Out[139]:
array([[1, 2, 0, 0, 0, 0, 0, 0, 0, 0],
[3, 1, 2, 0, 0, 0, 0, 0, 0, 0],
[0, 3, 1, 2, 0, 0, 0, 0, 0, 0],
[0, 0, 3, 1, 2, 0, 0, 0, 0, 0],
[0, 0, 0, 3, 1, 2, 0, 0, 0, 0],
[0, 0, 0, 0, 3, 1, 2, 0, 0, 0],
[0, 0, 0, 0, 0, 3, 1, 2, 0, 0],
[0, 0, 0, 0, 0, 0, 3, 1, 2, 0],
[0, 0, 0, 0, 0, 0, 0, 3, 1, 2],
[0, 0, 0, 0, 0, 0, 0, 0, 3, 1]])
Or adapting [131] etc
In [140]: A = np.zeros((10,10), int)
...: A[np.arange(10),np.arange(10)]=1
...: A[np.arange(9),np.arange(1,10)]=2
...: A[np.arange(1,10),np.arange(9)]=3
Because your only executing code when i == j, you can just use:
for i in range(0,10):
A[i,i] = 4
I'd like to take the difference of non-adjacent values within 2D numpy array along axis=-1 (per row). An array can consist of a large number of rows.
Each row is a selection of values along a timeline from 1 to N.
For N=12, the array could look like below 3x12 shape:
timeline = np.array([[ 0, 0, 0, 4, 0, 6, 0, 0, 9, 0, 11, 0],
[ 1, 0, 3, 4, 0, 0, 0, 0, 9, 0, 0, 12],
[ 0, 0, 0, 4, 0, 0, 0, 0, 9, 0, 0, 0]])
The desired result should look like: (size of array is intact and position is important)
diff = np.array([[ 0, 0, 0, 4, 0, 2, 0, 0, 3, 0, 2, 0],
[ 1, 0, 2, 1, 0, 0, 0, 0, 5, 0, 0, 3],
[ 0, 0, 0, 4, 0, 0, 0, 0, 5, 0, 0, 0]])
I am aware of the solution in 1D, Diff on non-adjacent values
imask = np.flatnonzero(timeline)
diff = np.zeros_like(timeline)
diff[imask] = np.diff(timeline[imask], prepend=0)
within which the last line can be replaced with
diff[imask[0]] = timeline[imask[0]]
diff[imask[1:]] = timeline[imask[1:]] - timeline[imask[:-1]]
and the first line can be replaced with
imask = np.where(timeline != 0)[0]
Attempting to generalise the 1D solution I can see imask = np.flatnonzero(timeline) is undesirable as rows becomes inter-dependent. Thus I am trying by using the alternative np.nonzero.
imask = np.nonzero(timeline)
diff = np.zeros_like(timeline)
diff[imask] = np.diff(timeline[imask], prepend=0)
However, this solution results in a connection between row's end values (inter-dependent).
array([[ 0, 0, 0, 4, 0, 2, 0, 0, 3, 0, 2, 0],
[-10, 0, 2, 1, 0, 0, 0, 0, 5, 0, 0, 3],
[ 0, 0, 0, -8, 0, 0, 0, 0, 5, 0, 0, 0]])
How can I make the "prepend" to start each row with a zero?
Wow. I did it... (It is interesting problem for me too..)
I made non_adjacent_diff function to be applied to every row, and apply it to every row using np.apply_along_axis.
Try this code.
timeline = np.array([[ 0, 0, 0, 4, 0, 6, 0, 0, 9, 0, 11, 0],
[ 1, 0, 3, 4, 0, 0, 0, 0, 9, 0, 0, 12],
[ 0, 0, 0, 4, 0, 0, 0, 0, 9, 0, 0, 0]])
def non_adjacent_diff(row):
not_zero_index = np.where(row != 0)
diff = row[not_zero_index][1:] - row[not_zero_index][:-1]
np.put(row, not_zero_index[0][1:], diff)
return row
np.apply_along_axis(non_adjacent_diff, 1, timeline)
I have initialized a numpy nd array like the following
arr = np.zeros((6, 6))
This empty array is passed as an input argument to a function,
def fun(arr):
arr.append(1) # this works for arr = [] initialization
return arr
for i in range(0,12):
fun(arr)
But append doesn't work for nd array. I want to fill up the elements of the nd array row-wise.
Is there any way to use a python scalar index for the nd array? I could increment this index every time fun is called and append elements to arr
Any suggestions?
In [523]: arr = np.zeros((6,6),int)
In [524]: arr
Out[524]:
array([[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0]])
In [525]: arr[0] = 1
In [526]: arr
Out[526]:
array([[1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0]])
In [527]: arr[1] = [1,2,3,4,5,6]
In [528]: arr[2,3:] = 2
In [529]: arr
Out[529]:
array([[1, 1, 1, 1, 1, 1],
[1, 2, 3, 4, 5, 6],
[0, 0, 0, 2, 2, 2],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0]])
import numpy
import sympy
n = 7
k = 3
X = numpy.random.randn(n,k)
Px = X#numpy.linalg.inv(numpy.transpose(X)#X)#numpy.transpose(X) #X(X'X)^(-1)X'
print(sympy.Matrix(Px).rref())
As you may verify yourself, Px is singular. However, sympy.rref() returns this:
(Matrix([[1, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 1]]), (0, 1, 2, 3, 4, 5, 6))
Why doesn't it return the real rref? I read somewhere I could pass simplify=True, however it didn't make any difference.
In [49]: Px
Out[49]:
array([[ 0.5418898 , 0.44245552, 0.04973693, -0.06834885, -0.19086119,
-0.07003176, 0.06325021],...
[ 0.06325021, -0.11080081, 0.21656224, -0.07445145, -0.28634725,
0.06648907, 0.19199866]])
In [50]: np.linalg.det(Px)
Out[50]: 2.141647537907433e-67
In [51]: np.linalg.inv(Px)
Out[51]:
array([[-7.18788695e+15, 4.95655702e+15, 7.52738018e+15,
-4.40875311e+15, -1.64015565e+16, 2.63785320e+15,
-3.03465003e+16],
[ 1.59176426e+16, ....
[ 3.31636798e+16, -3.39094560e+16, -3.60287970e+16,
-1.27160460e+16, 2.14338015e+16, 3.32345350e+15,
3.60287970e+16]])
Your Px is close to singular, but not exactly so. Contrast that with
In [52]: M = np.arange(9).reshape(3,3)
In [53]: np.linalg.det(M)
Out[53]: 0.0
In [55]: np.linalg.inv(M)
LinAlgError: Singular matrix
In [56]: sympy.Matrix(M).rref()
Out[56]:
(Matrix([
[1, 0, -1],
[0, 1, 2],
[0, 0, 0]]), (0, 1))
Numerically speaking your Px is not singular, just close:
In [57]: sympy.Matrix(Px).rref()
Out[57]:
(Matrix([
[1, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 1]]), (0, 1, 2, 3, 4, 5, 6))
But with a custom iszerofunc:
In [58]: sympy.Matrix(Px).rref(iszerofunc=lambda x: abs(x)<1e-16)
Out[58]:
(Matrix([
[1, 0, 0, 0.647383887198708, -1.91409951634531, -1.43377991000974, 0.578981680134581],
[0, 1, 0, -0.839184067893959, 1.88998490600173, 1.43367640627271, -0.611620902311026],
[0, 0, 1, -0.962221703397948, 0.203783478612254, 1.45929622452135, 0.404548167005728],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]]),
(0, 1, 2))
I've got an np.array 219 by 219 with mostly 0s and 2% of nonzeros and I know want to create new arrays where each of the nonzero values has 90% of chance of becoming a zero.
I now know how to change the n-th non zero value to 0 but how to work with probabilities?
Probably this can be modified:
index=0
for x in range(0, 219):
for y in range(0, 219):
if (index+1) % 10 == 0:
B[x][y] = 0
index+=1
print(B)
You could use np.random.random to create an array of random numbers to compare with 0.9, and then use np.where to select either the original value or 0. Since each draw is independent, it doesn't matter if we replace a 0 with a 0, so we don't need to treat zero and nonzero values differently. For example:
In [184]: A = np.random.randint(0, 2, (8,8))
In [185]: A
Out[185]:
array([[1, 1, 1, 0, 0, 0, 0, 1],
[1, 1, 1, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 0, 0, 0],
[0, 1, 0, 1, 0, 0, 0, 1],
[0, 1, 0, 1, 1, 1, 1, 0],
[1, 1, 0, 1, 1, 0, 0, 0],
[1, 0, 0, 1, 0, 0, 1, 0],
[1, 1, 0, 0, 0, 1, 0, 1]])
In [186]: np.where(np.random.random(A.shape) < 0.9, 0, A)
Out[186]:
array([[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0]])
# first method
prob=0.3
print(np.random.choice([2,5], (5,), p=[prob,1-prob]))
# second method (i prefer)
import random
import numpy as np
def randomZerosOnes(a,b, N, prob):
if prob > 1-prob:
n1=int((1-prob)*N)
n0=N-n1
else:
n0=int(prob*N)
n1=N-n0
zo=np.concatenate(([a for _ in range(n0)] ,[b for _ in range(n1)] ), axis=0 )
random.shuffle(zo)
return zo
zo=randomZerosOnes(2,5, N=5, prob=0.3)
print(zo)