Related
I encounter a problem with numpy arrays.
I used CountVectorizer from sklearn with a wordset and values (from pandas column) to create an array of arrays that count words (BoW). And when I print the array and the shape, I have this result:
[[array([0, 5, 0, ..., 0, 0, 0])]
[array([0, 0, 0, ..., 0, 0, 0])]
[array([0, 0, 0, ..., 0, 0, 0])]
...
[array([0, 0, 0, ..., 0, 0, 0])]
[array([0, 0, 0, ..., 0, 0, 0])]
[array([0, 0, 0, ..., 0, 0, 0])]] (2800, 1)
An array of arrays having a vector shape ???
I checked that all rows have the same size.
Here is a way to reproduce my problem:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
data = pd.DataFrame(["blop blip blup", "bop bip bup", "boop boip boup"], columns=["corpus"])
# add labels column
data["label"] = ["blop", "bip", "boup"]
wordset = pd.Series([y for x in data["corpus"].str.split() for y in x]).unique()
cvec = CountVectorizer(vocabulary=wordset, ngram_range=(1, 2))
labels_count_np = data["label"].apply(lambda x: cvec.fit_transform([x]).toarray()[0]).values
print(labels_count_np, labels_count_np.shape)
it should return:
[array([1, 0, 0, 0, 0, 0, 0, 0, 0]) array([0, 0, 0, 0, 1, 0, 0, 0, 0])
array([0, 0, 0, 0, 0, 0, 0, 0, 1])] (3,)
Can someone explain me why numpy has this comportment ?
Also, I tried to find a way to concatenate multiple arrays like this:
A = [array([1, 0, 0, 0, 0, 0, 0, 0, 0]) array([0, 0, 0, 0, 1, 0, 0, 0, 0])
array([0, 0, 0, 0, 0, 0, 0, 0, 1])]
B = [array([0, 7, 2, 0]) array([1, 4, 0, 8])
array([6, 1, 0, 9])]
concatenate(A,B) =>
[
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 2, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 4, 0, 8],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 6, 1, 0, 9]
]
But I did not found a good way to do it.
values from a dataframe, even if it has just one column, will be 2d. values from a Series, one column of the frame, will be 1d.
If labels_count_np is (2800, 1) shape, you can easily make it 1d with labels_count_np[:,0] or np.squeeze(labels...). That's just basic numpy.
It will still be an object dtype array containing arrays, the elements of the dataframe cells. If those arrays are all the same size then
np.stack(labels_count_np[:,0])
should create a 2d numeric array.
Make a frame with array elements:
In [35]: df = pd.DataFrame([None,None,None], columns=['x'])
In [36]: df
Out[36]:
x
0 None
1 None
2 None
In [37]: for i in range(3):df['x'][i] = np.zeros(4,int)
In [38]: df
Out[38]:
x
0 [0, 0, 0, 0]
1 [0, 0, 0, 0]
2 [0, 0, 0, 0]
The 2d array from the frame:
In [39]: df.values
Out[39]:
array([[array([0, 0, 0, 0])],
[array([0, 0, 0, 0])],
[array([0, 0, 0, 0])]], dtype=object)
In [40]: _.shape
Out[40]: (3, 1)
from the Series:
In [41]: df['x'].values
Out[41]:
array([array([0, 0, 0, 0]), array([0, 0, 0, 0]), array([0, 0, 0, 0])],
dtype=object)
In [42]: _.shape
Out[42]: (3,)
Joining the Series values into one 2d array:
In [43]: np.stack(df['x'].values)
Out[43]:
array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]])
You can concatenate using list comprehension:
C = [np.append(x, B[i]) for i, x in enumerate(A)]
OUTPUT
[array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 2, 0]),
array([0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 4, 0, 8]),
array([0, 0, 0, 0, 0, 0, 0, 0, 1, 6, 1, 0, 9])]
I need to find the number of non zero rows and put them in a 1D tensor(kind of vector).
For an example:
tensor = [
[
[1, 2, 3, 4, 0, 0, 0],
[4, 5, 6, 7, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]
],
[
[4, 3, 2, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]
],
[
[0, 0, 0, 0, 0, 0, 0],
[4, 5, 6, 7, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]
]
]
the tensor shape will be [None,45,7] in a real application, but here it is [3,2,7].
So I need to find the number of non-zero rows in dimension 1 and keep them in a 1d tensor.
non_zeros = [2,1,1] #result for the above tensor
I need to do it in TensorFlow, if it is in NumPy, I would have done it.
Can anyone help me with this?
Thanks in advance
You can use tf.math.count_nonzero combined with tf.reduce_sum
>>> tf.math.count_nonzero(tf.reduce_sum(tensor,axis=2),axis=1)
<tf.Tensor: shape=(3,), dtype=int64, numpy=array([2, 1, 1])>
Try this code:
t = tf.math.not_equal(tensor, 0)
t = tf.reduce_any(t, -1)
t = tf.cast(t, tf.int32)
t = tf.reduce_sum(t, -1)
I've the following adjacency matrix:
array([[0, 1, 1, 0, 0, 0, 0],
[1, 0, 1, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 1],
[0, 0, 0, 1, 0, 1, 0],
[0, 0, 0, 0, 1, 0, 1],
[0, 0, 0, 1, 0, 1, 0]])
Which can be drawn like that:
My goal is to identify the connected graph ABC and DEFG. It's seems that Depth-First Search algorithm is what I need and that Scipy implemented it. So here is my code:
from scipy.sparse import csr_matrix
from scipy.sparse.csgraph import depth_first_order
import numpy as np
test = np.asarray([
[0, 1, 1, 0, 0, 0, 0],
[1, 0, 1, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 1],
[0, 0, 0, 1, 0, 1, 0],
[0, 0, 0, 0, 1, 0, 1],
[0, 0, 0, 1, 0, 1, 0]
])
graph = csr_matrix(test)
result = depth_first_order(graph, 0)
But I don't get the result:
>>> result
(array([0, 1, 2]), array([-9999, 0, 1, -9999, -9999, -9999, -9999]))
what's that array([-9999, 0, 1, -9999, -9999, -9999, -9999]) ? Also, in the documentation, they talk about a sparse matrix not about an adjacency one. But an adjacency matrix seems to be a sparse matrix by definition so it's not clear for me.
While you could indeed use DFS to find the connected components, SciPy makes it even easier with scipy.sparse.csgraph.connected_components. With your example:
In [3]: connected_components(test)
Out[3]: (2, array([0, 0, 0, 1, 1, 1, 1], dtype=int32))
Well to start, you have an undirected graph. Look at the documentation again and set the directed parameter to false since the default is True.
The first array you get is the nodes reachable from where you start (node 0 = node a) including your starting node.
So you start at node a and you can reach b and c. You can't reach the rest of the graph since you have a disconnected graph. DFS is doing what it is supposed to do. You will need to do DFS on the d node to get the second graph.
I have a sparse matrix stored on disk in coordinate format, (triplet format).
I would like to read chunks of the matrix into memory, using scipy.sparse, however, when doing this, scipy will always assume a dense matrix indexing from 0,0, regardless of the chunk.
This means, for example, that for the last 'chunk' in the sparse matrix scipy will interpret as being a huge matrix that only has some values in the bottom right corner.
How can I correctly handle the chunks so that when doing toarray to create a dense matrix it only creates the subset corresponding to that chunk?
The reason for doing this is that, even sparse, the matrix is too large for memory (approx 600 million 32bit floating point values) and to display on screen (as the matrix represents a geospatial raster) I need to convert it to a dense matrix to store in a geospatial format (e.g. geotiff).
You should be able tweak the row and col values when building the subset. For example:
In [84]: row=np.arange(10)
In [85]: col=np.random.randint(0,6,row.shape)
In [86]: data=np.ones(row.shape,dtype=int)*2
In [87]: M=sparse.coo_matrix((data,(row,col)),shape=(10,6))
In [88]: M.A
Out[88]:
array([[0, 0, 2, 0, 0, 0],
[0, 0, 0, 0, 0, 2],
[0, 0, 0, 2, 0, 0],
[0, 0, 2, 0, 0, 0],
[0, 0, 2, 0, 0, 0],
[0, 2, 0, 0, 0, 0],
[2, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 2, 0],
[0, 0, 0, 2, 0, 0],
[0, 0, 0, 0, 0, 2]])
To build a matrix with a subset of the rows use:
In [89]: M1=sparse.coo_matrix((data[5:],(row[5:]-5,col[5:])),shape=(5,6))
In [90]: M1.A
Out[90]:
array([[0, 2, 0, 0, 0, 0],
[2, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 2, 0],
[0, 0, 0, 2, 0, 0],
[0, 0, 0, 0, 0, 2]])
You'll have to decide whether you want to specify the shape for M1, or let it deduce it from the range of row and col.
If these coordinates are not sorted, or you also want to take a subrange of col, things could get more complicated. But I think this captures the basic idea.
How to construct sparse matrix from diagonal vectors like this:
Lets say my matrix is square with dimension N=6 and i have the following vector
vec = np.array([[1], [1,2]])
and I want to put those parts on diagonals
offset = np.array([2,3])
but vec[0] should start at Mat[0,2] and vec[1] should start at Mat[1,4]
I know about scipy.sparse.diags() but I don't think there is a way to specify just part of a diagonal where non-zero elements are present.
This is just an example to illustrate the problem. In reality I deal with very big arrays and I dont want to waste memory for useless zeros.
Is this the matrix that you want?
In [200]: sparse.dia_matrix(([[0,0,1,0,0,0],[0,0,0,0,1,2]],[2,3]),(6,6)).A
Out[200]:
array([[0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 2],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0]])
Yes, the specification includes zeros, which could be annoying in larger cases.
spdiags just wraps the dia_matrix, with the option of converting the result to another format. In your example that converts a 7 element sparse to a 3.
sparse.diags accepts a ragged list of values, but they still need to match the diagonals in length. And internally it converts them to the rectangular array that dia_matrix takes.
S3=sparse.diags([[1,0,0,0],[0,1,2]],[2,3],(6,6))
So if you really need to be parsimonious about the zeros you need to go the coo route.
For example:
In [363]: starts = [[0,2],[1,4]]
In [364]: data = np.concatenate(vec)
In [365]: rows=np.concatenate([range(s[0],s[0]+len(v)) for s,v in zip(starts, vec)])
In [366]: cols=np.concatenate([range(s[1],s[1]+len(v)) for s,v in zip(starts, vec)])
In [367]: sparse.coo_matrix((data,(rows,cols)),(6,6)).A
Out[367]:
array([[0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 2],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0]])