Create dense matrix from sparse matrix efficently (numpy/scipy but NO sklearn) - python

I have a sparse.txt that looks like this:
# first column is label 0 or 1
# rest of the data is sparse data
# maximum value in the data is 4, so the future dense matrix will
# have 1+4 = 5 elements in a row
# file: sparse.txt
1 1:1 2:1 3:1
0 1:1 4:1
1 2:1 3:1 4:1
The required dense.txt is this:
# required file: dense.txt
1 1 1 1 0
0 1 0 0 1
1 0 1 1 1
Without using scipy coo_matrix it did it in a simple way like this:
def create_dense(fsparse, fdense,fvocab):
# number of lines in vocab
lvocab = sum(1 for line in open(fvocab))
# create dense file
with open(fsparse) as fi, open(fdense,'w') as fo:
for i, line in enumerate(fi):
words = line.strip('\n').split(':')
words = " ".join(words).split()
label = int(words[0])
indices = [int(w) for (i,w) in enumerate(words) if int(i)%2]
row = [0]* (lvocab+1)
row[0] = label
# use listcomps
row = [ 1 if i in indices else row[i] for i in range(len(row))]
l = " ".join(map(str,row)) + "\n"
fo.write(l)
print('Writing dense matrix line: ', i+1)
Question
How can we directly get label and data from sparse data without first creating dense matrix and using NUMPY /Scipy preferably??
Question:
How can we read the sparse data using numpy.fromregex ?
My attempt is:
def read_file(fsparse):
regex = r'([0-1]\s)([0-9]):(1\s)*([0-9]:1)' + r'\s*\n'
data = np.fromregex(fsparse,regex,dtype=str)
print(data,file=open('dense.txt','w'))
It did not work!
Related links:
Parsing colon separated sparse data with pandas and numpy

Tweaking your code to create the dense array directly, rather via file:
fsparse = 'stack47266965.txt'
def create_dense(fsparse, fdense, lvocab):
alist = []
with open(fsparse) as fi:
for i, line in enumerate(fi):
words = line.strip('\n').split(':')
words = " ".join(words).split()
label = int(words[0])
indices = [int(w) for (i,w) in enumerate(words) if int(i)%2]
row = [0]* (lvocab+1)
row[0] = label
# use listcomps
row = [ 1 if i in indices else row[i] for i in range(len(row))]
alist.append(row)
return alist
alist = create_dense(fsparse, fdense, 4)
print(alist)
import numpy as np
arr = np.array(alist)
from scipy import sparse
M = sparse.coo_matrix(arr)
print(M)
print(M.A)
produces
0926:~/mypy$ python3 stack47266965.py
[[1, 1, 1, 1, 0], [0, 1, 0, 0, 1], [1, 0, 1, 1, 1]]
(0, 0) 1
(0, 1) 1
(0, 2) 1
(0, 3) 1
(1, 1) 1
(1, 4) 1
(2, 0) 1
(2, 2) 1
(2, 3) 1
(2, 4) 1
[[1 1 1 1 0]
[0 1 0 0 1]
[1 0 1 1 1]]
If you want to skip the dense arr, you need to generate the equivalent of the M.row,M.col, and M.data attributes (order doesn't matter)
[0 0 0 0 1 1 2 2 2 2]
[0 1 2 3 1 4 0 2 3 4]
[1 1 1 1 1 1 1 1 1 1]
I don't use regex much so I won't try to fix that. I assume you want to convert
'1 1:1 2:1 3:1'
into
['1' '1' '2' '2' '1' '3' '1']
But that just gets you to the words/label stage.
A direct to sparse:
def create_sparse(fsparse, lvocab):
row, col, data = [],[],[]
with open(fsparse) as fi:
for i, line in enumerate(fi):
words = line.strip('\n').split(':')
words = " ".join(words).split()
label = int(words[0])
row.append(i); col.append(0); data.append(label)
indices = [int(w) for (i,w) in enumerate(words) if int(i)%2]
for j in indices: # quick-n-dirty version
row.append(i); col.append(j); data.append(1)
return row, col, data
r,c,d = create_sparse(fsparse, 4)
print(r,c,d)
M = sparse.coo_matrix((d,(r,c)))
print(M)
print(M.A)
producing
[0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2] [0, 1, 2, 3, 0, 1, 4, 0, 2, 3, 4] [1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1]
....
The only thing that's different is the one data item with value 0. sparse will take care of that.

(Answered before explicitly disallowing sklearn)
This is basically the svmlight / libsvm format.
Just use scikit-learn's load_svmlight_file or the more efficient svmlight-loader. No need to reinvent the wheel here!
from sklearn.datasets import load_svmlight_file
X, y = load_svmlight_file('C:/TEMP/sparse.txt')
print(X)
print(y)
print(X.todense())
Output:
(0, 0) 1.0
(0, 1) 1.0
(0, 2) 1.0
(1, 0) 1.0
(1, 3) 1.0
(2, 1) 1.0
(2, 2) 1.0
(2, 3) 1.0
[ 1. 0. 1.]
[[ 1. 1. 1. 0.]
[ 1. 0. 0. 1.]
[ 0. 1. 1. 1.]]

Related

inserting rows and columns of zeros to a sparse array in python

I have 50ish relatively large sparse arrays (in scipy.csr_array format but that can be changed) and I would like to insert rows and columns of zeros at certain locations. An example in dense format would look like:
A = np.asarray([[1,2,1],[2,4,5],[2,1,6]])
# A = array([[1,2,1],
# [2,4,5],
# [2,1,6]])
indices = np.asarray([-1, -1, 2, -1, 4, -1, -1, 7, -1])
# indices = array([-1, -1, 2, -1, 4, -1, -1, 7, -1])
#insert rows and colums of zeros where indices[i] == -1 to get B
B = np.asarray([[0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0],
[0,0,1,0,2,0,0,1,0],
[0,0,0,0,0,0,0,0,0],
[0,0,2,0,4,0,0,5,0],
[0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0],
    [0,0,2,0,1,0,0,6,0],
[0,0,0,0,0,0,0,0,0]])
A is a sparse array of shape (~2000, ~2000) with ~20000 non zero entries and indices is of shape (4096, ). I can imagine doing it in dense format but I guess I don't know enough about the way data and indices are are stored and cannot find a way to do this sort of operation for sparse arrays in a quick and efficient way.
Anyone have any ideas or suggestions?
Thanks.
I would probably do this by passing the data and associated indices into a COO matrix constructor:
import numpy as np
from scipy.sparse import coo_matrix
A = np.asarray([[1,2,1],[2,4,5],[2,1,6]])
indices = np.asarray([-1, -1, 2, -1, 4, -1, -1, 7, -1])
idx = indices[indices >= 0]
col, row = np.meshgrid(idx, idx)
mat = coo_matrix((A.ravel(), (row.ravel(), col.ravel())),
shape=(len(indices), len(indices)))
print(mat)
# (2, 2) 1
# (2, 4) 2
# (2, 7) 1
# (4, 2) 2
# (4, 4) 4
# (4, 7) 5
# (7, 2) 2
# (7, 4) 1
# (7, 7) 6
print(mat.todense())
# [[0 0 0 0 0 0 0 0 0]
# [0 0 0 0 0 0 0 0 0]
# [0 0 1 0 2 0 0 1 0]
# [0 0 0 0 0 0 0 0 0]
# [0 0 2 0 4 0 0 5 0]
# [0 0 0 0 0 0 0 0 0]
# [0 0 0 0 0 0 0 0 0]
# [0 0 2 0 1 0 0 6 0]
# [0 0 0 0 0 0 0 0 0]]
You could try storing your non-zero values in one list and their respective indexes in another:
data_list = [[], [], [1, 2, 1], [], [2, 4, 5], [], [], [2, 1, 6], []]
index_list = [[], [], [2, 4, 7], [], [2, 4, 7], [], [], [2, 4, 7], []]
These two lists, would only then have to store the number of nonzero values each, rather than one list with 4,000,000 values.
If you then wanted to grab the value in position (4, 7):
def find_value(row, col):
# Check to see if the given column is in our index list
if col not in index_list[row]:
return 0
# Otherwise return the number in the data list
myNum = data_list[row][index_list[row].index(col)]
return myNum
find_value(4, 7)
output: 5
Hope this helps!

compute density map D

You are given two integer numbers n and r, such that 1 <= r < n,
a two-dimensional array W of size n x n.
Each element of this array is either 0 or 1.
Your goal is to compute density map D for array W, using radius of r.
The output density map is also two-dimensional array,
where each value represent number of 1's in matrix W within the specified radius.
Given the following input array W of size 5 and radius 1 (n = 5, r = 1)
1 0 0 0 1
1 1 1 0 0
1 0 0 0 0
0 0 0 1 1
0 1 0 0 0
Output (using Python):
3 4 2 2 1
4 5 2 2 1
3 4 3 3 2
2 2 2 2 2
1 1 2 2 2
Logic: Input first row, first column value is 1. r value is 1. So we should check 1 right element, 1 left element, 1 top element, top left, top right, bottom , bottom left and bottom right and sum all elements.
Should not use any 3rd party library.
I did it using for loop and inner for loop and check for each element. Any better work around ?
Optimization: For each 1 in W, update count for locations, in whose neighborhood it belongs
Although for W of size nxn, the following algorithm would still take O(n^2) steps, however if W is sparse i.e. number of 1s (say k) << nxn then instead of rxrxnxn steps for approach stated in question, following would take nxn + rxrxk steps, which is much lower if k << nxn
Given r assigned and W stored as
[[1, 0, 0, 0, 1],
[1, 1, 1, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 0, 1, 1],
[0, 1, 0, 0, 0]]
then following
output = [[ 0 for i in range(5) ] for j in range(5) ]
for i in range(len(W)):
for j in range(len(W[0])):
if W[i][j] == 1:
for off_i in range(-r,r+1):
for off_j in range(-r,r+1):
if (0 <= i+off_i < len(W)) and (0 <= j+off_j < len(W[0])):
output[i+off_i][j+off_j] += 1
stores required values in output
for r = 1, output is as required
[[3, 4, 2, 2, 1],
[4, 5, 2, 2, 1],
[3, 4, 3, 3, 2],
[2, 2, 2, 2, 2],
[1, 1, 2, 2, 2]]

how to do this operation in numpy (chaining of tiling operation)?

I'm trying to do fast generation of numpy array, possibly without passing through python.
I want to build an 1D index numpy array that would take this as an input:
[2,3] and this [2,4] and would return this
[0,1,0,1,0,1,2,0,1,2,0,1,2,0,1,2]
Explanation:
I iterate from 0 to 2 (so [0,1] array) and repeat it 2 times : [0,1,0,1]
Then I iterate from 0 to 3 (so [0,1,2] array) and repeat it 4 times : [0,1,2,0,1,2,0,1,2,0,1,2]
Then I flattened everything.
Is there a way to do this fully in numpy?
For now I'm building each table separately in numpy by using np.tile() and flattening everything afterwards but I feel like there is a more efficient way that would only translate to C functions calls and no python
Here is a vectorized solution:
def cycles(spec):
steps = np.repeat(*spec)
ps = steps.cumsum()
psj = np.zeros(ps[-1], int)
psj[ps[:-1]] = steps[:-1]
return np.arange(ps[-1]) - psj.cumsum()
Demo:
>>> cycles(((2,3),(2,4)))
array([0, 1, 0, 1, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2])
I am not entirely sure if this is what you want; here each tuple in the call to func() contains first the range and then the repeat.
import numpy
def func(tups):
Arr = numpy.empty(numpy.sum([ele[0] * ele[1] for ele in tups]), dtype=int)
i = 0
for ele in tups:
Arr[i:i + ele[0] * ele[1]] = numpy.tile(numpy.arange(ele[0]), ele[1])
i += ele[0] * ele[1]
return Arr
arr = func([(2, 3), (3, 4)])
print(arr)
# [0 1 0 1 0 1 0 1 2 0 1 2 0 1 2 0 1 2]

Find all points within distance 1 of specific point in 2D numpy matrix

I want to find a list of points that are within range 1 (or exactly diagonal) of a point in my numpy matrix:
For example say my matrix m is:
[[0 0 0 0 0]
[0 0 0 0 0]
[0 0 1 0 0]
[0 0 0 0 0]
[0 0 0 0 0]]
I would like to obtain a list of tuples or something representing all the coordinates of the 9 points with X's below:
[[0 0 0 0 0]
[0 X X X 0]
[0 X X X 0]
[0 X X X 0]
[0 0 0 0 0]]
Here is another example with the target point on the edge:
[[0 0 0 0 0]
[0 0 0 0 0]
[0 0 0 0 1]
[0 0 0 0 0]
[0 0 0 0 0]]
In this case there would only 6 points within distance 1 of the target point:
[[0 0 0 0 0]
[0 0 0 X X]
[0 0 0 X X]
[0 0 0 X X]
[0 0 0 0 0]]
EDIT:
Using David Herrings answer/comment about chebyshev distance here is my attempt to solve example 2 above assuming I know the coordinates of the target point:
from scipy.spatial import distance
point = [2, 4]
valid_points = []
for x in range(5):
for y in range(5):
if(distance.chebyshev(point, [x,y]) <= 1):
valid_points.append([x,y])
print(valid_points) # [[1, 3], [1, 4], [2, 3], [2, 4], [3, 3], [3, 4]]
This seems a little inefficient for a bigger array as I only need to check a small set of cells really not the whole martix.
I think you're making it a little too complicated - no need to rely on complicated functions
import numpy as np
# set up matrix
x = np.zeros((5,5))
# add a single point
x[2,-1] = 1
# get coordinates of point as array
r, c = np.where(x)
# convert to python scalars
r = r[0]
c = c[0]
# get boundaries of array
m, n = x.shape
coords = []
# loop over possible locations
for i in [-1, 0, 1]:
for j in [-1, 0, 1]:
# check if location is within boundary
if 0 <= r + i < m and 0 <= c + j < n:
coords.append((r + i, c + j))
print(coords)
>>> [(1, 3), (1, 4), (2, 3), (2, 4), (3, 3), (3, 4)]
There is no algorithm of interest here. If you don’t already know where the 1 is, first you have to find it, and you can’t do better than searching through every element. (You can get a constant-factor speedup by having numpy do this at C speed with argmax; use divmod to separate the flattened index into row and column.) Thereafter, all you do is add &pm;1 (or 0) to the coordinates unless it would take you outside the array bounds. You don’t ever construct coordinates only to discard them later.
A simple way would be to get all possible coordinates with a cartesian product
Setup the data:
x = np.array([[0,0,0], [0,1,0], [0,0,0]])
x
array([[0, 0, 0],
[0, 1, 0],
[0, 0, 0]])
You know that the coordinates will be +/- 1 of your location:
loc = np.argwhere(x == 1)[0] # unless already known or pre-specified
v = [loc[0], loc[0]-1, loc[0]+1]
h = [loc[1], loc[1]-1, loc[1]+1]
output = []
for i in itertools.product(v, h):
if not np.any(np.array(i) >= x.shape[0]) and not np.any(np.array(i) < 0): output.append(i)
print(output)
[(1, 1), (1, 0), (1, 2), (0, 1), (0, 0), (0, 2), (2, 1), (2, 0), (2, 2)]

map matrix into specific vector with numpy

I have matrix similar to this:
1 0 0
1 0 0
0 2 0
0 2 0
0 0 3
0 0 3
(Non-zero numbers denote parts that I'm interested in. Actual number inside matrix could be random.)
And I need to produce vector like this:
[ 1 1 2 2 3 3 ].T
I can do this with loop:
result = np.zeros([rows])
for y in range(rows):
x = y // (rows // cols) # pick index of corresponded column
result[y] = mat[y][x]
But I can't figure out how to do this in vector form.
This might be what you want.
import numpy as np
m = np.array([
[1, 0, 0],
[1, 0, 0],
[0, 2, 0],
[0, 2, 0],
[0, 0, 3],
[0, 0, 3]
])
rows, cols = m.shape
# axis1 indices
y = np.arange(rows)
# axis2 indices
x = y // (rows // cols)
result = m[y,x]
print(result)
Result:
[1 1 2 2 3 3]

Categories