Efficient multinomial sampling for sparse array/tensor in python - python

I have a sparse array/tensor like below.
import torch
from torch_sparse import SparseTensor
row = torch.tensor([0, 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3])
col = torch.tensor([1, 2, 3, 0, 2, 0, 1, 4, 5, 0, 2, 5, 2, 4])
value = torch.rand([14])
adj_t = SparseTensor(row=row, col=col, value=value, sparse_sizes=(4, 9))
I want to sample n_samples column index's with or without replacement. I can do this by first converting adj_t to dense and then using torch.multinomial or similarly with numpy.random.choice.
sample = torch.multinomial(adj_t.to_dense(), num_samples=2, replacement=True)
But converting the sparse array to dense and the torch.multinomial is not very efficient. Is there a sparse version of torch.multinomial. If no, how would one go about implementing it

I am not sure if this can be done as efficiently as your one-liner.
From what I understand one way to achieve what you want is to:
Group values by row in which they appear in sparese tensor e.g. using this solution: np.split(value.numpy(), np.unique(row.numpy(), return_index=True)[1][1:])
Use numpy.random.multinominal to create the list of randomly chosen indexes for every row
map the indexes to respective values from col (i.e. 0 in 0th row is 1, 1 is 1st row is 2, 2 in 2nd row is 4 - all according to row and col values)
You might not want to use any built-in loop in order for performence not to drop.

Related

How to select r% samples from a list based on their values?

Let's say I have a list a = [2, 1, 4, 3, 5]. I want to do the following:
I define a percentage value r%. I would like to select r% of samples having low values and get their indices.
For examples, if r=80 - The output would be the indices of 1, 2, 3, 4 i.e. 1, 0, 3, 2
Use np.percentile and np.where
a = np.array([2, 1, 4, 3, 5])
r = 80
np.where(a < np.percentile(a, r))
> (array([0, 1, 2, 3]),)
Note: in your example you return the order of the indices as if the elements were sorted. It's not clear if this is important for you but if it is it's easy in NumPy! Just replace the last line with
np.argsort(a)[a < np.percentile(a, r)]
> array([1, 0, 3, 2])
def perc(r, number_list):
# Find number of samples based on the percentage (rounding to closest integer)
number_of_samples = len(number_list) * (r/ 100)
number_list.sort()
return [number_list[index] for index in range(number_of_samples)]

Create array/tensor of cycle shifted arrays

I want to create 2d tensor (or numpy array, doesn't really matter), where every row will be cycle shifted first row. I do it using for loop:
import torch
import numpy as np
a = np.random.rand(33, 11)
miss_size = 64
lp_order = a.shape[1] - 1
inv_a = -np.flip(a, axis=1)
mtx_size = miss_size+lp_order # some constant
mtx_row = torch.cat((torch.from_numpy(inv_a), torch.zeros((a.shape[0], miss_size - 1 + a.shape[1]))), dim=1)
mtx_full = mtx_row.unsqueeze(1)
for i in range(mtx_size):
mtx_row = torch.roll(mtx_row, 1, 1)
mtx_full = torch.cat((mtx_full, mtx_row.unsqueeze(1)), dim=1)
unsqueezing is needed because I stack 2d tensors into 3d tensor
Is there more efficient way to do that? Maybe linear algebra trick or more pythonic approach.
You can use scipy.linalg.circulant():
scipy.linalg.circulant([1, 2, 3])
# array([[1, 3, 2],
# [2, 1, 3],
# [3, 2, 1]])
I believe you can achieve this using torch.gather by constructing the appropriate index tensor. This approach works with batches too.
If we take this approach, the objective is to construct an index tensor where each value refers to an index in mtx_row (along the last dimension here dim=1). In this case, it would be shaped (3, 3):
tensor([[0, 1, 2],
[2, 0, 1],
[1, 2, 0]])
You can achieve this by broadcasting torch.arange with its own transpose and applying modulo on the resulting matrix:
>>> idx = (n-torch.arange(n)[None].T + torch.arange(n)[None]) % n
tensor([[0, 1, 2],
[2, 0, 1],
[1, 2, 0]])
Let mtx_row be shaped (2, 3):
>>> mtx_row
tensor([[0.3523, 0.0170, 0.1875],
[0.2156, 0.7773, 0.4563]])
From there you need to idx and mtx_row so they have the same shapes:
>>> idx_ = idx[None].expand(len(mtx_row), -1, -1)
>>> val_ = mtx_row[:, None].expand(-1, n, -1)
Then we can apply torch.gather on the last dimension dim=2:
>>> val_.gather(-1, idx_)
tensor([[[0.3523, 0.0170, 0.1875],
[0.1875, 0.3523, 0.0170],
[0.0170, 0.1875, 0.3523]],
[[0.2156, 0.7773, 0.4563],
[0.4563, 0.2156, 0.7773],
[0.7773, 0.4563, 0.2156]]])

Tenforflow Sparse Arithmetic

Hi I'm learning tensorflow right now and I am have a sparse dataset which is made up of three columns, date, bond, spread. I figured that storing this data in sparse tensor with bond as one dimension, and date as another will make operations on this tensor feel natural, do let me know if you think there is a better way.
I am trying to perform arithmetic on two slices of the tensor where I add/subtract values on one date only if given tensor values is not empty, and while I found some functions that help me with that task I can't shake off the feeling that I'm missing a really simple solution to the problem.
Using data bellow:
import tensorflow as tf
tf.enable_eager_execution()
indicies = [[0, 0], [0, 1], [1, 0], [1, 2], [2, 2]]
values = [10 , 10 , 10 , 11 , 11 ]
spreads = tf.sparse.SparseTensor(indicies, values, [3, 3])
In above example I intend to use dimension one for dates, and dimension two for bonds such that
tf.sparse.slice(spreads,[0,2],[3,1])
Gives me all spreads for date2, but apparently subtraction is not supported for SparseTensor, nor can I use tf.math.subtract. So I am no longer sure what is supported.
Specifically what I want to accomplish in above example is subtract date 0 for all other dates only if bond has spread on both dates. For Example bond 0 shows up in date 0 and 1 but not date 2 so I want to subtract spread in date 0 from both dates 0 and 1.
Final tensor would look like this:
indicies2 = [[0, 0], [0, 1], [1, 0], [1, 2]]
output = [ 0 , 0 , 0, , 1 ]
tf.sparse.to_dense(tf.sparse(tf.sparse.SparseTensor(indicies2, output, [3, 3])))
tf.Tensor: id=128, shape=(3, 3), dtype=int32, numpy=
array([[0, 0, 0],
[ 0, 0, 1],
[ 0, 0, 0]])
I guess easy solution would be to use tf.sparse.to_dense but that kind of defeats the whole point of using SparseTensor, so I'm not really sure if I missed something in API docs that makes my solution possible or did I got wrong completely by trying to use SparseTensor?
At the end of the day I am just looking to perform some math for each value of a tensor if that value has a match in another tensor.
UPDATE:
I realized that I can apply tf.math/negative to one of the slices to subtract two slices problem is that output assumes that if value in one slice is missing then it can be assumed to be some default value(zero).
I'm not sure there is any simple trick to make that work that easily. I would either make the dense computation or write the sparse computation myself. That is a bit trickier, so probably only worth it if the data is really very sparse and you would save a lot memory and computation. Here is a way to do that:
import tensorflow as tf
tf.enable_eager_execution()
bonds = [0, 0, 1, 1, 2]
dates = [0, 1, 0, 2, 2]
values = [10, 10, 10, 11, 11]
# Find date 0 data
m0 = tf.equal(dates, 0)
bonds0 = tf.boolean_mask(bonds, m0)
values0 = tf.boolean_mask(values, m0)
# Find where date 0 bonds are
match = tf.equal(tf.expand_dims(bonds, 1), bonds0)
# Compute the amount to subtract from each data point
values_sub = tf.reduce_sum(values0 * tf.dtypes.cast(match, values0.dtype), 1)
# Compute new spread values
values_new = values - values_sub
# Mask null values
m_valid = tf.not_equal(values_new, 0)
bonds_new = tf.boolean_mask(bonds, m_valid)
dates_new = tf.boolean_mask(dates, m_valid)
values_new = tf.boolean_mask(values_new, m_valid)
# Make sparse tensor
indices_new = tf.dtypes.cast(tf.stack([bonds_new, dates_new], 1), tf.int64)
spreads_new = tf.sparse.SparseTensor(indices_new, values_new, [3, 3])
tf.print(spreads_new)
# 'SparseTensor(indices=[[1 2]
# [2 2]], values=[1 11], shape=[3 3])'
For the example that you give, I get the outputs (1, 2) => 1 and (2, 2) => 11 - (2, 2) is unaffected because there was no spread for 2 in date 0. That is different from what you wrote, so let me know if that is not what you meant.

Delete array of values from numpy array

This post is an extension of this question.
I would like to delete multiple elements from a numpy array that have certain values. That is for
import numpy as np
a = np.array([1, 1, 2, 5, 6, 8, 8, 8, 9])
How do I delete one instance of each value of [1,5,8], such that the output is [1,2,6,8,8,9]. All I have found in the documentation for an array removal is the use of np.setdiff1d, but this removes all instances of each number. How can this be updated?
Using outer comparison and argmax to only remove once. For large arrays this will be memory intensive, since the created mask has a.shape * r.shape elements.
r = np.array([1, 5, 8])
m = (a == r[:, None]).argmax(1)
np.delete(a, m)
array([1, 2, 6, 8, 8, 9])
This does assume that each value in r appears in a at least once, otherwise the value at index 0 will get deleted since argmax will not find a match, and will return 0.
delNums = [np.where(a == x)[0][0] for x in [1,5,8]]
a = np.delete(a, delNums)
here, delNums contains the indexes of the values 1,5,8 and np.delete() will delete the values at those specified indexes
OUTPUT:
[1 2 6 8 8 9]

Map a number to an id in python

Suppose I have a numpy array like: [11, 30, 25]. These numbers represent categories of the objects corresponding to the indices. I know there are just 20 categories but for some reason they are numbered from 11 to 29. I'd like to convert them to numbers in 0:19 and back. What would by a pythonic way to do this? Preferably in bumpy.
EDIT: this is just a small example of a bigger problem, where the number of categories are in the thousands, and some categories are never represented, so the maximum id will be the number of unique existing categories.
Let's say arr is the input array of categories.
Forward Process/Encoding : From categories to IDs
To perform the encoding, use np.unique alongwith its optional return_inverse argument to give us IDs that would have values from 0 to N-1, where N is the number of categories you would have in arr , like so -
unq,idx = np.unique(arr,return_inverse=True)
Backward Process/Decoding : From IDs to categories
To go back to the original categories from the IDs (idx), just index into unique categories saved earlier as unq, like so -
arr_out = unq[idx]
Sample run -
In [40]: arr # Input array of categories
Out[40]: array([7, 1, 1, 3, 8, 2, 7, 7, 0, 2])
In [41]: unq,idx = np.unique(arr,return_inverse=True)
In [42]: idx # ID array with values from 0 to 5 (6 categories)
Out[42]: array([4, 1, 1, 3, 5, 2, 4, 4, 0, 2])
In [43]: unq[idx] # Get back original array of categories
Out[43]: array([7, 1, 1, 3, 8, 2, 7, 7, 0, 2])
To be able to easily convert back-and-forth, I would use the sklearn.preprocessing module LabelEncoder:
In [7]: from sklearn.preprocessing import LabelEncoder
In [8]: encoder = LabelEncoder()
In [9]: encoder.fit(range(11,31))
Out[9]: LabelEncoder()
In [10]: encoder.transform([11,30,25])
Out[10]: array([ 0, 19, 14])
In [11]: encoder.inverse_transform([18, 1, 15])
Out[11]: array([29, 12, 26])

Categories