Hi I'm learning tensorflow right now and I am have a sparse dataset which is made up of three columns, date, bond, spread. I figured that storing this data in sparse tensor with bond as one dimension, and date as another will make operations on this tensor feel natural, do let me know if you think there is a better way.
I am trying to perform arithmetic on two slices of the tensor where I add/subtract values on one date only if given tensor values is not empty, and while I found some functions that help me with that task I can't shake off the feeling that I'm missing a really simple solution to the problem.
Using data bellow:
import tensorflow as tf
tf.enable_eager_execution()
indicies = [[0, 0], [0, 1], [1, 0], [1, 2], [2, 2]]
values = [10 , 10 , 10 , 11 , 11 ]
spreads = tf.sparse.SparseTensor(indicies, values, [3, 3])
In above example I intend to use dimension one for dates, and dimension two for bonds such that
tf.sparse.slice(spreads,[0,2],[3,1])
Gives me all spreads for date2, but apparently subtraction is not supported for SparseTensor, nor can I use tf.math.subtract. So I am no longer sure what is supported.
Specifically what I want to accomplish in above example is subtract date 0 for all other dates only if bond has spread on both dates. For Example bond 0 shows up in date 0 and 1 but not date 2 so I want to subtract spread in date 0 from both dates 0 and 1.
Final tensor would look like this:
indicies2 = [[0, 0], [0, 1], [1, 0], [1, 2]]
output = [ 0 , 0 , 0, , 1 ]
tf.sparse.to_dense(tf.sparse(tf.sparse.SparseTensor(indicies2, output, [3, 3])))
tf.Tensor: id=128, shape=(3, 3), dtype=int32, numpy=
array([[0, 0, 0],
[ 0, 0, 1],
[ 0, 0, 0]])
I guess easy solution would be to use tf.sparse.to_dense but that kind of defeats the whole point of using SparseTensor, so I'm not really sure if I missed something in API docs that makes my solution possible or did I got wrong completely by trying to use SparseTensor?
At the end of the day I am just looking to perform some math for each value of a tensor if that value has a match in another tensor.
UPDATE:
I realized that I can apply tf.math/negative to one of the slices to subtract two slices problem is that output assumes that if value in one slice is missing then it can be assumed to be some default value(zero).
I'm not sure there is any simple trick to make that work that easily. I would either make the dense computation or write the sparse computation myself. That is a bit trickier, so probably only worth it if the data is really very sparse and you would save a lot memory and computation. Here is a way to do that:
import tensorflow as tf
tf.enable_eager_execution()
bonds = [0, 0, 1, 1, 2]
dates = [0, 1, 0, 2, 2]
values = [10, 10, 10, 11, 11]
# Find date 0 data
m0 = tf.equal(dates, 0)
bonds0 = tf.boolean_mask(bonds, m0)
values0 = tf.boolean_mask(values, m0)
# Find where date 0 bonds are
match = tf.equal(tf.expand_dims(bonds, 1), bonds0)
# Compute the amount to subtract from each data point
values_sub = tf.reduce_sum(values0 * tf.dtypes.cast(match, values0.dtype), 1)
# Compute new spread values
values_new = values - values_sub
# Mask null values
m_valid = tf.not_equal(values_new, 0)
bonds_new = tf.boolean_mask(bonds, m_valid)
dates_new = tf.boolean_mask(dates, m_valid)
values_new = tf.boolean_mask(values_new, m_valid)
# Make sparse tensor
indices_new = tf.dtypes.cast(tf.stack([bonds_new, dates_new], 1), tf.int64)
spreads_new = tf.sparse.SparseTensor(indices_new, values_new, [3, 3])
tf.print(spreads_new)
# 'SparseTensor(indices=[[1 2]
# [2 2]], values=[1 11], shape=[3 3])'
For the example that you give, I get the outputs (1, 2) => 1 and (2, 2) => 11 - (2, 2) is unaffected because there was no spread for 2 in date 0. That is different from what you wrote, so let me know if that is not what you meant.
Related
I have a sparse array/tensor like below.
import torch
from torch_sparse import SparseTensor
row = torch.tensor([0, 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3])
col = torch.tensor([1, 2, 3, 0, 2, 0, 1, 4, 5, 0, 2, 5, 2, 4])
value = torch.rand([14])
adj_t = SparseTensor(row=row, col=col, value=value, sparse_sizes=(4, 9))
I want to sample n_samples column index's with or without replacement. I can do this by first converting adj_t to dense and then using torch.multinomial or similarly with numpy.random.choice.
sample = torch.multinomial(adj_t.to_dense(), num_samples=2, replacement=True)
But converting the sparse array to dense and the torch.multinomial is not very efficient. Is there a sparse version of torch.multinomial. If no, how would one go about implementing it
I am not sure if this can be done as efficiently as your one-liner.
From what I understand one way to achieve what you want is to:
Group values by row in which they appear in sparese tensor e.g. using this solution: np.split(value.numpy(), np.unique(row.numpy(), return_index=True)[1][1:])
Use numpy.random.multinominal to create the list of randomly chosen indexes for every row
map the indexes to respective values from col (i.e. 0 in 0th row is 1, 1 is 1st row is 2, 2 in 2nd row is 4 - all according to row and col values)
You might not want to use any built-in loop in order for performence not to drop.
Hi Please help me either: speed up this dictionary compression; offer a better way to do it or gain a higher understanding of why it is so slow internally (like for example is calculation slowing down as the dictionary grows in memory size). I'm sure there must be a quicker way without learning some C!
classes = {i : [1 if x in df['column'].str.split("|")[i] else 0 for x in df['column']] for i in df.index}
with the output:
{1:[0,1,0...0],......, 4000:[0,1,1...0]}
from a df like this:
data_ = {'drugbank_id': ['DB06605', 'DB06606', 'DB06607', 'DB06608', 'DB06609'],
'drug-interactions': ['DB06605|DB06695|DB01254|DB01609|DB01586|DB0212',
'DB06605|DB06695|DB01254|DB01609|DB01586|DB0212',
'DB06606|DB06607|DB06608|DB06609',
'DB06606|DB06607',
'DB06608']
}
pd.DataFrame(data = data_ , index=range(0,5) )
I am preforming it in a df with 4000 rows, the column df['column'] contains a string of Ids separated by |. The number of IDs in each row that needs splitting varies from 1 to 1000, however, this is done for all 4000 indexes. I tested it on the head of the df and it seemed quick enough, now the comprehension has been running for 24hrs. So maybe it is just the sheer size of the job, but feel like I could speed it up and at this point I want to stop it an re-engineer, however, I am scared that will set me back without much increase in speed, so before I do that wanted to get some thoughts, ideas and suggestions.
Beyond 4000x4000 size I suspect that using the Series and Index Objects is the another problem and that I would be better off using lists, but given the size of the task I am not sure how much speed that will gain and maybe I am better off using some other method such as pd.apply(df, f(write line by line to json)). I am not sure - any help and education appreciated, thanks.
Here is one approach:
import pandas as pd
# create data frame
df = pd.DataFrame({'idx': [1, 2, 3, 4], 'col': ['1|2', '1|2|3', '2|3', '1|4']})
# split on '|' to convert string to list
df['col'] = df['col'].str.split('|')
# explode to get one row for each list element
df = df.explode('col')
# create dummy ID (this will become True in the final result)
df['dummy'] = 1
# use pivot to create dense matrix
df = (df.pivot(index='idx', columns='col', values='dummy')
.fillna(0)
.astype(int))
# convert each row to a list
df['test'] = df.apply(lambda x: x.to_list(), axis=1)
print(df)
col 1 2 3 4 test
idx
1 1 1 0 0 [1, 1, 0, 0]
2 1 1 1 0 [1, 1, 1, 0]
3 0 1 1 0 [0, 1, 1, 0]
4 1 0 0 1 [1, 0, 0, 1]
The output you want can be achieved using dummies. We split the column, stack, and use max to turn it into dummy indicators based on the original index. Then we use reindex to get it in the order you want based on the 'drugbank_id' column.
Finally to get the dictionary you want we will transpose and use to_dict
classes = (pd.get_dummies(df['drug-interactions'].str.split('|', expand=True).stack())
.max(level=0)
.reindex(df['drugbank_id'], axis=1)
.fillna(0, downcast='infer')
.T.to_dict('list'))
print(classes)
{0: [1, 0, 0, 0, 0], #Has DB06605, No DB06606, No DB06607, No DB06608, No DB06609
1: [1, 0, 0, 0, 0],
2: [0, 1, 1, 1, 1],
3: [0, 1, 1, 0, 0],
4: [0, 0, 0, 1, 0]}
I am trying to calculate marketsheds using the skimage.MCP_geometric find_costs function. It has been working wonderfully to calculate least-cost routes, but rather than finding the travel cost to the nearest source, I want to calculate the index of the nearest source.
Sample Code
import numpy as np
import skimage.graph as graph
import copy
img = np.array([[1,1,2,2],[2,1,1,3],[3,2,1,2],[2,2,2,1]])
mcp = graph.MCP_Geometric(img)
destinations = [[0,0],[3,3]]
costs, traceback = mcp.find_costs(destinations)
print(costs)
[[0. 1. 2.5 4.5 ]
[1.5 1.41421356 2.41421356 4. ]
[4. 2.91421356 1.41421356 1.5 ]
[5.5 3.5 1.5 0. ]]
This works as expected, and creates a nice travel cost raster. However, I want (for each cell) to know which of the destinations is the closest. The best solution I have found is to run each of the destinations separately, then combine them through min calculations. It works, but is slow, and has not been working at scale.
all_c = []
for dest in destinations:
costs, traceback = mcp.find_costs([dest])
all_c.append(copy.deepcopy(costs))
res = np.dstack(all_c)
res_min = np.amin(res, axis=2)
output = np.zeros([res_min.shape[0], res_min.shape[1]])
for idx in range(0, res.shape[2]):
cur_data = res[:,:,idx]
cur_val = (cur_data == res_min).astype(np.byte) * idx
output = output + cur_val
output = output.astype(np.byte)
print(output)
array([[0, 0, 0, 0],
[0, 0, 1, 1],
[0, 1, 1, 1],
[1, 1, 1, 1]], dtype=int8)
I have been looking into overloading the functions of MCP_Geometric and MCP_Flexible, but I cannot find anything providing information on the index of the destination.
Hope that provides enough information to replicate and understand what I want to do, thanks!
Ok, this is a bit of a ride, but it was fun to figure out. I'm unclear just how fast it'll be but I think it should be pretty fast in the case of many destinations and comfortably-in-RAM images.
The key is the traceback return value, which kinda-sorta tells you the neighbor index to get to the nearest destination. So with a bit of pathfinding you should be able to find that destination. Can that be fast? It turns out it can, with a bit of NumPy index wrangling, scipy.sparse matrices, and connected_components from scipy.sparse.csgraph!
Let's start with your same costs array and both destinations:
import numpy as np
image = np.array(
[[1, 1, 2, 2],
[2, 1, 1, 3],
[3, 2, 1, 2],
[2, 2, 2, 1]]
)
destinations = [[0, 0], [3, 3]]
We then make the graph, and get the costs and the traceback:
from skimage import graph
mcp = graph.MCP_Geometric(image)
costs, traceback = mcp.find_costs(destinations)
print(traceback)
gives:
[[-1 4 4 4]
[ 6 7 7 1]
[ 6 6 0 1]
[ 3 3 3 -1]]
Now, I had to look up the documentation for what traceback is:
Same shape as the costs array; this array contains the offset to
any given index from its predecessor index. The offset indices
index into the offsets attribute, which is a array of n-d
offsets. In the 2-d case, if offsets[traceback[x, y]] is (-1, -1),
that means that the predecessor of [x, y] in the minimum cost path
to some start position is [x+1, y+1]. Note that if the
offset_index is -1, then the given index was not considered.
For some reason, my mcp object didn't have an offsets attribute — possibly a Cython inheritance bug? Dunno, will investigate later — but searching the source code shows me that offsets is defined with the skimage.graph._mcp.make_offsets function. So I did a bad thing and imported from that private module, so I could claim what was rightfully mine — the offsets list, which translates from numbers in traceback to offsets in the image coordinates:
from skimage.graph import _mcp
offsets = _mcp.make_offsets(2, True)
print(offsets)
which gives:
[array([-1, -1]),
array([-1, 0]),
array([-1, 1]),
array([ 0, -1]),
array([0, 1]),
array([ 1, -1]),
array([1, 0]),
array([1, 1])]
Now, there's one last thing to do with the offsets: you'll note that destinations are marked in the traceback with "-1", which doesn't correspond to the last element of the offsets array. So we append np.array([0, 0]), and then every value in traceback corresponds to a real offset. In the case of destinations, you get a self-edge, but that's fine.
offsets.append(np.array([0, 0]))
offsets_arr = np.array(offsets) # shape (9, 2)
Now, we can build a graph from offsets, pixel coordinates, and pixel ids. First, we use np.indices to get an index for every pixel in the image:
indices = np.indices(traceback.shape)
print(indices.shape)
gives:
(2, 4, 4)
To get an array that has, for each pixel, the offset to its neighbor, we use fancy array indexing:
offset_to_neighbor = offsets_arr[traceback]
print(offset_to_neighbor.shape)
which gives:
(4, 4, 2)
The axes are different between the traceback and the numpy indices, but nothing a little transposition won't fix:
neighbor_index = indices - offset_to_neighbor.transpose((2, 0, 1))
Finally, we want to deal with integer pixel ids in order to create a graph of all the pixels, rather than coordinates. For this, we use np.ravel_multi_index.
ids = np.arange(traceback.size).reshape(image.shape)
neighbor_ids = np.ravel_multi_index(
tuple(neighbor_index), traceback.shape
)
This gives me a unique ID for each pixel, and then a unique "step towards the destination" for each pixel:
print(ids)
print(neighbor_ids)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
[[ 0 0 1 2]
[ 0 0 1 11]
[ 4 5 15 15]
[13 14 15 15]]
Then we can turn this into a graph using SciPy sparse matrices. We don't care about weights for this graph so we just use the value 1 for the edges.
from scipy import sparse
g = sparse.coo_matrix((
np.ones(traceback.size),
(ids.flat, neighbor_ids.flat),
shape=(ids.size, ids.size),
)).tocsr()
(This uses the (value, (row, column)) or (data, (i, j)) input format for sparse COOrdinate matrices.)
Finally, we use connected components to get the graphs — the groups of pixels that are nearest to each destination. The function returns the number of components and the mapping of "pixel id" to component:
n, components = sparse.csgraph.connected_components(g)
basins = components.reshape(image.shape)
print(basins)
[[0 0 0 0]
[0 0 0 1]
[0 0 1 1]
[1 1 1 1]]
(Note that this result is slightly different from yours because the cost is identical to destination 0 and 1 for the pixels in question, so it's arbitrary which to label.)
print(costs)
[[0. 1. 2.5 4.5 ]
[1.5 1.41421356 2.41421356 4. ]
[4. 2.91421356 1.41421356 1.5 ]
[5.5 3.5 1.5 0. ]]
Hope this helps!
I am working on argmax function of PyTorch which is defined as:
torch.argmax(input, dim=None, keepdim=False)
Consider an example
a = torch.randn(4, 4)
print(a)
print(torch.argmax(a, dim=1))
Here when I use dim=1 instead of searching column vectors, the function searches for row vectors as shown below.
print(a) :
tensor([[-1.7739, 0.8073, 0.0472, -0.4084],
[ 0.6378, 0.6575, -1.2970, -0.0625],
[ 1.7970, -1.3463, 0.9011, -0.8704],
[ 1.5639, 0.7123, 0.0385, 1.8410]])
print(torch.argmax(a, dim=1))
tensor([1, 1, 0, 3])
As far as my assumption goes dim = 0 represents rows and dim =1 represent columns.
It's time to correctly understand how the axis or dim argument work in PyTorch:
The following example should make sense once you comprehend the above picture:
|
v
dim-0 ---> -----> dim-1 ------> -----> --------> dim-1
| [[-1.7739, 0.8073, 0.0472, -0.4084],
v [ 0.6378, 0.6575, -1.2970, -0.0625],
| [ 1.7970, -1.3463, 0.9011, -0.8704],
v [ 1.5639, 0.7123, 0.0385, 1.8410]]
|
v
# argmax (indices where max values are present) along dimension-1
In [215]: torch.argmax(a, dim=1)
Out[215]: tensor([1, 1, 0, 3])
Note: dim (short for 'dimension') is the torch equivalent of 'axis' in NumPy.
Dimensions are defined as shown in the above excellent answer. I have highlighted the way I understand dimensions in Torch and Numpy (dim and axis respectively) and hope that this will be helpful to others.
Notice that only the specified dimension’s index varies during the argmax operation, and the specified dimension’s index range reduces to a single index once the operation is completed. Let tensor A have M rows and N columns and consider the sum operation for simplicity. The shape of A is (M, N). If dim=0 is specified, then the vectors A[0,:], A[1,:], ..., A[M-1,:] are summed elementwise and the result is another tensor with 1 row and N columns. Notice that only the 0th dimension’s indices vary from 0 throughout M-1. Similarly, If dim=1 is specified, then the vectors A[:,0], A[:,1], ..., A[:,N-1] are summed elementwise and the result is another tensor with M rows and 1 column.
An example is given below:
>>> A = torch.tensor([[1,2,3], [4,5,6]])
>>> A
tensor([[1, 2, 3],
[4, 5, 6]])
>>> S0 = torch.sum(A, dim = 0)
>>> S0
tensor([5, 7, 9])
>>> S1 = torch.sum(A, dim = 1)
>>> S1
tensor([ 6, 15])
In the above sample code, the first sum operation specifies dim=0, therefore A[0,:] and A[1,:], which are [1,2,3] and [4,5,6], are summed and resulted in [5, 7, 9]. When dim=1 was specified, the vectors A[:,0], A[:,1], and A[:2], which are the vectors [1, 4], [2, 5], and [3, 6], are elementwise added to find [6, 15].
Note also that the specified dimension collapses. Again let A have the shape (M, N). If dim=0, then the result will have the shape (1, N), where dimension 0 is reduced from M to 1. Similarly if dim=1, then the result would have the shape (M, 1), where N is reduced to 1. Note also that shapes (1, N) and (M,1) are represented by a single-dimensional tensor with N and M elements respectively.
Is there a built-in function (numpy or pandas I'm thinking) that would help combine multiple rows of one column in a dataframe, keeping the same dimensions, but different scale? Also, combined with that, summing the values from a different column between the intervals? Or is it something I just need to build from scratch? Example below, I'm not sure exactly how to ask. This would need to be scalable; the example is simple, in reality I'm working with a 250 dim array and theoretically unlimited rows.
Ex:
import pandas as pd
import numpy as np
#Creating DF
df = pd.DataFrame([[[-2,-1,0,1,2],[-10,-5,5,5,-10]],
[[-.5,.5,1.5,2.5,3.5],[-3,-2,0,-2,-3]]])
output: 0 1
0 [-2, -1, 0, 1, 2] [-10, -5, 5, 5, -10]
1 [-0.5, 0.5, 1.5, 2.5, 3.5] [-3, -2, 0, -2, -3]
where the answer is [-2,-0.625,0.75,2.125,3.5] (column0 combined with dim 5) , [-10,-5,0,-5,-5] (sum of column1 between steps of column0 where (interval-1) < x<=interval)
answer = pd.DataFrame([[[-2,-.625,.75,2.125,3.5],[-10,-5,0,-5,-5]]])