This is on TensorFlow 1.11.0. The documentation of tft.apply_buckets is not very descriptive. In specific, I read:
"bucket_boundaries: The bucket boundaries represented as a rank 2 Tensor."
I assume this has to be bucket indices and bucket boundaries?
When I try with the toy example below:
import tensorflow as tf
import tensorflow_transform as tft
import numpy as np
tf.enable_eager_execution()
x = np.array([-1,9,19, 29, 39])
xt = tf.cast(
tf.convert_to_tensor(x),
tf.float32
)
boundaries = tf.cast(
tf.transpose(
tf.convert_to_tensor([[0, 1, 2, 3], [10, 20, 30, 40]])
),
tf.float32
)
buckets = tft.apply_buckets(xt, boundaries)
I get:
InvalidArgumentError: Expected sorted boundaries [Op:BucketizeWithInputBoundaries] name: assign_buckets
Note that in this case x and bucket_boundaries arguments are:
tf.Tensor([-1. 9. 19. 29. 39.], shape=(5,), dtype=float32)
tf.Tensor(
[[ 0. 10.]
[ 1. 20.]
[ 2. 30.]
[ 3. 40.]], shape=(4, 2), dtype=float32)
So, it seems like bucket_boundaries is not supposed to be indices and boundaries. Does anyone know how to properly use this method?
After some playing around, I found out that bucket_boundaries is supposed to be a 2 dimensional array where entries are bucket boundaries and the array is wrapped so it has two columns. See example below:
import tensorflow as tf
import tensorflow_transform as tft
import numpy as np
tf.enable_eager_execution()
x = np.array([-1,9,19, 29, 39])
xt = tf.cast(
tf.convert_to_tensor(x),
tf.float32
)
boundaries = tf.cast(
tf.transpose(
tf.convert_to_tensor([[0, 20, 40, 60], [10, 30, 50, 70]])
),
tf.float32
)
buckets = tft.apply_buckets(xt, boundaries)
So, the expected inputs are:
print (xt)
print (buckets)
print (boundaries)
tf.Tensor([-1. 9. 19. 29. 39.], shape=(5,), dtype=float32)
tf.Tensor([0 1 2 3 4], shape=(5,), dtype=int64)
tf.Tensor(
[[ 0. 10.]
[20. 30.]
[40. 50.]
[60. 70.]], shape=(4, 2), dtype=float32)
Wanted to add to this as this is the only result for the Google search "tft.apply_buckets" :)
The example for me did not work in the latest version of TFT. The following code did work for me.
Note that the buckets are specified as a rank 2 tensor, but with only one element in the inner dimension.
(I'm using the wrong words but hopefully my example below will clarify)
import tensorflow as tf
import tensorflow_transform as tft
import numpy as np
tf.enable_eager_execution()
xt = tf.cast(tf.convert_to_tensor(np.array([-1,9,19, 29, 39])),tf.float32)
bds = [[0],[10],[20],[30],[40]]
boundaries = tf.cast(tf.convert_to_tensor(bds),tf.float32)
buckets = tft.apply_buckets(xt, boundaries)
thanks for your help as this answer got me most of the way there!
The rest I found from the TFT source code:
https://github.com/tensorflow/transform/blob/deb198d59f09624984622f7249944cdd8c3b733f/tensorflow_transform/mappers.py#L1697-L1698
I love this answer, just wanted to add some simplification as enabling eager execution, casting, and numpy aren't really needed. Note that casting below for the float case is done by making one of the scalars a float, tensorflow standardizes on the highest fidelity data type.
The code below shows how this mapping works. The number of buckets created is the length of bucket boundaries vector + 1, or (in my opinion), more intuitively, the minimum number of commas + 2. Plus two because negative infinity to the smallest value, and the largest value to infinity. If something is on the bucket boundary, it goes to the bucket representing bigger numbers. What happens when the bucket boundaries aren't sorted is left as an exercise for the reader :)
import tensorflow as tf
import tensorflow_transform as tft
xt = tf.constant([-1., 9, 19, 29, 39, float('nan'), float('-inf'), float('inf')])
bucket_boundaries = tf.constant([[0], [10], [20], [30], [40]])
bucketed_floats = tft.apply_buckets(xt, bucket_boundaries)
for scalar, index in zip(xt, range(len(xt))):
print(f"{scalar} was mapped to bucket {bucketed_floats[index]}.")
-1.0 was mapped to bucket 0.
9.0 was mapped to bucket 1.
19.0 was mapped to bucket 2.
29.0 was mapped to bucket 3.
39.0 was mapped to bucket 4.
nan was mapped to bucket 5.
-inf was mapped to bucket 0.
inf was mapped to bucket 5.
xt_int = tf.constant([-1, 9, 19, 29, 39, 41])
bucketed_ints = tft.apply_buckets(xt_int, bucket_boundaries)
for scalar, index in zip(xt_int, range(len(xt_int))):
print(f"{scalar} was mapped to bucket {bucketed_ints[index]}.")
-1 was mapped to bucket 0.
9 was mapped to bucket 1.
19 was mapped to bucket 2.
29 was mapped to bucket 3.
39 was mapped to bucket 4.
41 was mapped to bucket 5.
Note that there's also a function called tft.bucketize which appears to require a full pass over the data. I'm not a 100% clear on the nuance between tft.apply_buckets and tft.bucketize.
Related
Say I have a tensor as following :
var = tf.constant([0,0.05,0.2,0,0])
inverse_var = tf.math.reciprocal(var)
print(inverse_var)
Output : tf.Tensor([inf, 20. , 5. ,inf inf], shape=(5,), dtype=float32)
I want to make a new tensor from inverse_var tensor such that the infinity values are replaced with zero in the new tensor.
Final vector required - [ 0, 20, 5, 0, 0 ]
Here is a solution done using tf.tensor_scatter_nd_update method
import tensorflow as tf
var = tf.constant([0,0.05,0.2,0,0])
inverse_var = tf.math.reciprocal(var)
print(inverse_var)
mask = tf.math.is_inf(inverse_var)
indices = tf.where(mask) # found indices where infinite values are
print(indices)
updates=tf.zeros(len(indices)) # create 1D matrix of length of infinite values
inverse_var_inf = tf.tensor_scatter_nd_update(inverse_var,indices,updates) #updated using scatter_nd_update method
print(inverse_var_inf)
Thank you!
providing gist for reference
I have the following code with which I am clustering hierarchically. My data object is an array of similarity distances I calculated earlier. I think I am executing the clustering properly. I thought I could just get the leaves of the Cluster, but when I compare that to the original input I get a mismatch.
I have two questions here:
Why is there a mismatch between the leaves of my cluster and my actual input data?
How can I extract the original data from a cluster by either the linkage matrix or clusternodes?
import numpy as np
import pandas
import scipy.cluster.hierarchy as sch
def list_difference(list1, list2):
return [value for value in list1 if value not in list2]
if __name__ == '__main__':
# example data for this questions purpose.
data = [10, 11, 29, 288, 16]
X = np.array([[i] for i in data])
linkage_matrix = sch.average(X)
rootnode, nodelist = sch.to_tree(linkage_matrix, rd=True)
leaves = sch.leaves_list(linkage_matrix)
print(list_difference(leaves, data))
I want to retrieve the original data points per cluster.
Given your data
data = [10, 11, 29, 288, 16]
the result is compatible with the dendrogram
sch.dendrogram(linkage_matrix);
Analyzing linkage_matrix we can confirm
print(linkage_matrix)
array([[ 0. , 1. , 1. , 2. ],
[ 4. , 5. , 5.5 , 3. ],
[ 2. , 6. , 16.66666667, 4. ],
[ 3. , 7. , 271.5 , 5. ]])
Row by row we have
element 0 and element 1, with distance 1 in a cluster that has got 2 elements (this cluster will be called 5)
element 4 with clustered elements 5 (the previous), with distance 5.5 and 3 elements (this cluster will be called 6)
element 2 with clustered elements 6 (the previous), with distance 16.667 and 4 elements (this cluster will be called 7)
element 3 with clustered elements 7 (the previous), with distance 271.5 and 5 elements
Right now I have a a 2 by 2 numpy array. By using RobustScaler, it normalizes each column one at a time, whereas I wish to normalize everything all at once. Is there anyway to do that?
From the documentation the RobustScaler:
removes the median and scales the data according to the quantile range
So you need to compute the median and the quantile range for the whole array, for this you can use the np.median and np.percentile functions, this is what sklearn does under the hood. The code:
import numpy as np
from sklearn.preprocessing import robust_scale
data = np.array([[3, 6],
[9, 12]], dtype=np.float64)
result = robust_scale(data, axis=0)
print(result)
reshape = data.reshape((1, 4))
result = robust_scale(reshape, axis=1)
me = np.median(data.flat) # 7.5
percentiles = np.percentile(data, (25.0, 75.0)) # 5.25 9.75
data -= me
data /= (percentiles[1] - percentiles[0])
print(data)
Output
[[-1. -1.]
[ 1. 1.]]
[[-1. -0.33333333]
[ 0.33333333 1. ]]
In the example I used (25.0, 75.0) because this are the default values for the quantile range, also the function robust_scale is equivalent to the functionality of RobustScaler (section See Also on the documentation).
Given a Tensor with dimension greater than two, for example with shape (3, 3, 16, 32) - how can I rotate the 2D-matrices given by the first two dimension by an angle x?
Example:
kernel[:,:,0,0]
array([[ 0.14498544, 0.14481193, -0.18206167],
[ 0.06301615, 0.15354747, 0.176368 ],
[-0.16842318, -0.12931588, 0.0105814 ]], dtype=float32)
kernel_rot[:,:,0,0]
array([[-0.18206167, 0.176368 , 0.0105814 ],
[ 0.14481193, 0.15354747, -0.12931588],
[ 0.14498544, 0.06301615, -0.16842318]], dtype=float32)
I can easily do this in numpy using rotated = np.rot90(kernel) (for the special case of 90 degrees, which is easier to handle because we have no padding), however I am unsure if TF can pick this up as part of a complex computational graph. The kernel in my case I actually a Variable, and in the specific application a weights-kernel of a 2D-Conv layer.
There are N distributions which take on integer values 0,... with associated probabilities. Further, I assume 3 variables [value, prob]:
import numpy as np
x = np.array([ [0,0.3],[1,0.2],[3,0.5] ])
y = np.array([ [10,0.2],[11,0.4],[13,0.1],[14,0.3] ])
z = np.array([ [21,0.3],[23,0.7] ])
As there are N variables I convolve first x+y, then I add z, and so on.
Unfortunately numpy.convole() takes 1-d arrays as input variables, so it does not suit in this case directly. I play with variables to take them all values 0,1,2,...,23 (if value is not know then Pr=0)... I feel like there is another much better solution.
Does anyone have a suggestion for making it more efficient? Thanks in advance.
I don't see a built-in method for this in Scipy; there's a way to define a custom discrete random variables, but those don't support addition. Here is an approach using pandas, assuming import pandas as pd and x,y,z as in your example:
values = np.add.outer(x[:,0], y[:,0]).flatten()
probs = np.multiply.outer(x[:,1], y[:,1]).flatten()
df = pd.DataFrame({'values': values, 'probs': probs})
conv = df.groupby('values').sum()
result = conv.reset_index().values
The output is
array([[ 10. , 0.06],
[ 11. , 0.16],
[ 12. , 0.08],
[ 13. , 0.13],
[ 14. , 0.31],
[ 15. , 0.06],
[ 16. , 0.05],
[ 17. , 0.15]])
With more than two variables, you don't have to go back and forth between numpy and pandas: the additional variables can be included at the beginning.
values = np.add.outer(np.add.outer(x[:,0], y[:,0]), z[:,0]).flatten()
probs = np.multiply.outer(np.multiply.outer(x[:,1], y[:,1]), z[:,1]).flatten()
Aside: it would be better to keep values and probabilities in separate numpy arrays, if they have different intrinsic data types (integers vs reals).