pytorch NaNLabelEncoder to encode and decode categorical target - python

I'm new to pytorch but it feels like this should be simple. How do I inverse transform this tensor?
classification_dataset = TimeSeriesDataSet(
df,
group_ids=['group'],
target="target_col", # categorical target
time_idx="time_idx",
min_encoder_length= 60 * 60, # how much history to use
max_encoder_length= 60 * 60,
min_prediction_length=5,
max_prediction_length=5, # how far to predict into future
time_varying_unknown_reals=[
#...list of columns here
],
#time_varying_unknown_reals=[time_varying_unknown_reals[0]],
target_normalizer=NaNLabelEncoder(), # Use the NaNLabelEncoder to encode categorical target
)
x, y = next(iter(classification_dataset.to_dataloader(batch_size=4)))
y[0] # target values are encoded categories
output
tensor([[6, 6, 6, 6, 6],
[5, 5, 5, 5, 5],
[5, 5, 5, 5, 5],
[1, 1, 1, 1, 1]])
classification_dataset.target_normalizer returns NaNLabelEncoder() but it's not fitted.

ah it's as simple as
classification_dataset.target_normalizer.classes_

Related

Stats Model contingency table nd array 2 x 2 x k , cannot reshape

Consider below list of 2x2 tables and CMH(Cochran–Mantel–Haenszel) test results. We are trying to determine if each specific centre was accociated with the sucess of the treatment [Data from Agresti, Categorical Data Analysis, second edition]
tables= [
[[11, 10], [25, 27]],
[[16, 22], [4, 10]],
[[14, 7], [5, 12]],
[[2, 1], [14, 16]],
[[6, 0], [11, 12]],
[[1, 0], [10, 10]],
[[1, 1], [4, 8]],
[[4, 6], [2, 1]]]
cmh = sm.stats.contingency_tables.StratifiedTable(tables = tables)
print(cmh.test_null_odds())
pvalue ~ 0.012
statistic ~ 6.38
The tables parameters in StratifiedTable can also take a numpy array shape 2 x 2 x k, where k is a slice return each of the contingency tables.
I've been unable to wrap my head around the array reshaping, this based on the above 8, 2, 2 shape the list of lists can more intuitively offer (at least for me).
Any toughts on how to re run this same test with a nd array?
UPDATE: I've tried to reshape my tables var in numpy as suggested in comment below to a nd array 2 x 2 x k , with a transpose. The below TypeError is rasied when running the same test with
TypeError: No loop matching the specified signature and casting was found for ufunc true_divide
Note: in R the following matrix would return the desired output
data = array (c(11, 10, 25, 27, 16, 22, 4, 10,
14, 7, 5, 12, 2, 1, 14, 16,
6, 0, 11, 12, 1, 0, 10, 10,
1, 1, 4, 8, 4, 6, 2, 1),
c(2,2,8))
mantelhaen.test(data, correct=F)
Just referencing #Josef comment as the answer. I missed/ not accounted for a dtype convertion.
Your example worked for me with the transpose, .T. It looks like you have a separate problem with the dtype. Use float: tables = np.asarray(tables).T.astype(float) This was recently fixed github.com/statsmodels/statsmodels/pull/7279

generating segment labels for a Tensor given a value indicating segment boundaries

Does anyone know of a way to generate a 'segment label' for a Tensor, given a unique value that represents segment boundaries within the Tensor?
For example, given a 1D input tensor where the value 1 represents a segment boundary,
x = torch.Tensor([5, 4, 1, 3, 6, 2])
the resulting segment label Tensor should have the same shape with values representing the two segments:
segment_label = torch.Tensor([1, 1, 1, 2, 2, 2])
Likewise, for a batch of inputs, e.g. batch size = 3,
x = torch.Tensor([
[5, 4, 1, 3, 6, 2],
[9, 4, 5, 1, 8, 10],
[10, 1, 5, 4, 8, 9]
])
the resulting segment label Tensor (using 1 as the segment separator) should look something like this:
segment_label = torch.Tensor([
[1, 1, 1, 2, 2, 2],
[1, 1, 1, 1, 2, 2],
[1, 1, 2, 2, 2, 2]
])
Context: I'm currently working with Fairseq's Transformer implementation in PyTorch for a seq2seq NLP task. I am looking for a way to incorporate BERT-like segment embeddings in Transformer during the encoder's forward pass, rather than modifying an exisiting dataset used for translation tasks such as language_pair_dataset.
Thanks in advance!
You can use torch.cumsum to pull the trick:
mask = (x == 1).to(x) # mask with only the boundaries
segment_label = mask.cumsum(dim=-1) - mask + 1
Results with the desired segment_label.

Pose keypoints numpy averaging

I know you're supposed to give examples when you ask questions here, but I can't really think of anything that wouldn't involve pasting a massive project worth of code, so I'll just try to describe this as well as possible.
I'm working on a project that involves using keypoints generated by using OpenPose (after I've done some preprocessing on them to simplify everything, I come up with data formatted like this: [x0, y0, c0, x1, y1, c1...], where there are 18 points total, and the x's and y's represent their coordinates, while the c's represent confidence.) I want to take a nested list that has the keypoints for a single person listed in the above manner for each frame, and output a new nested list of lists, made up of the weighted average x's and y's (the weights would be the confidence values for each point) along with the average confidences by each second (instead of by frame), in the same format as above.
I have already converted the original list into a 3-dimensional list, with each second holding each of its frames, each of which holds its keypoint list. I know that I can write code myself to do all of this without using numpy.average(), but I was hoping that I wouldn't have to, because it quickly becomes confusing. Instead, I was wondering if there were a way I could iterate over each second, using said method, in a reasonably simple manner, and just append the resulting lists to a new list, like this:
out = []
for second in lst:
out.append(average(second, axis=1, weights=?, other params?))
Again, I'm sorry for not giving an example of some sort.
Maybe you could get some inspiration from this code:
import numpy as np
def pose_average(sequence):
x, y, c = sequence[0::3], sequence[1::3], sequence[2::3]
x_avg = np.average(x, weights=c)
y_avg = np.average(y, weights=c)
return x_avg, y_avg
sequence = [2, 4, 1, 5, 6, 3, 5, 2, 1]
pose_average(sequence)
>>> (4.4, 4.8)
For multiple sequences of grouped poses:
data = [[1, 2, 3, 2, 3, 4, 3, 4, 5], [1, 2, 3, 4, 5, 6, 7, 8, 9], [4, 1, 2, 5, 3, 3, 4, 1, 2]]
out = [ pose_average(seq) for seq in data ]
out
>>> [(2.1666666666666665, 3.1666666666666665),
(5.0, 6.0),
(4.428571428571429, 1.8571428571428572)]
Edit
By assuming that:
data is a list of sequence
a sequence is a list of grouped poses (for example grouped by seconds)
a pose is the coordinates of the joins positions: [x1, y1, c1, x2, y2, c2, ...]
the slightly modified code is now:
import numpy as np
data = [
[[1, 2, 3, 2, 3, 4, 3, 4, 5], [9, 2, 3, 4, 5, 6, 7, 8, 9], [4, 1, 2, 5, 3, 3, 4, 1, 2], [5, 3, 4, 1, 10, 6, 5, 0, 0]],
[[6, 9, 11, 0, 8, 6, 1, 5, 11], [3, 5, 4, 2, 0, 2, 0, 8, 8], [1, 5, 9, 5, 1, 0, 6, 6, 6]],
[[9, 4, 7, 0, 2, 1], [9, 4, 7, 0, 2, 1], [9, 4, 7, 0, 2, 1]]
]
def pose_average(sequence):
sequence = np.asarray(sequence)
x, y, c = sequence[:, 0::3], sequence[:, 1::3], sequence[:, 2::3]
x_avg = np.average(x, weights=c, axis=0)
y_avg = np.average(y, weights=c, axis=0)
return x_avg, y_avg
out = [ pose_average(seq) for seq in data ]
out
>>> [(array([4.83333333, 2.78947368, 5.375 ]),
array([2.16666667, 5.84210526, 5.875 ])),
(array([3.625, 0.5 , 1.88 ]), array([6.83333333, 6. , 6.2 ])),
(array([9., 0.]), array([4., 2.]))]
x_avg is now the list of x position averaged over the sequence for each point and weight by c.

Conditional logic with Python ndimage generic_filter

I am trying to write a python function to remove hot-pixels in 2D image data. I am trying to make function that will take the mean for the neighbors around each element in the 2D array and conditionally overwrite that element if its value exceeds the mean of its neighbors by a specific amount (for example 3 sigma). This is where I am:
def myFunction(values):
if np.mean(values) + 3*np.std(values) < origin:
return np.mean(values)
footprint = np.array([[1,1,1],
[1,0,1],
[1,1,1]])
correctedData = ndimage.generic_filter(data, myFunction, footprint = footprint)
'origin' in the above code is demonstrative. I know it isn't correct, I am just trying to show what I am trying to do. Is there a way to pass the value of the current element to the generic_function?
Thanks!
Your footprint is not passing the central value back to your function.
I find it easier to use size (equivalent to using all ones in the footprint), then deal with everything in the callback function. So in your case I'd extract the central value inside the callback function. Something like this:
from scipy.ndimage import generic_filter
def despike(values):
centre = int(values.size / 2)
avg = np.mean([values[:centre], values[centre+1:]])
std = np.std([values[:centre], values[centre+1:]])
if avg + 3 * std < values[centre]:
return avg
else:
return values[centre]
Let's make some fake data:
data = np.random.randint(0, 10, (5, 5))
data[2, 2] = 100
This yields (for example):
array([[ 2, 8, 4, 2, 4],
[ 9, 4, 7, 6, 5],
[ 9, 9, 100, 7, 3],
[ 0, 1, 0, 8, 0],
[ 9, 9, 7, 6, 0]])
Now you can apply the filter:
correctedData = generic_filter(data, despike, size=3)
Which removed the spike I added:
array([[2, 8, 4, 2, 4],
[9, 4, 7, 6, 5],
[9, 9, 5, 7, 3],
[0, 1, 0, 8, 0],
[9, 9, 7, 6, 0]])

How do I make a ragged batch in Tensorflow 2.0?

I'm trying to create a data input pipeline from a Tensorflow Dataset that consists of 1d tensors of numerical data. I would like to create batches of ragged tensors; I do not want to pad the data.
For instance, if my data is of the form:
[
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4]
...
]
I would like my dataset to consist of batches of the form:
<tf.Tensor [
<tf.RaggedTensor [
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4],
...]>,
<tf.RaggedTensor [
[ ... ],
...]>
]>
I've tried creating a RaggedTensor using a map but I can't seem to do it on one dimensional data.
I think this can be achieved with a little work before and after the batch.
# First, you can expand along the 0 axis for each data point
dataset = dataset.map(lambda x: tf.expand_dims(x, 0))
# Then create a RaggedTensor with a ragged rank of 1
dataset = dataset.map(lambda x: tf.RaggedTensor.from_tensor(x))
# Create batches
dataset = dataset.batch(BATCH_SIZE)
# Squeeze the extra dimension from the created batches
dataset = dataset.map(lambda x: tf.squeeze(x, axis=1))
Then the final output will be of the form:
<tf.RaggedTensor [
<tf.Tensor [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]>,
<tf.Tensor [0, 1, 2, 3]>,
...
]>
for each batch.

Categories