LightGBM produces same probabilities on any input (C++) - python

I have trained a LGBM model (gbdt) with python on a dataset with 5 classes (classification problem) and I'm able to make a correct inference on a test set, loading that model in a python script.
Now I need to use this model in a C++ program. To do this I have exported this model and I have lodaded it in C++ to make inference. The problem is that in C++ the output probabilities are always the same so I can't choose a winner class (each class results always 0.2).
To save the model I've tried this 2 ways.
First I've tried to save a model like a string:
s = lgb_model.model_to_string(num_iteration=114)
f = open('model_out.txt','w')
f.write(s)
f.close()
Second directly with save model method:
lgb_model.save_model('model_out.txt')
To load the model in C++ I've used this with no error:
int ret = LGBM_BoosterLoadModelFromString(model_string, &num_iter, &booster_handle);
To make inference I have prepared an input buffer and I have passed it on this function:
int res = LGBM_BoosterPredictForMat(booster_handle, input_data, C_API_DTYPE_FLOAT64,
n_row, n_cols, 1, C_API_PREDICT_NORMAL, 0, -1,"", &out_len, out_result);
I obtained a matrix with 5 rows, and a column for each sample like this:
0.2
0.2
0.2
0.2
0.2
I have tried to make inference with a lot of changes but the results are always the same (random inputs, different parameters, etc.). Moreover I have checked the loaded model trying to re-dump it with this function and the result seemed corrent:
LGBM_BoosterDumpModel(booster_handle, 0, -1, C_API_FEATURE_IMPORTANCE_SPLIT, 1, &out_len, out_string);
Where am I wrong?

I had a similar issue and in my case I found that the problem was the is_linear property in the model.
I compared the model that I generated from the binary_classification example with the model I was using and I noticed that the model in the example has the is_linear=0 property for each tree. On my model it was missing.
Then I checked the c++ code and found that if this property is missing, the variable describing this is true. I set it to false as default and that works for me.
I can't give more details as I just recently began working with LGBM models and c++.

Related

Number of parameters of a TFHub model

I would like to count the number of parameters of a Object Detection model loaded from TensorFlow Hub, for example https://tfhub.dev/tensorflow/ssd_mobilenet_v2/2.
I've tried this:
hub_model = hub.load("https://tfhub.dev/tensorflow/ssd_mobilenet_v2/2")
print(len(hub_model.signatures['serving_default'].variables))
But the output is not very readable and I'm not even sure if it's correct.
I've also tried this way:
malli = hub.KerasLayer("https://tfhub.dev/tensorflow/ssd_mobilenet_v2/2")
print("Thickness of the model:", len(malli.weights))
But it returns just an empty list [] of length 0.
It would be nice to be able to use the Keras summary() method on these models, but it cannot be called on a KerasLayer, so would incorporating this layer into a model with Keras.Sequential work?
There is a tool that counts total number of parameters for a checkpoint file: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/inspect_checkpoint.py
$ python inspect_checkpoint.py --file_name=/checkpoint/file/name --all_tensors

HuggingFace T5 transformer model - how to prep a custom dataset for fine-tuning?

I am trying to use the HuggingFace library to fine-tune the T5 transformer model using a custom dataset. HF provide an example of fine-tuning with custom data but this is for distilbert model, not the T5 model I want to use. From their example it says I need to implement len and getitem methods in my dataset subclass, but there doesn't seem to be much documentation about what to change when using t5 instead of distilbert. Here is the tokenizer code followed by my attempt at changing getitem
getitem method code
and the resulting error from trainer.train() which says " KeyError: 'labels' "
trainer.train() error message
I have seen the following discussion which seems to relate to this problem, but the answer offered still produces an error in trainer.train() which I can also post if useful.
Using the original example code from the "fine-tuning with custom data" then the dataset class is:
original code from hf distilbert example applied to T5
but then the error with the trainer changes:
trainer error using hf distilbert example applied to T5
which is what originally got me looking around for solutions. So using "fine-tuning with custom data" doesn't seem to be as simple as changing the model and the tokenizer (and the input/output data you are training on) when switching from say distilbert to a text to text model like T5. distilbert doesn't have any output text to train on, so I would have thought (but what do I know?) it would be different to T5 but I can't find documentation on how? At the bottom of this question seems to point to a direction to follow but once again I don't know (much!)
I think I may have solved the problem (at least the trainer runs and completes). The distilbert model doesn't have output text, it has flags that are provided to the dataset class as a list of integers. The T5 model has output text, so you assign the output encodings and rely upon DataCollatorForSeq2Seq() to prepare the data/featurs that the T5 model expects. See changes (for T5) with commented out HF code (for distilbert) below:
Changes for T5 - commented out distilbert code
Raised an issue to HuggingFace and they advised that the fine-tuning with custom datasets example on their website was out of date and that I needed to work off their maintained examples.
Based on your screenshots, here's how I'd implement len and getitem.
class ToxicDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels['input_ids'][idx])
return item
def __len__(self):
return len(self.labels['input_ids'])

Saving PyTorch model with no access to model class code

How can I save a PyTorch model without a need for the model class to be defined somewhere?
Disclaimer:
In Best way to save a trained model in PyTorch?, there are no solutions (or a working solution) for saving the model without access to the model class code.
If you plan to do inference with the Pytorch library available (i.e. Pytorch in Python, C++, or other platforms it supports) then the best way to do this is via TorchScript.
I think the simplest thing is to use trace = torch.jit.trace(model, typical_input) and then torch.jit.save(trace, path). You can then load the traced model with torch.jit.load(path).
Here's a really simple example. We make two files:
train.py :
import torch
class Model(torch.nn.Module):
def __init__(self):
super(Model, self).__init__()
self.linear = torch.nn.Linear(4, 4)
def forward(self, x):
x = torch.relu(self.linear(x))
return x
model = Model()
x = torch.FloatTensor([[0.2, 0.3, 0.2, 0.7], [0.4, 0.2, 0.8, 0.9]])
with torch.no_grad():
print(model(x))
traced_cell = torch.jit.trace(model, (x))
torch.jit.save(traced_cell, "model.pth")
infer.py :
import torch
x = torch.FloatTensor([[0.2, 0.3, 0.2, 0.7], [0.4, 0.2, 0.8, 0.9]])
loaded_trace = torch.jit.load("model.pth")
with torch.no_grad():
print(loaded_trace(x))
Running these sequentially gives results:
python train.py
tensor([[0.0000, 0.1845, 0.2910, 0.2497],
[0.0000, 0.5272, 0.3481, 0.1743]])
python infer.py
tensor([[0.0000, 0.1845, 0.2910, 0.2497],
[0.0000, 0.5272, 0.3481, 0.1743]])
The results are the same, so we are good. (Note that the result will be different each time here due to randomness of the initialisation of the nn.Linear layer).
TorchScript provides for much more complex architectures and graph definitions (including if statements, while loops, and more) to be saved in a single file, without needing to redefine the graph at inference time. See the docs (linked above) for more advanced possibilities.
I recomend you to convert you pytorch model to onnx and save it. Probably its best way to store model without an access to the class.
Supplying an official answer by one of the core PyTorch devs (smth):
There are limitations to loading a pytorch model without code.
First limitation:
We only save the source code of the class definition. We do not save beyond that (like the package sources that the class is referring to).
For example:
import foo
class MyModel(...):
def forward(input):
foo.bar(input)
Here the package foo is not saved in the model checkpoint.
Second limitation:
There are limitations on robustly serializing python constructs. For example the default picklers cannot serialize lambdas. There are helper packages that can serialize more python constructs than the standard, but they still have limitations. Dill 25 is one such package.
Given these limitations, there is no robust way to have torch.load work without having the original source files.
There is no a solutins (or working solution) for saving model without an access to the class.
You can save whatever you like.
You can save the model, torch.save(model, filepath). It saves the model object itself.
You can save just the model state dict.
torch.save(model.state_dict(), filepath)
Further, you can save anything you like, since torch.save is just a pickle based save.
state = {
'hello_text': 'just the optimizer sd will be saved',
'optimizer': optimizer.state_dict(),
}
torch.save(state, filepath)
You may check what I wrote on torch.save some time ago.

How do I train a chain of two models in Keras, but with a function in between them?

I'm using the Keras functional API and attempting to stack and train two models with a non-linear step in between them.
Say I want to train a chain of two models, Model A and Model B, where the output of Model A is used as the input of Model B, as one model, Model C. My understanding of how to do this is:
input_A = Input(input_shape_A)
output_A = ModelA(inputA)
output_B = ModelB(outputA)
model_C = Model(inputA, outputB)
Source
The problem is that, in my case, I want to slice up the output of Model A before it goes into Model B, call another function on the slices which returns only one of them, and then use that slice as the input to Model B. So, Model B is only being trained on a fixed-size subset of Model A's output, which could be at arbitrary indices.
Something more like:
input_A = Input(input_shape_A)
output_A = ModelA(inputA)
input_B = custom_function(output_A)
output_B = ModelB(input_B)
model_C = Model(inputA, outputB)
I have not found any code examples so far that resemble this, and I am still trying to figure out if I can. The loss of Model B has to be integrated into Model A's training, but I need a function in between them. I was considering keeping them separate and trying to write out a custom loss function for Model A, but custom loss functions in Keras are seem to be very restrictive, and I haven't seen any examples for that approach so far, either.
Is this possible?

Store/Reload CNTK Trainer, Model, Inputs, Outputs

What is the best way to store a trainer and all necessary components?
1. Storing:
Store checkpoint of the trainer: Use its trainer.save_checkpoint(filename, external_state={}) function
Additionally store the model separately: Use the z.save(filename) method, every cntk operation has. You can also get z = trainer.model.
2. Reloading:
Restore the model: Use C.load_model(...). (Don't get confused by the deprecated persist namespace from the Cntk 1.)
Get the inputs from the restored model.
Restore the trainer itself: Use trainer.restore_from_checkpoint as eg. shown here. The problem is, this function already needs a trainer object which probably has to be initialized in the same way as the trainer used to create the check point!?
How do I now restore the label-inputs which are going into the error function used by the trainer? In the following code I marked the variables which I think I have to restore after I once stored them.
z = C.layers.Dense(.... )
loss = error = C.squared_error(z, **l**)
**trainer** = C.Trainer(**z**, (loss, error), [mylearner], my_tensorboard_writer)
You can restore your trainer, but I actually prefer to just load my model m. The simple reason is that it is much easier to create a whole new trainer, beacuse then you can change all the other parameters of the trainer more easily.
Then you can get the input variable from the loaded model (if your network has only one input):
input_var = m.arguments[0]
then you need the output of your model:
output = m(input_var)
and define the loss function using your target output target_output:
C.squared_error(output, target_output)
using your model and the loss function you can recreate your trainer from there, setting the learning rate etc. as you like

Categories