Improving prediction accuracy in Bayesian Causal Network

Improving prediction accuracy in Bayesian Causal Network - python

I would like to determine the causes of an unexpected outcome (or anamoly) in a thermodynamic process. I have continuous data of the associated variables and trying to make use of 'Bayesian Network (BN)' for the determination of causality relationships. For this purpose, I used a library called 'Causalnex' in Python.
I have followed the tutorial section of this library to build the DAG,BN model and everything works fine upto the step of predictions. The prediction results of minority/less majority classes have an accuracy of around 60-70% (80-90% with SMOTE/SMOTETomek and a particular random state) whereas a stable accuracy of more than 90% is expected. I have implemented following data-preprocessing steps.
Ensuring no missing/NaN values
Discretization (only it is supported by the library)
SMOTE/SMOTETomek for data balancing
Various train/test size combinations
I am struggling to figure out the ways to optimize the model. I could not find any supportive material in Internet for the same.
Are there any Guidelines or 'Best practices' of data pre-processing techniques and dataset requirements that particulary work for this library/ BN model? Could you please suggest any troubleshooting methods to identify the causes of low accuracy/metrics? Perhaps a misunderstood node-node causal relationship in DAG causes mediocre accuracy?
Any ideas/literature/other suitable library regarding this would be of great help!

A few tips that can help:
Changing/Tuning the Structure learning.
Trying different thresholds. When doing from_pandas, you can experiment with different w-threshold values (and the beta term (if you are using from_pandas_lasso)).
This will change the density of the network. A more dense structure implies a BN with more parameters. If the structure is more dense, you have more parameters and your model may perform better. If it is too dense, though, you may not have enough data to train it and may overfit.
Center the Data. Empirically, it seems that NOTEARS (the algorithm behind from_pandas) works best if the data is centered. So, subtracting the mean of the see this may be a good idea.
Ensure causality. NOTEARS does not ensure causality. So we need "experts" to judge the output and make the necessary modifications. If you see edges that don't make causal sense, you can either remove them or add them as tabu_edges and train your network again.
Experiment with discretisation. The performance can be very sensitive to how you discretise the data. Experimenting with various types of discretisation can help. You can use:
Methods available in Causalnex (uniform, for example)
fixed discretisations based on what thresholds make sense for your data
MDLP is a supervised way to discretise data. You can apply MDLP for each node having as "target" one of its children. There are 2 main packages for MDLP in pypy: mdlp and mdlp-discretization

Related

In Leave One Out Cross Validation, How can I Use `shap.Explainer()` Function to Explain a Machine Learning Model?

Background of the Problem
I want to explain the outcome of machine learning (ML) models using SHapley Additive exPlanations (SHAP) which is implemented in the shap library of Python. As a parameter of the function shap.Explainer(), I need to pass an ML model (e.g. XGBRegressor()). However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different). Also, the model will be different as I am doing feature selection in each iteration.
Then, My Question
In LOOCV, How can I use shap.Explainer() function of shap library to present the performance of a machine learning model? It can be noted that I have checked several tutorials (e.g. this one, this one) and also several questions (e.g. this one) of SO. But I failed to find the answer of the problem.
Thanks for reading!
Update
I know that in LOOCV, the model found in each iteration can be explained by shap.Explainer(). However, as there is 250 participants' data, if I apply shap here for each model, there will be 250 output! Thus, I want to get a single output which will present the performance of the 250 models.

You seem to train model on a 250 datapoints while doing LOOCV. This is about choosing a model with hyperparams that will ensure best generalization ability.
Model explanation is different from training in that you don't sift through different sets of hyperparams -- note, 250 LOOCV is already overkill. Will you do that with 250'000 rows? -- you are rather trying to understand which features influence output in what direction and by how much.
Training has it's own limitations (availability of data, if new data resembles the data the model was trained on, if the model good enough to pick up peculiarities of data and generalize well etc), but don't overestimate explanation exercise either. It's still an attempt to understand how inputs influence outputs. You may be willing to average 250 different matrices of SHAP values. But do you expect the result to be much more different from a single random train/test split?
Note as well:
However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different).
In each iteration of LOOCV the model is still the same (same features, hyperparams may be different, depending on your definition of iteration). It's still the same dataset (same features)
Also, the model will be different as I am doing feature selection in each iteration.
Doesn't matter. Feed resulting model to SHAP explainer and you'll get what you want.

neural network find best hyperameters or architecture first

I'm implementing my first neural network for images classification.
I would like to know if i should start to find best hyperparameters first and then try to modify my neural network architecture (e.g number of layer, dropout...) or architecture then hyperameters?

First you should decide for an architecture and then play around with the hyperparameters. To compare different hyperparameters it is important to have the same base (architecture).
Of course you can also play around with the architecture (layers, nodes,...).But I think here it is easier to search for an architecture online, because often the same or a similar problem yet have been solved or described in a tutorial/blog.
The dropout is also a (training-)hyperparameter and not part of the architecture!

The answer is as always : it depends
What are you trying to achieve?
If you're hoping to make the worlds best image classifier by trial and error then you might want to ask yourself if you think you have more compute available than the people who have already done this. For a really good classifier there are several ones that come with tensorflow/keras and can be easily implemented. If you're goofing around and learning the coding then I'd recommend different architectures because that's going to teach you more functions. If you have a dataset you don't think existing solutions will be good at analysing and genuinely need the best network to solve classify them then unfortunately it still depends...
How to decide:
Firstly decide on the rough order of magnitude for your overall parameter count (the literal number of parameters your model has). For a given number of parameters, architecture is likely to produce the biggest difference in results between representative hyperparameter choices (don't choke your network down to a single neuron in the middle and expect it to be representative of that architecture).
Its important to compare the rough performance per parameter so you're not giving an edge to the networks with greater overfitting capacity. You don't need to use all your training data or even train to completion, mostly you'll find the better networks learn faster and finish better (mostly). In the past I've done grid searches with multiple trials at each point using significantly reduced data then optimised the architecture with the most potential by considering the gradients of the grid search. Fun fact: with sufficient time you can use gradient descent methods on hyperparameters to find local minima. You might well find that there are many similarly top performing models, all of which should you can tune until a clear winner emerges.

Is it possible to remove categories in a pretrained tensorflow model?

I am currently using Tensorflow Object Detection API for my human detection app.
I tried filtering in the API itself which worked but I am still not contended by it because it's slow. So I'm wondering if I could remove other categories in the model itself to also make it faster.
If it is not possible, can you please give me other suggestions to make the API faster since I will be using two cameras. Thanks in advance and also pardon my english :)

Your questions addresses several topics for using neural network pretrained models.
Theoretical methods
In general, you can always neutralize categories by removing the corresponding neurons in the softmax layer and compute a new softmax layer only with the relevant rows of the matrix.
This method will surely work (maybe that is what you meant by filtering) but will not accelerate the network computation time by much, since most of the flops (multiplications and additions) will remain.
Similar to decision trees, pruning is possible but may reduce performance. I will explain what pruning means, but note that the accuracy over your categories may remain since you are not just trimming, you are predicting less categories as well.
Transfer the learning to your problem. See stanford's course in computer vision here. Most of the times I've seen that works good is by keeping the convolution layers as-is, and preparing a medium-size dataset of the objects you'd like to detect.
I will add more theoretical methods if you request, but the above are the most common and accurate I know.
Practical methods
Make sure you are serving your tensorflow model, and not just using an inference python code. This could significantly accelerate performance.
You can export the parameters of the network and load them in a faster framework such as CNTK or Caffe. These frameworks work in C++/CSharp and can inference much faster. Make sure you load the weights correctly, some frameworks use different order in tensor dimensions when saving/loading (little/big endian-like issues).
If your application perform inference on several images, you can distribute the computation via several GPUs. **This can also be done in tensorflow, see Using GPUs.
Pruning a neural network
Maybe this is the most interesting method of adapting big networks for simple tasks. You can see a beginner's guide here.
Pruning means that you remove parameters from your network, specifically the whole nodes/neurons in a decision tree/neural network (resp). To do that in object detection, you can do as follows (simplest way):
Randomly prune neurons from the fully connected layers.
Train one more epoch (or more) with low learning rate, only on objects you'd like to detect.
(optional) Perform the above several times for validation and choose best network.
The above procedure is the most basic one, but you can find plenty of papers that suggest algorithms to do so. For example
Automated Pruning for Deep Neural Network Compression and An iterative pruning algorithm for feedforward neural networks.

How to calculate probability(confidence) of SVM classification for small data set?

Use case:
I have a small dataset with about 3-10 samples in each class. I am using sklearn SVC to classify those with rbf kernel.
I need the confidence of the prediction along with the predicted class. I used predict_proba method of SVC.
I was getting weird results with that. I searched a bit and found out that it makes sense only for larger datasets.
Found this question on stack Scikit-learn predict_proba gives wrong answers.
The author of the question verified this by multiplying the dataset, thereby duplicating the dataset.
My questions:
1) If I multiply my dataset by lets say 100, having each sample 100 times, it increases the "correctness" of "predict_proba". What sideeffects will it have? Overfitting?
2) Is there any other way I can calculate the confidence of the classifier? Like distance from the hyperplanes?
3) For this small sample size, is SVM a recommended algorithm or should I choose something else?

First of all: Your data set seems very small for any practical purposes. That being said, let's see what we can do.
SVM's are mainly popular in high dimensional settings. It is currently unclear whether that applies to your project. They build planes on a handful of (or even single) supporting instances, and are often outperformed in situation with large trainingsets by Neural Nets. A priori they might not be your worse choice.
Oversampling your data will do little for an approach using SVM. SVM is based on the notion of support vectors, which are basically the outliers of a class that define what is in the class and what is not. Oversampling will not construct new support vector (I am assuming you are already using the train set as test set).
Plain oversampling in this scenario will also not give you any new information on confidence, other than artififacts constructed by unbalanced oversampling, since the instances will be exact copies and no distibution changes will occur. You might be able to find some information by using SMOTE (Synthetic Minority Oversampling Technique). You will basically generate synthetic instances based of the ones you have. In theory this will provide you with new instances, that won't be exact copies of the ones you have, and might thusly fall a little out of the normal classification. Note: By definition all these examples will lie in between the original examples in your sample space. This will not mean that they will lie in between your projected SVM-space, possibly learning effects that aren't really true.
Lastly, you can estimate confidence with the distance to the hyperplane. Please see: https://stats.stackexchange.com/questions/55072/svm-confidence-according-to-distance-from-hyperline

How to find the most important features learned during Deep Learning using CNN?

I followed the tutorial given at this site, which detailed how to perform text classification on the movie dataset using CNN. It utilized the movie review dataset to find predict positive and negative reviews.
My question is, is there any way to find the most important learned features from the model? Does Tensorflow/Theano has any support for this?
Thanks !

A word of warning: if you can trace the classification back to specific input features, it's quite possible that CNN is the wrong ML paradigm for your application. Most text processing uses RNN, bag-of-words, bi-grams, and other simple linear combinations.
The structure of a CNN is generally antithetical to identifying the importance of individual features. Because of the various non-linear layers, it is rarely possible to pick out any one feature as important; rather, the combinations of inputs form small structures of inference, which then convolve to form more complex structures, until the final output is driven by a series of neighbor relationships, cut-offs, poolings, and other items.
This is why back-propagation is so important to running CNNs: the causation chain does not reverse cleanly. Otherwise, we'd reduce the process to a simple linear NN with one hidden layer.
If you want to analyze what's happening, try visualizing your intermediate layers. There are various modules to help with that; for instance, try a search for "+theano +visualize +CNN -news" (the last is to remove the high-traffic references to Cable News Network). There are plenty of examples in image processing; we won't know how much it might help your text processing, until you try it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.