Suggestions for nonparametric machine learning models

Suggestions for nonparametric machine learning models - python

I am new to machine learning, but I have decent experience in python. I am faced with a problem: I need to find a machine learning model that would work well to predict the speed of a boat given current environmental and physical conditions. I have looked into Scikit-Learn, Pytorch, and Tensorflow, but I am having trouble finding information on what type of model I should use. I am almost certain that linear regression models would be useless for this task. I have been told that non-parametric regression models would be ideal for this, but I am unable to find many in the Scikit Library. Should I be trying to use regression models at all, or should I be looking more into Neural Networks? I'm open to any suggestions, thanks in advance.

I think multi-linear regression model would work well for your case. I am assuming that the input data is just a bunch of environmental parameters and you have a boat speed corresponding to that. For such problems, regression usually works well. I would not recommend you to use neural networks unless you have a lot of training data and the size of one input data is also quite big.

Related

What's the difference between scikit-learn and tensorflow? Is it possible to use them together?

I cannot get a satisfying answer to this question. As I understand it, TensorFlow is a library for numerical computations, often used in deep learning applications, and Scikit-learn is a framework for general machine learning.
But what is the exact difference between them, what is the purpose and function of TensorFlow? Can I use them together, and does it make any sense?

Your understanding is pretty much spot on, albeit very, very basic. TensorFlow is more of a low-level library. Basically, we can think of TensorFlow as the Lego bricks (similar to NumPy and SciPy) that we can use to implement machine learning algorithms whereas Scikit-Learn comes with off-the-shelf algorithms, e.g., algorithms for classification such as SVMs, Random Forests, Logistic Regression, and many, many more. TensorFlow really shines if we want to implement deep learning algorithms, since it allows us to take advantage of GPUs for more efficient training. TensorFlow is a low-level library that allows you to build machine learning models (and other computations) using a set of simple operators, like “add”, “matmul”, “concat”, etc.
Makes sense so far?
Scikit-Learn is a higher-level library that includes implementations of several machine learning algorithms, so you can define a model object in a single line or a few lines of code, then use it to fit a set of points or predict a value.
Tensorflow is mainly used for deep learning while Scikit-Learn is used for machine learning.
Here is a link that shows you how to do Regression and Classification using TensorFlow. I would highly suggest downloading the data sets and running the code yourself.
https://stackabuse.com/tensorflow-2-0-solving-classification-and-regression-problems/
Of course, you can do many different kinds of Regression and Classification using Scikit-Learn, without TensorFlow. I would suggesting reading through the Scikit-Learn documentation when you have a chance.
https://scikit-learn.org/stable/user_guide.html
It's going to take a while to get through everything, but if yo make it to the end, you will have learned a ton!!! Finally, you can get the 2,600+ page user guide for Scikit-Learn from the link below.
https://scikit-learn.org/stable/_downloads/scikit-learn-docs.pdf

The Tensorflow is a library for constructing Neural Networks. The scikit-learn contains ready to use algorithms. The TF can work with a variety of data types: tabular, text, images, audio. The scikit-learn is intended to work with tabular data.
Yes, you can use both packages. But if you need only classic Multi-Layer implementation then the MLPClassifier and MLPRegressor available in scikit-learn is a very good choice. I have run a comparison of MLP implemented in TF vs Scikit-learn and there weren't significant differences and scikit-learn MLP works about 2 times faster than TF on CPU. You can read the details of the comparison in my blog post.
Below the scatter plots of performance comparison:

Both are 3rd party machine learning modules, and both are good at it.
Tensorflow is the more popular of the two.
Tensorflow is typically used more in Deep Learning and Neural Networks.
SciKit learn is more general Machine Learning.
And although I don't think I've come across anyone using both simultaneously, no one is saying you can't.

Improving prediction accuracy in Bayesian Causal Network

I would like to determine the causes of an unexpected outcome (or anamoly) in a thermodynamic process. I have continuous data of the associated variables and trying to make use of 'Bayesian Network (BN)' for the determination of causality relationships. For this purpose, I used a library called 'Causalnex' in Python.
I have followed the tutorial section of this library to build the DAG,BN model and everything works fine upto the step of predictions. The prediction results of minority/less majority classes have an accuracy of around 60-70% (80-90% with SMOTE/SMOTETomek and a particular random state) whereas a stable accuracy of more than 90% is expected. I have implemented following data-preprocessing steps.
Ensuring no missing/NaN values
Discretization (only it is supported by the library)
SMOTE/SMOTETomek for data balancing
Various train/test size combinations
I am struggling to figure out the ways to optimize the model. I could not find any supportive material in Internet for the same.
Are there any Guidelines or 'Best practices' of data pre-processing techniques and dataset requirements that particulary work for this library/ BN model? Could you please suggest any troubleshooting methods to identify the causes of low accuracy/metrics? Perhaps a misunderstood node-node causal relationship in DAG causes mediocre accuracy?
Any ideas/literature/other suitable library regarding this would be of great help!

A few tips that can help:
Changing/Tuning the Structure learning.
Trying different thresholds. When doing from_pandas, you can experiment with different w-threshold values (and the beta term (if you are using from_pandas_lasso)).
This will change the density of the network. A more dense structure implies a BN with more parameters. If the structure is more dense, you have more parameters and your model may perform better. If it is too dense, though, you may not have enough data to train it and may overfit.
Center the Data. Empirically, it seems that NOTEARS (the algorithm behind from_pandas) works best if the data is centered. So, subtracting the mean of the see this may be a good idea.
Ensure causality. NOTEARS does not ensure causality. So we need "experts" to judge the output and make the necessary modifications. If you see edges that don't make causal sense, you can either remove them or add them as tabu_edges and train your network again.
Experiment with discretisation. The performance can be very sensitive to how you discretise the data. Experimenting with various types of discretisation can help. You can use:
Methods available in Causalnex (uniform, for example)
fixed discretisations based on what thresholds make sense for your data
MDLP is a supervised way to discretise data. You can apply MDLP for each node having as "target" one of its children. There are 2 main packages for MDLP in pypy: mdlp and mdlp-discretization

Weight prediction using NNs

I’m relatively new to the topic of machine learning, so naturally I have a couple of issues that I hope you can help me with or lead me in the right direction. I had a project before, during which we collected data of people walking normally and also with a stone in their shoe. We measured Acceleration and also with a gyroscope sensor. Based on this data I build a neural network that can classify the signals into normal or impaired walking. So two possible outputs.
Now my idea is this: I want to, using the same data, build a network that can predict the weights of the participants (it was also recorded).
Based on this my three questions:
- What kind of network structure is most suitable for such a task? (Dense, CNN, LSTM,…)
- Before the network basically had two options to answer from (normal or impaired walking) but now I have a continuous range of answers… How can this be approached?
- How can I make sure the network initializes with a sensible prediction?
I hope all the questions make sense. Any help will be much appreciated!

You can use the NNa architecture you prefer:
If you work with sequences use 1d convolutionals or RNNs.
As you are dealing with a regression problem you have to have a single neuron as output without activation function.
Take a.look here to learn to solve a regression problem with RNNs

Applying "reinforcement learning" on a supervised learning model

Is it possible to use "reinforcement learning" or a feedback loop on a supervised model?
I have worked on a machine learning problem using a supervised learning model, more precisely a linear regression model, but I would like to improve the results by creating a feedback loop on the outputs of the prediction, i.e, tell the algorithm if it made mistakes on some examples.
As I know, this is basically how reinforcement learning works: the model learns from positive and negative feedbacks.
I found out that we can implement supervised learning and reinforcement learning algorithms using PyBrain, but I couldn't find a way to relate between both.

Most (or maybe all) iterative supervised learning methods already use a feedback loop on the outputs of the prediction. If fact, this feedback is very informative since it provides information with the exact amount of error in each sample. Think for example in stochastic gradient descent, where you compute the error of each sample to update the model parameters.
In reinforcement learning the feedback signal (i.e., reward) is much more limited than in supervised learning. Therefore, in the typical setup of adjusting some model parameters, if you have a set of input-output (i.e., a training data set), probably it has no sense to apply reinforcement learning.
If you are thinking on a more specific case/problem, you should be more specific in your question.

Reinforcement Learning has been used to tune hyper-parameters and/or select optimal Supervised Learning Models. There's also a paper on it: "Learning to optimize with Reinforcement Learning".
Reading Pablo's answer you may want to read up on "back propagation". It may be what you are looking for.

Subject Extraction of a paragraph/document using NLP

I am trying to build a subject extractor, simply put, read all the sentences of a paragraph and make a calculated guess to what the subject of the paragraph/article/document is. I might even upgrade it to a summerize depending on the progress I make.
There is a great deal of information on the internet. It is difficult to understand all of it and select a correct path, as I am not well versed with NLP.
I was hoping someone with some experience could point me in the right direction.
I am NOT looking for a linguistic computation model, but rather an n-gram or neural network approach, something that has been done recently.
I am also looking into coreference resolution using n-grams, if anyone has any leads on that, it is much appreciated. Slightly familiar with the Stanford Coreferential Solver, but don't want to use it as is.
Any information, ideas and opinions are welcome.

#Dagger,
For finding the 'topic' of the whole document, there are several approaches you can try and research. The unsupervised approaches will be faster and will get you started but may not differentiate between closely related documents that have similar topics. These also don't require neural network. The supervised techniques will be able to recognise differences in similar documents better but require training of networks. You should be able to easily find blogs about implementing these in your desired programming language.
Unsupervised
K-Means Clustering using TF-IDF On Text Words - see intro here
Latent Dirichlet Allocation
Supervised
Text Classification models using SVM, Logistic Regressions and neural nets
LSTM/RNN models using neural net
The neural net models will require training on a set of known documents with associated topics first. They are best suited for picking ONE most likely topic from their model but there are multi-class topic implementations possible.
If you post example data and/or domain along with programming language, I can give some more specifics for you to explore.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.