Unsupervised NearestNeighbours Sklearn Pipeline Examples - python

I am working on a project that compares the performance of unsupervised ML models on a loan default dataset and I am having trouble finding examples/tips on using unsupervised ML models within a scikit-learn pipeline. My pipeline is set up with pre-processing such as OneHotEncoding, StandardScalar, SimpleImputer and a couple Custom Transformers.
The lack of examples/tips available on the internet seems to suggest that this is not an advised route to go down for unsupervised ML models.
Has anyone come across any examples/tips or has any advice of there own to aid my project?
Any help would be greatly appreciated!

Related

Suggestions for nonparametric machine learning models

I am new to machine learning, but I have decent experience in python. I am faced with a problem: I need to find a machine learning model that would work well to predict the speed of a boat given current environmental and physical conditions. I have looked into Scikit-Learn, Pytorch, and Tensorflow, but I am having trouble finding information on what type of model I should use. I am almost certain that linear regression models would be useless for this task. I have been told that non-parametric regression models would be ideal for this, but I am unable to find many in the Scikit Library. Should I be trying to use regression models at all, or should I be looking more into Neural Networks? I'm open to any suggestions, thanks in advance.
I think multi-linear regression model would work well for your case. I am assuming that the input data is just a bunch of environmental parameters and you have a boat speed corresponding to that. For such problems, regression usually works well. I would not recommend you to use neural networks unless you have a lot of training data and the size of one input data is also quite big.

Knime, relevant nodes for regression analysis ( statistical models)

I need to use knime for regression analysis. I am a python user, I know knime as well but not in deep!
I usually use statsmodel in python for regression analysis and working on statistical models.
However for solving regression problem as a machine learning problem I use sklearn regression model. Each of these packages in python has its own benefit deepened on your task, and also different view of output which is really important to address the problem in the right way.
Here is my question, does knime present any special package for statistical model? If I plan to do a regression analysis which nodes are recommended?
Many thanks for your help
There's a Linear Regression Learner node under Analytics > Mining > Linear/Polynomial Regression in the node repository. Does that do what you need?

Subject Extraction of a paragraph/document using NLP

I am trying to build a subject extractor, simply put, read all the sentences of a paragraph and make a calculated guess to what the subject of the paragraph/article/document is. I might even upgrade it to a summerize depending on the progress I make.
There is a great deal of information on the internet. It is difficult to understand all of it and select a correct path, as I am not well versed with NLP.
I was hoping someone with some experience could point me in the right direction.
I am NOT looking for a linguistic computation model, but rather an n-gram or neural network approach, something that has been done recently.
I am also looking into coreference resolution using n-grams, if anyone has any leads on that, it is much appreciated. Slightly familiar with the Stanford Coreferential Solver, but don't want to use it as is.
Any information, ideas and opinions are welcome.
#Dagger,
For finding the 'topic' of the whole document, there are several approaches you can try and research. The unsupervised approaches will be faster and will get you started but may not differentiate between closely related documents that have similar topics. These also don't require neural network. The supervised techniques will be able to recognise differences in similar documents better but require training of networks. You should be able to easily find blogs about implementing these in your desired programming language.
Unsupervised
K-Means Clustering using TF-IDF On Text Words - see intro here
Latent Dirichlet Allocation
Supervised
Text Classification models using SVM, Logistic Regressions and neural nets
LSTM/RNN models using neural net
The neural net models will require training on a set of known documents with associated topics first. They are best suited for picking ONE most likely topic from their model but there are multi-class topic implementations possible.
If you post example data and/or domain along with programming language, I can give some more specifics for you to explore.

Scikit-learn KNN(K Nearest Neighbors ) parallelize using Apache Spark

I have been working on machine learning KNN (K Nearest Neighbors) algorithm with Python and Python's Scikit-learn machine learning API.
I have created sample code with toy dataset simply using python and Scikit-learn and my KNN is working fine. But As we know Scikit-learn API is build to work on single machine and hence once I will replace my toy data with millions of dataset it will decrease my output performance.
I have searched for many options, help and code examples, which will distribute my machine learning processing parallel using spark with Scikit-learn API, but I was not found any proper solution and examples.
Can you please let me know how can I achieve and increase my performance with Apache Spark and Scikit-learn API's K Nearest Neighbors?
Thanks in advance!!
Well according to discussions https://issues.apache.org/jira/browse/SPARK-2336 here MLLib (Machine Learning Library for Apache Spark) does not have an implementation of KNN.
You could try https://github.com/saurfang/spark-knn.

Building article Classifier - NLTK/ Scikit-learn/ Other NLP implementations

For my current project I have to build a topic modeling or classification utility which will process thousands of articles to classify them into various topics (topics may be 40-50 to start off with). For e.g. it'll go over database technologies articles and classify them whether an article is NOSQL article/ Relational DB Article/ Graph Database article (just an example).
I have very basic NLP background and our team mostly has python backend scripting experience. I began looking into various options available to implement it and came across NLTK and Scikit-Learn which are Python based and also Weka and Mallet which are JVM based.
My understanding is that NLTK is more suited to learn and understand various NLP techniques like Topic classification.
Can someone suggest what may be the best open source solution that we can use for our implementation?
Please let me know if I missed on any information that will help with the answers.
Building a Topic Classification model can be done into two ways.
If you have a training set where you have labels against the documents , you can always build a classifier using scikit learn
But if you don't have any training data , you can build something that is called a topic model. It basically gives you topics as group of words.
You can use Gensim package to implement this. Very crisp , fast and easy to implement (Look Here)

Categories