Subject Extraction of a paragraph/document using NLP - python

I am trying to build a subject extractor, simply put, read all the sentences of a paragraph and make a calculated guess to what the subject of the paragraph/article/document is. I might even upgrade it to a summerize depending on the progress I make.
There is a great deal of information on the internet. It is difficult to understand all of it and select a correct path, as I am not well versed with NLP.
I was hoping someone with some experience could point me in the right direction.
I am NOT looking for a linguistic computation model, but rather an n-gram or neural network approach, something that has been done recently.
I am also looking into coreference resolution using n-grams, if anyone has any leads on that, it is much appreciated. Slightly familiar with the Stanford Coreferential Solver, but don't want to use it as is.
Any information, ideas and opinions are welcome.

#Dagger,
For finding the 'topic' of the whole document, there are several approaches you can try and research. The unsupervised approaches will be faster and will get you started but may not differentiate between closely related documents that have similar topics. These also don't require neural network. The supervised techniques will be able to recognise differences in similar documents better but require training of networks. You should be able to easily find blogs about implementing these in your desired programming language.
Unsupervised
K-Means Clustering using TF-IDF On Text Words - see intro here
Latent Dirichlet Allocation
Supervised
Text Classification models using SVM, Logistic Regressions and neural nets
LSTM/RNN models using neural net
The neural net models will require training on a set of known documents with associated topics first. They are best suited for picking ONE most likely topic from their model but there are multi-class topic implementations possible.
If you post example data and/or domain along with programming language, I can give some more specifics for you to explore.

Related

Suggestions for nonparametric machine learning models

I am new to machine learning, but I have decent experience in python. I am faced with a problem: I need to find a machine learning model that would work well to predict the speed of a boat given current environmental and physical conditions. I have looked into Scikit-Learn, Pytorch, and Tensorflow, but I am having trouble finding information on what type of model I should use. I am almost certain that linear regression models would be useless for this task. I have been told that non-parametric regression models would be ideal for this, but I am unable to find many in the Scikit Library. Should I be trying to use regression models at all, or should I be looking more into Neural Networks? I'm open to any suggestions, thanks in advance.
I think multi-linear regression model would work well for your case. I am assuming that the input data is just a bunch of environmental parameters and you have a boat speed corresponding to that. For such problems, regression usually works well. I would not recommend you to use neural networks unless you have a lot of training data and the size of one input data is also quite big.

Machine Learning detecting random string

I do apologies in advance if something similar has been posted but from the research I've done I can't find anything specific.
I'm currently looking at http://scikit-learn.org and the content here looks great but I'm confused what type I should be using for my problem.
I want to able to have 2 labels.
**Suspicious**
1hbn34uqrup7a13t
qmr30zoyswr21cdxolg
1qmqnbetqx
**Not-Suspicious**
cheesemix
reg526
animato12
What type of machine learning algorithm could I feed the data in above as to teach it what I'd class as suspicious through supervised learning?
I'm leaning towards classification but there are so many models to choose from my slightly lost.
The first step in such machine learning problems is to think about the "features". You can't use e.g. a linear classifier directly on these strings. Thus, you have to extract some meaningful features that describe the string. In computer vision, these features are often edges, corner points, SIFT features. You basically have to options:
Design features yourself.
Learn the features.
1) This is the "classical" machine learning approach: you manually design a list of representative features, which you can extract from your input data. In your case, you could start with e.g.
length of the string
number of different characters
number of special characters
something about the sorting?
...
That will give you a vector of numbers for each string. Now, you can use any of the classifiers from scikit-learn to classify the data. You can start choosing your algorithm with the help of this flowchart. You should start with a simple model, e.g. a linear model (e.g. linear SVM). If performance is not sufficient, use a more complex model (e.g. SVM with kernels), or rethink your choice of features.
2) This is the "modern" approach, which is gaining more and more popularity. Designing the features is a crucial step in 1) and it requires good knowledge of your data. Now, by using a deep neural network, you can feed your raw data (the string) into the network, and let the network learn such "features" itself. This, however, requires a large amount of labeled training data, and a lot of processing power (GPUs).
LSTM networks are todays state-of-the-art in natural language processing and similar tasks. LSTMs would be well suited to your tasks, as the input can be of variable length.
tl;dr: Either design features yourself and use a classifier of your choice, or dive into deep neural networks and let a network learn both the features and the classification.

How to find the most important features learned during Deep Learning using CNN?

I followed the tutorial given at this site, which detailed how to perform text classification on the movie dataset using CNN. It utilized the movie review dataset to find predict positive and negative reviews.
My question is, is there any way to find the most important learned features from the model? Does Tensorflow/Theano has any support for this?
Thanks !
A word of warning: if you can trace the classification back to specific input features, it's quite possible that CNN is the wrong ML paradigm for your application. Most text processing uses RNN, bag-of-words, bi-grams, and other simple linear combinations.
The structure of a CNN is generally antithetical to identifying the importance of individual features. Because of the various non-linear layers, it is rarely possible to pick out any one feature as important; rather, the combinations of inputs form small structures of inference, which then convolve to form more complex structures, until the final output is driven by a series of neighbor relationships, cut-offs, poolings, and other items.
This is why back-propagation is so important to running CNNs: the causation chain does not reverse cleanly. Otherwise, we'd reduce the process to a simple linear NN with one hidden layer.
If you want to analyze what's happening, try visualizing your intermediate layers. There are various modules to help with that; for instance, try a search for "+theano +visualize +CNN -news" (the last is to remove the high-traffic references to Cable News Network). There are plenty of examples in image processing; we won't know how much it might help your text processing, until you try it.

Signal feature identification

I'm am trying to identify phonemes in voices using a training database of known ones.
I'm wondering if there is a way of identifying common features within my training sample and using that to classify a new one.
It seems like there are two paths:
Give the process raw/normalised data and it will return similar ones
Extract certain metrics such as pitch, formants etc and compare to training set
My interest is the first!
Any recommendations on machine learning or regression methods/algorithms?
Since you tagged Python, I highly recommend looking into scikit-learn, an excellent Python library for Machine Learning. Their docs are very thorough, and should give you a good crash course in Machine Learning algorithms and implementation (including classification, regression, clustering, etc)
Your points 1 and 2 are not very different: 1) is the end results of a classification problem 2) is the feature that you give for classification. What you need is a good classifier (SVM, decision trees, hierarchical classifiers etc.) and a good set of features (pitch, formants etc. that you mentioned).

Text summarization using deep learning techniques

I am trying to summarize text documents that belong to legal domain.
I am referring to the site deeplearning.net on how to implement the deep learning architectures. I have read quite a few research papers on document summarization (both single document and multidocument) but I am unable to figure to how exactly the summary is generated for each document.
Once the training is done, the network stabilizes during testing phase. So even if I know the set of features (which I have figured out) that are learnt during the training phase, it would be difficult to find out the importance of each feature (because the weight vector of the network is stabilized) during the testing phase where I will be trying to generate summary for each document.
I tried to figure this out for a long time but it's in vain.
If anybody has worked on it or have any idea regarding the same, please give me some pointers. I really appreciate your help. Thank you.
I think you need to be a little more specific. When you say "I am unable to figure to how exactly the summary is generated for each document", do you mean that you don't know how to interpret the learned features, or don't you understand the algorithm? Also, "deep learning techniques" covers a very broad range of models - which one are you actually trying to use?
In the general case, deep learning models do not learn features that are humanly intepretable (albeit, you can of course try to look for correlations between the given inputs and the corresponding activations in the model). So, if that's what you're asking, there really is no good answer. If you're having difficulties understanding the model you're using, I can probably help you :-) Let me know.
this is a blog series that talks in much detail from the very beginning of how text summarization works, recent research uses seq2seq deep learning based models, this blog series begins by explaining this architecture till reaching the newest research approaches
Also this repo collects multiple implementations on building a text summarization model, it runs these models on google colab, and hosts the data on google drive, so no matter how powerful your computer is, you can use google colab which is a free system to train your deep models on
If you like to see the text summarization in action, you can use this free api.
I truly hope this helps

Categories