Building article Classifier - NLTK/ Scikit-learn/ Other NLP implementations - python

For my current project I have to build a topic modeling or classification utility which will process thousands of articles to classify them into various topics (topics may be 40-50 to start off with). For e.g. it'll go over database technologies articles and classify them whether an article is NOSQL article/ Relational DB Article/ Graph Database article (just an example).
I have very basic NLP background and our team mostly has python backend scripting experience. I began looking into various options available to implement it and came across NLTK and Scikit-Learn which are Python based and also Weka and Mallet which are JVM based.
My understanding is that NLTK is more suited to learn and understand various NLP techniques like Topic classification.
Can someone suggest what may be the best open source solution that we can use for our implementation?
Please let me know if I missed on any information that will help with the answers.

Building a Topic Classification model can be done into two ways.
If you have a training set where you have labels against the documents , you can always build a classifier using scikit learn
But if you don't have any training data , you can build something that is called a topic model. It basically gives you topics as group of words.
You can use Gensim package to implement this. Very crisp , fast and easy to implement (Look Here)

Related

What method should be used to tag specific texts, when the dataset is too small for training a model?

My problem is the following:
The task is to use a Machine Learning Algorithm to perform tagging on texts. This would be simple enough if those texts were about common topics where large datasets for training are readily available.
But these texts describe company KPI's and Products. The tagging should improve search inside the upcoming Data Catalog.
What complicates the matter: the amount of already human-tagged texts is very small (around 50)
So I was searching for a method to train a model on a dataset that has nothing to do with our dataset since it is highly specialized.
Is there a way to adapt pre-trained text classifiers to our data?
I'm fairly new to NLP/Machine Learning since I recently started studying this field. I mainly program in Python. If a solution in Python is available, it would be great!
Any help would be much appreciated!

Subject Extraction of a paragraph/document using NLP

I am trying to build a subject extractor, simply put, read all the sentences of a paragraph and make a calculated guess to what the subject of the paragraph/article/document is. I might even upgrade it to a summerize depending on the progress I make.
There is a great deal of information on the internet. It is difficult to understand all of it and select a correct path, as I am not well versed with NLP.
I was hoping someone with some experience could point me in the right direction.
I am NOT looking for a linguistic computation model, but rather an n-gram or neural network approach, something that has been done recently.
I am also looking into coreference resolution using n-grams, if anyone has any leads on that, it is much appreciated. Slightly familiar with the Stanford Coreferential Solver, but don't want to use it as is.
Any information, ideas and opinions are welcome.
#Dagger,
For finding the 'topic' of the whole document, there are several approaches you can try and research. The unsupervised approaches will be faster and will get you started but may not differentiate between closely related documents that have similar topics. These also don't require neural network. The supervised techniques will be able to recognise differences in similar documents better but require training of networks. You should be able to easily find blogs about implementing these in your desired programming language.
Unsupervised
K-Means Clustering using TF-IDF On Text Words - see intro here
Latent Dirichlet Allocation
Supervised
Text Classification models using SVM, Logistic Regressions and neural nets
LSTM/RNN models using neural net
The neural net models will require training on a set of known documents with associated topics first. They are best suited for picking ONE most likely topic from their model but there are multi-class topic implementations possible.
If you post example data and/or domain along with programming language, I can give some more specifics for you to explore.

Named Entity Recognition in practice

I am a NLP novice trying to learn, and would like to better understand how Named Entity Recognition (NER) is implemented in practice, for example in popular python libraries such as spaCy.
I understand the basic concept behind it, but I suspect I am missing some details.
From the documentation, it is not clear to me for example how much preprocessing is done on the text and annotation data; and what statistical model is used.
Do you know if:
In order to work, the text has to go through chunking before the model is trained, right? Otherwise it wouldn't be able to perform anything useful?
Are the text and annotations typically normalized prior to the training of the model? So that if a named entity is at the beginning or middle of a sentence it can still work?
Specifically in spaCy, how are things implemented concretely? Is it a HMM, CRF or something else that is used to build the model?
Apologies if this is all trivial, I am having some trouble finding easy to read documentation on NER implementations.
In https://spacy.io/models/en#en_core_web_md they say English multi-task CNN trained on OntoNotes. So I imagine that's how they obtain the NEs. You can see that the pipeline is
tagger, parser, ner
and read more here: https://spacy.io/usage/processing-pipelines. I would try to remove the different components and see what happens. This way you could see what depends on what. I'm pretty sure NER depends on tagger, but not sure whether requires the parser. All of them of course require the tokenizer
I don't understand your second point. If an entity is at the beginning or middle of a sentence is just fine, the NER system should be able to catch it. I don't see how you're using the word normalize in a position of text context.
Regarding the model, they mention multi-task CNN, so I guess the CNN is the model for NER. Sometimes people use a CRF on top, but they don't mention it so probably is just that. According to their performance figures, it's good enough

Text summarization using deep learning techniques

I am trying to summarize text documents that belong to legal domain.
I am referring to the site deeplearning.net on how to implement the deep learning architectures. I have read quite a few research papers on document summarization (both single document and multidocument) but I am unable to figure to how exactly the summary is generated for each document.
Once the training is done, the network stabilizes during testing phase. So even if I know the set of features (which I have figured out) that are learnt during the training phase, it would be difficult to find out the importance of each feature (because the weight vector of the network is stabilized) during the testing phase where I will be trying to generate summary for each document.
I tried to figure this out for a long time but it's in vain.
If anybody has worked on it or have any idea regarding the same, please give me some pointers. I really appreciate your help. Thank you.
I think you need to be a little more specific. When you say "I am unable to figure to how exactly the summary is generated for each document", do you mean that you don't know how to interpret the learned features, or don't you understand the algorithm? Also, "deep learning techniques" covers a very broad range of models - which one are you actually trying to use?
In the general case, deep learning models do not learn features that are humanly intepretable (albeit, you can of course try to look for correlations between the given inputs and the corresponding activations in the model). So, if that's what you're asking, there really is no good answer. If you're having difficulties understanding the model you're using, I can probably help you :-) Let me know.
this is a blog series that talks in much detail from the very beginning of how text summarization works, recent research uses seq2seq deep learning based models, this blog series begins by explaining this architecture till reaching the newest research approaches
Also this repo collects multiple implementations on building a text summarization model, it runs these models on google colab, and hosts the data on google drive, so no matter how powerful your computer is, you can use google colab which is a free system to train your deep models on
If you like to see the text summarization in action, you can use this free api.
I truly hope this helps

Im trying to do sentiment analysis, need advice on how to start

I have decided to start a sentiment analysis project and am not sure exactly how to start and what to use.
I have heard of Naive Baes classifier but im not sure how to use it in python or how it works.
If i have to find, the net sentiment i have to find, the value of each word. Is there an already created database full of words and sentiment or must i create a list. Thanks
Haven't looked at this for a couple years but...
at that time NLTK was the thing to use for natural language processing. There were a few papers and projects, with source code, demonstrating sentiment analysis with NLTK.
I just googled and found http://text-processing.com/demo/sentiment/ and the page has examples, and links to others. including "Training a naive bayes classifier for sentiment analysis using movie reviews"
There are plenty of datasets you can use as training datasets but I'd recommend you build your own to ensure your classifier is set up properly for your domain.
For example, you could start with the Cornell Movie Review dataset as a way to get started. I'd be very surprised if this dataset worked for your needs but it would allow you to get started building your classification system.

Categories