Index based text clustering - python

Currently I am working on projet to cluster 2 millions of Text Memos. My objective is to create a standard for these Memos (Actually, when I say Memo, I mean text containing the description of something). To do so, I want first to cluster similar memos (gathering those which are probably having the same meaning) and then create a label for each cluster or group.
Since I am new to NLP, I want to know how to proceed to do so and what are some references / materials and similar projects that have been done before ?
I bet this a classic problem in NLP and many projects have been done about that subject.
I can work with R and Python

Finding hidden topics in unstructured data like text, which accurately represent the document is called Topic Modelling.
Gensim is a great library using which you can find Memos with similar topic. It has LSA and LDA algorithms implemented in python. The difference between LSA and LDA is in their implementation. LSA is a online learning algorithm, which means if the nature of data changes, it will reorient itself.
topicmodels is the R package that implements LDA. Here is a quick tutorial on LDA.

Related

Which clustering method is the standard way to go for text analytics?

Assume you have lot of text sentences which may have (or not) similarities. Now you want to cluster similar sentences for finding centroids of each cluster. Which method is the prefered way for doing this kind of clustering? K-means with TF-IDF sounds promising. Nevertheless, are there more sophisticated algorithms or better ones? Data structure is tokenized and in a one-hot encoded format.
Basically you can cluster texts using different techniques. As you pointed out, K-means with TF-IDF is one of the ways to do this. Unfortunately, only using tf-idf won't be able to "detect" semantics and to project smantically similar texts near one another in the space. However, instead of using tf-idf, you can use word embeddings, such as word2vec or glove - there is a lot of information on the net about them, just google it. Have you ever heard of topic models? Latent Dirichlet allocation (LDA) is a topic model and it observes each document as a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics (see the wikipedia link). So, basically, using a topic model you can also do some kind of grouping and assign similar texts (with a similar topic) to groups. I recommend you to read about topic models, since they are more common for such problems connected with text clustering.
I hope my answer was helpful.
in my view, You can use LDA(latent Dirichlet allocation, it is more flexible in comparison to other clustering techniques because of having Alpha and Beta vectors that can adjust to the contribution of each topic in a document and word in a topic. It can help if the documents are not of similar length or quality.

Perform Named Entity Recognition - NLP

I am trying to learn how to perform Named Entity Recognition.
I have a set of discharge summaries containing medical information about patients. I converted my unstructured data into structured data. Now, I have a DataFrame that looks like this:
Text | Target
normal coronary arteries... R060
The Text column contains information about the diagnosis of a patient, and the Target column contains the code that will need to be predicted in a further task.
I have also constructed a dictionary that looks like this:
Code (Key) | Term (Value)
A00 Cholera
This dictionary brings information about each diagnosis and the afferent code. The term column will be used to identify the clinical entities in the corpus.
I will need to train a classifier and predict the code in order to automate the process of assigning codes for the discharge summaries (I am explaining this to have an idea about the task I'm performing).
Until now I have converted my data into a structured one. I am trying to understand how I should perform Named Entity Recognition to label the medical terminology. I would like to try direct matching and fuzzy matching but I am not sure what are the previous steps. Should I perform tokenizing, stemming, lemmatizing before? Or firstly should I find the medical terminology as clinical named entities are often multi-token terms with nested structures that include other named entities inside them? Also what packages or tools are you recommending me to use in Python?
I am new in this field so any help will be appreciated! Thanks!
If you are asking for building a classification model, then you should go for deep learning. Deep learning is highly efficient in classification.
While dealing with such type of language processing tasks, I recommend you to first tokenize your text and do padding. Basic tokenization should be enough, but you can go for more preprocessing like basic string processing because proper preprocessing can improve your model accuracy upto 3% or 4%. For basic string processing, you can use regex(built-in package called re) in python.
https://docs.python.org/3/library/re.html
I think, you are doing mapping after preprocessing. Mapping should be enough for tasks like classification, but I recommend you to learn about word embeddings. Word embedding will improve your model.
For all these tasks, i recommend you to use tensorflow. Tensorflow is famous tool for machine learning, language processing, image processing, and much more. You can learn natural language processing from official tensorflow documentation. They have provided all learning material in tensorflow tutorial section.
https://www.tensorflow.org/tutorials/
I think, this will help you. All the best for your work!!!!
Thank you.

NER(Named Entity Recognition) Similarity between sentences in documents

I have been using spacy to find the NER of sentences.My problem is I have to calculate the NER similarity between sentences of two different documents. Is there any formula or package available in python for the same?
TIA
I believe you are asking, how similar are two named entities?
This is not so trivial, since we have to define what "similar" means.
If we use a naive bag-of-words approach, two entities are more similar when more of their tokens are identical.
If we put the entity tokens into sets, the calculation would just be the jaccard coefficient.
Sim(ent1, ent2) = |ent1 ∩ ent2| / |ent1  ∪ ent2|
Which in python would be:
ent1 = set(map(str, spacy_entity1))
ent2 = set(map(str, spacy_entity2))
similarity = len(ent1 & ent2) / len(ent1 | ent2)
Where spacy_entity is one of the entities extracted by spacy
We then just create entity sets ent by creating a set of the strings that represent them.
I've been tackling the same problem and heres how I'm Solving.
Since youve given no context about your problem, ill keep the solution as general as possible
Links to tools:
Spacy Spacy Similarity
Flair NLP: Flair NLP zero shot few shot
Roam Research for research articles and their approach Roam Research home page
DISCLAIMER: THIS IS A TRIAL EXPERIMENT. AND NOT THE SOLUTION
Steps:
try a Large language model with a similarity coefficient (Like GPT-3 or TARS) on your dataset
(Instead of similarity you could also use the zero shot/few shot classification and then manually see the accuracy or compute it if you have labelled data)
I am then grab the verbs (assuming you have a corpus and not single word inputs) and calculating their similariy (attaching x-n and x+n words [exlude stopwords according to your domain] if x is position of verb)
This step mainly allows you give more context to the large language model so it is not biased on its large corpus (this is experimental)
And finally Im grabbing all the named entities and hopefully their labels (Example India: Country/Location) And Ask the same LLM (Large Language model) after youve constructed its Prompt to see how many of the same buckets/categories your entities fall into (probability comparison maybe)
Even if you dont reproduce these steps, what you must understand is that these tools give you atomic information about your raw input. you have to put mathematic functions to compare and make the algorithm
In my case im averaging all 3 similarity index of the above steps and ensuring that all classification is multilabel classification
And to validate any of this, human pattern matching maybe.
You need probably http://uima.apache.org/d/uimacpp-2.4.0/docs/Python.html/ plus a CoNLL-U parser attached to it https://universaldependencies.org/format.html. With this approach NERs are based to a dictionary in UIMA Pipeline. You need to develop some proprietary NER search/match algorithms (in Python or in other supported language).

Using LDA(topic model) : the distrubution of each topic over words are similar and "flat"

Latent Dirichlet Allocation(LDA) is a topic model to find latent variable (topics) underlying a bunch of documents. I'm using python gensim package and having two problems:
I printed out the most frequent words for each topic (I tried 10,20,50 topics), and found out that the distribution over words is very "flat": meaning even the most frequent word has only 1% probability...
Most of the topics are similar: meaning the most frequent words for each of the topics overlap a lot and the topics share almost the same set of words for their high frequency words...
I guess the problem is probably due to my documents: my documents actually belong to a specific category, for example, they are all documents introducing different online games. For my case, will LDA still work, since the documents themselves are quite similar, so a model based on "bag of words" may not be a good way to try?
Could anyone give me some suggestions? Thank you!
I've found NMF to perform better when a corpus is smaller and more focused around a particular topic. In a corpus of ~250 documents all discussing the same issue NMF was able to pull 7 distinct coherent topics. This has also been reported by other researchers...
"Another advantage that is particularly useful for the appli- cation
presented in this paper is that NMF is capable of identifying niche
topics that tend to be under-reported in traditional LDA approaches" (p.6)
Greene & Cross, Exploring the Political Agenda of the European Parliament Using a Dynamic Topic Modeling Approach, PDF
Unfortunately Gensim doesn't have an implementation of NMF but it is in Scikit-Learn. To work effectively you need to feed NMF some TFIDF weighted word vectors rather than frequency counts like you do with LDA.
If you're used to Gensim and have preprocessed everything that way genesis has some utilities to convert a corpus top Scikit compatible structures. However I think it would actually be simpler to just use all Scikit. There is a good example of using NMF here.

Can I apply LDA (latent dirichlet allocation) to a different language corpus?

I am trying to analyze a text corpus obtained from a Turkish virtual community website to examine the user generated content during protests. Specifically, I plan to apply LDA to determine topics. I havent used LDA before and I dont exactly know if its applicable to a different language setting.
Thanks
Yes, I see no reason why not. It might be hard to find out of the box solutions for some preprocessing steps, but apparently it had been done previously: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=6830499

Categories