I am creating a medical web app that takes in audio input, converts it to text, and extracts keywords from the said text file, which is then used in an ML model. We have the text, but the problem lies in the fact that the person might say, I have pain in my chest and legs but the symptoms in our model are chest_pain or leg_pain.
How do we convert the different phrasing used by the user to one that matches our model features? Our basic approach would be using tokenizer and then using NLTK to check synonyms of each word and map pairs to try out multiple phrasings to match the one we currently have, but it would take one too much time.
Is it possible to do this task using basic NLP?
maybe an improvment of your first idea :
Split your keywords (chest_pain → ["chest","pain"]
Find only synonyms of your keywords ([["chest","thorax",...],["pain","suffer",...]]
For each words of your sentence check if the word is present in your keywords synonyms.
Related
I want to implement a document text processing OCR in my flutter app. Most of the OCRs that I have seen can only read texts from images but cannot organize just the important information that is actually needed, E.G: Name, Lastname, Date of birth, Genre, etc. They can only read the whole thing. I discovered this page called "Nanonets" which does exactly what I need. You train the AI with images indicating only the data that you want and it works really well. The problem is that I cannot afford the pro plan, so I was wondering if there is an alternative way to create something similar by my own with maybe Tensorflow or another tool.
Here's the page if you wanna take a look to see what I mean: https://nanonets.com/
in my opine, you can't handle OCR text in organised manner without AI trained models. most of the AI model api service paid until and unless you trained your own AI models for that.
another way is you can try to clean your OCR Text data using apply NLP Natural language processing (NLP).However, it's not accurate as much as an AI trained model.
Apply regex and find email, contacts or pattern based data which we can easily identify by regex & eliminate from your actual string and apply NLP steps your self to get quick output.
few NLP terms/Rules and how its work:
Sentence Tokenization - dividing a string of written language into its component sentences. (string will split via punctuation mark)//. sentence boundary.
Word Tokenization - dividing a string of written language into its component words. (sentence will divide in to words to clean string).
Stop words - Stop words are words which are filtered out before or after processing of text to get accurate output. //remove irrelevant words like and, the, a
then apply other NLP terms like ...Text Lemmatization and Stemming,again regex to clean text again & bag of words, TF-IDF etc.
paid AI models & service for accurate result checkout this link which you can use. they provide AI services like scanning visiting card, scan docs etc.
I'm trying to train my NLTK model to recognize movie names (ex. "game of thrones")
I have a text file where each line is a movie name.
How do I train my NLTK model to recognize these movie names if it sees it in a sentence during tokenization?
I searched around but found no resources. Any help is appreciated
It sounds like you are talking about training a named entity recognition (NER) model for movie names. To train an NER model in the traditional way, you'll need more than a list of movie names - you'll need a tagged corpus that might look something like the following (based on the 'dataset format' here):
I PRP O
like VBP O
the DT O
movie NN O
Game NN B-MOV
of IN I-MOV
Thrones NN I-MOV
. Punc O
but going on for a very long time (say, minimum 10,000 words to give enough examples of movie names in running text). Each word is following by the part-of-speech (POS) tag, and then the NER tag. B-MOV indicates that 'Game' is the start of a movie name, and I-MOV indicates that 'of' and 'Thrones' are 'inside' a movie name. (By the way, isn't Game of Thrones a TV series as opposed to a movie? I'm just reusing your example anyway...)
How would you create this dataset? Annotating by hand. It is a laborious process, but this is how state-of-the-art NER systems are trained, because whether or not something should be detected as a movie name depending on the context in which it appears. For example, there is a Disney movie called 'Aliens', but the same word 'Aliens' is a movie title in the second sentence below but not the first.
Aliens are hypothetical beings from other planets.
I went to see Aliens last week.
Tools like docanno exist to aid the annotation process. The dataset to be annotated should be selected depending on the final use case. For example, if you want to be able to find movie names in news articles, use a corpus of news articles. If you want to be able to find movie names in emails, use emails. If you want to be able to find movie names in any type of text, use a corpus with a wide range of different types of texts.
This is a good place to get started if you decide to stick with training and NER model using NLTK, although some of the answers here suggest other libraries you might want to use, such as spaCy.
Alternatively, if the whole tagging process sounds like too much work and you just want to use your list of movie names, look at fuzzy string matching. In this case, I don't think NLTK is the library to use as I'm not aware of any fuzzy string matching features in NLTK. You might instead use fuzzysearch as per the answer here.
I've been experimenting with NLP, and use the Doc2Vec model.
The aim of my objective, is a forum suggested question feature. For example, If a user types a question it will compare the vector to other questions already asked. So far this has worked ok in the sense of comparing a question to another asked question.
However, I would like to extend this to comparing the body of the question. For example, just like stackoverflow, I'm writing a the description to my question.
I understand that doc2vec represents sentences through paragraph ids. So for my question example I spoke about first, each sentence will be a unique paragraph id. However, with paraphs i.e the body to the question, sentences will have the same id as other sentences apart of the same paragraph.
para = 'This is a sentence. This is another sentence'
[['This','is','a','sentence',tag=[1]], ['This','is','another','sentence',tag=[1]]
I'm wondering how to go about doing this. How can i input a corpus like so:
['It is a nice day today. I wish I was outside in the sun. But I need to work.']
and compare that to another paragraph like this:
['It is a lovely day today. The sun is shining outside today. However, I am working.']
In which I would expect a very close similarity between the two. Does similarity get calculated by sentence to sentence, rather then paragraph to paragraph? i.e.
cosine_sim(['It is a nice day today'],['It is a lovely day today.]
and do this for the other sentences and average out the similarity scores?
Thanks.
EDIT
What I am confused about is using the above sentences, say the vectors are like so
sent1 = [0.23,0.1,0.33...n]
sent2 = [0.78,0.2,-0.6...n]
sent3 = [0.55,-0.5,0.9...n]
#Avergae out these vectors
para = [0.5,0.2,0.3...n]
and using this vector compare to another paragraph using the same process.
I'll presume you're talking about the Doc2Vec model in the Python Gensim library - based on the word2vec-like 'Paragraph Vector' algorithm. (There are many alternate ways to turn a text into a vector, and sometimes other ways, including the very-simple approach of averaging word-vectors together, gets called 'Doc2Vec' also.)
Doc2Vec has no internal idea of sentences or paragraphs. It just considers texts: lists of word tokens. So you decide what-sized chunks of text to provide, and to associate with tag keys: multiword fragments, sentences, paragraphs, sections, chapters, articles, books, whatever.
Every tag you provide during initial bulk training will have an associated vector trained up, and stored in the model, based on the lists-of-words that you provided alongside it. So, you can retrieve those vectors from training via:
d2v_model.dv[tag]
You can also use that trained, frozen model to infer new vectors for new lists-of-words:
d2v_model.infer_vector(list_of_words)
(Note: these words should be preprocessed/tokenized the same was as those during training, and any words not known to the model from training will be silently ignored.)
And, once you have vectors for two different texts, from whatever method, you can compare them via cosine-similarity.
For creating your doc-vectors, you might want to run the question & body together into one text. (If the question is more important, you could even consider training on a pseudotext that's repeats the question more than once, for example both before and after the body.) Or you might want to treat them separately, so that some downstream process can weight question->question similarities differently than body->body or question->body. What's best for your data & goals usually has to be determined via experimentation.
I have ~138k records of user feedback that I'd like to analyze to understand broad patterns in what our users are most often saying. Each one has a rating between 1-5 stars, so I don't need to do any sort of sentiment analysis. I'm mostly interested in splitting the dataset into >=4 stars to see what we're doing well and <= 3 stars to see what we need to improve upon.
One key problem I'm running into is that I expect to see a lot of n-grams. Some of these I know, like "HOV lane", "carpool lane", "detour time", "out of my way", etc. But I also want to detect common bi- and tri-grams programmatically. I've been playing around with Spacy a bit, but it doesn't seem to have any capability to do analysis on the corpus level, only on the document level.
Ideally my pipeline would look something like this (I think):
Import a list of known n-grams into the tokenizer
Process each string into a tokenized document, removing punctuation,
stopwords, etc, while respecting the known n-grams during
tokenization (ie, "HOV lane" should be a single noun token)
Identify the most common bi- and tri- grams in the corpus that I
missed
Re-tokenize using the found n-grams
Split by rating (>=4 and <=3)
Find the most common topics for each split of data in the corpus
I can't seem to find a single tool, or even a collection of tools, that will allow me to do what I want here. Am I approaching this the wrong way somehow? Any pointers on how to get started would be greatly appreciated!
Bingo State of the art results for your problem!
Its called - Zero-Short learning.
State-of-the-art NLP models for text classification without annotated data.
For Code and details read the blog - https://joeddav.github.io/blog/2020/05/29/ZSL.html
Let me know if it works for you or for any other help.
VADER tool is perfect with sentiment analysis and NLP based applications.
I think the proposed workflow is fine with this case study. Closely work with your feature extraction as it matters a lot.
Most of the time tri-grams make a sound sense on these use cases.
Using Spacy would be a better decision as SpaCy's rules-based match engines and components not only help you to find what the terms and sentences are searching for but also allow you to access the tokens inside a text and its relationships compared with regular expressions.
I have a text file with million of rows which I wanted to convert into word vectors and later on I can compare these vectors with a search keyword and see which all texts are closer to the search keyword.
My Dilemma is all the training files that I have seen for the Word2vec are in the form of paragraphs so that each word has some contextual meaning within that file. Now my file here is independent and contains different keywords in each row.
My question is whether is it possible to create word embedding using this text file or not, if not then what's the best approach for searching a matching search keyword in this million of texts
**My File Structure: **
Walmart
Home Depot
Home Depot
Sears
Walmart
Sams Club
GreenMile
Walgreen
Expected
search Text : 'WAL'
Result from My File:
WALGREEN
WALMART
WALMART
Embeddings
Lets step back and understand what is word2vec. Word2vec (like Glove, FastText etc) is a way to represent words as vectors. ML models don't understand words they only understand numbers so when we are dealing with words we would want to convert them into numbers (vectors). One-hot encoding is one naive way of encoding words as vectors. But for a large vocabulary one-hot encoding become too long. Also there is no semantic relationship between one-hot encoded word.
With DL came the distributed representation of words (called word embeddings). One important property of these word embeddings is that the vector distance between related words is small compared to the distance between unrelated words. i.e distance(apple,orange) < distance(apple,cat)
So how are these embedding model trained ? The embedding models are trained on (very) huge corpus of text. When you have huge corpus of text the model will learn that the apple are orange are used (many times) in same context. It will learn that the apple and orange are related. So to train a good embedding model you need huge corpus of text (not independent words because independent words have no context).
However, one rarely trains a word embedding model form scratch because good embedding model are available in open source. However, if your text is domain specific (say medical) then you do a transfer learning on openly available word embeddings.
Out of vocabulary (OOV) words
Word embedding like word2vec and Glove cannot return an embedding for OOV words. However the embeddings like FastText (thanks to #gojom for pointing it out) handle OOV words by breaking them into n-grams of chars and build a vector by summing up subword vectors that would make up the word.
Problem
Coming to your problem,
Case 1: lets say the user enters a word WAL, first of all it is not a valid English word so it will not be in vocabulary and it is hard to mind a meaning full vector to it. Embeddings like FastText handling them by breaking it into n-grams. This approach gives good embeddings for misspelled words or slang.
Case 2: Lets say the user enters a word WALL and if you plan to use vector similarly to find closest word it will never be close to Walmart because semantically they are not related. It will rather be close to words like window, paint, door.
Conclusion
If your search is for semantically similar words, then solution using vector embeddings will be good. On the other hand, if your search is based on lexicons then vectors embeddings will be of no help.
If you wanted to find walmart from a fragment like wal, you'd more likely use something like:
a substring or prefix search through all entries; or
a reverse-index-of-character-n-grams; or
some sort of edit-distance calculated against all entries or a subset of likely candidates
That is, from your example desired output, this is not really a job for word-vectors, even though some algorithms, like FastText, will be able to provide rough vectors for word-fragments based on their overlap with trained words.
If in fact you want to find similar stores, word-vectors might theoretically be useful. But the problem given your example input is that such word-vector algorithms require examples of tokens used in context, from sequences-of-tokens that co-appear in natural-language-like relationships. And you want lots of data featuring varied examples-in-context, to capture subtle gradations of mutual relationships.
While your existing single-column of short entity-names (stores) can't provide that, maybe you have something applicable elsewhere, if you have richer data sources. Some ideas might be:
lists of stores visited by a single customer
lists of stores carrying the same product/UPC
text from a much larger corpus (such as web-crawled text, or maybe Wikipedia) in which there are sufficient in-context usages of each store-name. (You'd just throw out all the other words created from such training - but the vectors for your tokens-of-interest might still be of use in your domain.)