NLP Classification / Inference on Small Dataset -> Word Embedding Approach

NLP Classification / Inference on Small Dataset -> Word Embedding Approach - python

I would like to create a model that is given a series of keywords extracted from the description about a company and classifies the 'type' of the company. Let me illustrate with an example.
"Snapchat is an image messaging and multimedia mobile application created by Evan Spiegel, Bobby Murphy, and Reggie Brown,[3] former students at Stanford University, and developed by Snap Inc., originally Snapchat Inc. "
Sample Extracted Keywords: "image messaging" ; "multimedia mobile application"
(from Wikipedia page on Snapchat)
Given this info, my model will need to infer 'IT' and 'SNS' from "image messaging" and "multimedia mobile application".
(In case you are asking why not go with the extracted keywords, I would like to categorize them into as few labels as possible for all companies, so 'IT' and 'SNS' are more general terms compared to 'image messaging' and such.)
Currently, my dataset is not too big. For about hundreds of data entries, about ~80 % contain info in the manner that I want. Given this info, I would like to process the keywords extracted from descriptions about the company and give them correct labels.
Any suggestions to aid me in this project would be great.

If you are targeting companies of specific domains, then using a small dataset may help you. So, one approach you may follow:
Use pre-trained word embeddings (ex. from Glove) of the extracted keywords and find a embedding for companies. It will be like constructing phrase or sentence representation from word embeddings. Lets name it company embeddings! Similar type of companies should have a similar embedding. So, ultimate idea is to form a relationship like Google - Ford = Microsoft - Tesla which we see in word embeddings. You can even think of other interesting arithmetic relations using embeddings, for example, Google = search engine + youtube + android where right-hand-side terms are extracted keywords.
You need company type information for further classification but that should be very simple enough using any machine learning classifier. You can use a simple text classifier to accomplish your overall goal but it would be interesting to achieve this using NLP techniques.

Related

Create document processing OCR similar to Nanonets

I want to implement a document text processing OCR in my flutter app. Most of the OCRs that I have seen can only read texts from images but cannot organize just the important information that is actually needed, E.G: Name, Lastname, Date of birth, Genre, etc. They can only read the whole thing. I discovered this page called "Nanonets" which does exactly what I need. You train the AI with images indicating only the data that you want and it works really well. The problem is that I cannot afford the pro plan, so I was wondering if there is an alternative way to create something similar by my own with maybe Tensorflow or another tool.
Here's the page if you wanna take a look to see what I mean: https://nanonets.com/

in my opine, you can't handle OCR text in organised manner without AI trained models. most of the AI model api service paid until and unless you trained your own AI models for that.
another way is you can try to clean your OCR Text data using apply NLP Natural language processing (NLP).However, it's not accurate as much as an AI trained model.
Apply regex and find email, contacts or pattern based data which we can easily identify by regex & eliminate from your actual string and apply NLP steps your self to get quick output.
few NLP terms/Rules and how its work:
Sentence Tokenization - dividing a string of written language into its component sentences. (string will split via punctuation mark)//. sentence boundary.
Word Tokenization - dividing a string of written language into its component words. (sentence will divide in to words to clean string).
Stop words - Stop words are words which are filtered out before or after processing of text to get accurate output. //remove irrelevant words like and, the, a
then apply other NLP terms like ...Text Lemmatization and Stemming,again regex to clean text again & bag of words, TF-IDF etc.
paid AI models & service for accurate result checkout this link which you can use. they provide AI services like scanning visiting card, scan docs etc.

Is there a way using machine learning to derive tags corresponding to a paragraph?

I am very new to machine learning and genuinely what I am looking here is for some direction. I have a dataset which has the following columns:
Name
Description of the company
Category it belongs to (tag)
eg Netflix | Netflix is an online platform that enables users to watch TV shows and movies on smart TVs, PCs, Macs, mobiles, tablets, and so on. | Digital Entertainment, Media and Entertainment, TV, Video, Video Streaming
I have thousands of such data about various companies. Is there a way to use this dataset to automatically generate tags when a new company name and company description is added?
I would really appreciate the name of the concept or some direction here.

You want a tag (analogue to industry classification) right? For the latest technology, you would train (or called fine-tune) a transformer model (start with distilbert and move on to a stronger model like DeBERTa) to predict the tag using the description text. Take a look at the text classification example here (https://huggingface.co/transformers/examples.html)
It will not generate new tags, this simply classify new company into existing categories, if that's something you want

Contextual Namend Entity Recognition with spacy - Howto?

For a new project I have a need to extract information from web pages, more precisely imprint information. I use brat to label the documents and have started first experiments with spacy and NER. There are many videos and tutorials about this, but still some basic questions remain.
Is it possible to include the context of an entity?
Example text:
Responsible for the content:
The Good Company GmbH 0331 Berlin
You can contact us via +49 123 123 123.
This website was created by good design GmbH, contact +49 12314 453 5.
Well, spacy is very good at extracting the phone numbers. According to my latest tests, the error rate is less than two percent. I was able to achieve this already after 250 labeled documents, in the meantime I have labeled 450 documents, my goal is about 5000 documents.
Now to the actual point. Relevant are only the phone numbers that are shown in the context of the sentence "Responsible for the content", the other phone numbers are not relevant.
I could now imagine to train these introductory sentences as entities, because they are always somehow similar. But how can I create the context? Are there perhaps already models based on NER that do just that?
Maybe someone has already read some hints or something about it somewhere? As a beginner the hurdle is relatively high, because the material is really deep (little play on words).
Greetings from Germany!

If I understand your question and use-case correctly, I would advise the following approach:
Train/design some system that recognizes all phone numbers - it looks like you've already got that
Train a text classifier to recognize the "responsible for content" sentences.
Implement some heuristics (can probably be rule-based?) to determine whether or not any recognized phone number is connected to any of the predicted "responsible for content" sentences - probably using straightforward features such as number of sentences in between, taking the first phone number after the sentence, etc.
So basically I would advice to solve each NLP challenge separately, and then connect the information throughout the document.

Perform Named Entity Recognition - NLP

I am trying to learn how to perform Named Entity Recognition.
I have a set of discharge summaries containing medical information about patients. I converted my unstructured data into structured data. Now, I have a DataFrame that looks like this:
Text | Target
normal coronary arteries... R060
The Text column contains information about the diagnosis of a patient, and the Target column contains the code that will need to be predicted in a further task.
I have also constructed a dictionary that looks like this:
Code (Key) | Term (Value)
A00 Cholera
This dictionary brings information about each diagnosis and the afferent code. The term column will be used to identify the clinical entities in the corpus.
I will need to train a classifier and predict the code in order to automate the process of assigning codes for the discharge summaries (I am explaining this to have an idea about the task I'm performing).
Until now I have converted my data into a structured one. I am trying to understand how I should perform Named Entity Recognition to label the medical terminology. I would like to try direct matching and fuzzy matching but I am not sure what are the previous steps. Should I perform tokenizing, stemming, lemmatizing before? Or firstly should I find the medical terminology as clinical named entities are often multi-token terms with nested structures that include other named entities inside them? Also what packages or tools are you recommending me to use in Python?
I am new in this field so any help will be appreciated! Thanks!

If you are asking for building a classification model, then you should go for deep learning. Deep learning is highly efficient in classification.
While dealing with such type of language processing tasks, I recommend you to first tokenize your text and do padding. Basic tokenization should be enough, but you can go for more preprocessing like basic string processing because proper preprocessing can improve your model accuracy upto 3% or 4%. For basic string processing, you can use regex(built-in package called re) in python.
https://docs.python.org/3/library/re.html
I think, you are doing mapping after preprocessing. Mapping should be enough for tasks like classification, but I recommend you to learn about word embeddings. Word embedding will improve your model.
For all these tasks, i recommend you to use tensorflow. Tensorflow is famous tool for machine learning, language processing, image processing, and much more. You can learn natural language processing from official tensorflow documentation. They have provided all learning material in tensorflow tutorial section.
https://www.tensorflow.org/tutorials/
I think, this will help you. All the best for your work!!!!
Thank you.

How to extract sub topic sentences of a review using python & NLTK?

Is there any efficient way to extract sub topic explanations of a review using python and NLTK library.As an example an user review regarding mobile phone could be "This phone's battery is good but display is a bullshit"
I wanna extract above two features like
"Battery is good"
"display is a bullshit"
The purpose of above is em gonna develop a rating system for products with respect to features of the product.
Analyzing polarity part has done.
But extracting features of review is some difficult for me.But I found a way to extract features using POS tag patterns with regular expressions like
<NN.?><VB.?>?<JJ.?>
this pattern as sub topic.But the problem is there could be lots of patterns in a review according to users description patterns.
Is there any way to solve my problem efficiently???
Thank you !!

The question you posed is multi-faceted and not straightforward to answer.
Conceptually, you may want to go through the following steps:
Identify the names of the features of phones (+ maybe creating an ontology based on these features).
Create a lists of synonyms to the feature names (similarly for evaluative phrases, e.g. nice, bad, sucks, etc.).
Use one of NLTK taggers to parse the reviews.
Create rules for extraction of features and their evaluation (Information Extraction part). I am not sure if NLTK can directly support you with this.
Evaluate and refine the approach.
Or: create a larger annotated corpus and train a Deep learning model on it using TensorFlow, Theano, or anything else alike.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.