Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am researching some Natural Language Processing algorithms to read a piece of text, and if the text seems to be trying to suggest a meeting request, it sets up that meeting for you automatically.
For example, if an email text reads:
Let's meet tomorrow someplace in Downtown at 7pm".
The algorithm should be able to detect the Time, date and place of the event.
Does someone know of some already existing NLP algorithms that I could use for this purpose? I have been researching some NLP resources (like NLTK and some tools in R), but did not have much success.
Thanks
This is an application of information extraction, and can be solved more specifically with sequence segmentation algorithms like hidden Markov models (HMMs) or conditional random fields (CRFs).
For a software implementation, you might want to start with the MALLET toolkit from UMass-Amherst, it's a popular library that implements CRFs for information extraction.
You would treat each token in a sentence as something to be labeled with the fields you are interested in (or 'x' for none of the above), as a function of word features (like part of speech, capitalization, dictionary membership, etc.)... something like this:
token label features
-----------------------------------
Let x POS=NNP, capitalized
's x POS=POS
meet x POS=VBP
tomorrow DATE POS=NN, inDateDictionary
someplace x POS=NN
in x POS=IN
Downtown LOCATION POS=NN, capitalized
at x POS=IN
7pm TIME POS=CD, matchesTimeRegex
. x POS=.
You will need to provide some hand-labeled training data first, though.
You should have a look to http://opennlp.apache.org java toolkit
I think you should be able to do this with spacy.
I tried this in jupyter-notebook
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Over the last quarter in 2018-12-02 Apple sold nearly 20 thousand iPods for a profit of $6 million.')
displacy.render(doc, style='ent', jupyter=True)
Output
Over the last quarter DATE in 2018-12-02 DATE Apple ORG sold nearly 20 thousand CARDINAL iPods PRODUCT for a profit of $6 million MONEY .
This problem is still in the headlines today. If you are (still) looking for an algorithm, there are solutions using ANTLR a parser generator in a large choice of programming languages (C/C++/C#/JS/Java/...).
Some open-source references:
http://natty.joestelmach.com/doc.jsp
https://blog.dgunia.de/2017/08/18/using-antlr-to-parse-date-ranges-in-java-and-kotlin/
https://github.com/eib/java-date-parser
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I tagged a dataset of texts with independent categories. When running a CNN classifier in Keras, I receive an accuracy of > 90%.
My texts are customer reviews "I really liked the camera of this phone." Classes are e.g. "phone camera", "memory" etc.
What I am searching for is whether I can tag the sentences with categories that appear in them while the classifier marks the entities that indicate the class. Or more specifically: How can I extract those parts of the input sentence that made a CNN network in Keras opt (i.e. classify) for 1, 2 or more categories?
My pipilene (in general) for similar task.
I don't use nn to solve a whole task
First, I don't use NNs directly to label separate entities like "camera", "screen" etc. There's some good approaches which might be useful, like a pointer networks or just attention, but it just didn't wotk in my case.
I guess, this architectures don't work well because there are a lot of noise, aka "I'm so glad I bought this TV" or so in my dataset. Approx. 75% overall, and the rest of the data is not so clean, to.
Because of this, I do some additional actions:
Split sentences into chunks (sometimes they contatins desired entities)
Label this chunks by hand into "non-useful" (aka "I'm so happy/so upset" etc.) and useful: "good camera", "bad phone" etc.
Train classifier(s) to classify this data.
Details about a pipeline
How to "recognize" entities
I just used regexps and part-of-speech tags to split my data. But I work with russian language dataset, so there's not good free syntax parser / library for russian. If you work with english or another language, well-presented in spacy or nltk libraries, you can use it for parsing to separate entities. Also, english grammar is so strict in contrast to russian - it's make your task easier probably.
Anyway, try to start with regexes and parsing.
Vocabularies with keywords for topics like "camera", "battery", ... are very helpful, too.
Another approach to recognize entities is topic modellig - PLSA/LDA (gensim rocks), but it's hard to tune, imo, because there are lot of noise in texts. You'll get a lot of topics {"happy", "glad", "bought", "family", ...} and so on - but you can try topic modelling anyway.
Also you can create a dataset with an entities labels for each text and train a NN with attention, so you can recognize it by high attention, but create this dataset is very tedious.
Create dataset and train NN's
I start to create dataset only when I've got acceptable quality of "named entities" - because if you change this (footing) part later, you probalby can throw away a dataset and start it from scratch again.
Better decide which labels you will use once and then don't change them - it's critical part of work.
Training NN's on such data is easiest part of the work probably - just any good classifier, as for the whole texts. Even not a nn, but a simplier calssifiers might be useful - use blending, bagging etc.
Possible troubles
There's a trap - some reviews / features not so obvious for NN classifier or even for a human, like "loud sound" or "gets very hot". Often they context-depentend. So, I use a little help of our team to mark a dataset - so, each entry was labeld by a group of humans to get better quality. Also I use context labels - category of a product - adding a context for each entity: so, "loud sound" for audio system and for washing mashing bears controversal sentiment and model can learn it. Most cases category labels easy accessable throug databases/web parsing.
Hope it helps, also I hope someone knows a better approach.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
lets say that i have n number of documents(resumes) in my list, and i want to weigh the each document(resume) of same category with Job description.txt as reference. i want to weigh the document as per below. My question is there any other approach to weigh the document in this kind of scenario? Thanks in advance.
Plan of action :
a) get resumes (eg. 10) related to same category (eg. java)
b) get bag of words from all the docs
for:
c) each document get features names by using TFIDF vectorizor scores
d) now I have list of featured words in a list
e) now compare these features in "Job Discription" Bag of words
f) now count the score for the document by adding the columns and weigh the document
What I understood from the question is that you are looking to grade the resumes(documents) by seeing how similar they are to the job description document. One approach that can be used is to convert all documents to a TFIDF matrix including the job description. Each document can be seen as vector in the word space. Once you have created the TFIDF matrix, you can calculate the similarity between two documents using cosine similarity.
There are additional things that you should do like removing stopwords, lemmatizing and encoding. Additionaly you may also want to make use of n-grams.
You can refer this book as well for more information.
EDIT:
Adding some setup code
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
import string
import spacy
nlp = spacy.load('en')
# to remove punctuations
translator = str.maketrans('', '', string.punctuation)
# some sample documents
resumes = ["Executive Administrative Assistant with over 10 years of experience providing thorough and skillful support to senior executives.",
"Experienced Administrative Assistant, successful in project management and systems administration.",
"10 years of administrative experience in educational settings; particular skill in establishing rapport with people from diverse backgrounds.",
"Ten years as an administrative support professional in a corporation that provides confidential case work.",
"A highly organized and detail-oriented Executive Assistant with over 15 years' experience providing thorough and skillful administrative support to senior executives.",
"More than 20 years as a knowledgeable and effective psychologist working with individuals, groups, and facilities, with particular emphasis on geriatrics and the multiple psychopathologies within that population.",
"Ten years as a sales professional with management experience in the fashion industry.",
"More than 6 years as a librarian, with 15 years' experience as an active participant in school-related events and support organizations.",
"Energetic sales professional with a knack for matching customers with optimal products and services to meet their specific needs. Consistently received excellent feedback from customers.",
"More than six years of senior software engineering experience, with strong analytical skills and a broad range of computer expertise.",
"Software Developer/Programmer with history of productivity and successful project outcomes."]
job_doc = ["""Executive Administrative with a knack for matching and effective psychologist with particular emphasis on geriatrics"""]
# combine the two
_all = resumes+job_doc
# convert each to spacy document
docs= [nlp(document) for document in _all]
# lemmatizae words, remove stopwords, remove punctuations
docs_pp = [' '.join([token.lemma_.translate(translator) for token in docs if not token.is_stop]) for docs in docs]
# get tfidf matrix
tfidf_vec = TfidfVectorizer()
tfidf_matrix = tfidf_vec.fit_transform(docs_pp).todense()
# calculate similarity
cosine_similarity(tfidf_matrix[-1,], tfidf_matrix[:-1,])
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
What I'm trying to do is to ask the user to input a company name, for example Microsoft, and be able to predict that it is in the Computer Software industry. I have around 150 000 names and 60+ industries. Some of the names are not English company names.
I have tried training a Word2Vec model using Gensim based on company names only and averaged up the word vectors before feeding it into SKlearn's logistic regression but had terrible results. My questions are:
Has anyone tried these kind of tasks? Googling on short text classification shows me results on classifying short sentences instead of pure names. If anyone had tried this before, mind sharing a few keywords or research papers regarding this task?
Would it be better if I have a brief description for each company instead of only using their names? How much would it help for my Word2Vec model rather than using only the company names?
For your problem, This is nothing but Company-Industry Relationship so for that, you have to train your word2vec model using company description data because the word2vec works on calculating the similar words related to the given word.So if you train, based on the company names that would give you bad results.If you train on the description then that would give you the similar words related to the particular industry.By using that you can get the industry it belongs to.
If you want to train based on company names NER(Named Entity Tagger) will be useful.But this will not be accurate.
Not sure what you want.
If the point is to use just company names, maybe break names into syllables/phonemes, and train on that data.
If the point is to use Word2Vec, I'd recommend pulling the Wikipedia page for each company (easier to automate than an 'about me').
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Update: How would one approach the task of classifying any text on public forums such as Games, or blogs such that derogatory comments/texts before bring posted are filtered.
Original: "
I want to filter out adult content from tweets (or any text for that matter).
For spam detection, we have datasets that check whether a particular text is spam or ham.
For adult content, I found a dataset I want to use (extract below):
arrBad = [
'acrotomophilia',
'anal',
'anilingus',
'anus',
.
. etc.
.
'zoophilia']
Question
How can I use that dataset to filter text instances?
"
I would approach this as a Text Classification problem, because using blacklists of words typically does not work very well to classify full texts. The main reason why blacklists don't work is that you will have a lot of false positives (one example: your list contains the word 'sexy', which alone isn't enough to flag a document as being for adults). To do so you need a training set with documents tagged as being "adult content" and others "safe for work". So here is what I would do:
check whether an existing labelled dataset can be used. You need
several thousands of documents of each class.
If you don't find any, create one. For instance you can create a scraper and download Reddit content. Read for instance Text Classification of NSFW Reddit Posts
Build a text classifier with NLTK. If you don't know how, read: Learning to Classify Text
This can be treated as a binary text classification problem. You should collect documents that contain 'adult-content' as well as documents that do not contain adult content ('universal'). It may so happen that a word/phrase that you have included in the list arrBad may be present in the 'universal' document, for example, 'girl on top' in the sentence 'She wanted to be the first girl on top of Mt. Everest.' You need to get a count vector of the number of times each word/phrase occurs in a 'adult-content' document and a 'universal' document.
I'd suggest you consider using algorithms like Naive Bayes (which should work fairly well in your case). However, if you want to capture the context in which each phrase is used, you could consider the Support Vector Machines algorithm as well (but that would involve tweaking a lot of complex parameters).
You may be interested in something like TextRazor. By using their API you could be able to classify the input text.
And for example you can choose to remove all input texts thats comes with some of the categories or keywords you don't want.
I think you more need to explore on filtering algorithms, studying their usage, how multi pattern searching works and how you can use some of those algorithms (their implementations are free online, so it is not hard to find an existing implementation and customize for your needs). Some pointers can be.
Check how grep family of algorithm works, especially the bitap algorithm and Wu-Manber implementation for fgrep..Depending upon how accurate you want to be, it may require adding some fuzzy logic handling (think why people use fukc instead of fuck..right?).
You may find Bloom Filter interesting, since it wont have any false negatives (your data set), downside is that it may contain false positives..
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I am working on a project wherein I have to extract the following information from a set of articles (the articles could be on anything):
People Find the names of any people present, like "Barack Obama"
Topic or related tags of the article, like "Parliament", "World Energy"
Company/Organisation I should be able to obtain the names of the any companies or organisations mentioned, like "Apple" or "Google"
Is there an NLP framework/library of this sort available in Python which would help me accomplish this task?
#sel and #3kt really good answers. OP you are looking for Entity Extraction, commonly referred to as Named entity recognition. There exist many APIs to perform this. But the first question you need to ask yourself is
What is the structure of my DATA? or rather,
Are my sentences good English sentences?
In the sense of figuring out whether the data you are working with is consistently grammatically correct, well capitalized and is well structured. These factors are paramount when it comes to extracting entities. The data I worked with were tweets. ABSOLUTE NIGHTMARE!! I performed a detailed analysis on the performance of various APIs on entity extraction and I shall share with you what I found.
Here's APIs that perform fabulous entity extraction-
NLTK has a handy reference book which talks in-depth about its functions with multiple examples. NLTK does not perform well on noisy data(tweets) because it has been trained on structured data.NLTK is absolute garbage for badly capitalized words(Eg, DUCK, Verb, CHAIR). Moreover, it is slightly less precise when compared to other APIs. It is great for structured data or curated data from News articles and Scholarly reports. It is a great learning tool for beginners.
Alchemy is simpler to implement and performs very well in categorizing the named entities.It has great precision when compared to the APIs I have mentioned.However, it has a certain transaction cost. You can only perform 1000 queries in a day! It identifies twitter-handles and can handle awkward capitalization.
IMHO sPacy API is probably the best. It's open source. It outperforms the Alchemy API but is not as precise. Categorizes entities almost as well Alchemy.
Choosing which API should be a simple problem for you now that you know how each API is likely to behave according to the data you have.
EXTRA -
POLYGLOT is yet another API.
Here is a blog post that performs entity extraction in NLTK.
There is a beautiful paper by Alan Ritter that might go over your head. But it is the standard for entity extraction(particularly in noisy data) at a professional level. You could refer to it every now and then to understand complex concepts like LDA or SVM for capitalisation.
What you are actually looking for is called in literature 'Named entity Recognition' or NER.
You might like to take a look at this tutorial:
http://textminingonline.com/how-to-use-stanford-named-entity-recognizer-ner-in-python-nltk-and-other-programming-languages
One easy way of solving this problem partially this problem is using regular expressions to extract words having the patterns that you can find in this paper to extract peoples names. This of course might lead to extracting all the categories you are looking for i.e. the topics and the campanies names as well.
There is also an API that you can use, that actually gives the same results you are looking for, which is called Alchemy. Unfortunatelly no documentation is available to explain the method they use to extract the topics nor the people's names.
Hope this helps.
You should take a look at NLTK.
Finding names and companies can be achieved by tagging the recovered text, and extracting proper nouns (tagged NNP). Finding the topic is a bit more tricky, and may require some machine learning on a given set of article.
Also, since we're talking about articles, I recommend the newspaper module, that can recover those from their URLs, and do some basic nlp operations (summary, keywords).