calculate document weight using machine learning [closed]

calculate document weight using machine learning [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
lets say that i have n number of documents(resumes) in my list, and i want to weigh the each document(resume) of same category with Job description.txt as reference. i want to weigh the document as per below. My question is there any other approach to weigh the document in this kind of scenario? Thanks in advance.
Plan of action :
a) get resumes (eg. 10) related to same category (eg. java)
b) get bag of words from all the docs
for:
c) each document get features names by using TFIDF vectorizor scores
d) now I have list of featured words in a list
e) now compare these features in "Job Discription" Bag of words
f) now count the score for the document by adding the columns and weigh the document

What I understood from the question is that you are looking to grade the resumes(documents) by seeing how similar they are to the job description document. One approach that can be used is to convert all documents to a TFIDF matrix including the job description. Each document can be seen as vector in the word space. Once you have created the TFIDF matrix, you can calculate the similarity between two documents using cosine similarity.
There are additional things that you should do like removing stopwords, lemmatizing and encoding. Additionaly you may also want to make use of n-grams.
You can refer this book as well for more information.
EDIT:
Adding some setup code
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
import string
import spacy
nlp = spacy.load('en')
# to remove punctuations
translator = str.maketrans('', '', string.punctuation)
# some sample documents
resumes = ["Executive Administrative Assistant with over 10 years of experience providing thorough and skillful support to senior executives.",
"Experienced Administrative Assistant, successful in project management and systems administration.",
"10 years of administrative experience in educational settings; particular skill in establishing rapport with people from diverse backgrounds.",
"Ten years as an administrative support professional in a corporation that provides confidential case work.",
"A highly organized and detail-oriented Executive Assistant with over 15 years' experience providing thorough and skillful administrative support to senior executives.",
"More than 20 years as a knowledgeable and effective psychologist working with individuals, groups, and facilities, with particular emphasis on geriatrics and the multiple psychopathologies within that population.",
"Ten years as a sales professional with management experience in the fashion industry.",
"More than 6 years as a librarian, with 15 years' experience as an active participant in school-related events and support organizations.",
"Energetic sales professional with a knack for matching customers with optimal products and services to meet their specific needs. Consistently received excellent feedback from customers.",
"More than six years of senior software engineering experience, with strong analytical skills and a broad range of computer expertise.",
"Software Developer/Programmer with history of productivity and successful project outcomes."]
job_doc = ["""Executive Administrative with a knack for matching and effective psychologist with particular emphasis on geriatrics"""]
# combine the two
_all = resumes+job_doc
# convert each to spacy document
docs= [nlp(document) for document in _all]
# lemmatizae words, remove stopwords, remove punctuations
docs_pp = [' '.join([token.lemma_.translate(translator) for token in docs if not token.is_stop]) for docs in docs]
# get tfidf matrix
tfidf_vec = TfidfVectorizer()
tfidf_matrix = tfidf_vec.fit_transform(docs_pp).todense()
# calculate similarity
cosine_similarity(tfidf_matrix[-1,], tfidf_matrix[:-1,])

Related

How to use ML for text classification in Python? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 months ago.
Improve this question
I have two columns of data that is roughly 25k rows long. The first column contains a list of income statement line items and was created from OCR, so there are lots of errors in there. For example, There might be 20 line items for 'Income', but they might show as 'I ncome' or 'Imcome' or '...Incom', etc.
The second column contains a list of classifications that have been hand-coded so that line items can be categorized. For example, 'Miscellaneous Fees', 'Application Fees', 'Insurance Fees' would all be classified as 'Other Income'.
I'd like to train a model using my existing dataset to predict that 'I ncome' should be placed in the 'Income' category, 'Mscelaneous Fees' should be placed in the 'Other Income' category and so on.
My experience with ML is limited to the examples I've worked on in classes that all use continuous variables in the data sets, so I have practically zero experience working with text classification. I could convert the text categories to numerical values, but wouldn't be able to do so with the line items so I don't know that it would help me.
Can I accomplish this with sklearn? Pytorch? Tensorflow? Spark?
Really appreciate if someone can point me in the right direction!

First you have to correct the words, because all Tensorflow and PyTorch pre trained models work in proper formatted words. For this you can use pyspellchecker or autocorrect in Python, for instance.
After that you will have to prepare data (try nltk or spacy), working on lower/upper case letters, removal of punctuation, special characters, maybe stemming and lemmatization. Then you will tokenize the phrases with nltk.word_tokenize.
Only after that you can map the first column to embeddings, vectors that represent that word/sentence.
For the embeddings, try this option, as is one of the fastest ones (choose the language in TF hub):
import seaborn as sns
from sklearn.metrics import pairwise
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text # Imports TF ops for preprocessing.
BERT_MODEL = "https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/4"
PREPROCESS_MODEL = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
sentences = [
"Here We Go Then, You And I is a 1999 album by Norwegian pop artist Morten Abel. It was Abel's second CD as a solo artist.",
"If it rains, it pours.", "The quick brown fox jumps over the lazy dog."]
preprocess = hub.load(PREPROCESS_MODEL)
bert = hub.load(BERT_MODEL)
inputs = preprocess(sentences)
outputs = bert(inputs)
Then you will map X (first column, now lots of columns - embeddings) to Y (second column - classes). In this step you can use whatever classification algorithm you want: logistic regression, naive bayes, SVM, decision trees, random forests, gradient boosting or even a neural network.
Ah, remember also to turn your second column of classes in numeric classes with dataframe['column_2'].cat.codes

How to parse: in one week into a date? [duplicate]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am researching some Natural Language Processing algorithms to read a piece of text, and if the text seems to be trying to suggest a meeting request, it sets up that meeting for you automatically.
For example, if an email text reads:
Let's meet tomorrow someplace in Downtown at 7pm".
The algorithm should be able to detect the Time, date and place of the event.
Does someone know of some already existing NLP algorithms that I could use for this purpose? I have been researching some NLP resources (like NLTK and some tools in R), but did not have much success.
Thanks

This is an application of information extraction, and can be solved more specifically with sequence segmentation algorithms like hidden Markov models (HMMs) or conditional random fields (CRFs).
For a software implementation, you might want to start with the MALLET toolkit from UMass-Amherst, it's a popular library that implements CRFs for information extraction.
You would treat each token in a sentence as something to be labeled with the fields you are interested in (or 'x' for none of the above), as a function of word features (like part of speech, capitalization, dictionary membership, etc.)... something like this:
token label features
-----------------------------------
Let x POS=NNP, capitalized
's x POS=POS
meet x POS=VBP
tomorrow DATE POS=NN, inDateDictionary
someplace x POS=NN
in x POS=IN
Downtown LOCATION POS=NN, capitalized
at x POS=IN
7pm TIME POS=CD, matchesTimeRegex
. x POS=.
You will need to provide some hand-labeled training data first, though.

You should have a look to http://opennlp.apache.org java toolkit

I think you should be able to do this with spacy.
I tried this in jupyter-notebook
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Over the last quarter in 2018-12-02 Apple sold nearly 20 thousand iPods for a profit of $6 million.')
displacy.render(doc, style='ent', jupyter=True)
Output
Over the last quarter DATE in 2018-12-02 DATE Apple ORG sold nearly 20 thousand CARDINAL iPods PRODUCT for a profit of $6 million MONEY .

This problem is still in the headlines today. If you are (still) looking for an algorithm, there are solutions using ANTLR a parser generator in a large choice of programming languages (C/C++/C#/JS/Java/...).
Some open-source references:
http://natty.joestelmach.com/doc.jsp
https://blog.dgunia.de/2017/08/18/using-antlr-to-parse-date-ranges-in-java-and-kotlin/
https://github.com/eib/java-date-parser

How to automatically identify citations of the same paper?

Consider 3 ways to cite the same paper:
cite1 = "Yoshua Bengio, Réjean Ducharme, Pascal Vincent and Christian Jauvin, A Neural Probabilistic Language Model (2003), in: Journal of Machine Learning Research, 3(1137--1155)"
cite2 = "Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin. (2003) A Neural Probabilistic Language Model"
cite3 = "Bengio Y, Ducharme R, Vincent P, Jauvin C. (2003) A Neural Probabilistic Language Model"
A simple way of automatically identifying citations of the same paper is to compute the similarity of those citations with the difflib module in the Python Standard Library:
from difflib import SequenceMatcher as smatch
def similar(x, y): return smatch(None, x.strip(), y.strip()).ratio()
similar(cite1, cite2) # 0.721
similar(cite1, cite3) # 0.553
similar(cite2, cite3) # 0.802
Unfortunately, the similarity metric ranges from 0.553 to 0.802 so it's not clear what threshold should be set. If the threshold is too low, then citations of different papers could be mistaken as the same paper. But if the threshold is too high, then we miss out some citations.
Are there better solutions?

It is important to consider What makes a citation unique?
Based on your example, it appears that the combination of authors, the title of the article, and what year it was published constitutes a unique citation.
This means that you can parse the names, then compare how close they are (Because the third example lists the names differently). Parse the title, and it should match 100%. Parse the year, and it should also be a 100% match.

Apart neural networks and NLP, which would be a rather ... complicated approach, i would approach this problem by preprocessing the data.
Few things you can do:
- Create Short names Yoshua Bengio => Bengio Y
- Normalize the names: Réjean Ducharme -> rejean ducharme
- Extract author part of the string, title part of the string, and the "leftovers". Calculate similarity for each of the parts and average the result.
- Extract the year of the publication and make it a three variable problem.
- Use additional metadata if available (paper field, citation index, etc.
The above approach works if your problem is limited to these three bibliography types.
If you have large variations amongst the bibliography (i.e. apply it on entire springer/ieee database) you should look into machine learning approaches.
While i cant suggest a correct model on top of my head, i remember this paper being somewhere close to your problem.
Amongst other approaches, if you have a large dataset of bibliography, you can attempt semi supervised approaches like word2vec/node2vec or kmeans and see if the subsequent similarity score would be accurate enough for you.
A word of advice.
in some cases you have very similar paper names coming along from the same research teams or short names being identical when long ones differ W. Xu can be either Wang Xu or Wei Xu are both transcribed to Xu W..
in other cases you have same authors having different names Réjean Ducharme and Rejean Ducharme
Paper titles can have variations: Conference of awesome discoveries and Awesome discoveries, conference of

how to improve word assignement in different topics in lda

I am working on a language that is the not english and I have scraped the data from different sources. I have done my preprocessing like punctuation removal, stop-words removal and tokenization. Now I want to extract domain specific lexicons. Let's say that I have data related to sports, entertainment, etc and I want to extract words that are related to these particular fields, like cricket etc, and place them in topics that are closely related. I tried to use lda for this, but I am not getting the correct clusters. Also in the clusters in which a word which is a part of one topic, it also appears in other topics.
How can I improve my results?
# URDU STOP WORDS REMOVAL
doc_clean = []
stopwords_corpus = UrduCorpusReader('./data', ['stopwords-ur.txt'])
stopwords = stopwords_corpus.words()
# print(stopwords)
for infile in (wordlists.fileids()):
words = wordlists.words(infile)
#print(words)
finalized_words = remove_urdu_stopwords(stopwords, words)
doc = doc_clean.append(finalized_words)
print("\n==== WITHOUT STOPWORDS ===========\n")
print(finalized_words)
# making dictionary and corpus
dictionary = corpora.Dictionary(doc_clean)
# convert tokenized documents into a document-term matrix
matrx= [dictionary.doc2bow(text) for text in doc_clean]
# generate LDA model
lda = models.ldamodel.LdaModel(corpus=matrx, id2word=dictionary, num_topics=5, passes=10)
for top in lda.print_topics():
print("\n===topics from files===\n")
print (top)

LDA and its drawbacks: The idea of LDA is to uncover latent topics from your corpus. A drawback of this unsupervised machine learning approach, is that you will end up with topics that may be hard to interpret by humans. Another drawback is that you will most likely end up with some generic topics including words that appear in every document (like 'introduction', 'date', 'author' etc.). Thirdly, you will not be able to uncover latent topics that are simply not present enough. If you have only 1 article about cricket, it will not be recognised by the algorithm.
Why LDA doesn't fit your case:
You are searching for explicit topics like cricket and you want to learn something about cricket vocabulary, correct? However, LDA will output some topics and you need to recognise cricket vocabulary in order to determine that e.g. topic 5 is concerned with cricket. Often times the LDA will identify topics that are mixed with other -related- topics. Keeping this in mind, there are three scenarios:
You don't know anything about cricket, but you are able to identify the topic that's concerned with cricket.
You are a cricket expert and already know the cricket vocabulary
You don't know anything about cricket and are not able to identify the semantic topic that the LDA produced.
In the first case, you will have the problem that you are likely to associate words with cricket, that are actually not related to cricket, because you count on the LDA output to provide high-quality topics that are only concerned with cricket and no other related topics or generic terms. In the second case, you don't need the analysis in the first place, because you already know the cricket vocabulary! The third case is likely when you are relying on your computer to interpret the topics. However, in LDA you always rely on humans to give a semantic interpretation of the output.
So what to do: There's a paper called Targeted Topic Modeling for Focused Analysis (Wang 2016), which tries to identify which documents are concerned with a pre-defined topic (like cricket). If you have a list of topics for which you'd like to get some topic-specific vocabulary (cricket, basketball, romantic comedies, ..), a starting point could be to first identify relevant documents to then proceed and analyse the word-distributions of the documents related to a certain topic.
Note that perhaps there are completely different methods that will perform exactly what you're looking for. If you want to stay in the LDA-related literature, I'm relatively confident that the article I linked is your best shot.
Edit:
If this answer is useful to you, you may find my paper interesting, too. It takes a labeled dataset of academic economics papers (600+ possible labels) and tries various LDA flavours to get the best predictions on new academic papers. The repo contains my code, documentation and also the paper itself

Text Classification - Label Pre Process [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I have a data set of 1M+ observations of customer interactions with a call center. The text is free text written by the representative taking the call. The text is not well formatted nor is it close to being grammatically correct (a lot of short hand). None of the free text has a label on the data as I do not know what labels to provide.
Given the size of the data, would a random sample of the data (to give a high level of confidence) be reasonable first step in determining what labels to create? Is it possible not to have to manually label 400+ random observations from the data, or is there no other method to pre-process the data in order to determine the a good set of labels to use for classification?
Appreciate any help on the issue.

Text Pre-Processing:
Convert all text to lower case, tokenize into unigrams, remove all stop words, use stemmer to normalize a token to it's base word.
There are 2 approaches I can think of for classifying the documents a.k.a. the free text you spoke about. Each free text is a document:
1) Supervised classification Take some time and randomly pick few samples of documents and assign them a category. Do this until you have multiple documents per category and all categories that you want to predict are covered.
Next, create a Tf-Idf matrix from this text. Select the top K features (tune value of K to get best results). Alternatively, you can use SVD to reduce the number of features by combining correlated features into one. Please bare in mind that you can use other features like the department of the customer service executive and many others also as predictors. Now train a machine learning model and test it out.
2) Unsupervised learning: If you know how many categories you have in your output variable, you can use that number as the number of clusters you want to create. Use the Tf-Idf vector from above technique and create k clusters. Randomly pick a few documents from each cluster and decide which category the documents belong to. Supposing you picked 5 documents and noticed that they belong to the category "Wanting Refund". Label all documents in this cluster to "Wanting Refund". Do this for all the remaining clusters.
The advantage of unsupervised learning is that it saves you the pain of pre-classification and data preparation, but beware of unsupervised learning. The accuracy might not be as good as supervised learning.
The 2 method explained are an abstract overview of what can be done. Now that you have an idea, read up more on the topics and use a tool like rapidminer to achieve your task much faster.

Manual annotation is a good option since you have a very good idea of an ideal document corresponding to your label.
However, with the large dataset size, I would recommend that you fit an LDA to the documents and look at the topics generated, this will give you a good idea of labels that you can use for text classification.
You can also use LDA for text classification eventually by finding out representative documents for your labels and then finding the closest documents to that document by a similarity metric(say cosine).
Alternatively, once you have an idea of labels, you can also assign them without any manual intervention using LDA, but then you will get restricted to unsupervised learning.
Hope this helps!
P.S. - Be sure to remove all the stopwords and use a stemmer to club together words of similar king example(managing,manage,management) at the pre-processing stage.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.