I need to classify/categorize the various sentences in the job_experience section of n=630 job descriptions. I'm particularly interested in extracting the work experience and ability-related sentences, but I need to be able keep them attached to the job_title that they are associated with.
Current state of these job descriptions: many different ways of saying similar things (e.g., "Needs Microsoft Office skills." "Experience using Microsoft Word, PowerPoint." "Minimum 3 years experience in related field." "Minimum three years experience in similar role.").
In the future, we will need to condense these job description statements so that, for example, the same statement can be applied to multiple jobs, and where managers select from a drop-down list of job experience statements.
So I would like to categorize these individual sentences so that we can begin condensing them and deciding on which statements will be used going forward.
I've been researching what I should do and I would appreciate any suggestions on which approach will be the most efficient. I am familiar with R but use it mostly for data wrangling and visualization. LDA, kmeans text clustering, feature identification... these are the things I'm finding in my research (scikit-learn.org) and mostly with application in Python.
Is Python best for this kind of thing? Can I use R?
Which algorithmic approach is best for a beginner?
I know this isn't magic - just looking for the best approach to this task.
My data looks like:
df <- data.frame(job_title = c("Recruiter","Recruiter","Recruiter","Recruiter",
"File Clerk","File Clerk",
"Learning & Org. Development Specialist","Learning & Org. Development Specialist","Learning & Org. Development Specialist","Learning & Org. Development Specialist",
"CNA","CNA","CNA"),
job_experience = c("Minimum 1 year experience in recruitment or related human resources function.",
"Proficient in Microsoft Office Applications.",
"High school diploma required.",
"Bachelors Degree in Human Resources or related field preferred.",
"High School diploma preferred.",
"Ability to use relevant computer systems.",
"Bachelors Degree in related field (e.g., Human Resources, Education, Organizational Development).",
"Minimum 2 years experience applying L&OD principles and practices in an organizational setting.",
"Previous work experience in Human Resources preferred.",
"Experience with a learning management system (LMS).",
"High school diploma or GED equivalent.",
"Certified Nursing Assistant, certified by the Virginia Board of Health Professions.",
"CPR certification required at date of hire."))
My goal is to have a dataset like this (new column = job_exp_category):
job_title job_experience job_exp_category
"Recruiter" "Minimum 1 year experience in recruitment..." "Work experience"
"Recruiter" "Proficient in Microsoft Office Applicati..." "Skill/Ability"
"Recruiter" "High school diploma required." "Degree"
... ... ...
"CNA" "Certified Nursing Assistant, certificati..." "Certification/License"
"CNA" "CPR certification required at date of hire." "Certification/License"
Thank you for any insight SO community.
In case someone sees this post and has a similar need, here is what I (OP) ended up doing:
Following the content in this link, I used a combination of supervised learning (random forest) to classify the job description statements into four categories (degree, work experience, certification/license, and ksa) and unsupervised learning (kmeans cluster analysis) to bucket work experience statements that used similar words (e.g., cluster 1 = statements that referenced Microsoft office products).
The general process included:
PHASE 1 (identify work experience-related job description statements):
hand-coding a sample of the job description statements into the appropriate category
converting my data set to a tidytext dataframe that would prepare it for analysis
create a training data set using the hand-coded job descriptions statements with their associated categories, then creating a test data set that contained the to-be-categorized job description statements
estimating a random forest model (supervised learning) using the caret package [ details: method="ranger", out-of-box resampling method, number of trees = 200]. My OOB prediction error was 2.96%.
used predict() to predict the job description categories on the remaining data (i.e., test data set)
PHASE 2 (classify work experience statements into related buckets):
being comfortable with my prediction error for this task, I filtered to include only the job description statements that fell under 'work experience'
after cleaning the data set to avoid clustering on non-important words (e.g., preferred) I used kmeans() cluster analysis (k = 200) to bucket the work experience statements into clusters based on the words used.
At this point, we're still ultimately deciding what the final work experience descriptions/categorizations will be, but this process is now more efficient with the trimmings gone and a bit of a head start on how to categorize appropriately.
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 11 months ago.
Improve this question
I have a list of text like this- and I want to count the number of lines in this. I thought of splitting it by ". ", but it will create extra lines for words like "a.m.". Please help!
[
'when people hear ai they often think about sentient robots and magic boxes. ai today is much more mundane and simple—but that doesn’t mean it’s not powerful. another misconception is that high-profile research projects can be applied directly to any business situation. ai done right can create an extreme return on investments (rois)—for instance through automation or precise prediction. but it does take thought, time, and proper implementation. we have seen that success and value generated by ai projects are increased when there is a grounded understanding and expectation of what the technology can deliver from the c-suite down.',
'“artificial intelligence (ai) is a science and a set of computational technologies that are inspired by—but typically operate quite differently from—the ways people use their nervous systems and bodies to sense, learn, reason and take action.”3 lately there has been a big rise in the day-to-day use of machines powered by ai. these machines are wired using cross-disciplinary approaches based on mathematics, computer science, statistics, psychology, and more.4 virtual assistants are becoming more common, most of the web shops predict your purchases, many companies make use of chatbots in their customer service and many companies use algorithms to detect fraud.',
'ai and deep learning technology employed in office entry systems will bring proper time tracking of each employee. as this system tries to learn each person with an image processing technology whose data is feed forwarded to a deep learning model where deep learning isn’t an algorithm per se, but rather a family of algorithms that implements deep networks (many layers). these networks are so deep that new methods of computation, such as graphics processing units (gpus), are required to train them, in addition to clusters of compute nodes. so using deep learning we can take detect the employee using face and person recognition scan and through which login, logout timing is recorded. using an ai system we can even identify each employee’s entry time, their working hours, non-working hours by tracking the movement of an employee in the office so that system can predict and report hr for the salary for each employee based on their working hours. our system can take feed from cctv to track movements of employees and this system is capable of recognizing a person even he/she is being masked as in this pandemic situation by taking their iris scan. with this system installed inside the office, the following are some of the benefits:',
'for several countries, regulations insist that the employer must keep documents available that can demonstrate the working hours performed by each employee. in the event of control from the labor inspectorate or a dispute with an employee, the employer must be able to explain and justify the working hours for the company. this can be made easy as our system is tracking employee movements',
'this is about monitoring user connection times to detect suspicious access times. in the event where compromised credentials are used to log on at 3 a.m. on a saturday, a notification on this access could alert the it team that an attack is possibly underway.',
'to manage and react to employees’ attendance, overtime thresholds, productivity, and suspicious access times, our system records and stores detailed and interactive reporting on users’ connection times. these records allow you to better manage users’ connection times and provide accurate, detailed data required by management.',
'4)if you want to avoid paying overtime, make sure that your employees respect certain working time quotas or even avoid suspicious access. our system will alert the hr officer about each employee’s office in and out time so that they can accordingly take action.',
'5)last but not least it reduces human resource needs to keep track of the records and sending the report to hr and hr officials has to check through the report so this system will reduce times and human resource needs',
'with the use of ai and deep learning technologies, we can automate some routines stuff with more functionality which humans need more resources to keep track thereby reducing time spent on manual data entry works rather companies can think of making their position high in the competitive world.'
]
Sentence splitting is not a trivial task. I'd suggest you use a ready-made library like the NLTK.
import nltk
text = "..." # your raw text
sentences = nltk.sent_tokenize(text)
This works fairly well, but don't expect perfect results.
I have a list of text like this- and I want to count no. of lines in this. I thought of splitting it by". "
Nice idea, but if you want to count the number of lines, why don't you simply take the lenght of that list?
lenght = len(that_long_text_list)
And then you can sum this lenght to the ., removing by regex expressions like a.m.. To do this check this question about regex pattern matching for acronyms.
Since we don't know how many characters fit into a line as you define it, we can't know for sure. My suggestion would be as follows:
Split the text into paragraphs, since the start of a new paragraph will, in any case, introduce a line break. This is easy - each item in your list will be a paragraph.
For each of these paragraphs, divide it by the number of characters in a line and round up to the nearest full integer.
So my code would be as follows:
raw_text = # Your raw text in the form of a list
line_count = 0
chars_per_line = # Whatever the number of characters in a line of yours is
for paragraph in raw_text:
line_count += (paragraph // chars_per_line) + 1
Not perfect because it doesn't take the rules of syllabification into account, but close enough, I guess.
I have a dataset of responses from around 300 people completing a questionaire. The questionaire dealt with user experiences and behaviour in public transport. We ran the survey for 3 bus companies. Most of the questions are "yes/no", "best among 3" or "worst among 3".
If possible, I want to build a model that will suggest the best company of the three, based on the answers. The questions are like, "Availability of a buses, reliability of buses, preferences of the user and physical maintenances of the bus.
I expect the model to analyse the data set and return the best bus company which will be easily available, clean and well maintained, reliable and the user will prefer it.
Also, answers to questions like "Which bus do you prefer?" should have more weight in decision making.
I am pretty new to machine learning and would appreciate suggestion on which algorithm to start with to train the model.
Firstly, you should use pandas to do all the cleansing of data such as removing null values and data validation.
Secondly, if you need to visualise your data, the more popular choices would be seaborn or matplotlib.
Lastly, for your model, since its machine learning and not deep learning, scikit-learn is a great library to train your model
I would recommend that you acquire more data as 300 is not sufficient. Not in the world of machine learning.
I may add also that you can also use NLP libraries like BERT or NLTK, they have pretrained models also.
At the preprocesing stage, as you have a classification issue, be careful to balance your data.
If you have a corpus of text, how can you identify all the categories (from a list of pre-defined categories) and the associated sentiment (positive/negative writing) with it?
I will be doing this in Python but at this stage I am not necessarily looking for a language specific solution.
Let's look at this question with an example to try and clarify what I am asking.
If I have a whole corpus of reviews for products e.g.:
Microsoft's Xbox One offers impressive graphics and a solid list of exclusive 2015 titles. The Microsoft console currently edges ahead of the PS4 with a better selection of media apps. The console's fall-2015 dashboard update is a noticeable improvement. The console has backward compatibility with around 100 Xbox 360 titles, and that list is poised to grow. The Xbox One's new interface is still more convoluted than the PS4's. In general, the PS4 delivers slightly better installation times, graphics and performance on cross-platform games. The Xbox One also lags behind the PS4 in its selection of indie games. The Kinect's legacy is still a blemish. While the PS4 remains our overall preferred choice in the game console race, the Xbox One's significant course corrections and solid exclusives make it a compelling alternative.
And I have a list of pre-defined categories e.g. :
Graphics
Game Play
Game Selection
Apps
Performance
Irrelevant/Other
I could take my big corpus of reviews and break them down by sentence. For each sentence in my training data I can hand tag them with the appropriate categories. The problem is that there could be various categories in 1 sentence.
If it was 1 category per sentence then any classification algorithm from scikit-learn would do the trick. When working with multi-classes I could use something like multi-label classification.
Adding in the sentiment is the trickier part. Identifying sentiment in a sentence is a fairly simple task but if there is a mix of sentiment on different labels that becomes different.
The example sentence "The Xbox One has a good selection of games but the performance is worse than the PS4". We can identify two of our pre-defined categories (game selection, performance) but we have positive sentiment towards game selection and a negative sentiment towards performance.
What would be a way to identify all categories in text (from our pre-defined list) with their associated sentiment?
One simple method is to break your training set into minimal sentences using a parser and use that as the input for labelling and sentiment classification.
Your example sentence:
The Xbox One has a good selection of games but the performance is worse than the PS4
Using the Stanford Parser, take S tags that don't have child S tags (and thus are minimal sentences) and put the tokens back together. For the above sentence that would give you these:
The Xbox One has a good selection of games
the performance is worse than the PS4
Sentiment within an S tag should be consistent most of the time. If sentences like The XBox has good games and terrible graphics are common in your dataset you may need to break it down to NP tags but that seems unlikely.
Regarding labelling, as you mentioned any multi-label classification method should work.
For more sophisticated methods, there's a lot of research on join topic-sentiment models - a search for "topic sentiment model" turns up a lot of papers and code. Here's sample training data from a paper introducing a Hidden Topic Sentiment Model that looks right up your alley. Note how in the first sentence with labels there are two topics.
Hope that helps!
The only approach I could think of would consists of a set of steps.
1) Use some library to extract entities from text and their relationships. For example, check this article:
http://www.nltk.org/book/ch07.html
By parsing each text you may figure out which entities you have in each text and which chunks of text are related to the entity.
2) Use NLTKs sentiment extraction to analyze chunks specifically related to each entity and obtain their sentiment. That gives you sentiment of each entity.
3) After that you need to come of with a way to map entities which you may face in text to what you call 'topics'. Unfortunately, I don't see a way to automate it since you clearly not define topics conventionally, through word frequency (like in topic modelling algorithms - LDA, NMF etc).
I'm trying to take a long list of objects (in this case, applications from the iTunes App Store) and classify them more specifically. For instance, there are a bunch of applications currently classified as "Education," but I'd like to label them as Biology, English, Math, etc.
Is this an AI/Machine Learning problem? I have no background in that area whatsoever but would like some resources or ideas on where to start for this sort of thing.
Yes, you are correct. Classification is a machine learning problem, and classifying stuff based on text data involves natural language processing.
The canonical classification problem is spam detection using a Naive Bayes classifier, which is very simple. The idea is as follows:
Gather a bunch of data (emails), and label them by class (spam, or not spam)
For each email, remove stopwords, and get a list of the unique words in that email
Now, for each word, calculate the probability it appears in a spam email, vs a non-spam email (ie count occurrences in spam, vs non spam)
Now you have a model- the probability of a email being spam, given it contains a word. However, an email contains many words. In Naive Bayes, you assume the words occur independently of each other (which turns out to to be an ok assumption), and multiply the probabilities of all words in the email against each other.
You usually divide data into training and testing, so you'll have a set of emails you train your model on, and then a set of labeled stuff you test against where you calculate precision and recall.
I'd highly recommend playing around with NLTK, a python machine learning and nlp library. It's very user friendly and has good docs and tutorials, and is a good way to get acquainted with the field.
EDIT: Here's an explanation of how to build a simple NB classifier with code.
Probably not. You'd need to do a fair bit of work to extract data in some usable form (such as names), and at the end of the day, there are probably few enough categories that it would simply be easier to manually identify a list of keywords for each category and set a parser loose on titles/descriptions.
For example, you could look through half a dozen biology apps, and realize that in the names/descriptions/whatever you have access to, the words "cell," "life," and "grow" appear fairly often - not as a result of some machine learning, but as a result of your own human intuition. So build a parser to classify everything with those words as biology apps, and do similar things for other categories.
Unless you're trying to classify the entire iTunes app store, that should be sufficient, and it would be a relatively small task for you to manually check any apps with multiple classifications or no classifications. The labor involved with using a simple parser + checking anomalies manually is probably far less than the labor involved with building a more complex parser to aid machine learning, setting up machine learning, and then checking everything again, because machine learning is not 100% accurate.
I was wondering how as semantic service like Open Calais figures out the names of companies, or people, tech concepts, keywords, etc. from a piece of text. Is it because they have a large database that they match the text against?
How would a service like Zemanta know what images to suggest to a piece of text for instance?
Michal Finkelstein from OpenCalais here.
First, thanks for your interest. I'll reply here but I also encourage you to read more on OpenCalais forums; there's a lot of information there including - but not limited to:
http://opencalais.com/tagging-information
http://opencalais.com/how-does-calais-learn
Also feel free to follow us on Twitter (#OpenCalais) or to email us at team#opencalais.com
Now to the answer:
OpenCalais is based on a decade of research and development in the fields of Natural Language Processing and Text Analytics.
We support the full "NLP Stack" (as we like to call it):
From text tokenization, morphological analysis and POS tagging, to shallow parsing and identifying nominal and verbal phrases.
Semantics come into play when we look for Entities (a.k.a. Entity Extraction, Named Entity Recognition). For that purpose we have a sophisticated rule-based system that combines discovery rules as well as lexicons/dictionaries. This combination allows us to identify names of companies/persons/films, etc., even if they don't exist in any available list.
For the most prominent entities (such as people, companies) we also perform anaphora resolution, cross-reference and name canonization/normalization at the article level, so we'll know that 'John Smith' and 'Mr. Smith', for example, are likely referring to the same person.
So the short answer to your question is - no, it's not just about matching against large databases.
Events/Facts are really interesting because they take our discovery rules one level deeper; we find relations between entities and label them with the appropriate type, for example M&As (relations between two or more companies), Employment Changes (relations between companies and people), and so on. Needless to say, Event/Fact extraction is not possible for systems that are based solely on lexicons.
For the most part, our system is tuned to be precision-oriented, but we always try to keep a reasonable balance between accuracy and entirety.
By the way there are some cool new metadata capabilities coming out later this month so stay tuned.
Regards,
Michal
I'm not familiar with the specific services listed, but the field of natural language processing has developed a number of techniques that enable this sort of information extraction from general text. As Sean stated, once you have candidate terms, it's not to difficult to search for those terms with some of the other entities in context and then use the results of that search to determine how confident you are that the term extracted is an actual entity of interest.
OpenNLP is a great project if you'd like to play around with natural language processing. The capabilities you've named would probably be best accomplished with Named Entity Recognizers (NER) (algorithms that locate proper nouns, generally, and sometimes dates as well) and/or Word Sense Disambiguation (WSD) (eg: the word 'bank' has different meanings depending on it's context, and that can be very important when extracting information from text. Given the sentences: "the plane banked left", "the snow bank was high", and "they robbed the bank" you can see how dissambiguation can play an important part in language understanding)
Techniques generally build on each other, and NER is one of the more complex tasks, so to do NER successfully, you will generally need accurate tokenizers (natural language tokenizers, mind you -- statistical approaches tend to fare the best), string stemmers (algorithms that conflate similar words to common roots: so words like informant and informer are treated equally), sentence detection ('Mr. Jones was tall.' is only one sentence, so you can't just check for punctuation), part-of-speech taggers (POS taggers), and WSD.
There is a python port of (parts of) OpenNLP called NLTK (http://nltk.sourceforge.net) but I don't have much experience with it yet. Most of my work has been with the Java and C# ports, which work well.
All of these algorithms are language-specific, of course, and they can take significant time to run (although, it is generally faster than reading the material you are processing). Since the state-of-the-art is largely based on statistical techniques, there is also a considerable error rate to take into account. Furthermore, because the error rate impacts all the stages, and something like NER requires numerous stages of processing, (tokenize -> sentence detect -> POS tag -> WSD -> NER) the error rates compound.
Open Calais probably use language parsing technology and language statics to guess which words or phrases are Names, Places, Companies, etc. Then, it is just another step to do some kind of search for those entities and return meta data.
Zementa probably does something similar, but matches the phrases against meta-data attached to images in order to acquire related results.
It certainly isn't easy.