How to programmatically classify a list of objects

How to programmatically classify a list of objects - python

I'm trying to take a long list of objects (in this case, applications from the iTunes App Store) and classify them more specifically. For instance, there are a bunch of applications currently classified as "Education," but I'd like to label them as Biology, English, Math, etc.
Is this an AI/Machine Learning problem? I have no background in that area whatsoever but would like some resources or ideas on where to start for this sort of thing.

Yes, you are correct. Classification is a machine learning problem, and classifying stuff based on text data involves natural language processing.
The canonical classification problem is spam detection using a Naive Bayes classifier, which is very simple. The idea is as follows:
Gather a bunch of data (emails), and label them by class (spam, or not spam)
For each email, remove stopwords, and get a list of the unique words in that email
Now, for each word, calculate the probability it appears in a spam email, vs a non-spam email (ie count occurrences in spam, vs non spam)
Now you have a model- the probability of a email being spam, given it contains a word. However, an email contains many words. In Naive Bayes, you assume the words occur independently of each other (which turns out to to be an ok assumption), and multiply the probabilities of all words in the email against each other.
You usually divide data into training and testing, so you'll have a set of emails you train your model on, and then a set of labeled stuff you test against where you calculate precision and recall.
I'd highly recommend playing around with NLTK, a python machine learning and nlp library. It's very user friendly and has good docs and tutorials, and is a good way to get acquainted with the field.
EDIT: Here's an explanation of how to build a simple NB classifier with code.

Probably not. You'd need to do a fair bit of work to extract data in some usable form (such as names), and at the end of the day, there are probably few enough categories that it would simply be easier to manually identify a list of keywords for each category and set a parser loose on titles/descriptions.
For example, you could look through half a dozen biology apps, and realize that in the names/descriptions/whatever you have access to, the words "cell," "life," and "grow" appear fairly often - not as a result of some machine learning, but as a result of your own human intuition. So build a parser to classify everything with those words as biology apps, and do similar things for other categories.
Unless you're trying to classify the entire iTunes app store, that should be sufficient, and it would be a relatively small task for you to manually check any apps with multiple classifications or no classifications. The labor involved with using a simple parser + checking anomalies manually is probably far less than the labor involved with building a more complex parser to aid machine learning, setting up machine learning, and then checking everything again, because machine learning is not 100% accurate.

Related

Which document embedding model for document similarity

First, I want to explain my task. I have a dataset of 300k documents with an average of 560 words (no stop word removal yet) 75% in German, 15% in English and the rest in different languages. The goal is to recommend similar documents based on an existing one. At the beginning I want to focus on the German and English documents.  
To achieve this goal I looked into several methods on feature extraction for document similarity, especially the word embedding methods have impressed me because they are context aware in contrast to simple TF-IDF feature extraction and the calculation of cosine similarity. 
I'm overwhelmed by the amount of methods I could use and I haven't found a proper evaluation of those methods yet. I know for sure that the size of my documents are too big for BERT, but there is FastText, Sent2Vec, Doc2Vec and the Universal Sentence Encoder from Google. My favorite method based on my research is Doc2Vec even though there aren't any or old pre-trained models which means I have to do the training on my own.
Now that you know my task and goal, I have the following questions:
Which method should I use for feature extraction based on the rough overview of my data?
My dataset is too small to train Doc2Vec on it. Do I achieve good results if I train the model on English / German Wikipedia?

You really have to try the different methods on your data, with your specific user tasks, with your time/resources budget to know which makes sense.
You 225K German documents and 45k English documents are each plausibly large enough to use Doc2Vec - as they match or exceed some published results. So you wouldn't necessarily need to add training on something else (like Wikipedia) instead, and whether adding that to your data would help or hurt is another thing you'd need to determine experimentally.
(There might be special challenges in German given compound words using common-enough roots but being individually rare, I'm not sure. FastText-based approaches that use word-fragments might be helpful, but I don't know a Doc2Vec-like algorithm that necessarily uses that same char-ngrams trick. The closest that might be possible is to use Facebook FastText's supervised mode, with a rich set of meaningful known-labels to bootstrap better text vectors - but that's highly speculative and that mode isn't supported in Gensim.)

Recommendation for ML algorithm to differentiate between ban or allowance

I am new to Machine Learning and wanted to see if any of you could recommend an algorithm I could apply onto a project I'm doing. Basically I want to scrape popular housing websites and look at their descriptions to see if they allow/disallow something, for example pets. The problem is a simple search for pets leads contradictory results: 'pets allowed' and 'no additional cost for pets' or 'no pets' or 'I don't accept at this time'. As seen from these examples, often negative keywords 'no' are used to indicate pets are allowed, whereas positive keywords like 'accept' are used to indicate a ban. As such, I was wondering if there was any algorithm I could use (preferably in python) differentiate between the two. (Note: I can't run training data to generate an algorithm myself as the thing I am actually looking for is quite niche).
Thank you very much for your help!!

The keyword you're looking for is "document classification". This is a document classification problem. You start with documents (i.e. webpages) and you want to classify them as "allows pets" or "doesn't allow pets" (or whatever). There are a lot of good tutorials out there for performing document classification but a full explanation is beyond the scope of a StackOverflow answer.
You won't be able to do this for your particular niche case without providing at least some training data but you could gather, say 30 example websites, extract their text, manually add labels ("does fit my niche" vs "doesn't fit my niche"), and then run through a standard document classification system and see if that gets you the accuracy you want. Also in order for this to work with a small amount of training data (like your 30 documents), you'll need to start from a pretrained model.
Good luck!

First time working with Word2Vec, try to cluster users based on their skill set

For my thesis I have to analyse the skill of candidates. I have to cluster the users and compare their skillsets. The information is classified so I made a random database which have the same structure so I can show how my data is build.
import random
listOfSkills = ["Dutch","Java OSGI","XML Transformation Query","Java Enterprise Edition","Functional Design","Scrum","Python","JavaScript","Ruby","Java","SQL","Data Analytics","Machine Learning","Deep Learning","English"]
rand_item = random.choice(listOfSkills)
n = 5
rand_items = random.sample(listOfSkills, n)
test_skillset = []
for i in range(5):
result = random.sample(listOfSkills, n)
string = ", ".join(result)
test_skillset.append(string)
test_id_ = np.arange(0, len(test_skillset)).tolist()
test_dict = {'id' : test_id_,
'skillset' : test_skillset
}
test_df = pd.DataFrame(test_dict)
After running this code I get a DataFrame which looks like this:
id, skillset
0, "Java, ruby, ..."
1, "Java, ruby, ..."
2, "Java, ruby, ..."
This is the same for the database I got.
The list of skills are some of the skills I found in the database. Also there are more users in the database which also have more skills.
I am quite new to machine learning and to using Word2Vec models. I tried some stuff but almost all the time I don't get a result which gives me extra information. Some of the skills have a long name which maybe messes up the model. Or I did something wrong.
One of my goals is to cluster the users and find similarities between each skill set.
My final goal is to match the vectors of the skill set with vacancies to check how good of a match a user can be with an open vacancy. But first I need to know if I can find similarities between the users.
So my question are:
How can I use Word2Vec to find individual similarities between skills?
How can I use Word2Vec cluster the users to find similar skillsets?
Sorry if my question is a bit vague, English isn't my native language and Python is a bit new to me.
I am open to clarify things if needed.

Word2vec was originally unveiled as an algorithm trained on long, real natural language texts, which include many subtly-varied examples of word usage, in original contexts.
You seem to be applying it to a smaller controlled-vocabulary of known-skills, using training-data that isn't full natural language communication – just lists.
Word2vec & similar algorithms can sometimes offer interesting results on such not-quite-real-language corpora, but can require more tinkering with the training data, and parameters further from the usual defaults for natural language texts, for better results.
In particular, if you are using a randomly-generated corpus – especially one generated by uniform sampling from a tiny list of just 15 'words'! – you shouldn't expect the word2vec algorithm to do anything useful. There's no language-like pattern of relative word-meanings in such an artificial corpus. (And, any tiny hints-of-correlation any one run might show would be noise from your random sample, totally unlike the gradations of meaning real languages have, and real training texts would show.)
(There are probably other errors, as well, in your "tried some stuff" experiments that didn't yield useful results – but this use of a random training set is just the most obvious problem, in what you've shown.)
To get useful "skills vectors" you'll need lots of realistic data, and to adjust things like the size/options of your training to match the limits of what you have. (As just one example, it's nonsensical to try to train even 20-dimensional 'dense embedding' vectors, like those from word2vec, for a vocabulary that's less than 20 tokens long – & you'd probably need 400+ unique tokens to make 20-dimensional vectors start making sense.)
With the right data, you should start to see meaningful relationships between such skills-vectors – with related skills nearer each other than unrelated skills, and even the directions-of-differences suggestive of certain human-describable aspects of skills (like "more enterprisey", or "more abstract mathy"). But you can't even really eyeball those results for sanity/improvement unless it's realistic data, with real relationships, which you can evaluate using your domain knowledge.
You might then be able to try alternate ways of composing those into per-candidate summaries (such as averaging the skills-vectors together), or just use some composite measure-of-distance which relies on word-vectors (like say "Word Mover's Distance"), in order to try candidate-level clustering.
Good luck!

Tweet classification into multiple categories on (Unsupervised data/tweets)

I want to classify the tweets into predefined categories (like: sports, health, and 10 more). If I had labeled data, I would be able to do the classification by training Naive Bayes or SVM. As described in http://cucis.ece.northwestern.edu/publications/pdf/LeePal11.pdf
But I cannot figure out a way with unlabeled data. One possibility could be using Expectation-Maximization and generating clusters and label those clusters. But as said earlier I have predefined set of classes, so clustering won't be as good.
Can anyone guide me on what techniques I should follow. Appreciate any help.

Alright by what i can understand i think there are multiple ways to attend to this case.
there will be trade offs and the accuracy rate may vary. because of the well know fact and observation
Each single tweet is distinct!
(unless you are extracting data from twitter stream api based on tags and other keywords). Please define the source of data and how are you extracting it. i am assuming you're just getting general tweets which can be about anything
The thing you can do is to generate a set of dictionary for each class you have
(i.e Music => pop , jazz , rap , instruments ...)
which will contain relevant words to that class. You can use NLTK for python or Stanford NLP for other languages.
You can start with extracting
Synonyms
Hyponyms
Hypernyms
Meronyms
Holonyms
Go see these NLP Lexical semantics slides. it will surely clear some of the concepts.
Once you have dictionaries for each classes. cross compare them with the tweets you have got. the tweet which has the most similarity (you can rank them according to the occurrences of words from the these dictionaries) you can label it to that class. This will make your tweets labeled like others.
Now the question is the accuracy! But it depends on the data and versatility of your classes. This may be an "Over kill" But it may come close to what you want.
Furthermore you can label some set of tweets this way and use Cosine Similarity to cross identify other tweets. This will help with the optimization part. But then again its up-to you. As you know what Trade offs you can bear
The real struggle will be the machine learning part and how you manage that.

Actually this seems as a typical use case of semi-supervised learning. There are plenty methods of use here, including clustering with constraints (where you force model to cluster samples from the same class together), transductive learning (where you try to extrapolate model from labeled samples onto distribution of unlabeled ones).
You could also simply cluster data as #Shoaib suggested, but then you will have to come up the the heuristic approach how to deal with clusters with mixed labeling. Futhermore - obviously solving optimziation problem not related to the task (labeling) will not be as good as actually using this knowledge.

You can use clustering for that task. For that you have to label some examples for each class first. Then using these labeled examples, you can identify the class of each cluster easily.

How do content discovery engines, like Zemanta and Open Calais work?

I was wondering how as semantic service like Open Calais figures out the names of companies, or people, tech concepts, keywords, etc. from a piece of text. Is it because they have a large database that they match the text against?
How would a service like Zemanta know what images to suggest to a piece of text for instance?

Michal Finkelstein from OpenCalais here.
First, thanks for your interest. I'll reply here but I also encourage you to read more on OpenCalais forums; there's a lot of information there including - but not limited to:
http://opencalais.com/tagging-information
http://opencalais.com/how-does-calais-learn
Also feel free to follow us on Twitter (#OpenCalais) or to email us at team#opencalais.com
Now to the answer:
OpenCalais is based on a decade of research and development in the fields of Natural Language Processing and Text Analytics.
We support the full "NLP Stack" (as we like to call it):
From text tokenization, morphological analysis and POS tagging, to shallow parsing and identifying nominal and verbal phrases.
Semantics come into play when we look for Entities (a.k.a. Entity Extraction, Named Entity Recognition). For that purpose we have a sophisticated rule-based system that combines discovery rules as well as lexicons/dictionaries. This combination allows us to identify names of companies/persons/films, etc., even if they don't exist in any available list.
For the most prominent entities (such as people, companies) we also perform anaphora resolution, cross-reference and name canonization/normalization at the article level, so we'll know that 'John Smith' and 'Mr. Smith', for example, are likely referring to the same person.
So the short answer to your question is - no, it's not just about matching against large databases.
Events/Facts are really interesting because they take our discovery rules one level deeper; we find relations between entities and label them with the appropriate type, for example M&As (relations between two or more companies), Employment Changes (relations between companies and people), and so on. Needless to say, Event/Fact extraction is not possible for systems that are based solely on lexicons.
For the most part, our system is tuned to be precision-oriented, but we always try to keep a reasonable balance between accuracy and entirety.
By the way there are some cool new metadata capabilities coming out later this month so stay tuned.
Regards,
Michal

I'm not familiar with the specific services listed, but the field of natural language processing has developed a number of techniques that enable this sort of information extraction from general text. As Sean stated, once you have candidate terms, it's not to difficult to search for those terms with some of the other entities in context and then use the results of that search to determine how confident you are that the term extracted is an actual entity of interest.
OpenNLP is a great project if you'd like to play around with natural language processing. The capabilities you've named would probably be best accomplished with Named Entity Recognizers (NER) (algorithms that locate proper nouns, generally, and sometimes dates as well) and/or Word Sense Disambiguation (WSD) (eg: the word 'bank' has different meanings depending on it's context, and that can be very important when extracting information from text. Given the sentences: "the plane banked left", "the snow bank was high", and "they robbed the bank" you can see how dissambiguation can play an important part in language understanding)
Techniques generally build on each other, and NER is one of the more complex tasks, so to do NER successfully, you will generally need accurate tokenizers (natural language tokenizers, mind you -- statistical approaches tend to fare the best), string stemmers (algorithms that conflate similar words to common roots: so words like informant and informer are treated equally), sentence detection ('Mr. Jones was tall.' is only one sentence, so you can't just check for punctuation), part-of-speech taggers (POS taggers), and WSD.
There is a python port of (parts of) OpenNLP called NLTK (http://nltk.sourceforge.net) but I don't have much experience with it yet. Most of my work has been with the Java and C# ports, which work well.
All of these algorithms are language-specific, of course, and they can take significant time to run (although, it is generally faster than reading the material you are processing). Since the state-of-the-art is largely based on statistical techniques, there is also a considerable error rate to take into account. Furthermore, because the error rate impacts all the stages, and something like NER requires numerous stages of processing, (tokenize -> sentence detect -> POS tag -> WSD -> NER) the error rates compound.

Open Calais probably use language parsing technology and language statics to guess which words or phrases are Names, Places, Companies, etc. Then, it is just another step to do some kind of search for those entities and return meta data.
Zementa probably does something similar, but matches the phrases against meta-data attached to images in order to acquire related results.
It certainly isn't easy.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.