Classification from a string of numbers and letters - python

I have a database full of items and I've been tasked with classifying them (they could be books, stationary etc). The options are to either go through 100k records manually and figure out what they are or to automate the task.
The codes for each type of item follow some kind of pattern, and so I'm hoping to use machine learning to solve this (I do not want to use regular expressions). Though I'm quite good at python, my ml knowledge only goes as far as random forests and logistic regression.
Is this at all possible? The data looks like this:
Item code type
1 4S2BDANC5L3247151 book
2 1N4AL3AP1JC236284 book
3 3R4BTTGC3L3237430 book
4 KNMAT2MT1KP546287 book
5 97806773062273208 pen
6 07356196706378892 Pen
7 97807345361169253 pen
8 01008130715194136 chair
9 01076305063010CCE44 chair
etc
I'm happy to look up and learn whatever is necessary, I just don't know where to start
Thanks!

I understood that you have 100k example. You can use RNN, LSTM or Attention based deep learning methods because the pattern of codes can be tracked by that models. Also machine learning models can solve that problem. In the end, your problem includes specific type of patterns for different classes. Therefore, you can separate that classes.
1) You need to start with finding embedding to represent your codes. I guess you can use ascii codes of numbers and letters. Also to make all vectors be the same length, use padding. Then you can normalize them to put in between 0-1.
2) Then my advise is to start with SVM with one-vs-all strategy to make multi-class classification. After that you can try XGBoost, which is a powerful ml model. Or you can start with more basic ml models. Idea behind that, start from basic and goes to complicated models.
3) If ml models are not enough for that task, start with basic RNN models.
I don't know about your data distribution among classes and also number of classes. If it is balanced and each class have enough data I guess you can easily automate that task.

Related

Why performs the NN better with OneHotEncoding?

i have a question just for a general case. So i am working with the poker-hand-dataset, which has 10 possible outputs from 0-9, each number gives a poker-hand, for example royal flush.
So i read in the internet, that it is necessary to use OHE in a multiclass problem because if not there would be like a artificial order, for example if you work with cities. But in my case with the poker hands there is a order from one pair over flush and straight to royal flush, right?
Even though my nn performs better with OHE, but it works also (but bad) without.
So why does it work better with the OHE? I did a Dense Network with 2 hidden layer.
Short answer - depending on the use of the feature in the classification and according to the implementation of the classifier you use, you decide if to use OHE or not. If the feature is a category, such that the rank has no meaning (for example, the suit of the card 1=clubs, 2=hearts...) then you should use OHE (for frameworks that require categorical distinction), because ranking it has no meaning. If the feature has a ranking meaning, with regards to the classification, then keep it as-is (for example, the probability of getting a certain winnig hand).
As you did not specify to what task you are using the NN nor the loss function and a lot of other things - I can only assume that when you say "...my nn performs better with OHE" you want to classify a combination to a class of poker hands and in this scenario the data just presents for the learner the classes to distinguish between them (as a category not as a rank). You can add a feature of the probability and/or strength of the hand etc. which will be a ranking feature - as for the resulted classifier, that's a whole other topic if adding it will improve or not (meaning the number of features to classification performance).
Hope I understood you correctly.
Note - this is a big question and there is a lot of hand waving, but this is the scope.

NLP-steps or approch to classify text?

I'm working on a project to classify restaurant reviews on sentiment(positive or negative) basis. Also I want to classify that if these comments belongs to food, service, value-for-money, etc category. I am unable to link the steps or the methodology provided on the internet. can anyone provide detailed method or steps to get to the solution.
How about using bag of words model. It's been tried and tested for ages. It has some downsides compared to more modern methods, but you can still get decent results. And there are tons of material on internet to help you:
Normalize documents to the form ingestable by your pipeline
Convert documents to vectors and perform TF-IDF to filter irrelevant terms.
Here is a good tutorial. And convert them to vector form.
Split your documents get some subset of documents and mark the ones that belong to training data according to classes ( Sentiment ) / type of comments. Clearly your documents will belong to two classes.
Apply some type of dimensionality reduction technique to make your models more robust, good discussion is here
Train your models on your training data. You need at least two models one for sentiment and one for type. Some algorithms work with binary classes only so you might need more than to models for comment type ( Food, Value, Service). This might be a good thing because a comment can belong to more than one class ( Food quality and Value, or Value and Service). Scikit-learn has a lot of good models, also I highly recommend orange toolbox it's like a GUI for data science.
Validate your models using validation set. If you accuracy is satisfactory (most classical methods like SVM should give you at leat 90%) go ahead and use it for incoming data

Architecture neural network OCR for Printed Documents

I'm learning neural network by using tensorflow to build a OCR for printed documents.
Would you mind giving me advices which Architecture neural network is good for recognize characters.
I'm confusing because I'm a newbie and there are a lot of neural network designs
I found MNIST CLASSIFIER but their architectures are only about digit.
I don't know their architectures can work with characters or not ?
thank you
As you correctly point out, recognizing documents is a different thing from recognizing single characters. It is a complex system that will take time to implement from scratch. First, there is the problem of preprocessing. You need to find where the text is, perhaps slightly rotate it, etc. That can be done with heuristics and a library like OpenCV. You'll also have to detect things like page numbers, header/footers, tables/figures, etc.
Then, in some cases, you could take the "easy" route and use heuristics to segment the text into characters. That works for block characters, but not cursive scripts.
If the segmentation is given, and you don't have to guess it, you have to solve multiple related problems, each are like MNIST but they are related in that the decisions are not independent. You can look up MEMM (Maximum-Entropy Markov Models) vs HMM (Hidden Markov Models, Hidden Conditional Random Fields, and Segmental Conditional Random Fields, and study the difference between them. You can also read about seq2seq.
So if you're making it simple for yourself, you can essentially run MNIST classifiers multiple times, once the segmentation is revealed (via some heuristic in opencv). On top of that, you have to run a dynamic program which finds the best final sequence based on the score of each decision, and a "language model", which assigns likelihoods of letters occurring close to each other.
If you're starting from scratch, it's not an easy thing. It may take months for you to get a basic understanding. Happy hacking!

Tweet classification into multiple categories on (Unsupervised data/tweets)

I want to classify the tweets into predefined categories (like: sports, health, and 10 more). If I had labeled data, I would be able to do the classification by training Naive Bayes or SVM. As described in http://cucis.ece.northwestern.edu/publications/pdf/LeePal11.pdf
But I cannot figure out a way with unlabeled data. One possibility could be using Expectation-Maximization and generating clusters and label those clusters. But as said earlier I have predefined set of classes, so clustering won't be as good.
Can anyone guide me on what techniques I should follow. Appreciate any help.
Alright by what i can understand i think there are multiple ways to attend to this case.
there will be trade offs and the accuracy rate may vary. because of the well know fact and observation
Each single tweet is distinct!
(unless you are extracting data from twitter stream api based on tags and other keywords). Please define the source of data and how are you extracting it. i am assuming you're just getting general tweets which can be about anything
The thing you can do is to generate a set of dictionary for each class you have
(i.e Music => pop , jazz , rap , instruments ...)
which will contain relevant words to that class. You can use NLTK for python or Stanford NLP for other languages.
You can start with extracting
Synonyms
Hyponyms
Hypernyms
Meronyms
Holonyms
Go see these NLP Lexical semantics slides. it will surely clear some of the concepts.
Once you have dictionaries for each classes. cross compare them with the tweets you have got. the tweet which has the most similarity (you can rank them according to the occurrences of words from the these dictionaries) you can label it to that class. This will make your tweets labeled like others.
Now the question is the accuracy! But it depends on the data and versatility of your classes. This may be an "Over kill" But it may come close to what you want.
Furthermore you can label some set of tweets this way and use Cosine Similarity to cross identify other tweets. This will help with the optimization part. But then again its up-to you. As you know what Trade offs you can bear
The real struggle will be the machine learning part and how you manage that.
Actually this seems as a typical use case of semi-supervised learning. There are plenty methods of use here, including clustering with constraints (where you force model to cluster samples from the same class together), transductive learning (where you try to extrapolate model from labeled samples onto distribution of unlabeled ones).
You could also simply cluster data as #Shoaib suggested, but then you will have to come up the the heuristic approach how to deal with clusters with mixed labeling. Futhermore - obviously solving optimziation problem not related to the task (labeling) will not be as good as actually using this knowledge.
You can use clustering for that task. For that you have to label some examples for each class first. Then using these labeled examples, you can identify the class of each cluster easily.

How to programmatically classify a list of objects

I'm trying to take a long list of objects (in this case, applications from the iTunes App Store) and classify them more specifically. For instance, there are a bunch of applications currently classified as "Education," but I'd like to label them as Biology, English, Math, etc.
Is this an AI/Machine Learning problem? I have no background in that area whatsoever but would like some resources or ideas on where to start for this sort of thing.
Yes, you are correct. Classification is a machine learning problem, and classifying stuff based on text data involves natural language processing.
The canonical classification problem is spam detection using a Naive Bayes classifier, which is very simple. The idea is as follows:
Gather a bunch of data (emails), and label them by class (spam, or not spam)
For each email, remove stopwords, and get a list of the unique words in that email
Now, for each word, calculate the probability it appears in a spam email, vs a non-spam email (ie count occurrences in spam, vs non spam)
Now you have a model- the probability of a email being spam, given it contains a word. However, an email contains many words. In Naive Bayes, you assume the words occur independently of each other (which turns out to to be an ok assumption), and multiply the probabilities of all words in the email against each other.
You usually divide data into training and testing, so you'll have a set of emails you train your model on, and then a set of labeled stuff you test against where you calculate precision and recall.
I'd highly recommend playing around with NLTK, a python machine learning and nlp library. It's very user friendly and has good docs and tutorials, and is a good way to get acquainted with the field.
EDIT: Here's an explanation of how to build a simple NB classifier with code.
Probably not. You'd need to do a fair bit of work to extract data in some usable form (such as names), and at the end of the day, there are probably few enough categories that it would simply be easier to manually identify a list of keywords for each category and set a parser loose on titles/descriptions.
For example, you could look through half a dozen biology apps, and realize that in the names/descriptions/whatever you have access to, the words "cell," "life," and "grow" appear fairly often - not as a result of some machine learning, but as a result of your own human intuition. So build a parser to classify everything with those words as biology apps, and do similar things for other categories.
Unless you're trying to classify the entire iTunes app store, that should be sufficient, and it would be a relatively small task for you to manually check any apps with multiple classifications or no classifications. The labor involved with using a simple parser + checking anomalies manually is probably far less than the labor involved with building a more complex parser to aid machine learning, setting up machine learning, and then checking everything again, because machine learning is not 100% accurate.

Categories