Tweet classification into multiple categories on (Unsupervised data/tweets)

Tweet classification into multiple categories on (Unsupervised data/tweets) - python

I want to classify the tweets into predefined categories (like: sports, health, and 10 more). If I had labeled data, I would be able to do the classification by training Naive Bayes or SVM. As described in http://cucis.ece.northwestern.edu/publications/pdf/LeePal11.pdf
But I cannot figure out a way with unlabeled data. One possibility could be using Expectation-Maximization and generating clusters and label those clusters. But as said earlier I have predefined set of classes, so clustering won't be as good.
Can anyone guide me on what techniques I should follow. Appreciate any help.

Alright by what i can understand i think there are multiple ways to attend to this case.
there will be trade offs and the accuracy rate may vary. because of the well know fact and observation
Each single tweet is distinct!
(unless you are extracting data from twitter stream api based on tags and other keywords). Please define the source of data and how are you extracting it. i am assuming you're just getting general tweets which can be about anything
The thing you can do is to generate a set of dictionary for each class you have
(i.e Music => pop , jazz , rap , instruments ...)
which will contain relevant words to that class. You can use NLTK for python or Stanford NLP for other languages.
You can start with extracting
Synonyms
Hyponyms
Hypernyms
Meronyms
Holonyms
Go see these NLP Lexical semantics slides. it will surely clear some of the concepts.
Once you have dictionaries for each classes. cross compare them with the tweets you have got. the tweet which has the most similarity (you can rank them according to the occurrences of words from the these dictionaries) you can label it to that class. This will make your tweets labeled like others.
Now the question is the accuracy! But it depends on the data and versatility of your classes. This may be an "Over kill" But it may come close to what you want.
Furthermore you can label some set of tweets this way and use Cosine Similarity to cross identify other tweets. This will help with the optimization part. But then again its up-to you. As you know what Trade offs you can bear
The real struggle will be the machine learning part and how you manage that.

Actually this seems as a typical use case of semi-supervised learning. There are plenty methods of use here, including clustering with constraints (where you force model to cluster samples from the same class together), transductive learning (where you try to extrapolate model from labeled samples onto distribution of unlabeled ones).
You could also simply cluster data as #Shoaib suggested, but then you will have to come up the the heuristic approach how to deal with clusters with mixed labeling. Futhermore - obviously solving optimziation problem not related to the task (labeling) will not be as good as actually using this knowledge.

You can use clustering for that task. For that you have to label some examples for each class first. Then using these labeled examples, you can identify the class of each cluster easily.

Related

Which document embedding model for document similarity

First, I want to explain my task. I have a dataset of 300k documents with an average of 560 words (no stop word removal yet) 75% in German, 15% in English and the rest in different languages. The goal is to recommend similar documents based on an existing one. At the beginning I want to focus on the German and English documents.  
To achieve this goal I looked into several methods on feature extraction for document similarity, especially the word embedding methods have impressed me because they are context aware in contrast to simple TF-IDF feature extraction and the calculation of cosine similarity. 
I'm overwhelmed by the amount of methods I could use and I haven't found a proper evaluation of those methods yet. I know for sure that the size of my documents are too big for BERT, but there is FastText, Sent2Vec, Doc2Vec and the Universal Sentence Encoder from Google. My favorite method based on my research is Doc2Vec even though there aren't any or old pre-trained models which means I have to do the training on my own.
Now that you know my task and goal, I have the following questions:
Which method should I use for feature extraction based on the rough overview of my data?
My dataset is too small to train Doc2Vec on it. Do I achieve good results if I train the model on English / German Wikipedia?

You really have to try the different methods on your data, with your specific user tasks, with your time/resources budget to know which makes sense.
You 225K German documents and 45k English documents are each plausibly large enough to use Doc2Vec - as they match or exceed some published results. So you wouldn't necessarily need to add training on something else (like Wikipedia) instead, and whether adding that to your data would help or hurt is another thing you'd need to determine experimentally.
(There might be special challenges in German given compound words using common-enough roots but being individually rare, I'm not sure. FastText-based approaches that use word-fragments might be helpful, but I don't know a Doc2Vec-like algorithm that necessarily uses that same char-ngrams trick. The closest that might be possible is to use Facebook FastText's supervised mode, with a rich set of meaningful known-labels to bootstrap better text vectors - but that's highly speculative and that mode isn't supported in Gensim.)

How to categorize tweets(supportive vs. unsupportive) to predict elections results

The idea
I am collecting tweets talking about the three major candidates for the US presidency in November. After collecting the tweets from different states, I will score these tweets, and then analyze each candidate's followers/supporters on various aspects.
The problem
I am sure what method I should use to classify the tweets in order to produce reasonable outcomes. More precisely, I don't know how to tell if a tweet is supporting or opposing a specific candidate.
What I tried
I tried to use a library called textblob. Given a tweet, it returns a tuple of the form Sentiment(polarity, subjectivity). Polarity is a float which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement. This method does not return reasonable results at all when applied. For example, given a tweet like "Donald Trump is horrible! I still support him tho.", it returns a polarity of -1.0 (negative), which does not make any sense at all.
Further Research
I looked for more examples and found this. In that example, the author uses a mood vocabulary (from the internet) and later assigns a mood to each tweet. I am planning to take a close look at that article and apply the method used there.
My questions
What is a proper way to categorize those tweets? Should I consider the one I mentioned in the further research section?
What if a tweet contains both names? something like "Trump will beat Biden". How something like this is going to be scored using a specific method?

Sentiment Analysis is a topic of NLP (Natural language processing). What you are wondering is about NLP which is one of many interesting branches of machine learning.
Your approach to weigh out the polarity of a tweet is correct. Since you have 3 different candidates aka. labels, this is a more interesting problem.
I would store a sentiment array of length 3 for each tweet in order to calculate weights for each candidate. For example; a tweet like "A is bad. B will do much better! But also C might too?" might produce [-1, 0.8, 0.4].
To do that, you need a corpus. The corpus aka your dataset should contain tweets and labels for each tweet so that your machine learning model can learn from the tweets.
There are many ways to build a machine learning model and train it with your dataset. That is a topic of data science. Data scientists try to improve some performance indicators in order to improve their model.
The simpliest would be something like, parse out all words out of tweet, increment their values in a hash map with labels and normalize. Now you have a hash map containing the sentiment value for each word.
But in reality, this would not work out well as outliers and lack of dataset would affect your result. Therefore, you need to look at your data, problem and choose the right machine learning model. Look at this article to learn more about building a sentiment classifier.

NLP-steps or approch to classify text?

I'm working on a project to classify restaurant reviews on sentiment(positive or negative) basis. Also I want to classify that if these comments belongs to food, service, value-for-money, etc category. I am unable to link the steps or the methodology provided on the internet. can anyone provide detailed method or steps to get to the solution.

How about using bag of words model. It's been tried and tested for ages. It has some downsides compared to more modern methods, but you can still get decent results. And there are tons of material on internet to help you:
Normalize documents to the form ingestable by your pipeline
Convert documents to vectors and perform TF-IDF to filter irrelevant terms.
Here is a good tutorial. And convert them to vector form.
Split your documents get some subset of documents and mark the ones that belong to training data according to classes ( Sentiment ) / type of comments. Clearly your documents will belong to two classes.
Apply some type of dimensionality reduction technique to make your models more robust, good discussion is here
Train your models on your training data. You need at least two models one for sentiment and one for type. Some algorithms work with binary classes only so you might need more than to models for comment type ( Food, Value, Service). This might be a good thing because a comment can belong to more than one class ( Food quality and Value, or Value and Service). Scikit-learn has a lot of good models, also I highly recommend orange toolbox it's like a GUI for data science.
Validate your models using validation set. If you accuracy is satisfactory (most classical methods like SVM should give you at leat 90%) go ahead and use it for incoming data

Text classification in Python based on large dict of string:string

I have a dataset that would be equivalent to a dict of 5 millions key-values, both strings.
Each key is unique but there are only a couple hundreds of different values.
Keys are not natural words but technical references. The values are "families", grouping similar technical references. Similar is meant in the sense of "having similar regex", "including similar characters", or some sort of pattern.
Example of key-values:
ADSF33344 : G1112
AWDX45603 : G1112
D99991111222 : X3334
E98881188393 : X3334
A30-00005-01 : B0007
B45-00234-07A : B0007
F50-01120-06 : B0007
The final goal is to feed an algorithm with a list of new references (never seen before) and the algorithm would return a suggested family for each reference, ideally together with a percentage of confidence, based on what it learned from the dataset.
The suggested family can only come from the existing families found in the dataset. No need to "invent" new family name.
I'm not familiar with machine learning so I don't really know where to start. I saw some solutions through Sklearn or TextBlob and I understand that I'm looking for a classifier algorithm but every tutorial is oriented toward analysis of large texts.
Somehow, I don't find how to handle my problem, although it seems to be a "simpler" problem than analysing newspaper articles in natural language...
Could you indicate me sources or tutorials that could help me?

Make a training dataset, and train a classifier. Most classifiers work on the values of a set of features that you define yourself. (The kind of features depends on the classifier; in some cases they are numeric quantities, in other cases true/false, in others they can take several discrete values.) You provide the features and the classifier decides how important each feature is, and how to interpret their combinations.
By way of a tutorial you can look at chapter 6 of the NLTK book. The example task, the classification of names into male and female, is structurally very close to yours: Based on the form of short strings (names), classify them into categories (genders).
You will translate each part number into a dictionary of features. Since you don't show us the real data, nobody give you concrete suggestions, but you should definitely make general-purpose features as in the book, and in addition you should make a feature out of every clue, strong or weak, that you are aware of. If supplier IDS differ in length, make a length feature. If the presence (or number or position) of hyphens is a clue, make that into a feature. If some suppliers' parts use a lot of zeros, ditto. Then make additional features for anything else, e.g. "first three letters" that might be useful. Once you have a working system, experiment with different feature sets and different classifier engines and algorithms, until you get acceptable performance.
To get good results with new data, don't forget to split up your training data into training, testing and evaluation subsets. You could use all this with any classifier, but the NLTK's Naive Bayes classifier is pretty quick to train so you could start with that. (Note that the features can be discrete values, e.g. first_letter can be the actual letter; you don't need to stick to boolean features.)

Python NLTK difference between a sentiment and an incident

Hi i want to implement a system which can identify whether the given sentence is an incident or a sentiment.
I was going through python NLTK and found out that there is a way to find out positivity or negativity of a sentense.
Found out the ref link: ref link
I want to achieve like
My new Phone is not as good as I expected should be treated as sentiment
and Camera of my phone is not working should be considered as incident.
I gave a Idea of making my own clusters for training my system for finding out such but not getting a desired solution is there a built-in way to find that or any idea on how can be approach for solution of same.
Advance thanks for your time.

If you have, or can construct, a corpus of appropriately categorized sentences, you could use it to train a classifier. There can be as many categories as you need (two, three or more).
You'll have to do some work (reading and experimenting) to find the best features to use for the task. I'd start by POS-tagging the sentence so you can pull out the verb(s), etc. Take a look at the NLTK book's chapter on classifiers.
Use proper training/testing methodology (always test on data that was not seen during training), and make sure you have enough training data-- it's easy to "overtrain" your classifier so that it does well on the training data, by using characteristics that coincidentally correlate with the category but will not recur in novel data.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.