classify website business domain - python

I want to identify the Category/business_domain of the website's business to which it belongs.
For ex. superhuman website. The company made Email client powered by buzzword features & UI.
So in short Category of website can be Professional Email services.
So, to get this done, some of my initials thoughts are applying LDA algorithm (python module) on About_us text of a website & company's Facebook info page, given that we have these both. But still this approach is not working in many cases. Any insights?
LDA details:
using 20000 passes and 1 topic, my results for http://aakritiartgallery.com/ website is
[(0, u'0.050*art + 0.020*aakriti + 0.019*contemporary + 0.017*gallery + 0.015*new')]
How can i narrow down to my business with these term probablities given by LDA?

#Anony-Mousse said it well, it would help to make a roadplan instead of fixating on a single algorithm. Given your situation, this is what I would do.
Preprocessing/Feature Extraction
NMF, LSA, LDA are unsupervised techniques mostly used in preprocessing to extract meaning features. In NLP, this usually corresponds to extracting meaningful words in large amounts of text. By using these techniques, you would be able to process raw data to gain meaningful features. These algorithms by themselves do not offer predictions, and they are usually not enough to create a good model.
Training
In your case, you would need structured data to train your model and make predictions. For instance, you can use your results of your LDA (you would actually use indices of these keywords) mapped to a business domain (or your label).
i.e)
(label)IT : (features) java, python, server
(label)Zoo: (features) monkey, zebra, giraffe
(label)IT : (features) nlp, machine learning
After you have gathered some data (at the very least (#features * #label)), you can train a supervised model of your choice. (Log Reg, SVM, NN, etc.)
Testing
Evaluate your prediction score and implement algorithm.
Having said this, this would be no easy task. You would have to deal with identifying categories/subcategories, other means of extracting meaningful features, etc so I would put a long timeframe on this project. Good Luck!

Get training data
Train a classifier
Classify!

Related

Does Fine-tunning Bert Model in multiple times with different dataset make it more accuracy?

i'm totally new in NLP and Bert Model.
What im trying to do right now is Sentiment Analysis on Twitter Trending Hashtag ("neg", "neu", "pos") by using DistilBert Model, but the accurazcy was about 50% ( I tried w Label data taken from Kaggle).
So here is my idea:
(1) First, I will Fine-tunning Distilbertmodel (Model 1) with IMDB dataset,
(2) After that since i've got some data took from Twitter post, i will sentiment analysis them my Model 1 and get Result 2.
(3) Then I will refine-tunning Model 1 with the Result 2 and expecting to have Model (3).
Im not really sure this process has any meaning to make the model more accuracy or not.
Thanks for reading my post.
I'm a little skeptical about your first step. Since the IMDB database is different from your target database, I do not think it will positively affect the outcome of your work. Thus, I would suggest fine-tuning it on a dataset like a tweeter or other social media hashtags; however, if you are only focusing on hashtags and do not care about the text, that might work! My little experience with fine-tuning transformers like BART and BERT shows that the dataset that you are working on should be very similar to your actual data. But in general, you can fine-tune a model with different datasets, and if the datasets are structured for one goal, it can improve the model's accuracy.
If you want to fine-tune a sentiment classification head of BERT for classifying tweets, then I'd recommend a different strategy:
IMDB dataset is a different kind of sentiment - the ratings do not really correspond with short post sentiment, unless you want to focus on tweets regarding movies.
using classifier output as input for further training of that classifier is not really a good approach, because, if the classifier made many mistakes while classifying, these will be reflected in the training, and so the errors will deapen. This is basically creating endogenous labels, which will not really improve your real-world classification.
You should consider other ways of obtaining labelled training data. There are a few good examples for twitter:
Twitter datasets on Kaggle - there are plenty of datasets available containing millions of various tweets. Some of those even contain sentiment labels (usually inferred from emoticons, as these were proven to be more accurate than words in predicting sentiment - for explanation see e.g. Frasincar 2013). So that's probably where you should look.
Stocktwits (if youre interested in financial sentiments)- contain posts that authors can label for sentiments, thus are a perfect way of mining labelled data, if stocks/cryptos is what you're looking for.
Another thing is picking a model that's better for your language, I'd recommend this one. It has been pretrained on 80M tweets, so should provide strong improvements. I believe it even contains a sentiment classification head that you can use.
Roberta Twitter Base
Check out the website for that and guidance for loading the model in your code - it's very easy, just use the following code (this is for sentiment classification):
MODEL = "cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
Another benefit of this model is that it has been pretrained from scratch with a vocabulary that contains emojis, meaning it has a deep understanding of them, their typical contexts and co-occurences. This can greatly benefit a social media classification, as many researchers would agree that emojis/emoticons are better predictors of sentiment than normal words.

Generating dictionaries to categorize tweets into pre-defined categories using NLTK

I have a list of twitter users (screen_names) and I need to categorise them into 7 pre-defined categories - Education, Art, Sports, Business, Politics, Automobiles, Technology based on thier interest area.
I have extracted last 100 tweets of the users in Python and created a corpus for each user after cleaning the tweets.
As mentioned here Tweet classification into multiple categories on (Unsupervised data/tweets) :
I am trying to generate dictionaries of common words under each category so that I can use it for classification.
Is there a method to generate these dictionaries for a custom set of words automatically?
Then I can use these for classifying the twitter data using a tf-idf classifier and get the degree of correspondence of the tweet to each of the categories. The highest value will give us the most probable category of the tweet.
But since the categorisation is based on these pre-generated dictionaries, I am looking for a way to generate them automatically for a custom list of categories.
Sample dictionaries :
Education - ['book','teacher','student'....]
Automobiles - ['car','auto','expo',....]
Example I/O:
**Input :**
UserA - "students visited share learning experience eye opening
article important preserve linaugural workshop students teachers
others know coding like know alphabets vision driving codeindia office
initiative get students tagging wrong people apologies apologies real
people work..."
.
.
UserN - <another corpus of cleaned tweets>
**Expected output** :
UserA - Education (61%)
UserN - Automobiles (43%)
TL;DR
Labels are necessary for supervised machine learning. And if you don't have training data that contains Xs (input texts) and Y (output labels) then (i) supervised learning might not be what you're looking for or (ii) you have to create a dataset with texts and their corresponding labels.
In Long
Lets try to break it down and see reflect what you're looking for.
I have a list twitter users (screen_names) and I need to categorise them into 7 pre-defined categories - Education, Art, Sports, Business, Politics, Automobiles, Technology
So your ultimate task is to label tweets into 7 categories.
I have extracted last 100 tweets of the users in Python and created a corpus for each user after cleaning the tweets.
100 data points is definitely insufficient to do anything if you want to train a supervised machine learning model from scratch.
Another thing is the definition of corpus. A corpus is a body of text so it's not wrong to call any list of strings a corpus. However, to do any supervised training, each text should come with the corresponding label(s)
But I see some people do unsupervised classification without any labels!
Now, that's an oxymoron =)
Unsupervised Classification
Yes, there are "unsupervised learning" which often means to learn representation of the inputs, generally the representation of the inpus is use to (i) generate or (ii) sample.
Generation from a representation means to create from the representation a data point that is similar to the data which an unsupervised model has learnt from. In the case of text process / NLP, this often means to generate new sentences from scratch, e.g. https://transformer.huggingface.co/
Sampling a representation means to give the unsupervised model a text and the model is expected to provide some signal from which the unsupervised model has learnt from. E.g. given a language model and novel sentence, we want to estimate the probability of the sentence, then we use this probability to compare across different sentences' probabilities.
Algorithmia has a nice summary blogpost https://algorithmia.com/blog/introduction-to-unsupervised-learning and a more modern perspective https://sites.google.com/view/berkeley-cs294-158-sp20/home
That's a whole lot of information but you don't tell me how to #$%^&-ing do unsupervised classification!
Yes, the oxymoron explanation isn't finished. If we look at text classification, what are we exactly doing?
We are fitting the input text into some pre-defined categories. In your case, the labels are pre-defined but
Q: Where exactly would the signal come from?
A: From the tweets, of course, stop distracting me! Tell me how to do classification!!!
Q: How do you tell the model that a tweet should be this label and not another label?
A: From the unsupervised learning, right? Isn't that what unsupervised learning supposed to do? To map the input texts to the output labels?
Precisely, that's the oxymoron,
Supervised learning maps the input texts to output labels not unsupervised learning
So what do I do? I need to use unsupervised learning and I want to do classification.
Then the question is ask is:
Do you have labelled data?
If no, then how to get labels?
Use proxies, find signals that tells you a certain tweet is a certain label, e.g. from the hashtags or make some assumptions that some people always tweets on certain category
Use existing tweet classifiers to label your data and then train the classification model on the data
Do I have to pay for these classifiers? Most often, yes you do. https://english.api.rakuten.net/search/text%20classification
If yes, then how much?
If it's too little,
then how to create more? Maybe https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/
or maybe use some modern post-training algorithm https://towardsdatascience.com/https-medium-com-chaturangarajapakshe-text-classification-with-transformer-models-d370944b50ca
How about all these AI I keep hearing about, that I can do classification with 3 lines of code.
Don't they use unsupervised language models that sounds like Sesame Street characters, e.g. ELMO, BERT, ERNIE?
I guess you mean something like https://github.com/ThilinaRajapakse/simpletransformers#text-classification
from simpletransformers.classification import ClassificationModel
import pandas as pd
# Train and Evaluation data needs to be in a Pandas Dataframe of two columns. The first column is the text with type str, and the second column is the label with type int.
train_data = [['Example sentence belonging to class 1', 1], ['Example sentence belonging to class 0', 0]]
train_df = pd.DataFrame(train_data)
eval_data = [['Example eval sentence belonging to class 1', 1], ['Example eval sentence belonging to class 0', 0]]
eval_df = pd.DataFrame(eval_data)
# Create a ClassificationModel
model = ClassificationModel('bert', 'bert-base') # You can set class weights by using the optional weight argument
# Train the model
model.train_model(train_df)
Take careful notice of the comment:
Train and Evaluation data needs to be in a Pandas Dataframe of two columns. The first column is the text with type str, and the second column is the label with type int.
Yes that's the more modern approach to:
First use a pre-trained language model to convert your texts into input representations
Then feed the input representations and their corresponding labels to a classifier
Note, you still can't avoid the fact that you need labels to train the supervised classifier
Wait a minute, you mean all these AI I keep hearing about is not "unsupervised classification".
Genau. There's really no such thing as "unsupervised classification" (yet), somehow the (i) labels needs to be manually defined, (ii) the mapping between the inputs to the labels should exist
The right word to define the paradigm would be transfer learning, where the language is
learned in a self-supervised manner (it's actually not truly unsupervised) so that the model learns to convert any text into some numerical representation
then use the numerical representation with labelled data to produce the classifier.

NLP data preparation and sorting for text-classification task

I read a lot of tutorials on the web and topics on stackoverflow but one question is still foggy for me. If consider just the stage of collecting data for multi-label training, what way (see below) are better and whether are both of them acceptable and effective?
Try to find 'pure' one-labeled examples at any cost.
Every example can be multi labeled.
For instance, I have articles about war, politics, economics, culture. Usually, politics tied to economics, war connected to politics, economics issues may appear in culture articles etc. I can assign strictly one main theme for each example and drop uncertain works or assign 2, 3 topics.
I'm going to train data using Spacy, volume of data will be about 5-10 thousand examples per topic.
I'd be grateful for any explanation and/or a link to some relevant discussion.
You can try OneVsAll / OneVsRest strategy. This will allow you to do both: predict exact one category without the need to strictly assign one label.
Also known as one-vs-all, this strategy consists in fitting one
classifier per class. For each classifier, the class is fitted against
all the other classes. In addition to its computational efficiency
(only n_classes classifiers are needed), one advantage of this
approach is its interpretability. Since each class is represented by
one and one classifier only, it is possible to gain knowledge about
the class by inspecting its corresponding classifier. This is the most
commonly used strategy for multiclass classification and is a fair
default choice.
This strategy can also be used for multilabel learning, where a
classifier is used to predict multiple labels for instance, by fitting
on a 2-d matrix in which cell [i, j] is 1 if sample i has label j and
0 otherwise.
Link to docs:
https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html

In Sklearn machine learning, is there any way to classify text without target labels?

I was wondering if there was any way to classify text data into different groups/categories based on the words in the text using a combination of Python and Sklearn Machine Learning?
For example:
text = [["request approval for access", "request approval to enter premises", "Laptop not working"], ["completed bw table loading"]]
So can I get categories like:
category_label = [[0,0,2], [1]]
categories = [["approval request", "approval request", "Laptop working"], ["bw table"]]
where
0 = approval request
2 = laptop working
1 = bw table
Basically the above would imply that there is no labelled training data or target labels.
This is readily possible in Scikit-Learn as well as in NLTK.
The features that you list:
0 = approval request
2 = laptop working
1 = bw table
are not ones that a clustering algorithm would naturally choose and its worthwhile to caution you against the possible mistake of clouding your statistical learning algorithm with heuristics. I suggest that you first try some clustering and classification and then consider semi-supervised learning methods whereby you can label your clusters and propagate those labels.
You can try a clustering method but there is no guarantee that the clusters you get will correspond to the categories you want, because you haven't clearly explained the algorithm what you want.
What I would do is manually label some data (how long can it take to label 300 samples?) and train on that, so that your algo can learn the words the are correlated with each class.
If this is impossible then your best bet is to calculate a cosine similarity between one sample and each class description, rank them, and then assign the closest class. But in my opinion by the time you finish to code this, you could have manually labeled some samples and trained a standard algo with a much better precision.
#user1452759
Your problem is more specific than general machine learning, and you should use package NLTK instead of sklearn. Look at classifying text with nltk http://www.nltk.org/book/ch06.html

Signal feature identification

I'm am trying to identify phonemes in voices using a training database of known ones.
I'm wondering if there is a way of identifying common features within my training sample and using that to classify a new one.
It seems like there are two paths:
Give the process raw/normalised data and it will return similar ones
Extract certain metrics such as pitch, formants etc and compare to training set
My interest is the first!
Any recommendations on machine learning or regression methods/algorithms?
Since you tagged Python, I highly recommend looking into scikit-learn, an excellent Python library for Machine Learning. Their docs are very thorough, and should give you a good crash course in Machine Learning algorithms and implementation (including classification, regression, clustering, etc)
Your points 1 and 2 are not very different: 1) is the end results of a classification problem 2) is the feature that you give for classification. What you need is a good classifier (SVM, decision trees, hierarchical classifiers etc.) and a good set of features (pitch, formants etc. that you mentioned).

Categories