I have a dataset of words in a non-semantic context, basically names. I want to perform an act of grouping all the similar ones (say samantha, samanta, sammanta, samaynta.. ) in the same groups.
Since it is a non-semantic context, I cannot vectorize the data using TF-IDF or something else, so I am using the data as it is.
Notice that, I tried using clustering, I used DBSCAN with a custom distance metric (levenshtein), and Polyfuzz. Both gave some decent results, but they were not enough, the former gave a lot of misclusterings, and the later missed a lot of data. I tried searching on the internet for ways to approach this, but weirdly couldn't find any. All were in semantic contexts using TF-IDF and NLP technologies.
note : the dataset is relatively big (around 400.000 or more names)
I have been stuck in this, and would appreciate help, insight, or propositions in this regard.
For clustering you need a distance metric, such as the Levenshtein distance. If that does not give you the desired result, you need to use another one.
I would start by defining what you mean by similar: obviously it is not similarity in spelling, as otherwise the Levenshtein distance should work. What else is there? From your example it seems like the initail characters are important, so maybe use a string comparison that is weighted towards the beginning of a word.
Another approach you could use is using an algorithm tailored to names, such as Soundex.
Related
I have a dataframe (more than 1 million rows) that has an open text columns for customer can write whatever they want.
Misspelled words appear frequently and I'm trying to group comments that are grammatically the same.
For example:
ID
Comment
1
I want to change my credit card
2
I wannt change my creditt card
3
I want change credit caurd
I have tried using Levenshtein Distance but computationally it is very expensive.
Can you tell me another way to do this task?
Thanks!
Levenshtein Distance has time complexity O(N^2).
If you define a maximum distance you're interested in, say m, you can reduce the time complexity to O(Nxm). The maximum distance, in your context, is the maximum number of typos you accept while still considering two comments as identical.
If you cannot do that, you may try to parallelize the task.
this is not a trivial task. If faced with this problem, my approach would be:
Tokenise your sentences. There are many ways to tokenise a sentence, the most straightforward way is to convert a sentence to a list of words. E.g. I want to change my credit card becomes [I, want, to, change, my, credit, card]. Another way is to roll a window of size n across your sentence, e.g. I want to becomes ['I w', ' wa', 'wan', 'ant', ...] for window size 3.
After tokenising your sentence, create an embedding (vectorising), i.e. convert your token to a vector of numbers. The most simple way is to use some ready-made library like sklearn's TfidfVectorizer. If your data cares about the order of the words, then a more sophisticated vectoriser is needed.
After vectorising, use a clustering algorithm. The most simple one is K-Means.
Of course, this is a very complicated task, and there could be a lot of ways to approach this problem. What I described is the simplest out-of-the-box solution. Some clever people have used different strategies to get better results. One example is https://www.youtube.com/watch?v=nlKE4gvJjMo. You need to do this research on this field on your own.
Edit: of course your approach is good for a small dataset. But the difficult part lies in how to perform better than a O(n^2) complexity.
If I have two columns both are retrieved from different resources but with the same Identifier and I need to check if they are similar but there might be only differences in the spelling or the are completely different.
If you want to check whether the two sentences are similar except for spelling differences then, you can use the Normalized Levenshtein Distance or the string edit distance.
s1= "Quick brown fox"
s2= "Quiqk drown fox"
The Levenshtein distance between the two sentences is two.
If you want to check for semantic differences, then you will have to probably use machine learning based model. Simplest thing you can do for semantic similarity is use a model like Sentence2Vec or Doc2Vec and get semantic embeddings for two sentences and compute their dot product.
As shubh gupta noted above me, there are measures of distance among strings. They usually return a magnitude related to the difference of characters or substrings. Tge Levenshtein Distance is one of the most common one. You can find a really cool articule that explains how it works here.
Looking on how your question is stated, I do not think you're looking for the semantic difference between your two input strings, you would need an NLP model to do that. Maybe you can restate your question and provide more information on exactly the difference that you want to measure.
I'm working on a model that would predict an exam schedule for a given course and term. My input would be the term and the course name, and the output would be the date. I'm currently done with the data cleaning and preprocessing step, however, I can't wrap my head around a way to make a model whose input is two strings and the output is two numbers (the day and month of exam). One approach that I thought of would be encoding my course names, and writing the term as a binary list. I.E input: encoded(course), [0,0,1] output: day, month. and then feeding to a regression model.
I hope someone who's more experienced could tell me a better approach.
Before I start answering your question:
/rant
I know this sounds dumb and doesn't really help your question, but why are you using Neural Networks for this?!
To me, this seems like the classical case of "everybody uses ML/AI in their area, so now I have to, too!" (which is completely not true) /rant over
For string-like inputs, there exist several methods to encode these; choosing the right one might depend on your specific task. As you have a very "simple" (and predictable) input - i.e., you know in advance that there might not be any new/unseen course titles during testing/inference, or you do not need contextual/semantic information, you can resort to something like scikit-learn's LabelEncoder, which will turn it into different classes.
Alternatively, you could also throw a more heavy-weight encoding structure at the problem, that embeds the values in a matrix. Most DL frameworks offer some form of internal function for this, which basically requires you to pass an unique index for your input data, and actively learns some k-dimensional embedding vector for this. Intuitively, these embeddings correspond to a semantic or topical direction. If you have for example 3-dimensional embeddings, the first one could represent "social sciences course", the other one "technical courses", and the third for "seminar".
Of course, this is just a simplification of it, but helps imagining how it works.
For the output, predicting a specific date is actually a really good question. As I have personally never predicted dates myself, I can only recommend tips by other users. A nice answer on dates (as input) is given here.
If you can sacrifice a little bit of accuracy in the result, predicting the calendar week in which the exam is happening might be a good idea. Otherwise, you could simply treat it as two regressed values, but you might end up with invalid combinations (i.e. "negative days/months", or something like "31st February".
Depending on how much training data of high quality you have, results might vary quite heavily. Lastly, I would again recommend you to overthink whether you actually need a neural network for this task, or whether there are simpler metrics to do this.
Create dummy variables or use RandomForest. They accept text input and numerical output.
I want to try and learn Deep Learning with Python.
The first thing that came to my mind for a useful scenario would be a Duplicate-Check.
Let's say you have a customer-table with name,address,tel,email and want to insert new customers.
E.g.:
In Table:
Max Test,Teststreet 5, 00642 / 58458,info#max.de
To Insert:
Max Test, NULL, (+49)0064258458, test#max.de
This should be recognised as a duplicate entry.
Are there already tutorials out there for this usecase? Or is it even possible with deep learning?
Duplicate matching is a special case of similarity matching. You can define input features as either individual characters or fields and then train your network. It's a binary classification problem (true/false) unless you want to have a similarity score (95% match). The network should be able to learn that punctuation and whitespace is irrelevant and an 'or function' for at least one of the fields matching to produce true positive.
Sounds like a fairly simple case for deep learning.
I don't know of any specific tutorial for this, but I tried to give you some keywords to look for.
You can use duplicates=dataset.duplicated()
It will return all rows which are duplicate
Then:
print(sum(duplicates))
to print count of duplicated rows.
In your case, finding duplicates for numbers and category data should be simpler. The problem arises when it is free text. I think you should try out fuzzy matching techniques to start with. There is good distance metric available in Python called Levenshtein distance. The library for calculating the distance is python-Levenshtein. It is pretty fast. See if you get good results using this distance metric if you want to improve further you go for deep learning algorithms like RNN, LSTM, etc. which are good for text data.
The problem to find duplicate instances in relational database is a traditional research topic in database and data mining, which is called "entity matching" or "entity resolution". Deep learning is also adapted in this domain.
Many related works can be found in google scholar by searching "entity matching"+"deep learning"
I think that it's easier to build some functions, who can check different input schemes than training a network to do so. The hard part would be building a large enough data set to train your network correctly.
I recently started working on Document clustering using SciKit module in python. However I am having a hard time understanding the basics of document clustering.
What I know ?
Document clustering is typically done using TF/IDF. Which essentially
converts the words in the documents to vector space model which is
then input to the algorithm.
There are many algorithms like k-means, neural networks, hierarchical
clustering to accomplish this.
My Data :
I am experimenting with linkedin data, each document would be the
linkedin profile summary, I would like to see if similar job
documents get clustered together.
Current Challenges:
My data has huge summary descriptions, which end up becoming 10000's
of words when I apply TF/IDF. Is there any proper way to handle this
high dimensional data.
K - means and other algorithms requires I specify the no. of clusters
( centroids ), in my case I do not know the number of clusters
upfront. This I believe is a completely unsupervised learning. Are
there algorithms which can determine the no. of clusters themselves?
I've never worked with document clustering before, if you are aware
of tutorials , textbooks or articles which address this issue, please
feel free to suggest.
I went through the code on SciKit webpage, it consists of too many technical words which I donot understand, if you guys have any code with good explanation or comments please share. Thanks in advance.
My data has huge summary descriptions, which end up becoming 10000's of words when I apply TF/IDF. Is there any proper way to handle this high dimensional data.
My first suggestion is that you don't unless you absolutely have to, due to memory or execution time problems.
If you must handle it, you should use dimensionality reduction (PCA for example) or feature selection (probably better in your case, see chi2 for example)
K - means and other algorithms requires I specify the no. of clusters ( centroids ), in my case I do not know the number of clusters upfront. This I believe is a completely unsupervised learning. Are there algorithms which can determine the no. of clusters themselves?
If you look at the clustering algorithms available in scikit-learn, you'll see that not all of them require that you specify the number of clusters.
Another one that does not is hierarchical clustering, implemented in scipy. Also see this answer.
I would also suggest that you use KMeans and try to manually tweak the number of clusters until you are satisfied with the results.
I've never worked with document clustering before, if you are aware of tutorials , textbooks or articles which address this issue, please feel free to suggest.
Scikit has a lot of tutorials for working with text data, just use the "text data" search query on their site. One is for KMeans, others are for supervised learning, but I suggest you go over those too to get more familiar with the library. From a coding, style and syntax POV, unsupervised and supervised learning are pretty similar in scikit-learn, in my opinion.
Document clustering is typically done using TF/IDF. Which essentially converts the words in the documents to vector space model which is then input to the algorithm.
Minor correction here: TF-IDF has nothing to do with clustering. It is simply a method for turning text data into numerical data. It does not care what you do with that data (clustering, classification, regression, search engine things etc.) afterwards.
I understand the message you were trying to get across, but it is incorrect to say that "clustering is done using TF-IDF". It's done using a clustering algorithm, TF-IDF only plays a preprocessing role in document clustering.
For the large matrix after TF/IDF transformation, consider using sparse matrix.
You could try different k values. I am not an expert in unsupervised clustering algorithms, but I bet with such algorithms and different parameters, you could also end up with a varied number of clusters.
This link might be useful. It provides good amount of explanation for k-means clustering with a visual output http://brandonrose.org/clustering