Python Deep Learning find Duplicates

Python Deep Learning find Duplicates - python

I want to try and learn Deep Learning with Python.
The first thing that came to my mind for a useful scenario would be a Duplicate-Check.
Let's say you have a customer-table with name,address,tel,email and want to insert new customers.
E.g.:
In Table:
Max Test,Teststreet 5, 00642 / 58458,info#max.de
To Insert:
Max Test, NULL, (+49)0064258458, test#max.de
This should be recognised as a duplicate entry.
Are there already tutorials out there for this usecase? Or is it even possible with deep learning?

Duplicate matching is a special case of similarity matching. You can define input features as either individual characters or fields and then train your network. It's a binary classification problem (true/false) unless you want to have a similarity score (95% match). The network should be able to learn that punctuation and whitespace is irrelevant and an 'or function' for at least one of the fields matching to produce true positive.
Sounds like a fairly simple case for deep learning.
I don't know of any specific tutorial for this, but I tried to give you some keywords to look for.

You can use duplicates=dataset.duplicated()
It will return all rows which are duplicate
Then:
print(sum(duplicates))
to print count of duplicated rows.

In your case, finding duplicates for numbers and category data should be simpler. The problem arises when it is free text. I think you should try out fuzzy matching techniques to start with. There is good distance metric available in Python called Levenshtein distance. The library for calculating the distance is python-Levenshtein. It is pretty fast. See if you get good results using this distance metric if you want to improve further you go for deep learning algorithms like RNN, LSTM, etc. which are good for text data.

The problem to find duplicate instances in relational database is a traditional research topic in database and data mining, which is called "entity matching" or "entity resolution". Deep learning is also adapted in this domain.
Many related works can be found in google scholar by searching "entity matching"+"deep learning"

I think that it's easier to build some functions, who can check different input schemes than training a network to do so. The hard part would be building a large enough data set to train your network correctly.

Related

Exclude values existing in a list that contains words like

I have a list of merchant category:
[
'General Contractors–Residential and Commercial',
'Air Conditioning, Heating and Plumbing Contractors',
'Electrical Contractors',
....,
'Insulation, Masonry, Plastering, Stonework and Tile Setting Contractors'
]
I want to exclude merchants from my dataframe if df['merchant_category'].str.contains() any of such merchant categories.
However, I cannot guarantee that the value in my dataframe has the long name as in the list of merchant category. It could be that my dataframe value is just air conditioning.
As such, df = df[~df['merchant_category'].isin(list_of_merchant_category)] will not work.

If you can collect a long list of positive examples (categories you definitely want to keep), & negative examples (categories you definitely want to exclude), you could try to train a text classifier on that data.
It would then be able to look at new texts and make a reasonable guess as to whether you want them included or excluded, based on their similarity to your examples.
So, as you're working in Python, I suggest you look for online tutorials and examples of "binary text classification" using Scikit-Learn.
While there's a bewildering variety of possible approaches to both representing/vectorizing your text, and then learning to make classifications from those vectors, you may have success with some very simple ones commonly used in intro examples. For example, you could represent your textual categories with bag-of-words and/or character-n-gram (word-fragments) representations. Then try NaiveBayes or SVC classifiers (and others if you need to experiment for possibly-bettr results).
Some of these will even report a sort of 'confidence' in their predictions - so you could potentially accept the strong predictions, but highlight the weak predictions for human review. When a human then looks at, an definitively rules on, a new 'category' string – because it was highlighted an iffy prediction, or noticed as an error, you can then improve the overall system by:
adding that to the known set that are automatically included/excluded based on an exact literal comparison
re-training the system, so that it has a better chance at getting other new similar strings correct
(I know this is a very high-level answer, but once you've worked though some attempts based on other intro tutorials, and hit issues with your data, you'll be able to ask more specific questions here on SO to get over any specific issues.)

Which document embedding model for document similarity

First, I want to explain my task. I have a dataset of 300k documents with an average of 560 words (no stop word removal yet) 75% in German, 15% in English and the rest in different languages. The goal is to recommend similar documents based on an existing one. At the beginning I want to focus on the German and English documents.  
To achieve this goal I looked into several methods on feature extraction for document similarity, especially the word embedding methods have impressed me because they are context aware in contrast to simple TF-IDF feature extraction and the calculation of cosine similarity. 
I'm overwhelmed by the amount of methods I could use and I haven't found a proper evaluation of those methods yet. I know for sure that the size of my documents are too big for BERT, but there is FastText, Sent2Vec, Doc2Vec and the Universal Sentence Encoder from Google. My favorite method based on my research is Doc2Vec even though there aren't any or old pre-trained models which means I have to do the training on my own.
Now that you know my task and goal, I have the following questions:
Which method should I use for feature extraction based on the rough overview of my data?
My dataset is too small to train Doc2Vec on it. Do I achieve good results if I train the model on English / German Wikipedia?

You really have to try the different methods on your data, with your specific user tasks, with your time/resources budget to know which makes sense.
You 225K German documents and 45k English documents are each plausibly large enough to use Doc2Vec - as they match or exceed some published results. So you wouldn't necessarily need to add training on something else (like Wikipedia) instead, and whether adding that to your data would help or hurt is another thing you'd need to determine experimentally.
(There might be special challenges in German given compound words using common-enough roots but being individually rare, I'm not sure. FastText-based approaches that use word-fragments might be helpful, but I don't know a Doc2Vec-like algorithm that necessarily uses that same char-ngrams trick. The closest that might be possible is to use Facebook FastText's supervised mode, with a rich set of meaningful known-labels to bootstrap better text vectors - but that's highly speculative and that mode isn't supported in Gensim.)

Recommendation for ML algorithm to differentiate between ban or allowance

I am new to Machine Learning and wanted to see if any of you could recommend an algorithm I could apply onto a project I'm doing. Basically I want to scrape popular housing websites and look at their descriptions to see if they allow/disallow something, for example pets. The problem is a simple search for pets leads contradictory results: 'pets allowed' and 'no additional cost for pets' or 'no pets' or 'I don't accept at this time'. As seen from these examples, often negative keywords 'no' are used to indicate pets are allowed, whereas positive keywords like 'accept' are used to indicate a ban. As such, I was wondering if there was any algorithm I could use (preferably in python) differentiate between the two. (Note: I can't run training data to generate an algorithm myself as the thing I am actually looking for is quite niche).
Thank you very much for your help!!

The keyword you're looking for is "document classification". This is a document classification problem. You start with documents (i.e. webpages) and you want to classify them as "allows pets" or "doesn't allow pets" (or whatever). There are a lot of good tutorials out there for performing document classification but a full explanation is beyond the scope of a StackOverflow answer.
You won't be able to do this for your particular niche case without providing at least some training data but you could gather, say 30 example websites, extract their text, manually add labels ("does fit my niche" vs "doesn't fit my niche"), and then run through a standard document classification system and see if that gets you the accuracy you want. Also in order for this to work with a small amount of training data (like your 30 documents), you'll need to start from a pretrained model.
Good luck!

Text classification in Python based on large dict of string:string

I have a dataset that would be equivalent to a dict of 5 millions key-values, both strings.
Each key is unique but there are only a couple hundreds of different values.
Keys are not natural words but technical references. The values are "families", grouping similar technical references. Similar is meant in the sense of "having similar regex", "including similar characters", or some sort of pattern.
Example of key-values:
ADSF33344 : G1112
AWDX45603 : G1112
D99991111222 : X3334
E98881188393 : X3334
A30-00005-01 : B0007
B45-00234-07A : B0007
F50-01120-06 : B0007
The final goal is to feed an algorithm with a list of new references (never seen before) and the algorithm would return a suggested family for each reference, ideally together with a percentage of confidence, based on what it learned from the dataset.
The suggested family can only come from the existing families found in the dataset. No need to "invent" new family name.
I'm not familiar with machine learning so I don't really know where to start. I saw some solutions through Sklearn or TextBlob and I understand that I'm looking for a classifier algorithm but every tutorial is oriented toward analysis of large texts.
Somehow, I don't find how to handle my problem, although it seems to be a "simpler" problem than analysing newspaper articles in natural language...
Could you indicate me sources or tutorials that could help me?

Make a training dataset, and train a classifier. Most classifiers work on the values of a set of features that you define yourself. (The kind of features depends on the classifier; in some cases they are numeric quantities, in other cases true/false, in others they can take several discrete values.) You provide the features and the classifier decides how important each feature is, and how to interpret their combinations.
By way of a tutorial you can look at chapter 6 of the NLTK book. The example task, the classification of names into male and female, is structurally very close to yours: Based on the form of short strings (names), classify them into categories (genders).
You will translate each part number into a dictionary of features. Since you don't show us the real data, nobody give you concrete suggestions, but you should definitely make general-purpose features as in the book, and in addition you should make a feature out of every clue, strong or weak, that you are aware of. If supplier IDS differ in length, make a length feature. If the presence (or number or position) of hyphens is a clue, make that into a feature. If some suppliers' parts use a lot of zeros, ditto. Then make additional features for anything else, e.g. "first three letters" that might be useful. Once you have a working system, experiment with different feature sets and different classifier engines and algorithms, until you get acceptable performance.
To get good results with new data, don't forget to split up your training data into training, testing and evaluation subsets. You could use all this with any classifier, but the NLTK's Naive Bayes classifier is pretty quick to train so you could start with that. (Note that the features can be discrete values, e.g. first_letter can be the actual letter; you don't need to stick to boolean features.)

Python NLTK difference between a sentiment and an incident

Hi i want to implement a system which can identify whether the given sentence is an incident or a sentiment.
I was going through python NLTK and found out that there is a way to find out positivity or negativity of a sentense.
Found out the ref link: ref link
I want to achieve like
My new Phone is not as good as I expected should be treated as sentiment
and Camera of my phone is not working should be considered as incident.
I gave a Idea of making my own clusters for training my system for finding out such but not getting a desired solution is there a built-in way to find that or any idea on how can be approach for solution of same.
Advance thanks for your time.

If you have, or can construct, a corpus of appropriately categorized sentences, you could use it to train a classifier. There can be as many categories as you need (two, three or more).
You'll have to do some work (reading and experimenting) to find the best features to use for the task. I'd start by POS-tagging the sentence so you can pull out the verb(s), etc. Take a look at the NLTK book's chapter on classifiers.
Use proper training/testing methodology (always test on data that was not seen during training), and make sure you have enough training data-- it's easy to "overtrain" your classifier so that it does well on the training data, by using characteristics that coincidentally correlate with the category but will not recur in novel data.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.