Similarity score between two sets of tokens

Similarity score between two sets of tokens - python

I have a set of urls retrieved for a person. I want to try and classify each url as being about that person (his/her linkedin profile or blog or news article mentioning the person) or not about that person.
I am trying to apply a rudimentary approach where I tokenize each webpage and compare to all others to see how many similar words (excluding stop words) there are between each document and then take the most similar webpages to be positive matches.
I am wondering if there is a machine learning approach I can take to this which will make my task easier and more accurate. Essentially I want to compare webpage content (tokenized into words) between two webpages and determine a score for how similar they are based on their content.

If you are familiar with python this NLP classifier should help you greatly:
http://www.nltk.org/api/nltk.classify.html#module-nltk.classify
For unsupervised clustering you can use this:
http://www.nltk.org/api/nltk.cluster.html#module-nltk.cluster
If you are simply looking for similarity scores then the metrics module should be useful:
http://www.nltk.org/api/nltk.metrics.html#module-nltk.metrics
NLP-toolkit has the answer, just browse through the modules to find what you want, and don't implement it by hand.

Related

Unsure of how to get started with using NLP for analyzing user feedback

I have ~138k records of user feedback that I'd like to analyze to understand broad patterns in what our users are most often saying. Each one has a rating between 1-5 stars, so I don't need to do any sort of sentiment analysis. I'm mostly interested in splitting the dataset into >=4 stars to see what we're doing well and <= 3 stars to see what we need to improve upon.
One key problem I'm running into is that I expect to see a lot of n-grams. Some of these I know, like "HOV lane", "carpool lane", "detour time", "out of my way", etc. But I also want to detect common bi- and tri-grams programmatically. I've been playing around with Spacy a bit, but it doesn't seem to have any capability to do analysis on the corpus level, only on the document level.
Ideally my pipeline would look something like this (I think):
Import a list of known n-grams into the tokenizer
Process each string into a tokenized document, removing punctuation,
stopwords, etc, while respecting the known n-grams during
tokenization (ie, "HOV lane" should be a single noun token)
Identify the most common bi- and tri- grams in the corpus that I
missed
Re-tokenize using the found n-grams
Split by rating (>=4 and <=3)
Find the most common topics for each split of data in the corpus
I can't seem to find a single tool, or even a collection of tools, that will allow me to do what I want here. Am I approaching this the wrong way somehow? Any pointers on how to get started would be greatly appreciated!

Bingo State of the art results for your problem!
Its called - Zero-Short learning.
State-of-the-art NLP models for text classification without annotated data.
For Code and details read the blog - https://joeddav.github.io/blog/2020/05/29/ZSL.html
Let me know if it works for you or for any other help.

VADER tool is perfect with sentiment analysis and NLP based applications.
I think the proposed workflow is fine with this case study. Closely work with your feature extraction as it matters a lot.
Most of the time tri-grams make a sound sense on these use cases.
Using Spacy would be a better decision as SpaCy's rules-based match engines and components not only help you to find what the terms and sentences are searching for but also allow you to access the tokens inside a text and its relationships compared with regular expressions.

How to extract sub topic sentences of a review using python & NLTK?

Is there any efficient way to extract sub topic explanations of a review using python and NLTK library.As an example an user review regarding mobile phone could be "This phone's battery is good but display is a bullshit"
I wanna extract above two features like
"Battery is good"
"display is a bullshit"
The purpose of above is em gonna develop a rating system for products with respect to features of the product.
Analyzing polarity part has done.
But extracting features of review is some difficult for me.But I found a way to extract features using POS tag patterns with regular expressions like
<NN.?><VB.?>?<JJ.?>
this pattern as sub topic.But the problem is there could be lots of patterns in a review according to users description patterns.
Is there any way to solve my problem efficiently???
Thank you !!

The question you posed is multi-faceted and not straightforward to answer.
Conceptually, you may want to go through the following steps:
Identify the names of the features of phones (+ maybe creating an ontology based on these features).
Create a lists of synonyms to the feature names (similarly for evaluative phrases, e.g. nice, bad, sucks, etc.).
Use one of NLTK taggers to parse the reviews.
Create rules for extraction of features and their evaluation (Information Extraction part). I am not sure if NLTK can directly support you with this.
Evaluate and refine the approach.
Or: create a larger annotated corpus and train a Deep learning model on it using TensorFlow, Theano, or anything else alike.

Python NLTK difference between a sentiment and an incident

Hi i want to implement a system which can identify whether the given sentence is an incident or a sentiment.
I was going through python NLTK and found out that there is a way to find out positivity or negativity of a sentense.
Found out the ref link: ref link
I want to achieve like
My new Phone is not as good as I expected should be treated as sentiment
and Camera of my phone is not working should be considered as incident.
I gave a Idea of making my own clusters for training my system for finding out such but not getting a desired solution is there a built-in way to find that or any idea on how can be approach for solution of same.
Advance thanks for your time.

If you have, or can construct, a corpus of appropriately categorized sentences, you could use it to train a classifier. There can be as many categories as you need (two, three or more).
You'll have to do some work (reading and experimenting) to find the best features to use for the task. I'd start by POS-tagging the sentence so you can pull out the verb(s), etc. Take a look at the NLTK book's chapter on classifiers.
Use proper training/testing methodology (always test on data that was not seen during training), and make sure you have enough training data-- it's easy to "overtrain" your classifier so that it does well on the training data, by using characteristics that coincidentally correlate with the category but will not recur in novel data.

Guess tags of a paragraph programmatically using python

I've trying to read about NLP in general and nltk in specific to use with python. I don't know for sure if what am looking for exists out there, or if I perhaps need to develop it.
I have a program that collect text from different files, the text is extremely random and talks about different things. Each file contains a paragraph or 3 maximum, my program opens the files and store them into a table.
My question is, can i guess tags of what the paragraph is about? if anyone knows of an existing technology or approach, I would really appreciate it.
Thanks,

Your task is called "document classification", and the nltk book has a whole chapter on it. I'd start with that.
It all depends on your criteria for assigning tags. Are you interested in matching your documents against a pre-existing set of tags, or perhaps in topic extraction (select the N most important words or phrases in the text)?

You should train a classifier, the easiest one to develop (and you don't really need to develop it as NLTK provides one) is the naive baesian. The problem is that you'll need to classify manually a corpus of observations and then have the program guess what tag best fits a given paragraph (needless to say that the bigger the training corpus the more precise will be your classifier, IMHO you can reach a 80-85% of correctness). Take a look at the docs.

Classify words to "good" and "bad"

I have a list of domain names and want to determine is name of domain looks like it is porno site or not. What the better way to do this? List of porn domains looks like http://dumpz.org/56957/ . This domains can be used to teach the system how porno domains should look like. Also I have other list - http://dumpz.org/56960/ - many domains of this list also is porno and I want to determine them by name.

Use a bayesian filter eg: SpamBayes or Divmods Reverend. You train it with the list you have and could score how likely it is for a given domain, if it is porn.
For a short overview look at this article.

You can't rely on the domain name for that, there are far too many porn domains with decent names and few others with porn-like names but with safe content.

It might depend on what your goals are. I'm guessing that you are mostly interested in minimizing false negatives (accidentally calling a domain a good domain if it isn't). This might be true if, for example, you want all porn links in a forum to be reviewed for spam before being posted. If some non-porn links get flagged for review, it's OK.
In this case, you could probably do something fairly simple. If you could come up with a list of porn'ish words, you could just mark all of the domains that contain any of those words as a substring. This would catch some safe domains though: expertsexchange.com could match "sex" or "sexchange", but "yahoo" wouldn't ever flag positive. Easy to implement, easy to understand, easy to tweak.
Lists of obscene words can be found using your favorite search engine. You could use your list of domains to extract common long substrings across the domains as words as well.
If you want to really get the answers correct though, you'll need to see what is on those domains. Site-About-Kitty-Porn.com could be a lolcats domain or illegal porn. Impossible to know unless you do some crawling. If you fetch the actual content and matched against your list, you'd be doing a little better.
You could also try each domain against some third party service, such as a child-safe internet filter, or even trying to test if the domain will appear for safe-search results in your favorite search engine. Of course, make sure you are following each service's TOS and all of that.

As someone already pointed out, you need some kind of classification to achieve what you are trying to. But then overall accuracy (precision and recall) depends on the training dataset you have. You could use classifiers like SVM, decision tree, etc. for this purpose.
I would suggest to go for a semi-supervised approach where you cluster your different URLs and check a few representative URLs from each cluster to see if that is porn or not. The benefit is you don need any training and you can find porn URLs which probably do not cover your training dataset. The common clustering techniques are k-means, hierarchical, dbscan, etc.
This will still not cover porn sites which do not have porn like URL. For that you have to grab the page and need to do similar training/clustering on the content of the webpage(s).

Do you mean something like this?
scala> val pornList = List("porn1.com","porn2.com","porn3.com")
pornList: List[java.lang.String] = List(porn1.com, porn2.com, porn3.com)
scala> val sites = List("porn1.com","site1.com","porn3.com","site2.com","site3.com")
sites: List[java.lang.String] = List(porn1.com, site1.com, porn3.com, site2.com, site3.com)
scala> val result = sites filterNot { pornList contains _ }
result: List[java.lang.String] = List(site1.com, site2.com, site3.com)

Check out this blog post on classifying webpages by topic. Start with the list of bad sites as your positive examples and use any heuristic for finding good sites (basic web crawler seeded with some innocent Google searches) as negative examples. The post walks you through the process of extracting content through the pages and touches on Weka and how you might apply some of their basic learners.
Note that you may want to add additional data to your training set that is specific to the domain of your problem instead of just using page contents. For example the number of pictures or size of pictures on the page may be a factor that you may want to consider.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.