Classify words to "good" and "bad"

Classify words to "good" and "bad" - python

I have a list of domain names and want to determine is name of domain looks like it is porno site or not. What the better way to do this? List of porn domains looks like http://dumpz.org/56957/ . This domains can be used to teach the system how porno domains should look like. Also I have other list - http://dumpz.org/56960/ - many domains of this list also is porno and I want to determine them by name.

Use a bayesian filter eg: SpamBayes or Divmods Reverend. You train it with the list you have and could score how likely it is for a given domain, if it is porn.
For a short overview look at this article.

You can't rely on the domain name for that, there are far too many porn domains with decent names and few others with porn-like names but with safe content.

It might depend on what your goals are. I'm guessing that you are mostly interested in minimizing false negatives (accidentally calling a domain a good domain if it isn't). This might be true if, for example, you want all porn links in a forum to be reviewed for spam before being posted. If some non-porn links get flagged for review, it's OK.
In this case, you could probably do something fairly simple. If you could come up with a list of porn'ish words, you could just mark all of the domains that contain any of those words as a substring. This would catch some safe domains though: expertsexchange.com could match "sex" or "sexchange", but "yahoo" wouldn't ever flag positive. Easy to implement, easy to understand, easy to tweak.
Lists of obscene words can be found using your favorite search engine. You could use your list of domains to extract common long substrings across the domains as words as well.
If you want to really get the answers correct though, you'll need to see what is on those domains. Site-About-Kitty-Porn.com could be a lolcats domain or illegal porn. Impossible to know unless you do some crawling. If you fetch the actual content and matched against your list, you'd be doing a little better.
You could also try each domain against some third party service, such as a child-safe internet filter, or even trying to test if the domain will appear for safe-search results in your favorite search engine. Of course, make sure you are following each service's TOS and all of that.

As someone already pointed out, you need some kind of classification to achieve what you are trying to. But then overall accuracy (precision and recall) depends on the training dataset you have. You could use classifiers like SVM, decision tree, etc. for this purpose.
I would suggest to go for a semi-supervised approach where you cluster your different URLs and check a few representative URLs from each cluster to see if that is porn or not. The benefit is you don need any training and you can find porn URLs which probably do not cover your training dataset. The common clustering techniques are k-means, hierarchical, dbscan, etc.
This will still not cover porn sites which do not have porn like URL. For that you have to grab the page and need to do similar training/clustering on the content of the webpage(s).

Do you mean something like this?
scala> val pornList = List("porn1.com","porn2.com","porn3.com")
pornList: List[java.lang.String] = List(porn1.com, porn2.com, porn3.com)
scala> val sites = List("porn1.com","site1.com","porn3.com","site2.com","site3.com")
sites: List[java.lang.String] = List(porn1.com, site1.com, porn3.com, site2.com, site3.com)
scala> val result = sites filterNot { pornList contains _ }
result: List[java.lang.String] = List(site1.com, site2.com, site3.com)

Check out this blog post on classifying webpages by topic. Start with the list of bad sites as your positive examples and use any heuristic for finding good sites (basic web crawler seeded with some innocent Google searches) as negative examples. The post walks you through the process of extracting content through the pages and touches on Weka and how you might apply some of their basic learners.
Note that you may want to add additional data to your training set that is specific to the domain of your problem instead of just using page contents. For example the number of pictures or size of pictures on the page may be a factor that you may want to consider.

Related

Identify domain related important keywords from a given text

I am relatively new to the field of NLP/text processing. I would like to know how to identify domain-related important keywords from a given text.
For example, if I have to build a Q&A chatbot that will be used in the Banking domain, the Q would be like: What is the maturity date for TRADE:12345 ?
From the Q, I would like to extract the keywords: maturity date & TRADE:12345.
From the extracted information, I would frame a SQL-like query, search the DB, retrieve the SQL output and provide the response back to the user.
Any help would be appreciated.
Thanks in advance.

So, this is where the work comes in.
Normally people start with a stop word list. There are several, choose wisely. But more than likely you'll experiment and/or use a base list and then add more words to that list.
Depending on the list it will take out
"what, is, the, for, ?"
Since this a pretty easy example, they'll all do that. But you'll notice that what is being done is just the opposite of what you wanted. You asked for domain-specific words but what is happening is the removal of all that other cruft (to the library).
From here it will depend on what you use. NLTK or Spacy are common choices. Regardless of what you pick, get a real understanding of concepts or it can bite you (like pretty much anything in Data Science).
Expect to start thinking in terms of linguistic patterns so, in your example:
What is the maturity date for TRADE:12345 ?
'What' is an interrogative, 'the' is a definite article, 'for' starts a prepositional phrase.
There may be other clues such as the ':' or that TRADE is in all caps. But, it might not be.
That should get you started but you might look at some of the other StackExchange sites for deeper expertise.
Finally, you want to break a question like this into more than one question (assuming that you've done the research and determined the question hasn't already been asked -- repeatedly). So, NLTK and NLP are decently new, but SQL queries are usually a Google search.

Assigning different weights to search keywords

I'm implementing a search engine and so far I am done with the part for web crawling, storing the results in the index and retrieving results for the search keywords entered by the user. However I would like the search results to be more specific. Let's say I'm searching "Shoe shops in Hyderabad". Is there any NLP library in python that can just process the text and assign higher weights on important words like in this case "shoes" and "Hyderabad".
Thanks.

I don't think one approach is going to solve the entire problem here. Your question is broad and it will take multiple steps to get best results. Here is how I would approach the problem
Create N-grams analyser with Lucene and query. Lucene also allows
Phrase queries. Shoe shops in Hyderabad is a good fit for that.
Use cosine similarity to treat Shoe shops in Hyderabad and Footwear shops in Hyderabad similarly.
Also think of some linguistic angle. Simple POS tagging and role based rule engine can help you get much smarter results for queries like Shoe shops in Hyderabad or Shoe under 500 bucks where a very finite word set of in, under, on etc can be assigned rules on location/ comparison etc. This point assumes you are looking at English language. You will have to build this layer separately for each language though.
Hope this helps.

I think the question is good (I was looking something similar last week) and of course as other people mention the question is too broad. But I think you can face it using a information Retrieval system. I can recommend you Lemur project and spefically Indri, it includes a lot of customization features for queries, then is possible weighting using n-grams, tf-idf (as previuos two answer suggest) or just use you own criteria. If you want use Indri, check this is an tutorial/introduction, and something about weighting is at page 56.
Good look!

Similarity score between two sets of tokens

I have a set of urls retrieved for a person. I want to try and classify each url as being about that person (his/her linkedin profile or blog or news article mentioning the person) or not about that person.
I am trying to apply a rudimentary approach where I tokenize each webpage and compare to all others to see how many similar words (excluding stop words) there are between each document and then take the most similar webpages to be positive matches.
I am wondering if there is a machine learning approach I can take to this which will make my task easier and more accurate. Essentially I want to compare webpage content (tokenized into words) between two webpages and determine a score for how similar they are based on their content.

If you are familiar with python this NLP classifier should help you greatly:
http://www.nltk.org/api/nltk.classify.html#module-nltk.classify
For unsupervised clustering you can use this:
http://www.nltk.org/api/nltk.cluster.html#module-nltk.cluster
If you are simply looking for similarity scores then the metrics module should be useful:
http://www.nltk.org/api/nltk.metrics.html#module-nltk.metrics
NLP-toolkit has the answer, just browse through the modules to find what you want, and don't implement it by hand.

What is it called, to extract an address from HTML via NLP

I have 300k+ html documents, which I want to extract postal addresses from. The data is different structures, so regex wont work.
I have done a heap of reading on NLP and NLTK for python, however I am still struggling on where to start with this.
Is this approach called Part-of-Speech Tagging or Chunking / Partial Parsing? I can't find any document on how to actually TAG a page so I can train a model on it, or even what I should be training.
So my questions;
What is this approach called?
How can I tag some documents to train from

Qn: Which NLP task is closely related with this task?
Ans: The task of detecting postal address can be viewed as a Name-Entity Recognition (NER) task. But I suggest viewing the task as simply sequence labeling on html (i.e. your input data) and then perform some standard machine learning classification.
Qn: How can I tag some documents to use as training data?
An: What you can do is to:
Label each word or each line as B egin I nside or O utside
Choose a supervised classification method
Decide what are the features (here's some hint: Feature selection)
Build the model (basically just run the classification software with the configured
features)
Voila, output should give you B and I and
O , then just delete all the instances labelled O and you will be left with the lines/words that are addresses

Apple calls their software that does this "Data Detectors" (be careful, it's patented -- they won an injunction against HTC Android phones over this). More generally, I think this application is called Information Extraction.

Strip the text out of the HTML page (unless there is a way from the HTML to identify the address text such as div with a particular class) then build a set of rules that match the address formats used.
If there are postal addresses in several countries then the formats may be markedly different but within a country, the format is the same (with some tweaks) or it is not valid.
Within the US, for example, addresses are 3 or 4 lines (including the person). There is usually a zip code (5 digits optionally followed by four more). Other countries have postal codes in different formats.
Unless your goal is 100% accuracy on all addresses then you probably should aim for extracting as many addresses as you can within the budget for the task.
It doesn't seem like a task for NLP unless you want to use Named Entity identification to find cities, countries etc.

Your task is called information extraction, but that's a very, very broad concept. Luckily your task is more limited (street addresses), but you don't give a lot of information:
What countries are the addresses in? An address in Tokyo looks very different from one in Cleveland. Your odds of succeeding are much better if you're interested in addresses from a limited number of countries-- you can develop a solution from each of them. If we're talking about a very limited number, you could code a recognizer manually.
What kind of webpages are we talking about? Are they a random collection, or can you group them into a limited number of websites and formats? Where do the addresses appear? (I.e., are there any contextual clues you can use to zero in on them?)
I'll take a worse-case scenario for question 2: The pages are completely disparate and the address could be anywhere. Not sure what the state of the art is, but I'd approach it as a chunking problem.
To get any kind of decent results, you need a training set. At a minimum, a large collection of addresses from the same locations and in the same style (informal, incomplete, complete) as the addresses you'll be extracting. Then you can try to coax decent performance out of a chunker (probably with post-processing).
PS. I wouldn't just discard the html mark-up. It contains information about document structure which could be useful. I'd add structural mark-up (paragraphs, emphasis, headings, lists, displays) before you throw out the html tags.

group detection in large data sets python

I am a newbie in python and have been trying my hands on different problems which introduce me to different modules and functionalities (I find it as a good way of learning).
I have googled around a lot but haven't found anything close to a solution to the problem.
I have a large data set of facebook posts from various groups on facebooks that use it as a medium to mass send the knowledge.
I want to make groups out of these posts which are content-wise same.
For example, one of the posts is "xyz.com is selling free domains. Go register at xyz.com"
and another is "Everyone needs to register again at xyz.com. Due to server failure, all data has been lost."
These are similar as they both ask to go the group's website and register.
P.S: Just a clarification, if any one of the links would have been abc.com, they wouldn't have been similar.
Priority is to the source and then to the action (action being registering here).
Is there a simple way to do it in python? (a module maybe?)
I know it requires some sort of clustering algorithm ( correct me if I am wrong), my question is can python make this job easier for me somehow? some module or anything?
Any help is much appreciated!

Assuming you have a function called geturls that takes a string and returns a list of urls contained within, I would do it like this:
from collections import defaultdict
groups = defaultdict(list):
for post in facebook_posts:
for url in geturls(post):
groups[url].append(post)

That greatly depends on your definition of being "content-wise same". A straight forward approach is to use a so-called Term Frequency - Inverse Document Frequency (TFIDF) model.
Simply put, make a long list of all words in all your posts, filter out stop-words (articles, determiners etc.) and for each document (=post) count how often each term occurs, and multiplying that by the importance of the team (which is the inverse document frequency, calculated by the log of the ratio of documents in which this term occurs). This way, words which are very rare will be more important than common words.
You end up with a huge table in which every document (still, we're talking about group posts here) is represented by a (very sparse) vector of terms. Now you have a metric for comparing documents. As your documents are very short, only a few terms will be significantly high, so similar documents might be the ones where the same term achieved the highest score (ie. the highest component of the document vectors is the same), or maybe the euclidean distance between the three highest values is below some parameter. That sounds very complicated, but (of course) there's a module for that.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.