Defining the context of a word - Python - python

I think this is an interesting question, at least for me.
I have a list of words, let's say:
photo, free, search, image, css3, css, tutorials, webdesign, tutorial, google, china, censorship, politics, internet
and I have a list of contexts:
Programming
World news
Technology
Web Design
I need to try and match words with the appropriate context/contexts if possible.
Maybe discovering word relationships in some way.
Any ideas?
Help would be much appreciated!

This sounds like it's more of a categorization/ontology problem than NLP. Try WordNet for a standard ontology.
I don't see any real NLP in your stated problem, but if you do need some semantic analysis or a parser try NLTK.

Where do these words come from? Do they come from real texts. If they are then it is a classic data mining problem. What you need to do is to your set of documents into the matrix where rows represent which document the word came from and the columns represent the words in the documents.
For example if you have two documents like this:
D1: Need to find meaning.
D2: Need to separate Apples from oranges
you matrix will look like this:
Need to find meaning Apples Oranges Separate From
D1: 1 1 1 1 0 0 0 0
D2: 1 1 0 0 1 1 1 1
This is called term by document matrix
Having collected this statistics you can use algorithms like K-Means to group similar documents together. Since you already know how many concepts you have your tasks should be soomewhat easier. K-Means is very slow algorithm, so you can try to optimize it using techniques such as SVD

I just found this a couple days ago: ConceptNet
It's a commonsense ontology, so it might not be as specific as you would like, but it has a python API and you can download their entire database (currently around 1GB decompressed). Just keep in mind their licensing restrictions.
If you read the papers that were published by the team that developed it, you may get some ideas on how to relate your words to concepts/contexts.

The answer to your question obviously depends on the target taxonomy you are trying to map your terms into. Once you have decided on this you need to figure out how fine-grained the concepts should be. WordNet, as it has been suggested in other responses, will give you synsets, i.e. sets of terms which are more or less synonymous but which you will have to map to concepts like 'Web Design' or 'World News' by some other mechanism since these are not encoded in WordNet. If you're aiming at a very broad semantic categorization, you could use WordNet's higher-level concept nodes which differentiate, e.g. (going up the hierarchy) human from animal, animates from plants, substances from solids, concrete from abstract things, etc.
Another kind-of-taxonomy which may be quite useful to you is the Wikipedia category system. This is not just a spontaneous idea I just came up with, but there has been a lot of work on deriving real ontologies from Wikipedia categories. Take a look at the Java Wikipedia Library - the idea would be to find a wikipedia article for the term in question (e.g. 'css3'), extract the categories this article belongs to, and pick the best ones with respect to some criterion (i.e. 'programming', 'technology', and 'web-development'). Depending on what you're trying to do this last step (choosing the best of several given categories) may or may not be difficult.
See here for a list of other ontologies / knowledge bases you could use.

Related

Ways of obtaining a similarity metric between two full text documents?

So imagine I have three text documents, for example (let 3 randomly generated texts).
Document 1:
"Whole every miles as tiled at seven or. Wished he entire esteem mr oh by. Possible bed you pleasure civility boy elegance ham. He prevent request by if in pleased. Picture too and concern has was comfort. Ten difficult resembled eagerness nor. Same park bore on be...."
Document 2:
"Style too own civil out along. Perfectly offending attempted add arranging age gentleman concluded. Get who uncommonly our expression ten increasing considered occasional travelling. Ever read tell year give may men call its. Piqued son turned fat income played end wicket..."
If I want to obtain in python (using libraries) a metric on how similar these 2 documents are to a third one (in other words, which one of the 2 documents is more similar to a third one) , what would be the best way to proceed?
edit: I have observed other questions that they answer by comparing individual sentences to other sentences, but I am not interested on that, as I want to compare a full text (consisting on related sentences) against another full text, and obtaining a number (which for example may be bigger than another comparison obtained with a different document which is less similar to the target one)
There is no simple answer to this question. As similarities will perform better or worse depending on the particular task you want to perform.
Having said that, you do have a couple of options regarding comparing blocks of text. This post compares and ranks several different ways of computing sentence similarity, which you can then aggregate to perform full document similarity. How to aggregate this? will also depend on your particular task. A simple, but often well-performing approach is to compute the average sentence similarities of the 2 (or more) documents.
Other useful links for this topics include:
Introduction to Information Retrieval (free book)
Doc2Vec (from gensim, for paragraph embeddings, which is probably very suitable for your case)
You could try the Simphile NLP text similarity library (disclosure: I'm the author). It offers several language agnostic methods: JaccardSimilarity, CompressionSimilarity, EuclidianSimilarity. Each has their advantages, but all work well on full document comparison:
Install:
pip install simphile
This example shows Jaccard, but is exactly the same with Euclidian or Compression:
from simphile import jaccard_similarity
text_a = "I love dogs"
text_b = "I love cats"
print(f"Jaccard Similarity: {jaccard_similarity(text_a, text_b)}")

Assigning different weights to search keywords

I'm implementing a search engine and so far I am done with the part for web crawling, storing the results in the index and retrieving results for the search keywords entered by the user. However I would like the search results to be more specific. Let's say I'm searching "Shoe shops in Hyderabad". Is there any NLP library in python that can just process the text and assign higher weights on important words like in this case "shoes" and "Hyderabad".
Thanks.
I don't think one approach is going to solve the entire problem here. Your question is broad and it will take multiple steps to get best results. Here is how I would approach the problem
Create N-grams analyser with Lucene and query. Lucene also allows
Phrase queries. Shoe shops in Hyderabad is a good fit for that.
Use cosine similarity to treat Shoe shops in Hyderabad and Footwear shops in Hyderabad similarly.
Also think of some linguistic angle. Simple POS tagging and role based rule engine can help you get much smarter results for queries like Shoe shops in Hyderabad or Shoe under 500 bucks where a very finite word set of in, under, on etc can be assigned rules on location/ comparison etc. This point assumes you are looking at English language. You will have to build this layer separately for each language though.
Hope this helps.
I think the question is good (I was looking something similar last week) and of course as other people mention the question is too broad. But I think you can face it using a information Retrieval system. I can recommend you Lemur project and spefically Indri, it includes a lot of customization features for queries, then is possible weighting using n-grams, tf-idf (as previuos two answer suggest) or just use you own criteria. If you want use Indri, check this is an tutorial/introduction, and something about weighting is at page 56.
Good look!

Clustering Company Curriculum Vitae (CV's) in python (Clustering pieces of text)

I am trying to classify (cluster) our companies Curriculum Vitae (CVs). There is about 100 CV's in total. The idea is to find similar people based on their CV content. I have already transformed the word docs into text files and read all of the candidates into a python dictionary with the format:
cvdict = { 'name1' : "cv text", 'name2', : 'cv text', ... }
I have also remove most punctuation, lowercased it, removed numbers etc., and removed words with a length less than x (4)
My questions:
Is clustering the correct approach? If not, which Machine Learning algorithm would be a suitable initial focus for this task.
Any pointers as to some python code i can use to transverse this dictionary and 'cluster' the content. Based on the clustering of the content it should output the ‘keys’=candidate names as clustered groups.
So from what I understood you want to see potential groups/clusters in the set of CVs.
the idea of cvdict is great, but you also need to convert all texts to numbers ! you are half way through. so think about matrix/excel sheet/table. where you have the profile of each employee in each line.
name1,cv_text1
name2,cv_text2
name3,cv_text3 ...
Yes, as you can guess, the length of cv_text can vary. Some people have a lengthy resume some other not ! which words can categorize the company employee. Some how we need to make them all equal size; Also, not all words are informative, you need to think which words can capture your idea; In Machine Learning they call it "Feature" vector or matrix. So my suggestion would be drive a set of words and mark if the person has mentioned that word in his skill.
managment marketing customers statistics programming
name1 1 1 0 0 0
name2 0 0 0 1 1
name3 0 0 1 1 0
or instead of a 0/1 matrix you can put how many times that word was mentioned in the resume.
again you can just extract all possible words from all resumes. NLTK is an awesome module for doing text analysis and it has some built-in function for you to polish you text. have a look at the first half of this slide.
Then you can use any kind of clustering method, for example hierarchical https://code.activestate.com/recipes/578834-hierarchical-clustering-heatmap-python/
there are already packages for doing such analysis; either in scipy or scikit and I am sure for each you can find a tons of examples. The key step is the one you are already working on; representing your data as a matrix.
Couple more hints to earlier comment:
I would not throw away words less than 4 characters long. Instead I would use a stop list of common words. You don't want to throw away things like C++ or C#
One good technique of building a matrix above is to use TF-IDF metric. What it is is essentially a measure of how frequently some word occurs in a particular document vs. how frequently it occurs in the entire collection. So things like 'the' are very common so they will be downgraded very quickly. If only 5 people in your company now C++ this will boost up the metric for this word a lot.
You might want to consider to use a stemmer like a 'porter algorithm'. This algorithm will combine words like 'statistics' and 'statistical'.
Most machine learning algorithms have a problem with very wide matrices. Unfortunately, your resume base is only 100 documents which is considered quite low vs how many potential terms you will have. The reason these techniques work for google and NSA is because human languages tend to have tens of thousands words in active use vs billions of documents they have to index. For your task I would try to shrink you dataset to no more than 30-40 columns. Be very aggressive on throwing away the common words.
Unfortunately the biggest weakness of most of the clustering techniques is that you have to set a number of clusters in advance. A common approach that people use is to set up some type of measure of how good your clusters are and start running the clustering algorithm first with very few clusters and keep increasing until your metrics starts to drop off. Look up Andrew Ng machine learning course on the interwebs. He explains these technique very well.
Of course hierarchical clustering is not affected by the point 5.
Instead of clustering you can try building decision tree. Although not super accurate, decision trees have a great advantage to visualize the built model. By looking at the three you can easily see the reason where built the way they are.
Besides scipy and scikit, which are very good. Take a look at Orange Toolbox. It has a lot of good algorithms with good visualization tools. They way you program it is just by connecting boxes with arrows. After you got satisfied with your model you can easily dump it out to the run as a script.
Hope this helps.

Unstructured Text to Structured Data

I am looking for references (tutorials, books, academic literature) concerning structuring unstructured text in a manner similar to the google calendar quick add button.
I understand this may come under the NLP category, but I am interested only in the process of going from something like "Levi jeans size 32 A0b293"
to: Brand: Levi, Size: 32, Category: Jeans, code: A0b293
I imagine it would be some combination of lexical parsing and machine learning techniques.
I am rather language agnostic but if pushed would prefer python, Matlab or C++ references
Thanks
You need to provide more information about the source of the text (the web? user input?), the domain (is it just clothes?), the potential formatting and vocabulary...
Assuming worst case scenario you need to start learning NLP. A very good free book is the documentation of NLTK: http://www.nltk.org/book . It is also a very good introduction to Python and the SW is free (for various usages). Be warned: NLP is hard. It doesn't always work. It is not fun at times. The state of the art is no where near where you imagine it is.
Assuming a better scenario (your text is semi-structured) - a good free tool is pyparsing. There is a book, plenty of examples and the resulting code is extremely attractive.
I hope this helps...
Possibly look at "Collective Intelligence" by Toby Segaran. I seem to remember that addressing the basics of this in one chapter.
After some researching I have found that this problem is commonly referred to as Information Extraction and have amassed a few papers and stored them in a Mendeley Collection
http://www.mendeley.com/research-papers/collections/3237331/Information-Extraction/
Also as Tai Weiss noted NLTK for python is a good starting point and this chapter of the book, looks specifically at information extraction
If you are only working for cases like the example you cited, you are better off using some manual rule-based that is 100% predictable and covers 90% of the cases it might encounter production..
You could enumerable lists of all possible brands and categories and detect which is which in an input string cos there's usually very little intersection in these two lists..
The other two could easily be detected and extracted using regular expressions. (1-3 digit numbers are always sizes, etc)
Your problem domain doesn't seem big enough to warrant a more heavy duty approach such as statistical learning.

How do content discovery engines, like Zemanta and Open Calais work?

I was wondering how as semantic service like Open Calais figures out the names of companies, or people, tech concepts, keywords, etc. from a piece of text. Is it because they have a large database that they match the text against?
How would a service like Zemanta know what images to suggest to a piece of text for instance?
Michal Finkelstein from OpenCalais here.
First, thanks for your interest. I'll reply here but I also encourage you to read more on OpenCalais forums; there's a lot of information there including - but not limited to:
http://opencalais.com/tagging-information
http://opencalais.com/how-does-calais-learn
Also feel free to follow us on Twitter (#OpenCalais) or to email us at team#opencalais.com
Now to the answer:
OpenCalais is based on a decade of research and development in the fields of Natural Language Processing and Text Analytics.
We support the full "NLP Stack" (as we like to call it):
From text tokenization, morphological analysis and POS tagging, to shallow parsing and identifying nominal and verbal phrases.
Semantics come into play when we look for Entities (a.k.a. Entity Extraction, Named Entity Recognition). For that purpose we have a sophisticated rule-based system that combines discovery rules as well as lexicons/dictionaries. This combination allows us to identify names of companies/persons/films, etc., even if they don't exist in any available list.
For the most prominent entities (such as people, companies) we also perform anaphora resolution, cross-reference and name canonization/normalization at the article level, so we'll know that 'John Smith' and 'Mr. Smith', for example, are likely referring to the same person.
So the short answer to your question is - no, it's not just about matching against large databases.
Events/Facts are really interesting because they take our discovery rules one level deeper; we find relations between entities and label them with the appropriate type, for example M&As (relations between two or more companies), Employment Changes (relations between companies and people), and so on. Needless to say, Event/Fact extraction is not possible for systems that are based solely on lexicons.
For the most part, our system is tuned to be precision-oriented, but we always try to keep a reasonable balance between accuracy and entirety.
By the way there are some cool new metadata capabilities coming out later this month so stay tuned.
Regards,
Michal
I'm not familiar with the specific services listed, but the field of natural language processing has developed a number of techniques that enable this sort of information extraction from general text. As Sean stated, once you have candidate terms, it's not to difficult to search for those terms with some of the other entities in context and then use the results of that search to determine how confident you are that the term extracted is an actual entity of interest.
OpenNLP is a great project if you'd like to play around with natural language processing. The capabilities you've named would probably be best accomplished with Named Entity Recognizers (NER) (algorithms that locate proper nouns, generally, and sometimes dates as well) and/or Word Sense Disambiguation (WSD) (eg: the word 'bank' has different meanings depending on it's context, and that can be very important when extracting information from text. Given the sentences: "the plane banked left", "the snow bank was high", and "they robbed the bank" you can see how dissambiguation can play an important part in language understanding)
Techniques generally build on each other, and NER is one of the more complex tasks, so to do NER successfully, you will generally need accurate tokenizers (natural language tokenizers, mind you -- statistical approaches tend to fare the best), string stemmers (algorithms that conflate similar words to common roots: so words like informant and informer are treated equally), sentence detection ('Mr. Jones was tall.' is only one sentence, so you can't just check for punctuation), part-of-speech taggers (POS taggers), and WSD.
There is a python port of (parts of) OpenNLP called NLTK (http://nltk.sourceforge.net) but I don't have much experience with it yet. Most of my work has been with the Java and C# ports, which work well.
All of these algorithms are language-specific, of course, and they can take significant time to run (although, it is generally faster than reading the material you are processing). Since the state-of-the-art is largely based on statistical techniques, there is also a considerable error rate to take into account. Furthermore, because the error rate impacts all the stages, and something like NER requires numerous stages of processing, (tokenize -> sentence detect -> POS tag -> WSD -> NER) the error rates compound.
Open Calais probably use language parsing technology and language statics to guess which words or phrases are Names, Places, Companies, etc. Then, it is just another step to do some kind of search for those entities and return meta data.
Zementa probably does something similar, but matches the phrases against meta-data attached to images in order to acquire related results.
It certainly isn't easy.

Categories