Removing punctuation with an exception in python - python

I am trying to remove punctuation from a given string in python.
It works well, however the data I am using includes lots of ":D" or ":)" or ":(".
Therefore when I analyse the data, I end up removing all of these text-smiles or only ":" for ":D"-case.
Following is an example code:
import string
import nltk
example_string = 'I would like to remove some punctiation, \
however some stuff like \':D\' causes errors. \
How would I not get rid of \':\', \
if it is followed by a \'D\'? '
translator = str.maketrans('', '', string.punctuation)
line = example_string.translate(translator)
line = nltk.word_tokenize(line)
line = [word.lower() for word in line
if word not in ['\'', '’', '”', '“']]
This produces as output:
['i', 'would', 'like', 'to', 'remove', 'some', 'punctiation',
'however', 'some', 'stuff', 'like', 'd', 'causes', 'errors',
'how', 'would', 'i', 'not', 'get', 'rid', 'of', 'if', 'it',
'is', 'followed', 'by', 'a', 'd']
What I would like to produce is (check the second line 5th word):
['i', 'would', 'like', 'to', 'remove', 'some', 'punctiation',
'however', 'some', 'stuff', 'like', ':d', 'causes', 'errors',
'how', 'would', 'i', 'not', 'get', 'rid', 'of', 'if', 'it',
'is', 'followed', 'by', 'a', 'd']
It will also remove all of ":)" or ":(".
Is there a way to not remove ":", if it is followed by a "d"?
or not remove ")" or "(", if the previous character is ":" ?


Word Counter loop keeps loading forever in Python

I have a DataFrame comments as seen below. I want to make a Counter of words for the Text field. I have made a list of UserId whose count of words is needed, those UserIds are stored in gold_users. But the loop to create Counter just keeps loading. Please help me fix this.
This is just part of dataframe, original has many rows.
Id| Text | UserId
6| Before the 2006 course, there was Allen Knutso... | 3
8| Also, Theo Johnson-Freyd has some notes from M... | 1
#Text Cleaning
punct = set(string.punctuation)
stopword = set(stopwords.words('english'))
lm = WordNetLemmatizer()
def clean_text(text):
text = ''.join(char.lower() for char in text if char not in punct)
tokens = re.split('\W+', text)
text = [lm.lemmatize(word) for word in tokens if word not in stopword]
return tuple(text) # Writing only `return text` was giving unhashable error 'list'
comments['Text'] = comments['Text'].apply(lambda x: clean_text(x))
for index,rows in comments.iterrows():
gold_comments = rows[comments.Text.loc[comments.UserId.isin(gold_users)]]
Expected Output
Passing your dataframe already having only your gold_users ids and texts, the following pure python function returns exactly what you need:
def word_count(df):
counts = dict()
for str in df['Text']:
words = str.split()
for word in words:
if word in counts:
counts[word] += 1
counts[word] = 1
return list(counts.items())
Hope it helps!
You overcomplicated the problem, I am afraid. In Pandas, it is almost never desirable to iterate through the rows. Select the rows that meet your condition, add their Texts, and apply a Counter to the combined list:
gold_users = [3,1]
golden_comments = comments[comments['UserId'].isin(gold_users)]
counter = Counter(golden_comments['Text'].sum())
If necessary, convert the counter to a list of lists:
[[k, v] for k, v in counter.items()]
# [['2006', 1], ['course', 1], ['allen', 1], ...]
# Initialise packages in session:
import pandas as pd
import re
# comments => Data Frame
comments = pd.DataFrame({"Id": [6, 8],
"Text": ["Before the 2006 course, there was Allen Knutso...",
"Also, Theo Johnson-Freyd has some notes from M..."],
"UserId": [3, 1]})
# Stopwords to remove from text: stopwords_lst => list of strings
stopwords_lst = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",
"you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself',
'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',
'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these',
'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having',
'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as',
'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in',
'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when',
'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some',
'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can',
'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've',
'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',
"hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't",
'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn',
"wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
# Clean lists of strings using regex: list of strings => function() => list of strings
def clean_string_list(str_lst):
"""Convert all alpha numeric characters in list of strings to lowercase,
non alpha-numeric characters to whitepsace, and trim whitespace on both sides of each string.
str_lst (list): Function takes a list of strings.
(list) A list of strings
return([*map(lambda x: re.sub('\W+', ' ', x.lower().strip('\s+\t\n\r')), str_lst)])
# Store a list of gold user's UserIds: gold_user_ids => list of integers:
gold_user_ids = [3, 1]
# Take Subset of Data Frame containing only gold users: gold_users => Data Frame
gold_users = comments[comments["UserId"].isin(gold_user_ids)]
# Apply the function to the list of stopwords and collapse the list into a single string: stopwords_re => string
stopwords_re = ' | '.join(clean_string_list(stopwords_lst))
# Clean strings, and remove stopwords: cleaned_text => vector of strings
gold_users['cleaned_text'] = [*map(lambda y: re.sub(stopwords_re, ' ', y), clean_string_list(gold_users['Text']))]
# Split each word on whitespace: words => list of strings
words = (' '.join(gold_users['cleaned_text'])).split()
# Count the number of occurences of each word: word_count => dict
word_count = dict(zip(words, [*map(lambda z: words.count(z), words)]))
# Print words to console: dictionary => stdout

Deleting elements in list of list using list comprehensions(Python)

I have the following data:
So I have a list that consistst of 2 other list(in my case I have 50000 lists in one big list).
I want to delete all punctuation and stopwords like "the", "a" "of" etc.
Here is what I have coded:
import string
from nltk.corpus import stopwords'stopwords')
punct = list(string.punctuation)
stops = set(stopwords.words("english"))
res = [[word.lower() for word in sentence if word not in punct or word.lower() in not stops] for sentence in dataset]
But it returns me the same list of lists that I initially had.
What is wrong with my code?
You shoud use and unstead of or:
res = [[word.lower() for word in sentence if word not in punct and word.lower() not in stops] for sentence in dataset]
Otherwise you get all elements since they are not exist at leatst in one of stops or punct list.
Since punct and stops do not over lap, every word will either not be in one or the other (or possibly both); you want to test for words that are not in both.
Assumning it would be ok to update the stops this is an alternative that avoids the 2-level comprehension
import string
import nltk
from nltk.corpus import stopwords
dataset = [
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an',
'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election',
'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities',
'took', 'place', '.'],
['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments',
'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had',
'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves',
'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta',
"''", 'for', 'the', 'manner',
'in', 'which', 'the', 'election', 'was', 'conducted', '.']
punct = list(string.punctuation)
stops = set(stopwords.words("english"))
# Union of punct and stops
res1 = [[word for word in sentence if word.lower() not in stops]
for sentence in dataset]
# Alternative solution that avoids an explict 2-level list comprehension
def filter_the(sentence, stops):
return [word for word in sentence if word.lower() not in stops]
res2 = [filter_the(sentence, stops) for sentence in dataset]
print(res1 == res2)

UnicodeDecodeError with word stemming in Python

I'm so stumped.
I have a list of a couple of thousand words
x = ['company', 'arriving', 'wednesday', 'and', 'then', 'beach', 'how', 'are', 'you', 'any', 'warmer', 'there', 'enjoy', 'your', 'day', 'follow', 'back', 'please', 'everyone', 'go', 'watch', 's', 'new', 'video', 'you', 'know', 'the', 'deal', 'make', 'sure', 'to', 'subscribe', 'and', 'like', '<http>', 'you', 'said', 'next', 'week', 'you', 'will', 'be', 'the', 'one', 'picking', 'me', 'up', 'lol', 'hindi', 'na', 'tl', 'huehue', 'that', 'works', 'you', 'said', 'everyone', 'of', 'us', 'my', 'little', 'cousin', 'keeps', 'asking', 'if', 'i', 'wanna', 'play', 'and', "i'm", 'like', 'yes', 'but', 'with', 'my', 'pals', 'not', 'you', "you're", 'welcome', 'pas', 'quand', 'tu', 'es', 'vers', '<num>', 'i', 'never', 'get', 'good', 'mornng', 'texts', 'sad', 'sad', 'moment', 'i', 'think', 'ima', 'go', 'get', 'a', 'glass', 'of', 'milk', 'ahah', 'for', 'the', 'first', 'time', 'i', 'actually', 'know', 'what', 'their', 'doing', 'd', 'thank', 'you', 'happy', 'birthday', 'hope', "you're"...........]
Now, I've confirmed the type of each element in this list to be a string
types = []
for word in x:
print set(a)
>>>set([<type 'str'>])
Now, I attempt to stem each word using NLTK's porter stemmer
import nltk
porter = nltk.PorterStemmer()
stemmed_x = [porter.stem(word) for word in x]
And I get this error that is clearly related to the stemming package and unicode somehow:
File "/Library/Python/2.7/site-packages/nltk-3.0.0b2-py2.7.egg/nltk/stem/", line 633, in stem
stem = self.stem_word(word.lower(), 0, len(word) - 1)
File "/Library/Python/2.7/site-packages/nltk-3.0.0b2-py2.7.egg/nltk/stem/", line 591, in stem_word
word = self._step1ab(word)
File "/Library/Python/2.7/site-packages/nltk-3.0.0b2-py2.7.egg/nltk/stem/", line 289, in _step1ab
if word.endswith("ied"):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 12: ordinal not in range(128)
I have tried everything, using, trying to explicitly encode each word as utf8 - still produce the same error.
Please advise.
I should mention that this code worked perfect on my PC running Ubuntu. I recently got a macbook pro and I'm getting this error. I've checked the terminal settings on my mac and it is set to utf8 encoding.
Interesting, with this piece of code, I have isolated the problem words:
for w in x:
except UnicodeDecodeError:
print w
Seems like what they all have in common are extra punctuation at the end of the word, except for the word #pr챕cieux
You have probably a multi-byte UTF8 char lurking around as 0xe2 is one of the first byte possible for an 16-bit codepoint encoded as UTF-8. As your program assume ASCII chars, with valid encoded values from 0x00 to 0x7F, this value is rejected.
You might be able to identify the "bad" value by a simple comprehension, then fix it by hand (as I assume from your data your want only deal with ASCII chars):
print [value for value in x if '\xe2' in x]
Using word.decode('utf-8') should solve this error.
import nltk
porter = nltk.PorterStemmer()
stemmed_x = [porter.stem(word.decode('utf-8')) for word in x]

nltk pos tagger looks to incorporate '.'

I am new to python, nlp and and nltk, so please bear with me. I have a handful of articles (~200), that have been categorized by hand. I am looking to develop a heuristic to assist/ recommend categories. To start I was hoping to build a relationship between current categories and the words within the document.
My premise is that the nouns are more important to the category than any other part of speech. For example the category "Energy" probably is driven nearly completely through the nouns: oil, battery, wind, etc.
The first thing I wanted to do was tag the parts and evaluate them. On the first article I encountered some issues. Some of the tokens are bound to punctuation.
for articles in articles[1]:
articles_id, content = articles
clean = nltk.clean_html(content).replace('’', "'")
tokens = nltk.word_tokenize(clean)
pos_document = nltk.pos_tag(tokens)
pos ={}
for pos_word in pos_document:
word, part = pos_word
if pos.has_key(part):
pos[part] = [word]
Formatted output:
'VBG': ['continuing', 'paying', 'falling', 'starting'],
'VBD': ['made', 'ended'], 'VBN': ['been', 'leaned', 'been', 'been'],
'VBP': ['know', 'hasn', 'have', 'continue', 'expect', 'take', 'see', 'have', 'are'],
'WDT': ['which', 'which'], 'JJ': ['negative', 'positive', 'top', 'modest', 'negative', 'real', 'financial', 'isn', 'important', 'long', 'short', 'next'],
'VBZ': ['is', 'has', 'is', 'leads', 'is', 'is'], 'DT': ['Another', 'the', 'the', 'any', 'any', 'the', 'the', 'a', 'the', 'the', 'the', 'the', 'a', 'the', 'a', 'a', 'the', 'a', 'the', 'any'],
'RP': ['back'],
'NN': [ 'listless', 'day', 'rsquo', 'll', 'progress', 'rsquo', 't', 'news', 'season', 'corner', 'surprise', 'stock', 'line', 'growth', 'question',
'stop', 'engineering', 'growth', 'isn', 'rsquo', 't', 'rsquo', 't', 'stock', 'market', 'look', 'junk', 'bond', 'market', 'turning', 'junk',
'rock', 'history', 'guide', 't', 'day', '%', '%', '%', 'level', 'move', 'isn', 'rsquo', 't', 'indication', 'way'],
',': [',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ','], '.': ['.'],
'TO': ['to', 'to', 'to', 'to', 'to', 'to', 'to'],
'PRP': ['them', 'they', 'they', 'we', 'you', 'they', 'it'],
'RB': ['then', 'there', 'just', 'just', 'always', 'so', 'so', 'only', 'there', 'right', 'there', 'much', 'typically', 'far', 'certainly'],
':': [';', ';', ';', ';', ';', ';', ';'],
'NNS': ['folks', 'companies', 'estimates', 'covers', 's', 'equities', 'bonds', 'equities', 'flats'],
'NNP': ['drift.', 'We', 'Monday', 'DC', 'note.', 'Earnings', 'EPS', 'same.', 'The', 'Street', 'now.', 'Since', 'points.', 'What', 'behind.', 'We', 'flat.', 'The'],
'VB': ['get', 'manufacture', 'buy', 'boost', 'look', 'see', 'say', 'let', 'rsquo', 'rsquo', 'be', 'build', 'accelerate', 'be'],
'WRB': ['when', 'where'],
'CC': ['&', 'and', '&', 'and', 'and', 'or', 'and', '&', '&', '&', 'and', '&', 'and', 'but', '&'],
'CD': ['47', '23', '30'],
'EX': ['there'],
'IN': ['on', 'if', 'until', 'of', 'around', 'as', 'on', 'down', 'since', 'of', 'for', 'under', 'that', 'about', 'at', 'at', 'that', 'like', 'if'],
'MD': ['can', 'will', 'can', 'can', 'will'],
'JJR': ['more']
notice under the NMP the word 'drift.' - shouldn't the period be removed? Do I need to remove this on my own or am I missing something with the library?
NLTK's word tokenizer assumes that its input has already been separated into sentences. Therefore in order to get it to work, you need to call sent_tokenize on your input first. I think you can use the output of sent_tokenize as the input to word_tokenize, but typically you would want to iterate over your sentences.
for articles in articles[1]:
articles_id, content = articles
clean = nltk.clean_html(content).replace('’', "'")
sents = nltk.sent_tokenize(clean)
pos ={}
for sent in sents:
tokens = nltk.word_tokenize(sent)
pos_document = nltk.pos_tag(tokens)
for pos_word in pos_document:
word, part = pos_word
if pos.has_key(part):
pos[part] = [word]
I believe the reason this is necessary is to help distinguish punctuation periods at the ends of sentences from periods used in abbreviations (i.e. you wouldn't want "Mr. Smith" to be broken into 'Mr', '.', 'Smith')

Fuzzy Group By, Grouping Similar Words

this question is asked here before
What is a good strategy to group similar words?
but no clear answer is given on how to "group" items. The solution based on difflib is basically search, for given item, difflib can return the most similar word out of a list. But how can this be used for grouping?
I would like to reduce
['ape', 'appel', 'apple', 'peach', 'puppy']
['ape', 'appel', 'peach', 'puppy']
['ape', 'apple', 'peach', 'puppy']
One idea I tried was, for each item, iterate through the list, if get_close_matches returns more than one match, use it, if not keep the word as is. This partly worked, but it can suggest apple for appel, then appel for apple, these words would simply switch places and nothing would change.
I would appreciate any pointers, names of libraries, etc.
Note: also in terms of performance, we have a list of 300,000 items, and get_close_matches seems a bit slow. Does anyone know of a C/++ based solution out there?
Note: Further investigation revealed kmedoid is the right algorithm (as well as hierarchical clustering), since kmedoid does not require "centers", it takes / uses data points themselves as centers (these points are called medoids, hence the name). In word grouping case, the medoid would be the representative element of that group / cluster.
You need to normalize the groups. In each group, pick one word or coding that represents the group. Then group the words by their representative.
Some possible ways:
Pick the first encountered word.
Pick the lexicographic first word.
Derive a pattern for all the words.
Pick an unique index.
Use the soundex as pattern.
Grouping the words could be difficult, though. If A is similar to B, and B is similar to C, A and C is not necessarily similar to each other. If B is the representative, both A and C could be included in the group. But if A or C is the representative, the other could not be included.
Going by the first alternative (first encountered word):
class Seeder:
def __init__(self):
self.seeds = set()
self.cache = dict()
def get_seed(self, word):
seed = self.cache.get(word,None)
if seed is not None:
return seed
for seed in self.seeds:
if self.distance(seed, word) <= LIMIT:
self.cache[word] = seed
return seed
self.cache[word] = word
return word
def distance(self, s1, s2):
l1 = len(s1)
l2 = len(s2)
matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
for zz in xrange(0,l2):
for sz in xrange(0,l1):
if s1[sz] == s2[zz]:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
return matrix[l2][l1]
import itertools
def group_similar(words):
seeder = Seeder()
words = sorted(words, key=seeder.get_seed)
groups = itertools.groupby(words, key=seeder.get_seed)
return [list(v) for k,v in groups]
import pprint
print pprint.pprint(group_similar([
'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
'people', 'into', 'year', 'your', 'good', 'some', 'could',
'them', 'see', 'other', 'than', 'then', 'now', 'look',
'only', 'come', 'its', 'over', 'think', 'also', 'back',
'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
'way', 'even', 'new', 'want', 'because', 'any', 'these',
'give', 'day', 'most', 'us'
]), width=120)
['and', 'a', 'in', 'on', 'as', 'at', 'an', 'one', 'all', 'can', 'no', 'want', 'any'],
['but', 'about', 'get', 'just'],
['good', 'look'],
['have', 'make', 'give'],
['his', 'her', 'if', 'him', 'its', 'how', 'us'],
['know', 'new'],
['like', 'time', 'take'],
['of', 'I', 'it', 'for', 'not', 'he', 'you', 'do', 'by', 'we', 'or', 'my', 'so', 'up', 'out', 'go', 'me', 'now'],
['over', 'our', 'even'],
['say', 'she', 'way', 'day'],
['some', 'see', 'come'],
['the', 'be', 'to', 'that', 'this', 'they', 'there', 'their', 'them', 'other', 'then', 'use', 'two', 'these'],
['what', 'who', 'when', 'than'],
['with', 'will', 'which'],
['would', 'could'],
['year', 'your']]
You have to decide in closed matches words, which words you want to use. May be get the first element from the list which get_close_matches is returning, or just use random function on that list and get one element from closed matches.
There must be some sort of rule, for it..
In [19]: import difflib
In [20]: a = ['ape', 'appel', 'apple', 'peach', 'puppy']
In [21]: a = ['appel', 'apple', 'peach', 'puppy']
In [22]: b = difflib.get_close_matches('ape',a)
In [23]: b
Out[23]: ['apple', 'appel']
In [24]: import random
In [25]: c = random.choice(b)
In [26]: c
Out[26]: 'apple'
In [27]:
Now remove c from the initial list, thats it...
For c++, you can use Levenshtein_distance
Here is another version using Affinity Propagation algorithm.
import numpy as np
import scipy.linalg as lin
import Levenshtein as leven
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.cluster import AffinityPropagation
import itertools
words = np.array(
['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
'people', 'into', 'year', 'your', 'good', 'some', 'could',
'them', 'see', 'other', 'than', 'then', 'now', 'look',
'only', 'come', 'its', 'over', 'think', 'also', 'back',
'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
'way', 'even', 'new', 'want', 'because', 'any', 'these',
'give', 'day', 'most', 'us'])
print "calculating distances..."
(dim,) = words.shape
f = lambda (x,y): -leven.distance(x,y)
res=np.fromiter(itertools.imap(f, itertools.product(words, words)), dtype=np.uint8)
A = np.reshape(res,(dim,dim))
af = AffinityPropagation().fit(A)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
unique_labels = set(labels)
for i in unique_labels:
print words[labels==i]
Distances had to be converted to similarities, I did that by taking the negative of distance. The output is
['to' 'you' 'do' 'by' 'so' 'who' 'go' 'into' 'also' 'two']
['it' 'with' 'at' 'if' 'get' 'its' 'first']
['of' 'for' 'from' 'or' 'your' 'look' 'after' 'work']
['the' 'be' 'have' 'I' 'he' 'we' 'her' 'she' 'me' 'give']
['this' 'his' 'which' 'him']
['and' 'a' 'in' 'an' 'my' 'all' 'can' 'any']
['on' 'one' 'good' 'some' 'see' 'only' 'come' 'over']
['would' 'could']
['but' 'out' 'about' 'our' 'most']
['make' 'like' 'time' 'take' 'back']
['that' 'they' 'there' 'their' 'when' 'them' 'other' 'than' 'then' 'think'
'even' 'these']
['not' 'no' 'know' 'now' 'how' 'new']
['will' 'people' 'year' 'well']
['say' 'what' 'way' 'want' 'day']
['as' 'up' 'just' 'use' 'us']
Another method could be using matrix factorization, using SVD. First we create word distance matrix, for 100 words this would be 100 x 100 matrix representating the distance from each word to all other words. Then, SVD is ran on this matrix, the u in the resulting u,s,v can be seen as membership strength to each cluster.
import numpy as np
import scipy.linalg as lin
import Levenshtein as leven
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import itertools
words = np.array(
['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
'people', 'into', 'year', 'your', 'good', 'some', 'could',
'them', 'see', 'other', 'than', 'then', 'now', 'look',
'only', 'come', 'its', 'over', 'think', 'also', 'back',
'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
'way', 'even', 'new', 'want', 'because', 'any', 'these',
'give', 'day', 'most', 'us'])
print "calculating distances..."
(dim,) = words.shape
f = lambda (x,y): leven.distance(x,y)
res=np.fromiter(itertools.imap(f, itertools.product(words, words)),
A = np.reshape(res,(dim,dim))
print "svd..."
u,s,v = lin.svd(A, full_matrices=False)
print u.shape
print s.shape
print s
print v.shape
data = u[:,0:10]
k=KMeans(init='k-means++', k=25, n_init=10)
centroids = k.cluster_centers_
labels = k.labels_
print labels
for i in range(np.max(labels)):
print words[labels==i]
def dist(x,y):
return np.sqrt(np.sum((x-y)**2, axis=1))
print "centroid points.."
for i,c in enumerate(centroids):
idx = np.argmin(dist(c,data[labels==i]))
print words[labels==i][idx]
print words[labels==i]
plt.plot(u[:,0], u[:,1], '.')
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = Axes3D(fig)
ax.plot(u[:,0], u[:,1], u[:,2],'.', zs=0,
zdir='z', label='zs=0, zdir=z')
The result
['and' 'an' 'can' 'any']
['to' 'you' 'do' 'so' 'go' 'no' 'two' 'how']
['who' 'when' 'well']
['be' 'I' 'by' 'we' 'my' 'up' 'me' 'use']
['for' 'or' 'out' 'about' 'your' 'our']
['it' 'his' 'if' 'him' 'its']
['would' 'people' 'could']
['this' 'think' 'these']
['the' 'he' 'she' 'see']
['all' 'back' 'want']
['of' 'on' 'one' 'only' 'even' 'new']
['but' 'just' 'first' 'most']
['some' 'come']
['that' 'than']
['say' 'what' 'way' 'day']
['like' 'time' 'give']
['in' 'into']
['her' 'get' 'year']
['with' 'will' 'which']
['other' 'over' 'after']
['a' 'as' 'at' 'also' 'us']
['they' 'there' 'their' 'them' 'then']
['not' 'from' 'know' 'good' 'now' 'look' 'work']
['have' 'make' 'take']
The selection of k for number of clusters is important, k=25 gives much better results than k=20 for instance.
The code also selects a representative word for each cluster by picking the word whose u[..] coordinate is closest to the cluster centroid.
Here is an approach based on medoids. First install MlPy. On Ubuntu
sudo apt-get install python-mlpy
import numpy as np
import mlpy
class distance:
def compute(self, s1, s2):
l1 = len(s1)
l2 = len(s2)
matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
for zz in xrange(0,l2):
for sz in xrange(0,l1):
if s1[sz] == s2[zz]:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
return matrix[l2][l1]
x = np.array(['ape', 'appel', 'apple', 'peach', 'puppy'])
km = mlpy.Kmedoids(k=3, dist=distance())
medoids,clusters,a,b = km.compute(x)
print medoids
print clusters
print a
print x[medoids]
for i,c in enumerate(x[medoids]):
print "medoid", c
print x[clusters[a==i]]
The output is
[4 3 1]
[0 2]
[2 2]
['puppy' 'peach' 'appel']
medoid puppy
medoid peach
medoid appel
['ape' 'apple']
The bigger word list and using k=10
medoid he
['or' 'his' 'my' 'have' 'if' 'year' 'of' 'who' 'us' 'use' 'people' 'see'
'make' 'be' 'up' 'we' 'the' 'one' 'her' 'by' 'it' 'him' 'she' 'me' 'over'
'after' 'get' 'what' 'I']
medoid out
['just' 'only' 'your' 'you' 'could' 'our' 'most' 'first' 'would' 'but'
medoid to
['from' 'go' 'its' 'do' 'into' 'so' 'for' 'also' 'no' 'two']
medoid now
['new' 'how' 'know' 'not']
medoid time
['like' 'take' 'come' 'some' 'give']
medoid because
medoid an
['want' 'on' 'in' 'back' 'say' 'and' 'a' 'all' 'can' 'as' 'way' 'at' 'day'
medoid look
['work' 'good']
medoid will
['with' 'well' 'which']
medoid then
['think' 'that' 'these' 'even' 'their' 'when' 'other' 'this' 'they' 'there'
'than' 'them']
