I have the following data:
[['The',
'Fulton',
'County',
'Grand',
'Jury',
'said',
'Friday',
'an',
'investigation',
'of',
"Atlanta's",
'recent',
'primary',
'election',
'produced',
'``',
'no',
'evidence',
"''",
'that',
'any',
'irregularities',
'took',
'place',
'.'],
['The',
'jury',
'further',
'said',
'in',
'term-end',
'presentments',
'that',
'the',
'City',
'Executive',
'Committee',
',',
'which',
'had',
'over-all',
'charge',
'of',
'the',
'election',
',',
'``',
'deserves',
'the',
'praise',
'and',
'thanks',
'of',
'the',
'City',
'of',
'Atlanta',
"''",
'for',
'the',
'manner',
'in',
'which',
'the',
'election',
'was',
'conducted',
'.']]
So I have a list that consistst of 2 other list(in my case I have 50000 lists in one big list).
I want to delete all punctuation and stopwords like "the", "a" "of" etc.
Here is what I have coded:
import string
from nltk.corpus import stopwords
nltk.download('stopwords')
punct = list(string.punctuation)
punct.append("``")
punct.append("''")
stops = set(stopwords.words("english"))
res = [[word.lower() for word in sentence if word not in punct or word.lower() in not stops] for sentence in dataset]
But it returns me the same list of lists that I initially had.
What is wrong with my code?
You shoud use and unstead of or:
res = [[word.lower() for word in sentence if word not in punct and word.lower() not in stops] for sentence in dataset]
Otherwise you get all elements since they are not exist at leatst in one of stops or punct list.
Since punct and stops do not over lap, every word will either not be in one or the other (or possibly both); you want to test for words that are not in both.
Assumning it would be ok to update the stops this is an alternative that avoids the 2-level comprehension
import string
import nltk
from nltk.corpus import stopwords
dataset = [
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an',
'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election',
'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities',
'took', 'place', '.'],
['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments',
'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had',
'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves',
'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta',
"''", 'for', 'the', 'manner',
'in', 'which', 'the', 'election', 'was', 'conducted', '.']
]
nltk.download('stopwords')
punct = list(string.punctuation)
punct.append("``")
punct.append("''")
stops = set(stopwords.words("english"))
# Union of punct and stops
stops.update(punct)
res1 = [[word for word in sentence if word.lower() not in stops]
for sentence in dataset]
# Alternative solution that avoids an explict 2-level list comprehension
def filter_the(sentence, stops):
return [word for word in sentence if word.lower() not in stops]
res2 = [filter_the(sentence, stops) for sentence in dataset]
print(res1 == res2)
This question already has answers here:
Sort list by frequency
(8 answers)
Closed 4 years ago.
I'm working on the code that can analyze the input text.
One of the functions I would like to ask for help is that making a list of words used in order of descending frequency.
By referring the similar topics in stack overflow, I was able to retain only alphanumeric characters (remove all quotation / punctuation etc) and put each words into the list.
Here is the list I have now. (variable called word_list)
['Hi', 'beautiful', 'creature', 'Said', 'by', 'Rothchild', 'the',
'biggest', 'enemy', 'of', 'Zun', 'Zun', 'started', 'get', 'afraid',
'of', 'him', 'As', 'her', 'best', 'friend', 'Lia', 'can', 'feel',
'her', 'fear', 'Why', 'the', 'the', 'hell', 'you', 'are', 'here']
(FYI, text file is just random fanfiction I found from the web)
However, I'm having trouble to modify this list to the list in order of descending frequency - for example, there are 3 'the' in that list, so 'the' becomes the first element of the list. next element would be 'of', which occurring 2 times.
I tried several things similar to my case but keep displaying error (Counter, sorted).
Can someone teach me how can I sort the list?
In addition, after sorting the list, how can I retain only 1 copy for repeating ones? (my current idea is using for loop and indexing - compare with previous index, remove if it's same.)
Thank you.
You can use a itertools.Counter for your sorting in different ways:
from collections import Counter
lst = ['Hi', 'beautiful', 'creature', 'Said', 'by', 'Rothchild', 'the', 'biggest', 'enemy', 'of', 'Zun', 'Zun', 'started', 'get', 'afraid', 'of', 'him', 'As', 'her', 'best', 'friend', 'Lia', 'can', 'feel', 'her', 'fear', 'Why', 'the', 'the', 'hell', 'you', 'are', 'here']
c = Counter(lst) # mapping: {item: frequency}
# now you can use the counter directly via most_common (1.)
lst = [x for x, _ in c.most_common()]
# or as a sort key (2.)
lst = sorted(set(lst), key=c.get, reverse=True)
# ['the', 'Zun', 'of', 'her', 'Hi', 'hell', 'him', 'friend', 'Lia',
# 'get', 'afraid', 'Rothchild', 'started', 'by', 'can', 'Why', 'fear',
# 'you', 'are', 'biggest', 'enemy', 'Said', 'beautiful', 'here',
# 'best', 'creature', 'As', 'feel']
These approaches use either the Counter keys (1.) or set for the removal of duplicates.
However, if you want the sort to be stable with regard to the original list (keep order of occurrence for equal frequency items), you might have to do this, following the collections.OrderedDict based recipe for duplicate removal:
from collections import OrderedDict
lst = sorted(OrderedDict.fromkeys(lst), key=c.get, reverse=True)
# ['the', 'of', 'Zun', 'her', 'Hi', 'beautiful', 'creature', 'Said',
# 'by', 'Rothchild', 'biggest', 'enemy', 'started', 'get', 'afraid',
# 'him', 'As', 'best', 'friend', 'Lia', 'can', 'feel', 'fear', 'Why',
# 'hell', 'you', 'are', 'here']
this question is asked here before
What is a good strategy to group similar words?
but no clear answer is given on how to "group" items. The solution based on difflib is basically search, for given item, difflib can return the most similar word out of a list. But how can this be used for grouping?
I would like to reduce
['ape', 'appel', 'apple', 'peach', 'puppy']
to
['ape', 'appel', 'peach', 'puppy']
or
['ape', 'apple', 'peach', 'puppy']
One idea I tried was, for each item, iterate through the list, if get_close_matches returns more than one match, use it, if not keep the word as is. This partly worked, but it can suggest apple for appel, then appel for apple, these words would simply switch places and nothing would change.
I would appreciate any pointers, names of libraries, etc.
Note: also in terms of performance, we have a list of 300,000 items, and get_close_matches seems a bit slow. Does anyone know of a C/++ based solution out there?
Thanks,
Note: Further investigation revealed kmedoid is the right algorithm (as well as hierarchical clustering), since kmedoid does not require "centers", it takes / uses data points themselves as centers (these points are called medoids, hence the name). In word grouping case, the medoid would be the representative element of that group / cluster.
You need to normalize the groups. In each group, pick one word or coding that represents the group. Then group the words by their representative.
Some possible ways:
Pick the first encountered word.
Pick the lexicographic first word.
Derive a pattern for all the words.
Pick an unique index.
Use the soundex as pattern.
Grouping the words could be difficult, though. If A is similar to B, and B is similar to C, A and C is not necessarily similar to each other. If B is the representative, both A and C could be included in the group. But if A or C is the representative, the other could not be included.
Going by the first alternative (first encountered word):
class Seeder:
def __init__(self):
self.seeds = set()
self.cache = dict()
def get_seed(self, word):
LIMIT = 2
seed = self.cache.get(word,None)
if seed is not None:
return seed
for seed in self.seeds:
if self.distance(seed, word) <= LIMIT:
self.cache[word] = seed
return seed
self.seeds.add(word)
self.cache[word] = word
return word
def distance(self, s1, s2):
l1 = len(s1)
l2 = len(s2)
matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
for zz in xrange(0,l2):
for sz in xrange(0,l1):
if s1[sz] == s2[zz]:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
else:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
return matrix[l2][l1]
import itertools
def group_similar(words):
seeder = Seeder()
words = sorted(words, key=seeder.get_seed)
groups = itertools.groupby(words, key=seeder.get_seed)
return [list(v) for k,v in groups]
Example:
import pprint
print pprint.pprint(group_similar([
'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
'people', 'into', 'year', 'your', 'good', 'some', 'could',
'them', 'see', 'other', 'than', 'then', 'now', 'look',
'only', 'come', 'its', 'over', 'think', 'also', 'back',
'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
'way', 'even', 'new', 'want', 'because', 'any', 'these',
'give', 'day', 'most', 'us'
]), width=120)
Output:
[['after'],
['also'],
['and', 'a', 'in', 'on', 'as', 'at', 'an', 'one', 'all', 'can', 'no', 'want', 'any'],
['back'],
['because'],
['but', 'about', 'get', 'just'],
['first'],
['from'],
['good', 'look'],
['have', 'make', 'give'],
['his', 'her', 'if', 'him', 'its', 'how', 'us'],
['into'],
['know', 'new'],
['like', 'time', 'take'],
['most'],
['of', 'I', 'it', 'for', 'not', 'he', 'you', 'do', 'by', 'we', 'or', 'my', 'so', 'up', 'out', 'go', 'me', 'now'],
['only'],
['over', 'our', 'even'],
['people'],
['say', 'she', 'way', 'day'],
['some', 'see', 'come'],
['the', 'be', 'to', 'that', 'this', 'they', 'there', 'their', 'them', 'other', 'then', 'use', 'two', 'these'],
['think'],
['well'],
['what', 'who', 'when', 'than'],
['with', 'will', 'which'],
['work'],
['would', 'could'],
['year', 'your']]
You have to decide in closed matches words, which words you want to use. May be get the first element from the list which get_close_matches is returning, or just use random function on that list and get one element from closed matches.
There must be some sort of rule, for it..
In [19]: import difflib
In [20]: a = ['ape', 'appel', 'apple', 'peach', 'puppy']
In [21]: a = ['appel', 'apple', 'peach', 'puppy']
In [22]: b = difflib.get_close_matches('ape',a)
In [23]: b
Out[23]: ['apple', 'appel']
In [24]: import random
In [25]: c = random.choice(b)
In [26]: c
Out[26]: 'apple'
In [27]:
Now remove c from the initial list, thats it...
For c++, you can use Levenshtein_distance
Here is another version using Affinity Propagation algorithm.
import numpy as np
import scipy.linalg as lin
import Levenshtein as leven
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.cluster import AffinityPropagation
import itertools
words = np.array(
['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
'people', 'into', 'year', 'your', 'good', 'some', 'could',
'them', 'see', 'other', 'than', 'then', 'now', 'look',
'only', 'come', 'its', 'over', 'think', 'also', 'back',
'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
'way', 'even', 'new', 'want', 'because', 'any', 'these',
'give', 'day', 'most', 'us'])
print "calculating distances..."
(dim,) = words.shape
f = lambda (x,y): -leven.distance(x,y)
res=np.fromiter(itertools.imap(f, itertools.product(words, words)), dtype=np.uint8)
A = np.reshape(res,(dim,dim))
af = AffinityPropagation().fit(A)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
unique_labels = set(labels)
for i in unique_labels:
print words[labels==i]
Distances had to be converted to similarities, I did that by taking the negative of distance. The output is
['to' 'you' 'do' 'by' 'so' 'who' 'go' 'into' 'also' 'two']
['it' 'with' 'at' 'if' 'get' 'its' 'first']
['of' 'for' 'from' 'or' 'your' 'look' 'after' 'work']
['the' 'be' 'have' 'I' 'he' 'we' 'her' 'she' 'me' 'give']
['this' 'his' 'which' 'him']
['and' 'a' 'in' 'an' 'my' 'all' 'can' 'any']
['on' 'one' 'good' 'some' 'see' 'only' 'come' 'over']
['would' 'could']
['but' 'out' 'about' 'our' 'most']
['make' 'like' 'time' 'take' 'back']
['that' 'they' 'there' 'their' 'when' 'them' 'other' 'than' 'then' 'think'
'even' 'these']
['not' 'no' 'know' 'now' 'how' 'new']
['will' 'people' 'year' 'well']
['say' 'what' 'way' 'want' 'day']
['because']
['as' 'up' 'just' 'use' 'us']
Another method could be using matrix factorization, using SVD. First we create word distance matrix, for 100 words this would be 100 x 100 matrix representating the distance from each word to all other words. Then, SVD is ran on this matrix, the u in the resulting u,s,v can be seen as membership strength to each cluster.
Code
import numpy as np
import scipy.linalg as lin
import Levenshtein as leven
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import itertools
words = np.array(
['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
'people', 'into', 'year', 'your', 'good', 'some', 'could',
'them', 'see', 'other', 'than', 'then', 'now', 'look',
'only', 'come', 'its', 'over', 'think', 'also', 'back',
'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
'way', 'even', 'new', 'want', 'because', 'any', 'these',
'give', 'day', 'most', 'us'])
print "calculating distances..."
(dim,) = words.shape
f = lambda (x,y): leven.distance(x,y)
res=np.fromiter(itertools.imap(f, itertools.product(words, words)),
dtype=np.uint8)
A = np.reshape(res,(dim,dim))
print "svd..."
u,s,v = lin.svd(A, full_matrices=False)
print u.shape
print s.shape
print s
print v.shape
data = u[:,0:10]
k=KMeans(init='k-means++', k=25, n_init=10)
k.fit(data)
centroids = k.cluster_centers_
labels = k.labels_
print labels
for i in range(np.max(labels)):
print words[labels==i]
def dist(x,y):
return np.sqrt(np.sum((x-y)**2, axis=1))
print "centroid points.."
for i,c in enumerate(centroids):
idx = np.argmin(dist(c,data[labels==i]))
print words[labels==i][idx]
print words[labels==i]
plt.plot(centroids[:,0],centroids[:,1],'x')
plt.hold(True)
plt.plot(u[:,0], u[:,1], '.')
plt.show()
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = Axes3D(fig)
ax.plot(u[:,0], u[:,1], u[:,2],'.', zs=0,
zdir='z', label='zs=0, zdir=z')
plt.show()
The result
any
['and' 'an' 'can' 'any']
do
['to' 'you' 'do' 'so' 'go' 'no' 'two' 'how']
when
['who' 'when' 'well']
my
['be' 'I' 'by' 'we' 'my' 'up' 'me' 'use']
your
['for' 'or' 'out' 'about' 'your' 'our']
its
['it' 'his' 'if' 'him' 'its']
could
['would' 'people' 'could']
this
['this' 'think' 'these']
she
['the' 'he' 'she' 'see']
back
['all' 'back' 'want']
one
['of' 'on' 'one' 'only' 'even' 'new']
just
['but' 'just' 'first' 'most']
come
['some' 'come']
that
['that' 'than']
way
['say' 'what' 'way' 'day']
like
['like' 'time' 'give']
in
['in' 'into']
get
['her' 'get' 'year']
because
['because']
will
['with' 'will' 'which']
over
['other' 'over' 'after']
as
['a' 'as' 'at' 'also' 'us']
them
['they' 'there' 'their' 'them' 'then']
good
['not' 'from' 'know' 'good' 'now' 'look' 'work']
have
['have' 'make' 'take']
The selection of k for number of clusters is important, k=25 gives much better results than k=20 for instance.
The code also selects a representative word for each cluster by picking the word whose u[..] coordinate is closest to the cluster centroid.
Here is an approach based on medoids. First install MlPy. On Ubuntu
sudo apt-get install python-mlpy
Then
import numpy as np
import mlpy
class distance:
def compute(self, s1, s2):
l1 = len(s1)
l2 = len(s2)
matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
for zz in xrange(0,l2):
for sz in xrange(0,l1):
if s1[sz] == s2[zz]:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
else:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
return matrix[l2][l1]
x = np.array(['ape', 'appel', 'apple', 'peach', 'puppy'])
km = mlpy.Kmedoids(k=3, dist=distance())
medoids,clusters,a,b = km.compute(x)
print medoids
print clusters
print a
print x[medoids]
for i,c in enumerate(x[medoids]):
print "medoid", c
print x[clusters[a==i]]
The output is
[4 3 1]
[0 2]
[2 2]
['puppy' 'peach' 'appel']
medoid puppy
[]
medoid peach
[]
medoid appel
['ape' 'apple']
The bigger word list and using k=10
medoid he
['or' 'his' 'my' 'have' 'if' 'year' 'of' 'who' 'us' 'use' 'people' 'see'
'make' 'be' 'up' 'we' 'the' 'one' 'her' 'by' 'it' 'him' 'she' 'me' 'over'
'after' 'get' 'what' 'I']
medoid out
['just' 'only' 'your' 'you' 'could' 'our' 'most' 'first' 'would' 'but'
'about']
medoid to
['from' 'go' 'its' 'do' 'into' 'so' 'for' 'also' 'no' 'two']
medoid now
['new' 'how' 'know' 'not']
medoid time
['like' 'take' 'come' 'some' 'give']
medoid because
[]
medoid an
['want' 'on' 'in' 'back' 'say' 'and' 'a' 'all' 'can' 'as' 'way' 'at' 'day'
'any']
medoid look
['work' 'good']
medoid will
['with' 'well' 'which']
medoid then
['think' 'that' 'these' 'even' 'their' 'when' 'other' 'this' 'they' 'there'
'than' 'them']