Python url string match - python

My problem is the following. I have a long list of URLs such as:
www.foo.com/davidbobmike1joe
www.foo.com/mikejoe2bobkarl
www.foo.com/joemikebob
www.foo.com/bobjoe
I need to compare all the entries (URLs) in that list with each other, extract the keywords in the subdomains of those URLs (in this case: david, joe, bob, mike, karl) and order them by frequency. I've been reading about several libraries such as nltk. However the problem here is that there are no spaces to tokenise each word independently. Any recommendations on how to get the job done?

Limitations
If you refuse to use a dictionary you're algorithm will require a lot of computation. Above that, it is impossible to distinguish a keyword that occurs only once (e.g: "karl") from a crappy sequence (e.g: "e2bo"). My solution will be a best effort and will only work if your list of URL's contains keywords multiple times.
The basic idea
I assume a word is a sequence of characters that occur frequently of at least 3 characters. This prevents the letter "o" from being the most popular word.
The basic idea is the following.
Count all n letter sequences and select the once that occur multiple times.
Cut all sequences that are a part of a larger sequence.
Order them by popularity and you have a solution that comes close to solving your problem. (Left as an exercise to the reader)
In code
import operator
sentences = ["davidbobmike1joe" , "mikejoe2bobkarl", "joemikebob", "bobjoe", "bobbyisawesome", "david", "bobbyjoe"];
dict = {}
def countWords(n):
"""Count all possible character sequences/words of length n occuring in all given sentences"""
for sentence in sentences:
countWordsSentence(sentence, n);
def countWordsSentence(sentence, n):
"""Count all possible character sequence/words of length n occuring in a sentence"""
for i in range(0,len(sentence)-n+1):
word = sentence[i:i+n]
if word not in dict:
dict[word] = 1;
else:
dict[word] = dict[word] +1;
def cropDictionary():
"""Removes all words that occur only once."""
for key in dict.keys():
if(dict[key]==1):
dict.pop(key);
def removePartials(word):
"""Removes all the partial occurences of a given word from the dictionary."""
for i in range(3,len(word)):
for j in range(0,len(word)-i+1):
for key in dict.keys():
if key==word[j:j+i] and dict[key]==dict[word]:
dict.pop(key);
def removeAllPartials():
"""Removes all partial words in the dictionary"""
for word in dict.keys():
removePartials(word);
for i in range(3,max(map(lambda x: len(x), sentences))):
countWords(i);
cropDictionary();
removeAllPartials();
print dict;
Output
>>> print dict;
{'mike': 3, 'bobby': 2, 'david': 2, 'joe': 5, 'bob': 6}
Some challenges to the reader
Sort the dictionary by value before printing it. (Sort a Python dictionary by value)
In this example "bob" occurs six times, 2 times it is a partial word of "bobby". Determine if this is problematic and fix it if necessary.
Take capitalization into account.

Overview
You could use this code to extract the names, passing in a list of [david, bob, etc.]:
Is there an easy way generate a probable list of words from an unspaced sentence in python?
And then use collections.Counter to get frequencies.
The code
from Bio import trie
import string
from collections import Counter
def get_trie(words):
tr = trie.trie()
for word in words:
tr[word] = len(word)
return tr
def get_trie_word(tr, s):
for end in reversed(range(len(s))):
word = s[:end + 1]
if tr.has_key(word):
return word, s[end + 1: ]
return None, s
def get_trie_words(s):
names = ['david', 'bob', 'karl', 'joe', 'mike']
tr = get_trie(names)
while s:
word, s = get_trie_word(tr, s)
yield word
def main(urls):
d = Counter()
for url in urls:
url = "".join(a for a in url if a in string.lowercase)
for word in get_trie_words(url):
d[word] += 1
return d
if __name__ == '__main__':
urls = [
"davidbobmike1joe",
"mikejoe2bobkarl",
"joemikebob",
"bobjoe",
]
print main(urls)
Results
Counter({'bob': 4, 'joe': 4, 'mike': 3, 'karl': 1, 'david': 1})

Related

how can I found the most repeated word and how much repeated it [duplicate]

I am using Python 3.3
I need to create two lists, one for the unique words and the other for the frequencies of the word.
I have to sort the unique word list based on the frequencies list so that the word with the highest frequency is first in the list.
I have the design in text but am uncertain how to implement it in Python.
The methods I have found so far use either Counter or dictionaries which we have not learned. I have already created the list from the file containing all the words but do not know how to find the frequency of each word in the list. I know I will need a loop to do this but cannot figure it out.
Here's the basic design:
original list = ["the", "car",....]
newlst = []
frequency = []
for word in the original list
if word not in newlst:
newlst.append(word)
set frequency = 1
else
increase the frequency
sort newlst based on frequency list
use this
from collections import Counter
list1=['apple','egg','apple','banana','egg','apple']
counts = Counter(list1)
print(counts)
# Counter({'apple': 3, 'egg': 2, 'banana': 1})
You can use
from collections import Counter
It supports Python 2.7,read more information here
1.
>>>c = Counter('abracadabra')
>>>c.most_common(3)
[('a', 5), ('r', 2), ('b', 2)]
use dict
>>>d={1:'one', 2:'one', 3:'two'}
>>>c = Counter(d.values())
[('one', 2), ('two', 1)]
But, You have to read the file first, and converted to dict.
2.
it's the python docs example,use re and Counter
# Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall(r'\w+', open('hamlet.txt').read().lower())
>>> Counter(words).most_common(10)
[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
('you', 554), ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]
words = file("test.txt", "r").read().split() #read the words into a list.
uniqWords = sorted(set(words)) #remove duplicate words and sort
for word in uniqWords:
print words.count(word), word
Pandas answer:
import pandas as pd
original_list = ["the", "car", "is", "red", "red", "red", "yes", "it", "is", "is", "is"]
pd.Series(original_list).value_counts()
If you wanted it in ascending order instead, it is as simple as:
pd.Series(original_list).value_counts().sort_values(ascending=True)
Yet another solution with another algorithm without using collections:
def countWords(A):
dic={}
for x in A:
if not x in dic: #Python 2.7: if not dic.has_key(x):
dic[x] = A.count(x)
return dic
dic = countWords(['apple','egg','apple','banana','egg','apple'])
sorted_items=sorted(dic.items()) # if you want it sorted
One way would be to make a list of lists, with each sub-list in the new list containing a word and a count:
list1 = [] #this is your original list of words
list2 = [] #this is a new list
for word in list1:
if word in list2:
list2.index(word)[1] += 1
else:
list2.append([word,0])
Or, more efficiently:
for word in list1:
try:
list2.index(word)[1] += 1
except:
list2.append([word,0])
This would be less efficient than using a dictionary, but it uses more basic concepts.
You can use reduce() - A functional way.
words = "apple banana apple strawberry banana lemon"
reduce( lambda d, c: d.update([(c, d.get(c,0)+1)]) or d, words.split(), {})
returns:
{'strawberry': 1, 'lemon': 1, 'apple': 2, 'banana': 2}
Using Counter would be the best way, but if you don't want to do that, you can implement it yourself this way.
# The list you already have
word_list = ['words', ..., 'other', 'words']
# Get a set of unique words from the list
word_set = set(word_list)
# create your frequency dictionary
freq = {}
# iterate through them, once per unique word.
for word in word_set:
freq[word] = word_list.count(word) / float(len(word_list))
freq will end up with the frequency of each word in the list you already have.
You need float in there to convert one of the integers to a float, so the resulting value will be a float.
Edit:
If you can't use a dict or set, here is another less efficient way:
# The list you already have
word_list = ['words', ..., 'other', 'words']
unique_words = []
for word in word_list:
if word not in unique_words:
unique_words += [word]
word_frequencies = []
for word in unique_words:
word_frequencies += [float(word_list.count(word)) / len(word_list)]
for i in range(len(unique_words)):
print(unique_words[i] + ": " + word_frequencies[i])
The indicies of unique_words and word_frequencies will match.
The ideal way is to use a dictionary that maps a word to it's count. But if you can't use that, you might want to use 2 lists - 1 storing the words, and the other one storing counts of words. Note that order of words and counts matters here. Implementing this would be hard and not very efficient.
Try this:
words = []
freqs = []
for line in sorted(original list): #takes all the lines in a text and sorts them
line = line.rstrip() #strips them of their spaces
if line not in words: #checks to see if line is in words
words.append(line) #if not it adds it to the end words
freqs.append(1) #and adds 1 to the end of freqs
else:
index = words.index(line) #if it is it will find where in words
freqs[index] += 1 #and use the to change add 1 to the matching index in freqs
Here is code support your question
is_char() check for validate string count those strings alone, Hashmap is dictionary in python
def is_word(word):
cnt =0
for c in word:
if 'a' <= c <='z' or 'A' <= c <= 'Z' or '0' <= c <= '9' or c == '$':
cnt +=1
if cnt==len(word):
return True
return False
def words_freq(s):
d={}
for i in s.split():
if is_word(i):
if i in d:
d[i] +=1
else:
d[i] = 1
return d
print(words_freq('the the sky$ is blue not green'))
for word in original_list:
words_dict[word] = words_dict.get(word,0) + 1
sorted_dt = {key: value for key, value in sorted(words_dict.items(), key=lambda item: item[1], reverse=True)}
keys = list(sorted_dt.keys())
values = list(sorted_dt.values())
print(keys)
print(values)
Simple way
d = {}
l = ['Hi','Hello','Hey','Hello']
for a in l:
d[a] = l.count(a)
print(d)
Output : {'Hi': 1, 'Hello': 2, 'Hey': 1}
word and frequency if you need
def counter_(input_list_):
lu = []
for v in input_list_:
ele = (v, lc.count(v)/len(lc)) #if you don't % remove <</len(lc)>>
if ele not in lu:
lu.append(ele)
return lu
counter_(['a', 'n', 'f', 'a'])
output:
[('a', 0.5), ('n', 0.25), ('f', 0.25)]
the best thing to do is :
def wordListToFreqDict(wordlist):
wordfreq = [wordlist.count(p) for p in wordlist]
return dict(zip(wordlist, wordfreq))
then try to :
wordListToFreqDict(originallist)

Creating and rearranging a dictionary

I am new to python! I have created a code which successfully opens my text file and sorts my list of 100's of words. I then have put these in a list labelled stimuli_words, which consists of no duplicates words, all lower case etc.
However I now want to convert this list into a dictionary, where the keys are all possible 3 letter endings in my list of words, and the values are the words that correspond to those endings.
For instance 'ing: going, hiring...', but I only want the words in which have more than 40 words corresponding to the last two characters. So far I have this code:
from collections import defaultdict
fq = defaultdict( int )
for w in stimuli_list:
fq[w] += 1
print fq
However it is just returning a dictionary with my words and how many times they occur which is obviously once. e.g 'going': 1, 'hiring': 1, 'driving': 1.
Really would appreciate some help!! Thank You!!
You could do something like this:
dictionary = {}
words = ['going', 'hiring', 'driving', 'letter', 'better', ...] # your list or words
# Creating words dictionary
for word in words:
dictionary.setdefault(word[-3:], []).append(word)
# Removing lists that contain less than 40 words:
for key, value in dictionary.copy().items():
if len(value) < 40:
del dictionary[key]
print(dictionary)
Output:
{ # Only lists that are longer than 40 words
'ing': ['going', 'hiring', 'driving', ...],
'ter': ['letter', 'better', ...],
...
}
Since you're counting the words (because your key is the word), you only get 1 count per word.
You could create a key of the 3 last characters (and use Counter instead):
import collections
wordlist = ["driving","hunting","fishing","drive","a"]
endings = collections.Counter(x[-3:] for x in wordlist)
print(endings)
result:
Counter({'ing': 3, 'a': 1, 'ive': 1})
Create DemoData:
import random
# seed the same for any run
random.seed(10)
# base lists for demo data
prae = ["help","read","muck","truck","sleep"]
post= ["ing", "biothign", "press"]
# lots of data
parts = [ x+str(y)+z for x in prae for z in post for y in range(100,1000,100)]
# shuffle and take on ever 15th
random.shuffle(parts)
stimuli_list = parts[::120]
Creation of dictionary from stimuli_list
# create key with empty lists
dic = dict(("".join(e[len(e)-3:]),[]) for e in stimuli_list)
# process data and if fitting, fill list
for d in dic:
fitting = [x for x in parts if x.endswith(d)] # adapt to only fit 2 last chars
if len(fitting) > 5: # adapt this to have at least n in it
dic[d] = fitting[:]
for d in [x for x in dic if not dic[x]]: # remove keys with empty lists
dic.remove(d)
print()
print(dic)
Output:
{'ess': ['help400press', 'sleep100press', 'sleep600press', 'help100press', 'muck400press', 'muck900press', 'muck500press', 'help800press', 'muck100press', 'read300press', 'sleep400press', 'muck800press', 'read600press', 'help200press', 'truck600press', 'truck300press', 'read700press', 'help900press', 'truck400press', 'sleep200press', 'read500press', 'help600press', 'truck900press', 'truck800press', 'muck200press', 'truck100press', 'sleep700press', 'sleep500press', 'sleep900press', 'truck200press', 'help700press', 'muck300press', 'sleep800press', 'muck700press', 'sleep300press', 'help500press', 'truck700press', 'read400press', 'read100press', 'muck600press', 'read900press', 'read200press', 'help300press', 'truck500press', 'read800press']
, 'ign': ['truck200biothign', 'muck500biothign', 'help800biothign', 'muck700biothign', 'help600biothign', 'truck300biothign', 'read200biothign', 'help500biothign', 'read900biothign', 'read700biothign', 'truck400biothign', 'help300biothign', 'read400biothign', 'truck500biothign', 'read800biothign', 'help700biothign', 'help400biothign', 'sleep600biothign', 'sleep500biothign', 'muck300biothign', 'truck700biothign', 'help200biothign', 'sleep300biothign', 'muck100biothign', 'sleep800biothign', 'muck200biothign', 'sleep400biothign', 'truck100biothign', 'muck800biothign', 'read500biothign', 'truck900biothign', 'muck600biothign', 'truck800biothign', 'sleep100biothign', 'read300biothign', 'read100biothign', 'help900biothign', 'truck600biothign', 'help100biothign', 'read600biothign', 'muck400biothign', 'muck900biothign', 'sleep900biothign', 'sleep200biothign', 'sleep700biothign']
}

Find group of strings that are anagrams

This question refers to this problem on lintcode. I have a working solution, but it takes too long for the huge testcase. I am wondering how can it be improved? Maybe I can decrease the number of comparisons I make in the outer loop.
class Solution:
# #param strs: A list of strings
# #return: A list of strings
def anagrams(self, strs):
# write your code here
ret=set()
for i in range(0,len(strs)):
for j in range(i+1,len(strs)):
if i in ret and j in ret:
continue
if Solution.isanagram(strs[i],strs[j]):
ret.add(i)
ret.add(j)
return [strs[i] for i in list(ret)]
#staticmethod
def isanagram(s, t):
if len(s)!=len(t):
return False
chars={}
for i in s:
if i in chars:
chars[i]+=1
else:
chars[i]=1
for i in t:
if i not in chars:
return False
else:
chars[i]-=1
if chars[i]<0:
return False
for i in chars:
if chars[i]!=0:
return False
return True
Update: Just to add, not looking for built-in pythonic solutions such as using Counter which are already optimized. Have added Mike's suggestions, but still exceeding time-limit.
Skip strings you already placed in the set. Don't test them again.
# #param strs: A list of strings
# #return: A list of strings
def anagrams(self, strs):
# write your code here
ret=set()
for i in range(0,len(strs)):
for j in range(i+1,len(strs)):
# If both anagrams exist in set, there is no need to compare them.
if i in ret and j in ret:
continue
if Solution.isanagram(strs[i],strs[j]):
ret.add(i)
ret.add(j)
return [strs[i] for i in list(ret)]
You can also do a length comparison in your anagram test before iterating through the letters. Whenever the strings aren't the same length, they can't be anagrams anyway. Also, when a counter in chars reaches -1 when comparing values in t, just return false. Don't iterate through chars again.
#staticmethod
def isanagram(s, t):
# Test strings are the same length
if len(s) != len(t):
return False
chars={}
for i in s:
if i in chars:
chars[i]+=1
else:
chars[i]=1
for i in t:
if i not in chars:
return False
else:
chars[i]-=1
# If this is below 0, return false
if chars[i] < 0:
return False
for i in chars:
if chars[i]!=0:
return False
return True
Instead of comparing all pairs of strings, you can just create a dictionary (or collections.defaultdict) mapping each of the letter-counts to the words having those counts. For getting the letter-counts, you can use collections.Counter. Afterwards, you just have to get the values from that dict. If you want all words that are anagrams of any other words, just merge the lists that have more than one entry.
strings = ["cat", "act", "rat", "hut", "tar", "tact"]
anagrams = defaultdict(list)
for s in strings:
anagrams[frozenset(Counter(s).items())].append(s)
print([v for v in anagrams.values()])
# [['hut'], ['rat', 'tar'], ['cat', 'act'], ['tact']]
print([x for v in anagrams.values() if len(v) > 1 for x in v])
# ['cat', 'act', 'rat', 'tar']
Of course, if you prefer not to use builtin functionality you can with just a few more lines just as well use a regular dict instead of defaultdict and write your own Counter, similar to what you have in your isanagram method, just without the comparison part.
Your solution is slow because you're not taking advantage of python's data structures.
Here's a solution that collects results in a dict:
class Solution:
def anagrams(self, strs):
d = {}
for word in strs:
key = tuple(sorted(word))
try:
d[key].append(word)
except KeyError:
d[key] = [word]
return [w for ws in d.values() for w in ws if len(ws) > 1]
As an addition to #Mike's great answer, here is a nice Pythonic way to do it:
import collections
class Solution:
# #param strs: A list of strings
# #return: A list of strings
def anagrams(self, strs):
patterns = Solution.find_anagram_words(strs)
return [word for word in strs if ''.join(sorted(word)) in patterns]
#staticmethod
def find_anagram_words(strs):
anagrams = collections.Counter(''.join(sorted(word)) for word in strs)
return {word for word, times in anagrams.items() if times > 1}
Why not this?
str1 = "cafe"
str2 = "face"
def isanagram(s1,s2):
return all(sorted(list(str1)) == sorted(list(str2)))
if isanagram(str1, str2):
print "Woo"
The same can be done with a single line of code if you are using Linq in C#
string[] = strs; // Input string array
var result = strs.GroupBy(x => new string(x.ToCharArray().OrderBy(z => z).ToArray())).Select(g => g.ToList()).ToList();
Now to Group Anagrams in Python, We have to : Sort the lists. Then, Create a dictionary. Now dictionary will tell us where are those anagrams are( Indices of Dictionary). Then values of the dictionary is the actual indices of the anagrams.
def groupAnagrams(words):
# sort each word in the list
A = [''.join(sorted(word)) for word in words]
dict = {}
for indexofsamewords, names in enumerate(A):
dict.setdefault(names, []).append(indexofsamewords)
print(dict)
#{'AOOPR': [0, 2, 5, 11, 13], 'ABTU': [1, 3, 4], 'Sorry': [6], 'adnopr': [7], 'Sadioptu': [8, 16], ' KPaaehiklry': [9], 'Taeggllnouy': [10], 'Leov': [12], 'Paiijorty': [14, 18], 'Paaaikpr': [15], 'Saaaabhmryz': [17], ' CNaachlortttu': [19], 'Saaaaborvz': [20]}
for index in dict.values():
print([words[i] for i in index])
if __name__ == '__main__':
# list of words
words = ["ROOPA","TABU","OOPAR","BUTA","BUAT" , "PAROO","Soudipta",
"Kheyali Park", "Tollygaunge", "AROOP","Love","AOORP", "Protijayi","Paikpara","dipSouta","Shyambazaar",
"jayiProti", "North Calcutta", "Sovabazaar"]
groupAnagrams(words)
The Output :
['ROOPA', 'OOPAR', 'PAROO', 'AROOP', 'AOORP']
['TABU', 'BUTA', 'BUAT']
['Soudipta', 'dipSouta']
['Kheyali Park']
['Tollygaunge']
['Love']
['Protijayi', 'jayiProti']
['Paikpara']
['Shyambazaar']
['North Calcutta']
['Sovabazaar']

put for loop in dict comprehension [duplicate]

I am using Python 3.3
I need to create two lists, one for the unique words and the other for the frequencies of the word.
I have to sort the unique word list based on the frequencies list so that the word with the highest frequency is first in the list.
I have the design in text but am uncertain how to implement it in Python.
The methods I have found so far use either Counter or dictionaries which we have not learned. I have already created the list from the file containing all the words but do not know how to find the frequency of each word in the list. I know I will need a loop to do this but cannot figure it out.
Here's the basic design:
original list = ["the", "car",....]
newlst = []
frequency = []
for word in the original list
if word not in newlst:
newlst.append(word)
set frequency = 1
else
increase the frequency
sort newlst based on frequency list
use this
from collections import Counter
list1=['apple','egg','apple','banana','egg','apple']
counts = Counter(list1)
print(counts)
# Counter({'apple': 3, 'egg': 2, 'banana': 1})
You can use
from collections import Counter
It supports Python 2.7,read more information here
1.
>>>c = Counter('abracadabra')
>>>c.most_common(3)
[('a', 5), ('r', 2), ('b', 2)]
use dict
>>>d={1:'one', 2:'one', 3:'two'}
>>>c = Counter(d.values())
[('one', 2), ('two', 1)]
But, You have to read the file first, and converted to dict.
2.
it's the python docs example,use re and Counter
# Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall(r'\w+', open('hamlet.txt').read().lower())
>>> Counter(words).most_common(10)
[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
('you', 554), ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]
words = file("test.txt", "r").read().split() #read the words into a list.
uniqWords = sorted(set(words)) #remove duplicate words and sort
for word in uniqWords:
print words.count(word), word
Pandas answer:
import pandas as pd
original_list = ["the", "car", "is", "red", "red", "red", "yes", "it", "is", "is", "is"]
pd.Series(original_list).value_counts()
If you wanted it in ascending order instead, it is as simple as:
pd.Series(original_list).value_counts().sort_values(ascending=True)
Yet another solution with another algorithm without using collections:
def countWords(A):
dic={}
for x in A:
if not x in dic: #Python 2.7: if not dic.has_key(x):
dic[x] = A.count(x)
return dic
dic = countWords(['apple','egg','apple','banana','egg','apple'])
sorted_items=sorted(dic.items()) # if you want it sorted
One way would be to make a list of lists, with each sub-list in the new list containing a word and a count:
list1 = [] #this is your original list of words
list2 = [] #this is a new list
for word in list1:
if word in list2:
list2.index(word)[1] += 1
else:
list2.append([word,0])
Or, more efficiently:
for word in list1:
try:
list2.index(word)[1] += 1
except:
list2.append([word,0])
This would be less efficient than using a dictionary, but it uses more basic concepts.
You can use reduce() - A functional way.
words = "apple banana apple strawberry banana lemon"
reduce( lambda d, c: d.update([(c, d.get(c,0)+1)]) or d, words.split(), {})
returns:
{'strawberry': 1, 'lemon': 1, 'apple': 2, 'banana': 2}
Using Counter would be the best way, but if you don't want to do that, you can implement it yourself this way.
# The list you already have
word_list = ['words', ..., 'other', 'words']
# Get a set of unique words from the list
word_set = set(word_list)
# create your frequency dictionary
freq = {}
# iterate through them, once per unique word.
for word in word_set:
freq[word] = word_list.count(word) / float(len(word_list))
freq will end up with the frequency of each word in the list you already have.
You need float in there to convert one of the integers to a float, so the resulting value will be a float.
Edit:
If you can't use a dict or set, here is another less efficient way:
# The list you already have
word_list = ['words', ..., 'other', 'words']
unique_words = []
for word in word_list:
if word not in unique_words:
unique_words += [word]
word_frequencies = []
for word in unique_words:
word_frequencies += [float(word_list.count(word)) / len(word_list)]
for i in range(len(unique_words)):
print(unique_words[i] + ": " + word_frequencies[i])
The indicies of unique_words and word_frequencies will match.
The ideal way is to use a dictionary that maps a word to it's count. But if you can't use that, you might want to use 2 lists - 1 storing the words, and the other one storing counts of words. Note that order of words and counts matters here. Implementing this would be hard and not very efficient.
Try this:
words = []
freqs = []
for line in sorted(original list): #takes all the lines in a text and sorts them
line = line.rstrip() #strips them of their spaces
if line not in words: #checks to see if line is in words
words.append(line) #if not it adds it to the end words
freqs.append(1) #and adds 1 to the end of freqs
else:
index = words.index(line) #if it is it will find where in words
freqs[index] += 1 #and use the to change add 1 to the matching index in freqs
Here is code support your question
is_char() check for validate string count those strings alone, Hashmap is dictionary in python
def is_word(word):
cnt =0
for c in word:
if 'a' <= c <='z' or 'A' <= c <= 'Z' or '0' <= c <= '9' or c == '$':
cnt +=1
if cnt==len(word):
return True
return False
def words_freq(s):
d={}
for i in s.split():
if is_word(i):
if i in d:
d[i] +=1
else:
d[i] = 1
return d
print(words_freq('the the sky$ is blue not green'))
for word in original_list:
words_dict[word] = words_dict.get(word,0) + 1
sorted_dt = {key: value for key, value in sorted(words_dict.items(), key=lambda item: item[1], reverse=True)}
keys = list(sorted_dt.keys())
values = list(sorted_dt.values())
print(keys)
print(values)
Simple way
d = {}
l = ['Hi','Hello','Hey','Hello']
for a in l:
d[a] = l.count(a)
print(d)
Output : {'Hi': 1, 'Hello': 2, 'Hey': 1}
word and frequency if you need
def counter_(input_list_):
lu = []
for v in input_list_:
ele = (v, lc.count(v)/len(lc)) #if you don't % remove <</len(lc)>>
if ele not in lu:
lu.append(ele)
return lu
counter_(['a', 'n', 'f', 'a'])
output:
[('a', 0.5), ('n', 0.25), ('f', 0.25)]
the best thing to do is :
def wordListToFreqDict(wordlist):
wordfreq = [wordlist.count(p) for p in wordlist]
return dict(zip(wordlist, wordfreq))
then try to :
wordListToFreqDict(originallist)

Loop through dictionary and get the 7 most common words. BUT only if the words aren't found in another list

I am learning some basic python 3 and have been stuck at this problem for 2 days now and i can't seem to get anywhere...
Been reading the "think python" book and I'm working on chapter 13 and the case study it contains. The chapter is all about reading a file and doing some magic with it like counting total number of words and the most used words.
One part of the program is about "Dictionary subtraction" where the program fetches all the word from one textfile that are not found in another textfile.
What I also need the program to do is count the most common word from the first file, excluding the words found in the "dictionary" text file. This functionality has had me stuck for two days and i don't really know how to solve this...
The Code to my program is as follow:
import string
def process_file(filename):
hist = {}
fp = open(filename)
for line in fp:
process_line(line, hist)
return hist
def process_line(line, hist):
line = line.replace('-', ' ')
for word in line.split():
word = word.strip(string.punctuation + string.whitespace)
word = word.lower()
hist[word] = hist.get(word, 0) + 1
def most_common(hist):
t = []
for key, value in hist.items():
t.append((value, key))
t.sort()
t.reverse()
return t
def subtract(d1, d2):
res = {}
for key in d1:
if key not in d2:
res[key] = None
return res
hist = process_file('alice-ch1.txt')
words = process_file('common-words.txt')
diff = subtract(hist, words)
def total_words(hist):
return sum(hist.values())
def different_words(hist):
return len(hist)
if __name__ == '__main__':
print ('Total number of words:', total_words(hist))
print ('Number of different words:', different_words(hist))
t = most_common(hist)
print ('The most common words are:')
for freq, word in t[0:7]:
print (word, '\t', freq)
print("The words in the book that aren't in the word list are:")
for word in diff.keys():
print(word)
I then created a test dict containing a few words and imaginary times they occur and a test list to try and solve my problem and the code for that is:
histfake = {'hello': 12, 'removeme': 2, 'hi': 3, 'fish':250, 'chicken':55, 'cow':10, 'bye':20, 'the':93, 'she':79, 'to':75}
listfake =['removeme', 'fish']
newdict = {}
for key, val in histfake.items():
for commonword in listfake:
if key != commonword:
newdict[key] = val
else:
newdict[key] = 0
sortcommongone = []
for key, value in newdict.items():
sortcommongone.append((value, key))
sortcommongone.sort()
sortcommongone.reverse()
for freq, word in sortcommongone:
print(word, '\t', freq)
The problem is that that code only works for one word. Only one matched word between the dict and the list gets the value of 0 (thought that I could give the duplicate words the vale 0 since I only need the 7 most common words that are not found in the common-word text file.
How can I solve this? Created a account here just to try and get some help with this since Stackowerflow has helped me before with other problems. But this time I needed to ask the question myself. Thanks!
You can filter out the items using a dict comprehension
>>> {key: value for key, value in histfake.items() if key not in listfake}
{'hi': 3, 'she': 79, 'to': 75, 'cow': 10, 'bye': 20, 'chicken': 55, 'the': 93, 'hello': 12}
Unless listfake is larger than histfake ,the most efficient way will be to delete keys in it listfake
for key in listfake:
del histfake[key]
Complexity of list comprehension and this solution is O(n)- but the list is supposedly much shorter than the dictionary.
EDIT:
Or it may be done - if you have more keys than actual words -
for key in histfake:
if key in listfake:
del histfake[key]
You may want to test run time
Then, of course, you'll have to sort dictionary into list - and recreate it
from operator import itemgetter
most_common_7 = dict(sorted(histfake.items(), key=itemgetter(1))[:7])
BTW, you may use Counter from Collections to count words. And maybe part of your problem is that you don't remove all non-letter characters from your text

Categories