How to count frequencies/occurences of all values within a string - python

I need a count of all the emails in a list, some of the emails however are consolidated together with a | symbol. These need to be split and the emails need to be counted after splitting to avoid getting an inaccurate or low count of frequencies.
I have a list that is something like this:
test = ['abc#gmail.com', 'xyz#jad.com|abc#gmail.com', 'asd#ajf.com|abc#gmail.com', 'asdf#adh.com', 'xyz#jad.com']
I performed a set of operations to split and when I split, the pipe gets replaced by double quotes at that location so I replace the double with single quotes so I have all email ids enclosed in single quotes.
# convert list to a string
test_str = str(test)
# apply string operation to split by separator '|'
test1 = test_str.split('|')
print(test1)
--> OUTPUT of above print statement: ["['abc#gmail.com', 'xyz#jad.com", "abc#gmail.com', 'asd#ajf.com", "abc#gmail.com', 'asdf#adh.com', 'xyz#jad.com']"]
test2 = str(test1)
test3 = test2.replace('"','')
print(test3)
--> OUTPUT of above print statement: [['abc#gmail.com', 'xyz#jad.com', 'abc#gmail.com', 'asd#ajf.com', 'abc#gmail.com', 'asdf#adh.com', 'xyz#jad.com']]
How can I now obtain a count of all the emails? This is a string essentially and if it's a list, I could use collections.Counter to easily obtain a count.
I'd like to get a list like the one listed below that has the email and the count in descending order of frequency
['abc#gmail.com': 3, 'xyz#jad.com': 2, 'asd#ajf.com': 1, 'asdf#adh.com': 1]
Thanks for the help!

You can use collections.Counter with a generator expression that iterates over the input list of strings and then iterates over the sub-list of emails by splitting the strings. Use the most_common method to ensure a descending order of counts:
from collections import Counter
dict(Counter(e for s in test if s for e in s.split('|')).most_common())
This returns:
{'abc#gmail.com': 3, 'xyz#jad.com': 2, 'asd#ajf.com': 1, 'asdf#adh.com': 1}

What about iterating over the list and calling counter.update on every string? Like this:
test = ['abc#gmail.com', 'xyz#jad.com|abc#gmail.com', 'asd#ajf.com|abc#gmail.com', 'asdf#adh.com', 'xyz#jad.com']
c = Counter()
for email_str in test:
if email_str:
c.update(email_str.split('|'))
res = c.most_common()

Related

How to use sep command with sorted

I want to separate sorted numbers with "<", but can't do it.
Here is the code:
numbers = [3, 7, 5]
print(sorted(numbers), sep="<")
The * operator as mentioned by #MisterMiyagi, can be used to unpack the list variables and use the sep.
Code:
print(*sorted(numbers), sep="<")
I dont know if this is the answer you want, but I have made a python code to seperate the sorted numbers with "<" with join after I convert the numbers to strings.
As the items in the iterable must be string types, I first use a list comprehension to create a list containing each interger as a string, and pass this as input to str.join()
# Initial String
test_str = [5,1,2,3,4]
# Sorting number
sortedNum = sorted(test_str)
# Changing numbers into string
string_ints = [str(int) for int in sortedNum]
# Joining the sorted string with "<"
output = '<'.join(string_ints)
print(output)

Find the letter with most occurrences in a string, and print just the letter, not the count

How can I find the letter with most appearances from a string, and only output the letter, not the count?
With collections.Counter, it always displays the count as well as the letter. Current output: ('l', 3) . Preferred output: l
import collections
s = "helloworld"
print(collections.Counter(s).most_common(1)[0])
Instead of
print(collections.Counter(s).most_common(1)[0])
You can write
print(collections.Counter(s).most_common(1)[0][0])
It will give you the first element of the tuple, so the output will be l.
You can also do something like:
txt = "aaaaaaaabbbbbcccdde"
print(max(set(txt), key=txt.count))
output:
a
You can use the max() function with another function as the key parameter like this:
s = "helloworld"
print(max(s, key = lambda c: s.count(c)))
The key parameter is a function to be used when comparing two items of the s iterable. In this case, we compare based on the occurrence of each item.

word count engine, output using square brackets and order incorrect

I came across that question online:
Word Count Engine
Implement a document scanning function wordCountEngine, which receives
a string document and returns a list of all unique words in it and
their number of occurrences, sorted by the number of occurrences in
a descending order. If two or more words have the same count, they
should be sorted according to their order in the original sentence.
Assume that all letters are in english alphabet. You function should
be case-insensitive, so for instance, the words “Perfect” and
“perfect” should be considered the same word.
The engine should strip out punctuation (even in the middle of a word) and use whitespaces to separate words.
Analyze the time and space complexities of your solution. Try to
optimize for time while keeping a polynomial space complexity.
So I attempted a solutions and it works fine, just 2 issues:
when I sort the words descending order, if 2 words have the same number I should sort based on their appearance , I can't seem to do that part
the expected out is enclosed between [] while mine is enclosed between ()
my code is as follows:
from collections import defaultdict
import operator
def word_count_engine(document):
#c=collections.Counter(document.split())
myDict=defaultdict(str) #will use a dict
document=document.lower()
document+=" " #just to count the last word so I add a space at the end
word=""
for i in range(len(document)):
if document[i].islower(): #as long as its a normal char append it to word string
word+=document[i]
elif document[i].isspace(): #if its a space it means its the end of word
if word in myDict.keys(): #if its already in dict inc counter
myDict[word]+=1
else:
myDict[word]=1 #if not in dict add it and make count =1
word="" #clear array
sorted_x = sorted(myDict.items(), key=operator.itemgetter(1),reverse=True)
print('myDict is ', myDict)
print('sorted ', sorted_x)
return sorted_x
Input: "Practice makes perfect, you'll get perfecT by practice. just practice! just just just!!"
Expected: [["just","4"],["practice","3"],["perfect","2"],["makes","1"],["youll","1"],["get","1"],["by","1"]]
Actual: [('just', 4), ('practice', 3), ('perfect', 2), ('get', 1), ('makes', 1), ('youll', 1), ('by', 1)]
Any ideas how can I fix those 2 issues? the order and the () to be []
You are not tracking word order in the original document, information you need to be able to sort the output correctly. You are also just using the standard (key, value) tuples returned by dict.items(). You need to return lists, and apparently the counts need to be strings too.
In Python versions < 3.6, you'd have to record the order that words appeared in for the first time. Use a defaultdict() with a itertools.count() object to record a 'first appearance' order number for that:
from collections import defaultdict
from itertools import count
word_order = defaultdict(count().__next__)
For any word you try to look up in that dictionary, the __next__ method of a single count() instance is called only if the word was not yet accessed before, resulting in a clean ordering number for each:
>>> from collections import defaultdict
>>> from itertools import count
>>> word_order = defaultdict(count().__next__)
>>> word_order["foo"]
0
>>> word_order["bar"]
1
>>> word_order["foo"]
0
>>> word_order["spam"]
2
>>> word_order
defaultdict(<method-wrapper '__next__' of itertools.count object at 0x109664680>, {'foo': 0, 'bar': 1, 'spam': 2})
You can use this information to track word order, and then later on use this information when sorting.
I'll explain below why this probably should be used in Python 3.6 and newer too, as the strict reading of the Python documentation tells us using a Counter() object instead might not always work.
To extract just words from the input, you'd be much better of with a regular expression to remove everything that's a word or spaces. The \w pattern matches word characters, which are letters, digits and underscores, and not punctuation. For most of these kinds of problems, that is plenty. \s matches anything that's whitespace (spaces, tabs, newlines). Because we want to keep words and spaces, you can use the inverse to remove everything else. You can get the inverse by combining the two classes in a new class with [...], and then adding in ^ at the start to match anything that's not part of those two groups. Lowercase the document, and remove the things we want to get rid of:
cleaned = re.sub(r"[^\w\s]+", "", document.lower())
All that is left to do to get the cleaned words, is to call cleaned.split(), producing a list of words without punctuation or other diacritical marks.
You can then use another defaultdict() to keep count of your words. You could also use a collections.Counter() object, but we are replacing just about everything it can do well with custom code anyway. I'd just integrate the word_count results into the key here:
counts = defaultdict(int)
cleaned = re.sub(r"[^\w\s]+", "", document.lower())
for word in cleaned.split():
counts[(word, word_order[word])] += 1
The items of the counts dictionary give you (word, index), count, so you can sort on that information:
# each sort item is a ((word, index), count) tuple, sort by descending counts
# and then by ascending index.
ordered = sorted(counts.items(), key=lambda kv: (-kv[1], kv[0][1]))
The above sorts on a composite key, (-count, index). By negating the count you sort in descending order (-10 would sort before -3, so a word that appeared 10 times is sorted before words with a lower count), but the second value, the index is used when two words have the same frequency and is used in ascending order.
Now all that remains is the extraction of the words and the counts from this structure, turning counts into strings:
result = [[word, str(count)] for (word, _), count in ordered]
I used (word, _), count as the for loop target, so Python unpacks the nested tuple structure for me and we can ignore the index. Because we don't use the index value in the output, I used the variable name _. Most code linters recognize this as this value is not used.
So a complete implementation would be:
import re
from collections import defaultdict
from itertools import count
def word_count_engine(document):
word_order = defaultdict(count().__next__)
counts = defaultdict(int)
cleaned = re.sub(r"[^\w\s]+", "", document.lower())
for word in cleaned.split():
counts[(word, word_order[word])] += 1
# each sort item is a ((word, index), count) tuple, sort by descending counts
# and then by ascending index.
ordered = sorted(counts.items(), key=lambda kv: (-kv[1], kv[0][1]))
return [[word, str(count)] for (word, _), count in ordered]
Demo:
>>> example = "Practice makes perfect, you'll get perfecT by practice. just practice! just just just!!"
>>> word_count_engine(example)
[['just', '4'], ['practice', '3'], ['perfect', '2'], ['makes', '1'], ['youll', '1'], ['get', '1'], ['by', '1']]
In Python 3.6 the implementation of the dict type was updated to save memory, a change that also happened to record insertion order. This means that the order that keys appear in, say, a Counter() produced from your words, would give you the words in document order already! In Python 3.7 this property became part of the language specification.
That's not to say that you can count on something like Counter.most_common() to make use of this property! The documentation for that method is very clear on this:
Elements with equal counts are ordered arbitrarily
However, in practice Counter is a straight-up subclass of dict, and as long as you don't pass in a value for the n argument to Counter.most_common() (or pass in a value less than the Counter length) a straight up sorted() call is used to produce the output and so you can also get the correct output using Counter(). This is not guaranteed to continue to work in future Python versions:
import re
from collections import Counter
def word_count_engine(document):
cleaned = re.sub(r"[^\w\s]+", "", document.lower())
counts = Counter(cleaned.split())
return [[word, str(count)] for word, count in counts.most_common()]
Improved and shortened (relying on collections.Counter object functionality):
from collections import Counter
import re
def word_count_engine(doc):
doc = re.sub(r'[^\w\s]+', '', doc) # remove all chars except words and whitespaces
word_stats = Counter(doc.lower().split())
return [list(t) for t in word_stats.most_common()]
input_doc = "Practice makes perfect, you'll get perfecT by practice. just practice! just just just!!"
print(word_count_engine(input_doc))
The output:
[['just', 4], ['practice', 3], ['perfect', 2], ['makes', 1], ['youll', 1], ['get', 1], ['by', 1]]
To convert your tuples to lists and sort by the order in which the words appeared in the document if the counts are the same, you can sort by the count in reverse and then sort by the index of the word in a list created by splitting the input string after removing punctuation and lowercasing. Note that this approach will sort words with the same count in the order in which they first appear in the document (repetitions of the same word won't impact the sort).
from collections import Counter
from string import punctuation
s = "Practice makes perfect, you'll get perfecT by practice. just practice! just just just!!"
s = s.translate(s.maketrans('', '', punctuation))
s = s.lower()
words = s.split()
counts = sorted([list(t) for t in Counter(words).items()], key=lambda x: (-x[1], words.index(x[0])))
print(counts)
# [['just', 4], ['practice', 3], ['perfect', 2], ['makes', 1], ['youll', 1], ['get', 1], ['by', 1]]

split by regex and add matches to dictionary

first time posting here.
I'd like to 1) parse the following text:"keyword: some keywords concept :some concepts"
and 2) store into the dictionary: ['keyword']=>'some keywords', ['concept']=>'some concepts'.
There may be 0 or 1 'space' before each 'colon'. The following is what I've tried so far.
sample_text = "keyword: some keywords concept :some concepts"
p_res = re.compile("(\S+\s?):").split(sample_text) # Task 1
d_inc = dict([(k, v) for k,v in zip (p_res[::2], p_res[1::2])]) # Task 2
However, the list result p_res is wrong , with empty entry at the index 0, which consequently produce wrong dict. Is there something wrong with my regex?
Use re.findall to capture list of groups in a match. And then apply dict to convert list of tuples to dict.
>>> import re
>>> s = 'keyword: some keywords concept :some concepts'
>>> dict(re.findall(r'(\S+)\s*:\s*(.*?)\s*(?=\S+\s*:|$)', s))
{'concept': 'some concepts', 'keyword': 'some keywords'}
>>>
Above regex would capture key and it's corresponding value in two separate groups.
I assume that the input string contain only key value pair and the key won't contain any space character.
DEMO
Simply replace Task1 by this line:
p_res = re.compile("(\S+\s?):").split(sample_text)[1:] # Task 1
This will always ignore the (normally empty) element that is returned by re.split.
Background: Why does re.split return the empty first result?
What should the program do with this input:
sample_text = "Hello! keyword: some keywords concept :some concepts"
The text Hello! at the beginning of the input doesn't fit into the definition of your problem (which assumes that the input starts with a key).
Do you want to ignore it? Do you want to raise an exception if it appears? Do you want to want to add it to your dictionary with a special key?
re.split doesn't want to decide this for you: It returns whatever information appears and you make your decision. In our solution, we simply ignore whatever appears before the first key.

Optionally replacing a substring python

My list of replacement is in the following format.
lstrep = [('A',('aa','aA','Aa','AA')),('I',('ii','iI','Ii','II')),.....]
What I want to achieve is optionally change the occurrence of the letter by all the possible replacements. The input word should also be a member of the list.
e.g.
input - DArA
Expected output -
['DArA','DaarA','Daaraa','DAraa','DaArA','DAraA','DaAraA','DAarA','DAarAa', 'DArAa','DAArA','DAArAA','DArAA']
My try was
lstrep = [('A',('aa','aA','Aa','AA'))]
def alte(word,lstrep):
output = [word]
for (a,b) in lstrep:
for bb in b:
output.append(word.replace(a,bb))
return output
print alte('DArA',lstrep)
The output I received was ['DArA', 'Daaraa', 'DaAraA', 'DAarAa', 'DAArAA'] i.e. All occurrences of 'A' were replaced by 'aa','aA','Aa' and 'AA' respectively. What I want is that it should give all permutations of optional replacements.
itertools.product will give all of the permutations. You can build up a list of substitutions and then let it handle the permutations.
import itertools
lstrep = [('A',('aa','aA','Aa','AA')),('I',('ii','iI','Ii','II'))]
input_str = 'DArA'
# make substitution list a dict for easy lookup
lstrep_map = dict(lstrep)
# a substitution is an index plus a string to substitute. build
# list of subs [[(index1, sub1), (index1, sub2)], ...] for all
# characters in lstrep_map.
subs = []
for i, c in enumerate(input_str):
if c in lstrep_map:
subs.append([(i, sub) for sub in lstrep_map[c]])
# build output by applying each sub recorded
out = [input_str]
for sub in itertools.product(*subs):
# make input a list for easy substitution
input_list = list(input_str)
for i, cc in sub:
input_list[i] = cc
out.append(''.join(input_list))
print(out)
Try constructing tuples of all possible permutations based on the replaceable characters that occur. This will have to be achieved using recursion.
The reason recursion is necessary is that you would need a variable number of loops to achieve this.
For your example "DArA" (2 replaceable characters, "A" and "A"):
replaceSet = set()
replacements = ['A':('aa','aA','Aa','AA'),'I':('ii','iI','Ii','II'),.....]
for replacement1 in replacements["A"]:
for replacement2 in replacements["A"]:
replaceSet.add((replacement1, replacement2))
You see you need two loops for two replaceables, and n loops for n replaceables.
Think of a way you could use recursion to solve this problem. It will likely involve creating all permutations for a substring that contains n-1 replaceables (if you had n in your original string).

Categories