I have a string (or a list of words). I would like to create tuples of every possible word pair combination in order to pass them to a Counter for dictionary creation and frequency calculation. The frequency is calculated in the following manner: if the pair exists in a string (regardless of the order or if there are any other words between them) the frequency = 1 (even the word1 has a frequency of 7 and word2 of 3 the frequency of a pair word1 and word2 is still 1)
I am using loops to create tuples of all pairs but got stuck
tweetList = ('I went to work but got delayed at other work and got stuck in a traffic and I went to drink some coffee but got no money and asked for money from work', 'We went to get our car but the car was not ready. We tried to expedite our car but were told it is not ready')
words = set(tweetList.split())
n = 10
for tweet in tweetList:
for word1 in words:
for word2 in words:
pairW = [(word1, word2)]
c1 = Counter(pairW for pairW in tweet)
c1.most_common(n)
However, the ouput is very bizzare:
[('k', 1)]
It seems instead of words it is iterating over letters
How can this be addressed? Converting a string into a list of words using split() ?
Another question: how to avoid creating duplicate tuples such as: (word1, word2) and (word2, word1)? Enumerate?
As an Output I expect a dictionary where key = all word pairs (see duplicate comment though), and the value = frequency of a pair in the list
Thank you!
I wonder if that's what you want:
import itertools, collections
tweets = ['I went to work but got delayed at other work and got stuck in a traffic and I went to drink some coffee but got no money and asked for money from work',
'We went to get our car but the car was not ready. We tried to expedite our car but were told it is not ready']
words = set(word.lower() for tweet in tweets for word in tweet.split())
_pairs = list(itertools.permutations(words, 2))
# We need to clean up similar pairs: sort words in each pair and then convert
# them to tuple so we can convert whole list into set.
pairs = set(map(tuple, map(sorted, _pairs)))
c = collections.Counter()
for tweet in tweets:
for pair in pairs:
if pair[0] in tweet and pair[1] in tweet:
c.update({pair: 1})
print c.most_common(10)
Result is: [(('a', 'went'), 2), (('a', 'the'), 2), (('but', 'i'), 2), (('i', 'the'), 2), (('but', 'the'), 2), (('a', 'i'), 2), (('a', 'we'), 2), (('but', 'we'), 2), (('no', 'went'), 2), (('but', 'went'), 2)]
tweet is a string so Counter(pairW for pairW in tweet) will compute the frequency of the letters in tweet, which is probably not want you want.
Related
I have the following list of tuples:
words = [('This', 0), ('showed', 0), ('good', 0), ('Patency', 140),
('of', 1), ('the', 1), ('vein', 7), ('graft', 30), ('with', 0),
('absence', 2), ('.', 0), ('FINDINGS', 0), (':', 0), (2, 5)]
The variable words has the structure [(token, n_of_occurrence)].
I want to preprocess words where I want to remove all tuples that their token is a digit, NAN, stopword, punctuation, and remove all duplicates in such a way I keep track of their occurrences. I am expecting to have as output:
[('showed', 0), ('good', 0), ('patency', 140),
('vein', 7), ('graft', 30), ('absence', 2)('findings', 0)]
I tried the following but end up with an appropriate output; an empty list.
tokens = [w[0].lower() for w in [str(w) for w in set(words)] if (w[0] != 'nan' and
w[0].isdigit() != True and not
w[0].replace('.', '', 1).isdigit())]
items = [t for t in tokens if (t not in stopwords.words('english') and
t not in string.punctuation)]
here items will be my final preprocessed list of tuples.
Thanks! I tried with what was proposed as solutions and found the correct answer.
tokens = [(w.lower(), cnt) for w, cnt in [(str(w), cnt) for w, cnt in set(words)]
if (w != 'nan' and
w.isdigit() != True and not
w.replace('.', '', 1).isdigit())]
items = [(t, cnt) for t, cnt in tokens if (t not in stopwords.words('english') and
t not in string.punctuation)]
This code will output exactly what was expected in the question.
For now, I'd stick preprocessWords in a def(), and work on each case one by one. I.e., write the code to test for NaN, punctuation, each in a different block of code, rather than trying to do it all at once, and then fault-find each area seperately. Therefore:
def preprocessWords(words):
goodwords = [] #This is Returned as the answer
temp = [] #Used to spot duplicates
for word,q in words: #word gets tested against multiple criteria, if none match, the final "else" adds it to goodwords
if isnumeric(word):
pass #Disregard purely numerical items. This also negatives strings such as "12.5".
elif word in ";:,.!": #Fill out full list of denied punctuation
pass
elif word in stopwords.words('english'): #Denied, for some reason...
pass
elif word.lower() in temp: #Duplicate. Temp only has .lower() values, so covert to lower
pass
else:
goodwords.append([word,q]) #Gets returned as the answer
temp.append(word.lower()) #List used to spot duplicates
return goodwords
Structure: You COULD merge all of those into one mega line of "IF NOT (TEST1) OR (TEST2) OR (TEST3)" but whilst in development, I'd keep each test separate and it gives you a chance of adding print(word,"failed due to having punctuation") and print(word,"failed due to duplicate") whilst debugging.
Duplicates: If there are duplicates, surely you want to add up the q values? You didn't say in your question - but if so, on the "if word in temp" line, you'll want to add logic to += the q value in goodwords for that word.
Punctuation-words: You'll want to define a full list of punctuation. An alternative way would be to search for any word that is just one letter long, and then check to see if that letter is a-z, A-Z. But that would be more complicated than you think, given that á,é,í and lots of other possible letters from different language would also have to specified.
I want to count occurrences of list elements in text with Python. I know that I can use .count() but I have read that this can effect on performance. Also, element in list can have more than 1 word.
my_list = ["largest", "biggest", "greatest", "the best"]
my_text = "i have the biggest house and the biggest car. My friend is the best. Best way win this is to make the largest house and largest treehouse and then you will be the greatest"
I can do this:
num = 0
for i in my_list:
num += my_text.lower().count(i.lower())
print(num)
This way works, but what If my list has 500 elements and my string is 3000 words, so in that case, I have very low performance.
Is there a way to do this but with good / fast performance?
Since my_list contains strings with more than one word, you'll have to find the n-grams of my_text to find matches, since splitting on spaces won't do. Also note that your approach is not advisable, as for every single string in my_list, you'll be traversing the whole string my_text by using count. A better way would be to predefine the n-grams that you'll be looking for beforehand.
Here's one approach using nltk's ngram.
I've added another string in my_list to better illustrate the process:
from nltk import ngrams
from collections import Counter, defaultdict
my_list = ["largest", "biggest", "greatest", "the best", 'My friend is the best']
my_text = "i have the biggest house and the biggest car. My friend is the best. Best way win this is to make the largest house and largest treehouse and then you will be the greatest"
The first step is to define a dictionary containing the different lengths of the n-grams that we'll be looking up:
d = defaultdict(list)
for i in my_list:
k = i.split()
d[len(k)].append(tuple(k))
print(d)
defaultdict(list,
{1: [('largest',), ('biggest',), ('greatest',)],
2: [('the', 'best')],
5: [('My', 'friend', 'is', 'the', 'best')]})
Then split my_text into a list, and for each key in d find the corresponding n-grams and build a Counter from the result. Then for each value in that specific key in d, update with the counts from the Counter:
my_text_split = my_text.replace('.', '').split()
match_counts = dict()
for n,v in d.items():
c = Counter(ngrams(my_text_split, n))
for k in v:
if k in c:
match_counts[k] = c[k]
Which will give:
print(match_counts)
{('largest',): 2,
('biggest',): 2,
('greatest',): 1,
('the', 'best'): 1,
('My', 'friend', 'is', 'the', 'best'): 1}
I've been solving problems in checkio.com and one of the questions was: "Write a function to find the letter which occurs the maximum number of times in a given string"
The top solution was:
import string
def checkio(text):
"""
We iterate through latin alphabet and count each letter in the text.
Then 'max' selects the most frequent letter.
For the case when we have several equal letter,
'max' selects the first from they.
"""
text = text.lower()
return max(string.ascii_lowercase, key=text.count)
I didn't understand what text.count is when it is used as the key in the max function.
Edit: Sorry for not being more specific. I know what the program does as well as the function of str.count(). I want to know what text.count is. If .count is a method then shouldn't it be followed by braces?
The key=text.count is what is counting the number of times all the letters appear in the string, then you take the highest number of all those numbers to get the most frequent letter that has appeared.
When the following code is run, the result is e, which is, if you count, the most frequent letter.
import string
def checkio(text):
"""
We iterate through latin alphabet and count each letter in the text.
Then 'max' selects the most frequent letter.
For the case when we have several equal letter,
'max' selects the first from they.
"""
text = text.lower()
return max(string.ascii_lowercase, key=text.count)
print checkio('hello my name is heinst')
A key function in max() is called for each element to provide an alternative to determine the maximum by, which in this case isn't all that efficient.
Essentially, the line max(string.ascii_lowercase, key=text.count) can be translated to:
max_character, max_count = None, -1
for character in string.ascii_lowercase:
if text.count(character) > max_count:
max_character = character
return max_character
where str.count() loops through the whole of text counting how often character occurs.
You should really use a multiset / bag here instead; in Python that's provided by the collections.Counter() type:
max_character = Counter(text.lower()).most_common(1)[0][0]
The Counter() takes O(N) time to count the characters in a string of length N, then to find the maximum, another O(K) to determine the highest count, where K is the number of unique characters. Asymptotically speaking, that makes the whole process take O(N) time.
The max() approach takes O(MN) time, where M is the length of string.ascii_lowercase.
Use the Counter function from the collections module.
>>> import collections
>>> word = "supercalafragalistic"
>>> c = collections.Counter(word)
>>> c.most_common()
[('a', 4), ('c', 2), ('i', 2), ('l', 2), ('s', 2), ('r', 2), ('e', 1), ('g', 1), ('f', 1), ('p', 1), ('u', 1), ('t', 1)]
>>> c.most_common()[0]
('a', 4)
I'm fairly new to Python and I have this program that I was tinkering with. It's supposed to get a string from input and display which character is the most frequent.
stringToData = raw_input("Please enter your string: ")
# imports collections class
import collections
# gets the data needed from the collection
letter, count = collections.Counter(stringToData).most_common(1)[0]
# prints the results
print "The most frequent character is %s, which occurred %d times." % (
letter, count)
However, if the string has 1 of each character, it only displays one letter and says it's the most frequent character. I thought about changing the number in the parenthesis in most_common(number), but I didn't want more to display how many times the other letters every time.
Thank you to all that help!
As I explained in the comment:
You can leave off the parameter to most_common to get a list of all characters, ordered from most common to least common. Then just loop through that result and collect the characters as long as the counter value is still the same. That way you get all characters that are most common.
Counter.most_common(n) returns the n most common elements from the counter. Or in case where n is not specified, it will return all elements from the counter, ordered by the count.
>>> collections.Counter('abcdab').most_common()
[('a', 2), ('b', 2), ('c', 1), ('d', 1)]
You can use this behavior to simply loop through all elements, ordered by their count. As long as the count is the same as of the first element in the output, you know that the element still ocurred in the same quantity in the string.
>>> c = collections.Counter('abcdefgabc')
>>> maxCount = c.most_common(1)[0][1]
>>> elements = []
>>> for element, count in c.most_common():
if count != maxCount:
break
elements.append(element)
>>> elements
['a', 'c', 'b']
>>> [e for e, c in c.most_common() if c == maxCount]
['a', 'c', 'b']
I am a beginner in python and i am trying to solve some questions about lists. I got stuck on one problem and I am not able to solve it:
Write a function countLetters(word) that takes in a word as argument
and returns a list that counts the number of times each letter
appears. The letters must be sorted in alphabetical order.
Ex:
>>> countLetters('google')
[('e', 1), ('g', 2), ('l', 1), ('o', 2)]
I am not able to count the occurrences of every character. For sorting I am using sorted(list) and I am also using dictionary(items functions) for this format of output(tuples of list). But I am not able to link all these things.
Use sets !
m = "google"
u = set(m)
sorted([(l, m.count(l)) for l in u])
>>> [('e', 1), ('g', 2), ('l', 1), ('o', 2)]
A hint: Note that you can loop through a string in the same way as a list or other iterable object in python:
def countLetters(word):
for letter in word:
print letter
countLetters("ABC")
The output will be:
A
B
C
So instead of printing, use the loop to look at what letter you've got (in your letter variable) and count it somehow.
finally, made it!!!
import collections
def countch(strng):
d=collections.defaultdict(int)
for letter in strng:
d[letter]+=1
print sorted(d.items())
This is my solution.Now, i can ask for your solutions of this problem.I would love to see your code.