Related
I have this string
tw =('BenSasse, well I did teach her the bend-and-snap https://twitter.com/bethanyshondark/status/903301101855928322 QT #bethanyshondark Is Reese channeling #BenSasse https://acculturated.com/reese-witherspoons-daughter-something-many-celebrity-children-lack-work-ethic/ , Twitter for Android')
I need to create a list with all the words that have more than 3 vowels. Please help!
You can use re.findall with the following regex:
import re
re.findall(r'(?:[a-z-]*[aeiou]){3,}[a-z-]*', tw, flags=re.IGNORECASE)
This returns:
['BenSasse', 'bend-and-snap', 'bethanyshondark', 'bethanyshondark', 'Reese', 'channeling', 'BenSasse', 'acculturated', 'reese-witherspoons-daughter-something-many-celebrity-children-lack-work-ethic', 'Android']
I would suggest you start with creating a list of all vowels:
vowels = ['a','e','i','o','u']
Well, a list of letters (Char) is really the same as a string, so I would just do the following:
vowels = "aeiou"
After that I would attempt to split your string into words. Let's try tw.split() like Joran Beasley suggested. It returns:
['BenSasse,', 'well', 'I', 'did', 'teach', 'her', 'the', 'bend-and-snap', 'https://twitter.com/bethanyshondark/status/903301101855928322', 'QT', '#bethanyshondark', 'Is', 'Reese', 'channeling', '#BenSasse', 'https://acculturated.com/reese-witherspoons-daughter-something-many-celebrity-children-lack-work-ethic/', ',', 'Twitter', 'for', 'Android']
Are you fine with this being your "words"? Notice that each link is a "word". I'm going to assume that this is fine.
Ok, so if we access each word with a for-loop, we can then access each letter with an inner for-loop. But before we start, we need to keep track of all the accepted words with 3 or more vowels, so make a new list: final_list = list(). Now:
for word in tw.split():
counter=0 # Let's keep track of how many vowels we have in a word
for letter in word:
if letter in vowels:
counter = counter+1
if counter >= 3:
final_list.append(word) # Add the word if 3 or more vowels.
If you now do a print: print(final_list) you should get:
['BenSasse,', 'bend-and-snap', 'https://twitter.com/bethanyshondark/status/903301101855928322', '#bethanyshondark', 'Reese', 'channeling', '#BenSasse', 'https://acculturated.com/reese-witherspoons-daughter-something-many-celebrity-children-lack-work-ethic/']
I am trying to split the sentences in words.
words = content.lower().split()
this gives me the list of words like
'evening,', 'and', 'there', 'was', 'morning--the', 'first', 'day.'
and with this code:
def clean_up_list(word_list):
clean_word_list = []
for word in word_list:
symbols = "~!##$%^&*()_+`{}|\"?><`-=\][';/.,']"
for i in range(0, len(symbols)):
word = word.replace(symbols[i], "")
if len(word) > 0:
clean_word_list.append(word)
I get something like:
'evening', 'and', 'there', 'was', 'morningthe', 'first', 'day'
if you see the word "morningthe" in the list, it used to have "--" in between words. Now, is there any way I can split them in two words like "morning","the"??
I would suggest a regex-based solution:
import re
def to_words(text):
return re.findall(r'\w+', text)
This looks for all words - groups of alphabetic characters, ignoring symbols, seperators and whitespace.
>>> to_words("The morning-the evening")
['The', 'morning', 'the', 'evening']
Note that if you're looping over the words, using re.finditer which returns a generator object is probably better, as you don't have store the whole list of words at once.
Alternatively, you may also use itertools.groupby along with str.alpha() to extract alphabets-only words from the string as:
>>> from itertools import groupby
>>> sentence = 'evening, and there was morning--the first day.'
>>> [''.join(j) for i, j in groupby(sentence, str.isalpha) if i]
['evening', 'and', 'there', 'was', 'morning', 'the', 'first', 'day']
PS: Regex based solution is much cleaner. I have mentioned this as an possible alternative to achieve this.
Specific to OP: If all you want is to also split on -- in the resultant list, then you may firstly replace hyphens '-' with space ' ' before performing split. Hence, your code should be:
words = content.lower().replace('-', ' ').split()
where words will hold the value you desire.
Trying to do this with regexes will send you crazy e.g.
>>> re.findall(r'\w+', "Don't read O'Rourke's books!")
['Don', 't', 'read', 'O', 'Rourke', 's', 'books']
Definitely look at the nltk package.
Besides the solutions given already, you could also improve your clean_up_list function to do a better work.
def clean_up_list(word_list):
clean_word_list = []
# Move the list out of loop so that it doesn't
# have to be initiated every time.
symbols = "~!##$%^&*()_+`{}|\"?><`-=\][';/.,']"
for word in word_list:
current_word = ''
for index in range(len(word)):
if word[index] in symbols:
if current_word:
clean_word_list.append(current_word)
current_word = ''
else:
current_word += word[index]
if current_word:
# Append possible last current_word
clean_word_list.append(current_word)
return clean_word_list
Actually, you could apply the block in for word in word_list: to the whole sentence to get the same result.
You could also do this:
import re
def word_list(text):
return list(filter(None, re.split('\W+', text)))
print(word_list("Here we go round the mulberry-bush! And even---this and!!!this."))
Returns:
['Here', 'we', 'go', 'round', 'the', 'mulberry', 'bush', 'And', 'even', 'this', 'and', 'this']
Using the following code from https://stackoverflow.com/a/11899925, I am able to find if a word is unique or not (by comparing if it was used once or greater than once):
helloString = ['hello', 'world', 'world']
count = {}
for word in helloString :
if word in count :
count[word] += 1
else:
count[word] = 1
But, if I were to have a string with hundreds of words, how would I be able to count the number of unique words within that string?
For example, my code has:
uniqueWordCount = 0
helloString = ['hello', 'world', 'world', 'how', 'are', 'you', 'doing', 'today']
count = {}
for word in words :
if word in count :
count[word] += 1
else:
count[word] = 1
How would I be able to set uniqueWordCount to 6? Usually, I am really good at solving these types of algorithmic puzzles, but I have been unsuccessful with figuring this one out. I feel as if it is right beneath my nose.
The best way to solve this is to use the set collection type. A set is a collection in which all elements are unique. Therefore:
unique = set([ 'one', 'two', 'two'])
len(unique) # is 2
You can use a set from the outset, adding words to it as you go:
unique.add('three')
This will throw out any duplicates as they are added. Or, you can collect all the elements in a list and pass the list to the set() function, which will remove the duplicates at that time. The example I provided above shows this pattern:
unique = set([ 'one', 'two', 'two'])
unique.add('three')
# unique now contains {'one', 'two', 'three'}
Read more about sets in Python.
You have many options for this, I recommend a set, but you can also use a counter, which counts the amount a number shows up, or you can look at the number of keys for the dictionary you made.
Set
You can also convert the list to a set, where all elements have to be unique. Not unique elements are discarded:
helloString = ['hello', 'world', 'world', 'how', 'are', 'you', 'doing', 'today']
helloSet = set(helloString) #=> ['doing', 'how', 'are', 'world', 'you', 'hello', 'today']
uniqueWordCount = len(set(helloString)) #=> 7
Here's a link to further reading on sets
Counter
You can also use a counter, which can also tell you how often a word was used, if you still need that information.
from collections import Counter
helloString = ['hello', 'world', 'world', 'how', 'are', 'you', 'doing', 'today']
counter = Counter(helloString)
len(counter) #=> 7
counter["world"] #=> 2
Loop
At the end for your loop, you can check the len of count, also, you mistyped helloString as words:
uniqueWordCount = 0
helloString = ['hello', 'world', 'world', 'how', 'are', 'you', 'doing', 'today']
count = {}
for word in helloString:
if word in count :
count[word] += 1
else:
count[word] = 1
len(count) #=> 7
You can use collections.Counter
helloString = ['hello', 'world', 'world']
from collections import Counter
c = Counter(helloString)
print("There are {} unique words".format(len(c)))
print('They are')
for k, v in c.items():
print(k)
I know the question doesn't specifically ask for this, but to maintain order
helloString = ['hello', 'world', 'world', 'how', 'are', 'you', 'doing', 'today']
from collections import Counter, OrderedDict
class OrderedCounter(Counter, OrderedDict):
pass
c = OrderedCounter(helloString)
print("There are {} unique words".format(len(c)))
print('They are')
for k, v in c.items():
print(k)
In your current code you can either increment uniqueWordCount in the else case where you already set count[word], or just lookup the number of keys in the dictionary: len(count).
If you only want to know the number of unique elements, then get the elements in the set: len(set(helloString))
I would do this using a set.
def stuff(helloString):
hello_set = set(helloString)
return len(hello_set)
Counter is the efficient way to do it.
This code is similar to counter,
text = ['hello', 'world']
# create empty dictionary
freq_dict = {}
# loop through text and count words
for word in text:
# set the default value to 0
freq_dict.setdefault(word, 0)
# increment the value by 1
freq_dict[word] += 1
for key,value in freq_dict.items():
if value == 1:
print(f'Word "{key}" has single appearance in the list')
Word "hello" has single appearance in the list
Word "world" has single appearance in the list
[Program finished]
I may be misreading the question but I believe the goal is to find all elements which only occur one time in the list.
from collections import Counter
helloString = ['hello', 'world', 'world', 'how', 'are', 'you', 'doing', 'today']
counter = Counter(helloString)
uniques = [value for value, count in counter.items() if count == 1]
This will give us 6 items because "world" occurs twice in our list:
>>> uniques
['you', 'are', 'doing', 'how', 'today', 'hello']
I have a several lists of words and need a total count of each word.
Two lines:
['ashtonsos', 'i', 'heard', 'you', 'shouldnt', 'trust', 'claires', 'with', 'piercings', 'lol']
['liveaidstyles', 'thank', 'you', 'so', 'much', '\xf0\x9f\x92\x98']
I have imported the collections counter, using the line "from collections import Counter"
And this is my code:
for word in words:
if word not in unique_words:
unique_words.append(word)
#print unique_words
tweet_count = Counter(unique_words)
for word in unique_words:
tweet_count.update()
for word in tweet_count:
print word, tweet_count[word]
What that prints is each word followed by a 1, even if the word is repeated. So, basically, the counter isn't counting.
FYI...the '.update()' line...I've also used 'tweet_count += 1'... and it returns the same result.
What am I doing wrong??
Isn't it obvious? You're counting a list of unique_words. Unique, by definition, occurring once.
Try this:
counter = Counter()
for my_list in my_list_of_lists:
counter += Counter(set(my_list))
modified it to:
for word in words:
if word not in AFINN and word not in unique_words:
unique_words.append(word)
for word in unique_words:
tweet_count[word] = tweet_count.get(word,0) + 1
I wrote some code to find the most popular words in submission titles on reddit, using the reddit praw api.
import nltk
import praw
picksub = raw_input('\nWhich subreddit do you want to analyze? r/')
many = input('\nHow many of the top words would you like to see? \n\t> ')
print 'Getting the top %d most common words from r/%s:' % (many,picksub)
r = praw.Reddit(user_agent='get the most common words from chosen subreddit')
submissions = r.get_subreddit(picksub).get_top_from_all(limit=200)
hey = []
for x in submissions:
hey.extend(str(x).split(' '))
fdist = nltk.FreqDist(hey) # creates a frequency distribution for words in 'hey'
top_words = fdist.keys()
common_words = ['its','am', 'ago','took', 'got', 'will', 'been', 'get', 'such','your','don\'t', 'if', 'why', 'do', 'does', 'or', 'any', 'but', 'they', 'all', 'now','than','into','can', 'i\'m','not','so','just', 'out','about','have','when', 'would' ,'where', 'what', 'who' 'I\'m','says' 'not', '', 'over', '_', '-','after', 'an','for', 'who', 'by', 'from', 'it', 'how', 'you', 'about' 'for', 'on', 'as', 'be', 'has', 'that', 'was', 'there', 'with','what', 'we', '::', 'to', 'the', 'of', ':', '...', 'a', 'at', 'is', 'my', 'in' , 'i', 'this', 'and', 'are', 'he', 'she', 'is', 'his', 'hers']
already = []
counter = 0
number = 1
print '-----------------------'
for word in top_words:
if word.lower() not in common_words and word.lower() not in already:
print str(number) + ". '" + word + "'"
counter +=1
number +=1
already.append(word.lower())
if counter == many:
break
print '-----------------------\n'
so inputting subreddit 'python' and getting 10 posts returns:
'Python'
'PyPy'
'code'
'use'
'136'
'181'
'd...'
'IPython'
'133'
10. '158'
How can I make this script not return numbers, and error words like 'd...'? The first 4 results are acceptable, but I would like to replace this rest with words that make sense. Making a list common_words is unreasonable, and doesn't filter these errors. I'm relatively new to writing code, and I appreciate the help.
I disagree. Making a list of common words is correct, there is no easier way to filter out the, for, I, am, etc.. However, it is unreasonable to use the common_words list to filter out results that aren't words, because then you'd have to include every possible non-word you don't want. Non-words should be filtered out differently.
Some suggestions:
1) common_words should be a set(), since your list is long this should speed things up. The in operation for sets in O(1), while for lists it is O(n).
2) Getting rid of all number strings is trivial. One way you could do it is:
all([w.isdigit() for w in word])
Where if this returns True, then the word is just a series of numbers.
3) Getting rid of the d... is a little more tricky. It depends on how you define a non-word. This:
tf = [ c.isalpha() for c in word ]
Returns a list of True/False values (where it is False if the char was not a letter). You can then count the values like:
t = tf.count(True)
f = tf.count(False)
You could then define a non-word as one that has more non-letter chars in it than letters, as one that has any non-letter characters at all, etc. For example:
def check_wordiness(word):
# This returns true only if a word is all letters
return all([ c.isalpha() for c in word ])
4) In the for word in top_words: block, are you sure that you have not mixed up counter and number? Also, counter and number are pretty much redundant, you could rewrite the last bit as:
for word in top_words:
# Since you are calling .lower() so much,
# you probably want to define it up here
w = word.lower()
if w not in common_words and w not in already:
# String formatting is preferred over +'s
print "%i. '%s'" % (number, word)
number +=1
# This could go under the if statement. You only want to add
# words that could be added again. Why add words that are being
# filtered out anyways?
already.append(w)
# this wasn't indented correctly before
if number == many:
break
Hope that helps.