Python: Split list based on first character of word - python

Im kind of stuck on an issue and Ive gone round and round with it until ive confused myself.
What I am trying to do is take a list of words:
['About', 'Absolutely', 'After', 'Aint', 'Alabama', 'AlabamaBill', 'All', 'Also', 'Amos', 'And', 'Anyhow', 'Are', 'As', 'At', 'Aunt', 'Aw', 'Bedlam', 'Behind', 'Besides', 'Biblical', 'Bill', 'Billgone']
Then sort them under and alphabetical order:
A
About
Absolutely
After
B
Bedlam
Behind
etc...
Is there and easy way to do this?

Use itertools.groupby() to group your input by a specific key, such as the first letter:
from itertools import groupby
from operator import itemgetter
for letter, words in groupby(sorted(somelist), key=itemgetter(0)):
print letter
for word in words:
print word
print
If your list is already sorted, you can omit the sorted() call. The itemgetter(0) callable will return the first letter of each word (the character at index 0), and groupby() will then yield that key plus an iterable that consists only of those items for which the key remains the same. In this case that means looping over words gives you all items that start with the same character.
Demo:
>>> somelist = ['About', 'Absolutely', 'After', 'Aint', 'Alabama', 'AlabamaBill', 'All', 'Also', 'Amos', 'And', 'Anyhow', 'Are', 'As', 'At', 'Aunt', 'Aw', 'Bedlam', 'Behind', 'Besides', 'Biblical', 'Bill', 'Billgone']
>>> from itertools import groupby
>>> from operator import itemgetter
>>>
>>> for letter, words in groupby(sorted(somelist), key=itemgetter(0)):
... print letter
... for word in words:
... print word
... print
...
A
About
Absolutely
After
Aint
Alabama
AlabamaBill
All
Also
Amos
And
Anyhow
Are
As
At
Aunt
Aw
B
Bedlam
Behind
Besides
Biblical
Bill
Billgone

Instead of using any library imports, or anything fancy.
Here is the logic:
def splitLst(x):
dictionary = dict()
for word in x:
f = word[0]
if f in dictionary.keys():
dictionary[f].append(word)
else:
dictionary[f] = [word]
return dictionary
splitLst(['About', 'Absolutely', 'After', 'Aint', 'Alabama', 'AlabamaBill', 'All', 'Also', 'Amos', 'And', 'Anyhow', 'Are', 'As', 'At', 'Aunt', 'Aw', 'Bedlam', 'Behind', 'Besides', 'Biblical', 'Bill', 'Billgone'])

def split(n):
n2 = []
for i in n:
if i[0] not in n2:
n2.append(i[0])
n2.sort()
for j in n:
z = j[0]
z1 = n2.index(z)
n2.insert(z1+1, j)
return n2
word_list = ['be','have','do','say','get','make','go','know','take','see','come','think',
'look','want','give','use','find','tell','ask','work','seem','feel','leave','call']
print(split(word_list))

Related

How to extract specific words from a string?

I have to extract two things from a string: A list that contains stop-words, and another list that contains the rest of the string.
text = 'he is the best when people in our life'
stopwords = ['he', 'the', 'our']
contains_stopwords = []
normal_words = []
for i in text.split():
for j in stopwords:
if i in j:
contains_stopwords.append(i)
else:
normal_words.append(i)
if text.split() in stopwords:
contains_stopwords.append(text.split())
else:
normal_words.append(text.split())
print("contains_stopwords:", contains_stopwords)
print("normal_words:", normal_words)
Output:
contains_stopwords: ['he', 'he', 'the', 'our']
normal_words: ['he', 'is', 'is', 'is', 'the', 'the', 'best', 'best', 'best', 'when', 'when', 'when', 'people', 'people', 'people', 'in', 'in', 'in', 'our', 'our', 'life', 'life', 'life', ['he', 'is', 'the', 'best', 'when', 'people', 'in', 'our', 'life']]
Desired result:
contains_stopwords: ['he', 'the', 'our']
normal_words: ['is', 'best', 'when', 'people', 'in', 'life']
One answer could be:
text = 'he is the best when people in our life'
stopwords = ['he', 'the', 'our']
contains_stopwords = set() # The set data structure guarantees there won't be any duplicate
normal_words = []
for word in text.split():
if word in stopwords:
contains_stopwords.add(word)
else:
normal_words.append(word)
print("contains_stopwords:", contains_stopwords)
print("normal_words:", normal_words)
you seem to have chosen the most difficult path. The code under should do the trick.
for word in text.split():
if word in stopwords:
contains_stopwords.append(word)
else:
normal_words.append(word)
First, we separate the text into a list of words using split, then we iterate and check if that word is in the list of stopwords (yeah, python allows you to do this). If it is, we just append it to the list of stopwords, if not, we append it to the other list.
Use the list comprehention and eliminate the duplicates by creating a dictionary with keys as list values and converting it again to a list:
itext = 'he is the best when people in our life'
stopwords = ['he', 'the', 'our']
split_words = itext.split(' ')
contains_stopwords = list(dict.fromkeys([word for word in split_words if word in stopwords]))
normal_words = list(dict.fromkeys([word for word in split_words if word not in stopwords]))
print("contains_stopwords:", contains_stopwords)
print("normal_words:", normal_words)
Some list comprehension could work and then use set() to remove duplicates from the list. I reconverted the set datastructure to a list as per your question, but you can leave it as a set:
text = 'he is the best when people in our life he he he'
stopwords = ['he', 'the', 'our']
list1 = {item for item in text.split(" ") if item in stopwords}
list2 = [item for item in text.split(" ") if item not in list1]
Output:
list1 - ['he', 'the', 'our']
list2 - ['is', 'best', 'when', 'people', 'in', 'life']
text = 'he is the best when people in our life'
# I will suggest make `stopwords` a set
# cuz the membership operator(ie. in) will take O(1)
stopwords = set(['he', 'the', 'our'])
contains_stopwords = []
normal_words = []
for word in text.split():
if word in stopwords: # here checking membership
contains_stopwords.append(word)
else:
normal_words.append(word)
print("contains_stopwords:", contains_stopwords)
print("normal_words:", normal_words)

Counting the number of unique words in a list

Using the following code from https://stackoverflow.com/a/11899925, I am able to find if a word is unique or not (by comparing if it was used once or greater than once):
helloString = ['hello', 'world', 'world']
count = {}
for word in helloString :
if word in count :
count[word] += 1
else:
count[word] = 1
But, if I were to have a string with hundreds of words, how would I be able to count the number of unique words within that string?
For example, my code has:
uniqueWordCount = 0
helloString = ['hello', 'world', 'world', 'how', 'are', 'you', 'doing', 'today']
count = {}
for word in words :
if word in count :
count[word] += 1
else:
count[word] = 1
How would I be able to set uniqueWordCount to 6? Usually, I am really good at solving these types of algorithmic puzzles, but I have been unsuccessful with figuring this one out. I feel as if it is right beneath my nose.
The best way to solve this is to use the set collection type. A set is a collection in which all elements are unique. Therefore:
unique = set([ 'one', 'two', 'two'])
len(unique) # is 2
You can use a set from the outset, adding words to it as you go:
unique.add('three')
This will throw out any duplicates as they are added. Or, you can collect all the elements in a list and pass the list to the set() function, which will remove the duplicates at that time. The example I provided above shows this pattern:
unique = set([ 'one', 'two', 'two'])
unique.add('three')
# unique now contains {'one', 'two', 'three'}
Read more about sets in Python.
You have many options for this, I recommend a set, but you can also use a counter, which counts the amount a number shows up, or you can look at the number of keys for the dictionary you made.
Set
You can also convert the list to a set, where all elements have to be unique. Not unique elements are discarded:
helloString = ['hello', 'world', 'world', 'how', 'are', 'you', 'doing', 'today']
helloSet = set(helloString) #=> ['doing', 'how', 'are', 'world', 'you', 'hello', 'today']
uniqueWordCount = len(set(helloString)) #=> 7
Here's a link to further reading on sets
Counter
You can also use a counter, which can also tell you how often a word was used, if you still need that information.
from collections import Counter
helloString = ['hello', 'world', 'world', 'how', 'are', 'you', 'doing', 'today']
counter = Counter(helloString)
len(counter) #=> 7
counter["world"] #=> 2
Loop
At the end for your loop, you can check the len of count, also, you mistyped helloString as words:
uniqueWordCount = 0
helloString = ['hello', 'world', 'world', 'how', 'are', 'you', 'doing', 'today']
count = {}
for word in helloString:
if word in count :
count[word] += 1
else:
count[word] = 1
len(count) #=> 7
You can use collections.Counter
helloString = ['hello', 'world', 'world']
from collections import Counter
c = Counter(helloString)
print("There are {} unique words".format(len(c)))
print('They are')
for k, v in c.items():
print(k)
I know the question doesn't specifically ask for this, but to maintain order
helloString = ['hello', 'world', 'world', 'how', 'are', 'you', 'doing', 'today']
from collections import Counter, OrderedDict
class OrderedCounter(Counter, OrderedDict):
pass
c = OrderedCounter(helloString)
print("There are {} unique words".format(len(c)))
print('They are')
for k, v in c.items():
print(k)
In your current code you can either increment uniqueWordCount in the else case where you already set count[word], or just lookup the number of keys in the dictionary: len(count).
If you only want to know the number of unique elements, then get the elements in the set: len(set(helloString))
I would do this using a set.
def stuff(helloString):
hello_set = set(helloString)
return len(hello_set)
Counter is the efficient way to do it.
This code is similar to counter,
text = ['hello', 'world']
# create empty dictionary
freq_dict = {}
# loop through text and count words
for word in text:
# set the default value to 0
freq_dict.setdefault(word, 0)
# increment the value by 1
freq_dict[word] += 1
for key,value in freq_dict.items():
if value == 1:
print(f'Word "{key}" has single appearance in the list')
Word "hello" has single appearance in the list
Word "world" has single appearance in the list
[Program finished]
I may be misreading the question but I believe the goal is to find all elements which only occur one time in the list.
from collections import Counter
helloString = ['hello', 'world', 'world', 'how', 'are', 'you', 'doing', 'today']
counter = Counter(helloString)
uniques = [value for value, count in counter.items() if count == 1]
This will give us 6 items because "world" occurs twice in our list:
>>> uniques
['you', 'are', 'doing', 'how', 'today', 'hello']

understanding a selector from within a dict

For a raw_input sentence I must print out each word and it's type from a dict:
wordDict = {
"directions": ['north', 'south', 'east', 'west', 'down', 'up', 'left', 'right'],
"verbs": ['go', 'stop', 'eat', 'kill'],
"stop_words": ['the', 'in', 'of', 'from', 'at', 'it'],
"nouns": ['door', 'bear', 'princess', 'cabinet'],
"numbers": range(10)
}
stuff = raw_input("Write sentence here > ")
words = stuff.split()
for wds in words:
print (wordDict[wrd]), wrd
So if someone typed in "north go the bear 5" I'd receive output along the lines:
directions: north, verbs: go, stop_words: the, nouns: bear, numbers: 5
This is for a tutorial in Learn Python The Hard Way (exercise 48).
For each word how would I print out it's type and value?
Instead of using your wordDict, as your keys are the words which are the values in your search dictionary, you would be in an advantage, if you transpose your dictionary aforehand.
This will make your lookup code less complex and readable.
Also, its important to note that, your words would be unique, as a single word cannot fall into multiple categories, so, you can easily use your words are keys and the category as values.
>>> wordDict = {
"directions": ['north', 'south', 'east', 'west', 'down', 'up', 'left', 'right'],
"verbs": ['go', 'stop', 'eat', 'kill'],
"stop_words": ['the', 'in', 'of', 'from', 'at', 'it'],
"nouns": ['door', 'bear', 'princess', 'cabinet'],
"numbers": range(10)
}
>>> wordDict_transpose = {str(elem): key for key, value in wordDict.items()
for elem in value}
>>> for word in words.split():
print "{}: {}".format(wordDict_transpose.get(str(word), 'Unknown'), word)
directions: north
verbs: go
stop_words: the
nouns: bear
numbers: 5
You can get the type of the words by iterating through your dictionary:
for word in words:
for key,values in wordDict.items():
if word in values:
print key,word
For the numbers to work well, you need to convert these to strings:
"numbers": [str(n) for n in range(10)]
Following Raphaƫl's suggestion, an other way to get the type:
def get_type(word):
for key,values in wordDict.items():
if word in values:
return key
for word in words:
print word, get_type(word)
In this case it returns one type even if the same word exists in multiple lists. It handles the situation when the word is missing from all lists. Prints None in that case.

Split strings in a list of lists

I currently have a list of lists:
[['Hi my name is'],['What are you doing today'],['Would love some help']]
And I would like to split the strings in the lists, while remaining in their current location. For example
[['Hi','my','name','is']...]..
How can I do this?
Also, if I would like to use a specific of the lists after searching for it, say I search for "Doing", and then want to append something to that specific list.. how would I go about doing that?
You can use a list comprehension to create new list of lists with all the sentences split:
[lst[0].split() for lst in list_of_lists]
Now you can loop through this and find the list matching a condition:
for sublist in list_of_lists:
if 'doing' in sublist:
sublist.append('something')
or searching case insensitively, use any() and a generator expression; this will the minimum number of words to find a match:
for sublist in list_of_lists:
if any(w.lower() == 'doing' for w in sublist):
sublist.append('something')
list1 = [['Hi my name is'],['What are you doing today'],['Would love some help']]
use
[i[0].split() for i in list1]
then you will get the output like
[['Hi', 'my', 'name', 'is'], ['What', 'are', 'you', 'doing', 'today'], ['Would', 'love', 'some', 'help']]
l = [['Hi my name is'],['What are you doing today'],['Would love some help']]
for x in l:
l[l.index(x)] = x[0].split(' ')
print l
Or simply:
l = [x[0].split(' ') for x in l]
Output
[['Hi', 'my', 'name', 'is'], ['What', 'are', 'you', 'doing', 'today'], ['Would', 'love', 'some', 'help']]

Is there a better way to get just 'important words' from a list in python?

I wrote some code to find the most popular words in submission titles on reddit, using the reddit praw api.
import nltk
import praw
picksub = raw_input('\nWhich subreddit do you want to analyze? r/')
many = input('\nHow many of the top words would you like to see? \n\t> ')
print 'Getting the top %d most common words from r/%s:' % (many,picksub)
r = praw.Reddit(user_agent='get the most common words from chosen subreddit')
submissions = r.get_subreddit(picksub).get_top_from_all(limit=200)
hey = []
for x in submissions:
hey.extend(str(x).split(' '))
fdist = nltk.FreqDist(hey) # creates a frequency distribution for words in 'hey'
top_words = fdist.keys()
common_words = ['its','am', 'ago','took', 'got', 'will', 'been', 'get', 'such','your','don\'t', 'if', 'why', 'do', 'does', 'or', 'any', 'but', 'they', 'all', 'now','than','into','can', 'i\'m','not','so','just', 'out','about','have','when', 'would' ,'where', 'what', 'who' 'I\'m','says' 'not', '', 'over', '_', '-','after', 'an','for', 'who', 'by', 'from', 'it', 'how', 'you', 'about' 'for', 'on', 'as', 'be', 'has', 'that', 'was', 'there', 'with','what', 'we', '::', 'to', 'the', 'of', ':', '...', 'a', 'at', 'is', 'my', 'in' , 'i', 'this', 'and', 'are', 'he', 'she', 'is', 'his', 'hers']
already = []
counter = 0
number = 1
print '-----------------------'
for word in top_words:
if word.lower() not in common_words and word.lower() not in already:
print str(number) + ". '" + word + "'"
counter +=1
number +=1
already.append(word.lower())
if counter == many:
break
print '-----------------------\n'
so inputting subreddit 'python' and getting 10 posts returns:
'Python'
'PyPy'
'code'
'use'
'136'
'181'
'd...'
'IPython'
'133'
10. '158'
How can I make this script not return numbers, and error words like 'd...'? The first 4 results are acceptable, but I would like to replace this rest with words that make sense. Making a list common_words is unreasonable, and doesn't filter these errors. I'm relatively new to writing code, and I appreciate the help.
I disagree. Making a list of common words is correct, there is no easier way to filter out the, for, I, am, etc.. However, it is unreasonable to use the common_words list to filter out results that aren't words, because then you'd have to include every possible non-word you don't want. Non-words should be filtered out differently.
Some suggestions:
1) common_words should be a set(), since your list is long this should speed things up. The in operation for sets in O(1), while for lists it is O(n).
2) Getting rid of all number strings is trivial. One way you could do it is:
all([w.isdigit() for w in word])
Where if this returns True, then the word is just a series of numbers.
3) Getting rid of the d... is a little more tricky. It depends on how you define a non-word. This:
tf = [ c.isalpha() for c in word ]
Returns a list of True/False values (where it is False if the char was not a letter). You can then count the values like:
t = tf.count(True)
f = tf.count(False)
You could then define a non-word as one that has more non-letter chars in it than letters, as one that has any non-letter characters at all, etc. For example:
def check_wordiness(word):
# This returns true only if a word is all letters
return all([ c.isalpha() for c in word ])
4) In the for word in top_words: block, are you sure that you have not mixed up counter and number? Also, counter and number are pretty much redundant, you could rewrite the last bit as:
for word in top_words:
# Since you are calling .lower() so much,
# you probably want to define it up here
w = word.lower()
if w not in common_words and w not in already:
# String formatting is preferred over +'s
print "%i. '%s'" % (number, word)
number +=1
# This could go under the if statement. You only want to add
# words that could be added again. Why add words that are being
# filtered out anyways?
already.append(w)
# this wasn't indented correctly before
if number == many:
break
Hope that helps.

Categories