How to extract specific words from a string? - python

I have to extract two things from a string: A list that contains stop-words, and another list that contains the rest of the string.
text = 'he is the best when people in our life'
stopwords = ['he', 'the', 'our']
contains_stopwords = []
normal_words = []
for i in text.split():
for j in stopwords:
if i in j:
contains_stopwords.append(i)
else:
normal_words.append(i)
if text.split() in stopwords:
contains_stopwords.append(text.split())
else:
normal_words.append(text.split())
print("contains_stopwords:", contains_stopwords)
print("normal_words:", normal_words)
Output:
contains_stopwords: ['he', 'he', 'the', 'our']
normal_words: ['he', 'is', 'is', 'is', 'the', 'the', 'best', 'best', 'best', 'when', 'when', 'when', 'people', 'people', 'people', 'in', 'in', 'in', 'our', 'our', 'life', 'life', 'life', ['he', 'is', 'the', 'best', 'when', 'people', 'in', 'our', 'life']]
Desired result:
contains_stopwords: ['he', 'the', 'our']
normal_words: ['is', 'best', 'when', 'people', 'in', 'life']

One answer could be:
text = 'he is the best when people in our life'
stopwords = ['he', 'the', 'our']
contains_stopwords = set() # The set data structure guarantees there won't be any duplicate
normal_words = []
for word in text.split():
if word in stopwords:
contains_stopwords.add(word)
else:
normal_words.append(word)
print("contains_stopwords:", contains_stopwords)
print("normal_words:", normal_words)

you seem to have chosen the most difficult path. The code under should do the trick.
for word in text.split():
if word in stopwords:
contains_stopwords.append(word)
else:
normal_words.append(word)
First, we separate the text into a list of words using split, then we iterate and check if that word is in the list of stopwords (yeah, python allows you to do this). If it is, we just append it to the list of stopwords, if not, we append it to the other list.

Use the list comprehention and eliminate the duplicates by creating a dictionary with keys as list values and converting it again to a list:
itext = 'he is the best when people in our life'
stopwords = ['he', 'the', 'our']
split_words = itext.split(' ')
contains_stopwords = list(dict.fromkeys([word for word in split_words if word in stopwords]))
normal_words = list(dict.fromkeys([word for word in split_words if word not in stopwords]))
print("contains_stopwords:", contains_stopwords)
print("normal_words:", normal_words)

Some list comprehension could work and then use set() to remove duplicates from the list. I reconverted the set datastructure to a list as per your question, but you can leave it as a set:
text = 'he is the best when people in our life he he he'
stopwords = ['he', 'the', 'our']
list1 = {item for item in text.split(" ") if item in stopwords}
list2 = [item for item in text.split(" ") if item not in list1]
Output:
list1 - ['he', 'the', 'our']
list2 - ['is', 'best', 'when', 'people', 'in', 'life']

text = 'he is the best when people in our life'
# I will suggest make `stopwords` a set
# cuz the membership operator(ie. in) will take O(1)
stopwords = set(['he', 'the', 'our'])
contains_stopwords = []
normal_words = []
for word in text.split():
if word in stopwords: # here checking membership
contains_stopwords.append(word)
else:
normal_words.append(word)
print("contains_stopwords:", contains_stopwords)
print("normal_words:", normal_words)

Related

python: how to sort a string alphabetically and by len

shakespeare = ‘All the world is a stage, and all the men and women merely players. They have their exits and their entrances, And one man in his time plays many parts.’
Create a function that returns a string with all the words of the sentence shakespeare ordered alphabetically. Eliminate punctuation marks.
(Tip: the three first words should be ‘ a all all’, this time duplicates are allowed and remember that there are words in mayus)
def sort_string(shakespeare):
return string_sorted
Here you get a one-liner
import re
shakespeare = "All the world is a stage, and all the men and women merely players. They have their exits and their entrances, And one man in his time plays many parts."
print (sorted(re.sub(r"[^\w\s]","",shakespeare.lower()).split(), key=lambda x: (x,-len(x))))
Output:
['a', 'all', 'all', 'and', 'and', 'and', 'and', 'entrances', 'exits', 'have', 'his', 'in', 'is', 'man', 'many', 'men', 'merely', 'one', 'parts', 'players', 'plays', 'stage', 'the', 'the', 'their', 'their', 'they', 'time', 'women', 'world']
The corresponding function:
def sort_string(shakespeare)
return sorted(re.sub(r"[^\w\s]","",shakespeare.lower()).split(), key=lambda x: (x,-len(x)))
In case you want a string to be returned:
def sort_string(shakespeare)
return " ".join(sorted(re.sub(r"[^\w\s]","",shakespeare.lower()).split(), key=lambda x: (x,-len(x))))

Replace text in a list with formatted text in another list

I am attempting to replace text in a list with text from another list. Below, lst_a has the string length I need for another script, but none of the formatting from lst_b. I want to give lst_a the correct spelling, capitalization, and punctuation from lst_b.
For example:
lst_a = ['it is an', 'example of', 'an english simple sentence']
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
I'm not 100% sure the best way to approach this problem.
I have tried breaking lst_a into a smaller sub_lst_a and taking the difference from each list, but I'm not sure what to do when entire items exist in one list and not the other (e.g. 'it' and 'is' rather than 'it's').
Regardless, any help/direction would be greatly appreciated!
Solution attempt below:
I thought it may be worth trying to break lst_a into a list just of words. Then I thought to enumerate each item, so that I could more easily identify it's counter part in lst_b. From there I wanted to take the difference of the two lists, and replace the values in lst_a_diff with lst_b_diff. I had to sort the lists because my diff script wasn't consistently ordering the outputs.
lst_a = ['it is an', 'example of', 'an english simple sentence']
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
# splitting lst_a into a smaller sub_lst_a
def convert(lst_a):
return ([i for item in lst_a for i in item.split()])
sub_lst_a = convert(lst_a)
# getting the position values of sub_lst_a and lst_b
lst_a_pos = [f"{i}, {v}" for i, v in enumerate(sub_lst_a)]
lst_b_pos = [f"{i}, {v}" for i, v in enumerate(lst_b)]
# finding the difference between the two lists
def Diff(lst_a_pos, lst_b_pos):
return list(set(lst_a_pos) - set(lst_b_pos))
lst_a_diff = Diff(lst_a_pos, lst_b_pos)
lst_b_diff = Diff(lst_b_pos, lst_a_pos)
# sorting lst_a_diff and lst_b_diff by the original position of each item
lst_a_diff_sorted = sorted(lst_a_diff, key = lambda x: int(x.split(', ')[0]))
lst_b_diff_sorted = sorted(lst_b_diff, key = lambda x: int(x.split(', ')[0]))
print(lst_a_diff_sorted)
print(lst_b_diff_sorted)
Desired Results:
final_lst_a = ['It\'s an', 'example of', 'an English simple sentence.']
Solution walkthrough
Assuming as you say that the two lists are essentially always in order, to properly align the indexes in both, words with apostrophe should really count for two.
One way to do that is for example to expand the words by adding an empty element:
# Fill in blanks for words that have apostrophe: they should count as 2
lst_c = []
for item in lst_b:
lst_c.append(item)
if item.find("'") != -1:
lst_c.append('')
print(lst_c)
>> ["It's", '', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
Now it is a matter of expanding lst_a on a word-by-word basis, and then group them back as in the original lists. Essentially, we align the lists like this:
['it', 'is', 'an', 'example', 'of', 'an', 'english', 'simple', 'sentence']
["It's", '', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
then we create new_item slices like these:
["It's", "", "an"]
["example of"]
["an English simple sentence"]
The code looks like this:
# Makes a map of list index and length to extract
final = []
ptr = 0
for item in lst_a:
# take each item in lst_a and count how many words it has
count = len(item.split())
# then use ptr and count to correctly map a slice off lst_c
new_item = lst_c[ptr:ptr+count]
# get rid of empty strings now
new_item = filter(len, new_item)
# print('new[{}:{}]={}'.format(ptr,count,new_item))
# join the words by single space and append to final list
final.append(' '.join(new_item))
# advance the ptr
ptr += count
>> ["It's an", 'example of', 'an English simple sentence.']
Complete code solution
This seems to handle other cases well enough. The complete code would be something like:
lst_a = ['it is an', 'example of', 'an english simple sentence']
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
# This is another example that seems to work
# lst_a = ['tomorrow I will', 'go to the movies']
# lst_b = ['Tomorrow', 'I\'ll', 'go', 'to', 'the', 'movies.']
# Fill in blanks for words that have apostrophe: they should count as 2
lst_c = []
for item in lst_b:
lst_c.append(item)
if item.find("'") != -1:
lst_c.append('')
print(lst_c)
# Makes a map of list index and length to extract
final = []
ptr = 0
for item in lst_a:
count = len(item.split())
# print(ptr, count, item)
new_item = lst_c[ptr:ptr+count]
# get rid of empty strings now
new_item = filter(len, new_item)
# print('new[{}:{}]={}'.format(ptr,count,new_item))
ptr += count
final.append(' '.join(new_item))
print(final)
You can try the following code:
lst_a = ['it is an', 'example of', 'an english simple sentence']
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
lst_a_split = []
end_indices_in_lst_a_split = []
# Construct "lst_a_split" and "end_indices_in_lst_a_split".
# "lst_a_split" is supposed to be ['it', 'is', 'an', 'example', 'of', 'an', 'english', 'simple', 'sentence'].
# "end_indices_in_lst_a_split" is supposed to be [3, 5, 9].
end = 0
for s in lst_a:
s_split = s.split()
end += len(s_split)
end_indices_in_lst_a_split.append(end)
for word in s_split:
lst_a_split.append(word)
# Construct "d" which contains
# index of every word in "lst_b" which does not include '\'' as value
# and the corresponding index of the word in "lst_a_split" as key.
# "d" is supposed to be {2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6, 8: 7}.
d = {}
start = 0
for index_in_lst_b, word in enumerate(lst_b):
if '\'' in word:
continue
word = word.lower().strip('.').strip(',').strip('"') # you can add other strip()'s as you want
index_in_lst_a_split = lst_a_split.index(word, start)
start = index_in_lst_a_split + 1
d[index_in_lst_a_split] = index_in_lst_b
# Construct "final_lst_a".
final_lst_a = []
start_index_in_lst_b = 0
for i, end in enumerate(end_indices_in_lst_a_split):
if end - 1 in d:
end_index_in_lst_b = d[end - 1] + 1
final_lst_a.append(' '.join(lst_b[start_index_in_lst_b:end_index_in_lst_b]))
start_index_in_lst_b = end_index_in_lst_b
elif end in d:
end_index_in_lst_b = d[end]
final_lst_a.append(' '.join(lst_b[start_index_in_lst_b:end_index_in_lst_b]))
start_index_in_lst_b = end_index_in_lst_b
else:
# It prints the following message if it fails to construct "final_lst_a" successfully.
# It would happen if words in "lst_b" on both sides at a boundary contain '\'', which seem to be unlikely.
print(f'Failed to find corresponding words in "lst_b" for the string "{lst_a[i]}".')
break
print(final_lst_a)
which prints
["It's an", 'example of', 'an English simple sentence.']
lst_a = ['it is an', 'example of', 'an english simple sentence']
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
for word in lst_b:
# If a word is capitalized, look for it in lst_a and capitalize it
if word[0].upper() == word[0]:
for idx, phrase in enumerate(lst_a):
if word.lower() in phrase:
lst_a[idx] = phrase.replace(word.lower(), word)
if "'" in word:
# if a word has an apostrophe, look for it in lst_a and change it
# Note here you can include other patterns like " are",
# or maybe just restrict it to "it is", etc.
for idx, phrase in enumerate(lst_a):
if " is" in phrase:
lst_a[idx] = phrase.replace(" is", "'s")
break
print(lst_a)
I know you already have a few responses to review. Here's something that should help you expand the implementation.
In addition to lst_a and lst_b, what if you could give all the lookup items like 'It's', 'I'll', 'don't' and you could outline what it should represent, then the below could would take care of that lookup as well.
#original lst_a. This list does not have the punctuation marks
lst_a = ['it is an', 'example of', 'an english simple sentence', 'if time permits', 'I will learn','this weekend', 'but do not', 'count on me']
#desired output with correct spelling, capitalization, and punctuation
#but includes \' that need to be replaced
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'english', 'simple', 'sentence.', 'If', 'time', 'permits,','I\'ll', 'learn','this','weekend', 'but', 'don\'t','count', 'on', 'me']
#lookup list to replace the contractions
ch = {'It\'s':['It','is'],'I\'ll':['I','will'], 'don\'t':['do','not']}
#final list will be stored into lst_c
lst_c = []
#enumerate through lst_b to replace all words that are contractions
for i,v in enumerate(lst_b):
#for this example, i am considering that all contractions are 2 part words
for j,k in ch.items():
if v == j: #here you are checking for contractions
lst_b[i] = k[0] #for each contraction, you are replacing the first part
lst_b.insert(i+1,k[1]) #and inserting the second part
#now stitch the words together based on length of each word in lst_b
c = 0
for i in lst_a:
j = i.count(' ') #find out number of words to stitch together
#stitch together only the number of size of words in lst_a
lst_c.append(' '.join([lst_b[k] for k in range (c, c+j+1)]))
c += j+1
#finally, I am printing lst_a, lst_b, and lst_c. The final result is in lst_c
print (lst_a, lst_b, lst_c, sep = '\n')
Output for this is as shown below:
lst_a = ['it is an', 'example of', 'an english simple sentence', 'if time permits', 'I will learn', 'this weekend', 'but do not', 'count on me']
lst_b = ['It', 'is', 'an', 'example', 'of', 'an', 'english', 'simple', 'sentence.', 'If', 'time', 'permits,', 'I', 'will', 'learn', 'this', 'weekend', 'but', 'do', 'not', 'count', 'on', 'me']
lst_c = ['It is an', 'example of', 'an english simple sentence.', 'If time permits,', 'I will learn', 'this weekend', 'but do not', 'count on me']

Get list of words from text file

In my code on line I have no idea why it is wrong I've tried a gazillion different ways but they don't work. I want it to print out:
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']
romeo.txt is the text document name
this is whats inside:
"But soft what light through yonder window breaks It is the east and
Juliet is the sun Arise fair sun and kill the envious moon Who is
already sick and pale with grief "
Also the output is in alphabetic order.
fname = "romeo.txt"#raw_input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
lst.append(line)
words = lst.split(line)
# line = line.sort()
print lst
fname = "romeo.txt"
fh = open(fname)
lst = []
for line in fh:
words = lst.split(line) # this comes first
lst.extend(words) # add all the words to the current list
lst = sorted(lst) # sorts lexicographically
print lst
Comments in code. Basically, split up your line and accumulate it in your list. Sorting should be done at the end, once.
A (slightly) more pythonic solution:
import re
lst = sorted(re.split('[\s]+', open("romeo.txt").read(), flags=re.M))
Regex will split your text into a list of words based on the regexp (delimiters as whitespaces). Everything else is basically multiple lines condensed into 1.

Is there a better way to get just 'important words' from a list in python?

I wrote some code to find the most popular words in submission titles on reddit, using the reddit praw api.
import nltk
import praw
picksub = raw_input('\nWhich subreddit do you want to analyze? r/')
many = input('\nHow many of the top words would you like to see? \n\t> ')
print 'Getting the top %d most common words from r/%s:' % (many,picksub)
r = praw.Reddit(user_agent='get the most common words from chosen subreddit')
submissions = r.get_subreddit(picksub).get_top_from_all(limit=200)
hey = []
for x in submissions:
hey.extend(str(x).split(' '))
fdist = nltk.FreqDist(hey) # creates a frequency distribution for words in 'hey'
top_words = fdist.keys()
common_words = ['its','am', 'ago','took', 'got', 'will', 'been', 'get', 'such','your','don\'t', 'if', 'why', 'do', 'does', 'or', 'any', 'but', 'they', 'all', 'now','than','into','can', 'i\'m','not','so','just', 'out','about','have','when', 'would' ,'where', 'what', 'who' 'I\'m','says' 'not', '', 'over', '_', '-','after', 'an','for', 'who', 'by', 'from', 'it', 'how', 'you', 'about' 'for', 'on', 'as', 'be', 'has', 'that', 'was', 'there', 'with','what', 'we', '::', 'to', 'the', 'of', ':', '...', 'a', 'at', 'is', 'my', 'in' , 'i', 'this', 'and', 'are', 'he', 'she', 'is', 'his', 'hers']
already = []
counter = 0
number = 1
print '-----------------------'
for word in top_words:
if word.lower() not in common_words and word.lower() not in already:
print str(number) + ". '" + word + "'"
counter +=1
number +=1
already.append(word.lower())
if counter == many:
break
print '-----------------------\n'
so inputting subreddit 'python' and getting 10 posts returns:
'Python'
'PyPy'
'code'
'use'
'136'
'181'
'd...'
'IPython'
'133'
10. '158'
How can I make this script not return numbers, and error words like 'd...'? The first 4 results are acceptable, but I would like to replace this rest with words that make sense. Making a list common_words is unreasonable, and doesn't filter these errors. I'm relatively new to writing code, and I appreciate the help.
I disagree. Making a list of common words is correct, there is no easier way to filter out the, for, I, am, etc.. However, it is unreasonable to use the common_words list to filter out results that aren't words, because then you'd have to include every possible non-word you don't want. Non-words should be filtered out differently.
Some suggestions:
1) common_words should be a set(), since your list is long this should speed things up. The in operation for sets in O(1), while for lists it is O(n).
2) Getting rid of all number strings is trivial. One way you could do it is:
all([w.isdigit() for w in word])
Where if this returns True, then the word is just a series of numbers.
3) Getting rid of the d... is a little more tricky. It depends on how you define a non-word. This:
tf = [ c.isalpha() for c in word ]
Returns a list of True/False values (where it is False if the char was not a letter). You can then count the values like:
t = tf.count(True)
f = tf.count(False)
You could then define a non-word as one that has more non-letter chars in it than letters, as one that has any non-letter characters at all, etc. For example:
def check_wordiness(word):
# This returns true only if a word is all letters
return all([ c.isalpha() for c in word ])
4) In the for word in top_words: block, are you sure that you have not mixed up counter and number? Also, counter and number are pretty much redundant, you could rewrite the last bit as:
for word in top_words:
# Since you are calling .lower() so much,
# you probably want to define it up here
w = word.lower()
if w not in common_words and w not in already:
# String formatting is preferred over +'s
print "%i. '%s'" % (number, word)
number +=1
# This could go under the if statement. You only want to add
# words that could be added again. Why add words that are being
# filtered out anyways?
already.append(w)
# this wasn't indented correctly before
if number == many:
break
Hope that helps.

Python: Split list based on first character of word

Im kind of stuck on an issue and Ive gone round and round with it until ive confused myself.
What I am trying to do is take a list of words:
['About', 'Absolutely', 'After', 'Aint', 'Alabama', 'AlabamaBill', 'All', 'Also', 'Amos', 'And', 'Anyhow', 'Are', 'As', 'At', 'Aunt', 'Aw', 'Bedlam', 'Behind', 'Besides', 'Biblical', 'Bill', 'Billgone']
Then sort them under and alphabetical order:
A
About
Absolutely
After
B
Bedlam
Behind
etc...
Is there and easy way to do this?
Use itertools.groupby() to group your input by a specific key, such as the first letter:
from itertools import groupby
from operator import itemgetter
for letter, words in groupby(sorted(somelist), key=itemgetter(0)):
print letter
for word in words:
print word
print
If your list is already sorted, you can omit the sorted() call. The itemgetter(0) callable will return the first letter of each word (the character at index 0), and groupby() will then yield that key plus an iterable that consists only of those items for which the key remains the same. In this case that means looping over words gives you all items that start with the same character.
Demo:
>>> somelist = ['About', 'Absolutely', 'After', 'Aint', 'Alabama', 'AlabamaBill', 'All', 'Also', 'Amos', 'And', 'Anyhow', 'Are', 'As', 'At', 'Aunt', 'Aw', 'Bedlam', 'Behind', 'Besides', 'Biblical', 'Bill', 'Billgone']
>>> from itertools import groupby
>>> from operator import itemgetter
>>>
>>> for letter, words in groupby(sorted(somelist), key=itemgetter(0)):
... print letter
... for word in words:
... print word
... print
...
A
About
Absolutely
After
Aint
Alabama
AlabamaBill
All
Also
Amos
And
Anyhow
Are
As
At
Aunt
Aw
B
Bedlam
Behind
Besides
Biblical
Bill
Billgone
Instead of using any library imports, or anything fancy.
Here is the logic:
def splitLst(x):
dictionary = dict()
for word in x:
f = word[0]
if f in dictionary.keys():
dictionary[f].append(word)
else:
dictionary[f] = [word]
return dictionary
splitLst(['About', 'Absolutely', 'After', 'Aint', 'Alabama', 'AlabamaBill', 'All', 'Also', 'Amos', 'And', 'Anyhow', 'Are', 'As', 'At', 'Aunt', 'Aw', 'Bedlam', 'Behind', 'Besides', 'Biblical', 'Bill', 'Billgone'])
def split(n):
n2 = []
for i in n:
if i[0] not in n2:
n2.append(i[0])
n2.sort()
for j in n:
z = j[0]
z1 = n2.index(z)
n2.insert(z1+1, j)
return n2
word_list = ['be','have','do','say','get','make','go','know','take','see','come','think',
'look','want','give','use','find','tell','ask','work','seem','feel','leave','call']
print(split(word_list))

Categories