Related
I have to extract two things from a string: A list that contains stop-words, and another list that contains the rest of the string.
text = 'he is the best when people in our life'
stopwords = ['he', 'the', 'our']
contains_stopwords = []
normal_words = []
for i in text.split():
for j in stopwords:
if i in j:
contains_stopwords.append(i)
else:
normal_words.append(i)
if text.split() in stopwords:
contains_stopwords.append(text.split())
else:
normal_words.append(text.split())
print("contains_stopwords:", contains_stopwords)
print("normal_words:", normal_words)
Output:
contains_stopwords: ['he', 'he', 'the', 'our']
normal_words: ['he', 'is', 'is', 'is', 'the', 'the', 'best', 'best', 'best', 'when', 'when', 'when', 'people', 'people', 'people', 'in', 'in', 'in', 'our', 'our', 'life', 'life', 'life', ['he', 'is', 'the', 'best', 'when', 'people', 'in', 'our', 'life']]
Desired result:
contains_stopwords: ['he', 'the', 'our']
normal_words: ['is', 'best', 'when', 'people', 'in', 'life']
One answer could be:
text = 'he is the best when people in our life'
stopwords = ['he', 'the', 'our']
contains_stopwords = set() # The set data structure guarantees there won't be any duplicate
normal_words = []
for word in text.split():
if word in stopwords:
contains_stopwords.add(word)
else:
normal_words.append(word)
print("contains_stopwords:", contains_stopwords)
print("normal_words:", normal_words)
you seem to have chosen the most difficult path. The code under should do the trick.
for word in text.split():
if word in stopwords:
contains_stopwords.append(word)
else:
normal_words.append(word)
First, we separate the text into a list of words using split, then we iterate and check if that word is in the list of stopwords (yeah, python allows you to do this). If it is, we just append it to the list of stopwords, if not, we append it to the other list.
Use the list comprehention and eliminate the duplicates by creating a dictionary with keys as list values and converting it again to a list:
itext = 'he is the best when people in our life'
stopwords = ['he', 'the', 'our']
split_words = itext.split(' ')
contains_stopwords = list(dict.fromkeys([word for word in split_words if word in stopwords]))
normal_words = list(dict.fromkeys([word for word in split_words if word not in stopwords]))
print("contains_stopwords:", contains_stopwords)
print("normal_words:", normal_words)
Some list comprehension could work and then use set() to remove duplicates from the list. I reconverted the set datastructure to a list as per your question, but you can leave it as a set:
text = 'he is the best when people in our life he he he'
stopwords = ['he', 'the', 'our']
list1 = {item for item in text.split(" ") if item in stopwords}
list2 = [item for item in text.split(" ") if item not in list1]
Output:
list1 - ['he', 'the', 'our']
list2 - ['is', 'best', 'when', 'people', 'in', 'life']
text = 'he is the best when people in our life'
# I will suggest make `stopwords` a set
# cuz the membership operator(ie. in) will take O(1)
stopwords = set(['he', 'the', 'our'])
contains_stopwords = []
normal_words = []
for word in text.split():
if word in stopwords: # here checking membership
contains_stopwords.append(word)
else:
normal_words.append(word)
print("contains_stopwords:", contains_stopwords)
print("normal_words:", normal_words)
shakespeare = ‘All the world is a stage, and all the men and women merely players. They have their exits and their entrances, And one man in his time plays many parts.’
Create a function that returns a string with all the words of the sentence shakespeare ordered alphabetically. Eliminate punctuation marks.
(Tip: the three first words should be ‘ a all all’, this time duplicates are allowed and remember that there are words in mayus)
def sort_string(shakespeare):
return string_sorted
Here you get a one-liner
import re
shakespeare = "All the world is a stage, and all the men and women merely players. They have their exits and their entrances, And one man in his time plays many parts."
print (sorted(re.sub(r"[^\w\s]","",shakespeare.lower()).split(), key=lambda x: (x,-len(x))))
Output:
['a', 'all', 'all', 'and', 'and', 'and', 'and', 'entrances', 'exits', 'have', 'his', 'in', 'is', 'man', 'many', 'men', 'merely', 'one', 'parts', 'players', 'plays', 'stage', 'the', 'the', 'their', 'their', 'they', 'time', 'women', 'world']
The corresponding function:
def sort_string(shakespeare)
return sorted(re.sub(r"[^\w\s]","",shakespeare.lower()).split(), key=lambda x: (x,-len(x)))
In case you want a string to be returned:
def sort_string(shakespeare)
return " ".join(sorted(re.sub(r"[^\w\s]","",shakespeare.lower()).split(), key=lambda x: (x,-len(x))))
I am attempting to replace text in a list with text from another list. Below, lst_a has the string length I need for another script, but none of the formatting from lst_b. I want to give lst_a the correct spelling, capitalization, and punctuation from lst_b.
For example:
lst_a = ['it is an', 'example of', 'an english simple sentence']
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
I'm not 100% sure the best way to approach this problem.
I have tried breaking lst_a into a smaller sub_lst_a and taking the difference from each list, but I'm not sure what to do when entire items exist in one list and not the other (e.g. 'it' and 'is' rather than 'it's').
Regardless, any help/direction would be greatly appreciated!
Solution attempt below:
I thought it may be worth trying to break lst_a into a list just of words. Then I thought to enumerate each item, so that I could more easily identify it's counter part in lst_b. From there I wanted to take the difference of the two lists, and replace the values in lst_a_diff with lst_b_diff. I had to sort the lists because my diff script wasn't consistently ordering the outputs.
lst_a = ['it is an', 'example of', 'an english simple sentence']
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
# splitting lst_a into a smaller sub_lst_a
def convert(lst_a):
return ([i for item in lst_a for i in item.split()])
sub_lst_a = convert(lst_a)
# getting the position values of sub_lst_a and lst_b
lst_a_pos = [f"{i}, {v}" for i, v in enumerate(sub_lst_a)]
lst_b_pos = [f"{i}, {v}" for i, v in enumerate(lst_b)]
# finding the difference between the two lists
def Diff(lst_a_pos, lst_b_pos):
return list(set(lst_a_pos) - set(lst_b_pos))
lst_a_diff = Diff(lst_a_pos, lst_b_pos)
lst_b_diff = Diff(lst_b_pos, lst_a_pos)
# sorting lst_a_diff and lst_b_diff by the original position of each item
lst_a_diff_sorted = sorted(lst_a_diff, key = lambda x: int(x.split(', ')[0]))
lst_b_diff_sorted = sorted(lst_b_diff, key = lambda x: int(x.split(', ')[0]))
print(lst_a_diff_sorted)
print(lst_b_diff_sorted)
Desired Results:
final_lst_a = ['It\'s an', 'example of', 'an English simple sentence.']
Solution walkthrough
Assuming as you say that the two lists are essentially always in order, to properly align the indexes in both, words with apostrophe should really count for two.
One way to do that is for example to expand the words by adding an empty element:
# Fill in blanks for words that have apostrophe: they should count as 2
lst_c = []
for item in lst_b:
lst_c.append(item)
if item.find("'") != -1:
lst_c.append('')
print(lst_c)
>> ["It's", '', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
Now it is a matter of expanding lst_a on a word-by-word basis, and then group them back as in the original lists. Essentially, we align the lists like this:
['it', 'is', 'an', 'example', 'of', 'an', 'english', 'simple', 'sentence']
["It's", '', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
then we create new_item slices like these:
["It's", "", "an"]
["example of"]
["an English simple sentence"]
The code looks like this:
# Makes a map of list index and length to extract
final = []
ptr = 0
for item in lst_a:
# take each item in lst_a and count how many words it has
count = len(item.split())
# then use ptr and count to correctly map a slice off lst_c
new_item = lst_c[ptr:ptr+count]
# get rid of empty strings now
new_item = filter(len, new_item)
# print('new[{}:{}]={}'.format(ptr,count,new_item))
# join the words by single space and append to final list
final.append(' '.join(new_item))
# advance the ptr
ptr += count
>> ["It's an", 'example of', 'an English simple sentence.']
Complete code solution
This seems to handle other cases well enough. The complete code would be something like:
lst_a = ['it is an', 'example of', 'an english simple sentence']
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
# This is another example that seems to work
# lst_a = ['tomorrow I will', 'go to the movies']
# lst_b = ['Tomorrow', 'I\'ll', 'go', 'to', 'the', 'movies.']
# Fill in blanks for words that have apostrophe: they should count as 2
lst_c = []
for item in lst_b:
lst_c.append(item)
if item.find("'") != -1:
lst_c.append('')
print(lst_c)
# Makes a map of list index and length to extract
final = []
ptr = 0
for item in lst_a:
count = len(item.split())
# print(ptr, count, item)
new_item = lst_c[ptr:ptr+count]
# get rid of empty strings now
new_item = filter(len, new_item)
# print('new[{}:{}]={}'.format(ptr,count,new_item))
ptr += count
final.append(' '.join(new_item))
print(final)
You can try the following code:
lst_a = ['it is an', 'example of', 'an english simple sentence']
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
lst_a_split = []
end_indices_in_lst_a_split = []
# Construct "lst_a_split" and "end_indices_in_lst_a_split".
# "lst_a_split" is supposed to be ['it', 'is', 'an', 'example', 'of', 'an', 'english', 'simple', 'sentence'].
# "end_indices_in_lst_a_split" is supposed to be [3, 5, 9].
end = 0
for s in lst_a:
s_split = s.split()
end += len(s_split)
end_indices_in_lst_a_split.append(end)
for word in s_split:
lst_a_split.append(word)
# Construct "d" which contains
# index of every word in "lst_b" which does not include '\'' as value
# and the corresponding index of the word in "lst_a_split" as key.
# "d" is supposed to be {2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6, 8: 7}.
d = {}
start = 0
for index_in_lst_b, word in enumerate(lst_b):
if '\'' in word:
continue
word = word.lower().strip('.').strip(',').strip('"') # you can add other strip()'s as you want
index_in_lst_a_split = lst_a_split.index(word, start)
start = index_in_lst_a_split + 1
d[index_in_lst_a_split] = index_in_lst_b
# Construct "final_lst_a".
final_lst_a = []
start_index_in_lst_b = 0
for i, end in enumerate(end_indices_in_lst_a_split):
if end - 1 in d:
end_index_in_lst_b = d[end - 1] + 1
final_lst_a.append(' '.join(lst_b[start_index_in_lst_b:end_index_in_lst_b]))
start_index_in_lst_b = end_index_in_lst_b
elif end in d:
end_index_in_lst_b = d[end]
final_lst_a.append(' '.join(lst_b[start_index_in_lst_b:end_index_in_lst_b]))
start_index_in_lst_b = end_index_in_lst_b
else:
# It prints the following message if it fails to construct "final_lst_a" successfully.
# It would happen if words in "lst_b" on both sides at a boundary contain '\'', which seem to be unlikely.
print(f'Failed to find corresponding words in "lst_b" for the string "{lst_a[i]}".')
break
print(final_lst_a)
which prints
["It's an", 'example of', 'an English simple sentence.']
lst_a = ['it is an', 'example of', 'an english simple sentence']
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'English', 'simple', 'sentence.']
for word in lst_b:
# If a word is capitalized, look for it in lst_a and capitalize it
if word[0].upper() == word[0]:
for idx, phrase in enumerate(lst_a):
if word.lower() in phrase:
lst_a[idx] = phrase.replace(word.lower(), word)
if "'" in word:
# if a word has an apostrophe, look for it in lst_a and change it
# Note here you can include other patterns like " are",
# or maybe just restrict it to "it is", etc.
for idx, phrase in enumerate(lst_a):
if " is" in phrase:
lst_a[idx] = phrase.replace(" is", "'s")
break
print(lst_a)
I know you already have a few responses to review. Here's something that should help you expand the implementation.
In addition to lst_a and lst_b, what if you could give all the lookup items like 'It's', 'I'll', 'don't' and you could outline what it should represent, then the below could would take care of that lookup as well.
#original lst_a. This list does not have the punctuation marks
lst_a = ['it is an', 'example of', 'an english simple sentence', 'if time permits', 'I will learn','this weekend', 'but do not', 'count on me']
#desired output with correct spelling, capitalization, and punctuation
#but includes \' that need to be replaced
lst_b = ['It\'s', 'an', 'example', 'of', 'an', 'english', 'simple', 'sentence.', 'If', 'time', 'permits,','I\'ll', 'learn','this','weekend', 'but', 'don\'t','count', 'on', 'me']
#lookup list to replace the contractions
ch = {'It\'s':['It','is'],'I\'ll':['I','will'], 'don\'t':['do','not']}
#final list will be stored into lst_c
lst_c = []
#enumerate through lst_b to replace all words that are contractions
for i,v in enumerate(lst_b):
#for this example, i am considering that all contractions are 2 part words
for j,k in ch.items():
if v == j: #here you are checking for contractions
lst_b[i] = k[0] #for each contraction, you are replacing the first part
lst_b.insert(i+1,k[1]) #and inserting the second part
#now stitch the words together based on length of each word in lst_b
c = 0
for i in lst_a:
j = i.count(' ') #find out number of words to stitch together
#stitch together only the number of size of words in lst_a
lst_c.append(' '.join([lst_b[k] for k in range (c, c+j+1)]))
c += j+1
#finally, I am printing lst_a, lst_b, and lst_c. The final result is in lst_c
print (lst_a, lst_b, lst_c, sep = '\n')
Output for this is as shown below:
lst_a = ['it is an', 'example of', 'an english simple sentence', 'if time permits', 'I will learn', 'this weekend', 'but do not', 'count on me']
lst_b = ['It', 'is', 'an', 'example', 'of', 'an', 'english', 'simple', 'sentence.', 'If', 'time', 'permits,', 'I', 'will', 'learn', 'this', 'weekend', 'but', 'do', 'not', 'count', 'on', 'me']
lst_c = ['It is an', 'example of', 'an english simple sentence.', 'If time permits,', 'I will learn', 'this weekend', 'but do not', 'count on me']
I've dug through countless other questions but none of them seem to work for me. I've also tried a ton of different things but I don't understand what I need to do. I don't know what else to do.
list:
split_me = ['this', 'is', 'my', 'list', '--', 'and', 'thats', 'what', 'it', 'is!', '--', 'Please', 'split', 'me', 'up.']
I need to:
Split this into a new list everytime it finds a "--"
name the list the first value after the "--"
not include the "--" in the new lists.
So it becomes this:
this=['this', 'is', 'my', 'list']
and=['and', 'thats', 'what', 'it', 'is!']
please=['Please', 'split', 'me', 'up.']
current attempt (Work in progress):
for value in split_me:
if firstrun:
newlist=list(value)
firstrun=False
continue
if value == "--":
#restart? set firstrun to false?
firstrun=False
continue
else:
newlist.append(value)
print(newlist)
This more or less works, although I had to change words to solve the reserved word problem. (Bad idea to call a variable 'and').
split_me = ['This', 'is', 'my', 'list', '--', 'And', 'thats', 'what', 'it', 'is!', '--', 'Please', 'split', 'me', 'up.']
retval = []
actlist = []
for e in split_me:
if (e == '--'):
retval.append(actlist)
actlist = []
continue
actlist.append(e)
if len(actlist) != 0:
retval.append(actlist)
for l in retval:
name = l[0]
cmd = name + " = " + str(l)
exec( cmd )
print This
print And
print Please
Utilizing itertools.groupby():
dash = "--"
phrases = [list(y) for x, y in groupby(split_me, lambda z: z == dash) if not x]
Initialize a dict and map each list to the first word in that list:
myDict = {}
for phrase in phrases:
myDict[phrase[0].lower()] = phrase
Which will output:
{'this': ['this', 'is', 'my', 'list]
'and': ['and', 'thats', 'what', 'it', 'is!']
'please': ['Please', 'split', 'me', 'up.'] }
This will actually create global variables named the way you want them to be named. Unfortunately it will not work for Python keywords such as and and for this reason I am replacing 'and' with 'And':
split_me = ['this', 'is', 'my', 'list', '--', 'And', 'thats', 'what', 'it',
'is!', '--', 'Please', 'split', 'me', 'up.']
new = True
while split_me:
current = split_me.pop(0)
if current == '--':
new = True
continue
if new:
globals()[current] = [current]
newname = current
new = False
continue
globals()[newname].append(current)
A more elegant approach based on #Mangohero1 answer would be:
from itertools import groupby
dash = '--'
phrases = [list(y) for x, y in groupby(split_me, lambda z: z == dash) if not x]
for l in phrases:
if not l:
continue
globals()[l[0]] = l
I would try something ike
" ".join(split_me).split(' -- ') # as a start
Im kind of stuck on an issue and Ive gone round and round with it until ive confused myself.
What I am trying to do is take a list of words:
['About', 'Absolutely', 'After', 'Aint', 'Alabama', 'AlabamaBill', 'All', 'Also', 'Amos', 'And', 'Anyhow', 'Are', 'As', 'At', 'Aunt', 'Aw', 'Bedlam', 'Behind', 'Besides', 'Biblical', 'Bill', 'Billgone']
Then sort them under and alphabetical order:
A
About
Absolutely
After
B
Bedlam
Behind
etc...
Is there and easy way to do this?
Use itertools.groupby() to group your input by a specific key, such as the first letter:
from itertools import groupby
from operator import itemgetter
for letter, words in groupby(sorted(somelist), key=itemgetter(0)):
print letter
for word in words:
print word
print
If your list is already sorted, you can omit the sorted() call. The itemgetter(0) callable will return the first letter of each word (the character at index 0), and groupby() will then yield that key plus an iterable that consists only of those items for which the key remains the same. In this case that means looping over words gives you all items that start with the same character.
Demo:
>>> somelist = ['About', 'Absolutely', 'After', 'Aint', 'Alabama', 'AlabamaBill', 'All', 'Also', 'Amos', 'And', 'Anyhow', 'Are', 'As', 'At', 'Aunt', 'Aw', 'Bedlam', 'Behind', 'Besides', 'Biblical', 'Bill', 'Billgone']
>>> from itertools import groupby
>>> from operator import itemgetter
>>>
>>> for letter, words in groupby(sorted(somelist), key=itemgetter(0)):
... print letter
... for word in words:
... print word
... print
...
A
About
Absolutely
After
Aint
Alabama
AlabamaBill
All
Also
Amos
And
Anyhow
Are
As
At
Aunt
Aw
B
Bedlam
Behind
Besides
Biblical
Bill
Billgone
Instead of using any library imports, or anything fancy.
Here is the logic:
def splitLst(x):
dictionary = dict()
for word in x:
f = word[0]
if f in dictionary.keys():
dictionary[f].append(word)
else:
dictionary[f] = [word]
return dictionary
splitLst(['About', 'Absolutely', 'After', 'Aint', 'Alabama', 'AlabamaBill', 'All', 'Also', 'Amos', 'And', 'Anyhow', 'Are', 'As', 'At', 'Aunt', 'Aw', 'Bedlam', 'Behind', 'Besides', 'Biblical', 'Bill', 'Billgone'])
def split(n):
n2 = []
for i in n:
if i[0] not in n2:
n2.append(i[0])
n2.sort()
for j in n:
z = j[0]
z1 = n2.index(z)
n2.insert(z1+1, j)
return n2
word_list = ['be','have','do','say','get','make','go','know','take','see','come','think',
'look','want','give','use','find','tell','ask','work','seem','feel','leave','call']
print(split(word_list))