How do I merge the bigrams below to a single string?
_bigrams=['the school', 'school boy', 'boy is', 'is reading']
_split=(' '.join(_bigrams)).split()
_newstr=[]
_filter=[_newstr.append(x) for x in _split if x not in _newstr]
_newstr=' '.join(_newstr)
print _newstr
Output:'the school boy is reading'....its the desired output but the approach is too long and not quite efficient given the large size of my data. Secondly, the approach would not support duplicate words in the final string ie 'the school boy is reading, is he?'. Only one of the 'is' will be permitted in the final string in this case.
Any suggestions on how to make this work better? Thanks.
# Multi-for generator expression allows us to create a flat iterable of words
all_words = (word for bigram in _bigrams for word in bigram.split())
def no_runs_of_words(words):
"""Takes an iterable of words and returns one with any runs condensed."""
prev_word = None
for word in words:
if word != prev_word:
yield word
prev_word = word
final_string = ' '.join(no_runs_of_words(all_words))
This takes advantage of generators to lazily evaluate and not keep the entire set of words in memory at the same time until generating the one final string.
If you really wanted a oneliner, something like this could work:
' '.join(val.split()[0] for val in (_bigrams)) + ' ' + _bigrams[-1].split()[-1]
Would this do it? It does simply take the first word up to the last entry
_bigrams=['the school', 'school boy', 'boy is', 'is reading']
clause = [a.split()[0] if a != _bigrams[-1] else a for a in _bigrams]
print ' '.join(clause)
Output
the school boy is reading
However, concerning performance probably Amber's solution is a good option
Related
given a list of stop words and a string:
list_stop_words = ['for', 'the', 'with']
mystring = 'this is the car for the girl with the long nice red hair'
I would like to get the text starting from the end up to the first stop word of the list.
expected result 'the long nice red hair'
I tried with several for loops but it is super cumbersome there should be a straight way, probably even a one liner.
my super verbose solution:
list_stop_words = ['for', 'the', 'with']
mystring = 'this is the car for the girl with the long nice red hair'
reversed_sentence =mystring.split()[::-1]
for i,word in enumerate(reversed_sentence):
if word in list_stop_words:
position = i
words = reversed_sentence[0:i+1]
print(' '.join(words[::-1]))
break
for word in mastering[::-1]:
Any suggestion for a better approach?
EDIT AFTER THE ANSWER (SEE BELLOW)
you can try something like this
mystring[max([mystring.rfind(stop_word) for stop_word in list_stop_words]):]
basically you find the last occurence of each word with rfind then you find the last from all the words with max then you slice it out
I have had a go at the CODEWARS dubstep challenge using python.
My code is below, it works and I pass the kata test. However, it took me a long time and I ended up using a brute force approach (newbie).
(basically replacing and striping the string until it worked)
Any ideas with comments on how my code could be improved please?
TASK SUMMARY:
Let's assume that a song consists of some number of words (that don't contain WUB). To make the dubstep remix of this song, Polycarpus inserts a certain number of words "WUB" before the first word of the song (the number may be zero), after the last word (the number may be zero), and between words (at least one between any pair of neighbouring words), and then the boy glues together all the words, including "WUB", in one string and plays the song at the club.
For example, a song with words "I AM X" can transform into a dubstep remix as "WUBWUBIWUBAMWUBWUBX" and cannot transform into "WUBWUBIAMWUBX".
song_decoder("WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB")
# => WE ARE THE CHAMPIONS MY FRIEND
song_decoder("AWUBBWUBC"), "A B C","WUB should be replaced by 1 space"
song_decoder("AWUBWUBWUBBWUBWUBWUBC"), "A B C","multiples WUB should be replaced by only 1 space"
song_decoder("WUBAWUBBWUBCWUB"), "A B C","heading or trailing spaces should be removed"
Thanks in advance, (I am new to stackoverflow also)
MY CODE:
def song_decoder(song):
new_song = song.replace("WUB", " ")
new_song2 = new_song.strip()
new_song3 = new_song2.replace(" ", " ")
new_song4 = new_song3.replace(" ", " ")
return(new_song4)
I don't know if it can improve it but I would use split and join
text = 'WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB'
text = text.replace("WUB", " ")
print(text)
words = text.split()
print(words)
text = " ".join(words)
print(text)
Result
WE ARE THE CHAMPIONS MY FRIEND
['WE', 'ARE', 'THE', 'CHAMPIONS', 'MY', 'FRIEND']
WE ARE THE CHAMPIONS MY FRIEND
EDIT:
Dittle different version. I split usinsg WUB but then it creates empty elements between two WUB and it needs to remove them
text = 'WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB'
words = text.split("WUB")
print(words)
words = [x for x in words if x] # remove empty elements
#words = list(filter(None, words)) # remove empty elements
print(words)
text = " ".join(words)
print(text)
I have a list of sentences (~100k sentences total) and a list of "infrequent words" (length ~20k). I would like to run through each sentence and replace any word that matches an entry in "infrequent_words" with the tag "UNK".
(so as a small example, if
infrequent_words = ['dog','cat']
sentence = 'My dog likes to chase after cars'
Then after applying the transformation it should be
sentence = 'My unk likes for chase after cars'
I am having trouble finding an efficient way to do this. This function below (applied to each sentence) works, but it is very slow and I know there must be something better. Any suggestions?
def replace_infrequent_words(text,infrequent_words):
for word in infrequent_words:
text = text.replace(word,'unk')
return text
Thank you!
infrequent_words = {'dog','cat'}
sentence = 'My dog likes to chase after cars'
def replace_infrequent_words(text, infrequent_words):
words = text.split()
for i in range(len(words)):
if words[i] in infrequent_words:
words[i] = 'unk'
return ' '.join(words)
print(replace_infrequent_words(sentence, infrequent_words))
Two things that should improve performance:
Use a set instead of a list for storing infrequent_words.
Use a list to store each word in text so you don't have to scan the entire text string with each replacement.
This doesn't account for grammar and punctuation but this should be a performance improvement from what you posted.
I have a list
['mPXSz0qd6j0 youtube ', 'lBz5XJRLHQM youtube ', 'search OpHQOO-DwlQ ',
'sachin 47427243 ', 'alex smith ', 'birthday JEaM8Lg9oK4 ',
'nebula 8x41n9thAU8 ', 'chuck norris ',
'searcher O6tUtqPcHDw ', 'graham wXqsg59z7m0 ', 'queries K70QnTfGjoM ']
Is there some way to identify the strings which can't be spelt in the list item and remove them?
You can use, e.g. PyEnchant for basic dictionary checking and NLTK to take minor spelling issues into account, like this:
import enchant
import nltk
spell_dict = enchant.Dict('en_US') # or whatever language supported
def get_distance_limit(w):
'''
The word is considered good
if it's no further from a known word than this limit.
'''
return len(w)/5 + 2 # just for example, allowing around 1 typo per 5 chars.
def check_word(word):
if spell_dict.check(word):
return True # a known dictionary word
# try similar words
max_dist = get_distance_limit(word)
for suggestion in spell_dict.suggest(word):
if nltk.edit_distance(suggestion, word) < max_dist:
return True
return False
Add a case normalisation and a filter for digits and you'll get a pretty good heuristics.
It is entirely possible to compare your list members to words that you don't believe to be valid for your input.
This can be done in many ways, partially depending on your definition of "properly spelled" and what you end up using for a comparison list. If you decide that numbers preclude an entry from being valid, or underscores, or mixed case, you could test for regex matching.
Post regex, you would have to decide what a valid character to split on should be. Is it spaces (are you willing to break on 'ad hoc' ('ad' is an abbreviation, 'hoc' is not a word))? Is it hyphens (this will break on hyphenated last names)?
With these above criteria decided, it's just a decision of what word, proper name, and common slang list to use and a list comprehension:
word_list[:] = [term for term in word_list if passes_my_membership_criteria(term)]
where passes_my_membership_criteria() is a function that contains the rules for staying in the list of words, returning False for things that you've decided are not valid.
This question already has answers here:
Counting the number of unique words in a document with Python
(8 answers)
Closed 9 years ago.
I want to count unique words in a text, but I want to make sure that words followed by special characters aren't treated differently, and that the evaluation is case-insensitive.
Take this example
text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."
print len(set(w.lower() for w in text.split()))
The result would be 16, but I expect it to return 14. The problem is that 'boy.' and 'boy' are evaluated differently, because of the punctuation.
import re
print len(re.findall('\w+', text))
Using a regular expression makes this very simple. All you need to keep in mind is to make sure that all the characters are in lowercase, and finally combine the result using set to ensure that there are no duplicate items.
print len(set(re.findall('\w+', text.lower())))
you can use regex here:
In [65]: text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."
In [66]: import re
In [68]: set(m.group(0).lower() for m in re.finditer(r"\w+",text))
Out[68]:
set(['grown',
'boy',
'he',
'now',
'longer',
'no',
'is',
'there',
'up',
'one',
'a',
'the',
'has',
'handsome'])
I think that you have the right idea of using the Python built-in set type.
I think that it can be done if you first remove the '.' by doing a replace:
text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."
punc_char= ",.?!'"
for letter in text:
if letter == '"' or letter in punc_char:
text= text.replace(letter, '')
text= set(text.split())
len(text)
that should work for you. And if you need any of the other signs or punctuation points you can easily
add them into punc_char and they will be filtered out.
Abraham J.
First, you need to get a list of words. You can use a regex as eandersson suggested:
import re
words = re.findall('\w+', text)
Now, you want to get the number of unique entries. There are a couple of ways to do this. One way would be iterate through the words list and use a dictionary to keep track of the number of times you have seen a word:
cwords = {}
for word in words:
try:
cwords[word] += 1
except KeyError:
cwords[word] = 1
Now, finally, you can get the number of unique words by
len(cwords)