nested for loop for spliting string based on multiple delimiters - python

I'm working on a Python assignment which requires a text to be delimited, sorted and printed as:
sentences are delimited by .
phrases by ,
then printed
What I've done so far:
text = "what time of the day is it. i'm heading out to the ball park, with the kids, on this nice evening. are you willing to join me, on the walk to the park, tonight."
for i, phrase in enumerate(text.split(',')):
print('phrase #%d: %s' % (i+1,phrase)):
phrase #1: what time of the day is it. i'm heading out to the ball park
phrase #2: with the kids
phrase #3: on this nice evening. are you willing to join me
phrase #4: on the walk to the park
phrase #5: tonight.
I know a nested for loop is needed and have tried with:
for s, sentence in enumerate(text.split('.')):
for p, phrase in enumerate(text.split(',')):
print('sentence #%d:','phrase #%d: %s' %(s+1,p+1,len(sentence),phrase))
TypeError: not all arguments converted during string formatting
A hint and/or a simple example would be welcomed.

You probably want:
'sentence #%d:\nphrase #%d: %d %s\n' %(s+1,p+1,len(sentence),phrase)
And in the inner loop, you certainly want to split sentence, not text again

TypeError: not all arguments converted during string formatting
Is a hint.
Your loops are fine.
'sentence #%d:','phrase #%d: %s' %(s+1,p+1,len(sentence),phrase)
is wrong.
Count the %d and %s conversion specifications. Count the values after the % operator/
The numbers aren't the same, are they? That's a TypeError.

There are couple of issues with your code snippet
for s, sentence in enumerate(text.split('.')):
for p, phrase in enumerate(text.split(',')):
print('sentence #%d:','phrase #%d: %s' %(s+1,p+1,len(sentence),phrase))
If I understand you correctly, you want to split the sentence by delimited .. Then Each of these sentences you want to split on phrases which is again delimited by ,. So second line of your should actually split the output of the outer loops enumeration. Something like
for p, phrase in enumerate(sentence.split(',')):
The Print Statement. If you ever See an Error Like TypeError, you can be sure that you are trying to assign a variable of one type onto another type. Well but there is no assignments? Its an indirect assignment to the print concatenation. What you committed to the print is that, you would be supplying 3 parameters out of which first two would be Integers(%d) and the last a string(%d). But you ended up supplying 3 Integers (s+1, p+1, len(sentence),phrase) which is inconsistent with your print format specifier. Either you drop the third parameter (len(sentence)) like
print('sentence #%d:, phrase #%d: %s' %(s+1,p+1,phrase))
or add one more Format Specifier to the print statement
print('sentence #%d:, phrase #%d:, length #%d, %s' %(s+1,p+1,len(sentence),phrase))
Assuming you want the former want, that leaver us to
for s, sentence in enumerate(text.split('.')):
for p, phrase in enumerate(text.split(',')):
print('sentence #%d:, phrase #%d:, length #%d, %s' %(s+1,p+1,len(sentence),phrase))

>>> sen = [words[1] for words in enumerate(text.split(". "))]
>>> for each in sen: each.split(", ")
['what time of the day is it']
["i'm heading out to the ball park", 'with the kids', 'on this nice evening']
['are you willing to join me', 'on the walk to the park', 'tonight.']
It's up to you to transform this unassigned output to your liking.

Related

how to remove instances and possible multiple instances of a certain word in a string and return a string (CODEWARS dubstep)

I have had a go at the CODEWARS dubstep challenge using python.
My code is below, it works and I pass the kata test. However, it took me a long time and I ended up using a brute force approach (newbie).
(basically replacing and striping the string until it worked)
Any ideas with comments on how my code could be improved please?
TASK SUMMARY:
Let's assume that a song consists of some number of words (that don't contain WUB). To make the dubstep remix of this song, Polycarpus inserts a certain number of words "WUB" before the first word of the song (the number may be zero), after the last word (the number may be zero), and between words (at least one between any pair of neighbouring words), and then the boy glues together all the words, including "WUB", in one string and plays the song at the club.
For example, a song with words "I AM X" can transform into a dubstep remix as "WUBWUBIWUBAMWUBWUBX" and cannot transform into "WUBWUBIAMWUBX".
song_decoder("WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB")
# => WE ARE THE CHAMPIONS MY FRIEND
song_decoder("AWUBBWUBC"), "A B C","WUB should be replaced by 1 space"
song_decoder("AWUBWUBWUBBWUBWUBWUBC"), "A B C","multiples WUB should be replaced by only 1 space"
song_decoder("WUBAWUBBWUBCWUB"), "A B C","heading or trailing spaces should be removed"
Thanks in advance, (I am new to stackoverflow also)
MY CODE:
def song_decoder(song):
new_song = song.replace("WUB", " ")
new_song2 = new_song.strip()
new_song3 = new_song2.replace(" ", " ")
new_song4 = new_song3.replace(" ", " ")
return(new_song4)
I don't know if it can improve it but I would use split and join
text = 'WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB'
text = text.replace("WUB", " ")
print(text)
words = text.split()
print(words)
text = " ".join(words)
print(text)
Result
WE ARE THE CHAMPIONS MY FRIEND
['WE', 'ARE', 'THE', 'CHAMPIONS', 'MY', 'FRIEND']
WE ARE THE CHAMPIONS MY FRIEND
EDIT:
Dittle different version. I split usinsg WUB but then it creates empty elements between two WUB and it needs to remove them
text = 'WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB'
words = text.split("WUB")
print(words)
words = [x for x in words if x] # remove empty elements
#words = list(filter(None, words)) # remove empty elements
print(words)
text = " ".join(words)
print(text)

How to create a vocabulary from a list of strings in a fast manner in python

I have a problem which I solved, but not in an efficient manner. I have a list of strings, which are captions for images. I need to get any word of this list of strings and create a dictionary containing the following information
Word, if that word appears 5 times or more in that list
A simple id for that word
Therefore, my vocabulary in a python dictionary would contain word:id entries
First, I have an auxiliary function to divide a string into tokens, or words
def split_sentence(sentence):
return list(filter(lambda x: len(x) > 0, re.split('\W+', sentence.lower())))
Then, I will generate the vocabulary like this, which works
def generate_vocabulary(train_captions):
"""
Return {token: index} for all train tokens (words) that occur 5 times or more,
`index` should be from 0 to N, where N is a number of unique tokens in the resulting dictionary.
"""
#convert the list of whole captions to one string
string=listToStr = ' '.join([str(elem) for elem in train_captions])
#divide the string tokens (individual words), by calling the previous function
individual_words=split_sentence(string)
#create a list of words that happen 5 times or more in that string
more_than_5=list(set([x for x in individual_words if individual_words.count(x) >= 5]))
#generate ids
ids=[i for i in range(0,len(more_than_5))]
#generate the vocabulary(dictionary)
vocab = dict(zip(more_than_5,ids))
return {token: index for index, token in enumerate(sorted(vocab))}
The code works like a charm for relatively small lists of captions. However, with lists with thousands of lengths (e.g., 80000), it lasts forever. I am running this code for one hour now.
Is there any way to speed up my code? how can I calculate my more_than_5 variable faster?
EDIT: I forgot menstioning that, in very few specific members of this list of strings, there are \n symbols in just some elements at the beginning of the sentence. Is that possible to eliminate just this symbol from my list and then apply the algorithm again?
You can calculate a number of word's occurrences once instead of calculating it on every step of list comprehension using Counter from collections package.
import re
from collections import Counter
def split_sentence(sentence):
return list(filter(lambda x: len(x) > 0, re.split('\W+', sentence.lower())))
def generate_vocabulary(train_captions, min_threshold):
"""
Return {token: index} for all train tokens (words) that occur min_threshold times or more,
`index` should be from 0 to N, where N is a number of unique tokens in the resulting dictionary.
"""
#convert the list of whole captions to one string
concat_str = ' '.join([str(elem).strip('\n') for elem in train_captions])
#divide the string tokens (individual words), by calling the split_sentence function
individual_words = split_sentence(concat_str)
#create a list of words that happen min_threshold times or more in that string
condition_keys = sorted([key for key, value in Counter(individual_words).items() if value >= min_threshold])
#generate the vocabulary(dictionary)
result = dict(zip(condition_keys, range(len(condition_keys))))
return result
train_captions = ['Nory was a Catholic because her mother was a Catholic, and Nory’s mother was a Catholic because her father was a Catholic, and her father was a Catholic because his mother was a Catholic, or had been.',
'I felt happy because I saw the others were happy and because I knew I should feel happy, but I wasn’t really happy.',
'Almost nothing was more annoying than having our wasted time wasted on something not worth wasting it on.']
generate_vocabulary(train_captions, min_threshold=5)
# {'a': 0, 'because': 1, 'catholic': 2, 'i': 3, 'was': 4}
Like #Eduard Ilyasov said, the Counter class is the best when needing to count things.
Here's my solution:
import re
import collections
original_text = (
"I say to you today, my friends, though, even though ",
"we face the difficulties of today and tomorrow, I still have ",
"a dream. It is a dream deeply rooted in the American ",
"dream. I have a dream that one day this nation will rise ",
'up, live out the true meaning of its creed: "We hold these ',
'truths to be self-evident, that all men are created equal."',
"",
"I have a dream that one day on the red hills of Georgia ",
"sons of former slaves and the sons of former slave-owners ",
"will be able to sit down together at the table of brotherhood. ",
"I have a dream that one day even the state of ",
"Mississippi, a state sweltering with the heat of injustice, ",
"sweltering with the heat of oppression, will be transformed ",
"into an oasis of freedom and justice. ",
"",
"I have a dream that my four little chi1dren will one day ",
"live in a nation where they will not be judged by the color ",
"of their skin but by the content of their character. I have ",
"a dream… I have a dream that one day in Alabama, ",
"with its vicious racists, with its governor having his lips ",
"dripping with the words of interposition and nullification, ",
"one day right there in Alabama little black boys and black ",
"girls will he able to join hands with little white boys and ",
"white girls as sisters and brothers. "
)
def split_sentence(sentence):
return (x.lower() for x in re.split('\W+', sentence.strip()) if x)
def generate_vocabulary(train_captions):
word_count = collections.Counter()
for current_sentence in train_captions:
word_count.update(split_sentence(str(current_sentence)))
return {key: value for (key, value) in word_count.items() if value >= 5}
print(generate_vocabulary(original_text))
I made some assumptions that you didn't specified:
I didn't think that a word would span two sentences
I kept the fact that your captions aren't going to be always strings. If you know they will always be you can simply the code by changing word_count.update(split_sentence(str(current_sentence))) to word_count.update(split_sentence(current_sentence))

Merging or reversing n-grams to a single string

How do I merge the bigrams below to a single string?
_bigrams=['the school', 'school boy', 'boy is', 'is reading']
_split=(' '.join(_bigrams)).split()
_newstr=[]
_filter=[_newstr.append(x) for x in _split if x not in _newstr]
_newstr=' '.join(_newstr)
print _newstr
Output:'the school boy is reading'....its the desired output but the approach is too long and not quite efficient given the large size of my data. Secondly, the approach would not support duplicate words in the final string ie 'the school boy is reading, is he?'. Only one of the 'is' will be permitted in the final string in this case.
Any suggestions on how to make this work better? Thanks.
# Multi-for generator expression allows us to create a flat iterable of words
all_words = (word for bigram in _bigrams for word in bigram.split())
def no_runs_of_words(words):
"""Takes an iterable of words and returns one with any runs condensed."""
prev_word = None
for word in words:
if word != prev_word:
yield word
prev_word = word
final_string = ' '.join(no_runs_of_words(all_words))
This takes advantage of generators to lazily evaluate and not keep the entire set of words in memory at the same time until generating the one final string.
If you really wanted a oneliner, something like this could work:
' '.join(val.split()[0] for val in (_bigrams)) + ' ' + _bigrams[-1].split()[-1]
Would this do it? It does simply take the first word up to the last entry
_bigrams=['the school', 'school boy', 'boy is', 'is reading']
clause = [a.split()[0] if a != _bigrams[-1] else a for a in _bigrams]
print ' '.join(clause)
Output
the school boy is reading
However, concerning performance probably Amber's solution is a good option

Python Error"TypeError: coercing to Unicode: need string or buffer, list found"

The purpose of this code is to make a program that searches a persons name (on Wikipedia, specifically) and uses keywords to come up with reasons why that person is significant.
I'm having issues with this specific line "if fact_amount < 5 and (terms in sentence.lower()):" because I get this error ("TypeError: coercing to Unicode: need string or buffer, list found")
If you could offer some guidance it would be greatly appreciated, thank you.
import requests
import nltk
import re
#You will need to install requests and nltk
terms = ['pronounced'
'was a significant'
'major/considerable influence'
'one of the (X) most important'
'major figure'
'earliest'
'known as'
'father of'
'best known for'
'was a major']
names = ["Nelson Mandela","Bill Gates","Steve Jobs","Lebron James"]
#List of people that you need to get info from
for name in names:
print name
print '==============='
#Goes to the wikipedia page of the person
r = requests.get('http://en.wikipedia.org/wiki/%s' % (name))
#Parses the raw html into text
raw = nltk.clean_html(r.text)
#Tries to split each sentence.
#sort of buggy though
#For example St. Mary will split after St.
sentences = re.split('[?!.][\s]*',raw)
fact_amount = 0
for sentence in sentences:
#I noticed that important things came after 'he was' and 'she was'
#Seems to work for my sample list
#Also there may be buggy sentences, so I return 5 instead of 3
if fact_amount < 5 and (terms in sentence.lower()):
#remove the reference notation that wikipedia has
#ex [ 33 ]
sentence = re.sub('[ [0-9]+ ]', '', sentence)
#removes newlines
sentence = re.sub('\n', '', sentence)
#removes trailing and leading whitespace
sentence = sentence.strip()
fact_amount += 1
#sentence is formatted. Print it out
print sentence + '.'
print
You should be checking it the other way
sentence.lower() in terms
terms is list and sentence.lower() is a string. You can check if a particular string is there in a list, but you cannot check if a list is there in a string.
you might mean if any(t in sentence_lower for t in terms), to check whether any terms from terms list is in the sentence string.

Counting the number of unique words [duplicate]

This question already has answers here:
Counting the number of unique words in a document with Python
(8 answers)
Closed 9 years ago.
I want to count unique words in a text, but I want to make sure that words followed by special characters aren't treated differently, and that the evaluation is case-insensitive.
Take this example
text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."
print len(set(w.lower() for w in text.split()))
The result would be 16, but I expect it to return 14. The problem is that 'boy.' and 'boy' are evaluated differently, because of the punctuation.
import re
print len(re.findall('\w+', text))
Using a regular expression makes this very simple. All you need to keep in mind is to make sure that all the characters are in lowercase, and finally combine the result using set to ensure that there are no duplicate items.
print len(set(re.findall('\w+', text.lower())))
you can use regex here:
In [65]: text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."
In [66]: import re
In [68]: set(m.group(0).lower() for m in re.finditer(r"\w+",text))
Out[68]:
set(['grown',
'boy',
'he',
'now',
'longer',
'no',
'is',
'there',
'up',
'one',
'a',
'the',
'has',
'handsome'])
I think that you have the right idea of using the Python built-in set type.
I think that it can be done if you first remove the '.' by doing a replace:
text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."
punc_char= ",.?!'"
for letter in text:
if letter == '"' or letter in punc_char:
text= text.replace(letter, '')
text= set(text.split())
len(text)
that should work for you. And if you need any of the other signs or punctuation points you can easily
add them into punc_char and they will be filtered out.
Abraham J.
First, you need to get a list of words. You can use a regex as eandersson suggested:
import re
words = re.findall('\w+', text)
Now, you want to get the number of unique entries. There are a couple of ways to do this. One way would be iterate through the words list and use a dictionary to keep track of the number of times you have seen a word:
cwords = {}
for word in words:
try:
cwords[word] += 1
except KeyError:
cwords[word] = 1
Now, finally, you can get the number of unique words by
len(cwords)

Categories