Ellipsizing list joins, Pythonically

Ellipsizing list joins, Pythonically - python

I learned about list comprehensions a few days ago, and now I think I’ve gone a little crazy with them, trying to make them solve all the problems. Perhaps I don’t truly understand them yet, or I just don’t know enough Python yet to make them both powerful and simple. This problem has occupied me for a while now, and I’d appreciate any input.
The Problem
In Python, join a list of strings words into a single string excerpt that satisfies these conditions:
a single space separates elements
the final length of excerpt does not exceed integer maximum_length
if all elements of words are not in excerpt, append an ellipsis character … to excerpt
only whole elements of words appear in excerpt
The Ugly Solution
words = ('Your mother was a hamster and your ' +
'father smelled of elderberries!').split()
maximum_length = 29
excerpt = ' '.join(words) if len(' '.join(words)) <= maximum_length else \
' '.join(words[:max([n for n in range(0, len(words)) if \
len(' '.join(words[:n]) + '\u2026') <= \
maximum_length])]) + '\u2026'
print(excerpt) # Your mother was a hamster…
print(len(excerpt)) # 26
Yup, that works. Your mother was a hamster and fits in 29, but leaves no room for the ellipsis. But boy is it ugly. I can break it up a little:
words = ('Your mother was a hamster and your ' +
'father smelled of elderberries!').split()
maximum_length = 29
excerpt = ' '.join(words)
if len(excerpt) > maximum_length:
maximum_words = max([n for n in range(0, len(words)) if \
len(' '.join(words[:n]) + '\u2026') <= \
maximum_length])
excerpt = ' '.join(words[:maximum_words]) + '\u2026'
print(excerpt) # 'Your mother was a hamster…'
But now I’ve made a variable I’m never going to use again. Seems like a waste. And it hasn’t really made anything prettier or easier to understand.
Is there a nicer way to do this that I just haven’t seen yet?

see my comment about why "simple is better than complex"
that said, here's a suggestion
l = 'Your mother was a hamster and your father smelled of elderberries!'
last_space = l.rfind(' ', 0, 29)
suffix = ""
if last_space < 29:
suffix = "..."
print l[:last_space]+suffix
this is not 100% what you need, but rather easy to extend

My humble opinion is that you are right in that list comprehension is not necessary for this task. I would first get all words in a list with the split, then maybe do a while loop that remove words one at a time from the end of the list until len(' '.join(list)) < maximum_length.
I would also shorten the maximum_length by 3 (the length of the elipses) and after the while loop ends, add the "..." as the last element of the list.

You can trim the excerpt down to the maximum_length. Then, use rsplit to remove the last space and append on the ellipsis:
def append_ellipsis(words, length=29):
excerpt = ' '.join(words)[:length]
# If you are using Python 3.x then you can instead of the line below,
# pass `maxsplit=1` to `rsplit`. Below is the Python 2.x version.
return excerpt.rsplit(' ', 1)[0] + '\u2026'
words = ('Your mother was a hamster and your ' +
'father smelled of elderberries!').split()
result = append_ellipsis(words)
print(result)
print(len(result))

Related

How to count the number of words in the list, provided there is more than one?

For example I have a text,
text = '''
Wales’ greatest moment. Lille is so close to the Belgian border, this was
essentially a home game for one of the tournament favourites. Their confident supporters mingled
with their new Welsh fans on the streets, buying into the carnival spirit - perhaps
more relaxed than some might have been before a quarter-final because they
thought this was their time.
In the driving rain, Wales produced the best performance in their history to carry
the nation into uncharted territory. Nobody could quite believe it.
'''
I need to get the number of words in this text, we enter the words with input().
Type will be a list, dict, set this required condition.
It is also not clear how to remove the attention to punctuation marks.
My solution, but perhaps there is a cleaner way.
text = list(text.split(' '))
word = input('Enter a word: ')
for i in text:
if text.count(word) < 2:
break
if word in text:
print(f'{word} - {text.count(word)}')
break
Output:
this - 2
the - 7
The 'moment' occurs only once in the text, we do not deduce it

You can think of this as two steps:
Clean the input
Find the count
A fast way to clean the input is to strip all the punctuation first using translate combined with string.punctuation:
import string
clean = text.translate(str.maketrans('', '', string.punctuation)).split()
Now you have all the text with no punctuation and can split it into words and count:
import string
clean = text.translate(str.maketrans('', '', string.punctuation)).split()
word = "this"
count = clean.count(word)
if count > 1:
print(f'{word} - {count}')
# prints: this - 2
Since you are using count you don't need to loop. Just be careful not to call count multiple times if you don't need to. Each time you do, it needs to look through the whole list. Notice above the code calls it once and saves it so we can use the count in multiple places.

You can use collections.Counter() to get a dictionary of the number of occurrences of each element in a list:
import collections
text = '''
Wales’ greatest moment. Lille is so close to the Belgian border, this was
essentially a home game for one of the tournament favourites. Their confident supporters mingled
with their new Welsh fans on the streets, buying into the carnival spirit - perhaps
more relaxed than some might have been before a quarter-final because they
thought this was their time.
In the driving rain, Wales produced the best performance in their history to carry
the nation into uncharted territory. Nobody could quite believe it.
'''
word = input('Enter a word: ')
# Remove punctuation from text
for char in text:
if char.lower() not in "abcdefghijklmnopqrstuvwxyz ":
text = text.replace(char, "")
wordcount = collections.Counter(text.split())
print(f"{word} - {wordcount[word]}")

Why is the autograder giving me errors for CodeHS 8.4.9: Owls Part 2?

Here's the assignment:
This program is an extension of your previous ‘Owls’ program.
In addition to just reporting how many words contained the word owl, you should report the indices at which the words occurred!
Here’s what an example run of your program might look like:
Enter some text: Owls are so cool! I think snowy owls might be my favorite. Or maybe spotted owls.
There were 3 words that contained "owl".
They occurred at indices: [0, 7, 15]
As you can see from the output, you’ll have to use another list to store the indices where you found the words containing “owl”.
The enumerate function might also come in handy!
Here's my code now:
def owl_count(text):
owl_lower = text.lower()
owl_split = owl_lower.split()
count = 0
index = 0
sec_in = []
owl = "owl"
for i in range(len(owl_split)):
if owl in owl_split[index]:
count = count + 1
sec_in.append(index)
index = index + 1
print("There were " + str(count) + " words that contained owl.")
return "They occurred at indices: " + str(sec_in)
text = "I really like owls. Did you know that an owl's eyes are more than twice as big as the eyes of other birds of comparable weight? And that when an owl partially closes its eyes during the day, it is just blocking out light? Sometimes I wish I could be an owl."
print(owl_count(text))
When I run the code it's perfectly fine. But when it has to go through the autograder it says I'm doing things wrong. And this input is what's getting me the least errors.
Here are some of the ones I used before if it helps:
Owlets are baby owls. Baby peafowls are called peachicks.
Owls are so cool! I think snowy owls might be my favorite. Or maybe spotted owls.
I think owls are pretty cool
Here's the link to the code I used to help me.
First autograder picture
Second autograder picture
Third autograder picture

As mentioned in the comments, your code has a lot of redundant variables - this is a tidied up version of your code - it should work exactly the same as yours :
I think the biggest issue is that your code prints the count line, and then returns the indices line. If the autograder simply executes your function and ignores the return value it will ignore the indices. Note that the example code you cite prints both lines - and doesn't return anything. This is corrected in the version below.
Note that this code uses enumerate - it is a very good function to get into the habit of using if you need the contents of a list and you need to track the index in the list too.
def owl_count(text):
owl_lower = text.lower()
owl_split = owl_lower.split()
sec_in = []
owl = "owl"
for index, word in enumerate(owl_split):
if owl in word:
sec_in.append(index)
print("There were " + str(len(sec_in)) + " words that contained \"owl\".")
print("They occurred at indices: " + str(sec_in))
text = "I really like owls. Did you know that an owl's eyes are more than twice as big as the eyes of other birds of comparable weight? And that when an owl partially closes its eyes during the day, it is just blocking out light? Sometimes I wish I could be an owl."
owl_count(text)
There is a more efficient method to solve the problem, without a lot of variables that are only used to create others - and you have a for loop which could/should be a comprehension - so an even better version would be :
def owl_count(text):
sec_in = [index for index, word in enumerate(text.lower().split())
if 'owl' in word]
print("There were " + str(len(sec_in)) + " words that contained \"owl".")
print("They occurred at indices: " + str(sec_in))
Update - 02/11/2020 - 14:12 - The expected result from the autograder expects quotes around the word 'owl' in the first output message. The above code has been updated to include those quotes.

Templates with argument in string formatting

I'm looking for a package or any other approach (other than manual replacement) for the templates within string formatting.
I want to achieve something like this (this is just an example so you could get the idea, not the actual working code):
text = "I {what:like,love} {item:pizza,space,science}".format(what=2,item=3)
print(text)
So the output would be:
I love science
How can I achieve this? I have been searching but cannot find anything appropriate. Probably used wrong naming terms.
If there isnt any ready to use package around I would love to read some tips on the starting point to code this myself.

I think using list is sufficient since python lists are persistent
what = ["like","love"]
items = ["pizza","space","science"]
text = "I {} {}".format(what[1],items[2])
print(text)
output:
I love science

My be use a list or a tuple for what and item as both data types preserve insertion order.
what = ['like', 'love']
item = ['pizza', 'space', 'science']
text = "I {what} {item}".format(what=what[1],item=item[2])
print(text) # I like science
or even this is possible.
text = "I {what[1]} {item[2]}".format(what=what, item=item)
print(text) # I like science
Hope this helps!

Why not use a dictionary?
options = {'what': ('like', 'love'), 'item': ('pizza', 'space', 'science')}
print("I " + options['what'][1] + ' ' + options['item'][2])
This returns: "I love science"
Or if you wanted a method to rid yourself of having to reformat to accommodate/remove spaces, then incorporate this into your dictionary structure, like so:
options = {'what': (' like', ' love'), 'item': (' pizza', ' space', ' science'), 'fullstop': '.'}
print("I" + options['what'][0] + options['item'][0] + options['fullstop'])
And this returns: "I like pizza."

Since no one have provided an appropriate answer that answers my question directly, I decided to work on this myself.
I had to use double brackets, because single ones are reserved for the string formatting.
I ended up with the following class:
class ArgTempl:
def __init__(self, _str):
self._str = _str
def format(self, **args):
for k in re.finditer(r"{{(\w+):([\w,]+?)}}", self._str,
flags=re.DOTALL | re.MULTILINE | re.IGNORECASE):
key, replacements = k.groups()
if not key in args:
continue
self._str = self._str.replace(k.group(0), replacements.split(',')[args[key]])
return self._str
This is a primitive, 5 minute written code, therefore lack of checks and so on. It works as expected and can be improved easly.
Tested on Python 2.7 & 3.6~
Usage:
test = "I {{what:like,love}} {{item:pizza,space,science}}"
print(ArgTempl(test).format(what=1, item=2))
> I love science
Thanks for all of the replies.

Highlight certain words that appear in sequence

I'm trying to print a text while highlighting certain words and word bigrams. This would be fairly straight forward if I didn't have to print the other tokens like punctuation and such as well.
I have a list of words to highlight and another list of word bigrams to highlight.
Highlighting individual words is fairly easy, like for example:
import re
import string
regex_pattern = re.compile("([%s \n])" % string.punctuation)
def highlighter(content, terms_to_hightlight):
tokens = regex_pattern.split(content)
for token in tokens:
if token.lower() in terms_to_hightlight:
print('\x1b[6;30;42m' + token + '\x1b[0m', end="")
else:
print(token, end="")
Only highlighting words that appear in sequence is more complex. I have been playing around with iterators but haven't been able to come up with anything that isn't overtly complicated.

If I understand the question correctly, one solution is to look ahead to the next word token and check if the bigram is in the list.
import re
import string
regex_pattern = re.compile("([%s \n])" % string.punctuation)
def find_next_word(tokens, idx):
nonword = string.punctuation + " \n"
for i in range(idx+1, len(tokens)):
if tokens[i] not in nonword:
return (tokens[i], i)
return (None, -1)
def highlighter(content, terms, bigrams):
tokens = regex_pattern.split(content)
idx = 0
while idx < len(tokens):
token = tokens[idx]
(next_word, nw_idx) = find_next_word(tokens, idx)
if token.lower() in terms:
print('*' + token + '*', end="")
idx += 1
elif next_word and (token.lower(), next_word.lower()) in bigrams:
concat = "".join(tokens[idx:nw_idx+1])
print('-' + concat + '-', end="")
idx = nw_idx + 1
else:
print(token, end="")
idx += 1
terms = ['man', 'the']
bigrams = [('once', 'upon'), ('i','was')]
text = 'Once upon a time, as I was walking to the city, I met a man. As I was tired, I did not look once... upon this man.'
highlighter(text, terms, bigrams)
When called, this gives :
-Once upon- a time, as -I was- walking to *the* city, I met a *man*. As -I was- tired, I did not look -once... upon- this *man*.
Please note that:
this is a greedy algorithm, it will match the first bigram it finds. So for instance you check for yellow banana and banana boat, yellow banana boat is always highlighted as -yellow banana- boat. If you want another behavior, you should update the test logic.
you probably also want to update the logic to manage the case where a word is both in terms and the first part of a bigram
I haven't tested all edge cases, some things may break / there may be fence-post errors
you can optimize performance if necessary by:
building a list of the first words of the bigram and checking if a word is in it before doing the look-ahead to the next word
and/or using the result of the look-ahead to treat in one step all the non-word tokens between two words (implementing this step should be enough to insure linear performance)
Hope this helps.

Merging or reversing n-grams to a single string

How do I merge the bigrams below to a single string?
_bigrams=['the school', 'school boy', 'boy is', 'is reading']
_split=(' '.join(_bigrams)).split()
_newstr=[]
_filter=[_newstr.append(x) for x in _split if x not in _newstr]
_newstr=' '.join(_newstr)
print _newstr
Output:'the school boy is reading'....its the desired output but the approach is too long and not quite efficient given the large size of my data. Secondly, the approach would not support duplicate words in the final string ie 'the school boy is reading, is he?'. Only one of the 'is' will be permitted in the final string in this case.
Any suggestions on how to make this work better? Thanks.

# Multi-for generator expression allows us to create a flat iterable of words
all_words = (word for bigram in _bigrams for word in bigram.split())
def no_runs_of_words(words):
"""Takes an iterable of words and returns one with any runs condensed."""
prev_word = None
for word in words:
if word != prev_word:
yield word
prev_word = word
final_string = ' '.join(no_runs_of_words(all_words))
This takes advantage of generators to lazily evaluate and not keep the entire set of words in memory at the same time until generating the one final string.

If you really wanted a oneliner, something like this could work:
' '.join(val.split()[0] for val in (_bigrams)) + ' ' + _bigrams[-1].split()[-1]

Would this do it? It does simply take the first word up to the last entry
_bigrams=['the school', 'school boy', 'boy is', 'is reading']
clause = [a.split()[0] if a != _bigrams[-1] else a for a in _bigrams]
print ' '.join(clause)
Output
the school boy is reading
However, concerning performance probably Amber's solution is a good option

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Ellipsizing list joins, Pythonically - python

Related

How to count the number of words in the list, provided there is more than one?

Why is the autograder giving me errors for CodeHS 8.4.9: Owls Part 2?

Templates with argument in string formatting

Highlight certain words that appear in sequence

Merging or reversing n-grams to a single string

Categories

Resources