okay these two functions are related to each other and fortunately the first one is solved but the other is a big mess and it should give me 17.5 but it only gives me 3 so why doesn't it work out??
def split_on_separators(original, separators):
""" (str, str) -> list of str
Return a list of non-empty, non-blank strings from the original string
determined by splitting the string on any of the separators.
separators is a string of single-character separators.
>>> split_on_separators("Hooray! Finally, we're done.", "!,")
['Hooray', ' Finally', " we're done."]
"""
result = []
newstring = ''
for index,char in enumerate(original):
if char in separators or index==len(original) -1:
result.append(newstring)
newstring=''
if '' in result:
result.remove('')
else:
newstring+=char
return result
def average_sentence_length(text):
""" (list of str) -> float
Precondition: text contains at least one sentence. A sentence is defined
as a non-empty string of non-terminating punctuation surrounded by
terminating punctuation or beginning or end of file. Terminating
punctuation is defined as !?.
Return the average number of words per sentence in text.
>>> text = ['The time has come, the Walrus said\n',
'To talk of many things: of shoes - and ships - and sealing wax,\n',
'Of cabbages; and kings.\n'
'And why the sea is boiling hot;\n'
'and whether pigs have wings.\n']
>>> average_sentence_length(text)
17.5
"""
words=0
Sentences=0
for line in text:
words+=1
sentence=split_on_separators(text,'?!.')
for sep in sentence:
Sentences+=1
ASL=words/Sentences
return ASL
words can be counted by spliting each sentence in the list using space and counting the length of that list. would be helpful.
You can eliminate the need for your first function by using regular expressions to split on separators. The regular expression function is re.split(). Here is a cleaned up version that gets the right result:
import re
def average_sentence_length(text):
# Join all the text into one string and remove all newline characters
# Joining all text into one string allows us to find the sentences much
# easier, since multiple list items in 'text' could be one whole sentence
text = "".join(text).replace('\n', '')
# Use regex to split the sentences at delimiter characters !?.
# Filter out any empty strings that result from this function,
# otherwise they will count as words later on
sentences = filter(None, re.split('[!?.]', text))
# Set the word sum variable
wordsum = 0.0
for s in sentences:
# Split each sentence (s) into its separate words and add them
# to the wordsum variable
words = s.split(' ')
wordsum += len(words)
return wordsum / len(sentences)
data = ['The time has come, the Walrus said\n',
' To talk of many things: of shoes - and ships - and sealing wax,\n',
'Of cabbages; and kings.\n'
'And why the sea is boiling hot;\n'
'and whether pigs have wings.\n']
print average_sentence_length(data)
The one issue with this function is that with the text you provided, it returns 17.0 instead of 17.5. This is because there is no space in between "...the Walrus said" and "To talk of...". There is nothing that can be done there besides adding the space that should be there in the first place.
If the first function (split_on_separators) is required for the project, than you can replace the re.split() function with your function. Using regular expressions is a bit more reliable and a lot more lightweight than writing an entire function for it, however.
EDIT
I forgot to explain the filter() function. Basically if you give the first argument of type None, it takes the second argument and removes all "false" items in it. Since an empty string is considered false in Python, it is removed. You can read more about filter() here
Related
This question already has answers here:
How to strip all whitespace from string
(14 answers)
Closed 4 years ago.
Basically, I'm trying to do a code in Python where a user inputs a sentence. However, I need my code to remove ALL whitespaces (e.g. tabs, space, index, etc.) and print it out.
This is what I have so far:
def output_without_whitespace(text):
newText = text.split("")
print('String with no whitespaces: '.join(newText))
I'm clear that I'm doing a lot wrong here and I'm missing plenty, but, I haven't been able to thoroughly go over splitting and joining strings yet, so it'd be great if someone explained it to me.
This is the whole code that I have so far:
text = input(str('Enter a sentence: '))
print(f'You entered: {text}')
def get_num_of_characters(text):
result = 0
for char in text:
result += 1
return result
print('Number of characters: ', get_num_of_characters(text))
def output_without_whitespace(text):
newtext = "".join(text.split())
print(f'String without whitespaces: {newtext}')
I FIGURED OUT MY PROBLEM!
I realize that in this line of code.
print(f'String without whitespaces: {newtext}')
It's supposed to be.
print('String without whitespaces: ', output_without_whitespace(text))
I realize that my problem as to why the sentence without whitespaces was not printing back out to me was, because I was not calling out my function!
You have the right idea, but here's how to implement it with split and join:
def output_without_whitespace(text):
return ''.join(text.split())
so that:
output_without_whitespace(' this\t is a\n test..\n ')
would return:
thisisatest..
A trivial solution is to just use split and rejoin (similar to what you are doing):
def output_without_whitespace(text):
return ''.join(text.split())
First we split the initial string to a list of words, then we join them all together.
So to think about it a bit:
text.split()
will give us a list of words (split by any whitespace). So for example:
'hello world'.split() -> ['hello', 'world']
And finally
''.join(<result of text.split()>)
joins all of the words in the given list to a single string. So:
''.join(['hello', 'world']) -> 'helloworld'
See Remove all whitespace in a string in Python for more ways to do it.
Get input, split, join
s = ''.join((input('Enter string: ').split()))
Enter string: vash the stampede
vashthestampede
There are a few different ways to do this, but this seems the most obvious one to me. It is simple and efficient.
>>> with_spaces = ' The quick brown fox '
>>> list_no_spaces = with_spaces.split()
>>> ''.join(list_no_spaces)
'Thequickbrownfox'
.split() with no parameter splits a string into a list wherever there's one or more white space characters, leaving out the white space...more details here.
''.join(list_no_spaces) joins elements of the list into a string with nothing betwen the elements, which is what you want here: 'Thequickbrownfox'.
If you had used ','.join(list_no_spaces) you'd get 'The,quick,brown,fox'.
Experienced Python programmers tend to use regular expressions sparingly. Often it's better to use tools like .split() and .join() to do the work, and keep regular expressions for where there is no alternative.
I need to print how many characters there are in a sentence the user specifies, print how many words there are in a sentence the user specifies and print each word, the number of letters in the word, and the first and last letter in the word. Can this be done?
I want you to take your time and understand what is going on in the code below and I suggest you to read these resources.
http://docs.python.org/3/library/re.html
http://docs.python.org/3/library/functions.html#len
http://docs.python.org/3/library/functions.html
http://docs.python.org/3/library/stdtypes.html#str.split
import re
def count_letter(word):
"""(str) -> int
Return the number of letters in a word.
>>> count_letter('cat')
3
>>> count_letter('cat1')
3
"""
return len(re.findall('[a-zA-Z]', word))
if __name__ == '__main__':
sentence = input('Please enter your sentence: ')
words = re.sub("[^\w]", " ", sentence).split()
# The number of characters in the sentence.
print(len(sentence))
# The number of words in the sentence.
print(len(words))
# Print all the words in the sentence, the number of letters, the first
# and last letter.
for i in words:
print(i, count_letter(i), i[0], i[-1])
Please enter your sentence: hello user
10
2
hello 5 h o
user 4 u r
Please read Python's string documentation, it is self explanatory. Here is a short explanation of the different parts with some comments.
We know that a sentence is composed of words, each of which is composed of letters. What we have to do first is to split the sentence into words. Each entry in this list is a word, and each word is stored in a form of a succession of characters and we can get each of them.
sentence = "This is my sentence"
# split the sentence
words = sentence.split()
# use len() to obtain the number of elements (words) in the list words
print('There are {} words in the given sentence'.format(len(words)))
# go through each word
for word in words:
# len() counts the number of elements again,
# but this time it's the chars in the string
print('There are {} characters in the word "{}"'.format(len(word), word))
# python is a 0-based language, in the sense that the first element is indexed at 0
# you can go backward in an array too using negative indices.
#
# However, notice that the last element is at -1 and second to last is -2,
# it can be a little bit confusing at the beginning when we know that the second
# element from the start is indexed at 1 and not 2.
print('The first being "{}" and the last "{}"'.format(word[0], word[-1]))
We don't do your homework for you on stack overflow... but I will get you started.
The most important method you will need is one of these two (depending on the version of python):
Python3.X - input([prompt]),.. If the prompt argument is present, it is written
to standard output without a trailing newline. The function then
reads a line from input, converts it to a string (stripping a
trailing newline), and returns that. When EOF is read, EOFError is
raised. http://docs.python.org/3/library/functions.html#input
Python2.X raw_input([prompt]),... If the prompt argument is
present, it is written to standard output without a trailing newline.
The function then reads a line from input, converts it to a string
(stripping a trailing newline), and returns that. When EOF is read,
EOFError is raised. http://docs.python.org/2.7/library/functions.html#raw_input
You can use them like
>>> my_sentance = raw_input("Do you want us to do your homework?\n")
Do you want us to do your homework?
yes
>>> my_sentance
'yes'
as you can see, the text wrote was stroed in the my_sentance variable
To get the amount of characters in a string, you need to understand that a string is really just a list! So if you want to know the amount of characters you can use:
len(s),... Return the length (the number of items) of an object.
The argument may be a sequence (string, tuple or list) or a mapping
(dictionary). http://docs.python.org/3/library/functions.html#len
I'll let you figure out how to use it.
Finally you're going to need to use a built in function for a string:
str.split([sep[, maxsplit]]),...Return a list of the words in the
string, using sep as the delimiter string. If maxsplit is given, at
most maxsplit splits are done (thus, the list will have at most
maxsplit+1 elements). If maxsplit is not specified or -1, then there
is no limit on the number of splits (all possible splits are made).
http://docs.python.org/2/library/stdtypes.html#str.split
I'm new to python and i need to calculate the average number of characters per word in a list
using these definitions and helper function clean_up.
a token is a str that you get from calling the string method split on a line of a file.
a word is a non-empty token from the file that isn't completely made up of punctuation.
find the "words" in a file by using str.split to find the tokens and then removing the punctuation from the words using the helper function clean_up.
A sentence is a sequence of characters that is terminated by (but doesn't include) the characters !, ?, . or the end of the file, excludes whitespace on either end, and is not empty.
This is my homework question from my computer science class in my college
the clean up function is:
def clean_up(s):
punctuation = """!"',;:.-?)([]<>*#\n\"""
result = s.lower().strip(punctuation)
return result
my code is:
def average_word_length(text):
""" (list of str) -> float
Precondition: text is non-empty. Each str in text ends with \n and at
least one str in text contains more than just \n.
Return the average length of all words in text. Surrounding punctuation
is not counted as part of the words.
>>> text = ['James Fennimore Cooper\n', 'Peter, Paul and Mary\n']
>>> average_word_length(text)
5.142857142857143
"""
for ch in text:
word = ch.split()
clean = clean_up(ch)
average = len(clean) / len(word)
return average
I get 5.0, but i am really confused, some help would be greatly appreciated :)
PS I'm using python 3
Let's clean up some of these functions with imports and generator expressions, shall we?
import string
def clean_up(s):
# I'm assuming you REQUIRE this function as per your assignment
# otherwise, just substitute str.strip(string.punctuation) anywhere
# you'd otherwise call clean_up(str)
return s.strip(string.punctuation)
def average_word_length(text):
total_length = sum(len(clean_up(word)) for sentence in text for word in sentence.split())
num_words = sum(len(sentence.split()) for sentence in text)
return total_length/num_words
You may notice this actually condenses to a length and unreadable one-liner:
average = sum(len(word.strip(string.punctuation)) for sentence in text for word in sentence.split()) / sum(len(sentence.split()) for sentence in text)
It's gross and disgusting, which is why you shouldn't do it ;). Readability counts and all that.
This is a short and sweet method to solve your problem that is still readable.
def clean_up(word, punctuation="!\"',;:.-?)([]<>*#\n\\"):
return word.lower().strip(punctuation) # you don't really need ".lower()"
def average_word_length(text):
cleaned_words = [clean_up(w) for w in (w for l in text for w in l.split())]
return sum(map(len, cleaned_words))/len(cleaned_words) # Python2 use float
>>> average_word_length(['James Fennimore Cooper\n', 'Peter, Paul and Mary\n'])
5.142857142857143
Burden of all those preconditions falls to you.
I want to create a list of tags from a user supplied single input box, separated by comma's and I'm looking for some expression(s) that can help automate this.
What I want is to supply the input field and:
remove all double+ whitespaces, tabs, new lines (leaving just single spaces)
remove ALL (single's and double+) quotation marks, except for comma's, which there can be only one of
in between each comma, i want Something Like Title Case, but excluding the first word and not at all for single words, so that when the last spaces are removed, the tag comes out as 'somethingLikeTitleCase' or just 'something' or 'twoWords'
and finally, remove all remaining spaces
Here's what I have gathered around SO so far:
def no_whitespace(s):
"""Remove all whitespace & newlines. """
return re.sub(r"(?m)\s+", "", s)
# remove spaces, newlines, all whitespace
# http://stackoverflow.com/a/42597/523051
tag_list = ''.join(no_whitespace(tags_input))
# split into a list at comma's
tag_list = tag_list.split(',')
# remove any empty strings (since I currently don't know how to remove double comma's)
# http://stackoverflow.com/questions/3845423/remove-empty-strings-from-a-list-of-strings
tag_list = filter(None, tag_list)
I'm lost though when it comes to modifying that regex to remove all the punctuation except comma's and I don't even know where to begin for the capitalizing.
Any thoughts to get me going in the right direction?
As suggested, here are some sample inputs = desired_outputs
form: 'tHiS iS a tAg, 'whitespace' !&#^ , secondcomment , no!punc$$, ifNOSPACESthenPRESERVEcaps' should come out as
['thisIsATag', 'secondcomment', 'noPunc', 'ifNOSPACESthenPRESERVEcaps']
Here's an approach to the problem (that doesn't use any regular expressions, although there's one place where it could). We split up the problem into two functions: one function which splits a string into comma-separated pieces and handles each piece (parseTags), and one function which takes a string and processes it into a valid tag (sanitizeTag). The annotated code is as follows:
# This function takes a string with commas separating raw user input, and
# returns a list of valid tags made by sanitizing the strings between the
# commas.
def parseTags(str):
# First, we split the string on commas.
rawTags = str.split(',')
# Then, we sanitize each of the tags. If sanitizing gives us back None,
# then the tag was invalid, so we leave those cases out of our final
# list of tags. We can use None as the predicate because sanitizeTag
# will never return '', which is the only falsy string.
return filter(None, map(sanitizeTag, rawTags))
# This function takes a single proto-tag---the string in between the commas
# that will be turned into a valid tag---and sanitizes it. It either
# returns an alphanumeric string (if the argument can be made into a valid
# tag) or None (if the argument cannot be made into a valid tag; i.e., if
# the argument contains only whitespace and/or punctuation).
def sanitizeTag(str):
# First, we turn non-alphanumeric characters into whitespace. You could
# also use a regular expression here; see below.
str = ''.join(c if c.isalnum() else ' ' for c in str)
# Next, we split the string on spaces, ignoring leading and trailing
# whitespace.
words = str.split()
# There are now three possibilities: there are no words, there was one
# word, or there were multiple words.
numWords = len(words)
if numWords == 0:
# If there were no words, the string contained only spaces (and/or
# punctuation). This can't be made into a valid tag, so we return
# None.
return None
elif numWords == 1:
# If there was only one word, that word is the tag, no
# post-processing required.
return words[0]
else:
# Finally, if there were multiple words, we camel-case the string:
# we lowercase the first word, capitalize the first letter of all
# the other words and lowercase the rest, and finally stick all
# these words together without spaces.
return words[0].lower() + ''.join(w.capitalize() for w in words[1:])
And indeed, if we run this code, we get:
>>> parseTags("tHiS iS a tAg, \t\n!&#^ , secondcomment , no!punc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'secondcomment', 'noPunc', 'ifNOSPACESthenPRESERVEcaps']
There are two points in this code that it's worth clarifying. First is the use of str.split() in sanitizeTags. This will turn a b c into ['a','b','c'], whereas str.split(' ') would produce ['','a','b','c','']. This is almost certainly the behavior you want, but there's one corner case. Consider the string tAG$. The $ gets turned into a space, and is stripped out by the split; thus, this gets turned into tAG instead of tag. This is probably what you want, but if it isn't, you have to be careful. What I would do is change that line to words = re.split(r'\s+', str), which will split the string on whitespace but leave in the leading and trailing empty strings; however, I would also change parseTags to use rawTags = re.split(r'\s*,\s*', str). You must make both these changes; 'a , b , c'.split(',') becomes ['a ', ' b ', ' c'], which is not the behavior you want, whereas r'\s*,\s*' deletes the space around the commas too. If you ignore leading and trailing white space, the difference is immaterial; but if you don't, then you need to be careful.
Finally, there's the non-use of regular expressions, and instead the use of str = ''.join(c if c.isalnum() else ' ' for c in str). You can, if you want, replace this with a regular expression. (Edit: I removed some inaccuracies about Unicode and regular expressions here.) Ignoring Unicode, you could replace this line with
str = re.sub(r'[^A-Za-z0-9]', ' ', str)
This uses [^...] to match everything but the listed characters: ASCII letters and numbers. However, it's better to support Unicode, and it's easy, too. The simplest such approach is
str = re.sub(r'\W', ' ', str, flags=re.UNICODE)
Here, \W matches non-word characters; a word character is a letter, a number, or the underscore. With flags=re.UNICODE specified (not available before Python 2.7; you can instead use r'(?u)\W' for earlier versions and 2.7), letters and numbers are both any appropriate Unicode characters; without it, they're just ASCII. If you don't want the underscore, you can add |_ to the regex to match underscores as well, replacing them with spaces too:
str = re.sub(r'\W|_', ' ', str, flags=re.UNICODE)
This last one, I believe, matches the behavior of my non-regex-using code exactly.
Also, here's how I'd write the same code without those comments; this also allows me to eliminate some temporary variables. You might prefer the code with the variables present; it's just a matter of taste.
def parseTags(str):
return filter(None, map(sanitizeTag, str.split(',')))
def sanitizeTag(str):
words = ''.join(c if c.isalnum() else ' ' for c in str).split()
numWords = len(words)
if numWords == 0:
return None
elif numWords == 1:
return words[0]
else:
return words[0].lower() + ''.join(w.capitalize() for w in words[1:])
To handle the newly-desired behavior, there are two things we have to do. First, we need a way to fix the capitalization of the first word: lowercase the whole thing if the first letter's lowercase, and lowercase everything but the first letter if the first letter's upper case. That's easy: we can just check directly. Secondly, we want to treat punctuation as completely invisible: it shouldn't uppercase the following words. Again, that's easy—I even discuss how to handle something similar above. We just filter out all the non-alphanumeric, non-whitespace characters rather than turning them into spaces. Incorporating those changes gives us
def parseTags(str):
return filter(None, map(sanitizeTag, str.split(',')))
def sanitizeTag(str):
words = filter(lambda c: c.isalnum() or c.isspace(), str).split()
numWords = len(words)
if numWords == 0:
return None
elif numWords == 1:
return words[0]
else:
words0 = words[0].lower() if words[0][0].islower() else words[0].capitalize()
return words0 + ''.join(w.capitalize() for w in words[1:])
Running this code gives us the following output
>>> parseTags("tHiS iS a tAg, AnD tHIs, \t\n!&#^ , se#%condcomment$ , No!pUnc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'AndThis', 'secondcomment', 'NopUnc', 'ifNOSPACESthenPRESERVEcaps']
You could use a white list of characters allowed to be in a word, everything else is ignored:
import re
def camelCase(tag_str):
words = re.findall(r'\w+', tag_str)
nwords = len(words)
if nwords == 1:
return words[0] # leave unchanged
elif nwords > 1: # make it camelCaseTag
return words[0].lower() + ''.join(map(str.title, words[1:]))
return '' # no word characters
This example uses \w word characters.
Example
tags_str = """ 'tHiS iS a tAg, 'whitespace' !&#^ , secondcomment , no!punc$$,
ifNOSPACESthenPRESERVEcaps' """
print("\n".join(filter(None, map(camelCase, tags_str.split(',')))))
Output
thisIsATag
whitespace
secondcomment
noPunc
ifNOSPACESthenPRESERVEcaps
I think this should work
def toCamelCase(s):
# remove all punctuation
# modify to include other characters you may want to keep
s = re.sub("[^a-zA-Z0-9\s]","",s)
# remove leading spaces
s = re.sub("^\s+","",s)
# camel case
s = re.sub("\s[a-z]", lambda m : m.group(0)[1].upper(), s)
# remove all punctuation and spaces
s = re.sub("[^a-zA-Z0-9]", "", s)
return s
tag_list = [s for s in (toCamelCase(s.lower()) for s in tag_list.split(',')) if s]
the key here is to make use of re.sub to make the replacements you want.
EDIT : Doesn't preserve caps, but does handle uppercase strings with spaces
EDIT : Moved "if s" after the toCamelCase call
A beginner's Python question:
I have a string with x number of sentences. How to I extract first 2 sentences (may end with . or ? or !)
Ignoring considerations such as when a . constitutes the end of sentence:
import re
' '.join(re.split(r'(?<=[.?!])\s+', phrase, 2)[:-1])
EDIT: Another approach that just occurred to me is this:
re.match(r'(.*?[.?!](?:\s+.*?[.?!]){0,1})', phrase).group(1)
Notes:
Whereas the first solution lets you replace the 2 with some other number to choose a different number of sentences, in the second solution, you change the 1 in {0,1} to one less than the number of sentences you want to extract.
The second solution isn't quite as robust in handling, e.g., empty strings, or strings with no punctuation. It could be made so, but the regex would be even more complex than it is already, and I would favour the slightly less efficient first solution over an unreadable mess.
I solved it like this: Separating sentences, though a comment on that post also points to NLTK, though I don't know how to find the sentence segmenter on their site...
Here's how yo could do it:
str = "Sentence one? Sentence two. Sentence three? Sentence four. Sentence five."
sentences = str.split(".")
allSentences = []
for sentence in sentences
allSentences.extend(sentence.split("?"))
print allSentences[0:3]
There are probably better ways, I look forward to seeing them.
Here is a step by step explanation of how to disassemble, choose the first two sentences, and reassemble it. As noted by others, this does not take into account that not all dot/question/exclamation characters are really sentence separators.
import re
testline = "Sentence 1. Sentence 2? Sentence 3! Sentence 4. Sentence 5."
# split the first two sentences by the dot/question/exclamation.
sentences = re.split('([.?!])', testline, 2)
print "result of split: ", sentences
# toss everything else (the last item in the list)
firstTwo = sentences[:-1]
print firstTwo
# put the first two sentences back together
finalLine = ''.join(firstTwo)
print finalLine
Generator alternative using my utility function returning piece of string until any item in search sequence:
from itertools import islice
testline = "Sentence 1. Sentence 2? Sentence 3! Sentence 4. Sentence 5."
def multis(search_sequence,text,start=0):
""" multisearch by given search sequence values from text, starting from position start
yielding tuples of text before found item and found sequence item"""
x=''
for ch in text[start:]:
if ch in search_sequence:
if x: yield (x,ch)
else: yield ch
x=''
else:
x+=ch
else:
if x: yield x
# split the first two sentences by the dot/question/exclamation.
two_sentences = list(islice(multis('.?!',testline),2)) ## must save the result of generation
print "result of split: ", two_sentences
print '\n'.join(sentence.strip()+sep for sentence,sep in two_sentences)