average number of characters per word in a list - python

I'm new to python and i need to calculate the average number of characters per word in a list
using these definitions and helper function clean_up.
a token is a str that you get from calling the string method split on a line of a file.
a word is a non-empty token from the file that isn't completely made up of punctuation.
find the "words" in a file by using str.split to find the tokens and then removing the punctuation from the words using the helper function clean_up.
A sentence is a sequence of characters that is terminated by (but doesn't include) the characters !, ?, . or the end of the file, excludes whitespace on either end, and is not empty.
This is my homework question from my computer science class in my college
the clean up function is:
def clean_up(s):
punctuation = """!"',;:.-?)([]<>*#\n\"""
result = s.lower().strip(punctuation)
return result
my code is:
def average_word_length(text):
""" (list of str) -> float
Precondition: text is non-empty. Each str in text ends with \n and at
least one str in text contains more than just \n.
Return the average length of all words in text. Surrounding punctuation
is not counted as part of the words.
>>> text = ['James Fennimore Cooper\n', 'Peter, Paul and Mary\n']
>>> average_word_length(text)
5.142857142857143
"""
for ch in text:
word = ch.split()
clean = clean_up(ch)
average = len(clean) / len(word)
return average
I get 5.0, but i am really confused, some help would be greatly appreciated :)
PS I'm using python 3

Let's clean up some of these functions with imports and generator expressions, shall we?
import string
def clean_up(s):
# I'm assuming you REQUIRE this function as per your assignment
# otherwise, just substitute str.strip(string.punctuation) anywhere
# you'd otherwise call clean_up(str)
return s.strip(string.punctuation)
def average_word_length(text):
total_length = sum(len(clean_up(word)) for sentence in text for word in sentence.split())
num_words = sum(len(sentence.split()) for sentence in text)
return total_length/num_words
You may notice this actually condenses to a length and unreadable one-liner:
average = sum(len(word.strip(string.punctuation)) for sentence in text for word in sentence.split()) / sum(len(sentence.split()) for sentence in text)
It's gross and disgusting, which is why you shouldn't do it ;). Readability counts and all that.

This is a short and sweet method to solve your problem that is still readable.
def clean_up(word, punctuation="!\"',;:.-?)([]<>*#\n\\"):
return word.lower().strip(punctuation) # you don't really need ".lower()"
def average_word_length(text):
cleaned_words = [clean_up(w) for w in (w for l in text for w in l.split())]
return sum(map(len, cleaned_words))/len(cleaned_words) # Python2 use float
>>> average_word_length(['James Fennimore Cooper\n', 'Peter, Paul and Mary\n'])
5.142857142857143
Burden of all those preconditions falls to you.

Related

Return first word in sentence? [duplicate]

This question already has answers here:
How to extract the first and final words from a string?
(7 answers)
Closed 5 years ago.
Heres the question I have to answer for school
For the purposes of this question, we will define a word as ending a sentence if that word is immediately followed by a period. For example, in the text “This is a sentence. The last sentence had four words.”, the ending words are ‘sentence’ and ‘words’. In a similar fashion, we will define the starting word of a sentence as any word that is preceded by the end of a sentence. The starting words from the previous example text would be “The”. You do not need to consider the first word of the text as a starting word. Write a program that has:
An endwords function that takes a single string argument. This functioin must return a list of all sentence ending words that appear in the given string. There should be no duplicate entries in the returned list and the periods should not be included in the ending words.
The code I have so far is:
def startwords(astring):
mylist = astring.split()
if mylist.endswith('.') == True:
return my list
but I don't know if I'm using the right approach. I need some help
Several issues with your code. The following would be a simple approach. Create a list of bigrams and pick the second token of each bigram where the first token ends with a period:
def startwords(astring):
mylist = astring.split() # a list! Has no 'endswith' method
bigrams = zip(mylist, mylist[1:])
return [b[1] for b in bigrams if b[0].endswith('.')]
zip and list comprehenion are two things worth reading up on.
mylist = astring.split()
if mylist.endswith('.')
that cannot work, one of the reasons being that mylist is a list, and doesn't have endswith as a method.
Another answer fixed your approach so let me propose a regular expression solution:
import re
print(re.findall(r"\.\s*(\w+)","This is a sentence. The last sentence had four words."))
match all words following a dot and optional spaces
result: ['The']
def endwords(astring):
mylist = astring.split('.')
temp_words = [x.rpartition(" ")[-1] for x in mylist if len(x) > 1]
return list(set(temp_words))
This creates a set so there are no duplicates. Then goes on a for loop in a list of sentences (split by ".") then for each sentence, splits it in words then using [:-1] makes a list of the last word only and gets [0] item in that list.
print (set([ x.split()[:-1][0] for x in s.split(".") if len(x.split())>0]))
The if in theory is not needed but i couldn't make it work without it.
This works as well:
print (set([ x.split() [len(x.split())-1] for x in s.split(".") if len(x.split())>0]))
This is one way to do it ->
#!/bin/env/ python
from sets import Set
sentence = 'This is a sentence. The last sentence had four words.'
uniq_end_words = Set()
for word in sentence.split():
if '.' in word:
# check if period (.) is at the end
if '.' == word[len(word) -1]:
uniq_end_words.add(word.rstrip('.'))
print list(uniq_end_words)
Output (list of all the end words in a given sentence) ->
['words', 'sentence']
If your input string has a period in one of its word (lets say the last word), something like this ->
'I like the documentation of numpy.random.rand.'
The output would be - ['numpy.random.rand']
And for input string 'I like the documentation of numpy.random.rand a lot.'
The output would be - ['lot']

Python - Extract hashtags from text; end at punctuation

For my programming class, I have to create a function according to the following description:
The parameter is a tweet. This function should return a list containing all of the hashtags in the tweet, in the order they appear in the tweet. Each hashtag in the returned list should have the initial hash symbol removed, and hashtags should be unique. (If a tweet uses the same hashtag twice, it is included in the list only once. The order of the hashtags should match the order of the first occurrence of each tag in the tweet.)
I am unsure how to make it so the hashtag ends when punctuation is encountered (see second doctest example). My current code is not outputting anything:
def extract(start, tweet):
""" (str, str) -> list of str
Return a list of strings containing all words that start with a specified character.
>>> extract('#', "Make America Great Again, vote #RealDonaldTrump")
['RealDonaldTrump']
>>> extract('#', "Vote Hillary! #ImWithHer #TrumpsNotMyPresident")
['ImWithHer', 'TrumpsNotMyPresident']
"""
words = tweet.split()
return [word[1:] for word in words if word[0] == start]
def strip_punctuation(s):
""" (str) -> str
Return a string, stripped of its punctuation.
>>> strip_punctuation("Trump's in the lead... damn!")
'Trumps in the lead damn'
"""
return ''.join(c for c in s if c not in '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~')
def extract_hashtags(tweet):
""" (str) -> list of str
Return a list of strings containing all unique hashtags in a tweet.
Outputted in order of appearance.
>>> extract_hashtags("I stand with Trump! #MakeAmericaGreatAgain #MAGA #TrumpTrain")
['MakeAmericaGreatAgain', 'MAGA', 'TrumpTrain']
>>> extract_hashtags('NEVER TRUMP. I'm with HER. Does #this! work?')
['this']
"""
hashtags = extract('#', tweet)
no_duplicates = []
for item in hashtags:
if item not in no_duplicates and item.isalnum():
no_duplicates.append(item)
result = []
for hash in no_duplicates:
for char in hash:
if char.isalnum() == False and char != '#':
hash == hash[:char.index()]
result.append()
return result
I'm pretty lost at this point; any help would be appreciated. Thank you in advance.
Note: we are not allowed to use regular expressions or import any modules.
You do look a little bit lost. The key to solving these types of problems is to divide the problem into smaller parts, solve those, and then combine the results. You've got every piece you need..:
def extract_hashtags(tweet):
# strip the punctuation on the tags you've extracted (directly)
hashtags = [strip_punctuation(tag) for tag in extract('#', tweet)]
# hashtags is now a list of hash-tags without any punctuation, but possibly with duplicates
result = []
for tag in hashtags:
if tag not in result: # check that we haven't seen the tag already (we know it doesn't contain punctuation at this point)
result.append(tag)
return result
ps: this is a problem that is well suited for a regex solution, but if you want a fast strip_punctuation you could use:
def strip_punctuation(s):
return s.translate(None, '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~')

How can you use Python to count the unique words (without special characters/ cases interfering) in a text document

I am new to Python and need some help with trying to come up with a text content analyzer that will help me find 7 things within a text file:
Total word count
Total count of unique words (without case and special characters interfering)
The number of sentences
Average words in a sentence
Find common used phrases (a phrase of 3 or more words used over 3 times)
A list of words used, in order of descending frequency (without case and special characters interfering)
The ability to accept input from STDIN, or from a file specified on the command line
So far I have this Python program to print total word count:
with open('/Users/name/Desktop/20words.txt', 'r') as f:
p = f.read()
words = p.split()
wordCount = len(words)
print "The total word count is:", wordCount
So far I have this Python program to print unique words and their frequency: (it's not in order and sees words such as: dog, dog., "dog, and dog, as different words)
file=open("/Users/name/Desktop/20words.txt", "r+")
wordcount={}
for word in file.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for k, v in wordcount.items():
print k, v
Thank you for any help you can give!
Certainly the most difficult part is identifying the sentences. You could use a regular expression for this, but there might still be some ambiguity, e.g. with names and titles, that have a dot followed by an upper case letter. For words, too, you can use a simple regex, instead of using split. The exact expression to use depends on what qualifies as a "word". Finally, you can use collections.Counter for counting all of those instead of doing this manually. Use str.lower to convert either the text as a whole or the individual words to lowercase.
This should help you getting startet:
import re, collections
text = """Sentences start with an upper-case letter. Do they always end
with a dot? No! Also, not each dot is the end of a sentence, e.g. these two,
but this is. Still, some ambiguity remains with names, like Mr. Miller here."""
sentence = re.compile(r"[A-Z].*?[.!?](?=\s+[A-Z]|$)", re.S)
sentences = collections.Counter(sentence.findall(text))
for n, s in sentences.most_common():
print n, s
word = re.compile(r"\w+")
words = collections.Counter(word.findall(text.lower()))
for n, w in words.most_common():
print n, w
For "more power", you could use some natural language toolkit, but this might be a bit much for this task.
If you know what characters you want to avoid, you can use str.strip to remove these characters from the extremities.
word = word.strip().strip("'").strip('"')...
This will remove the occurrence of these characters on the extremities of the word.
This probably isn't as efficient as using some NLP library, but it can get the job done.
str.strip Docs

Python function: Please help me in this one

okay these two functions are related to each other and fortunately the first one is solved but the other is a big mess and it should give me 17.5 but it only gives me 3 so why doesn't it work out??
def split_on_separators(original, separators):
""" (str, str) -> list of str
Return a list of non-empty, non-blank strings from the original string
determined by splitting the string on any of the separators.
separators is a string of single-character separators.
>>> split_on_separators("Hooray! Finally, we're done.", "!,")
['Hooray', ' Finally', " we're done."]
"""
result = []
newstring = ''
for index,char in enumerate(original):
if char in separators or index==len(original) -1:
result.append(newstring)
newstring=''
if '' in result:
result.remove('')
else:
newstring+=char
return result
def average_sentence_length(text):
""" (list of str) -> float
Precondition: text contains at least one sentence. A sentence is defined
as a non-empty string of non-terminating punctuation surrounded by
terminating punctuation or beginning or end of file. Terminating
punctuation is defined as !?.
Return the average number of words per sentence in text.
>>> text = ['The time has come, the Walrus said\n',
'To talk of many things: of shoes - and ships - and sealing wax,\n',
'Of cabbages; and kings.\n'
'And why the sea is boiling hot;\n'
'and whether pigs have wings.\n']
>>> average_sentence_length(text)
17.5
"""
words=0
Sentences=0
for line in text:
words+=1
sentence=split_on_separators(text,'?!.')
for sep in sentence:
Sentences+=1
ASL=words/Sentences
return ASL
words can be counted by spliting each sentence in the list using space and counting the length of that list. would be helpful.
You can eliminate the need for your first function by using regular expressions to split on separators. The regular expression function is re.split(). Here is a cleaned up version that gets the right result:
import re
def average_sentence_length(text):
# Join all the text into one string and remove all newline characters
# Joining all text into one string allows us to find the sentences much
# easier, since multiple list items in 'text' could be one whole sentence
text = "".join(text).replace('\n', '')
# Use regex to split the sentences at delimiter characters !?.
# Filter out any empty strings that result from this function,
# otherwise they will count as words later on
sentences = filter(None, re.split('[!?.]', text))
# Set the word sum variable
wordsum = 0.0
for s in sentences:
# Split each sentence (s) into its separate words and add them
# to the wordsum variable
words = s.split(' ')
wordsum += len(words)
return wordsum / len(sentences)
data = ['The time has come, the Walrus said\n',
' To talk of many things: of shoes - and ships - and sealing wax,\n',
'Of cabbages; and kings.\n'
'And why the sea is boiling hot;\n'
'and whether pigs have wings.\n']
print average_sentence_length(data)
The one issue with this function is that with the text you provided, it returns 17.0 instead of 17.5. This is because there is no space in between "...the Walrus said" and "To talk of...". There is nothing that can be done there besides adding the space that should be there in the first place.
If the first function (split_on_separators) is required for the project, than you can replace the re.split() function with your function. Using regular expressions is a bit more reliable and a lot more lightweight than writing an entire function for it, however.
EDIT
I forgot to explain the filter() function. Basically if you give the first argument of type None, it takes the second argument and removes all "false" items in it. Since an empty string is considered false in Python, it is removed. You can read more about filter() here

How do I calculate the number of times a word occurs in a sentence?

So I've been learning Python for some months now and was wondering how I would go about writing a function that will count the number of times a word occurs in a sentence. I would appreciate if someone could please give me a step-by-step method for doing this.
Quick answer:
def count_occurrences(word, sentence):
return sentence.lower().split().count(word)
'some string.split() will split the string on whitespace (spaces, tabs and linefeeds) into a list of word-ish things. Then ['some', 'string'].count(item) returns the number of times item occurs in the list.
That doesn't handle removing punctuation. You could do that using string.maketrans and str.translate.
# Make collection of chars to keep (don't translate them)
import string
keep = string.lowercase + string.digits + string.whitespace
table = string.maketrans(keep, keep)
delete = ''.join(set(string.printable) - set(keep))
def count_occurrences(word, sentence):
return sentence.lower().translate(table, delete).split().count(word)
The key here is that we've constructed the string delete so that it contains all the ascii characters except letters, numbers and spaces. Then str.translate in this case takes a translation table that doesn't change the string, but also a string of chars to strip out.
wilberforce has the quick, correct answer, and I'll give the long winded 'how to get to that conclusion' answer.
First, here are some tools to get you started, and some questions you need to ask yourself.
You need to read the section on Sequence Types, in the python docs, because it is your best friend for solving this problem. Seriously, read it. Once you have read that, you should have some ideas. For example you can take a long string and break it up using the split() function. To be explicit:
mystring = "This sentence is a simple sentence."
result = mystring.split()
print result
print "The total number of words is: " + str(len(result))
print "The word 'sentence' occurs: " + str(result.count("sentence"))
Takes the input string and splits it on any whitespace, and will give you:
["This", "sentence", "is", "a", "simple", "sentence."]
The total number of words is 6
The word 'sentence' occurs: 1
Now note here that you do have the period still at the end of the second 'sentence'. This is a problem because 'sentence' is not the same as 'sentence.'. If you are going to go over your list and count words, you need to make sure that the strings are identical. You may need to find and remove some punctuation.
A naieve approach to this might be:
no_period_string = mystring.replace(".", " ")
print no_period_string
To get me a period-less sentence:
"This sentence is a simple sentence"
You also need to decide if your input going to be just a single sentence, or maybe a paragraph of text. If you have many sentences in your input, you might want to find a way to break them up into individual sentences, and find the periods (or question marks, or exclamation marks, or other punctuation that ends a sentence). Once you find out where in the string the 'sentence terminator' is you could maybe split up the string at that point, or something like that.
You should give this a try yourself - hopefully I've peppered in enough hints to get you to look at some specific functions in the documentation.
Simplest way:
def count_occurrences(word, sentence):
return sentence.count(word)
text=input("Enter your sentence:")
print("'the' appears", text.count("the"),"times")
simplest way to do it
Problem with using count() method is that it not always gives the correct number of occurrence when there is overlapping, for example
print('banana'.count('ana'))
output
1
but 'ana' occurs twice in 'banana'
To solve this issue, i used
def total_occurrence(string,word):
count = 0
tempsting = string
while(word in tempsting):
count +=1
tempsting = tempsting[tempsting.index(word)+1:]
return count
You can do it like this:
def countWord(word):
numWord = 0
for i in range(1, len(word)-1):
if word[i-1:i+3] == 'word':
numWord += 1
print 'Number of times "word" occurs is:', numWord
then calling the string:
countWord('wordetcetcetcetcetcetcetcword')
will return: Number of times "word" occurs is: 2
def check_Search_WordCount(mySearchStr, mySentence):
len_mySentence = len(mySentence)
len_Sentence_without_Find_Word = len(mySentence.replace(mySearchStr,""))
len_Remaining_Sentence = len_mySentence - len_Sentence_without_Find_Word
count = len_Remaining_Sentence/len(mySearchStr)
return (int(count))
I assume that you just know about python string and for loop.
def count_occurences(s,word):
count = 0
for i in range(len(s)):
if s[i:i+len(word)] == word:
count += 1
return count
mystring = "This sentence is a simple sentence."
myword = "sentence"
print(count_occurences(mystring,myword))
explanation:
s[i:i+len(word)]: slicing the string s to extract a word having the same length with the word (argument)
count += 1 : increase the counter whenever matched.

Categories