Empty string need not be counted in python (repeat) - python

The purpose of the program is to count each word in a passage and note the frequency. Unfortunately, the program is also counting empty strings. My codes are:
def build_map( in_file, word_map ):
# Receives an input file and an empty dictionary
for line in in_file:
# Splits each line at blank space and turns it into
# a list.
word_list = line.split()
for word in word_list:
word= word.strip().strip(string.punctuation).lower()#program revised
if word!='':
# Within the word_list, we are stripping empty space
# on both sides of each word and also stripping any
# punctuation on both side of each word in the list.
# Then, it turns each word to the lower case to avoid
# counting 'THE' and 'the' as two different words.
add_word( word_map, word)
I would really appreciate if someone could take a look at the codes and explain, why it is still counting empty strings. Other than that everything else is working fine. Thanks (modified the code and it is working fine now).

You're checking if the word is empty and then you're stripping the whitespace and punctuation. Reverse the order of these operations.

Related

Python 3 function for counting words in a txt file is giving wrong answer for one word...why?

I am supposed to be making a word/count dictionary for a given txt file. The count part is correct aside from one word in the txt file getting skipped over twice. Why is this? It is like it is skipping over the instances of "we" that are capitalized despite me trying to use .lower() on each word. This is the text it is supposed to count words in:
We are not what we should be
We are not what we need to be
But at least we are not what we used to be
-- Football Coach
However when I change the txt file so the uppercase "We"s are lowercase, it will count them just fine. All of the other words starting with uppercase letters are getting lowercased as they should, but not those "We"s. Why is .lower() not working on those "We"s, but it is working on everything else that starts with an uppercase letter? It is counting only 4 "we"s instead of 6. Everything else is correct and it also has the correct overall word count, so I don't understand what is wrong. Any ideas?
Here is my code:
def create_word_dict(filename):
"""Returns a word/count dict for the given file."""
import collections
frequency = {}
with open(filename) as text:
for word in text.read().split():
if word in frequency:
frequency[word.lower()] += 1
else:
frequency[word.lower()] = 1
new_dict = collections.OrderedDict(sorted(frequency.items()))
return (dict(new_dict))
You only put lower case words in your dictionary. So when word is "We", then if word in frequency is false, even if "we" is actually in the dictionary. That will throw off your counts.
Since you're already importing collections, why not use a Counter?
with open(filename) as text:
frequency = collections.Counter(text.read().lower().split())

I'm trying to display the total amount of lower case letters in a text file, but

I've got the text file for the Dracula novel and I want to count the number of lower case letters contained within it. The code I've got executes without a problem but prints out 4297. I'm not sure where I went wrong and hoped you guys could point out my issue here. Thank you!
Indentation isn't necessarily reflective of what I see on my text editor
def main():
book_file = open('dracula.txt', 'r')
lower_case = sum(map(str.islower, book_file))
print (lower_case)
book_file.close()
main()
expected: 621607
results: 4297
When you iterate over a file, you get a line as a value on each iteration. Your current code would be correct if it was running on characters, not lines. When you call islower on a longer string (like a line from a book), it only returns True if all the letters in the string are lowercase.
In your copy of Dracula, there are apparently 4297 lines that contain no capital letters, so that's the result you're getting. The much larger number is the count of characters.
You can fix your code by adding an extra step to read the file as a single large string, the iterating on that.
def main():
with open('dracula.txt', 'r') as book_file:
text = book_file.read()
lower_case = sum(map(str.islower, text))
print(lower_case)
I also modified your code slightly by using a with statement to handle closing the file. This is nice because it will always close the file when it exits the intended block, even if something has gone wrong and an exception has been raised.
You can use regex to count the lower-case and upper-case characters
import re
text = "sdfsdfdTTsdHSksdsklUHD"
lowercase = len(re.findall("[a-z]", text))
uppercase = len(re.findall("[A-Z]", text))
print(lowercase)
print(uppercase)
Outputs:
15
7
And you will need to change how you read the file to
text = open("dracula.txt").read()
with open('dracula.txt', 'r') as book_file:
count=0
for line in book_file: # for each line in the file you will count the number # of lower case letters and add it to the variable "count"
count+=sum(map(str.islower, line))
print("number of lower case letters = " +int(count))
Here is a version that uses a list comprehension rather than map()
It iterates over the characters in the text and creates a list of all lowercase characters. The length of this list is the number of lowercase letters in the text.
with open('dracula.txt') as f:
text = f.read()
lowers = [char for char in text if char.islower()]
print(len(lowers))

Derive words from string based on key words

I have a string (text_string) from which I want to find words based on my so called key_words. I want to store the result in a list called expected_output.
The expected output is always the word after the keyword (the number of spaces between the keyword and the output word doesn't matter). The expected_output word is then all characters until the next space.
Please see the example below:
text_string = "happy yes_no!?. why coding without paus happy yes"
key_words = ["happy","coding"]
expected_output = ['yes_no!?.', 'without', 'yes']
expected_output explanation:
yes_no!?. (since it comes after happy. All signs are included until the next space.)
without (since it comes after coding. the number of spaces surronding the word doesn't matter)
yes (since it comes after happy)
You can solve it using regex. Like this e.g.
import re
expected_output = re.findall('(?:{0})\s+?([^\s]+)'.format('|'.join(key_words)), text_string)
Explanation
(?:{0}) Is getting your key_words list and creating a non-capturing group with all the words inside this list.
\s+? Add a lazy quantifier so it will get all spaces after any of the former occurrences up to the next character which isn't a space
([^\s]+) Will capture the text right after your key_words until a next space is found
Note: in case you're running this too many times, inside a loop i.e, you ought to use re.compile on the regex string before in order to improve performance.
We will use re module of Python to split your strings based on whitespaces.
Then, the idea is to go over each word, and look if that word is part of your keywords. If yes, we set take_it to True, so that next time the loop is processed, the word will be added to taken which stores all the words you're looking for.
import re
def find_next_words(text, keywords):
take_it = False
taken = []
for word in re.split(r'\s+', text):
if take_it == True:
taken.append(word)
take_it = word in keywords
return taken
print(find_next_words("happy yes_no!?. why coding without paus happy yes", ["happy", "coding"]))
results in ['yes_no!?.', 'without', 'yes']

Python - Recursive word list

I'm trying to get make an anagram algorithm, but I'm stuck once I get to the recursive part. Let me know if anymore information is needed.
My code:
def ana_words(words, letter_count):
"""Return all the anagrams using the given letters and allowed words.
- letter_count has 26 keys (one per lowercase letter),
and each value is a non-negative integer.
#type words: list[str]
#type letter_count: dict[str, int]
#rtype: list[str]
"""
anagrams_list = []
if not letter_count:
return [""]
for word in words:
if not _within_letter_count(word, letter_count):
continue
new_letter_count = dict(letter_count)
for char in word:
new_letter_count[char] -= 1
# recursive function
var1 = ana_words(words[1:], new_letter_count)
sorted_word = ''.join(word)
for i in var1:
sorted_word = ''.join([word, i])
anagrams_list.append(sorted_word)
return anagrams_list
Words is a list of words from a file, and letter count is a dictionary of characters (already in lower case). the list of words in words is also in lowercase already.
Input: print ana_words('dormitory')
Output I'm getting:
['dirtyroom', 'dotoi', 'doori', 'dormitory', 'drytoori', 'itorod', 'ortoidry', 'rodtoi', 'roomidry', 'rootidry', 'torodi']
Output I want:
['dirty room', 'dormitory', 'room dirty']
Link to word list: https://1drv.ms/t/s!AlfWKzBlwHQKbPj9P_pyKdmPwpg
Without knowing your words list it is hard to tell why it is including the 'wrong' entries. Trying with just
words = ['room','dirty','dormitory']
Returns the correct entries.
if you are wanting spaces between the words you need to change
sorted_word = ''.join([word, i])
to
sorted_word = ' '.join([word, i])
(Note the added space)
Incidentally, if you are wanting to solve this problem more efficiently then using a 'trie' data structure to store words can help (https://en.wikipedia.org/wiki/Trie)
Question errors:
You are saying:
Words is a list of words from a file, and letter count is a dictionary of characters (already in lower case). the list of words in words is also in lowercase already.
But you are actually calling the function in a different way:
print ana_words('dormitory')
This is not right.
Checking if a dictionaries values are all 0:
if not letter_count: doesn't do what you expected. To check if a dictionary has all 0s you should do if not any(letter_count.values()): that first obtains the values, checks if any of them is different from 0 and then negates the answer.
Joining words:
str.join(arg1) method is not for joining 2 words, is for joining an iterable passed as arg1 by the string, in your case the string is an iterable of chars and you are joining by nothing so the result is the same word.
''.join('Hello')
>>> 'Hello'
The second time you use it the iterable is the list and it joins word with each of the elements of var1 that is actually a list of words so thats fine excluding the space you are missing here. The problem is you are not doing anything with sorted_words. You are just using the last time it appears. The anagram_list.append(sorted_word) should be inside the loop and the sorted_word = ''.join(word) should be deleted.
Other errors:
Aside from all this errors, you are never checking if the letter count gets to 0 to stop recursion.

About getting rid of empty space in word count

I am dealing with a passage. I am required to sort the words in the passage alphabetically and then sort them by reverse frequency. When my word count function sorts the passage, it counts empty space too. I did some modification and it still counts the empty string. I am wondering if there is any other way to do it. My codes are:
def build_map( in_file, word_map ):
for line in in_file:
# Splits each line at blank space and turns it into
# a list.
word_list = line.split()
for word in word_list:
if word!='':
# Within the word_list, we are stripping empty space
# on both sides of each word and also stripping any
# punctuation on both side of each word in the list.
# Then, it turns each word to the lower case to avoid
# counting 'THE' and 'the' as two different words.
word = word.strip().strip(string.punctuation).lower()#program revised
add_word( word_map, word )
This should get you going in the right direction, you'll need to process it, probably by stripping periods and colons, and you might want to make it all lowercase anyways.
passage = '''I am dealing with a passage. I am required to sort the words in the passage alphabetically and then sort them by reverse frequency. When my word count function sorts the passage, it counts empty space too. I did some modification and it still counts the empty spaces. I am wondering if there is any other way to do it. My codes are:'''
words = set(passage.split())
alpha_sort = sorted(words, key=str.lower)
frequency_sort = sorted(words, key=passage.count, reverse=True)
Maybe you're looking for str.isspace()
Instead of:
if word!='':
you should use:
if word.strip()!='':
because the first one checks for zero-length strings, and you want to eliminate the spaces which are not zero length. Stripping an only-space string will make it zero-length.
To filter empty strings from a list of strings, I would use:
my_list = filter(None, my_list)

Categories