I am dealing with a passage. I am required to sort the words in the passage alphabetically and then sort them by reverse frequency. When my word count function sorts the passage, it counts empty space too. I did some modification and it still counts the empty string. I am wondering if there is any other way to do it. My codes are:
def build_map( in_file, word_map ):
for line in in_file:
# Splits each line at blank space and turns it into
# a list.
word_list = line.split()
for word in word_list:
if word!='':
# Within the word_list, we are stripping empty space
# on both sides of each word and also stripping any
# punctuation on both side of each word in the list.
# Then, it turns each word to the lower case to avoid
# counting 'THE' and 'the' as two different words.
word = word.strip().strip(string.punctuation).lower()#program revised
add_word( word_map, word )
This should get you going in the right direction, you'll need to process it, probably by stripping periods and colons, and you might want to make it all lowercase anyways.
passage = '''I am dealing with a passage. I am required to sort the words in the passage alphabetically and then sort them by reverse frequency. When my word count function sorts the passage, it counts empty space too. I did some modification and it still counts the empty spaces. I am wondering if there is any other way to do it. My codes are:'''
words = set(passage.split())
alpha_sort = sorted(words, key=str.lower)
frequency_sort = sorted(words, key=passage.count, reverse=True)
Maybe you're looking for str.isspace()
Instead of:
if word!='':
you should use:
if word.strip()!='':
because the first one checks for zero-length strings, and you want to eliminate the spaces which are not zero length. Stripping an only-space string will make it zero-length.
To filter empty strings from a list of strings, I would use:
my_list = filter(None, my_list)
Related
so I was recently working on this function here:
# counts owls
def owl_count(text):
# sets all text to lowercase
text = text.lower()
# sets text to list
text = text.split()
# saves indices of owl in list
indices = [i for i, x in enumerate(text) if x == ["owl"] ]
# counts occurences of owl in text
owl_count = len(indices)
# returns owl count and indices
return owl_count, indices
My goal was to count how many times "owl" occurs in the string and save the indices of it. The issue I kept running into was that it would not count "owls" or "owl." I tried splitting it into a list of individual characters but I couldn't find a way to search for three consecutive elements in the list. Do you guys have any ideas on what I could do here?
PS. I'm definitely a beginner programmer so this is probably a simple solution.
thanks!
If you don't want to use huge libraries like NLTK, you can filter words that starts with 'owl', not equal to 'owl':
indices = [i for i, x in enumerate(text) if x.startswith("owl")]
In this case words like 'owlowlowl' will pass too, but one should use NLTK to properly tokenize words like in real world.
Python has built in functions for these.These types of matching of strings comes under something called Regular Expressions,which you can go into detail later
a_string = "your string"
substring = "substring that you want to check"
matches = re.finditer(substring, a_string)
matches_positions = [match.start() for match in matches]
print(matches_positions)
finditer() will return an iteration object and start() will return the starting index of the found matches.
Simply put ,it returns indices of all the substrings in the string
I have a string (text_string) from which I want to find words based on my so called key_words. I want to store the result in a list called expected_output.
The expected output is always the word after the keyword (the number of spaces between the keyword and the output word doesn't matter). The expected_output word is then all characters until the next space.
Please see the example below:
text_string = "happy yes_no!?. why coding without paus happy yes"
key_words = ["happy","coding"]
expected_output = ['yes_no!?.', 'without', 'yes']
expected_output explanation:
yes_no!?. (since it comes after happy. All signs are included until the next space.)
without (since it comes after coding. the number of spaces surronding the word doesn't matter)
yes (since it comes after happy)
You can solve it using regex. Like this e.g.
import re
expected_output = re.findall('(?:{0})\s+?([^\s]+)'.format('|'.join(key_words)), text_string)
Explanation
(?:{0}) Is getting your key_words list and creating a non-capturing group with all the words inside this list.
\s+? Add a lazy quantifier so it will get all spaces after any of the former occurrences up to the next character which isn't a space
([^\s]+) Will capture the text right after your key_words until a next space is found
Note: in case you're running this too many times, inside a loop i.e, you ought to use re.compile on the regex string before in order to improve performance.
We will use re module of Python to split your strings based on whitespaces.
Then, the idea is to go over each word, and look if that word is part of your keywords. If yes, we set take_it to True, so that next time the loop is processed, the word will be added to taken which stores all the words you're looking for.
import re
def find_next_words(text, keywords):
take_it = False
taken = []
for word in re.split(r'\s+', text):
if take_it == True:
taken.append(word)
take_it = word in keywords
return taken
print(find_next_words("happy yes_no!?. why coding without paus happy yes", ["happy", "coding"]))
results in ['yes_no!?.', 'without', 'yes']
I'm trying to get make an anagram algorithm, but I'm stuck once I get to the recursive part. Let me know if anymore information is needed.
My code:
def ana_words(words, letter_count):
"""Return all the anagrams using the given letters and allowed words.
- letter_count has 26 keys (one per lowercase letter),
and each value is a non-negative integer.
#type words: list[str]
#type letter_count: dict[str, int]
#rtype: list[str]
"""
anagrams_list = []
if not letter_count:
return [""]
for word in words:
if not _within_letter_count(word, letter_count):
continue
new_letter_count = dict(letter_count)
for char in word:
new_letter_count[char] -= 1
# recursive function
var1 = ana_words(words[1:], new_letter_count)
sorted_word = ''.join(word)
for i in var1:
sorted_word = ''.join([word, i])
anagrams_list.append(sorted_word)
return anagrams_list
Words is a list of words from a file, and letter count is a dictionary of characters (already in lower case). the list of words in words is also in lowercase already.
Input: print ana_words('dormitory')
Output I'm getting:
['dirtyroom', 'dotoi', 'doori', 'dormitory', 'drytoori', 'itorod', 'ortoidry', 'rodtoi', 'roomidry', 'rootidry', 'torodi']
Output I want:
['dirty room', 'dormitory', 'room dirty']
Link to word list: https://1drv.ms/t/s!AlfWKzBlwHQKbPj9P_pyKdmPwpg
Without knowing your words list it is hard to tell why it is including the 'wrong' entries. Trying with just
words = ['room','dirty','dormitory']
Returns the correct entries.
if you are wanting spaces between the words you need to change
sorted_word = ''.join([word, i])
to
sorted_word = ' '.join([word, i])
(Note the added space)
Incidentally, if you are wanting to solve this problem more efficiently then using a 'trie' data structure to store words can help (https://en.wikipedia.org/wiki/Trie)
Question errors:
You are saying:
Words is a list of words from a file, and letter count is a dictionary of characters (already in lower case). the list of words in words is also in lowercase already.
But you are actually calling the function in a different way:
print ana_words('dormitory')
This is not right.
Checking if a dictionaries values are all 0:
if not letter_count: doesn't do what you expected. To check if a dictionary has all 0s you should do if not any(letter_count.values()): that first obtains the values, checks if any of them is different from 0 and then negates the answer.
Joining words:
str.join(arg1) method is not for joining 2 words, is for joining an iterable passed as arg1 by the string, in your case the string is an iterable of chars and you are joining by nothing so the result is the same word.
''.join('Hello')
>>> 'Hello'
The second time you use it the iterable is the list and it joins word with each of the elements of var1 that is actually a list of words so thats fine excluding the space you are missing here. The problem is you are not doing anything with sorted_words. You are just using the last time it appears. The anagram_list.append(sorted_word) should be inside the loop and the sorted_word = ''.join(word) should be deleted.
Other errors:
Aside from all this errors, you are never checking if the letter count gets to 0 to stop recursion.
The purpose of the program is to count each word in a passage and note the frequency. Unfortunately, the program is also counting empty strings. My codes are:
def build_map( in_file, word_map ):
# Receives an input file and an empty dictionary
for line in in_file:
# Splits each line at blank space and turns it into
# a list.
word_list = line.split()
for word in word_list:
word= word.strip().strip(string.punctuation).lower()#program revised
if word!='':
# Within the word_list, we are stripping empty space
# on both sides of each word and also stripping any
# punctuation on both side of each word in the list.
# Then, it turns each word to the lower case to avoid
# counting 'THE' and 'the' as two different words.
add_word( word_map, word)
I would really appreciate if someone could take a look at the codes and explain, why it is still counting empty strings. Other than that everything else is working fine. Thanks (modified the code and it is working fine now).
You're checking if the word is empty and then you're stripping the whitespace and punctuation. Reverse the order of these operations.
line = "english: while french: pendant que spanish: mientras german: whrend "
words = line.split('\t')
for each in words:
each = each.rstrip()
print words
the string in 'line' is tab delimited but also features a single white space character after each translated word, so while split returns the list I'm after, each word annoyingly has a whitespace character at the end of the string.
in the loop I'm trying to go through the list and remove any trailing whitespaces in the strings but it doest seem to work, suggestions?
Just line.split() could give you stripped words list.
Updating each inside the loop does not make any changes to the words list
Should be done like this
for i in range(len(words)):
words[i]=words[i].rstrip()
Or
words=map(str.rstrip,words)
See the map docs for details on map.
Or one liner with list comprehension
words=[x.rstrip() for x in line.split("\t")]
Or with regex .findall
words=re.findall("[^\t]+",line)
words = line.split('\t')
words = [ i.rstrip() for i in words ]
You can use a regular expression:
import re
words = re.split(r' *\t| +$', line)[:-1]
With this you define the possible sequence as the delimiter. It also allows more than one space because of the * operator (or no space at all).
EDIT: Fixed after Roger Pate pointed an error.