I have following code:
def splitParagraphIntoSentences(paragraph):
import re
sentenceEnders = re.compile('[.!?]')
sentenceList = sentenceEnders.split(paragraph)
return sentenceList
sentenceList=splitParagraphIntoSentences (u"""I have a bicycle. I want the car.
""")
print len(sentenceList)
Python will return that the lenght of sentencelist is 3. Actually there are just two sentences. I know i t is so because the '.' at the end of second sentence. What is the best way to teach program count sentences in correct way without removing '.' from the end of second sentence?
Thank you
Instead of splitting, count the ends:
len(sentenceEnders.findall(paragraph))
Or subtract 1 to account for the empty line after the last sentence split:
len(splitParagraphIntoSentences(paragraph)) - 1
or return a filtered list, removing empty items:
return filter(None, sentenceList)
or, when using Python 3 (where filter() returns a generator):
return [s for s in sentenceList if s]
Related
I want to display words with common substring in a string.
for example if given string is
str = "the games are lame"
and the words have to be grouped together on the basis of common substring of length 3, so output should be
the
games, lame
are
since common substring of length 3 is "ame".
I proceeded by converting the string to list say "lista" using split() and made another list say "listb" with all the possible substrings of length 3, like
the, gam, gme, ges, ame, aes, mes, are, lam, lme, ame
then I checked the "listb" for duplicate items('ame') and on basis of them compared with items in "lista" like so
for items in duplicate:
for item in lista:
if items in item and not in listc:
listc.append(item)
Now, I have a "listc" with items that have common substring of length 3 but I can't figure out how to group them as needed in output. Also if "str" contains more words with common substring "listc" will also have those common words.
I don't know if I should have proceeded in this way and can't seem to figure out how to group items from "listc" as needed in output.
Here is a solution
str_ = "the games are lame"
# first I get a list of all the words
words = str_.split()
# words >>> ['the', 'games', 'are', 'lame']
groups = []
# This variable will contain the list of words
# For each words
for word in words:
found = False
# Get the first words of each groups
other_words = [x[0] for x in groups if x != word]
# Loop through the word and get all substring of 3 characters
for i in range(len(word)):
substring = word[i:i+3]
# Eliminates the substring that doesn't have the correct length
if len(substring) != 3:
continue
try:
# try to find the substring in a group and get the corresponding index of that group
index = [substring in other_word for other_word in other_words].index(True)
found = True
# Add the word in the group
groups[index].append(word)
except ValueError:
continue
# If we don't find a group for the word, we create a new group with that word in it
if not found:
groups.append([word])
# groups >>> [['the'], ['games', 'lame'], ['are']]
# Now print the groups
for group in groups:
print(", ".join(group))
output :
the
games, lame
are
I think you're creating a lot of lists there and this can be quite confusing.
If you want to use a purely logical approach without using libraries designed for sequence matching, such as difflib, you can first define a function that compares two strings; then you separate your sentence into a words list and perform a double iteration (nested) through that list comparing all possible pairs.
If the strings match they will be printed on the same line separated by commas otherwise on a new line.
In the following function I've also added a parameter for the length of the substring that you want to match, set to 3 by default to stay in line with your question:
# This function compairs two strings and returns them in a tuple if they contain the
# same substring of len_substring characters.
def string_matcher(string_a, string_b, len_substring = 3):
for i in range(len(string_a)-len_substring):
if string_a[i:i+len_substring] in string_b:
return string_a, string_b
return None
string = "the games are lame"
words = string.split()
output = ""
# Making a double iteration over the words list and calling string_matcher for each pair.
for i in range(len(words)-1):
output = output+words[i]
for j in range(i+1, len(words)):
try:
word_a, word_b = string_matcher(words[i], words[j])
output = output+", "+word_b
except TypeError:
pass
output = output+"\n"
print(output)
The program prints out:
the
games, lame
are
I need to write a function that returns the first letters (and make it uppercase) of any text like:
shortened = shorten("Don't repeat yourself")
print(shortened)
Expected output:
DRY
and:
shortened = shorten("All terrain armoured transport")
print(shortened)
Expected output:
ATAT
Use list comprehension and join
shortened = "".join([x[0] for x in text.title().split(' ') if x])
Using regex you can match all characters except the first letter of each word, replace them with an empty string to remove them, then capitalize the resulting string:
import re
def shorten(sentence):
return re.sub(r"\B[\S]+\s*","",sentence).upper()
print(shorten("Don't repeat yourself"))
Output:
DRY
text = 'this is a test'
output = ''.join(char[0] for char in text.title().split(' '))
print(output)
TIAT
Let me explain how this works.
My first step is to capitalize the first letter of each work
text.title()
Now I want to be able to separate each word by the space in between, this will become a list
text.title()split(' ')
With that I'd end up with 'This','Is','A','Test' so now I obviously only want the first character of each word in the list
for word in text.title()split(' '):
print(word[0]) # T I A T
Now I can lump all that into something called list comprehension
output = [char[0] for char in text.title().split(' ')]
# ['T','I','A','T']
I can use ''.join() to combine them together, I don't need the [] brackets anymore because it doesn't need to be a list
output = ''.join(char[0] for char in text.title().split(' ')
This question already has answers here:
How to extract the first and final words from a string?
(7 answers)
Closed 5 years ago.
Heres the question I have to answer for school
For the purposes of this question, we will define a word as ending a sentence if that word is immediately followed by a period. For example, in the text “This is a sentence. The last sentence had four words.”, the ending words are ‘sentence’ and ‘words’. In a similar fashion, we will define the starting word of a sentence as any word that is preceded by the end of a sentence. The starting words from the previous example text would be “The”. You do not need to consider the first word of the text as a starting word. Write a program that has:
An endwords function that takes a single string argument. This functioin must return a list of all sentence ending words that appear in the given string. There should be no duplicate entries in the returned list and the periods should not be included in the ending words.
The code I have so far is:
def startwords(astring):
mylist = astring.split()
if mylist.endswith('.') == True:
return my list
but I don't know if I'm using the right approach. I need some help
Several issues with your code. The following would be a simple approach. Create a list of bigrams and pick the second token of each bigram where the first token ends with a period:
def startwords(astring):
mylist = astring.split() # a list! Has no 'endswith' method
bigrams = zip(mylist, mylist[1:])
return [b[1] for b in bigrams if b[0].endswith('.')]
zip and list comprehenion are two things worth reading up on.
mylist = astring.split()
if mylist.endswith('.')
that cannot work, one of the reasons being that mylist is a list, and doesn't have endswith as a method.
Another answer fixed your approach so let me propose a regular expression solution:
import re
print(re.findall(r"\.\s*(\w+)","This is a sentence. The last sentence had four words."))
match all words following a dot and optional spaces
result: ['The']
def endwords(astring):
mylist = astring.split('.')
temp_words = [x.rpartition(" ")[-1] for x in mylist if len(x) > 1]
return list(set(temp_words))
This creates a set so there are no duplicates. Then goes on a for loop in a list of sentences (split by ".") then for each sentence, splits it in words then using [:-1] makes a list of the last word only and gets [0] item in that list.
print (set([ x.split()[:-1][0] for x in s.split(".") if len(x.split())>0]))
The if in theory is not needed but i couldn't make it work without it.
This works as well:
print (set([ x.split() [len(x.split())-1] for x in s.split(".") if len(x.split())>0]))
This is one way to do it ->
#!/bin/env/ python
from sets import Set
sentence = 'This is a sentence. The last sentence had four words.'
uniq_end_words = Set()
for word in sentence.split():
if '.' in word:
# check if period (.) is at the end
if '.' == word[len(word) -1]:
uniq_end_words.add(word.rstrip('.'))
print list(uniq_end_words)
Output (list of all the end words in a given sentence) ->
['words', 'sentence']
If your input string has a period in one of its word (lets say the last word), something like this ->
'I like the documentation of numpy.random.rand.'
The output would be - ['numpy.random.rand']
And for input string 'I like the documentation of numpy.random.rand a lot.'
The output would be - ['lot']
I'm trying to get make an anagram algorithm, but I'm stuck once I get to the recursive part. Let me know if anymore information is needed.
My code:
def ana_words(words, letter_count):
"""Return all the anagrams using the given letters and allowed words.
- letter_count has 26 keys (one per lowercase letter),
and each value is a non-negative integer.
#type words: list[str]
#type letter_count: dict[str, int]
#rtype: list[str]
"""
anagrams_list = []
if not letter_count:
return [""]
for word in words:
if not _within_letter_count(word, letter_count):
continue
new_letter_count = dict(letter_count)
for char in word:
new_letter_count[char] -= 1
# recursive function
var1 = ana_words(words[1:], new_letter_count)
sorted_word = ''.join(word)
for i in var1:
sorted_word = ''.join([word, i])
anagrams_list.append(sorted_word)
return anagrams_list
Words is a list of words from a file, and letter count is a dictionary of characters (already in lower case). the list of words in words is also in lowercase already.
Input: print ana_words('dormitory')
Output I'm getting:
['dirtyroom', 'dotoi', 'doori', 'dormitory', 'drytoori', 'itorod', 'ortoidry', 'rodtoi', 'roomidry', 'rootidry', 'torodi']
Output I want:
['dirty room', 'dormitory', 'room dirty']
Link to word list: https://1drv.ms/t/s!AlfWKzBlwHQKbPj9P_pyKdmPwpg
Without knowing your words list it is hard to tell why it is including the 'wrong' entries. Trying with just
words = ['room','dirty','dormitory']
Returns the correct entries.
if you are wanting spaces between the words you need to change
sorted_word = ''.join([word, i])
to
sorted_word = ' '.join([word, i])
(Note the added space)
Incidentally, if you are wanting to solve this problem more efficiently then using a 'trie' data structure to store words can help (https://en.wikipedia.org/wiki/Trie)
Question errors:
You are saying:
Words is a list of words from a file, and letter count is a dictionary of characters (already in lower case). the list of words in words is also in lowercase already.
But you are actually calling the function in a different way:
print ana_words('dormitory')
This is not right.
Checking if a dictionaries values are all 0:
if not letter_count: doesn't do what you expected. To check if a dictionary has all 0s you should do if not any(letter_count.values()): that first obtains the values, checks if any of them is different from 0 and then negates the answer.
Joining words:
str.join(arg1) method is not for joining 2 words, is for joining an iterable passed as arg1 by the string, in your case the string is an iterable of chars and you are joining by nothing so the result is the same word.
''.join('Hello')
>>> 'Hello'
The second time you use it the iterable is the list and it joins word with each of the elements of var1 that is actually a list of words so thats fine excluding the space you are missing here. The problem is you are not doing anything with sorted_words. You are just using the last time it appears. The anagram_list.append(sorted_word) should be inside the loop and the sorted_word = ''.join(word) should be deleted.
Other errors:
Aside from all this errors, you are never checking if the letter count gets to 0 to stop recursion.
A beginner's Python question:
I have a string with x number of sentences. How to I extract first 2 sentences (may end with . or ? or !)
Ignoring considerations such as when a . constitutes the end of sentence:
import re
' '.join(re.split(r'(?<=[.?!])\s+', phrase, 2)[:-1])
EDIT: Another approach that just occurred to me is this:
re.match(r'(.*?[.?!](?:\s+.*?[.?!]){0,1})', phrase).group(1)
Notes:
Whereas the first solution lets you replace the 2 with some other number to choose a different number of sentences, in the second solution, you change the 1 in {0,1} to one less than the number of sentences you want to extract.
The second solution isn't quite as robust in handling, e.g., empty strings, or strings with no punctuation. It could be made so, but the regex would be even more complex than it is already, and I would favour the slightly less efficient first solution over an unreadable mess.
I solved it like this: Separating sentences, though a comment on that post also points to NLTK, though I don't know how to find the sentence segmenter on their site...
Here's how yo could do it:
str = "Sentence one? Sentence two. Sentence three? Sentence four. Sentence five."
sentences = str.split(".")
allSentences = []
for sentence in sentences
allSentences.extend(sentence.split("?"))
print allSentences[0:3]
There are probably better ways, I look forward to seeing them.
Here is a step by step explanation of how to disassemble, choose the first two sentences, and reassemble it. As noted by others, this does not take into account that not all dot/question/exclamation characters are really sentence separators.
import re
testline = "Sentence 1. Sentence 2? Sentence 3! Sentence 4. Sentence 5."
# split the first two sentences by the dot/question/exclamation.
sentences = re.split('([.?!])', testline, 2)
print "result of split: ", sentences
# toss everything else (the last item in the list)
firstTwo = sentences[:-1]
print firstTwo
# put the first two sentences back together
finalLine = ''.join(firstTwo)
print finalLine
Generator alternative using my utility function returning piece of string until any item in search sequence:
from itertools import islice
testline = "Sentence 1. Sentence 2? Sentence 3! Sentence 4. Sentence 5."
def multis(search_sequence,text,start=0):
""" multisearch by given search sequence values from text, starting from position start
yielding tuples of text before found item and found sequence item"""
x=''
for ch in text[start:]:
if ch in search_sequence:
if x: yield (x,ch)
else: yield ch
x=''
else:
x+=ch
else:
if x: yield x
# split the first two sentences by the dot/question/exclamation.
two_sentences = list(islice(multis('.?!',testline),2)) ## must save the result of generation
print "result of split: ", two_sentences
print '\n'.join(sentence.strip()+sep for sentence,sep in two_sentences)