Limit the number of sentences in a string - python

A beginner's Python question:
I have a string with x number of sentences. How to I extract first 2 sentences (may end with . or ? or !)

Ignoring considerations such as when a . constitutes the end of sentence:
import re
' '.join(re.split(r'(?<=[.?!])\s+', phrase, 2)[:-1])
EDIT: Another approach that just occurred to me is this:
re.match(r'(.*?[.?!](?:\s+.*?[.?!]){0,1})', phrase).group(1)
Notes:
Whereas the first solution lets you replace the 2 with some other number to choose a different number of sentences, in the second solution, you change the 1 in {0,1} to one less than the number of sentences you want to extract.
The second solution isn't quite as robust in handling, e.g., empty strings, or strings with no punctuation. It could be made so, but the regex would be even more complex than it is already, and I would favour the slightly less efficient first solution over an unreadable mess.

I solved it like this: Separating sentences, though a comment on that post also points to NLTK, though I don't know how to find the sentence segmenter on their site...

Here's how yo could do it:
str = "Sentence one? Sentence two. Sentence three? Sentence four. Sentence five."
sentences = str.split(".")
allSentences = []
for sentence in sentences
allSentences.extend(sentence.split("?"))
print allSentences[0:3]
There are probably better ways, I look forward to seeing them.

Here is a step by step explanation of how to disassemble, choose the first two sentences, and reassemble it. As noted by others, this does not take into account that not all dot/question/exclamation characters are really sentence separators.
import re
testline = "Sentence 1. Sentence 2? Sentence 3! Sentence 4. Sentence 5."
# split the first two sentences by the dot/question/exclamation.
sentences = re.split('([.?!])', testline, 2)
print "result of split: ", sentences
# toss everything else (the last item in the list)
firstTwo = sentences[:-1]
print firstTwo
# put the first two sentences back together
finalLine = ''.join(firstTwo)
print finalLine

Generator alternative using my utility function returning piece of string until any item in search sequence:
from itertools import islice
testline = "Sentence 1. Sentence 2? Sentence 3! Sentence 4. Sentence 5."
def multis(search_sequence,text,start=0):
""" multisearch by given search sequence values from text, starting from position start
yielding tuples of text before found item and found sequence item"""
x=''
for ch in text[start:]:
if ch in search_sequence:
if x: yield (x,ch)
else: yield ch
x=''
else:
x+=ch
else:
if x: yield x
# split the first two sentences by the dot/question/exclamation.
two_sentences = list(islice(multis('.?!',testline),2)) ## must save the result of generation
print "result of split: ", two_sentences
print '\n'.join(sentence.strip()+sep for sentence,sep in two_sentences)

Related

Regular expression to extract all sentences that start and end with the same word

Given a string of sentences, I need to extract a list of all of the sentences which start and end with the same word.
e.g.
# sample text
text = "This is a sample sentence. well, I'll check that things are going well. another sentence starting with another. ..."
# required result
[
"well, I'll check that things are going well",
"another sentence starting with another"
]
How can I make the match using back references and also capture the full sentence?
I have tried the following regex but it's not working.
re.findall("^[a-zA-Z](.*[a-zA-Z])?$", text)
text = "This is a sample sentence. going to checking whether it is well going. another
sentence starting with another."
sentences = re.split('[.!?]+', text)
result = []
for s in sentences:
words = s.split()
if len(words) > 0 and words[0] == words[-1]:
result.append(s.strip())
print(result)
You could try using a backreference to reuse the match...
import re
# sample text
text = "This is a sample sentence. Well, I'll check that things are going well. another sentence starting with another. ..."
print([match[0] for match in re.findall(r"((\b\w+\b)[^.?!]+\2[.?!])", text, re.IGNORECASE)])
This prints...
['Well, I'll check that things are going well.', 'another sentence starting with another.']
Note: I changed the case of the first "well" to "Well" for testing purposes.

Return first word in sentence? [duplicate]

This question already has answers here:
How to extract the first and final words from a string?
(7 answers)
Closed 5 years ago.
Heres the question I have to answer for school
For the purposes of this question, we will define a word as ending a sentence if that word is immediately followed by a period. For example, in the text “This is a sentence. The last sentence had four words.”, the ending words are ‘sentence’ and ‘words’. In a similar fashion, we will define the starting word of a sentence as any word that is preceded by the end of a sentence. The starting words from the previous example text would be “The”. You do not need to consider the first word of the text as a starting word. Write a program that has:
An endwords function that takes a single string argument. This functioin must return a list of all sentence ending words that appear in the given string. There should be no duplicate entries in the returned list and the periods should not be included in the ending words.
The code I have so far is:
def startwords(astring):
mylist = astring.split()
if mylist.endswith('.') == True:
return my list
but I don't know if I'm using the right approach. I need some help
Several issues with your code. The following would be a simple approach. Create a list of bigrams and pick the second token of each bigram where the first token ends with a period:
def startwords(astring):
mylist = astring.split() # a list! Has no 'endswith' method
bigrams = zip(mylist, mylist[1:])
return [b[1] for b in bigrams if b[0].endswith('.')]
zip and list comprehenion are two things worth reading up on.
mylist = astring.split()
if mylist.endswith('.')
that cannot work, one of the reasons being that mylist is a list, and doesn't have endswith as a method.
Another answer fixed your approach so let me propose a regular expression solution:
import re
print(re.findall(r"\.\s*(\w+)","This is a sentence. The last sentence had four words."))
match all words following a dot and optional spaces
result: ['The']
def endwords(astring):
mylist = astring.split('.')
temp_words = [x.rpartition(" ")[-1] for x in mylist if len(x) > 1]
return list(set(temp_words))
This creates a set so there are no duplicates. Then goes on a for loop in a list of sentences (split by ".") then for each sentence, splits it in words then using [:-1] makes a list of the last word only and gets [0] item in that list.
print (set([ x.split()[:-1][0] for x in s.split(".") if len(x.split())>0]))
The if in theory is not needed but i couldn't make it work without it.
This works as well:
print (set([ x.split() [len(x.split())-1] for x in s.split(".") if len(x.split())>0]))
This is one way to do it ->
#!/bin/env/ python
from sets import Set
sentence = 'This is a sentence. The last sentence had four words.'
uniq_end_words = Set()
for word in sentence.split():
if '.' in word:
# check if period (.) is at the end
if '.' == word[len(word) -1]:
uniq_end_words.add(word.rstrip('.'))
print list(uniq_end_words)
Output (list of all the end words in a given sentence) ->
['words', 'sentence']
If your input string has a period in one of its word (lets say the last word), something like this ->
'I like the documentation of numpy.random.rand.'
The output would be - ['numpy.random.rand']
And for input string 'I like the documentation of numpy.random.rand a lot.'
The output would be - ['lot']

How can you use Python to count the unique words (without special characters/ cases interfering) in a text document

I am new to Python and need some help with trying to come up with a text content analyzer that will help me find 7 things within a text file:
Total word count
Total count of unique words (without case and special characters interfering)
The number of sentences
Average words in a sentence
Find common used phrases (a phrase of 3 or more words used over 3 times)
A list of words used, in order of descending frequency (without case and special characters interfering)
The ability to accept input from STDIN, or from a file specified on the command line
So far I have this Python program to print total word count:
with open('/Users/name/Desktop/20words.txt', 'r') as f:
p = f.read()
words = p.split()
wordCount = len(words)
print "The total word count is:", wordCount
So far I have this Python program to print unique words and their frequency: (it's not in order and sees words such as: dog, dog., "dog, and dog, as different words)
file=open("/Users/name/Desktop/20words.txt", "r+")
wordcount={}
for word in file.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for k, v in wordcount.items():
print k, v
Thank you for any help you can give!
Certainly the most difficult part is identifying the sentences. You could use a regular expression for this, but there might still be some ambiguity, e.g. with names and titles, that have a dot followed by an upper case letter. For words, too, you can use a simple regex, instead of using split. The exact expression to use depends on what qualifies as a "word". Finally, you can use collections.Counter for counting all of those instead of doing this manually. Use str.lower to convert either the text as a whole or the individual words to lowercase.
This should help you getting startet:
import re, collections
text = """Sentences start with an upper-case letter. Do they always end
with a dot? No! Also, not each dot is the end of a sentence, e.g. these two,
but this is. Still, some ambiguity remains with names, like Mr. Miller here."""
sentence = re.compile(r"[A-Z].*?[.!?](?=\s+[A-Z]|$)", re.S)
sentences = collections.Counter(sentence.findall(text))
for n, s in sentences.most_common():
print n, s
word = re.compile(r"\w+")
words = collections.Counter(word.findall(text.lower()))
for n, w in words.most_common():
print n, w
For "more power", you could use some natural language toolkit, but this might be a bit much for this task.
If you know what characters you want to avoid, you can use str.strip to remove these characters from the extremities.
word = word.strip().strip("'").strip('"')...
This will remove the occurrence of these characters on the extremities of the word.
This probably isn't as efficient as using some NLP library, but it can get the job done.
str.strip Docs

How do I calculate the number of times a word occurs in a sentence?

So I've been learning Python for some months now and was wondering how I would go about writing a function that will count the number of times a word occurs in a sentence. I would appreciate if someone could please give me a step-by-step method for doing this.
Quick answer:
def count_occurrences(word, sentence):
return sentence.lower().split().count(word)
'some string.split() will split the string on whitespace (spaces, tabs and linefeeds) into a list of word-ish things. Then ['some', 'string'].count(item) returns the number of times item occurs in the list.
That doesn't handle removing punctuation. You could do that using string.maketrans and str.translate.
# Make collection of chars to keep (don't translate them)
import string
keep = string.lowercase + string.digits + string.whitespace
table = string.maketrans(keep, keep)
delete = ''.join(set(string.printable) - set(keep))
def count_occurrences(word, sentence):
return sentence.lower().translate(table, delete).split().count(word)
The key here is that we've constructed the string delete so that it contains all the ascii characters except letters, numbers and spaces. Then str.translate in this case takes a translation table that doesn't change the string, but also a string of chars to strip out.
wilberforce has the quick, correct answer, and I'll give the long winded 'how to get to that conclusion' answer.
First, here are some tools to get you started, and some questions you need to ask yourself.
You need to read the section on Sequence Types, in the python docs, because it is your best friend for solving this problem. Seriously, read it. Once you have read that, you should have some ideas. For example you can take a long string and break it up using the split() function. To be explicit:
mystring = "This sentence is a simple sentence."
result = mystring.split()
print result
print "The total number of words is: " + str(len(result))
print "The word 'sentence' occurs: " + str(result.count("sentence"))
Takes the input string and splits it on any whitespace, and will give you:
["This", "sentence", "is", "a", "simple", "sentence."]
The total number of words is 6
The word 'sentence' occurs: 1
Now note here that you do have the period still at the end of the second 'sentence'. This is a problem because 'sentence' is not the same as 'sentence.'. If you are going to go over your list and count words, you need to make sure that the strings are identical. You may need to find and remove some punctuation.
A naieve approach to this might be:
no_period_string = mystring.replace(".", " ")
print no_period_string
To get me a period-less sentence:
"This sentence is a simple sentence"
You also need to decide if your input going to be just a single sentence, or maybe a paragraph of text. If you have many sentences in your input, you might want to find a way to break them up into individual sentences, and find the periods (or question marks, or exclamation marks, or other punctuation that ends a sentence). Once you find out where in the string the 'sentence terminator' is you could maybe split up the string at that point, or something like that.
You should give this a try yourself - hopefully I've peppered in enough hints to get you to look at some specific functions in the documentation.
Simplest way:
def count_occurrences(word, sentence):
return sentence.count(word)
text=input("Enter your sentence:")
print("'the' appears", text.count("the"),"times")
simplest way to do it
Problem with using count() method is that it not always gives the correct number of occurrence when there is overlapping, for example
print('banana'.count('ana'))
output
1
but 'ana' occurs twice in 'banana'
To solve this issue, i used
def total_occurrence(string,word):
count = 0
tempsting = string
while(word in tempsting):
count +=1
tempsting = tempsting[tempsting.index(word)+1:]
return count
You can do it like this:
def countWord(word):
numWord = 0
for i in range(1, len(word)-1):
if word[i-1:i+3] == 'word':
numWord += 1
print 'Number of times "word" occurs is:', numWord
then calling the string:
countWord('wordetcetcetcetcetcetcetcword')
will return: Number of times "word" occurs is: 2
def check_Search_WordCount(mySearchStr, mySentence):
len_mySentence = len(mySentence)
len_Sentence_without_Find_Word = len(mySentence.replace(mySearchStr,""))
len_Remaining_Sentence = len_mySentence - len_Sentence_without_Find_Word
count = len_Remaining_Sentence/len(mySearchStr)
return (int(count))
I assume that you just know about python string and for loop.
def count_occurences(s,word):
count = 0
for i in range(len(s)):
if s[i:i+len(word)] == word:
count += 1
return count
mystring = "This sentence is a simple sentence."
myword = "sentence"
print(count_occurences(mystring,myword))
explanation:
s[i:i+len(word)]: slicing the string s to extract a word having the same length with the word (argument)
count += 1 : increase the counter whenever matched.

How do the count the number of sentences, words and characters in a file?

I have written the following code to tokenize the input paragraph that comes from the file samp.txt. Can anybody help me out to find and print the number of sentences, words and characters in the file? I have used NLTK in python for this.
>>>import nltk.data
>>>import nltk.tokenize
>>>f=open('samp.txt')
>>>raw=f.read()
>>>tokenized_sentences=nltk.sent_tokenize(raw)
>>>for each_sentence in tokenized_sentences:
... words=nltk.tokenize.word_tokenize(each_sentence)
... print each_sentence #prints tokenized sentences from samp.txt
>>>tokenized_words=nltk.word_tokenize(raw)
>>>for each_word in tokenized_words:
... words=nltk.tokenize.word_tokenize(each_word)
... print each_words #prints tokenized words from samp.txt
Try it this way (this program assumes that you are working with one text file in the directory specified by dirpath):
import nltk
folder = nltk.data.find(dirpath)
corpusReader = nltk.corpus.PlaintextCorpusReader(folder, '.*\.txt')
print "The number of sentences =", len(corpusReader.sents())
print "The number of patagraphs =", len(corpusReader.paras())
print "The number of words =", len([word for sentence in corpusReader.sents() for word in sentence])
print "The number of characters =", len([char for sentence in corpusReader.sents() for word in sentence for char in word])
Hope this helps
With nltk, you can also use FreqDist (see O'Reillys Book Ch3.1)
And in your case:
import nltk
raw = open('samp.txt').read()
raw = nltk.Text(nltk.word_tokenize(raw.decode('utf-8')))
fdist = nltk.FreqDist(raw)
print fdist.N()
For what it's worth if someone comes along here. This addresses all that the OP's question asked I think. If one uses the textstat package, counting sentences and characters is very easy. There is a certain importance for punctuation at the end of each sentence.
import textstat
your_text = "This is a sentence! This is sentence two. And this is the final sentence?"
print("Num sentences:", textstat.sentence_count(your_text))
print("Num chars:", textstat.char_count(your_text, ignore_spaces=True))
print("Num words:", len(your_text.split()))
I believe this to be the right solution because it properly counts things like "..." and "??" as a single sentence
len(re.findall(r"[^?!.][?!.]", paragraph))
Characters are easy to count.
Paragraphs are usually easy to count too. Whenever you see two consecutive newlines you probably have a paragraph. You might say that an enumeration or an unordered list is a paragraph, even though their entries can be delimited by two newlines each. A heading or a title too can be followed by two newlines, even-though they're clearly not paragraphs. Also consider the case of a single paragraph in a file, with one or no newlines following.
Sentences are tricky. You might settle for a period, exclamation-mark or question-mark followed by whitespace or end-of-file. It's tricky because sometimes colon marks an end of sentence and sometimes it doesn't. Usually when it does the next none-whitespace character would be capital, in the case of English. But sometimes not; for example if it's a digit. And sometimes an open parenthesis marks end of sentence (but that is arguable, as in this case).
Words too are tricky. Usually words are delimited by whitespace or punctuation marks. Sometimes a dash delimits a word, sometimes not. That is the case with a hyphen, for example.
For words and sentences you will probably need to clearly state your definition of a sentence and a word and program for that.
Not 100% correct but I just gave a try. I have not taken all points by #wilhelmtell in to consideration. I try them once I have time...
if __name__ == "__main__":
f = open("1.txt")
c=w=0
s=1
prevIsSentence = False
for x in f:
x = x.strip()
if x != "":
words = x.split()
w = w+len(words)
c = c + sum([len(word) for word in words])
prevIsSentence = True
else:
if prevIsSentence:
s = s+1
prevIsSentence = False
if not prevIsSentence:
s = s-1
print "%d:%d:%d" % (c,w,s)
Here 1.txt is the file name.
The only way you can solve this is by creating an AI program that uses Natural Language Processing which is not very easy to do.
Input:
"This is a paragraph about the Turing machine. Dr. Allan Turing invented the Turing Machine. It solved a problem that has a .1% change of being solved."
Checkout OpenNLP
https://sourceforge.net/projects/opennlp/
http://opennlp.apache.org/
There's already a program to count words and characters-- wc.

Categories