Python - Split point is an array of words - python

I want to split a sentence based upon words currently stored in an array. The array stores the words that I want to act as a split point. Can I use an array as split point using regex?
Example:
array=['and','also','but']
Text file:
I am new to Python and I need help. I am also asking a question.
Required output:
I need help
asking a question

You can use re.split() function:
import re
array = ['and','also','but']
sentence = "I am new to Python and I need help. I am also asking a question."
result = re.split("|".join(array), sentence)
I will add a trim:
result = [x.strip() for x in result]
print(result)

Here's an adaptation of #hurturk solution - that will produce #blahhh requested output.
Since my crystal ball broke last week, whether this algorithm is what #blahhh intended, is anyone's guess.
from __future__ import print_function
import re
array = ['and', 'also', 'but']
separators = ['\.', '\;', '\?', '\!']
sentence = "I am new to Python and I need help. I am also asking a question."
sentences = re.split("|".join(separators), sentence)
for sentence in sentences:
result = re.split("|".join(array), sentence)
result = [x.strip() for x in result]
print(result[-1])
where the output is:
I need help
asking a question

Related

Return first word in sentence? [duplicate]

This question already has answers here:
How to extract the first and final words from a string?
(7 answers)
Closed 5 years ago.
Heres the question I have to answer for school
For the purposes of this question, we will define a word as ending a sentence if that word is immediately followed by a period. For example, in the text “This is a sentence. The last sentence had four words.”, the ending words are ‘sentence’ and ‘words’. In a similar fashion, we will define the starting word of a sentence as any word that is preceded by the end of a sentence. The starting words from the previous example text would be “The”. You do not need to consider the first word of the text as a starting word. Write a program that has:
An endwords function that takes a single string argument. This functioin must return a list of all sentence ending words that appear in the given string. There should be no duplicate entries in the returned list and the periods should not be included in the ending words.
The code I have so far is:
def startwords(astring):
mylist = astring.split()
if mylist.endswith('.') == True:
return my list
but I don't know if I'm using the right approach. I need some help
Several issues with your code. The following would be a simple approach. Create a list of bigrams and pick the second token of each bigram where the first token ends with a period:
def startwords(astring):
mylist = astring.split() # a list! Has no 'endswith' method
bigrams = zip(mylist, mylist[1:])
return [b[1] for b in bigrams if b[0].endswith('.')]
zip and list comprehenion are two things worth reading up on.
mylist = astring.split()
if mylist.endswith('.')
that cannot work, one of the reasons being that mylist is a list, and doesn't have endswith as a method.
Another answer fixed your approach so let me propose a regular expression solution:
import re
print(re.findall(r"\.\s*(\w+)","This is a sentence. The last sentence had four words."))
match all words following a dot and optional spaces
result: ['The']
def endwords(astring):
mylist = astring.split('.')
temp_words = [x.rpartition(" ")[-1] for x in mylist if len(x) > 1]
return list(set(temp_words))
This creates a set so there are no duplicates. Then goes on a for loop in a list of sentences (split by ".") then for each sentence, splits it in words then using [:-1] makes a list of the last word only and gets [0] item in that list.
print (set([ x.split()[:-1][0] for x in s.split(".") if len(x.split())>0]))
The if in theory is not needed but i couldn't make it work without it.
This works as well:
print (set([ x.split() [len(x.split())-1] for x in s.split(".") if len(x.split())>0]))
This is one way to do it ->
#!/bin/env/ python
from sets import Set
sentence = 'This is a sentence. The last sentence had four words.'
uniq_end_words = Set()
for word in sentence.split():
if '.' in word:
# check if period (.) is at the end
if '.' == word[len(word) -1]:
uniq_end_words.add(word.rstrip('.'))
print list(uniq_end_words)
Output (list of all the end words in a given sentence) ->
['words', 'sentence']
If your input string has a period in one of its word (lets say the last word), something like this ->
'I like the documentation of numpy.random.rand.'
The output would be - ['numpy.random.rand']
And for input string 'I like the documentation of numpy.random.rand a lot.'
The output would be - ['lot']

Split string into two parts at double consonants in Python

I'm currently working on a code to translate lines of text into a variation of pig latin,
and one of the requirements is that at any occurrence of a double consonant (bb, cc, dd, etc.)
the string needs to split between those two consonants and reform the word to look like:
"s" + part 2 + part 1 + "s".
The first part of my code is
raw = input("Enter a line to be translated: ")
words = raw.split()
for word in words:
Any help would be greatly appreciated.
Example input/output: "hello, I am Sammy, nice to meet you" = "slohels, I am smysams, nice to meet you"
You might want to use regular expression here...
>>> import re
>>> s = 'hello, I am Sammy, nice to meet you'
>>> re.sub('((\w*([bcdfghjklmnpqrstvwxyz]))(\\3\w*))', 's\\4\\2s', s)
'slohels, I am smySams, nice to meet you'
Almost there... :)
I'll help you on the way a bit.
double_consonants = [2*c for c in 'bcdfghjklmnpqrstvwxz']
for word in raw:
for d_c in double_consonants:
if d_c in word:
# You should be able to finish this bit yourself
You should have a look at the string methods. Tip: Try:
>> s = "I am going to apply for a job".split('pp')
What does that return? Another tip is to use something like the code above, and place it inside a for loop. Maybe you should split the springs into a list of words first?
Edit: The reason I don't give you the entire answer here, is that I suspect that this is a homework assignment.

Using sent_tokenize in a specific area of a file in Python using NLTK?

I have a file with thousands of sentences, and I want to find the sentence containing a specific character/word.
Originally, I was tokenizing the entire file (using sent_tokenize) and then iterating through the sentences to find the word. However, this is too slow. Since I can quickly find the indices of the words, can I use this to my advantage? Is there a way to just tokenize an area around a word (i.e. figure out which sentence contains a word)?
Thanks.
Edit: I'm in Python and using the NLTK library.
What platform are you using? On unix/linux/macOS/cygwin, you can do the following:
sed 's/[\.\?\!]/\n/' < myfile | grep 'myword'
Which will display just the lines containing your word (and the sed will get a very rough tokenisation into sentences). If you want a solution in a particular language, you should say what you're using!
EDIT for Python:
The following will work---it only calls the tokenization if there's a regexp match on your word (this is a very fast operation). This will mean you only tokenize lines that contain the word you want:
import re
import os.path
myword = 'using'
fname = os.path.abspath('path/to/my/file')
try:
f = open(fname)
matching_lines = list(l for l in f if re.search(r'\b'+myword+r'\b', l))
for match in matching_lines:
#do something with matching lines
sents = sent_tokenize(match)
except IOError:
print "Can't open file "+fname
finally:
f.close()
Here's an idea that might speed up the search. You create an additional list in which you store the running total of the word counts for each sentence in your big text. Using a generator function that I learned from Alex Martelli, try something like:
def running_sum(a):
tot = 0
for item in a:
tot += item
yield tot
from nltk.tokenize import sent_tokenize
sen_list = sent_tokenize(bigtext)
wc = [len(s.split()) for s in sen_list]
runningwc = list(running_sum(wc)) #list of the word count for each sentence (running total for the whole text)
word_index = #some number that you get from word index
for index,w in enumerate(runningwc):
if w > word_index:
sentnumber = index-1 #found the index of the sentence that contains the word
break
print sen_list[sentnumber]
Hope the idea helps.
UPDATE: If sent_tokenize is what is slow, then you can try avoiding it altogether. Use the known index to find the word in your big text.
Now, move forward and backward, character by character, to detect sentence end and sentence starts. Something like a "[.!?] " (a period, exclamation or a question mark, followed by a space) would signify and sentence start and end. You will only be searching in the vicinity of your target word, so it should be much faster than sent_tokenize.

How do the count the number of sentences, words and characters in a file?

I have written the following code to tokenize the input paragraph that comes from the file samp.txt. Can anybody help me out to find and print the number of sentences, words and characters in the file? I have used NLTK in python for this.
>>>import nltk.data
>>>import nltk.tokenize
>>>f=open('samp.txt')
>>>raw=f.read()
>>>tokenized_sentences=nltk.sent_tokenize(raw)
>>>for each_sentence in tokenized_sentences:
... words=nltk.tokenize.word_tokenize(each_sentence)
... print each_sentence #prints tokenized sentences from samp.txt
>>>tokenized_words=nltk.word_tokenize(raw)
>>>for each_word in tokenized_words:
... words=nltk.tokenize.word_tokenize(each_word)
... print each_words #prints tokenized words from samp.txt
Try it this way (this program assumes that you are working with one text file in the directory specified by dirpath):
import nltk
folder = nltk.data.find(dirpath)
corpusReader = nltk.corpus.PlaintextCorpusReader(folder, '.*\.txt')
print "The number of sentences =", len(corpusReader.sents())
print "The number of patagraphs =", len(corpusReader.paras())
print "The number of words =", len([word for sentence in corpusReader.sents() for word in sentence])
print "The number of characters =", len([char for sentence in corpusReader.sents() for word in sentence for char in word])
Hope this helps
With nltk, you can also use FreqDist (see O'Reillys Book Ch3.1)
And in your case:
import nltk
raw = open('samp.txt').read()
raw = nltk.Text(nltk.word_tokenize(raw.decode('utf-8')))
fdist = nltk.FreqDist(raw)
print fdist.N()
For what it's worth if someone comes along here. This addresses all that the OP's question asked I think. If one uses the textstat package, counting sentences and characters is very easy. There is a certain importance for punctuation at the end of each sentence.
import textstat
your_text = "This is a sentence! This is sentence two. And this is the final sentence?"
print("Num sentences:", textstat.sentence_count(your_text))
print("Num chars:", textstat.char_count(your_text, ignore_spaces=True))
print("Num words:", len(your_text.split()))
I believe this to be the right solution because it properly counts things like "..." and "??" as a single sentence
len(re.findall(r"[^?!.][?!.]", paragraph))
Characters are easy to count.
Paragraphs are usually easy to count too. Whenever you see two consecutive newlines you probably have a paragraph. You might say that an enumeration or an unordered list is a paragraph, even though their entries can be delimited by two newlines each. A heading or a title too can be followed by two newlines, even-though they're clearly not paragraphs. Also consider the case of a single paragraph in a file, with one or no newlines following.
Sentences are tricky. You might settle for a period, exclamation-mark or question-mark followed by whitespace or end-of-file. It's tricky because sometimes colon marks an end of sentence and sometimes it doesn't. Usually when it does the next none-whitespace character would be capital, in the case of English. But sometimes not; for example if it's a digit. And sometimes an open parenthesis marks end of sentence (but that is arguable, as in this case).
Words too are tricky. Usually words are delimited by whitespace or punctuation marks. Sometimes a dash delimits a word, sometimes not. That is the case with a hyphen, for example.
For words and sentences you will probably need to clearly state your definition of a sentence and a word and program for that.
Not 100% correct but I just gave a try. I have not taken all points by #wilhelmtell in to consideration. I try them once I have time...
if __name__ == "__main__":
f = open("1.txt")
c=w=0
s=1
prevIsSentence = False
for x in f:
x = x.strip()
if x != "":
words = x.split()
w = w+len(words)
c = c + sum([len(word) for word in words])
prevIsSentence = True
else:
if prevIsSentence:
s = s+1
prevIsSentence = False
if not prevIsSentence:
s = s-1
print "%d:%d:%d" % (c,w,s)
Here 1.txt is the file name.
The only way you can solve this is by creating an AI program that uses Natural Language Processing which is not very easy to do.
Input:
"This is a paragraph about the Turing machine. Dr. Allan Turing invented the Turing Machine. It solved a problem that has a .1% change of being solved."
Checkout OpenNLP
https://sourceforge.net/projects/opennlp/
http://opennlp.apache.org/
There's already a program to count words and characters-- wc.

Limit the number of sentences in a string

A beginner's Python question:
I have a string with x number of sentences. How to I extract first 2 sentences (may end with . or ? or !)
Ignoring considerations such as when a . constitutes the end of sentence:
import re
' '.join(re.split(r'(?<=[.?!])\s+', phrase, 2)[:-1])
EDIT: Another approach that just occurred to me is this:
re.match(r'(.*?[.?!](?:\s+.*?[.?!]){0,1})', phrase).group(1)
Notes:
Whereas the first solution lets you replace the 2 with some other number to choose a different number of sentences, in the second solution, you change the 1 in {0,1} to one less than the number of sentences you want to extract.
The second solution isn't quite as robust in handling, e.g., empty strings, or strings with no punctuation. It could be made so, but the regex would be even more complex than it is already, and I would favour the slightly less efficient first solution over an unreadable mess.
I solved it like this: Separating sentences, though a comment on that post also points to NLTK, though I don't know how to find the sentence segmenter on their site...
Here's how yo could do it:
str = "Sentence one? Sentence two. Sentence three? Sentence four. Sentence five."
sentences = str.split(".")
allSentences = []
for sentence in sentences
allSentences.extend(sentence.split("?"))
print allSentences[0:3]
There are probably better ways, I look forward to seeing them.
Here is a step by step explanation of how to disassemble, choose the first two sentences, and reassemble it. As noted by others, this does not take into account that not all dot/question/exclamation characters are really sentence separators.
import re
testline = "Sentence 1. Sentence 2? Sentence 3! Sentence 4. Sentence 5."
# split the first two sentences by the dot/question/exclamation.
sentences = re.split('([.?!])', testline, 2)
print "result of split: ", sentences
# toss everything else (the last item in the list)
firstTwo = sentences[:-1]
print firstTwo
# put the first two sentences back together
finalLine = ''.join(firstTwo)
print finalLine
Generator alternative using my utility function returning piece of string until any item in search sequence:
from itertools import islice
testline = "Sentence 1. Sentence 2? Sentence 3! Sentence 4. Sentence 5."
def multis(search_sequence,text,start=0):
""" multisearch by given search sequence values from text, starting from position start
yielding tuples of text before found item and found sequence item"""
x=''
for ch in text[start:]:
if ch in search_sequence:
if x: yield (x,ch)
else: yield ch
x=''
else:
x+=ch
else:
if x: yield x
# split the first two sentences by the dot/question/exclamation.
two_sentences = list(islice(multis('.?!',testline),2)) ## must save the result of generation
print "result of split: ", two_sentences
print '\n'.join(sentence.strip()+sep for sentence,sep in two_sentences)

Categories