Python: split a text into individual English sentences; retain the punctuation [closed]

Python: split a text into individual English sentences; retain the punctuation [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I am trying to make a function, takes a string/text as an argument, return list of sentences in the text. Sentence boundaries like(.,?,!) should not be removed.
I don't want it to split on abbreviations (Dr. Kg. Mr. Mrs., e.g. "Dr. Jones").
Should I make a dictionary of all abbreviations?
Given input:
input = "I think Dr. Jones is busy now. Can you visit some other day? I was really surprised!"
Expected output:
output=['I think Dr. Jones is busy now.','Can you visit some other day?','I was really surprised!']
What I've tried:
# performing somthing like this:
output = input.split('.')
# will produce
'''
['I think Dr', ' Jones is busy now', ' Can you visit some other day? I was really surprised!']
'''
# where as doing
output = input.split(' ')
# will produce
'''
['I', 'think', 'Dr.', 'Jones', 'is', 'busy', 'now.', 'Can', 'you', 'visit', 'some', 'other', 'day?', 'I', 'was', 'really', 'surprised!']
'''
Basic assumption is that the text intput is not anomalously punctuated!

A clumsy way of achieving it is as follows:
abbr = {'Dr.', 'Mr.', 'Mrs.', 'Ms.'}
sentence_ender = ['.', '?', '!']
s = "I think Dr. Jones is busy now. Can you visit some other day? I was really surprised!"
def containsAny(wrd, charList):
# The list comprehension generates a list of True and False.
# "1 in [ ... ]" returns true is the list has atleast 1 true, else false
# we are essentially testing whether the word contains the sentence ender char
return 1 in [c in wrd for c in charList]
def separate_sentences(string):
sentences = [] # will be a list of all complete sentences
temp = [] # will be a list of all words in current sentence
for wrd in string.split(' '): # the input string is split on spaces
temp.append(wrd) # append current word to temp
# The following condition checks that if the word is not an abbreviation
# yet contains any of the sentence delimiters,
# make 'space separated' sentence and clear temp
if wrd not in abbr and containsAny(wrd, sentence_ender):
sentences.append(' '.join(temp)) # combine words currently in temp
temp = [] # clear temp, for next sentence
return sentences
print(separate_sentences(s))
Should produce:
['I think Dr. Jones is busy now.', 'Can you visit some other day?', 'I was really surprised!']

Related

python: find the last appearance of a word (of a list of words) in a text

given a list of stop words and a string:
list_stop_words = ['for', 'the', 'with']
mystring = 'this is the car for the girl with the long nice red hair'
I would like to get the text starting from the end up to the first stop word of the list.
expected result 'the long nice red hair'
I tried with several for loops but it is super cumbersome there should be a straight way, probably even a one liner.
my super verbose solution:
list_stop_words = ['for', 'the', 'with']
mystring = 'this is the car for the girl with the long nice red hair'
reversed_sentence =mystring.split()[::-1]
for i,word in enumerate(reversed_sentence):
if word in list_stop_words:
position = i
words = reversed_sentence[0:i+1]
print(' '.join(words[::-1]))
break
for word in mastering[::-1]:
Any suggestion for a better approach?
EDIT AFTER THE ANSWER (SEE BELLOW)

you can try something like this
mystring[max([mystring.rfind(stop_word) for stop_word in list_stop_words]):]
basically you find the last occurence of each word with rfind then you find the last from all the words with max then you slice it out

Removing stopwords from a string with ordered set and join retains a single stopword

I don't understand why I don't remove the stopword "a" in this loop. It seems so obvious that this should work...
Given a list of stop words, write a function that takes a string and returns a string stripped of the stop words. Output: stripped_paragraph = 'want figure out how can better data scientist'
Below I define 'stopwords'
I split all the words by a space, make a set of words while retaining the order
loop through the ordered and split substring set ('osss' var) and conditionally remove each word if it's a word in the list 'stopwords'
paragraph = 'I want to figure out how I can be a better data scientist'
def rm_stopwards(par):
stopwords = ['I', 'as', 'to', 'you', 'your','but','be', 'a']
osss = list(list(dict.fromkeys(par.split(' ')))) # ordered_split_shortened_set
for word in osss:
if word.strip() in stopwords:
osss.remove(word)
else:
next
return ' '.join(osss)
print("stripped_paragraph = "+"'"+(rm_stopwards(paragraph))+"'")
My incorrect output is: 'want figure out how can a better data scientist'
Correct output: 'want figure out how can better data scientist'
edit: note that .strip() in the condition check with word.strip() is unnecessary and I still get the same output, that was me checking to make sure there wasn't an extra space somehow
edit2: this is an interview question, so I can't use any imports

What your trying to do can be achieved with much fewer lines of code.
The main problem in your code is your changing the list while iterating over it.
This works and is much simpler. Essentially looping over the list of your paragraph words, and only keeping the ones that aren't in the stopwords list. Then joining them back together with a space.
paragraph = 'I want to figure out how I can be a better data scientist'
stopwords = ['I', 'as', 'to', 'you', 'your','but','be', 'a']
filtered = ' '.join([word for word in paragraph.split() if word not in stopwords])
print(filtered)
You may also consider using nltk, which has a predefined list of stopwords.

You should not change(delete/add) a collection(osss) while iterating over it.
del_list = []
for word in osss:
if word.strip() in stopwords:
del_list.append(word)
else:
next
osss = [e for e in osss if e not in del_list]

paragraph = 'I want to figure out how I can be a better data scientist'
def rm_stopwards(par):
stopwords = ['I', 'as', 'to', 'you', 'your','but','be', 'a']
osss = list(list(dict.fromkeys(par.split(' ')))) # ordered_split_shortened_set
x = list(osss)
for word in osss:
if word.strip() in stopwords:
x.remove(word)
#else:
# next
ret = ' '.join(x)
return ret
print("stripped_paragraph = "+"'"+(rm_stopwards(paragraph))+"'")

separate words in a sentence that has comma between them [duplicate]

This question already has answers here:
Split string with multiple delimiters in Python [duplicate]
(5 answers)
How to split at spaces and commas in Python?
(3 answers)
Closed 4 years ago.
I want to remove commas from one sentence and separate all the other words(a-z) and print them one by one.
a = input()
b=list(a) //to remove punctuations
for item in list(b): //to prevent "index out of range" error.
for j in range(len(l)):
if(item==','):
b.remove(item)
break
c="".join(b) //sentence without commas
c=c.split()
print(c)
My input is :
The university was founded as a standard academy,and developed to a university of technology by Habib Nafisi.
and when I remove the comma:
... founded as a standard academyand developed to a university...
and when I split the words:
The
university
.
.
.
academyand
.
.
.
what can I do to prevent this?
I already tried replace method and it doesn't work.

You could replace , with a space assuming there is no space between , and next word in your input 1 and then perform split:
s = 'The university was founded as a standard academy,and developed to a university of technology by Habib Nafisi.'
print(s.replace(',', ' ').split())
# ['The', 'university', 'was', 'founded', 'as', 'a', 'standard', 'academy', 'and', 'developed', 'to', 'a', 'university', 'of', 'technology', 'by', 'Habib', 'Nafisi.']
Alternatively, you could also try your hand at regex:
import re
s = 'The university was founded as a standard academy,and developed to a university of technology by Habib Nafisi.'
print(re.split(r' |,', s))
1Note: This works even if you had space (multiple) after , because ultimately you split on whitespace.

Your issue seems to be that there is no space between the comma and the next word here: academy,and You could solve this by ensuring that there is a space so when you use b=list(a) that function will actually separate each word into a different element of the list.

This is probably what you want, I see you forgot to replace comma with space.
stri = """ The university was founded as a standard academy,and developed to a university of technology by Habib Nafisi."""
stri.replace(",", " ")
print(stri.split())
Will give you the output in a list:
['The', 'university', 'was', 'founded', 'as', 'a', 'standard', 'academy,and', 'developed', 'to', 'a', 'university', 'of', 'technology', 'by', 'Habib', 'Nafisi.']

If you consider words to be a series of characters that are separated by spaces, if you replace a , with nothing, then there will be no space between them, and it will consider it one word.
The easiest way to do this is to replace the comma with a space, and then split based on spaces:
my_string = "The university was founded as a standard academy,and developed to a university of technology by Habib Nafisi."
list_of_words = my_string.replace(",", " ").split()

Storing words from a text file [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm new to python and am wondering is there a way to take 1 one word from an external file of 10 words and store it individually.
I'm making a words memory game where the user is shown a list of words and then it is removed after a certain amount of time and the words will appear again but one word will be different and they have to guess which word has been replaced.
The word will be randomly chosen from an external file but the external file consists of 10 words, 9 in which will be displayed first and 1 in which is stored as a substitute word.
Does anyone have any ideas?

I have used the unix dictionary here, you can take whichever you want. More resources here:
import random
from copy import copy
''' Word game '''
with open('/usr/share/dict/words','r') as w:
words = w.read().splitlines()
numWords = 10
allWords = [words[i] for i in random.sample(range(len(words)),numWords)]
hiddenWord = allWords[0]
displayWords = allWords[1:]
print displayWords
choice = str((raw_input ('Ready? [y]es\n')))
choice = choice.strip()
if choice == 'y':
indexToRemove = random.randint(0,len(displayWords))
displayWordsNew = copy(displayWords)
random.shuffle(displayWordsNew)
displayWordsNew[indexToRemove] = hiddenWord
print displayWordsNew
word = str(raw_input ('Which is the different word\n'))
if word == displayWordsNew[indexToRemove]:
print "You got it right"
print displayWords
print displayWordsNew
else:
print "Oops, you got it wrong, but it's a difficult game! The correct word was"
print displayWordsNew[indexToRemove]
Results:
["Lena's", 'Galsworthy', 'filliped', 'cadenza', 'telecasts', 'scrutinize', "candidate's", "kayak's", 'workman']
Ready?
y
["Lena's", 'workman', 'scrutinize', 'filliped', 'Latino', 'telecasts', "candidate's", 'cadenza', 'Galsworthy']
Which is the different word
telecasts
Oops, you got it wrong, but it's a difficult game! The correct word was
Latino

If you have an input file like "one word in a new line", just do this:
>>> open("C:/TEXT.txt").read()
'FISH\nMEAT\nWORD\nPLACE\nDOG\n'
Then split the string to the list:
>>> open("C:/Work/TEXT.txt").read().split('\n')
['FISH', 'MEAT', 'WORD', 'PLACE', 'DOG', '']
Oh... And strip new line in the end:
>>> open("C:/Work/TEXT.txt").read().strip().split('\n')
['FISH', 'MEAT', 'WORD', 'PLACE', 'DOG']
For replacing use random.choice from the range of the list:
>>> import random
>>> listOfWords = open("C:/Work/TEXT.txt").read().strip().split('\n')
>>> listOfWords
['FISH', 'MEAT', 'WORD', 'PLACE', 'DOG']
>>> random.choice(range(len(listOfWords)))
3
>>> listOfWords[random.choice(range(len(listOfWords)))] = 'NEW_WORD'
>>> listOfWords
['FISH', 'MEAT', 'NEW_WORD', 'PLACE', 'DOG']
And if you want to shuffle a new list:
>>> random.shuffle(listOfWords)
>>> listOfWords
['PLACE', 'NEW_WORD', 'FISH', 'DOG', 'MEAT']

I'm new to python and am wondering is there a way to take 1 one word
from an external file of 10 words and store it individually.
There's a LOT of ways to store/reference variables in/from a file.
If you don't mind a little typing, just store the variables in a .py file (remember to use proper python syntax):
# myconfig.py:
var_a = 'Word1'
var_b = 'Word2'
var_c = 'Word3'
etc...
Use the file itself as a module
from myconfig import *
(This will let you reference all the variables in the text file.)
If you only want to reference individual variables you just import the ones you want
from myconfig import var_a, var_b
(This will let you reference var_a and var_b, but nothing else)

You should try this:
foo = open("file.txt", mode="r+")
If the words are on different lines:
words = foo.readlines()
Or if the words are separated by spaces:
words = foo.read().split(" ")
Try this...

Counting the number of unique words [duplicate]

This question already has answers here:
Counting the number of unique words in a document with Python
(8 answers)
Closed 9 years ago.
I want to count unique words in a text, but I want to make sure that words followed by special characters aren't treated differently, and that the evaluation is case-insensitive.
Take this example
text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."
print len(set(w.lower() for w in text.split()))
The result would be 16, but I expect it to return 14. The problem is that 'boy.' and 'boy' are evaluated differently, because of the punctuation.

import re
print len(re.findall('\w+', text))
Using a regular expression makes this very simple. All you need to keep in mind is to make sure that all the characters are in lowercase, and finally combine the result using set to ensure that there are no duplicate items.
print len(set(re.findall('\w+', text.lower())))

you can use regex here:
In [65]: text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."
In [66]: import re
In [68]: set(m.group(0).lower() for m in re.finditer(r"\w+",text))
Out[68]:
set(['grown',
'boy',
'he',
'now',
'longer',
'no',
'is',
'there',
'up',
'one',
'a',
'the',
'has',
'handsome'])

I think that you have the right idea of using the Python built-in set type.
I think that it can be done if you first remove the '.' by doing a replace:
text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."
punc_char= ",.?!'"
for letter in text:
if letter == '"' or letter in punc_char:
text= text.replace(letter, '')
text= set(text.split())
len(text)
that should work for you. And if you need any of the other signs or punctuation points you can easily
add them into punc_char and they will be filtered out.
Abraham J.

First, you need to get a list of words. You can use a regex as eandersson suggested:
import re
words = re.findall('\w+', text)
Now, you want to get the number of unique entries. There are a couple of ways to do this. One way would be iterate through the words list and use a dictionary to keep track of the number of times you have seen a word:
cwords = {}
for word in words:
try:
cwords[word] += 1
except KeyError:
cwords[word] = 1
Now, finally, you can get the number of unique words by
len(cwords)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: split a text into individual English sentences; retain the punctuation [closed] - python

Related

python: find the last appearance of a word (of a list of words) in a text

Removing stopwords from a string with ordered set and join retains a single stopword

separate words in a sentence that has comma between them [duplicate]

Storing words from a text file [closed]

Counting the number of unique words [duplicate]

Categories

Resources