How to split sentences in a list? - python

I am trying to create a function to count the number of words and mean length of words in any given sentence or sentences. I can't seem to split the string into two sentences to be put into a list, assuming the sentence has a period and ending the sentence.
Question marks and exclamation marks should be replaced by periods to be recognized as a new sentence in the list.
For example: "Haven't you eaten 8 oranges today? I don't know if you did." would be: ["Haven't you eaten 8 oranges today", "I don't know if you did"]
The mean length for this example would be 44/12 = 3.6
def word_length_list(text):
text = text.replace('--',' ')
for p in string.punctuation + "‘’”“":
text = text.replace(p,'')
text = text.lower()
words = text.split(".")
word_length = []
print(words)
for i in words:
count = 0
for j in i:
count = count + 1
word_length.append(count)
return(word_length)
testing1 = word_length_list("Haven't you eaten 8 oranges today? I don't know if you did.")
print(sum(testing1)/len(testing1))

One option might use re.split:
inp = "Haven't you eaten 8 oranges today? I don't know if you did."
sentences = re.split(r'(?<=[?.!])\s+', inp)
print(sentences)
This prints:
["Haven't you eaten 8 oranges today?", "I don't know if you did."]
We could also use re.findall:
inp = "Haven't you eaten 8 oranges today? I don't know if you did."
sentences = re.findall(r'.*?[?!.]', inp)
print(sentences) # prints same as above
Note that in both cases we are assuming that period . would only appear as a stop, and not as part of an abbrevation. If period can have multiple contexts, then it could be tricky to tease apart sentences. For example:
Jon L. Skeet earned more point than anyone. Gordon Linoff also earned a lot of points.
It is not clear here whether period means end of sentence or part of an abbreviation.

An example to split using regex:
import re
s = "Hello! How are you?"
print([x for x in re.split("[\.\?\!]+",s.strip()) if not x == ''])

Related

Using .replace effectively on text

I'm attempting to capitalize all words in a section of text that only appear once. I have the bit that finds which words only appear once down, but when I go to replace the original word with the .upper version, a bunch of other stuff gets capitalized too. It's a small program, so here's the code.
from collections import Counter
from string import punctuation
path = input("Path to file: ")
with open(path) as f:
word_counts = Counter(word.strip(punctuation) for line in f for word in line.replace(")", " ").replace("(", " ")
.replace(":", " ").replace("", " ").split())
wordlist = open(path).read().replace("\n", " ").replace(")", " ").replace("(", " ").replace("", " ")
unique = [word for word, count in word_counts.items() if count == 1]
for word in unique:
print(word)
wordlist = wordlist.replace(word, str(word.upper()))
print(wordlist)
The output should be 'Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan., as sojournings is the first word that only appears once. Instead, it outputs GenesIs 37:1 Jacob lIved In the land of hIs FATher's SOJOURNINGS, In the land of Canaan. Because some of the other letters appear in keywords, it tries to capitalize them as well.
Any ideas?
I rewrote the code pretty significantly since some of the chained replace calls might prove to be unreliable.
import string
# The sentence.
sentence = "Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan."
rm_punc = sentence.translate(None, string.punctuation) # remove punctuation
words = rm_punc.split(' ') # split spaces to get a list of words
# Find all unique word occurrences.
single_occurrences = []
for word in words:
# if word only occurs 1 time, append it to the list
if words.count(word) == 1:
single_occurrences.append(word)
# For each unique word, find it's index and capitalize the letter at that index
# in the initial string (the letter at that index is also the first letter of
# the word). Note that strings are immutable, so we are actually creating a new
# string on each iteration. Also, sometimes small words occur inside of other
# words, e.g. 'an' inside of 'land'. In order to make sure that our call to
# `index()` doesn't find these small words, we keep track of `start` which
# makes sure we only ever search from the end of the previously found word.
start = 0
for word in single_occurrences:
try:
word_idx = start + sentence[start:].index(word)
except ValueError:
# Could not find word in sentence. Skip it.
pass
else:
# Update counter.
start = word_idx + len(word)
# Rebuild sentence with capitalization.
first_letter = sentence[word_idx].upper()
sentence = sentence[:word_idx] + first_letter + sentence[word_idx+1:]
print(sentence)
Text replacement by patters calls for regex.
Your text is a bit tricky, you have to
remove digits
remove punktuations
split into words
care about capitalisation: 'It's' vs 'it's'
only replace full matches 'remote' vs 'mote' when replacing mote
etc.
This should do this - see comments inside for explanations:
bible.txt is from your link
from collections import Counter
from string import punctuation , digits
import re
from collections import defaultdict
with open(r"SO\AllThingsPython\P4\bible.txt") as f:
s = f.read()
# get a set of unwanted characters and clean the text
ps = set(punctuation + digits)
s2 = ''.join( c for c in s if c not in ps)
# split into words
s3 = s2.split()
# create a set of all capitalizations of each word
repl = defaultdict(set)
for word in s3:
repl[word.upper()].add(word) # f.e. {..., 'IN': {'In', 'in'}, 'THE': {'The', 'the'}, ...}
# count all words _upper case_ and use those that only occure once
single_occurence_upper_words = [w for w,n in Counter( (w.upper() for w in s3) ).most_common() if n == 1]
text = s
# now the replace part - for all upper single words
for upp in single_occurence_upper_words:
# for all occuring capitalizations in the text
for orig in repl[upp]:
# use regex replace to find the original word from our repl dict with
# space/punktuation before/after it and replace it with the uppercase word
text = re.sub(f"(?<=[{punctuation} ])({orig})(?=[{punctuation} ])",upp, text)
print(text)
Output (shortened):
Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan.
2 These are the GENERATIONS of Jacob.
Joseph, being seventeen years old, was pasturing the flock with his brothers. He was a boy with the sons of Bilhah and Zilpah, his father's wives. And Joseph brought a BAD report of them to their father. 3 Now Israel loved Joseph more than any other of his sons, because he was the son of his old age. And he made him a robe of many colors. [a] 4 But when his brothers saw that their father loved him more than all his brothers, they hated him
and could not speak PEACEFULLY to him.
<snipp>
The regex uses lookahead '(?=...)' and lookbehind '(?<=...)'syntax to make sure we replace only full words, see regex syntax.

How to extract a number before a certain words?

There is a sentence "i have 5 kg apples and 6 kg pears".
I just want to extract the weight of apples.
So I use
sentence = "I have 5 kg apples and 6 kg pears"
number = re.findall(r'(\d+) kg apples', sentence)
print (number)
However, it just works for integer numbers. So what should I do if the number I want to extract is 5.5?
You can try something like this:
import re
sentence = ["I have 5.5 kg apples and 6 kg pears",
"I have 5 kg apples and 6 kg pears"]
for sen in sentence:
print re.findall(r'(\d+(?:\.\d+)?) kg apples', sen)
Output:
['5.5']
['5']
? designates an optional segment of a regex.
re.findall(r'((\d+\.)?\d+)', sentence)
You can use number = re.findall(r'(\d+\.?\d*) kg apples', sentence)
You change your regex to match it:
(\d+(?:\.\d+)?)
\.\d+ matches a dot followed by at least one digit. I made it optional, because you still want one digit.
re.findall(r'[-+]?[0-9]*\.?[0-9]+.', sentence)
Non-regex solution
sentence = "I have 5.5 kg apples and 6 kg pears"
words = sentence.split(" ")
[words[idx-1] for idx, word in enumerate(words) if word == "kg"]
# => ['5.5', '6']
You can then check whether these are valid floats using
try:
float(element)
except ValueError:
print "Not a float"
The regex you need should look like this:
(\d+.?\d*) kg apples
You can do as follows:
number = re.findall(r'(\d+.?\d*) kg apples', sentence)
Here is an online example

getting words between m and n characters

I am trying to get all names that start with a capital letter and ends with a full-stop on the same line where the number of characters are between 3 and 5
My text is as follows:
King. Great happinesse
Rosse. That now Sweno, the Norwayes King,
Craues composition:
Nor would we deigne him buriall of his men,
Till he disbursed, at Saint Colmes ynch,
Ten thousand Dollars, to our generall vse
King. No more that Thane of Cawdor shall deceiue
Our Bosome interest: Goe pronounce his present death,
And with his former Title greet Macbeth
Rosse. Ile see it done
King. What he hath lost, Noble Macbeth hath wonne.
I am testing it out on this link. I am trying to get all words between 3 and 5 but haven't succeeded.
Does this produce your desired output?
import re
re.findall(r'[A-Z].{2,4}\.', text)
When text contains the text in your question it will produce this output:
['King.', 'Rosse.', 'King.', 'Rosse.', 'King.']
The regex pattern matches any sequence of characters following an initial capital letter. You can tighten that up if required, e.g. using [a-z] in the pattern [A-Z][a-z]{2,4}\. would match an upper case character followed by between 2 to 4 lowercase characters followed by a literal dot/period.
If you don't want duplicates you can use a set to get rid of them:
>>> set(re.findall(r'[A-Z].{2,4}\.', text))
set(['Rosse.', 'King.'])
You may have your own reasons for wanting to use regexs here, but Python provides a rich set of string methods and (IMO) it's easier to understand the code using these:
matched_words = []
for line in open('text.txt'):
words = line.split()
for word in words:
if word[0].isupper() and word[-1] == '.' and 3 <= len(word)-1 <=5:
matched_words.append(word)
print matched_words

Search in a string and obtain the 2 words before and after the match in Python

I'm using Python to search some words (also multi-token) in a description (string).
To do that I'm using a regex like this
result = re.search(word, description, re.IGNORECASE)
if(result):
print ("Trovato: "+result.group())
But what I need is to obtain the first 2 word before and after the match. For example if I have something like this:
Parking here is horrible, this shop sucks.
"here is" is the word that I looking for. So after I matched it with my regex I need the 2 words (if exists) before and after the match.
In the example:
Parking here is horrible, this
"Parking" and horrible, this are the words that I need.
ATTTENTION
The description cab be very long and the pattern "here is" can appear multiple times?
How about string operations?
line = 'Parking here is horrible, this shop sucks.'
before, term, after = line.partition('here is')
before = before.rsplit(maxsplit=2)[-2:]
after = after.split(maxsplit=2)[:2]
Result:
>>> before
['Parking']
>>> after
['horrible,', 'this']
Try this regex: ((?:[a-z,]+\s+){0,2})here is\s+((?:[a-z,]+\s*){0,2})
with re.findall and re.IGNORECASE set
Demo
I would do it like this (edit: added anchors to cover most cases):
(\S+\s+|^)(\S+\s+|)here is(\s+\S+|)(\s+\S+|$)
Like this you will always have 4 groups (might have to be trimmed) with the following behavior:
If group 1 is empty, there was no word before (group 2 is empty too)
If group 2 is empty, there was only one word before (group 1)
If group 1 and 2 are not empty, they are the words before in order
If group 3 is empty, there was no word after
If group 4 is empty, there was only one word after
If group 3 and 4 are not empty, they are the words after in order
Corrected demo link
Based on your clarification, this becomes a bit more complicated. The solution below deals with scenarios where the searched pattern may in fact also be in the two preceding or two subsequent words.
line = "Parking here is horrible, here is great here is mediocre here is here is "
print line
pattern = "here is"
r = re.search(pattern, line, re.IGNORECASE)
output = []
if r:
while line:
before, match, line = line.partition(pattern)
if match:
if not output:
before = before.split()[-2:]
else:
before = ' '.join([pattern, before]).split()[-2:]
after = line.split()[:2]
output.append((before, after))
print output
Output from my example would be:
[(['Parking'], ['horrible,', 'here']), (['is', 'horrible,'], ['great', 'here']), (['is', 'great'], ['mediocre', 'here']), (['is', 'mediocre'], ['here', 'is']), (['here', 'is'], [])]

How can I randomize this text generator even further?

I'm working on a random text generator -without using Markov chains- and currently it works without too many problems -actually generates a good amount of random sentences by my criteria but I want to make it even more accurate to prevent as many sentence repeats as possible-. Firstly, here is my code flow:
Enter a sentence as input -this is called trigger string, is assigned to a variable-
Get longest word in trigger string
Search all Project Gutenberg database for sentences that contain this word -regardless of uppercase lowercase-
Return the longest sentence that has the word I spoke about in step 3
Append the sentence in Step 1 and Step4 together
Assign the sentence in Step 4 as the new 'trigger' sentence and repeat the process. Note that I have to get the longest word in second sentence and continue like that and so on-
And here is my code:
import nltk
from nltk.corpus import gutenberg
from random import choice
import smtplib #will be for send e-mail option later
triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str
longestLength = 0
longestString = ""
longestLen2 = 0
longestStr2 = ""
listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of list format-
listOfWords = gutenberg.words()# all words in gutenberg books -list format-
while triggerSentence:#run the loop so long as there is a trigger sentence
sets = []
sets2 = []
split_str = triggerSentence.split()#split the sentence into words
#code to find the longest word in the trigger sentence input
for piece in split_str:
if len(piece) > longestLength:
longestString = piece
longestLength = len(piece)
#code to get the sentences containing the longest word, then selecting
#random one of these sentences that are longer than 40 characters
for sentence in listOfSents:
if sentence.count(longestString):
sents= " ".join(sentence)
if len(sents) > 40:
sets.append(" ".join(sentence))
triggerSentence = choice(sets)
print triggerSentence #the first sentence that comes up after I enter input-
split_str = triggerSentence.split()
for apiece in triggerSentence: #find the longest word in this new sentence
if len(apiece) > longestLen2:
longestStr2 = piece
longestLen2 = len(apiece)
if longestStr2 == longestString:
second_longest = sorted(split_str, key=len)[-2]#this should return the second longest word in the sentence in case it's longest word is as same as the longest word of last sentence
#print second_longest #now get second longest word if first is same
#as longest word in previous sentence
for sentence in listOfSents:
if sentence.count(second_longest):
sents = " ".join(sentence)
if len(sents) > 40:
sets2.append(" ".join(sentence))
triggerSentence = choice(sets2)
else:
for sentence in listOfSents:
if sentence.count(longestStr2):
sents = " ".join(sentence)
if len(sents) > 40:
sets.append(" ".join(sentence))
triggerSentence = choice(sets)
print triggerSentence
According to my code, once I enter a trigger sentence, I should get another one that contains the longest word of the trigger sentence I entered. Then this new sentence becomes the trigger sentence and it's longest word is picked. This is where the problem sometimes occurs. I observed that despite the code lines I placed - starting from line 47 to the end- , the algorithm still can pick the same longest word in the sentences that come along, not looking for the second longest word.
For example:
Trigger string = "Scotland is a nice place."
Sentence 1 = -This is a random sentence with the word Scotland in it-
Now, this is where the problem can occur in my code at times -doesn't matter whether it comes up in sentence 2 or 942 or zillion or whatever, but I give it in sent.2 for example's sake-
Sentence 2 = Another sentence that has the word Scotland in it but not the second longest word in sentence 1. According to my code, this sentence should have been some sentence that contained the second longest word in sentence 1, not Scotland !
How can I solve this ? I'm trying to optimize the code as much as possible and any help is welcome.
There is nothing random about your algorithm at all. It should always be deterministic.
I'm not quite sure what you want to do here. If it is to generate random words, just use a dictionary and the random module. If you want to grab random sentences from the Gutenberg project, use the random module to pick a work and then a sentence out of that work.

Categories