Capitalize each first word of a sentence in a paragraph - python

I want to capitilize the first word after a dot in a whole paragraph (str) full of sentences. The problem is that all chars are lowercase.
I tried something like this:
text = "here a long. paragraph full of sentences. what in this case does not work. i am lost"
re.sub(r'(\b\. )([a-zA-z])', r'\1' (r'\2').upper(), text)
I expect something like this:
"Here a long. Paragraph full of sentences. What in this case does not work. I am lost."

You can use re.sub with a lambda:
import re
text = "here a long. paragraph full of sentences. what in this case does not work. i am lost"
result = re.sub('(?<=^)\w|(?<=\.\s)\w', lambda x:x.group().upper(), text)
Output:
'Here a long. Paragraph full of sentences. What in this case does not work. I am lost'
Regex Explanation:
(?<=^)\w: matches an alphanumeric character preceded by the start of the line.
(?<=\.\s)\w: matches an alphanumeric character preceded by a period and a space.

You can use ((?:^|\.\s)\s*)([a-z]) regex (which doesn't depend upon lookarounds which sometimes may not be available in the regex dialect you may be using and hence is simpler and widely supported. Like for example Javascript doesn't yet widely support lookbehind although it is supported in EcmaScript2018 but its not widely supported yet) where you capture either the starting zero or more whitespace at the beginning of a sentence or one or more whitespace followed by a literal dot . and capture it in group1 and next capture a lower case letter using ([a-z]) and capture in group2 and replace the matched text with group1 captured text and group2 captured letter by making it uppercase using lambda expression. Check this Python code,
import re
arr = ['here a long. paragraph full of sentences. what in this case does not work. i am lost',
' this para contains more than one space after period and also has unneeded space at the start of string. here a long. paragraph full of sentences. what in this case does not work. i am lost']
for s in arr:
print(re.sub(r'(^\s*|\.\s+)([a-z])', lambda m: m.group(1) + m.group(2).upper(), s))
Output,
Here a long. Paragraph full of sentences. What in this case does not work. I am lost
This para contains more than one space after period and also has unneeded space at the start of string. Here a long. Paragraph full of sentences. What in this case does not work. I am lost
And in case you want to get rid of extra whitespaces and reduce them to just one space, just take that \s* out of group1 and use this regex ((?:^|\.\s))\s*([a-z]) and with updated Python code,
import re
arr = ['here a long. paragraph full of sentences. what in this case does not work. i am lost',
' this para contains more than one space after period and also has unneeded space at the start of string. here a long. paragraph full of sentences. what in this case does not work. i am lost']
for s in arr:
print(re.sub(r'((?:^|\.\s))\s*([a-z])', lambda m: m.group(1) + m.group(2).upper(), s))
You get following where extra whitespace is reduced to just one space, which may often be desired,
Here a long. Paragraph full of sentences. What in this case does not work. I am lost
This para contains more than one space after period and also has unneeded space at the start of string. Here a long. Paragraph full of sentences. What in this case does not work. I am lost
Also, if this was to be done using PCRE based regex engine, then you could have used \U in the regex itself without having to use lambda functions and just been able to replace it with \1\U\2
Regex Demo for PCRE based regex

Related

Regexp, find in a procedure last "end" to replace with another word

I have tried to replace in all procedures some mistakes. Now, I need to find last "end;" in procedure and replace it with another text.
I wrote like: (\s.*)(end|END)(.*(;).*)
But in work not correctly, it also replace some words in the middle of the text. I using re biblio from python.
You can use
result = re.sub(r'(?si)(.*)\bend\b', r'\g<1>some other word', text)
The regex matches
(?si) - an inline re.DOTALL (s) and re.IGNORECASE (i) modifier
(.*) - Group 1: any zero or more chars as many as possible
\bend\b -a whole word end.
The \g<1>some other word replacement is the Group 1 value (I used \g<1> since it will be helpful if your some other word starts with a digit) plus your word.
NOTE: if your some other word can contain literal backslashes, do not forget to double them.

how to write a regular expression which matches a pattern if the sentence ends by period '.'

I've a group of strings like following:
a phrase containing spaces
A sentence contains spaces as well, but end by period.
I'd like to find a regular expression to match the spaces (like [ \t\f]) in the 2nd line, which ends by '.'.
I've looked around and found no solution. So I come here for help.
I am using Python, but do not mind knowing the pcre solution even it's not possible for python.
I came out some regex, but it could not exclude the first line.
my regex
Here is a regex pattern which, if applied repeatedly to every line, should be able to match spaces in that line, assuming the line ends with period:
\s+(?=.*\.$)
Demo
Here is my attempt at a Python script. I don't print the space when a match is found, because we can't see it. Instead, I print something visible:
input = 'A sentence contains spaces as well, but end by period.'
spaces = re.findall(r'\s+(?=.*\.$)', input)
for space in spaces:
print('found a space')
found a space (printed 9 times)

Finding last word in tweepy tweet response python

I am receiving a stream of tweets with python and would like to extract the last word or know where to reference it.
for example in
NC don’t like working together www.linktowtweet.org
get back
together
I am not familiar with tweepy, so I am presuming you have the data in a python string, so maybe there is a better answer.
However, given a string in python, it simple to extract the last word.
Solution 1
Use str.rfind(' '). The idea here is to find the space, preceding the last word. Here is an example.
text = "NC don’t like working together"
text = text.rstrip() # To any spaces at the end, that would otherwise confuse the algorithm.
last_word = text[text.rfind(' ')+1:] # Output every character *after* the space.
print(last_word)
Note: If a string is given with no words, last_word will be a blank string.
Now this presumes that all of the words are separated by spaces. To handle newlines and spaces, use str.replace to turn them into strings. Whitespaces in python are \t\n\x0b\x0c\r, but I presume only newlines and tabs will be found in twitter messages.
Also see: string.whitespace
So a complete example (wrapped as a function) would be
def last_word(text):
text = text.replace('\n', ' ') # Replace newlines with spaces.
text = text.replace('\t', ' ') # Replace tabs with spaces.
text = text.rstrip(' ') # Remove trailing spaces.
return text[text.rfind(' ')+1:]
print(last_word("NC don’t like working together")) # Outputs "together".
This may still be the best situation for basic parsing. There is something better for larger problems.
Solution 2
Regular Expressions
These are a way to handle strings in python, that is a lot more flexible. REGEX, as they are often called, use there own language to specify a portion of text.
For example, .*\s(\S+) specifies the last word in a string.
Here is it again with a longer explanation.
.* # Match as many characters as possible.
\s # Until a whitespace ("\t\n\x0b\x0c\r ")
( # Remember the next section for the answer.
\S+ # Match a ~word~ (not whitespace) as possible.
) # End saved section.
So then, in python you would use this as follows.
import re # Import the REGEX library.
# Compile the code, (DOTALL makes . match \n).
LAST_WORD_PATTERN = re.compile(r".*\s(\S+)", re.DOTALL)
def last_word(text):
m = LAST_WORD_PATTERN.match(text)
if not m: # If there was not a last word to this text.
return ''
return m.group(1) # Otherwise return the last word.
print(last_word("NC don’t like working together")) # Outputs "together".
Now, even though this method is a lot less obvious, it has a couple of advantages. First off, it is a lot more customizable. If you wanted to match the final word, but not links, the regex r".*\s([^.:\s]+(?!\.\S|://))\b" would match the last word, but ignore a link if that was the last thing.
Example:
import re # Import the REGEX library.
# Compile the code, (DOTALL makes . match \n).
LAST_WORD_PATTERN = re.compile(r".*\s([^.:\s]+(?!\.\S|://))\b", re.DOTALL)
def last_word(text):
m = LAST_WORD_PATTERN.match(text)
if not m: # If there was not a last word to this text.
return ''
return m.group(1) # Otherwise return the last word.
print(last_word("NC don’t like working together www.linktowtweet.org")) # Outputs "together".
The second advantage to this method is speed.
As you can Try it online! here, the regex approach is almost as fast as the string manipulation, if not faster in some cases. (I actually found that regex execute .2 usec faster on my machine that in the demo.)
Either way, the regex execution is extremely fast, even in the simple case, and there is no question that the regex is faster then any more complex string algorithm implemented in python. So using the regex can also speed up the code.
EDIT
Changed the url avoiding regex from
re.compile(r".*\s([^.\s]+(?!\.\S))\b", re.DOTALL)
to
re.compile(r".*\s([^.:\s]+(?!\.\S|://))\b", re.DOTALL)
So that calling last_word("NC don’t like working together http://www.linktowtweet.org") returns together and not http://.
To so how this regex works, look at https://regex101.com/r/sdwpqB/2.
Simple, so if your text is:
text = "NC don’t like working together www.linktowtweet.org"
text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE) #to remove any URL
text = text.split() #splits sentence into words with delimiter=" "
last_word = text[-1]
So there you go!! Now you'll get the last word "together".

How can I find all substrings that have this pattern: some_word.some_other_word with python?

I am trying to clean up some very noisy user-generated web data. Some people do not add a space after a period that ends the sentence. For example,
"Place order.Call us if you have any questions."
I want to extract each sentence, but when I try to parse a sentence using nltk, it fails to recognize that these are two separate sentences. I would like to use regular expressions to find all patterns that contain "some_word.some_other_word" and all patterns that contain "some_word:some_other_word" using python.
At the same time I want to avoid finding patterns like "U.S.A". so avoid just_a_character.just_another_character
Thanks very much for your help :)
The easiest solution:
>>> import re
>>> re.sub(r'([.:])([^\s])', r'\1 \2', 'This is a test. Yes, test.Hello:world.')
'This is a test. Yes, test. Hello: world.'
The first argument — the pattern — tells that we want to match a period or a colon followed by a non-whitespace character. The second argument is the replacement, it puts the first matched symbol, then a space, then the second matched symbol back.
It seems that you are asking two different questions:
1) If you want to find all patterns like "some_word.some_other_word" or "some_word:some_other_word"
import re
re.findall('\w+[\.:\?\!]\w+', your_text)
This finds all patterns in the text your_text
2) If you want to extract all sentences, you could do
import re
re.split('[\.\!\?]', your_text)
This should return a list of sentences. For example,
text = 'Hey, this is a test. How are you?Fine, thanks.'
import re
re.findall('\w+[\.:\?\!]\w+', text) # returns ['you?Fine']
re.split('[\.\!\?]', text) # returns ['Hey, this is a test', ' How are you', 'Fine, thanks', '']
Here's some cases that might be in your text:
sample = """
Place order.Call us (period: split)
ever after.(The end) (period: split)
U.S.A.(abbreviation: don't split internally)
1.3 How to work with computers (dotted numeral: don't split)
ever after...The end (ellipsis: don't split internally)
(This is the end.) (period inside parens: don't split)
"""
So: Don't add space to periods after digits, after a single capital letter, or before a paren or another period. Add space otherwise. This will do all that:
sample = re.sub(r"(\w[A-Z]|[a-z.])\.([^.)\s])", r"\1. \2", sample)
Result:
Place order. Call us (period: split)
ever after. (The end) (period: split)
U.S.A.(abbreviation: don't split internally)
1.3 How to work with computers (dotted numeral: don't split)
ever after... The end (ellipsis: don't split internally)
(This is the end.) (period inside parens: don't split)
This fixed every problem in the sample except the last period after U.S.A., which should have a space added after it. I left that aside because combinations of conditions are tricky. The following regexp will handle everything, but I do not recommend it:
sample = re.sub(r"(\w[A-Z]|[a-z.]|\b[A-Z](?!\.[A-Z]))\.([^.)\s])", r"\1. \2", sample)
Complex regexps like this are a maintainability nightmare-- just try adding another pattern, or restricting it to omit some more cases. Instead, I recommend using a separate regexp to catch just the missing case: a period after a single capital letter, but not followed by a single capital, paren, or another period.
sample = re.sub(r"(\b[A-Z]\.)([^.)A-Z])", r"\1 \2", sample)
For a complex task like this, it makes sense to use a separate regexp for each type of replacement. I'd split the original into subcases, each of which adds spaces only for a very specific pattern. You can have as many as you want, and it won't get out of hand (at least, not too much...)
You could use something like
import re
test = "some_word.some_other_word"
r = re.compile(r'(\D+)\.(\D+)')
print r.match(test).groups()

Sentence matching with regex

I have a text that splits into many lines, no particular formats. So I decided to line.strip('\n') for each line. Then I want to split the text into sentences using the sentence end marker . considering:
period . that is followed by a \s (whitespace), \S (like " ') and followed by [A-Z] will split
not to split [0-9]\.[A-Za-z], like 1.stackoverflow real time solution.
My program only solve half of 1 - period (.) that is followed by a \s and [A-Z]. Below is the code:
# -*- coding: utf-8 -*-
import re, sys
source = open(sys.argv[1], 'rb')
dest = open(sys.argv[2], 'wb')
sent = []
for line in source:
line1 = line.strip('\n')
k = re.sub(r'\.\s+([A-Z“])'.decode('utf8'), '.\n\g<1>', line1)
sent.append(k)
for line in sent:
dest.write(''.join(line))
Pls! I'd like to know which is the best way to master regex. It seems to be confusing.
To include the single quote in the character class, escape it with a \. The regex should be:
\.\s+[A-Z"\']
That's really all you need. You only need to tell a regex what to match, you don't need to specify what you don't want to match. Everything that doesn't fit the pattern won't match.
This regex will match any period followed by whitespace followed by a capital letter or a quote. Since a period immediately preceded by an number and immediately followed by a letter doesn't meet those criteria, it won't match.
This is assuming that the regex you had was working to split a period followed by whitespace followed by a capital, as you stated. Note, however, that this means that I am Sam. Sam I am. would split into I am Sam and am I am. Is that really what you want? If not, use zero-width assertions to exclude the parts you want to match but also keep. Here are your options, in order of what I think it's most likely you want.
1) Keep the period and the first letter or opening quote of the next sentence; lose the whitespace:
(?<=\.)\s+(?=[A-Z"\'])
This will split the example above into I am Sam. and Sam I am.
2) Keep the first letter of the next sentence; lose the period and whitespace:
\.\s+(?=[A-Z"\'])
This will split into I am Sam and Sam I am. This presumes that there are more sentences afterward, otherwise the period will stay with the second sentence, because it's not followed by whitespace and a capital letter or quote. If this option is the one you want - the sentences without the periods, then you might want to also match a period followed by the end of the string, with optional intervening whitespace, so that the final period and any trailing whitespace will be dropped:
\.(?:\s+(?=[A-Z"\'])|\s*$)
Note the ?:. You need non-capturing parentheses, because if you have capture groups in a split, anything captured by the group is added as an element in the results (e.g. split('(+)', 'a+b+c' gives you an array of a + b + c rather than just a b c).
3) Keep everything; whitespace goes with the preceding sentence:
(?<=\.\s+)(?=[A-Z"\'])
This will give you I am Sam. and Sam I am.
Regarding the last part of your question, the best resource for regex syntax I've seen is http://www.regular-expressions.info. Start with this summary: http://www.regular-expressions.info/reference.html Then go to the Tutorial page for more advanced details: http://www.regular-expressions.info/tutorial.html

Categories