Sentence splitting based in regular expression [duplicate] - python

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
I am trying to split an article into sentences. And using the following code (written by someone who has left organization). Help me understand the code
re.split(r' *[.,:-\#/_&?!;][\s ]+', x)

It looks for punctuation marks such as stops, commas and colons, optionally preceded by spaces and always followed by at least one whitespace character. In the commonest case that will be ". ". Then it splits the string x into pieces by removing the matched punctuation and returning whatever is left as a list.
>>> x = "First sentence. Second sentence? Third sentence."
>>> re.split(r' *[.,:-\#/_&?!;][\s ]+', x)
['First sentence', 'Second sentence', 'Third sentence.']
The regular expression is unnecessarily complex and doesn't do a very good job.
This bit: :-\# has a redundant quoting backslash, and means the characters between ascii 58 and 64, in other words : ; < = > ? #, but it would be better to list the 7 characters explicitly, because most people will not know what characters fall in that range. That includes me: I had to look it up. And it clearly also includes the code's author, since he redundantly specified ; again at the end.
This bit [\s ]+ means one or more spaces or whitespace characters but a space is a whitespace character so that could be more simply expressed as \s+.
Note the retained full stop in the 3rd element of the returned list. That is because when the full stop comes at the end of the line, it is not followed by a space, and the regular expression insists that it must be. Retaining the full stop is okay, but only if it is done consistently for all sentences, not just for the ones that end at a line break.
Throw away that bit of code and start from scratch. Or use nltk, which has power tools for splitting text into sentences and is likely to do a much more respectable job.
>>> import nltk
>>> sent_tokenizer=nltk.punkt.PunktSentenceTokenizer()
>>> sent_tokenizer.sentences_from_text(x)
['First sentence.', 'Second sentence?', 'Third sentence.']

Related

Capitalize each first word of a sentence in a paragraph

I want to capitilize the first word after a dot in a whole paragraph (str) full of sentences. The problem is that all chars are lowercase.
I tried something like this:
text = "here a long. paragraph full of sentences. what in this case does not work. i am lost"
re.sub(r'(\b\. )([a-zA-z])', r'\1' (r'\2').upper(), text)
I expect something like this:
"Here a long. Paragraph full of sentences. What in this case does not work. I am lost."
You can use re.sub with a lambda:
import re
text = "here a long. paragraph full of sentences. what in this case does not work. i am lost"
result = re.sub('(?<=^)\w|(?<=\.\s)\w', lambda x:x.group().upper(), text)
Output:
'Here a long. Paragraph full of sentences. What in this case does not work. I am lost'
Regex Explanation:
(?<=^)\w: matches an alphanumeric character preceded by the start of the line.
(?<=\.\s)\w: matches an alphanumeric character preceded by a period and a space.
You can use ((?:^|\.\s)\s*)([a-z]) regex (which doesn't depend upon lookarounds which sometimes may not be available in the regex dialect you may be using and hence is simpler and widely supported. Like for example Javascript doesn't yet widely support lookbehind although it is supported in EcmaScript2018 but its not widely supported yet) where you capture either the starting zero or more whitespace at the beginning of a sentence or one or more whitespace followed by a literal dot . and capture it in group1 and next capture a lower case letter using ([a-z]) and capture in group2 and replace the matched text with group1 captured text and group2 captured letter by making it uppercase using lambda expression. Check this Python code,
import re
arr = ['here a long. paragraph full of sentences. what in this case does not work. i am lost',
' this para contains more than one space after period and also has unneeded space at the start of string. here a long. paragraph full of sentences. what in this case does not work. i am lost']
for s in arr:
print(re.sub(r'(^\s*|\.\s+)([a-z])', lambda m: m.group(1) + m.group(2).upper(), s))
Output,
Here a long. Paragraph full of sentences. What in this case does not work. I am lost
This para contains more than one space after period and also has unneeded space at the start of string. Here a long. Paragraph full of sentences. What in this case does not work. I am lost
And in case you want to get rid of extra whitespaces and reduce them to just one space, just take that \s* out of group1 and use this regex ((?:^|\.\s))\s*([a-z]) and with updated Python code,
import re
arr = ['here a long. paragraph full of sentences. what in this case does not work. i am lost',
' this para contains more than one space after period and also has unneeded space at the start of string. here a long. paragraph full of sentences. what in this case does not work. i am lost']
for s in arr:
print(re.sub(r'((?:^|\.\s))\s*([a-z])', lambda m: m.group(1) + m.group(2).upper(), s))
You get following where extra whitespace is reduced to just one space, which may often be desired,
Here a long. Paragraph full of sentences. What in this case does not work. I am lost
This para contains more than one space after period and also has unneeded space at the start of string. Here a long. Paragraph full of sentences. What in this case does not work. I am lost
Also, if this was to be done using PCRE based regex engine, then you could have used \U in the regex itself without having to use lambda functions and just been able to replace it with \1\U\2
Regex Demo for PCRE based regex

how to write a regular expression which matches a pattern if the sentence ends by period '.'

I've a group of strings like following:
a phrase containing spaces
A sentence contains spaces as well, but end by period.
I'd like to find a regular expression to match the spaces (like [ \t\f]) in the 2nd line, which ends by '.'.
I've looked around and found no solution. So I come here for help.
I am using Python, but do not mind knowing the pcre solution even it's not possible for python.
I came out some regex, but it could not exclude the first line.
my regex
Here is a regex pattern which, if applied repeatedly to every line, should be able to match spaces in that line, assuming the line ends with period:
\s+(?=.*\.$)
Demo
Here is my attempt at a Python script. I don't print the space when a match is found, because we can't see it. Instead, I print something visible:
input = 'A sentence contains spaces as well, but end by period.'
spaces = re.findall(r'\s+(?=.*\.$)', input)
for space in spaces:
print('found a space')
found a space (printed 9 times)

Distinguish quotes ' and apostrophes while tokenizing with regex

Having some text, i want to tokenize it on words properly. In the text may appear:
words with apostrophe in middle (Can't, I'll,the accountant‘s books )
words with apostrophe in the end (the employers‘ association , I spent most o’ the day replacin’ the broken bit)
quotes, staying directly after the word or between words like : word'word
text is splitted on sentences, but the can be many sentences inside a quote, also, the word with apostroph can stay inside a quote
different symbols for qutes like either ' ' both for opening and closing or one is ' other is ` or ´, etc...
What yould be your suggestion to solve it?
Is it solvable with regex ( Python re for example?
I want words with apostrophe do not split and quotes to split from word tokens
Parcing commont text, The Fellowship Of The Ring.txt for example is tricky a little bit:
input : had hardly any 'government'.
output: ["had","hardly","any","'","government","'"] (recognized as quote)
A rather larger body, varying at need, was employed to 'beat the bounds'
is a quote, however is tricky because of ending s'
'It isn't natural, and trouble will come of it!' apostrophe inside a quote
'Elves and Dragons'_ I says to him. is a quote, howewer, s' again.
My suggestion would be to try to break down your cases. If you want to split by words (meaning that a word has spaces on both ends) probably a simple split would do its job.
>>> my_str = "words like that'"
>>> my_str.split(' ')
['words', 'like', "that'"]
>>>
If it's more complicated, regex seems to be a better idea. You can use (a|b), meaning match a or b. My suggestion would be to experiment more, the perfect place to experiment is here: regex101.com. To make things clearer select 'Python' in the left panel!

Replace single line-feed characters, keep multiples [duplicate]

This question already has answers here:
replacing only single instances of a character with python regexp
(4 answers)
Closed 6 years ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
I am parsing a text file and want to remove all in-paragraph line breaks, while actually keeping the double line feeds that form new paragraphs. e.g.
This is my first poem\nthat does not make sense\nhow far should it go\nnobody can know.\n\nHere is a seconds\nthat is not as long\ngoodbye\n\n
When printed out, this should look like this:
This is my first poem
that does not make sense
how far should it go
nobody can know.
Here is a seconds
that is not as long
goodbye
should become
This is my first poem that does not make sense how far should it go nobody can know.\n\nHere is a seconds that is not as long goodbye\n\n
Again, when printed, it should look like:
This is my first poem that does not make sense how far should it go nobody can know.
Here is a seconds that is not as long goodbye
The trick here is in removing single occurrances of '\n', while keeping the double line feed '\n\n', AND in preserving white space (i.e. "hello\nworld" becomes "hello world" and not "helloworld").
I can do this by first substituting the \n\n with a dummy string (like "$$$", or something equally ridiculous), then removing the \n followed by reconversion of "$$$" back to \n\n...but that seems overly circuitous. Can I make this conversion with a single regular expression call?
You may replace all newlines that are not enclosed with other newlines with a space:
re.sub(r"(?<!\n)\n(?!\n)", " ", s)
See the Python demo:
import re
s = "This is my first poem\nthat does not make sense\nhow far should it go\nnobody can know.\n\nHere is a seconds\nthat is not as long\ngoodbye\n\n"
res = re.sub(r"(?<!\n)\n(?!\n)", " ", s)
print(res)
Here, the (?<!\n) is a negative lookbehind that fails the match if a newline is receded with another newline, and (?!\n) is a negative lookahead that fils the match of the newline is followed with another newline.
See more about Lookahead and Lookbehind Zero-Length Assertions here.

How can I find all substrings that have this pattern: some_word.some_other_word with python?

I am trying to clean up some very noisy user-generated web data. Some people do not add a space after a period that ends the sentence. For example,
"Place order.Call us if you have any questions."
I want to extract each sentence, but when I try to parse a sentence using nltk, it fails to recognize that these are two separate sentences. I would like to use regular expressions to find all patterns that contain "some_word.some_other_word" and all patterns that contain "some_word:some_other_word" using python.
At the same time I want to avoid finding patterns like "U.S.A". so avoid just_a_character.just_another_character
Thanks very much for your help :)
The easiest solution:
>>> import re
>>> re.sub(r'([.:])([^\s])', r'\1 \2', 'This is a test. Yes, test.Hello:world.')
'This is a test. Yes, test. Hello: world.'
The first argument — the pattern — tells that we want to match a period or a colon followed by a non-whitespace character. The second argument is the replacement, it puts the first matched symbol, then a space, then the second matched symbol back.
It seems that you are asking two different questions:
1) If you want to find all patterns like "some_word.some_other_word" or "some_word:some_other_word"
import re
re.findall('\w+[\.:\?\!]\w+', your_text)
This finds all patterns in the text your_text
2) If you want to extract all sentences, you could do
import re
re.split('[\.\!\?]', your_text)
This should return a list of sentences. For example,
text = 'Hey, this is a test. How are you?Fine, thanks.'
import re
re.findall('\w+[\.:\?\!]\w+', text) # returns ['you?Fine']
re.split('[\.\!\?]', text) # returns ['Hey, this is a test', ' How are you', 'Fine, thanks', '']
Here's some cases that might be in your text:
sample = """
Place order.Call us (period: split)
ever after.(The end) (period: split)
U.S.A.(abbreviation: don't split internally)
1.3 How to work with computers (dotted numeral: don't split)
ever after...The end (ellipsis: don't split internally)
(This is the end.) (period inside parens: don't split)
"""
So: Don't add space to periods after digits, after a single capital letter, or before a paren or another period. Add space otherwise. This will do all that:
sample = re.sub(r"(\w[A-Z]|[a-z.])\.([^.)\s])", r"\1. \2", sample)
Result:
Place order. Call us (period: split)
ever after. (The end) (period: split)
U.S.A.(abbreviation: don't split internally)
1.3 How to work with computers (dotted numeral: don't split)
ever after... The end (ellipsis: don't split internally)
(This is the end.) (period inside parens: don't split)
This fixed every problem in the sample except the last period after U.S.A., which should have a space added after it. I left that aside because combinations of conditions are tricky. The following regexp will handle everything, but I do not recommend it:
sample = re.sub(r"(\w[A-Z]|[a-z.]|\b[A-Z](?!\.[A-Z]))\.([^.)\s])", r"\1. \2", sample)
Complex regexps like this are a maintainability nightmare-- just try adding another pattern, or restricting it to omit some more cases. Instead, I recommend using a separate regexp to catch just the missing case: a period after a single capital letter, but not followed by a single capital, paren, or another period.
sample = re.sub(r"(\b[A-Z]\.)([^.)A-Z])", r"\1 \2", sample)
For a complex task like this, it makes sense to use a separate regexp for each type of replacement. I'd split the original into subcases, each of which adds spaces only for a very specific pattern. You can have as many as you want, and it won't get out of hand (at least, not too much...)
You could use something like
import re
test = "some_word.some_other_word"
r = re.compile(r'(\D+)\.(\D+)')
print r.match(test).groups()

Categories