regex findall sentence excluding \n - python

Short question:
Let's assume I have such a text:
sent one. sent two.
sent three
sent four
I want to get the sentences with a result like this:
['sent one.', 'sent two.', 'sent three', 'sent four']
Long Question:
I want to create in python a function that allows you to divide a text into sentences (but I don't want to use a tokenizer imported from another module).
As in the previous example, in addition to the newline there can be several separators that can cause a new sentence to start. Also, I may want to keep these separators in the sentence or not.
So because requirement can change dynamically, I would like to write a parameterized function that allows you to decide:
which are the separators that make the sentence break (for
example \n . ! ? : ;)
which of these must be removed (for example \n)
in the presence of which characters the separators have the effect (e.g. \s, in some cases the separator should not work)
I didn't want to ask such a complex question because actually I had already written a code that worked in part but some things don't work and I'm afraid it's because of the newlines.
I show you a simplified version of the code, the regex inside it is dynamically generated, I show you only the resulted generated regex with the default parameters omitting the code for generation which is now useless for the question:
def tokeniz_text(text, separator=['.', '!', '?'], to_remove=['\n'], bordering=['\s']):
...regex generation...
re_divide = r"""(.+?{}{}){}{}""".format(pre, sep, nxt, rem)
...generated regex...
(.+?(?:(?<!\.)|(?<!!)|(?<!\?))(?:\.|!|\?))(?:\.|!|\?|\s)|(?:(?:\s*)(?:\n+)(?:\s*))
the reason why I chose findall and why it seemed the only method of RE that could allow me to decide whether to keep the separators unlike split (which, however, from what Tom says, maybe I have to reconsider. It seems to me that these are the only two methods that return a list of occurrences from the entire parsed string, so I don't consider the others).
Plus I wanted to use the dotall flag because as you can see I use the dot to capture characters, and I thought that if the dot automatically doesn't capture spaces, I could never decide to capture them.
Anyway, I hope that now the situation is not reversed and that the question is too complex!
Sorry if I explained it wrong, yesterday I was very sleepy and it's difficult to understand what to explain because it's not very clear to me either, I'll try again (even if Tim Biegeleisen's answer might be right).

Here's one attempt to get you started:
>>> s = '''\
sent one. sent two.
sent three
sent four'''
>>> import re
>>> re.split(r'[.\n]\s*', s)
['sent one', 'sent two', 'sent three', 'sent four']
The says, split on sentence delimiters where a delimiter is a period or newline either of which can be followed by zero or more spaces.

We can try using re.split on the following pattern:
(?<=\.)\s+|\r?\n
This will split on dot, followed by any amount of whitespace, or on a CR?LF character. Note that this approach retains the dots ending a sentence, since they only appear in the pattern as a fixed width lookbehind.
inp = """sent one. sent two.
sent three
sent four"""
matches = re.split(r'(?<=\.)\s+|\r?\n', inp)
print(matches)
This prints:
['sent one.', 'sent two.', 'sent three', 'sent four']

Related

Distinguish quotes ' and apostrophes while tokenizing with regex

Having some text, i want to tokenize it on words properly. In the text may appear:
words with apostrophe in middle (Can't, I'll,the accountant‘s books )
words with apostrophe in the end (the employers‘ association , I spent most o’ the day replacin’ the broken bit)
quotes, staying directly after the word or between words like : word'word
text is splitted on sentences, but the can be many sentences inside a quote, also, the word with apostroph can stay inside a quote
different symbols for qutes like either ' ' both for opening and closing or one is ' other is ` or ´, etc...
What yould be your suggestion to solve it?
Is it solvable with regex ( Python re for example?
I want words with apostrophe do not split and quotes to split from word tokens
Parcing commont text, The Fellowship Of The Ring.txt for example is tricky a little bit:
input : had hardly any 'government'.
output: ["had","hardly","any","'","government","'"] (recognized as quote)
A rather larger body, varying at need, was employed to 'beat the bounds'
is a quote, however is tricky because of ending s'
'It isn't natural, and trouble will come of it!' apostrophe inside a quote
'Elves and Dragons'_ I says to him. is a quote, howewer, s' again.
My suggestion would be to try to break down your cases. If you want to split by words (meaning that a word has spaces on both ends) probably a simple split would do its job.
>>> my_str = "words like that'"
>>> my_str.split(' ')
['words', 'like', "that'"]
>>>
If it's more complicated, regex seems to be a better idea. You can use (a|b), meaning match a or b. My suggestion would be to experiment more, the perfect place to experiment is here: regex101.com. To make things clearer select 'Python' in the left panel!

Using regex, extract quoted strings that may contain nested quotes

I have the following string:
'Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!' Alice replied in a very melancholy voice. She continued, 'I'll try again.'
Now, I wish to extract the following quotes:
1. Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!
2. How Doth the Little Busy Bee,
3. I'll try again.
I tried the following code but I'm not getting what I want. The [^\1]* is not working as expected. Or is the problem elsewhere?
import re
s = "'Well, I've tried to say \"How Doth the Little Busy Bee,\" but it all came different!' Alice replied in a very melancholy voice. She continued, 'I'll try again.'"
for i, m in enumerate(re.finditer(r'([\'"])(?!(?:ve|m|re|s|t|d|ll))(?=([^\1]*)\1)', s)):
print("\nGroup {:d}: ".format(i+1))
for g in m.groups():
print(' '+g)
If you really need to return all the results from a single regular expression applied only once, it will be necessary to use lookahead ((?=findme)) so the finding position goes back to the start after each match - see this answer for a more detailed explanation.
To prevent false matches, some clauses are also needed regarding the quotes that add complexity, e.g. the apostrophe in I've shouldn't count as an opening or closing quote. There's no single clear-cut way of doing this but the rules I've gone for are:
An opening quote must not be immediately preceeded by a word character (e.g. letter). So for example, A" would not count as an opening quote but ," would count.
A closing quote must not be immediately followed by a word character (e.g. letter). So for example, 'B would not count as a closing quote but '. would count.
Applying the above rules leads to the following regular expression:
(?=(?:(?<!\w)'(\w.*?)'(?!\w)|\"(\w.*?)\"(?!\w)))
Debuggex Demo
A good quick sanity check test on any possible candidate regular expression is to reverse the quotes. This has been done in this regex101 demo.
EDIT
I modified my regex, it match properly even more complicated cases:
(?=(?<!\w|[!?.])('|\")(?!\s)(?P<content>(?:.(?!(?<=(?=\1).)(?!\w)))*)\1(?!\w))
DEMO
It is now even more complicated, the main improvement is not matching directly after some of punctuation character ([!?.]) and better quote case separation. Verified on diversified examples.
The sentence will be in content captured group. Of course it has some restrictions, releted to usage of whitespaces, etc. But it should work with most of proper formatted sentences - or at least it work with examples.
(?=(?<!\w|[!?.])('|\")(?!\s) - match the ' or " not preceded by word or punctuation character ((?<!\w|[!?.])) or not fallowed by whitespace((?!\s)), the ' or " part is captured in group 1 to further use,
(?P<content>(?:.(?!(?<=(?=\1).)(?!\w)))*)\1(?!\w)) - match sentence, followed by
same char (' or " captured in group 1) as it was started, ignore other quotes
It doesn't match whole sentence directly, but with capturing group nested in lookaround construct, so with global match modifier it will match also sentences inside sentences - because it directly match only the place before sentence starts.
About your regex:
I suppose, that by [^\1]* you meant any char but not one captured in group 1, but character class doesn't work this way, because it treats \1 as an char in octal notation (which I think is some kind of whitespace) not a reference to capturing group. Take a look on this example - read explanation. Also compare matching of THIS and THIS regex.
To achieve what you want, you should use lookaround, something like this: (')((?:.(?!\1))*.) - capture the opening char, then match every char which is not followed by captured opening char, then capture one more char, which is directly before captured char - and you have whole content between chars you excluded.
This is a great question for Python regex because sadly, in my opinion the re module is one of the most underpowered of mainstream regex engines. That's why for any serious regex work in Python, I turn to Matthew Barnett's stellar regex module, which incorporates some terrific features from Perl, PCRE and .NET.
The solution I'll show you can be adapted to work with re, but it is much more readable with regex because it is made modular. Also, consider it as a starting block for more complex nested matching, because regex lets you write recursive regular expressions similar to those found in Perl and PCRE.
Okay, enough talk, here's the code (a mere four lines apart from the import and definitions). Please don't let the long regex scare you: it is long because it is designed to be readable. Explanations follow.
The Code
import regex
quote = regex.compile(r'''(?x)
(?(DEFINE)
(?<qmark>["']) # what we'll consider a quotation mark
(?<not_qmark>[^'"]+) # chunk without quotes
(?<a_quote>(?P<qopen>(?&qmark))(?&not_qmark)(?P=qopen)) # a non-nested quote
) # End DEFINE block
# Start Match block
(?&a_quote)
|
(?P<open>(?&qmark))
(?&not_qmark)?
(?P<quote>(?&a_quote))
(?&not_qmark)?
(?P=open)
''')
str = """'Well, I have tried to say "How Doth the Little Busy Bee," but it all came different!' Alice replied in a very melancholy voice. She continued, 'I will try again.'"""
for match in quote.finditer(str):
print(match.group())
if match.group('quote'):
print(match.group('quote'))
The Output
'Well, I have tried to say "How Doth the Little Busy Bee," but it all came different!'
"How Doth the Little Busy Bee,"
'I will try again.'
How it Works
First, to simplify, note that I have taken the liberty of converting I'll to I will, reducing confusion with quotes. Addressing I'll would be no problem with a negative lookahead, but I wanted to make the regex readable.
In the (?(DEFINE)...) block, we define the three sub-expressions qmark, not_qmark and a_quote, much in the way that you define variables or subroutines to avoid repeating yourself.
After the definition block, we proceed to matching:
(?&a_quote) matches an entire quote,
| or...
(?P<open>(?&qmark)) matches a quotation mark and captures it to the open group,
(?&not_qmark)? matches optional text that is not quotes,
(?P<quote>(?&a_quote)) matches a full quote and captures it to the quote group,
(?&not_qmark)? matches optional text that is not quotes,
(?P=open) matches the same quotation mark that was captured at the opening of the quote.
The Python code then only needs to print the match and the quote capture group if present.
Can this be refined? You bet. Working with (?(DEFINE)...) in this way, you can build beautiful patterns that you can later re-read and understand.
Adding Recursion
If you want to handle more complex nesting using pure regex, you'll need to turn to recursion.
To add recursion, all you need to do is define a group and refer to it using the subroutine syntax. For instance, to execute the code within Group 1, use (?1). To execute the code within group something, use (?&something). Remember to leave an exit for the engine by either making the recursion optional (?) or one side of an alternation.
References
Pre-defined regex subroutines
Named capture groups
It seems difficult to achieve with juste one regex pass, but it could be done with a relatively simple regex and a recursive function:
import re
REGEX = re.compile(r"(['\"])(.*?[!.,])\1", re.S)
S = """'Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!' Alice replied in a very melancholy voice. She continued, 'I'll try again.' 'And we may now add "some more 'random test text'.":' "Yes it seems to be a good idea!" 'ok, let's go.'"""
def extract_quotes(string, quotes_list=None):
list = quotes_list or []
list += [found[1] for found in REGEX.findall(string)]
print("found: {}".format(quotes_list))
index = 0
for quote in list[:]:
index += 1
sub_list = extract_quotes(quote)
list = list[:index] + sub_list + list[index:]
index += len(sub_list)
return list
print extract_quotes(S)
This prints:
['Well, I\'ve tried to say "How Doth the Little Busy Bee," but it all came different!', 'How Doth the Little Busy Bee,', "I'll try again.", 'And we may now add "some more \'random test text\'.":\' "Yes it seems to be a good idea!" \'ok, let\'s go.', "some more 'random test text'.", 'Yes it seems to be a good idea!']
Note that the regex uses the punctuation to determine if a quoted text is a "real quote". in order to be extracted, a quote need to be ended with a punctuation character before the closing quote. That is 'random test text' is not considered as an actual quote, while 'ok let's go.' is.
The regex is pretty simple, I think it does not need explanation.
Thue extract_quotes function find all quotes in the given string and store them in the quotes_list. Then, it calls itself for each found quote, looking for inner quotes...

How can I find all substrings that have this pattern: some_word.some_other_word with python?

I am trying to clean up some very noisy user-generated web data. Some people do not add a space after a period that ends the sentence. For example,
"Place order.Call us if you have any questions."
I want to extract each sentence, but when I try to parse a sentence using nltk, it fails to recognize that these are two separate sentences. I would like to use regular expressions to find all patterns that contain "some_word.some_other_word" and all patterns that contain "some_word:some_other_word" using python.
At the same time I want to avoid finding patterns like "U.S.A". so avoid just_a_character.just_another_character
Thanks very much for your help :)
The easiest solution:
>>> import re
>>> re.sub(r'([.:])([^\s])', r'\1 \2', 'This is a test. Yes, test.Hello:world.')
'This is a test. Yes, test. Hello: world.'
The first argument — the pattern — tells that we want to match a period or a colon followed by a non-whitespace character. The second argument is the replacement, it puts the first matched symbol, then a space, then the second matched symbol back.
It seems that you are asking two different questions:
1) If you want to find all patterns like "some_word.some_other_word" or "some_word:some_other_word"
import re
re.findall('\w+[\.:\?\!]\w+', your_text)
This finds all patterns in the text your_text
2) If you want to extract all sentences, you could do
import re
re.split('[\.\!\?]', your_text)
This should return a list of sentences. For example,
text = 'Hey, this is a test. How are you?Fine, thanks.'
import re
re.findall('\w+[\.:\?\!]\w+', text) # returns ['you?Fine']
re.split('[\.\!\?]', text) # returns ['Hey, this is a test', ' How are you', 'Fine, thanks', '']
Here's some cases that might be in your text:
sample = """
Place order.Call us (period: split)
ever after.(The end) (period: split)
U.S.A.(abbreviation: don't split internally)
1.3 How to work with computers (dotted numeral: don't split)
ever after...The end (ellipsis: don't split internally)
(This is the end.) (period inside parens: don't split)
"""
So: Don't add space to periods after digits, after a single capital letter, or before a paren or another period. Add space otherwise. This will do all that:
sample = re.sub(r"(\w[A-Z]|[a-z.])\.([^.)\s])", r"\1. \2", sample)
Result:
Place order. Call us (period: split)
ever after. (The end) (period: split)
U.S.A.(abbreviation: don't split internally)
1.3 How to work with computers (dotted numeral: don't split)
ever after... The end (ellipsis: don't split internally)
(This is the end.) (period inside parens: don't split)
This fixed every problem in the sample except the last period after U.S.A., which should have a space added after it. I left that aside because combinations of conditions are tricky. The following regexp will handle everything, but I do not recommend it:
sample = re.sub(r"(\w[A-Z]|[a-z.]|\b[A-Z](?!\.[A-Z]))\.([^.)\s])", r"\1. \2", sample)
Complex regexps like this are a maintainability nightmare-- just try adding another pattern, or restricting it to omit some more cases. Instead, I recommend using a separate regexp to catch just the missing case: a period after a single capital letter, but not followed by a single capital, paren, or another period.
sample = re.sub(r"(\b[A-Z]\.)([^.)A-Z])", r"\1 \2", sample)
For a complex task like this, it makes sense to use a separate regexp for each type of replacement. I'd split the original into subcases, each of which adds spaces only for a very specific pattern. You can have as many as you want, and it won't get out of hand (at least, not too much...)
You could use something like
import re
test = "some_word.some_other_word"
r = re.compile(r'(\D+)\.(\D+)')
print r.match(test).groups()

Add quotes around sentences with the word "said"

Ok regex masters, I have a very long text and I'm trying to add quotes in sentences that contain the words "he said" and similar variations.
For example:
s = 'This should have no quotes. This one should he said. But this one should not. Neither should this. But this one should she said.'
Should result in:
This should have no quotes. "This one should," he said. But this one should not. Neither should this. "But this one should," she said.
So far I can get pretty close, but not quite right:
>>> import re
>>> m = re.sub(r'\.\W(.*?) (he|she|it) said.', r'. "\1," \2 said.', s)
Results in:
>>> print m
This should have no quotes. "This one should," he said. But this one should not. "Neither should this. But this one should," she said.
As you can see, it puts quote properly around the first instance, but places it too early for the second. Any help appreciated!
There are some different valid situations that have been pointed out in the comments, but to address the concern you were facing:
It is quoting the whole sentence because it sees the period at the end of one should not.. What you really want, is to only quote back to the last period. So in your matching brackets make sure to not include periods, like so:
m = re.sub(r'\.\W([^\.]*?) (he|she|it) said.', r'. "\1," \2 said.', s)
This will fail for things with periods in the sentence like "Dr. Seuss likes to eat, she said" but that is another problem.

How do I append a list of negative lookbehinds to a python regular expression?

I'm trying to split a paragraph into sentences using regex split and I'm trying to use the second answer posted here:
a Regex for extracting sentence from a paragraph in python
But I have a list of abbreviations that I don't want to end the sentence on even though there's a period. But I don't know how to append it to that regular expression properly. I'm reading in the abbreviations from a file that contains terms like Mr. Ms. Dr. St. (one on each line).
Short answer: You can't, unless all lookbehind assertions are of the same, fixed width (which they probably aren't in your case; your example contained only two-letter abbreviations, but Mrs. would break your regex).
This is a limitation of the current Python regex engine.
Longer answer:
You could write a regex like (?s)(?<!.Mr|Mrs|.Ms|.St)\., padding each alternating part of the lookbehind assertion with as many .s as needed to get all of them to the same width. However, that would fail in some circumstances, for example when a paragraph begins with Mr..
Anyway, you're not using the right tool here. Better use a tool designed for the job, for example the Natural Language Toolkit.
If you're stuck with regex (too bad!), then you could try and use a findall() approach instead of split():
(?:(?:\b(?:Mr|Ms|Dr|Mrs|St)\.)|[^.])+\.\s*
would match a sentence that ends in . (optionally followed by whitespace) and may contain no dots unless preceded by one of the allowed abbreviations.
>>> import re
>>> s = "My name is Mr. T. I pity the fool who's not on the A-Team."
>>> re.findall(r"(?:(?:\b(?:Mr|Ms|Dr|Mrs|St)\.)|[^.])+\.\s*", s)
['My name is Mr. T. ', "I pity the fool who's not on the A-Team."]
I don't directly answer your question, but this post should contain enough information for you to write a working regex for your problem.
You can append a list of negative look-behinds. Remember that look-behinds are zero-width, which means that you can put as many look-behinds as you want next to each other, and you are still look-behind from the same position. As long as you don't need to use "many" quantifier (e.g. *, +, {n,}) in the look-behind, everything should be fine (?).
So the regex can be constructured like this:
(?<!list )(?<!of )(?<!words )(?<!not )(?<!allowed )(?<!to )(?<!precede )pattern\w+
It is a bit too verbose. Anyway, I write this post just to demonstrate that it is possible to look-behind on a list of fixed string.
Example run:
>>> s = 'something patterning of patterned crap patternon not patterner, not allowed patternes to patternsses, patternet'
>>> re.findall(r'(?<!list )(?<!of )(?<!words )(?<!not )(?<!allowed )(?<!to )(?<!precede )pattern\w+', s)
['patterning', 'patternon', 'patternet']
There is a catch in using look-behind, though. If there are dynamic number of spaces between the blacklisted text and the text matching the pattern, the regex above will fail. I really doubt there exists a way to modify the regex so that it works for the case above while keeping the look-behinds. (You can always replace consecutive spaces into 1, but it won't work for more general cases).

Categories