How to grab multiple paragraphs in the capture group? [duplicate]

How to grab multiple paragraphs in the capture group? [duplicate] - python

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 3 years ago.
I'm using this code: (?i)(?<!\d)Item.*?1A.*?Risk.*?Factors.*?\n*(.+?)\n*Item.*?1B to grab the following text:
ITEM 1A. RISK FACTORS
In addition to other information in this Form 10-K, the following risk factors should be carefully considered in evaluating us and our business because these factors currently have a significant impact or
In addition to other information in this Form 10-K, the following risk factors should be carefully considered in evaluating us and our business because these factors currently have a significant impact or
ITEM 1B.
But it would not grab anything in the capturing group, unless it's one paragraph like this:
ITEM 1A. RISK FACTORS
In addition to other information in this Form 10-K, the following risk factors should be carefully considered in evaluating us and our business because these factors currently have a significant impact or
ITEM 1B.

Your regex is matching any number of newlines, then any amount of text on one line, then any number of newlines - it's only looking for a single "paragraph" between newlines, since . does not capture across lines.
Try replacing it with something like [\s\S], which will capture everything - including newlines, paragraphs, text, space, anything you want. Of special note is that this will capture any number of paragraphs, with any amount of whitespace between them.
(?i)(?<!\d)Item.*?1A.*?Risk.*?Factors\n*([\s\S]*?)\n*Item.*?1B
(?i)(?<!\d)Item.*?1A.*?Risk.*?Factors Match up to the end of risk factors.
\n* Match as many newlines as needed 'till we hit the next paragraph.
([\s\S]*?) Capture anything, across any number of lines (lazy).
\n* Match as many newlines as needed 'till we hit the next paragraph.
Item.*?1B Match the rest of the content. (This doesn't match the . at the very end, did you mean for it to? If so, add \. to the end).
Try it here!

Try
(?i)(?<!\d)Item.*?1A.*?Risk.*?Factors.*?\n*((.*\n*)+)\n*Item.*?1B
And for the sake of your future regex headaches, an incredible resource:
https://regex101.com
Cheers-

Related

Replacing a group of words from a long document [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I was trying to solve this problem.
I have a long form technical document (600 pages). I need to substitute group of words with their abbreviations. Say, 'Long Form Document' is to be detected and replaced with LFD in the text.
I have a list of these group of words and their abbreviation. Also, the length of words is not fixed, it ranges from 2-6 words to be replaced by one single abbreviation.
I have tried creating n-grams and substituting but it distorts the document with unnecessary combinations and count of tokens is important. I also tried using a regex with window of 5 words and capital alphabets not preceded by full stop. Please suggest a suitable solution.

Below is an example in Python 3. There is no tokenisation: all possible terms to be abbreviated are simply or'ed together into a pattern, and the match is replaced with the corresponding abbreviation.
The performance of this solution with a large dictionary of abbreviations is to be determined experimentally.
If 600 pages are too much to be loaded into memory, a file can be loaded line by line, assuming that no group goes across several lines. If it can happen, a window of two or three lines can be processed at a time, then advancing by one line.
import re
text = '''I was trying to solve this problem. I have a long form technical document (600 pages).
I need to substitute group of words with their abbreviations.
Say, 'Long Form Document' is to be detected and replaced with LFD in the text.
I have a list of these group of words and their abbreviation.
Also, the length of words is not fixed, it ranges from 2-6 words to be replaced by one single abbreviation.
I have tried creating n-grams and substituting but it distorts the document with unnecessary combinations and count of tokens is important.
I also tried using a regex with window of 5 words and capital alphabets not preceded by full stop. Please suggest a suitable solution.'''
abbr = {
'Long Form Document': 'LFD',
'Short Form Focument': 'SFD'
}
def abbreviate (text: str, abbr: dict) -> str:
pattern = re.compile ('|'.join (abbr.keys ()), re.I)
return re.sub (pattern, lambda m: abbr [m.group()], text)
# Test:
print (abbreviate (text, abbr))
Output:
I was trying to solve this problem. I have a long form technical document (600 pages).
I need to substitute group of words with their abbreviations.
Say, 'LFD' is to be detected and replaced with LFD in the text.
I have a list of these group of words and their abbreviation.
Also, the length of words is not fixed, it ranges from 2-6 words to be replaced by one single abbreviation.
I have tried creating n-grams and substituting but it distorts the document with unnecessary combinations and count of tokens is important.
I also tried using a regex with window of 5 words and capital alphabets not preceded by full stop. Please suggest a suitable solution.

Python Regular expression search specific string beside number

I need help here.
I have a list and string.
Things I want to do is to find all the numbers from the string and also match the words from the list in the string that are beside numbers.
str = 'Lily goes to school everyday at 9:00. Her House is near to her school.
Lily's address - Flat No. 203, 14th street lol lane, opp to yuta mall,
washington. Her school name is kids International.'
list = ['school', 'international', 'house', 'flat no']
I wrote a regex which can pull numbers
x = re.findall('([0-9]+[\S]+[0-9]+|[0-9]+)' , str,re.I|re.M)
Output I want:
Numbers - ['9:00', '203', '14th']
Flat No.203 (because flat no is beside 203)
14 is also beside string but I dont want it because it is not contained in list.
But How can I write regex to make second condition satisfy. that is to search
whether flat no is beside 203 or not in same regex.

There you go:
(\d{1,2}:\d{1,2})|(?:No\. (\d+))|(\d+\w{2})
Demo on Regex101.com can be found here
What does it do and how does it work?
I use two pipes (|) to gather different number "types" you want:
First alteration ((\d{1,2}:\d{1,2}) - captures time using 1-2 digits followed by a colon and another set of 1-2 digits (probably you could go for 2 digits only).
Second alteration (?:No\. (\d+)) - gives you the number prefixed with literal "No. " (note the space at the end), and then captures following number, no matter how long (at least one digit)
The third and the last part (\d+\w{2}) - simply captures any number of digits (again, at least one) followed by two word characters. You could further improve this part of the regex to match only st, nd, and th suffixes, but I will leave this up to you.
Also to get rid of further unneeded matches you could use lookarounds, but again - I'll leave this up to you to implement.
General note - rather than using one regex to rule... erm - match them all, you should focus on creating many simple regexes. Not only will this improve legibility, but also maintainability of the regexes. This also allows you to search for timestamps, building numbers and positional numerals separately, easily allowing you to split this information to specific variables.

How to make regex that matches a number with commas for every three digits?

I am a beginner in Python and in regular expressions and now I try to deal with one exercise, that sound like that:
How would you write a regex that matches a number with commas for
every three digits? It must match the following:
'42'
'1,234'
'6,368,745'
but not the following:
'12,34,567' (which has only two digits between the commas)
'1234' (which lacks commas)
I thought it would be easy, but I've already spent several hours and still don't have write answer. And even the answer, that was in book with this exercise, doesn't work at all (the pattern in the book is ^\d{1,3}(,\d{3})*$)
Thank you in advance!

The answer in your book seems correct for me. It works on the test cases you have given also.
(^\d{1,3}(,\d{3})*$)
The '^' symbol tells to search for integers at the start of the line. d{1,3} tells that there should be at least one integer but not more than 3 so ;
1234,123
will not work.
(,\d{3})*$
This expression tells that there should be one comma followed by three integers at the end of the line as many as there are.
Maybe the answer you are looking for is this:
(^\d+(,\d{3})*$)
Which matches a number with commas for every three digits without limiting the number being larger than 3 digits long before the comma.

You can go with this (which is a slightly improved version of what the book specifies):
^\d{1,3}(?:,\d{3})*$
Demo on Regex101

I got it to work by putting the stuff between the carrot and the dollar in parentheses like so: re.compile(r'^(\d{1,3}(,\d{3})*)$')
but I find this regex pretty useless, because you can't use it to find these numbers in a document because the string has to begin and end with the exact phrase.

#This program is to validate the regular expression for this scenerio.
#Any properly formattes number (w/Commas) will match.
#Parsing through a document for this regex is beyond my capability at this time.
print('Type a number with commas')
sentence = input()
import re
pattern = re.compile(r'\d{1,3}(,\d{3})*')
matches = pattern.match(sentence)
if matches.group(0) != sentence:
#Checks to see if the input value
#does NOT match the pattern.
print ('Does Not Match the Regular Expression!')
else:
print(matches.group(0)+ ' matches the pattern.')
#If the values match it will state verification.

The Simple answer is :
^\d{1,2}(,\d{3})*$
^\d{1,2} - should start with a number and matches 1 or 2 digits.
(,\d{3})*$ - once ',' is passed it requires 3 digits.
Works for all the scenarios in the book.
test your scenarios on https://pythex.org/

I also went down the rabbit hole trying to write a regex that is a solution to the question in the book. The question in the book does not assume that each line is such a number, that is, there might be multiple such numbers in the same line and there might some kind of quotation marks around the number (similar to the question text). On the other hand, the solution provided in the book makes those assumptions: (^\d{1,3}(,\d{3})*$)
I tried to use the question text as input and ended up with the following pattern, which is way too complicated:
r'''(
(?:(?<=\s)|(?<=[\'"])|(?<=^))
\d{1,3}
(?:,\d{3})*
(?:(?=\s)|(?=[\'"])|(?=$))
)'''
(?:(?<=\s)|(?<=[\'"])|(?<=^)) is a non-capturing group that allows
the number to start after \s characters, ', ", or the start of the text.
(?:,\d{3})* is a non-capturing group to avoid capturing, for example, 123 in 12,123.
(?:(?=\s)|(?=[\'"])|(?=$)) is a non-capturing group that allows
the number to end before \s characters, ', ", or the end of the text (no newline case).
Obviously you could extend the list of allowed characters around the number.

regex- capturing text between matches

In the following text, I try to match a number followed by ")" and number followed by a period. I am trying to retrieve the text between the matches.
Example:
"1) there is a dsfsdfsd and 2) there is another one and 3) yet another
case"
so I am trying to output: ["there is a dsfsdfsd and", "there is another one and", yet another case"]
I've used this regex: (?:\d)|\d.)
Adding a .* at the end matches the entire string, I only want it to match the words between
also in this string:
"we will give 4. there needs to be another option and 6.99 USD is a
bit amount"
I want to only match the 4. and not the 6.99
Any pointers will be appreciated. Thank you. r

tldr
Regular expressions are tricky beasts and you should avoid them if at all possible.
If you can't avoid them, then make sure you have lots of test cases for all the edge cases that can occur.
Build up your regular expression slowly and systematically, testing your assumptions at every step.
If this code will go intro production, then please write unit tests that explain the thinking process to the poor soul who has to maintain it one day
The long version
Regular expressions are finicky. Your best approach may be to solve the problem a different way.
For example, your language might have a library function that allows you to split up strings using a regular expression to define what comes between the numbers. That will let you get away with writing a simpler regex to match the numbers and brackets/dots.
If you still decide to use regular expressions, then you need to be very structured about how you build up your regular expressions. It's extremely easy to miss edge cases.
So let's break this down piece by piece...
Set up a test environment for quickly experimenting with your regex.
There are lots of options here, depending on your programming language and OS. Ones I sometimes use are:
a Powershell window for testing .Net regexes (NB: the cli gives you a history of past attempts, so you can go back a few steps if you mess things up too badly)
a Python console for testing Python regexes (which are slightly different to .Net regexes in their syntax for named capture groups).
an html page with JavaScript to test the regex
an online or desktop regex tool (I still use the ancient Regular Expression Workbench from Eric Gunnerson, but I'm sure there are better alternatives these days)
Since you didn't specify a language or regex version, I'll assume .Net regular expressions
Create a single test string for testing a wider variety of options.
Your goal is to include as many edge cases as you can think of. Here's what I would use: "ab 1. there is a dsfsdfsd costing $6.99 and 2) there is another one and 3. yet another case 4)5) 6)10."
Note that I've added a few extra cases you didn't mention:
empty strings between two round bracket numbers: "4)" and "5)"
white space string between two round bracket numbers: "5)" and "6)"
empty strings between a round bracket number and a dotted number: "6)" and "10."
empty string after the dotted number "10." at the end of the string
random text and empty space, which should be ignored, before the first number
I'm going to make a few assumptions here, which you will need to vary based on your actual requirements:
You DO want to capture the white space after the dot or round bracket.
You DO want to capture the white space before the next dotted number or round bracket number.
You might have numbers that go beyond 9, so I've included "10" in the test cases.
You want to capture empty strings at the end e.g. after the "10."
NOTES:
Thinking through this test case forces you to be more rigorous about your requirements.
It will also help you be more efficient while you are manually testing your regular expression.
HOWEVER, this is assuming you aren't following a TDD approach. If you are, then you should probably do things a little differently... create unit tests for each scenario separately and get the regex working incrementally.
This test string doesn't cover all cases. For example, there are no newline or tab characters in the test string. Also it can't test for an empty string following a round bracket number at the very end.
First get a regex working that just captures the round brackets and dotted brackets.
Don't worry about the $6.99 edge case yet.
Drop the "(?:" non-capturing group syntax from your regex for now: "\d)|\d."
This doesn't even parse, because you have an unescaped round bracket.
The revised string is "\d\)|\d.", which parses, but which also matches "99" which you probably weren't expecting. That's because you forgot to escape the "."
The revised string is "\d\)|\d\.". This no longer matches "99", but it now matches "0." at the end instead of "10.". That's because it assumes that numbers will be single digit only.
The following string seems to work: "\d+\)|\d+\."
Time to deal with that pesky "$6.99" now...
Modify the regex so that it doesn't capture a floating point number.
You need to use a negative look ahead pattern to prevent a digit being after the decimal point.
Result: "\d+\)|\d+\.(?!\d)"
Count how many matches this produces. You're going to use this number for checking later results.
Hint: Save the regex pattern somewhere. You want to be able to go back to it any time you mess up your regex pattern beyond repair.
If you found a string splitting function, then you should use it now and avoid the complexity that follows. [I've included an example of this at the end.]
Simple is better, but I'm going to continue with the longer solution in the interests of showing an approach to staying in control of regex'es that start getting horribly complicated
Decide how to exclude that pattern
You used the non-capture group pattern in your question i.e. "(?:"
That approach can work. But it's a bit cumbersome, because you need to have a capturing group after it that you will look for instead.
It would be much nicer if your entire pattern matched what you are looking for.
So wrap the number pattern inside a zero-width positive look behind pattern (if your language supports it) i.e. "(?<=".
This checks for the pattern, but doesn't include it in what gets captured.
So now your regex looks like this: "(?<=\d+\)|\d+\.(?!\d))"
Test it!
It might seem silly to test this on its own - all the matches are empty strings.
Do it anyway. You want to sanity check every step of the way.
Make sure that it still produces the same number of matches as in step 4.
Decide how to match the text in between the numbers.
You rightly mention that ".*" will match the entire string, not just the parts in between.
There's a neat trick that allows you to reuse the pattern from step 5 to get the text in between.
Start by just matching the next character
The trick is that you want to match any character unless it's the start of the next number
That sounds like a negative look ahead pattern again: "(?!"
Let X be the pattern you saved in step 4. Matching a single character will look like this: "(?!X)."
You want to match lots of those characters. So put that pattern into a non-capturing group and repeat it: "(?:(?!X).)*"
This assumes you want to capture empty text.
If you're not, then change the "*" to a "+".
Hint: This is such a common pattern that you will want to reuse it in future pasting in different patterns in place of X
I used a non-capturing group instead of a normal group so that you can also embed this pattern in regexes where you do care about the capturing groups
Resulting pattern: "(?:(?!\d+\)|\d+\.(?!\d)).)*"
I suggest testing this pattern on its own to see what it does
Now put parts 5 and 7 together: "(?<=\d+\)|\d+\.(?!\d))(?:(?!\d+\)|\d+\.(?!\d)).)*"
Test it!
Unit tests!
If this is going into production, then please write lots of unit tests that will explain each step of this thought process
Have pity on the poor soul who has to maintain your regex in future!
By rights that person should be you
I suggest putting a note in your calendar to return to this code in 6 months' time and make sure you can still understand it from the unit tests alone!
Refactor
In six months' time, if you can't understand the code any more, use your newfound insight (and incentive) to solve the problem without using regular expressions (or only very simple ones)
Addendum
As an example of using a string splitting function to get away with a simpler regex, here's a solution in Powershell:
$string = 'ab 1. there is a dsfsdfsd costing $6.99 and 2) there is another one and 3. yet another case 4)5) 6)10.'
$pattern = [regex] '\d+\)|\d+\.(?!\d)'
$string -split $pattern | select-object -skip 1

Judging by the task you have, it might be easier to match the delimiters and use re.split (as also pointed out by bobblebubble in the comments).
I dsuggest a mere
\d+[.)]\B\s*
See it in action (demo)
It matches 1 or more digits, then a . or a ), then it makes sure there is no word letter (digit, letter or underscore) after it and then matches zero or more whitespace.
Python demo:
import re
rx = r'\d+[.)]\B\s*'
test_str = "1) there is a dsfsdfsd and 2) there is another one and 3) yet another case\n\"we will give 4. there needs to be another option and 6.99 USD is a bit amount"
print([x for x in re.split(rx,test_str) if x])

Try the following regex with the g modifier:
([A-Za-z\s\-_]+|\d(?!(\)|\.)\D)|\.\d)
Example: https://regex101.com/r/kB1xI0/3
[A-Za-z\s\-_]+ automatically matches all alphabetical characters + whitespace
\d(?!(\)|\.)\D) match any numeric sequence of digits not followed by a closing parenthesis ) or decimal value (.99)
\.\d match any period followed by numeric digit.

I used this pattern:
(?<=\d.\s)(.*?)(?=\d.\s)
demo
This looks for the contents between any digit, any character, then a space.
Edit: Updated pattern to handle the currency issue and line ends better:
This is with flag 'g'
(?<=[0-9].\s)(.*?)(?=\s[0-9].\s|\n|\r)
Demo 2

import re
s = "1) there is a dsfsdfsd and 2) there is another one and 3) yet another case"
s1 = "we will give 4. there needs to be another option and 6.99 USD is a bit amount"
regex = re.compile("\d\)\s.*?|\s\d\.\D.*?")
print ([x for x in regex.split(s) if x])
print regex.split(s1)
Output:
['there is a dsfsdfsd and ', 'there is another one and ', 'yet another case']
['we will give', 'there needs to be another option and 6.99 USD is a bit amount']

Non greedy python regex

I'm trying to work my way through some regular expressions; I'm using python.
My task right now is to scrape newspaper articles and look for instances where people have died. Once I have a relevant article, I'm trying to snag the death count for some other things. I'm trying to come up with a few patterns, but I'm having difficulty with one in particular. Take this sample article section:
SANAA, Oct 21 (Reuters) - Three men thought to be al Qaeda militants
were killed in an apparent U.S. drone attack on a car in Yemen on
Sunday, tribal sources and local officials said.
The code that I'm using to snag the 'three' first does a replace on the entire document, so that the 'three' becomes a '3' before any patterns at all are applied. The pattern relevant to this example is this:
re.compile(r"(\d+)\s(:?men|women|children|people)?.*?(:?were|have been)? killed")
The idea is that this pattern will start with a number, be followed by an optional noun such as one of the ones listed, then have a minimum amount of clutter before finding 'dead' or 'died'. I want to leave room so that this pattern would catch:
3 people have been killed since Sunday
and still catch the instance in the example:
3 men thought to be al qaeda militants were killed
The problem is that the pattern I'm using is collecting the date from the first part of the article, and returning a count of 21. No amount of fiddling so far has enabled me to limit the scope to the digit right beside the word men, followed by the participial phrase, then the relevant 'were killed'.
Any help would be much appreciated. I'm definitely no guru when it comes to RE.

Don't make the men|women|children optional, i.e. take out the question mark after the closing parenthesis. The regex engine will match at the first possible place, regardless of whether repetition operators are greedy or stingy.
Alternatively, or additionally, make the "anything here" pattern only match non-numbers, i.e. replace .*? with \D*?

This is because, you have used the quantifier ?, which matches 0 or 1 of your (:?men|women|children|people) after your digit. So, 21 will match. since it has 0 of them.
Try removing your quantifier after it, to match exactly one of them: -
re.compile(r"(\d+)\s(?:men|women|children|people).*?(?:were|have been)? killed")
UPDATE: - To use ? quantifier and still get the required result, you need to use Look-Ahead Regex, to make sure that your digit is not followed by a string containing a hiephen(-) as is in your example.
re.compile(r"(\d+)(?!.*?-.*?)\s(?:men|women|children|people)?.*?(?:were|have been)? killed")

You use wrong syntax (:?...). You probably wanted to use (?:...).
Use regex pattern
(\d+).*?\b(?:men|women|children|people|)\b.*?\b(?:were|have been|)\b.*?\bkilled\b
or if just spaces are allowed between those words, then
(\d+)\s+(?:men|women|children|people|)\s+(?:were|have been|)\s+killed\b

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.