Non greedy python regex - python

I'm trying to work my way through some regular expressions; I'm using python.
My task right now is to scrape newspaper articles and look for instances where people have died. Once I have a relevant article, I'm trying to snag the death count for some other things. I'm trying to come up with a few patterns, but I'm having difficulty with one in particular. Take this sample article section:
SANAA, Oct 21 (Reuters) - Three men thought to be al Qaeda militants
were killed in an apparent U.S. drone attack on a car in Yemen on
Sunday, tribal sources and local officials said.
The code that I'm using to snag the 'three' first does a replace on the entire document, so that the 'three' becomes a '3' before any patterns at all are applied. The pattern relevant to this example is this:
re.compile(r"(\d+)\s(:?men|women|children|people)?.*?(:?were|have been)? killed")
The idea is that this pattern will start with a number, be followed by an optional noun such as one of the ones listed, then have a minimum amount of clutter before finding 'dead' or 'died'. I want to leave room so that this pattern would catch:
3 people have been killed since Sunday
and still catch the instance in the example:
3 men thought to be al qaeda militants were killed
The problem is that the pattern I'm using is collecting the date from the first part of the article, and returning a count of 21. No amount of fiddling so far has enabled me to limit the scope to the digit right beside the word men, followed by the participial phrase, then the relevant 'were killed'.
Any help would be much appreciated. I'm definitely no guru when it comes to RE.

Don't make the men|women|children optional, i.e. take out the question mark after the closing parenthesis. The regex engine will match at the first possible place, regardless of whether repetition operators are greedy or stingy.
Alternatively, or additionally, make the "anything here" pattern only match non-numbers, i.e. replace .*? with \D*?

This is because, you have used the quantifier ?, which matches 0 or 1 of your (:?men|women|children|people) after your digit. So, 21 will match. since it has 0 of them.
Try removing your quantifier after it, to match exactly one of them: -
re.compile(r"(\d+)\s(?:men|women|children|people).*?(?:were|have been)? killed")
UPDATE: - To use ? quantifier and still get the required result, you need to use Look-Ahead Regex, to make sure that your digit is not followed by a string containing a hiephen(-) as is in your example.
re.compile(r"(\d+)(?!.*?-.*?)\s(?:men|women|children|people)?.*?(?:were|have been)? killed")

You use wrong syntax (:?...). You probably wanted to use (?:...).
Use regex pattern
(\d+).*?\b(?:men|women|children|people|)\b.*?\b(?:were|have been|)\b.*?\bkilled\b
or if just spaces are allowed between those words, then
(\d+)\s+(?:men|women|children|people|)\s+(?:were|have been|)\s+killed\b

Related

Regex: match address string if multiple words

Disclaimer: I know from this answer that regex isn't great for U.S. addresses since they're not regular. However, this is a fairly small project and I want to see if I can reduce the number of false positives.
My challenge is to distinguish (i.e. match) between addresses like "123 SOUTH ST" and "123 SOUTH MAIN ST". The best solution I can come up with is to check if more than 1 word comes after the directional word.
My python regex is of the form:
^(NORTH|SOUTH|EAST|WEST)(\s\S*\s\S*)+$
Explanation:
^(NORTH|SOUTH|EAST|WEST) matches direction at the start of the string
(\s\S*\s\S*)+$ attempts to match a space, a word of any length, another space, and another word of any length 1 or more times
But my expression doesn't seem to distinguish between the 2 types of term. Where's my error (besides using regex for U.S. addresses)?
Thanks for your help.
Your regex misses number in beginning of the address and treats optional word (MAIN in this case) as mandatory. Try this
^\d+ (NORTH|SOUTH|EAST|WEST)((\s\S*)?\s\S*)+$

Regex - If not match then match this - Python

I apologise for the amount of text, but I cannot wrap my head around this and I would like to make my problem clear.
I am currently attempting to create a regex expression to find the end of a website/email link in order to then process the rest of the address. I have decided to look for the ending of the address (eg. '.com', '.org', '.net'); however, I am having difficulty in two areas when dealing with this. (I have chosen this method as it is the best fit for the current project)
Firstly I am trying to get around accidentally hindering users typing words with these keywords within them (eg. '"org"anisation', 'try this "or g"o to'). How I have tackled this is, as an example, the regex:
org(?!\w) - To skip the match if there are letters directly after the keyword.
The secondary problem is finding extra parts of an address (eg. 'www.website."org".uk') which would not be matched. To combat this, as an example, I have used the regex:
org((\W*|\.|dot)\w\w) - In an attempt to find the first two letters after the keyword, as most extensions are only two letters.
The Main Problem:
In order to prevent both of the above situations I have used the regex akin to:
org(.|dot)\w\w|(?!\w)
However, I am not as versed as I would like to be in Regex to find a solution and I understand that this would not create correct results. I know there is a form of 'If this then that' within Regex but I just cant seem to understand the online documentation I have found on the subject.
If possible would someone be able to explain how I may go about creating a system to say:
IF: NOT org(\w)
ELSE IF: org(.|dot)
THEN: MATCH org(.|dot)\w\w
ELSE: MATCH org
I would really appreciate any help on the matter, this has been on my mind for a while now. I would just like to see it through, but I just do not possess the required knowledge.
Edit:
Test cases the Regex would need to pass (Specifically for the 'org' regex for these examples):
(I have marked matches in square brackets '[ ]', and I have marked possible matches to be disregarded with '< >' )
"Hello, please come and check out my website: www.website.[org]"
"I have just uploaded a new game at games.[org.uk]"
"If you would like quote please email me at email#email.[org.ru]"
"I have just made a new <org>anisation website at website.[org], please get in contact at name.name#email.[org.us]"
"For more info check info.[org] <or g>o to info.[org.uk]"
I hope this allows for a better insight to what the Regex needs to do.
The following regex:
(?i)(?<=\.)org(?:\.[a-z]{2})?\b
should do the work for you.
demo:
https://regex101.com/r/8F9qbQ/2/
explanations:
(?i) to activate the case as insensitive (.ORG or .org)
(?<=.) forces that there is a . before org to avoid matches when org is actually a part of a word.
org to match ORG or org
(?:...)? non capturing group that can appear 0 to 1 time
\.[a-zA-Z]{2} dot followed by exactly 2 letters
\b word boundary constraint
There are some other simpler way to catch any website, but assuming that you exactly need the feature IF: NOT org(\w) ELSE IF: org(.|dot) THEN: MATCH org(.|dot)\w\w ELSE: MATCH org, then you can use:
org(?!\w)(\.\w\w)?
It will match:
"org.uk" of www.domain.org.uk
"org" of www.domain.org
But will not match www.domain.orgzz and orgzz
Explanation:
The org(?!\w) part will match org that is not followed by a letter character. It will match the org of org, org of org. but will not match orgzz.
Then, if we already have the org, we will try if we can match additional (\.\w\w) by adding the quantifier ? which means match if there is any, which will match the \.uk but it is not necessary.
I made a little regex that captures a website as long as it starts with 'www.' that is followed by some characters with a following '.'.
import re
matcher = re.compile('(www\.\S*\.\S*)') #matches any website with layout www.whatever
string = 'they sky is very blue www.harvard.edu.co see nothing else triggers it, www, org'
match = re.search(matcher, string).group(1)
#output
#'www.harvard.edu.co'
Now you can tighten this up as needed to avoid false positives.

How can I get a regular expression to find the correct instance of a word?

I'm trying to write a regular expression in python to identify instances of the phrases "played for" and "plays for" in a text, with the potential for finding instances where words come between the two, for example, "played guitar for". I only want this to find the first instance of the word "for" after "plays" or "played", however, I cannot work out how to write the regular expression.
The code I have at the moment is like this:
def play_finder(doc)
playre = re.compile(r'\bplay[s|e][d]?\b.*\bfor\b\s\b')
if playre.findall(doc):
for inst in playre.findall(doc):
playstr = inst
print(playstr)
mytext = "He played for four hours last night. He plays guitar for the foo pythers. He won an award for his guitar playing."
play_finder(mytext)
I would like my to be able to pull out two instances from mytext; "played for four" and "plays guitar for the".
Instead, what my code is finding is:
"He played for four hours last night. He plays guitar for the foo pythers. He won an award for".
So it's skipping the first and second for, and only finding the last.
How can I rewrite the regular expression to get it to stop skipping over the first and second instance of "for" in the sentence, and to identify both of them?
Edit: Another problem has become apparent to me after applying a solution I was offered. Given more than one sentence, such as:
"He played an eight hour set. It seemed like he went on for ever."
I don't want the regex to identify "He played an eight hour set. It seemed like he went on for" as matching the pattern. Is there a way to stop it looking for the "for" if it encounters a full stop?
You can try this,
\bplay(?:s|ed).*?for\b
Demo
There are some faults in the regex of your script.
playre = re.compile(r'\bplay[s|e][d]?\b.*\bfor\b\s\b')
[s|e] : is not workable for logical expression because [] is character class and means only one character which it allows
.* : greed(*) search seems match the string of possible maximum length match.
Somebody answered that I needed the lazy .*? then deleted their answer. I'm not sure why, because that worked. Hence, the code I'm using now is:
(r'\bplay[s|e][d]?\b.*?\bfor\b\s\b')
#ThmLee I tried your suggestion:
\bplay(s|ed).*?for\b
I'm (clearly) no expert with Regex, but it seemed not to work as well. Instead of outputting the lines "played for" and " plays guitar for" it just outputs "s" and "ed".
You misunderstand the use of square brackets. They create a character class which matches a single character out of the set of characters enumerated between the brackets. So [s|e] matches s or | or e.
Also, the word boundary is simply an assertion. It matches if the previous character was a "word" character and the next one isn't, or vice versa; but it doesn't advance the position within the string. So, for example, \s\bfor\b\s is redundant; we already know that \s matches whitespace (which is non-word) and for consists of word characters. You mean simply \sfor\s because the dropped \b conditions don't change what is being matched.
Try
r'\bplay(?:s|ed)?\s+(?:\w+\s+)??for\s+\w+'
The (?:\w+\s+)?? allows for a single optional word before for. The second question mark makes the capture non-greedy, i.e. it matches the shortest possible string which still allows the expression to match, instead of the longest. You will not want to allow unlimited repetitions (because then you'd match e.g. "played another game before he sat down for") but you might consider replacing the ?? with e.g. {0,3}? to allow for up to three words before "for".
We use (?:...) instead of (...) to make the grouping parentheses non-capturing; otherwise, findall will return a list of the captured submatches rather than the entire match.
The if findall: for findall is a minor inefficiency; you just need for match in findall which will simply iterate zero times if there are no matches.
More generally, using regex for higher-level grammatical patterns is very often unsatisfactory. A grammatical parser (even some type of shallow parsing) is better at telling you when some words are constituents of an optional attribute or modifier for a noun phrase, or when "play" should be analyzed as a noun. Consider
He played - or rather, tapped his fingers and hummed - for three minutes.
I play another silly but not completely outrageous role for the third time in a year.
She plays what for many is considered offensive gameplay for the Hawks.
Brett plays the oboe although he thinks it's for wimps.
Some plays are for fools.

regex- capturing text between matches

In the following text, I try to match a number followed by ")" and number followed by a period. I am trying to retrieve the text between the matches.
Example:
"1) there is a dsfsdfsd and 2) there is another one and 3) yet another
case"
so I am trying to output: ["there is a dsfsdfsd and", "there is another one and", yet another case"]
I've used this regex: (?:\d)|\d.)
Adding a .* at the end matches the entire string, I only want it to match the words between
also in this string:
"we will give 4. there needs to be another option and 6.99 USD is a
bit amount"
I want to only match the 4. and not the 6.99
Any pointers will be appreciated. Thank you. r
tldr
Regular expressions are tricky beasts and you should avoid them if at all possible.
If you can't avoid them, then make sure you have lots of test cases for all the edge cases that can occur.
Build up your regular expression slowly and systematically, testing your assumptions at every step.
If this code will go intro production, then please write unit tests that explain the thinking process to the poor soul who has to maintain it one day
The long version
Regular expressions are finicky. Your best approach may be to solve the problem a different way.
For example, your language might have a library function that allows you to split up strings using a regular expression to define what comes between the numbers. That will let you get away with writing a simpler regex to match the numbers and brackets/dots.
If you still decide to use regular expressions, then you need to be very structured about how you build up your regular expressions. It's extremely easy to miss edge cases.
So let's break this down piece by piece...
Set up a test environment for quickly experimenting with your regex.
There are lots of options here, depending on your programming language and OS. Ones I sometimes use are:
a Powershell window for testing .Net regexes (NB: the cli gives you a history of past attempts, so you can go back a few steps if you mess things up too badly)
a Python console for testing Python regexes (which are slightly different to .Net regexes in their syntax for named capture groups).
an html page with JavaScript to test the regex
an online or desktop regex tool (I still use the ancient Regular Expression Workbench from Eric Gunnerson, but I'm sure there are better alternatives these days)
Since you didn't specify a language or regex version, I'll assume .Net regular expressions
Create a single test string for testing a wider variety of options.
Your goal is to include as many edge cases as you can think of. Here's what I would use: "ab 1. there is a dsfsdfsd costing $6.99 and 2) there is another one and 3. yet another case 4)5) 6)10."
Note that I've added a few extra cases you didn't mention:
empty strings between two round bracket numbers: "4)" and "5)"
white space string between two round bracket numbers: "5)" and "6)"
empty strings between a round bracket number and a dotted number: "6)" and "10."
empty string after the dotted number "10." at the end of the string
random text and empty space, which should be ignored, before the first number
I'm going to make a few assumptions here, which you will need to vary based on your actual requirements:
You DO want to capture the white space after the dot or round bracket.
You DO want to capture the white space before the next dotted number or round bracket number.
You might have numbers that go beyond 9, so I've included "10" in the test cases.
You want to capture empty strings at the end e.g. after the "10."
NOTES:
Thinking through this test case forces you to be more rigorous about your requirements.
It will also help you be more efficient while you are manually testing your regular expression.
HOWEVER, this is assuming you aren't following a TDD approach. If you are, then you should probably do things a little differently... create unit tests for each scenario separately and get the regex working incrementally.
This test string doesn't cover all cases. For example, there are no newline or tab characters in the test string. Also it can't test for an empty string following a round bracket number at the very end.
First get a regex working that just captures the round brackets and dotted brackets.
Don't worry about the $6.99 edge case yet.
Drop the "(?:" non-capturing group syntax from your regex for now: "\d)|\d."
This doesn't even parse, because you have an unescaped round bracket.
The revised string is "\d\)|\d.", which parses, but which also matches "99" which you probably weren't expecting. That's because you forgot to escape the "."
The revised string is "\d\)|\d\.". This no longer matches "99", but it now matches "0." at the end instead of "10.". That's because it assumes that numbers will be single digit only.
The following string seems to work: "\d+\)|\d+\."
Time to deal with that pesky "$6.99" now...
Modify the regex so that it doesn't capture a floating point number.
You need to use a negative look ahead pattern to prevent a digit being after the decimal point.
Result: "\d+\)|\d+\.(?!\d)"
Count how many matches this produces. You're going to use this number for checking later results.
Hint: Save the regex pattern somewhere. You want to be able to go back to it any time you mess up your regex pattern beyond repair.
If you found a string splitting function, then you should use it now and avoid the complexity that follows. [I've included an example of this at the end.]
Simple is better, but I'm going to continue with the longer solution in the interests of showing an approach to staying in control of regex'es that start getting horribly complicated
Decide how to exclude that pattern
You used the non-capture group pattern in your question i.e. "(?:"
That approach can work. But it's a bit cumbersome, because you need to have a capturing group after it that you will look for instead.
It would be much nicer if your entire pattern matched what you are looking for.
So wrap the number pattern inside a zero-width positive look behind pattern (if your language supports it) i.e. "(?<=".
This checks for the pattern, but doesn't include it in what gets captured.
So now your regex looks like this: "(?<=\d+\)|\d+\.(?!\d))"
Test it!
It might seem silly to test this on its own - all the matches are empty strings.
Do it anyway. You want to sanity check every step of the way.
Make sure that it still produces the same number of matches as in step 4.
Decide how to match the text in between the numbers.
You rightly mention that ".*" will match the entire string, not just the parts in between.
There's a neat trick that allows you to reuse the pattern from step 5 to get the text in between.
Start by just matching the next character
The trick is that you want to match any character unless it's the start of the next number
That sounds like a negative look ahead pattern again: "(?!"
Let X be the pattern you saved in step 4. Matching a single character will look like this: "(?!X)."
You want to match lots of those characters. So put that pattern into a non-capturing group and repeat it: "(?:(?!X).)*"
This assumes you want to capture empty text.
If you're not, then change the "*" to a "+".
Hint: This is such a common pattern that you will want to reuse it in future pasting in different patterns in place of X
I used a non-capturing group instead of a normal group so that you can also embed this pattern in regexes where you do care about the capturing groups
Resulting pattern: "(?:(?!\d+\)|\d+\.(?!\d)).)*"
I suggest testing this pattern on its own to see what it does
Now put parts 5 and 7 together: "(?<=\d+\)|\d+\.(?!\d))(?:(?!\d+\)|\d+\.(?!\d)).)*"
Test it!
Unit tests!
If this is going into production, then please write lots of unit tests that will explain each step of this thought process
Have pity on the poor soul who has to maintain your regex in future!
By rights that person should be you
I suggest putting a note in your calendar to return to this code in 6 months' time and make sure you can still understand it from the unit tests alone!
Refactor
In six months' time, if you can't understand the code any more, use your newfound insight (and incentive) to solve the problem without using regular expressions (or only very simple ones)
Addendum
As an example of using a string splitting function to get away with a simpler regex, here's a solution in Powershell:
$string = 'ab 1. there is a dsfsdfsd costing $6.99 and 2) there is another one and 3. yet another case 4)5) 6)10.'
$pattern = [regex] '\d+\)|\d+\.(?!\d)'
$string -split $pattern | select-object -skip 1
Judging by the task you have, it might be easier to match the delimiters and use re.split (as also pointed out by bobblebubble in the comments).
I dsuggest a mere
\d+[.)]\B\s*
See it in action (demo)
It matches 1 or more digits, then a . or a ), then it makes sure there is no word letter (digit, letter or underscore) after it and then matches zero or more whitespace.
Python demo:
import re
rx = r'\d+[.)]\B\s*'
test_str = "1) there is a dsfsdfsd and 2) there is another one and 3) yet another case\n\"we will give 4. there needs to be another option and 6.99 USD is a bit amount"
print([x for x in re.split(rx,test_str) if x])
Try the following regex with the g modifier:
([A-Za-z\s\-_]+|\d(?!(\)|\.)\D)|\.\d)
Example: https://regex101.com/r/kB1xI0/3
[A-Za-z\s\-_]+ automatically matches all alphabetical characters + whitespace
\d(?!(\)|\.)\D) match any numeric sequence of digits not followed by a closing parenthesis ) or decimal value (.99)
\.\d match any period followed by numeric digit.
I used this pattern:
(?<=\d.\s)(.*?)(?=\d.\s)
demo
This looks for the contents between any digit, any character, then a space.
Edit: Updated pattern to handle the currency issue and line ends better:
This is with flag 'g'
(?<=[0-9].\s)(.*?)(?=\s[0-9].\s|\n|\r)
Demo 2
import re
s = "1) there is a dsfsdfsd and 2) there is another one and 3) yet another case"
s1 = "we will give 4. there needs to be another option and 6.99 USD is a bit amount"
regex = re.compile("\d\)\s.*?|\s\d\.\D.*?")
print ([x for x in regex.split(s) if x])
print regex.split(s1)
Output:
['there is a dsfsdfsd and ', 'there is another one and ', 'yet another case']
['we will give', 'there needs to be another option and 6.99 USD is a bit amount']

python re problem

i test re on some pythonwebshelll, all of them are encounter issue
if i use
a=re.findall(r"""<ul>[\s\S]*?<li><a href="(?P<link>[\s\S]*?)"[\s\S]*?<img src="(?P<img>[\s\S]*?)"[\s\S]*?<br/>[\s\S]*?</li>[\s\S]*?</li>[\s\S]*?</li>[\s\S]*?</ul>""",html)
print a
it's ok
but if i use
a=re.findall(r"""<ul>[\s\S]*?<li><a href="(?P<link>[\s\S]*?)"[\s\S]*?<img src="(?P<img>[\s\S]*?)"[\s\S]*?<br/>[\s\S]*?</li>[\s\S]*?</li>[\s\S]*?</li>[\s\S]*?</ul>d""",html)
print a
it will block the server and wait always like the server is dead,also i have tested on regexbuddy
the only difference betwwen the two snippet code is at the end of the second snippet code's regur expression,i add a character 'd'
any one can explain why occures this
The expression [\s\S]*? can match any amount of anything. This can potentially cause an enormous amount of backtracking in the case that the match fails. If you are more specific about what you can and can't match then it will allow the match to fail faster.
Also, I'd advise you to use an HTML parser instead of regular expressions for this. Beautiful Soup is an excellent library that is easy to use.
Your regex is suffering from catastrophic backtracking. If it can find a match it's fine, but if it can't, it has to try a virtually infinite number of possibilities before it gives up. Every one of those [\s\S]*? constructs ends up trying to match all the way to the end of the document, and the interaction between them creates a staggering amount of useless work.
Python doesn't support atomic groups, but here's a little trick you can use to imitate them:
a=re.findall(r"""(?=(<ul>[\s\S]*?<li><a href="(?P<link>[\s\S]*?)"[\s\S]*?<img src="(?P<img>[\s\S]*?)"[\s\S]*?<br/>[\s\S]*?</li>[\s\S]*?</li>[\s\S]*?</li>[\s\S]*?</ul>))\1d""",html)
print a
If the lookahead succeeds, the whole <UL> element is captured in group #1, the match position resets to the beginning of the element, then the \1 backreference consumes the element. But if the next character is not d, it does not go back and muck about with all those [\s\S]*? constructs again, like your regex does.
Instead, the regex engine goes straight back to the beginning of the <UL> element, then bumps ahead one position (so it's between the < and the u) and tries the lookahead again from the beginning. It keeps doing that until it finds another match for the lookahead, or it reaches the end of the document. In this way, it will fail (the expected result) in about the same time your first regex took to succeed.
Note that I'm not presenting this trick as a solution, just trying to answer your question as to why your regex seems to hang. If I were offering a solution, I would say to stop using [\s\S]*? (or [\s\S]*, or .*, or .*?); you're relying on that too much. Try to be as specific as you reasonably can--for example, instead of:
<a href="(?P<link>[\s\S]*?)"[\s\S]*?<img src="(?P<img>[\s\S]*?)"[\s\S]*?
...use:
<a href="(?P<link>[^"]*)"[^>]*><img src="(?P<img>[^"]*)"[^>]*>
But even that has serious problems. You should seriously consider using an HTML parser for this job. I love regexes too, but you're asking too much from them.

Categories