Python Match xx-xxxx Numbers - Inaccurate Results - python

I'm quite weak in regex.
I'm trying to match a string which could be anthing like the following:
12-1234 *string*
12 1234 *string*
or
12 123 *string*
12-1234 *string*
As long as that pattern is found in a given string, then it should pass...
I figured this should be sufficient:
a = re.compile("^\d{0,2}[\- ]\d{0,4}$")
if a.match(dbfull_address):
continue
Yet I'm still getting inaccurate results:
12 string
I guess I need help with my regex :D

^\d{0,2}[\- ]\d{0,4}$
allows zero digits around the space/dash, so you probably want to use \d{1,2}[- ]\d{1,4}.
Also, you should remove the $ anchor, unless you only want to match lines where nothing follows the second number.
The ^ anchor is unnecessary as well since Python's .match() method implicitly anchors the regex match to the start of the string.

reobj = re.compile(r"^[\d]{0,2}[\s\-]+[\d]{0,4}.*?$", re.IGNORECASE | re.MULTILINE)
Options: dot matches newline; case insensitive; ^ and $ match at line breaks
Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
Match a single digit 0..9 «[\d]{2}»
Exactly 2 times «{2}»
Match the regular expression below and capture its match into backreference number 1 «(\.|\*)?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match either the regular expression below (attempting the next alternative only if this one fails) «\.»
Match the character “.” literally «\.»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «\*»
Match the character “*” literally «\*»
Match the regular expression below and capture its match into backreference number 2 «([\d]{2})?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match a single digit 0..9 «[\d]{2}»
Exactly 2 times «{2}»

Related

What is a regex expression that can prune down repeating identical characters down to a maximum of two repeats?

I feel I am having the most difficulty explaining this well enough for a search engine to pick up on what I'm looking for. The behavior is essentially this:
string = "aaaaaaaaare yooooooooou okkkkkk"
would become "aare yoou okk", with the maximum number of repeats for any given character is two.
Matching the excess duplicates, and then re.sub -ing it seems to me the approach to take, but I can't figure out the regex statement I need.
The only attempt I feel is even worth posting is this - (\w)\1{3,0}
Which matched only the first instance of a character repeating more than three times - so only one match, and the whole block of repeated characters, not just the ones exceeding the max of 2. Any help is appreciated!
The regexp should be (\w)\1{2,} to match a character followed by at least 2 repetitions. That's 3 or more when you include the initial character.
The replacement is then \1\1 to replace with just two repetitions.
string = "aaaaaaaaare yooooooooou okkkkkk"
new_string = re.sub(r'(\w)\1{2,}', r'\1\1', string)
You could write
string = "aaaaaaaaare yooooooooou okkkkkk"
rgx = (\w)\1*(?=\1\1)
re.sub(rgx, '', string)
#=> "aare yoou okk"
Demo
The regular expression can be broken down as follows.
(\w) # match one word character and save it to capture group 1
\1* # match the content of capture group 1 zero or more times
(?= # begin a positive lookahead
\1\1 # match the content of capture group 1 twice
) # end the positive lookahead

Extracting two strings from between two characters. Why doesn't my regex match and how can I improve it?

I'm learning about regular expressions and I to want extract a string from a text that has the following characteristic:
It always begins with the letter C, in either lowercase or
uppercase, which is then followed by a number of hexadecimal
characters (meaning it can contain the letters A to F and numbers
from 1 to 9, with no zeros included).
After those hexadecimal
characters comes a letter P, also either in lowercase or uppercase
And then some more hexadecimal characters (again, excluding 0).
Meaning I want to capture the strings that come in between the letters C and P as well as the string that comes after the letter P and concatenate them into a single string, while discarding the letters C and P
Examples of valid strings would be:
c45AFP2
CAPF
c56Bp26
CA6C22pAAA
For the above examples what I want would be to extract the following, in the same order:
45AF2 # Original string: c45AFP2
AF # Original string: CAPF
56B26 # Original string: c56Bp26
A6C22AAA # Original string: CA6C22pAAA
Examples of invalid strings would be:
BCA6C22pAAA # It doesn't begin with C
c56Bp # There aren't any characters after P
c45AF0P2 # Contains a zero
I'm using python and I want a regex to extract the two strings that come both in between the characters C and P as well as after P
So far I've come up with this:
(?<=\A[cC])[a-fA-F1-9]*(?<=[pP])[a-fA-F1-9]*
A breakdown would be:
(?<=\A[cC]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [cC] and that [cC] must be at the beginning of the string
[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times
(?<=[pP]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [pP]
[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times
But with the above regex I can't match any of the strings!
When I insert a | in between (?<=[cC])[a-fA-F1-9]* and (?<=[pP])[a-fA-F1-9]* it works.
Meaning the below regex works:
(?<=[cC])[a-fA-F1-9]*|(?<=[pP])[a-fA-F1-9]*
I know that | means that it should match at most one of the specified regex expressions. But it's non greedy and it returns the first match that it finds. The remaining expressions aren’t tested, right?
But using | means the string BCA6C22pAAA is a partial match to AAA since it comes after P, even though the first assertion isn't true, since it doesn't begin with a C.
That shouldn't be the case. I want it to only match if all conditions explained in the beginning are true.
Could someone explain to me why my first attempt doesn't produces the result I want? Also, how can I improve my regex?
I still need it to:
Not be a match if the string contains the number 0
Only be a match if ALL conditions are met
Thank you
To match both groups before and after P or p
(?<=^[Cc])[1-9a-fA-F]+(?=[Pp]([1-9a-fA-F]+$))
(?<=^[Cc]) - Positive Lookbehind. Must match a case insensitive C or c at the start of the line
[1-9a-fA-F]+ - Matches hexadecimal characters one or more times
(?=[Pp] - Positive Lookahead for case insensitive p or P
([1-9a-fA-F]+$) - Cature group for one or more hexadecimal characters following the pP
View Demo
Your main problem is you're using a look behind (?<=[pP]) for something ahead, which will never work: You need a look ahead (?=...).
Also, the final quantifier should be + not * because you require at least one trailing character after the p.
The final mistake is that you're not capturing anything, you're only matching, so put what you want to capture inside brackets, which also means you can remove all look arounds.
If you use the case insensitive flag, it makes the regex much smaller and easier to read.
A working regex that captures the 2 hex parts in groups 1 and 2 is:
(?i)^c([a-f1-9]*)p([a-f1-9]+)
See live demo.
Unless you need to use \A, prefer ^ (start of input) over \A (start of all input in multi line scenario) because ^ easier to read and \A won't match every line, which is what many situations and tools expect. I've used ^.

What is the use of second limit in the quantifier {m,n} in the regular expression in python if it used in a non-greedy way?

The regular expression in Python re.compile(r'\w{3,5}?') will match with any pattern that have at least three non-overlapping alpha-numeric and underscore characters. My question here 'is the second limit has any use in this non greedy use of quantifier {3,5}, i.e. even if the five is replaced by any other number the result would be same. i.e. re.compile(r'\w{3,5}?')=re.compile(r'\w{3,6}?')=re.compile(r'\w{3,7}?')=re.compile(r'\w{3,}?')
Can some one give me an example where the second limit find any use?
When a lazily quantified pattern appears at the end of the pattern, it matches the minimum amount of chars it needs to match to return a value. A 123(\w*?) will always yield no value inside Group 1 as *? matches zero or more chars, but as few as possible.
It means that \w{3,5}? regex will always match 3 word chars, and the second argument will be "ignored" as it is enough to match 3 occurrences of the word char.
If the lazy pattern is not at the end, the second argument is important.
See an example: Test: (\w{3,5}?)-(\d+) captures different amount of chars in Group 1 depending on how match word chars there are in the strings.

regex for plain email and mailto links, but not http basic auth

I'm trying to build a regular expression to meet these conditions:
[DON'T MATCH]
dont:match#example.com
[MATCH]
mailto:match#example.com
match#example.com
<p>match#example.com</p>
I can match the last two, but the first example (DON'T MATCH) is also matched.
How do I make sure an email is only valid if it's plain or proceeded by mailto:, but not just a :?
http://rubular.com/r/HvldBe4Ew9
Regex:
(?<=mailto:)?([a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)
You can use anchors ^ and $ for matching string start/end if the strings are passed as separate values:
(?<=>)(?:mailto:)?([a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9.-]+)(?=<)
Or, getting rid of capturing groups:
(?<=>)(?:mailto:)?[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9.-]+(?=<)
See demo
Please note that you have an issue in [a-zA-Z0-9-.]: the hyphen symbol should not appear unescaped in the middle of the character class.
No need fora-zA-Z, just use A-Z and make the regex case insensitive with re.IGNORECASE.
Also make sure you use
^ Assert position at the beginning of a line
and
$ Assert position at the end of a line
Python Example:
import re
match = re.search(r"^(?:mailto:)?([A-Z0-9_.+-]+#[A-Z0-9-]+\.[\tA-Z0-9-.]+)$", email, re.IGNORECASE)
if match:
result = match.group(1)
else:
result = ""
Demo:
https://regex101.com/r/cI1eD6/1
Regex explanation:
^(mailto:)?([A-Z0-9_.+-]+#[A-Z0-9-]+\.[A-Z0-9-.]+)$
Options: Case insensitive
Assert position at the beginning of a line «^»
Match the regex below and capture its match into backreference number 1 «(mailto:)?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match the character string “mailto:” literally «mailto:»
Match the regex below and capture its match into backreference number 2 «([A-Z0-9_.+-]+#[A-Z0-9-]+\.[A-Z0-9-.]+)»
Match a single character present in the list below «[A-Z0-9_.+-]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
A character in the range between “A” and “Z” «A-Z»
A character in the range between “0” and “9” «0-9»
A single character from the list “_.+” «_.+»
The literal character “-” «-»
Match the character “#” literally «#»
Match a single character present in the list below «[A-Z0-9-]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
A character in the range between “A” and “Z” «A-Z»
A character in the range between “0” and “9” «0-9»
The literal character “-” «-»
Match the character “.” literally «\.»
Match a single character present in the list below «[A-Z0-9-.]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
A character in the range between “A” and “Z” «A-Z»
A character in the range between “0” and “9” «0-9»
A single character from the list “-.” «-.»
Assert position at the end of a line «$»

Match only the string that has strings after last underscore

I am trying to match string with underscores, throughout the string there are underscores but I want to match the strings that that has strings after the last underscore: Let me provide an example:
s = "hello_world"
s1 = "hello_world_foo"
s2 = "hello_world_foo_boo"
In my case I only want to capture s1 and s2.
I started with following, but can't really figure how I would do the match to capture strings that has strings after hello_world's underscore.
rgx = re.compile(ur'(?P<firstpart>\w+)[_]+(?P<secondpart>\w+)$', re.I | re.U)
Try this:
reobj = re.compile("^(?P<firstpart>[a-z]+)_(?P<secondpart>[a-z]+)_(?P<lastpart>.*?)$", re.IGNORECASE)
result = reobj.findall(subject)
Regex Explanation
^(?P<firstpart>[a-z]+)_(?P<secondpart>[a-z]+)_(?P<lastpart>.*?)$
Options: case insensitive
Assert position at the beginning of the string «^»
Match the regular expression below and capture its match into backreference with name “firstpart” «(?P<firstpart>[a-z]+)»
Match a single character in the range between “a” and “z” «[a-z]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “_” literally «_»
Match the regular expression below and capture its match into backreference with name “secondpart” «(?P<secondpart>[a-z]+)»
Match a single character in the range between “a” and “z” «[a-z]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “_” literally «_»
Match the regular expression below and capture its match into backreference with name “lastpart” «(?P<lastpart>.*?)»
Match any single character that is not a line break character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Assert position at the end of the string (or before the line break at the end of the string, if any) «$»
If I understand what you are asking for (you want to match string with more than one underscore and following text)
rgx = re.compile(ur'(?P<firstpart>\w+)[_]+(?P<secondpart>\w+)_[^_]+$', re.I | re.U)

Categories