I am validating mac address, Below is the code
import re
input = """
abc
xyz
ff:ff:ff:ff:ff:ff
ff::ff::ff::ff::ff::ff
ff-ff-ff-ff-ff-ff
"""
regex = re.compile("([A-Fa-f0-9]{2}?-::){5}[A-Fa-f0-9]{2}")
mac_address = re.findall(regex, input)
print(mac_address)
actual output:
[] # Empty list
Expected output:
["ff:ff:ff:ff:ff:ff", "ff::ff::ff::ff::ff::ff", "ff-ff-ff-ff-ff-ff"]
My code explanation:
1) mac address contains alpha numeric that is [A-Fa-f0-9]
2) must contain two characters hence added {2}
3) optional -:: (one colon(:) or two colon(::) or dash) hence added ?-::
4) option 1,2,3 must matches 5 times hence added {5}
5) at last [A-Fa-f0-9]{2} must match 2 times
Could any please correct me what i am doing wrong here
The answer is the re.M flag (for "multi-line" - you've provided a string with multiple lines, so this is necessary). Your regex doesn't parse correctly for me, but my similar regex (five occurrences of "two hex digits followed by either :, ::, or -" followed by two more hex digits) works:
>>> inp = """
... abc
... xyz
... ff:ff:ff:ff:ff:ff
... ff::ff::ff::ff::ff::ff
... ff-ff-ff-ff-ff-ff
... """
>>> regex = re.compile(r'(?:[A-Fa-f0-9]{2}(?:\:|\:\:|-)?){5}[A-Fa-f0-9]{2}', flags=re.M)
>>> mac_address = re.findall(regex, inp)
>>> print(mac_address)
['ff:ff:ff:ff:ff:ff', 'ff::ff::ff::ff::ff::ff', 'ff-ff-ff-ff-ff-ff']
In your pattern your are matching 5 times 2 chars out of [A-Fa-f0-9]{2} followed by -::
In your example data, the match could be either :: or : or -
To match that you could make use of a non capturing group matching either :: or or one of : or - using a character class and a non capturing group like (?:::|[:-])
Note that using ?-:: is not the notation to make for an optional
character.
If you want to make all of the options optional, you can make the group optional (?:::|[:-])? (which would possibly also match ffffffffffff but is not in the example data)
\b(?:[A-Fa-f0-9]{2}(?:::|[:-])){5}[A-Fa-f0-9]{2}\b
Regex demo | Python demo
For example
import re
input = """
abc
xyz
ff:ff:ff:ff:ff:ff
ff::ff::ff::ff::ff::ff
ff-ff-ff-ff-ff-ff
"""
regex = re.compile(r"\b(?:[A-Fa-f0-9]{2}(?:::|[:-])){5}[A-Fa-f0-9]{2}\b")
mac_address = re.findall(regex, input)
Output
['ff:ff:ff:ff:ff:ff', 'ff::ff::ff::ff::ff::ff', 'ff-ff-ff-ff-ff-ff']
Related
I want to identify all variants of a short form of example e.g and replace it with space.
The regex I tried is given below. It matches e.g but doesn't match the other variants. What am I doing wrong?
(?:^|\s)([e]\.[g](\.)?)(?=\s|$)
The data input is
e.g E.g E.G. e.g.
The regex should match all of these variants.
The regex can be tried at - https://regex101.com/r/oFQxYJ/5
Your regex is already working if you use re.IGNORECASE, for example:
import re
pat = "(?:^|\s)([e]\.[g](\.)?)(?=\s|$)" # unchanged from question
data = "e.g E.g E.G. e.g."
regex = re.compile(pat, re.IGNORECASE) # note the IGNORECASE
print(regex.findall(data))
gives
[('e.g', ''), ('E.g', ''), ('E.G.', '.'), ('e.g.', '.')]
Or if you do not want to use re.IGNORECASE, then include the upper case variants in the character classes:
import re
pat = "(?:^|\s)([eE]\.[gG](\.)?)(?=\s|$)" # note the [eE] and [gG] here
data = "e.g E.g E.G. e.g."
regex = re.compile(pat)
print(regex.findall(data))
(same output as above).
But by default [e] will be a case-sensitive match (and in this case the [ ... ] do not make any difference because it means match any of the characters inside the square brackets, but there is only one).
Then to replace with space, use sub. This will replace all the matches in the line, so is equivalent to findall. For example:
import re
pat = "(?:^|\s)([eE]\.[gG](\.)?)(?=\s|$)"
data2 = "test e.g test E.g test E.G. test e.g. test"
regex = re.compile(pat)
print(regex.sub(" ", data2)) # <== using sub
gives
test test test test test
I do not know the code for search and replace in one regex, but to search:
([eE]\.[gG]\.{0,1})
a bit unprecise, or the rolled out version
((e.g)|(E.g)|(E.G.)|(e.g.))
I need to extract the real issue number in my file name. There are 2 patterns:
if there is no leading number in the file name, then the number, which we read first, is the issue number. For example
asdasd 213.pdf ---> 213
abcd123efg456.pdf ---> 123
however, sometimes there is a leading number in the file name, which is just the index of file, so I have to ignore/skip it firstly. For example
123abcd 4567sdds.pdf ---> 4567, since 123 is ignored
890abcd 123efg456.pdf ---> 123, since 890 is ignored
I want to learn whether it is possilbe to write only one regular expression to implement it? Currently, my soluton involves 2 steps:
if there is a leading number, remove it
find the number in the remaining string
or in Python code
import re
reNumHeading = re.compile('^\d{1,}', re.IGNORECASE | re.VERBOSE) # to find leading number
reNum = re.compile('\d{1,}', re.IGNORECASE | re.VERBOSE) # to find number
lstTest = '''123abcd 4567sdds.pdf
asdasd 213.pdf
abcd 123efg456.pdf
890abcd 123efg456.pdf'''.split('\n')
for test in lstTest:
if reNumHeading.match(test):
span = reNumHeading.match(test).span()
stripTest = test[span[1]:]
else:
stripTest = test
result = reNum.findall(stripTest)
if result:
print(result[0])
thanks
You can use ? quantifier to define optional pattern
>>> import re
>>> s = '''asdasd 213.pdf
... abcd123efg456.pdf
... 123abcd 4567sdds.pdf
... 890abcd 123efg456.pdf'''
>>> for line in s.split('\n'):
... print(re.search(r'(?:^\d+)?.*?(\d+)', line)[1])
...
213
123
4567
123
(?:^\d+)? here a non-capturing group and ? quantifier is used to optionally match digits at start of line
since + is greedy, all the starting digits will be matched
.*? match any number of characters minimally (because we need the first match of digits)
(\d+) the required digits to be captured
re.search returns a re.Match object from which you can get various details
[1] on the re.Match object will give you string captured by first capturing group
use .group(1) if you are on older version of Python that doesn't support [1] syntax
See also: Reference - What does this regex mean?
Just match digits \d+ that follow a non-digit \D:
import re
lstTest = '''123abcd 4567sdds.pdf
asdasd 213.pdf
abcd 123efg456.pdf
890abcd 123efg456.pdf'''.split('\n')
for test in lstTest:
res = re.search(r'\D(\d+)', test)
print(res.group(1))
Output:
4567
213
123
123
I have a string if the alphabetical part of a word is more than 3 letters, I want to store that in a list. I need to store "hour" and "lalal" into a list.
I wrote a regex pattern for alpha-digit and digit alpha sequences like below.
regex = ["([a-zA-Z])-([0-9])*","([0-9])*-([a-zA-Z])"]
tring = 'f-16 is 1-hour, lalal-54'
for r in regex:
m = re.search(r,tring)
d.append((m.group(0))
print(d)
But this obviously gives me all the alphanumeric patterns which are being stored too. So, I thought I could extend this to count the letters in each pattern and store it differently too. Is that possible?
Edit: Another example would
tring = I will be there in 1-hour
and the output for this should be ['hour']
So you want to only capture alphanumeric text if either it is preceded or followed by a number and a hyphen. You can use this regex which uses alternation for capturing both the cases,
([a-zA-Z]{4,})-\d+|\d+-([a-zA-Z]{4,})
Explanation:
([a-zA-Z]{4,}) - Captures the alphanumeric text of length four or more and stores in group1
-\d+ - Ensures it is followed by hyphen and one or more digit
| - Alternation as there are two cases
\d+- - Matches one or more digits and a hyphen
([a-zA-Z]{4,}) - Captures the alphanumeric text of length four or more and stores in group2
Demo
Check this python code,
import re
s = 'f-16 is 1-hour, lalal-54 I will be there in 1-hours'
d = []
for m in re.finditer(r'([a-zA-Z]{4,})-\d+|\d+-([a-zA-Z]{4,})',s):
if (m.group(1)):
d.append(m.group(1))
elif (m.group(2)):
d.append(m.group(2))
print(d)
s = 'f-16 is 1-hour, lalal-54'
arr = re.findall(r'[a-zA-Z]{4,}', s)
print(arr)
Prints,
['hour', 'lalal', 'hours']
I have tested the following code on http://regexpal.com/ and it correctly matches the string I want. I want to find 16 digit numbers which occur in blocks of 4 with a space in the middle, so I wrote the following regex:
\d{4}(\s\d{4}){3}
i.e. match 4 numbers, then match three repeating sets of a space followed by four numbers. On regexpal, this correctly matches:
test1234 message1234 5678 1234 5678
In Python, however, I run the following code:
>>> import re
>>> p = re.compile('\d{4}(\s\d{4}){3}')
>>> p.findall('test1234 message1234 5678 1234 5678')
[' 5678']
>>>
I don't understand why it is matching the second instance of '5678' and why it is not matching the block of numbers as I would expect.
raw string is the recommended way to define regex but the problem here is mainly because of the implementation of findall method. You need to turn capturing group present in your regex to non-capturing group. Because re.findall function gives the first preference to captures and then the matches. Your regex \d{4}(\s\d{4}){3} matches the 16 digit number but captures only the last four plus the preceding space.
p = re.compile(r'\d{4}(?:\s\d{4}){3}')
Example:
>>> import re
>>> p = re.compile(r'\d{4}(\s\d{4}){3}')
>>> p.findall('test1234 message1234 5678 1234 5678')
[' 5678']
>>> p = re.compile(r'\d{4}(?:\s\d{4}){3}')
>>> p.findall('test1234 message1234 5678 1234 5678')
['1234 5678 1234 5678']
You need to either prefix your string with an r or escape your backslashes:
p = re.compile(r'\d{4}(\s\d{4}){3}')
or
p = re.compile('\\d{4}(\\s\\d{4}){3}')
I have the following identifiers:
id1 = '883316040119_FRIENDS_HD'
id2 = 'ZWEX01DE9463DB_DMD'
id3 = '35358fr1'
id4 = 'as3d99j_br001'
I need a regex to get me the following output:
id1 = '883316040119'
id2 = 'ZWEX01DE9463DB'
id3 = '35358'
id4 = 'as3d99j'
Here is what I have so far --
re.sub(r'_?([a-zA-Z]{2,4}?\d?(00\d)?)$','',vendor_id)
It doesn't work perfectly though, here is what it gives me:
BAD - 883316040119_FRIENDS
GOOD - ZWEX01DE9463DB
GOOD - 35358
GOOD - as3d99j
What would be the correct regular expression to get all of them? For the first one, I basically want to strip the ending if it is only underscores and letters, so 1928h9829_bundle_hd --> 1928h9829.
Please note that I have hundreds of thousands of identifiers here, and it is required that I use a regular expression. I'm not looking for a python split() way to do it, as it wouldn't work.
The way you present your input, I would suggest this simple regex:
^(?:[^_]+(?=_)|\d+)
This can be tweaked if you want to add details to the spec.
To show you a regex demo, just because of the way the site regex101 works, we have to add \n (it assumes we are working on the whole file, rather than one input at a time): DEMO
Explanation
The ^ anchor asserts that we are at the beginning of the string
The non-capture group (?: ... ) matches either
[^_]+(?=_) non-underscore characters (followed by an underscore, not matched)
| OR
\d+ digits
This works for the examples:
for id in ids :
print (id)
883316040119_FRIENDS_HD
ZWEX01DE9463DB_DMD
35358fr1
as3d99j_br001
for id in ids :
hit = re.sub( "(_[A-Za-z_]*|_?[A-Za-z]{2,4}?\d?(00\d)?)$", "", id)
print (hit)
883316040119
ZWEX01DE9463DB
35358
as3d99j
When the tail contains letters and underscores, then the pattern is easygoing and strips off any number of underscores and digits; if the tail does not contain an underscore, or contains digits after the underscore, then it demands the pattern in the question: 0/2/3/4 letters then an optional digit then an optional zero-zero-digit.
You are checking for underscore only one possible time, as ? means {0,1}.
r'(_[a-zA-Z]{2,}\d?(00[0-9])?|[a-z]{2,}\d)+$'
The following reproduces your desired results from your input.
I would use the replace method with this regex:
_[^']+|(?!.*_)('[0-9]+)[^']+
and return capturing group 1
Perhaps:
result = re.sub("_[^']+|(?!.*_)('[0-9]+)[^']+", r"\1", subject)
The regex first looks for an underscore. If it finds one, it will match everything up to but not including the next single quote; and that will get removed.
If that doesn't match, the alternative will look for a string that does NOT have an underscore; match and return in capturing group 1 the sequence of digits; and then replace everything after the digits up to but not including the single quote.
This is not subtraction approach. Just capture matched string.
The regex is ^[0-9]+)|(^[a-zA-Z0-9]+(?=_).(ie (^\d+)|(^[\d\w]+(?=_)))
import re
id1 = '883316040119_FRIENDS_HD'
id2 = 'ZWEX01DE9463DB_DMD'
id3 = '35358fr1'
id4 = 'as3d99j_br001'
ids = [id1, id2, id3, id4]
for i in ids:
try:
print re.match(r"(^[0-9]+)|(^[a-zA-Z0-9]+(?=_))", i).group()
except:
print "not matched"
output:
883316040119
ZWEX01DE9463DB
35358
as3d99j