Regex to find N characters between underscore and period - python

I have a filename having numerals like test_20200331_2020041612345678.csv.
So I just want to read only first 8 characters from the number between last underscore and .csv using a regex.
For e.g: From the file name test_20200331_2020041612345678.csv --> i want to read only 20200416 using regex.
Regex tried: (?<=_)(\d+)(?=\.)
But it is returning the full number between underscore and period i.e 2020041612345678
Also, when tried quantifier like (?<=_)(\d{8})(?=\.) its not matching with any string

The (?<=_)(\d{8})(?=\.) does not work because the (?=\.) positive lookahead requires the presence of a . char immediately to the right of the current location, i.e. right after the eigth digit, but there are more digits in between.
You may add \d* before \. to match any amount of digits after the required 8 digits, use
(?<=_)\d{8}(?=\d*\.)
Or, with a capturing group, you do not even need lookarounds (just make sure you access Group 1 when a match is obtained):
_(\d{8})\d*\.
See the regex demo
Python demo:
import re
s = "test_20200331_2020041612345678.csv"
m = re.search(r"(?<=_)\d{8}(?=\d*\.)", s)
# m = re.search(r"_(\d{8})\d*\.", s) # capturing group approach
if m:
print(m.group()) # => 20200416
# print(m.group(1)) # capturing group approach

Related

How to create regex to match a string that contains only hexadecimal numbers and arrows?

I am using a string that uses the following characters:
0-9
a-f
A-F
-
>
The mixture of the greater than and hyphen must be:
->
-->
Here is the regex that I have so far:
[0-9a-fA-F\-\>]+
I tried these others using exclusion with ^ but they didn't work:
[^g-zG-Z][0-9a-fA-F\-\>]+
^g-zG-Z[0-9a-fA-F\-\>]+
[0-9a-fA-F\-\>]^g-zG-Z+
[0-9a-fA-F\-\>]+^g-zG-Z
[0-9a-fA-F\-\>]+[^g-zG-Z]
Here are some samples:
"0912adbd->12d1829-->218990d"
"ab2c8d-->82a921->193acd7"
Firstly, you don't need to escape - and >
Here's the regex that worked for me:
^([0-9a-fA-F]*(->)*(-->)*)*$
Here's an alternative regex:
^([0-9a-fA-F]*(-+>)*)*$
What does the regex do?
^ matches the beginning of the string and $ matches the ending.
* matches 0 or more instances of the preceding token
Created a big () capturing group to match any token.
[0-9a-fA-F] matches any character that is in the range.
(->) and (-->) match only those given instances.
Putting it into a code:
import re
regex = "^([0-9a-fA-F]*(->)*(-->)*)*$"
re.match(re.compile(regex),"0912adbd->12d1829-->218990d")
re.match(re.compile(regex),"ab2c8d-->82a921->193acd7")
re.match(re.compile(regex),"this-failed->so-->bad")
You can also convert it into a boolean:
print(bool(re.match(re.compile(regex),"0912adbd->12d1829-->218990d")))
print(bool(re.match(re.compile(regex),"ab2c8d-->82a921->193acd7")))
print(bool(re.match(re.compile(regex),"this-failed->so-->bad")))
Output:
True
True
False
I recommend using regexr.com to check your regex.
If there must be an arrow present, and not at the start or end of the string using a case insensitive pattern:
^[a-f\d]+(?:-{1,2}>[a-f\d]+)+$
Explanation
^ Start of string
[a-f\d]+ Match 1+ chars a-f or digits
(?: Non capture group to repeat as a whole
-{1,2}>[a-f\d]+ Match - or -- and > followed by 1+ chars a-f or digits
)+ Close the non capture group and repeat 1+ times
$ End of string
See a regex demo and a Python demo.
import re
pattern = r"^[a-f\d]+(?:-{1,2}>[a-f\d]+)+$"
s = ("0912adbd->12d1829-->218990d\n"
"ab2c8d-->82a921->193acd7\n"
"test")
print(re.findall(pattern, s, re.I | re.M))
Output
[
'0912adbd->12d1829-->218990d',
'ab2c8d-->82a921->193acd7'
]
You can construct the regex by steps. If I understand your requirements, you want a sequence of hexadecimal numbers (like a01d or 11efeb23, separated by arrows with one or two hyphens (-> or -->).
The hex part's regex is [0-9a-fA-F]+ (assuming it cannot be empty).
The arrow's regex can be -{1,2}> or (->|-->).
The arrow is only needed before each hex number but the first, so you'll build the final regex in two parts: the first number, then the repetition of arrow and number.
So the general structure will be:
NUMBER(ARROW NUMBER)*
Which gives the following regex:
[0-9a-fA-F]+(-{1,2}>[0-9a-fA-F]+)*

Capturing repeated pattern in Python

I'm trying to implement some kind of markdown like behavior for a Python log formatter.
Let's take this string as example:
**This is a warning**: Virus manager __failed__
A few regexes later the string has lost the markdown like syntax and been turned into bash code:
\033[33m\033[1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
But that should be compressed to
\033[33;1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m
I tried these, beside many other non working solutions:
(\\033\[([\d]+)m){2,} => Capture: \033[33m\033[1m with g1 '\033[1m' and g2 '1' and \033[0m\033[0mwith g1 '\033[0m' and g2 '0'
(\\033\[([\d]+)m)+ many results, not ok
(?:(\\033\[([\d]+)m)+) many results, although this is the recommended way for repeated patterns if I understood correctly, not ok
and others..
My goal is to have as results:
Input
\033[33m\033[1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
Output
Match 1
033[33m\033[1m
Group1: 33
Group2: 1
Match 2
033[0m\033[0m
Group1: 0
Group2: 0
In other words, capture the ones that are "duplicated" and not the ones alone, so I can fuse them with a regex sub.
You want to match consectuively repeating \033[\d+m chunks of text and join the numbers after [ with a semi-colon.
You may use
re.sub(r'(?:\\033\[\d+m){2,}', lambda m: r'\033['+";".join(set(re.findall(r"\[(\d+)", m.group())))+'m', text)
See the Python demo online
The (?:\\033\[\d+m){2,} pattern will match two or more sequences of \033[ + one or more digits + m chunks of texts and then, the match will be passed to the lambda expression, where the output will be: 1) \033[, 2) all the numbers after [ extracted with re.findall(r"\[(\d+)", m.group()) and deduplicated with the set, and then 3) m.
The patterns in the string to be modified have not been made clear from the question. For example, is 033 fixed or might it be 025 or even 25? I've made certain assumptions in using the regex
r" ^(\\0(\d+)\[\2)[a-z]\\0\2\[(\d[a-z].+)
to obtain two capture groups that are to be combined, separated by a semi-colon. I've attempted to make clear my assumptions below, in part to help the OP modify this regex to satisfy alternative requirements.
Demo
The regex performs the following operations:
^ # match beginning of line
( # begin cap grp 1
\\0 # match '\0'
(\d+) # match 1+ digits in cap grp 2
\[ # match '['
\2 # match contents of cap grp 2
) # end cap grp 1
[a-z] # match a lc letter
\\0 # match '\0'
\2 # match contents of cap grp 2
\[ # match '['
(\d[a-z].+) # match a digit, then lc letter then 1+ chars to the
# end of the line in cap grp 3
As you see, the portion of the string captured in group 1 is
\033[33
I've assumed that the part of this string that is now 033 must be two or more digits beginning with a zero, and the second appearance of a string of digits consists of the same digits after the zero. This is done by capturing the digits following '0' (33) in capture group 2 and then using a back-reference \2.
The next part of the string is to be replaced and therefore is not captured:
m\\033[
I've assumed that m must be one lower case letter (or should it be a literal m?), the backslash and zero and required and the following digits must again match the content of capture group 2.
The remainder of the string,
1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
is captured in capture group 3. Here I've assumed it begins with one digit (perhaps it should be \d+) followed by one lower case letter that needn't be the same as the lower case letter matched earlier (though that could be enforced with another capture group). At that point I match the remainder of the line with .+, having given up matching patterns in that part of the string.
One may alternatively have just two capture groups, the capture group that is now #2, becoming #1, and #2 being the part of the string that is to be replaced with a semicolon.
This is pretty straightforward for the cases you desribe here; simply write out from left to right what you want to match and capture. Repeating capturing blocks won't help you here, because only the most recently captured values would be returned as a result.
\\033\[(\d+)m\\033\[(\d+)m

Regex - How do i find this specific slice of string inside a bigger whole string

following my previous question (How do i find multiple occurences of this specific string and split them into a list?), I'm now going to ask something more since the rule has been changed.
Here's the string, and the bold words are the ones that I want to extract.
text|p1_1_1120170AS074192161A0Z20|C M E -
Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier
Module 3KW|#|text|p1_4_1120170AS074192161A0Z20|Shuangdeng
6-FMX-170|#|text|p1_5_1120170AS074192161A0Z20|24021665|#|text|p1_6_1120170AS074192161A0Z20|1120170AS074192161A0Z20|#|text|p1_7_1120170AS074192161A0Z20|OK|#|text|p1_8_1120170AS074192161A0Z20||#|text|p1_9_1120170AS074192161A0Z20|ACTIVE|#|text|p1_10_1120170AS074192161A0Z20|-OK|#|text|site_id|20MJK110|#|text|barcode_flag|auto|#|text|movement_flag||#|text|unit_of_measurement||#|text|flag_waste|no|#|text|req_qty_db|2|#|text|req_qty|2
Here's my current regex:
(?<=p1\_1\_.*)[^|]+(?=\|\#\|.*|$)
After trying it out in https://regexr.com/, I found the result instead :
text|p1_1_1120170AS074192161A0Z20|C M E -
Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier
Module 3KW|#|text|p1_4_1120170AS074192161A0Z20|Shuangdeng
6-FMX-170|#|text|p1_5_1120170AS074192161A0Z20|24021665|#|text|p1_6_1120170AS074192161A0Z20|1120170AS074192161A0Z20|#|text|p1_7_1120170AS074192161A0Z20|OK|#|text|p1_8_1120170AS074192161A0Z20||#|text|p1_9_1120170AS074192161A0Z20|ACTIVE|#|text|p1_10_1120170AS074192161A0Z20|-OK|#|text|site_id|20MJK110|#|text|barcode_flag|auto|#|text|movement_flag||#|text|unit_of_measurement||#|text|flag_waste|no|#|text|req_qty_db|2|#|text|req_qty|2
The question remains: "Why don't just return the first matched occurrence ?".
Let's consider that if the value between the first "bar section" is empty, then it'll return the value of the next bar section.
Example :
text|p1_1_1120170AS074192161A0Z20||#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text . . .
And I don't want that. Let it be just return nothing instead (nothing match).
What's the correct regex to acquire such a match?
Thank you :).
This data looks more structured than you are giving it credit for. A regular expression is great for e.g. extracting email addresses from unstructured text, but this data seems delimited in a straightforward manner.
If there is structure it will be simpler, faster, and more reliable to just split on | and perhaps #:
text = 'text|p1_1_1120170AS074192161A0Z20|C M E - Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier Module 3KW|#|text|p1_4_11201...'
lines = text.split('|#|')
words = [line.split('|')[-1] for line in lines]
doc='text|p1_1_1120170AS074192161A0Z20|C M E - Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|...'
re.findall('[^|]+(?=\|\#\|)', doc)
In the re expression:
[^|]+finds chunks of text not containing the separator
(?=...) is a "lookahead assertion" (match the text but do not include in result)
About the pattern you tried
This part of the pattern [^|]+ states to match any char other than |
Then (?=\|\#\|.*|$) asserts using a positive lookahead what is on the right is |#|.* or the end of the string.
The positive lookbehind (?<=p1\_1\_.*) asserts what is on the left is p1_1_ followed by any char except a newline using a quantifier in the lookbehind.
As the pattern is not anchored, you will get all the matches for this logic because the p1_1_ assertion is true as it precedes all the|#| parts
Note that using the quantifier in the lookbehind will require the pypi regex module.
If you want the first match using a quantifier in the positive lookbehind you could for example use an anchor in combination with a negative lookahead to not cross the |#| or match || in case it is empty:
(?<=^.*?p1_1_(?:(?!\|#\|).|\|{2})*\|)[^|]+(?=\|\#\||$)
Python demo
You could use your original pattern using re.search getting the first match.
(?<=p1_1_.*)[^|]+(?=\|\#\||$)
Note that you don't have to escape the underscore in your original pattern and you can omit .* from the positive lookahead
Python demo
But to get the first match you don't have to use a positive lookbehind. You could also use an anchor, match and capturing group.
^.*?p1_1_(?:(?!\|#\|).|\|{2})*\|([^|]+)(?:\|#\||$)
^ Start of string
.*? Match any char except a newline
p1_1_ Match literally
(?: Non capturing group
(?!\|#\|).|\|{2} If what is on the right is not |#| match any char, or match 2 times ||
)* Close non capturing group and repeat 0+ times
\| Match |
( Capture group 1 (This will contain your value
[^|]+ Match 1+ times any char except |
) Close group
(?:\|#\||$) Match either |#|
Regex demo

Check in python if self designed pattern matches

I have a pattern which looks like:
abc*_def(##)
and i want to look if this matches for some strings.
E.x. it matches for:
abc1_def23
abc10_def99
but does not match for:
abc9_def9
So the * stands for a number which can have one or more digits.
The # stands for a number with one digit
I want the value in the parenthesis as result
What would be the easiest and simplest solution for this problem?
Replace the * and # through regex expression and then look if they match?
Like this:
pattern = pattern.replace('*', '[0-9]*')
pattern = pattern.replace('#', '[0-9]')
pattern = '^' + pattern + '$'
Or program it myself?
Based on your requirements, I would go for a regex for the simple reason it's already available and tested, so it's easiest as you were asking.
The only "complicated" thing in your requirements is avoiding after def the same digit you have after abc.
This can be done with a negative backreference. The regex you can use is:
\babc(\d+)_def((?!\1)\d{1,2})\b
\b captures word boundaries; if you enclose your regex between two \b
you will restrict your search to words, i.e. text delimited by space,
punctuations etc
abc captures the string abc
\d+ captures one or more digits; if there is an upper limit to the number of digits you want, it has to be \d{1,MAX} where MAX is your maximum number of digits; anyway \d stands for a digit and + indicates 1 or more repetitions
(\d+) is a group: the use of parenthesis defines \d+ as something you want to "remember" inside your regex; it's somehow similar to defining a variable; in this case, (\d+) is your first group since you defined no other groups before it (i.e. to its left)
_def captures the string _def
(?!\1) is the part where you say "I don't want to repeat the first group after _def. \1 represents the first group, while (?!whatever) is a check that results positive is what follows the current position is NOT (the negation is given by !) whatever you want to negate.
Live demo here.
I had the hardest time getting this to work. The trick was the $
#!python2
import re
yourlist = ['abc1_def23', 'abc10_def99', 'abc9_def9', 'abc955_def9', 'abc_def9', 'abc9_def9288', 'abc49_def9234']
for item in yourlist:
if re.search(r'abc[0-9]+_def[0-9][0-9]$', item):
print item, 'is a match'
You could match your pattern like:
abc\d+_def(\d{2})
abc Match literally
\d+ Match 1 or more digits
_ Match underscore
def - Match literally
( Capturing group (Your 2 digits will be in this group)
\d{2} Match 2 digits
) Close capturing group
Then you could for example use search to check for a match and use .group(1) to get the digits between parenthesis.
Demo Python
You could also add word boundaries:
\babc\d+_def(\d{2})\b

What does "?:" mean in a Python regular expression?

Below is the Python regular expression. What does the ?: mean in it? What does the expression do overall? How does it match a MAC address such as "00:07:32:12:ac:de:ef"?
re.compile(([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5}), string)
It (?:...) means a set of non-capturing grouping parentheses.
Normally, when you write (...) in a regex, it 'captures' the matched material. When you use the non-capturing version, it doesn't capture.
You can get at the various parts matched by the regex using the methods in the re package after the regex matches against a particular string.
How does this regular expression match MAC address "00:07:32:12:ac:de:ef"?
That's a different question from what you initially asked. However, the regex part is:
([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5})
The outer most pair of parentheses are capturing parentheses; what they surround will be available when you use the regex against a string successfully.
The [\dA-Fa-f]{2} part matches a digit (\d) or the hexadecimal digits A-Fa-f], in a pair {2}, followed by a non-capturing grouping where the matched material is a colon or dash (: or -), followed by another pair of hex digits, with the whole repeated exactly 5 times.
p = re.compile(([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5}))
m = p.match("00:07:32:12:ac:de:ef")
if m:
m.group(1)
The last line should print the string "00:07:32:12:ac:de" because that is the first set of 6 pairs of hex digits (out of the seven pairs in total in the string). In fact, the outer grouping parentheses are redundant and if omitted, m.group(0) would work (it works even with them). If you need to match 7 pairs, then you change the 5 into a 6. If you need to reject them, then you'd put anchors into the regex:
p = re.compile(^([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5})$)
The caret ^ matches the start of string; the dollar $ matches the end of string. With the 5, that would not match your sample string. With 6 in place of 5, it would match your string.
Using ?: as in (?:...) makes the group non-capturing during replace. During find it does'nt make any sense.
Your RegEx means
r"""
( # Match the regular expression below and capture its match into backreference number 1
[\dA-Fa-f] # Match a single character present in the list below
# A single digit 0..9
# A character in the range between “A” and “F”
# A character in the range between “a” and “f”
{2} # Exactly 2 times
(?: # Match the regular expression below
[:-] # Match a single character present in the list below
# The character “:”
# The character “-”
[\dA-Fa-f] # Match a single character present in the list below
# A single digit 0..9
# A character in the range between “A” and “F”
# A character in the range between “a” and “f”
{2} # Exactly 2 times
){5} # Exactly 5 times
)
"""
Hope this helps.
It does not change the search process. But it affects the retrieval of the group after the match has been found.
For example:
Text:
text = 'John Wick'
pattern to find:
regex = re.compile(r'John(?:\sWick)') # here we are looking for 'John' and also for a group (space + Wick). the ?: makes this group unretrievable.
When we print the match - nothing changes:
<re.Match object; span=(0, 9), match='John Wick'>
But if you try to manually address the group with (?:) syntax:
res = regex.finditer(text)
for i in res:
print(i)
print(i.group(1)) # here we are trying to retrieve (?:\sWick) group
it gives us an error:
IndexError: no such group
Also, look:
Python docs:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
the link to the re page in docs:
https://docs.python.org/3/library/re.html
(?:...) means a non cature group. The group will not be captured.

Categories