Referencing previous group possible within the same regex? - python

I am trying to perform a regex in Python. I want to match on a file path that does not have a domain extension and additionally, I only want to get those file paths that have 20 characters max after the last '\' of the file path. For example, given the data:
c:\users\docs\cmd.exe
c:\users\docs\files\ewyrkfdisadfasdfaffsfdasfsafsdf
c:\users\docs\files\target
I would want to match on 'target', and not the other two lines. It should be noted that in my current situation, using the re module or python operations is not an option, as this regex is fed into the program (which uses re.match() ), so I have do to this within a regex string.
I have two regexes:
^([^.]+)$ will match the the last 2 lines
([^\\]{,20}$) will match 'cmd.exe' and 'target'
How can I combine these two into one regex? I tried backreferencing (?P=, etc), but couldn't get it to work. Is this even possible?

How about \\([^\\.]{1,20})(?:$|\n)? It seems to work for me.
\\ is escaped literal backslash.
( start of capture group.
[^\\.] match anything except literal backslash or literal dot character
{1,20} match class 1-20 times, as many times as possible (greedy).
) end the capture group.
(?: starts a non-capturing group
$ match the end of the string.
| is the 'or' operator for this group
\n matches a line-feed or newline character (ASCII 10)
) end of non-capturing group
To create this, I used https://regex101.com/#python which is a very good resource in my opinion because it explains every part of the regex and neatly shows the captured groups in real time.

>>> s = r"""c:\users\docs\cmd.exe
... c:\users\docs\files\ewyrkfdisadfasdfaffsfdasfsafsdf
... c:\users\docs\files\target""".split('\n')
>>> [re.match(r'.*\\([^.]{,20})$', x) for x in s]
[None, None, <_sre.SRE_Match object at 0x7f6ad9631558>]
also
>>> [re.findall(r'.*\\([^.]{,20})$', x) for x in s]
[[], [], ['target']]
This means:
.*\\ - grab everything up to and including the last \
([^.]{,20}) - make sure there are no . in the remaining upto 20 characters
$ - end of line
The () around the middle group indicate that it should be the group returned as the match

Related

Extracting two strings from between two characters. Why doesn't my regex match and how can I improve it?

I'm learning about regular expressions and I to want extract a string from a text that has the following characteristic:
It always begins with the letter C, in either lowercase or
uppercase, which is then followed by a number of hexadecimal
characters (meaning it can contain the letters A to F and numbers
from 1 to 9, with no zeros included).
After those hexadecimal
characters comes a letter P, also either in lowercase or uppercase
And then some more hexadecimal characters (again, excluding 0).
Meaning I want to capture the strings that come in between the letters C and P as well as the string that comes after the letter P and concatenate them into a single string, while discarding the letters C and P
Examples of valid strings would be:
c45AFP2
CAPF
c56Bp26
CA6C22pAAA
For the above examples what I want would be to extract the following, in the same order:
45AF2 # Original string: c45AFP2
AF # Original string: CAPF
56B26 # Original string: c56Bp26
A6C22AAA # Original string: CA6C22pAAA
Examples of invalid strings would be:
BCA6C22pAAA # It doesn't begin with C
c56Bp # There aren't any characters after P
c45AF0P2 # Contains a zero
I'm using python and I want a regex to extract the two strings that come both in between the characters C and P as well as after P
So far I've come up with this:
(?<=\A[cC])[a-fA-F1-9]*(?<=[pP])[a-fA-F1-9]*
A breakdown would be:
(?<=\A[cC]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [cC] and that [cC] must be at the beginning of the string
[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times
(?<=[pP]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [pP]
[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times
But with the above regex I can't match any of the strings!
When I insert a | in between (?<=[cC])[a-fA-F1-9]* and (?<=[pP])[a-fA-F1-9]* it works.
Meaning the below regex works:
(?<=[cC])[a-fA-F1-9]*|(?<=[pP])[a-fA-F1-9]*
I know that | means that it should match at most one of the specified regex expressions. But it's non greedy and it returns the first match that it finds. The remaining expressions aren’t tested, right?
But using | means the string BCA6C22pAAA is a partial match to AAA since it comes after P, even though the first assertion isn't true, since it doesn't begin with a C.
That shouldn't be the case. I want it to only match if all conditions explained in the beginning are true.
Could someone explain to me why my first attempt doesn't produces the result I want? Also, how can I improve my regex?
I still need it to:
Not be a match if the string contains the number 0
Only be a match if ALL conditions are met
Thank you
To match both groups before and after P or p
(?<=^[Cc])[1-9a-fA-F]+(?=[Pp]([1-9a-fA-F]+$))
(?<=^[Cc]) - Positive Lookbehind. Must match a case insensitive C or c at the start of the line
[1-9a-fA-F]+ - Matches hexadecimal characters one or more times
(?=[Pp] - Positive Lookahead for case insensitive p or P
([1-9a-fA-F]+$) - Cature group for one or more hexadecimal characters following the pP
View Demo
Your main problem is you're using a look behind (?<=[pP]) for something ahead, which will never work: You need a look ahead (?=...).
Also, the final quantifier should be + not * because you require at least one trailing character after the p.
The final mistake is that you're not capturing anything, you're only matching, so put what you want to capture inside brackets, which also means you can remove all look arounds.
If you use the case insensitive flag, it makes the regex much smaller and easier to read.
A working regex that captures the 2 hex parts in groups 1 and 2 is:
(?i)^c([a-f1-9]*)p([a-f1-9]+)
See live demo.
Unless you need to use \A, prefer ^ (start of input) over \A (start of all input in multi line scenario) because ^ easier to read and \A won't match every line, which is what many situations and tools expect. I've used ^.

Using Regex to move some letter of a string to a new location in the same string in a Series of strings in python

I have a list of 4000 strings. The naming convention needs to be changed for each string and I do not want to go through and edit each one individually.
The list looks like this:
data = list()
data = ['V2-FG2110-EMA-COMPRESSION',
'V2-FG2110-SA-COMPRESSION',
'V2-FG2110-UMA-COMPRESSION',
'V2-FG2120-EMA-DISTRIBUTION',
'V2-FG2120-SA-DISTRIBUTION',
'V2-FG2120-UMA-DISTRIBUTION',
'V2-FG2140-EMA-HEATING',
'V2-FG2140-SA-HEATING',
'V2-FG2140-UMA-HEATING',
'V2-FG2150-EMA-COOLING',
'V2-FG2150-SA-COOLING',
'V2-FG2150-UMA-COOLING',
'V2-FG2160-EMA-TEMPERATURE CONTROL']
I need all each 'SA' 'UMA' and 'EMA' to be moved to before the -FG.
Desired output is:
V2-EMA-FG2110-Compression
V2-SA-FG2110-Compression
V2-UMA-FG2110-Compression
...
The V2-FG2 does not change throughout the list so I have started there and I tried re.sub and re.search but I am pretty new to python so I have gotten a mess of different results. Any help is appreciated.
You can rearrange the strings.
new_list = []
for word in data:
arr = word.split('-')
new_word = '%s-%s-%s-%s'% (arr[0], arr[2], arr[1], arr[3])
new_list.append(new_word)
You can replace matches of the following regular expression with the contents of capture group 1:
(?<=^[A-Z]\d)(?=.*(-(?:EMA|SA|UMA))(?=-))|-(?:EMA|SA|UMA)(?=-)
Demo
The regular expression can be broken down as follows.
(?<=^[A-Z]\d) # current string position must be preceded by a capital
# letter followed by a digit at the start of the string
(?= # begin a positive lookahead
.* # match >= 0 chars other than a line terminator
(-(?:EMA|SA|UMA)) # match a hyphen followed by one of the three strings
# and save to capture group 1
(?=-) # the next char must be a hyphen
) # end positive lookahead
| # or
-(?:EMA|SA|UMA) # match a hyphen followed by one of the three strings
(?=-) # the next character must be a hyphen
(?=-) is a positive lookahead.
Evidently this may not work for versions of Python prior to 3.5, because the match in the second part of the alternation does not assign a value to capture group 1: "Before Python 3.5, backreferences to failed capture groups in Python re.sub were not populated with an empty string.. This quote is from
#WiktorStribiżew 's answer at the link. For what it's worth I confirmed that Ruby has the same behaviour ("V2-FG2110-EMA-COMPRESSION".gsub(rgx,'\1') #=> "V2-EMA-FG2110-COMPRESSION").
One could of course instead replace matches of (?<=^[A-Z]\d)(-[A-Z]{2}\d{4})(-(?:EMA|SA|UMA))(?=-)) with $2 + $1. That's probably more sensible even if it's less interesting.

How to group inside "or" matching in a regex?

I have two kinds of documents to parse:
1545994641 INFO: ...
and
'{"deliveryDate":"1545994641","error"..."}'
I want to extract the timestamp 1545994641 from each of them.
So, I decided to write a regex to match both cases:
(\d{10}\s|\"\d{10}\")
In the 1st kind of document, it matches the timestamp and groups it, using the first expression in the "or" above (\d{10}\s):
>>> regex = re.compile("(\d{10}\s|\"\d{10}\")")
>>> msg="1545994641 INFO: ..."
>>> regex.search(msg).group(0)
'1545994641 '
(So far so good.)
However, in the 2nd kind, using the second expression in the "or" (\"\d{10}\") it matches the timestamp and quotation marks, grouping them. But I just want the timestamp, not the "":
>>> regex = re.compile("(\d{10}\s|\"\d{10}\")")
>>> msg='{"deliveryDate":"1545994641","error"..."}'
>>> regex.search(msg).group(0)
'"1545994641"'
What I tried:
I decided to use a non-capturing group for the quotation marks:
(\d{10}\s|(?:\")\d{10}(?:\"))
but it doesn't work as the outer group catches them.
I also removed the outer group, but the result is the same.
Unwanted ways to solve:
I can surpass this by creating a group for each expression in the or,
but I just want it to output a single group (to abstract the code
from the regex).
I could also use a 2nd step of regex to capture the timestamp from
the group that has the quotation marks, but again that would break
the code abstraction.
I could omit the "" in the regex but that would match a timestamp in the middle of the message , as I want it to be objective to capture the timestamp as a value of a key or in the beginning of the document, followed by a space.
Is there a way I can match both cases above but, in the case it matches the second case, return only the timestamp? Or is it impossible?
EDIT:
As noticed by #Amit Bhardwaj, the first case also returns a space after the timestamp. It's another problem (I didn't figure out) with the same solution, probably!
You may use lookarounds if your code can only access the whole match:
^\d{10}(?=\s)|(?<=")\d{10}(?=")
See the regex demo.
In Python, declare it as
rx = r'^\d{10}(?=\s)|(?<=")\d{10}(?=")'
Pattern details
^\d{10}(?=\s):
^ - string start
\d{10} - ten digits
(?=\s) - a positive lookahead that requires a whitespace char immediately to the right of the current location
| - or
(?<=")\d{10}(?="):
(?<=") - a " char
\d{10} - ten digits
(?=") - a positive lookahead that requires a double quotation mark immediately to the right of the current location.
You could use lookarounds, but I think this solution is simpler, if you can just get the group:
"?(\d{10})(?:\"|\s)
EDIT:
Considering if there is a first " there must be a ", try this:
(^\d{10}\s|(?<=\")\d{10}(?=\"))
EDIT 2:
To also remove the trailing space in the end, use a lookahead too:
(^\d{10}(?=\s)|(?<=\")\d{10}(?=\"))

How to match numeric characters with no white space following

I need to match lines in text document where the line starts with numbers and the numbers are followed by nothing.... I want to include numbers that have '.' and ',' separating them.
Currently, I have:
p = re.compile('\$?\s?[0-9]+')
for i, line in enumerate(letter):
m = p.match(line)
if s !=None:
print(m)
print(line)
Which gives me this:
"15,704" and "416" -> this is good, I want this
but also this:
"$40 million...." -> I do not want to match this line or any line where the numbers are followed by words.
I've tried:
p = re.compile('\$?\s?[0-9]+[ \t\n\r\f\v]')
But it doesn't work. One reason is that it turns out there is no white space after the numbers I'm trying to match.
Appreciate any tips or tricks.
If you want to match the whole string with a regex,
you have 2 choices:
Either call re.fullmatch(pattern, string) (note full in the function name).
It tries to match just the whole string.
Or put $ anchor at the end of your regex and call re.match(pattern, string).
It tries to find a match from the start of the string.
Actually, you could also add ^ at the start of regex and call re.search(pattern,
string), but it would be a very strange combination.
I have also a remark concerning how you specified your conditions, maybe in incomplete
way: You put e.g. $40 million string and stated that the only reason to reject
it is space and letters after $40.
So actually you should have written that you want to match a string:
Possibly starting with $.
After the $ there can be a space (maybe, I'm not sure).
Then there can be a sequence of digits, dots or commas.
And nothing more.
And one more remark concerning Python literals: Apparently you have forgotten to prepend the pattern with r.
If you use r-string literal, you do not have to double backslashes inside.
So I think the most natural solution is to call a function devoted just to
match the whole string (i.e. fullmatch), without adding start / end
anchors and the whole script can be:
import re
pat = re.compile(r'(?:\$\s?)?[\d,.]+')
lines = ["416", "15,704", "$40 million"]
for line in lines:
if pat.fullmatch(line):
print(line)
Details concerning the regex:
(?: - A non-capturing group.
\$ - Consisting of a $ char.
\s? - And optional space.
)? - End of the non-capturing group and ? stating that the whole
group group is optional.
[\d,.]+ - A sequence of digits, commas and dots (note that between [
and ] the dot represents itself, so no backslash quotation is needed.
If you would like to reject strings like 2...5 or 3.,44 (no consecutive
dots or commas allowed), change the last part of the above regex to:
[\d]+(?:[,.]?[\d]+)*
Details:
[\d]+ - A sequence of digits.
(?: - A non-capturing group.
[,.] - Either a comma or a dot (single).
[\d]+ - Another sequence of digits.
)* - End of the non-capturing group, it may occur several times.
With a little modification to your code:
letter = ["15,704", "$40 million"]
p = re.compile('^\d{1,3}([\.,]\d{3})*$') # Numbers separated by commas or points
for i, line in enumerate(letter):
m = p.match(line)
if m:
print(line)
Output:
15,704
You could use the following regex:
import re
pattern = re.compile('^[0-9,.]+\s*$')
lines = ["416", "15,704", "$40 million...."]
for line in lines:
if pattern.match(line):
print(line)
Output
416
15,704
The pattern ^[0-9,.]+\s*$ matches everything that is a digit a , or ., followed by zero or more spaces. If you want to match only numbers with one , or . use the following pattern: '^\d+[,.]?\d+\s*$', code:
import re
pattern = re.compile('^\d+[,.]?\d+\s*$')
lines = ["416", "15,704", "$40 million...."]
for line in lines:
if pattern.match(line):
print(line)
Output
416
15,704
The pattern ^\d+[,.]?\d+\s*$ matches everything that starts with a group of digits (\d+) followed by an optional , or . ([,.]?) followed by a group of digits, with an optional group of spaces \s*.

What does "?:" mean in a Python regular expression?

Below is the Python regular expression. What does the ?: mean in it? What does the expression do overall? How does it match a MAC address such as "00:07:32:12:ac:de:ef"?
re.compile(([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5}), string)
It (?:...) means a set of non-capturing grouping parentheses.
Normally, when you write (...) in a regex, it 'captures' the matched material. When you use the non-capturing version, it doesn't capture.
You can get at the various parts matched by the regex using the methods in the re package after the regex matches against a particular string.
How does this regular expression match MAC address "00:07:32:12:ac:de:ef"?
That's a different question from what you initially asked. However, the regex part is:
([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5})
The outer most pair of parentheses are capturing parentheses; what they surround will be available when you use the regex against a string successfully.
The [\dA-Fa-f]{2} part matches a digit (\d) or the hexadecimal digits A-Fa-f], in a pair {2}, followed by a non-capturing grouping where the matched material is a colon or dash (: or -), followed by another pair of hex digits, with the whole repeated exactly 5 times.
p = re.compile(([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5}))
m = p.match("00:07:32:12:ac:de:ef")
if m:
m.group(1)
The last line should print the string "00:07:32:12:ac:de" because that is the first set of 6 pairs of hex digits (out of the seven pairs in total in the string). In fact, the outer grouping parentheses are redundant and if omitted, m.group(0) would work (it works even with them). If you need to match 7 pairs, then you change the 5 into a 6. If you need to reject them, then you'd put anchors into the regex:
p = re.compile(^([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5})$)
The caret ^ matches the start of string; the dollar $ matches the end of string. With the 5, that would not match your sample string. With 6 in place of 5, it would match your string.
Using ?: as in (?:...) makes the group non-capturing during replace. During find it does'nt make any sense.
Your RegEx means
r"""
( # Match the regular expression below and capture its match into backreference number 1
[\dA-Fa-f] # Match a single character present in the list below
# A single digit 0..9
# A character in the range between “A” and “F”
# A character in the range between “a” and “f”
{2} # Exactly 2 times
(?: # Match the regular expression below
[:-] # Match a single character present in the list below
# The character “:”
# The character “-”
[\dA-Fa-f] # Match a single character present in the list below
# A single digit 0..9
# A character in the range between “A” and “F”
# A character in the range between “a” and “f”
{2} # Exactly 2 times
){5} # Exactly 5 times
)
"""
Hope this helps.
It does not change the search process. But it affects the retrieval of the group after the match has been found.
For example:
Text:
text = 'John Wick'
pattern to find:
regex = re.compile(r'John(?:\sWick)') # here we are looking for 'John' and also for a group (space + Wick). the ?: makes this group unretrievable.
When we print the match - nothing changes:
<re.Match object; span=(0, 9), match='John Wick'>
But if you try to manually address the group with (?:) syntax:
res = regex.finditer(text)
for i in res:
print(i)
print(i.group(1)) # here we are trying to retrieve (?:\sWick) group
it gives us an error:
IndexError: no such group
Also, look:
Python docs:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
the link to the re page in docs:
https://docs.python.org/3/library/re.html
(?:...) means a non cature group. The group will not be captured.

Categories