Match all identifiers in a string

Match all identifiers in a string - python

Problem:
I am looking for a way to match certain identifiers in a given line
that starts with certain words. The ID consists of
characters, possibly followed by digits, followed by a dash then some
more digits. An ID should only be matched on lines where the
starting word is one of the following: Closes, Fixes, Resolves. If a
line contains more than one IDs, those will be separated by
the string and. Any number of IDs can be present on a
line.
Example Test String:
Closes PD-1 # Match: PD-1
Related to PD-2 # No match, line doesn't start with an allowed word
Closes
NPD-1 # No match, as the identifier is in a new line
Fixes PD-21 and PD-22 # Match: PD-21, PD-22
Closes PD-31, also PD-32 and PD-33 # Match: PD-31 - the rest is not captured because of ", also"
Resolves PD4-41 and PD4-42 and PD4-43 and PD4-44 # Match: PD4-41, PD4-42, PD4-43, PD4-44
Resolves something related to N-2 # No match, the identifier is not directly after 'Resolves'
What I tried:
Using a regular expressions to get all the matches, I always come up short in some regards. E.g. one of the regexp I tried is this:
^(?:Closes|Fixes|Resolves) (\w+-\d+)(?:(?: and )(\w+-\d+))*
I intended to have a non-capturing group where the line needs to
start with one of the allowed words, followed by a single space:
^(?:Closes|Fixes|Resolves)
Then at least one ID needs to follow the starting word,
which I intend to capture: (\w+-\d+)
Finally, zero or more ID can follow the first one, which are
separated by the string and, but I only want to capture the
IDs here, not the separator: (?:(?: and )(\w+-\d+))*
Result of this regexp in python:
test_string = """
Closes PD-1 # Match: PD-1
Related to PD-2 # No match, line doesn't start with an allowed word
Closes
NPD-1 # No match, as the identifier is in a new line
Fixes PD-21 and PD-22 # Match: PD-21, PD-22
Closes PD-31, also PD-32 and PD-33 # Match: PD-31 - the rest is not captured because of ", also"
Resolves PD4-41 and PD4-42 and PD4-43 and PD4-44 # Match: PD4-41, PD4-42, PD4-43, PD4-44
Resolves something related to N-2 # No match, the identifier is not directly after 'Resolves'
"""
ids = []
for match in re.findall("^(?:Closes|Fixes|Resolves) (\w+-\d+)(?:(?: and )(\w+-\d+))*", test_string, re.M):
for group in match:
if group:
ids.append(group)
print(ids)
['PD-1', 'PD-21', 'PD-22', 'PD-31', 'PD4-41', 'PD4-44']
Also, here is the result on regex101.com. If more than one ID follows the initial one, unfortunately it only captures the last match, not all of them. I read that a repeated capturing group will only capture the last iteration, and I should put a capturing group around the repeated group to capture all iterations, but I couldn't make it work.
Summary:
Is there a solution for this with regular expressions, something similar to what I tried but which captures all the occurrences of the IDs? Or is there a better way to parse this string for the IDs, using Python?

You could use a single capturing group and in that capturing group match the first occurrence and repeat the same pattern 0+ times preceded by a space followed by and and space.
The values are in group 1.
To get the separate values, split on and
^(?:Closes|Fixes|Resolves) (\w+-\d+(?: and \w+-\d+)*)
Regex demo

It might be easier with the two-stage approach, such as:
def get_matches(test): #assume test is a list of strings
regex1 = re.compile(r'^(?:Closes|Fixes|Resolves) \w+-\d+')
regex2 = re.compile(r'\w+-\d+')
results = []
for line in test:
if regex1.search(line):
results.extend(regex2.findall(line))
return results
gives:
['PD-1','PD-21','PD-22','PD-31','PD-32',
'PD-33','PD4-41','PD4-42','PD4-43','PD4-44']

If you need to work with repeated capturing groups, you should install PyPi regex module with pip install regex and use
import regex
test_string = "your string here"
ids = []
for match in regex.finditer("^(?:Closes|Fixes|Resolves) (?P<id>\w+-\d+)(?:(?: and )(?P<id>\w+-\d+))*", test_string, regex.M):
ids.extend(match.captures("id"))
print(ids)
# => ['PD-1', 'PD-21', 'PD-22', 'PD-31', 'PD4-41', 'PD4-42', 'PD4-43', 'PD4-44']
See the Python demo
The capture stack for each group is accessible via match.captures(X).
The regex you have is fine to use as is, but it is more user-frienly with a named capturing group here.

Related

Using Regex to move some letter of a string to a new location in the same string in a Series of strings in python

I have a list of 4000 strings. The naming convention needs to be changed for each string and I do not want to go through and edit each one individually.
The list looks like this:
data = list()
data = ['V2-FG2110-EMA-COMPRESSION',
'V2-FG2110-SA-COMPRESSION',
'V2-FG2110-UMA-COMPRESSION',
'V2-FG2120-EMA-DISTRIBUTION',
'V2-FG2120-SA-DISTRIBUTION',
'V2-FG2120-UMA-DISTRIBUTION',
'V2-FG2140-EMA-HEATING',
'V2-FG2140-SA-HEATING',
'V2-FG2140-UMA-HEATING',
'V2-FG2150-EMA-COOLING',
'V2-FG2150-SA-COOLING',
'V2-FG2150-UMA-COOLING',
'V2-FG2160-EMA-TEMPERATURE CONTROL']
I need all each 'SA' 'UMA' and 'EMA' to be moved to before the -FG.
Desired output is:
V2-EMA-FG2110-Compression
V2-SA-FG2110-Compression
V2-UMA-FG2110-Compression
...
The V2-FG2 does not change throughout the list so I have started there and I tried re.sub and re.search but I am pretty new to python so I have gotten a mess of different results. Any help is appreciated.

You can rearrange the strings.
new_list = []
for word in data:
arr = word.split('-')
new_word = '%s-%s-%s-%s'% (arr[0], arr[2], arr[1], arr[3])
new_list.append(new_word)

You can replace matches of the following regular expression with the contents of capture group 1:
(?<=^[A-Z]\d)(?=.*(-(?:EMA|SA|UMA))(?=-))|-(?:EMA|SA|UMA)(?=-)
Demo
The regular expression can be broken down as follows.
(?<=^[A-Z]\d) # current string position must be preceded by a capital
# letter followed by a digit at the start of the string
(?= # begin a positive lookahead
.* # match >= 0 chars other than a line terminator
(-(?:EMA|SA|UMA)) # match a hyphen followed by one of the three strings
# and save to capture group 1
(?=-) # the next char must be a hyphen
) # end positive lookahead
| # or
-(?:EMA|SA|UMA) # match a hyphen followed by one of the three strings
(?=-) # the next character must be a hyphen
(?=-) is a positive lookahead.
Evidently this may not work for versions of Python prior to 3.5, because the match in the second part of the alternation does not assign a value to capture group 1: "Before Python 3.5, backreferences to failed capture groups in Python re.sub were not populated with an empty string.. This quote is from
#WiktorStribiżew 's answer at the link. For what it's worth I confirmed that Ruby has the same behaviour ("V2-FG2110-EMA-COMPRESSION".gsub(rgx,'\1') #=> "V2-EMA-FG2110-COMPRESSION").
One could of course instead replace matches of (?<=^[A-Z]\d)(-[A-Z]{2}\d{4})(-(?:EMA|SA|UMA))(?=-)) with $2 + $1. That's probably more sensible even if it's less interesting.

Regex extract group inside optional group

I have strings of the form "identfier STEP=10" where the "STEP=10" part is optional. The goal is to detect both lines with or without the STEP part and to extract the numerical value of STEP in cases where it is part of the line. Now matching both cases is easy enough,
import re
pattern = ".*(STEP=[0-9]+)?"
re.match(pattern, "identifier STEP=10")
re.match(pattern, "identifier")
This detects both cases without problem. But I fail to extract the numerical value in one go. I tried the following,
import re
pattern = ".*(STEP=([0-9]+))?"
group0 = re.search(pattern, "identifier STEP=10").groups()
group1 = re.search(pattern, "identifier").groups()
And while it still does detect the lines, i only get
group0 = (None, None)
group1 = (None, None)
While i hoped to get something like
group0 = (None, "10")
group1 = (None, None)
Is regex not suited to do this in one go or am I simply using it wrong ? I am curious if there is a single regex call that returns what I want without doing a second pass after I have matched the line.

A possible solution will look like
import re
pattern = "^.*?(?:STEP=([0-9]+))?$"
group0 = re.search(pattern, "identifier STEP=10").groups()
group1 = re.search(pattern, "identifier").groups()
print(*group0)
print(*group1)
See the Python demo.
The ^.*?(?:STEP=([0-9]+))?$ regex matches
^ - start of string
.*? - zero or more chars other than line break chars as few as possible (i.e. the regex engine skips this pattern first and tries the subsequent patterns, and only comes back to use this when the subsequent patterns fail to match)
(?:STEP=([0-9]+))? - an optional non-capturing group: STEP= and then Group 1 capturing one or more ASCII digits
$ - end of string.
The .*(STEP=[0-9]+)? regex matches like this:
.* - grabs the whole line, from start to end
(STEP=[0-9]+)? - the group is quantified with * (meaning zero or more occurrences of the quantified pattern), so the regex engine, with its index being at the end of the line now, finds a match: an empty string at the string end, and the match is returned, with Group 1 text value as empty.
To be able to resolve such issues you must understand backtracking in regex (for example, see this YT video of mine to learn more about it).

Why does a joined list have to be inside a non-capturing group for the regex to be able to interpret it?

So I'm trying to build a regex that searches for an occurrence of digits followed by a white space following by one of many possible key words (represented by test_cases in this case).
The first regex below does that successfully however I'm confused as to why it works. My understanding of capturing groups is that they allow you to put quantifiers on the group and also assist in specifying what data is returned. Why does this example need to be in the non-capturing group for it to be processed correctly?
test_string = "251 to 300 Vitality"
test_cases = ["Damage", "Pods", "Chance", "Vitality"]
print(re.findall(r'\d+\s(?:{})$'.format('|'.join(test_cases)), test_string)) # works
print(re.findall(r'\d+\s({})$'.format('|'.join(test_cases)), test_string)) # doesn't work
print(re.findall(r'\d+\s{}$'.format('|'.join(test_cases)), test_string)) # doesn't work
Output:
['300 Vitality']
['Vitality']
['Vitality']

How to group inside "or" matching in a regex?

I have two kinds of documents to parse:
1545994641 INFO: ...
and
'{"deliveryDate":"1545994641","error"..."}'
I want to extract the timestamp 1545994641 from each of them.
So, I decided to write a regex to match both cases:
(\d{10}\s|\"\d{10}\")
In the 1st kind of document, it matches the timestamp and groups it, using the first expression in the "or" above (\d{10}\s):
>>> regex = re.compile("(\d{10}\s|\"\d{10}\")")
>>> msg="1545994641 INFO: ..."
>>> regex.search(msg).group(0)
'1545994641 '
(So far so good.)
However, in the 2nd kind, using the second expression in the "or" (\"\d{10}\") it matches the timestamp and quotation marks, grouping them. But I just want the timestamp, not the "":
>>> regex = re.compile("(\d{10}\s|\"\d{10}\")")
>>> msg='{"deliveryDate":"1545994641","error"..."}'
>>> regex.search(msg).group(0)
'"1545994641"'
What I tried:
I decided to use a non-capturing group for the quotation marks:
(\d{10}\s|(?:\")\d{10}(?:\"))
but it doesn't work as the outer group catches them.
I also removed the outer group, but the result is the same.
Unwanted ways to solve:
I can surpass this by creating a group for each expression in the or,
but I just want it to output a single group (to abstract the code
from the regex).
I could also use a 2nd step of regex to capture the timestamp from
the group that has the quotation marks, but again that would break
the code abstraction.
I could omit the "" in the regex but that would match a timestamp in the middle of the message , as I want it to be objective to capture the timestamp as a value of a key or in the beginning of the document, followed by a space.
Is there a way I can match both cases above but, in the case it matches the second case, return only the timestamp? Or is it impossible?
EDIT:
As noticed by #Amit Bhardwaj, the first case also returns a space after the timestamp. It's another problem (I didn't figure out) with the same solution, probably!

You may use lookarounds if your code can only access the whole match:
^\d{10}(?=\s)|(?<=")\d{10}(?=")
See the regex demo.
In Python, declare it as
rx = r'^\d{10}(?=\s)|(?<=")\d{10}(?=")'
Pattern details
^\d{10}(?=\s):
^ - string start
\d{10} - ten digits
(?=\s) - a positive lookahead that requires a whitespace char immediately to the right of the current location
| - or
(?<=")\d{10}(?="):
(?<=") - a " char
\d{10} - ten digits
(?=") - a positive lookahead that requires a double quotation mark immediately to the right of the current location.

You could use lookarounds, but I think this solution is simpler, if you can just get the group:
"?(\d{10})(?:\"|\s)
EDIT:
Considering if there is a first " there must be a ", try this:
(^\d{10}\s|(?<=\")\d{10}(?=\"))
EDIT 2:
To also remove the trailing space in the end, use a lookahead too:
(^\d{10}(?=\s)|(?<=\")\d{10}(?=\"))

Python regex to match after the text and the dot [duplicate]

I am using Python and would like to match all the words after test till a period (full-stop) or space is encountered.
text = "test : match this."
At the moment, I am using :
import re
re.match('(?<=test :).*',text)
The above code doesn't match anything. I need match this as my output.

Everything after test, including test
test.*
Everything after test, without test
(?<=test).*
Example here on regexr.com

You need to use re.search since re.match tries to match from the beging of the string. To match until a space or period is encountered.
re.search(r'(?<=test :)[^.\s]*',text)
To match all the chars until a period is encountered,
re.search(r'(?<=test :)[^.]*',text)

In a general case, as the title mentions, you may capture with (.*) pattern any 0 or more chars other than newline after any pattern(s) you want:
import re
p = re.compile(r'test\s*:\s*(.*)')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want . to match across multiple lines, compile the regex with re.DOTALL or re.S flag (or add (?s) before the pattern):
p = re.compile(r'test\s*:\s*(.*)', re.DOTALL)
p = re.compile(r'(?s)test\s*:\s*(.*)')
However, it will retrun match this.. See also a regex demo.
You can add \. pattern after (.*) to make the regex engine stop before the last . on that line:
test\s*:\s*(.*)\.
Watch out for re.match() since it will only look for a match at the beginning of the string (Avinash aleady pointed that out, but it is a very important note!)
See the regex demo and a sample Python code snippet:
import re
p = re.compile(r'test\s*:\s*(.*)\.')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want to make sure test is matched as a whole word, add \b before it (do not remove the r prefix from the string literal, or '\b' will match a BACKSPACE char!) - r'\btest\s*:\s*(.*)\.'.

I don't see why you want to use regex if you're just getting a subset from a string.
This works the same way:
if line.startswith('test:'):
print(line[5:line.find('.')])
example:
>>> line = "test: match this."
>>> print(line[5:line.find('.')])
match this
Regex is slow, it is awkward to design, and difficult to debug. There are definitely occassions to use it, but if you just want to extract the text between test: and ., then I don't think is one of those occasions.
See: https://softwareengineering.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions
For more flexibility (for example if you are looping through a list of strings you want to find at the beginning of a string and then index out) replace 5 (the length of 'test:') in the index with len(str_you_looked_for).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Match all identifiers in a string - python

Related

Using Regex to move some letter of a string to a new location in the same string in a Series of strings in python

Regex extract group inside optional group

Why does a joined list have to be inside a non-capturing group for the regex to be able to interpret it?

How to group inside "or" matching in a regex?

Python regex to match after the text and the dot [duplicate]

Categories

Resources