regex matching the pattern - python

Is there a way to write a regular expression in python that matches the string of the following format:
feat(fix-validation): some texts (Aerogear-G1010)
or
feat$($fix-validation$)$:$some texts$($Aerogear-G1010$)
here, $ = represents Zero or more white space can be present
breakdown:
feat : string from a fixed subset of strings ['feat','fix','docs','breaking']
fix-validation : string of max n length
some texts : string of max m length
Aerogear-G1010 : Prefix should always be a string Aerogear- and after that some alphanumeric characters of max q length
Note : we can't escape the special characters like ( ) : - and should be in the exact same format as shown in the example below:
feat(feat-new): new feature for group creation (Aerogear-1234)
docs(new docs): add new document on the file repo (Aerogear-G1235)
fix(fix-warnings): fix user raised concerns (Aerogear-P1230)
I was only able to match the string with fix subset of strings using pattern: '^fix|docs|feat\s+\(' . I am not able to add zero or more spaces after a matched string followed by (some texts) :
Can this be achieved ? thank you in advance :)

From your regex, '^fix|docs|feat\s+\(', the \s+ matches any whitespace character between one and unlimited times. This requires at least one whitespace character.
Instead:
^(feat|docs|fix|breaking) *\( *(.*?) *\) *: *(.*?) *\( *(.*?) *\)$
I have used typed spaces ( ) but if you want to include any space character you should use \s which is equivalent to [\r\n\t\f\v ].
Should do what you are looking for, you can see what each part is doing here: https://regex101.com/r/EsF8FF/4
I'd recommend reading the explanation there and reading the python docs for the re module: https://docs.python.org/3/library/re.html

Related

Python\PySpark Reg Expression - replace a pattern IF it occurs x time in string

my strings look something like this :
ABS25C/18033C,25A/17972C
ABS300ABC
AS25C/18033C,25A/17972C,25B/18026C
desired output :
ABS25C25A
ABS300ABC
AS25C25A25B
I have tried many different combinations. It seems straight forward. (/.+,) would match those characters between the "/" and ",". I would then be able to replace them with empty string.
But it ignores the first "," and therefore returns the string I want in the middle. Meaning it would get replaced if I went that route.
example in image. I end up losing out on text I need. I figured there was a way to get what I need via straight regex and not have to split(",") the string then do regex
You can match using this regex:
/[^/,\n]+,?
And replace using empty string.
RegEx Demo
RegEx Details:
/: Match a /
[^/,\n]+: Match 1+ of any characters that is not / and , and not a line break
,?: Match optional ,

Extracting a word between two path separators that comes after a specific word

I have the following path stored as a python string 'C:\ABC\DEF\GHI\App\Module\feature\src' and I would like to extract the word Module that is located between words \App\ and \feature\ in the path name. Note that there are file separators '\' in between which ought not to be extracted, but only the string Module has to be extracted.
I had the few ideas on how to do it:
Write a RegEx that matches a string between \App\ and \feature\
Write a RegEx that matches a string after \App\ --> App\\[A-Za-z0-9]*\\, and then split that matched string in order to find the Module.
I think the 1st solution is better, but that unfortunately it goes over my RegEx knowledge and I am not sure how to do it.
I would much appreciate any help.
Thank you in advance!
The regex you want is:
(?<=\\App\\).*?(?=\\feature\\)
Explanation of the regex:
(?<=behind)rest matches all instances of rest if there is behind immediately before it. It's called a positive lookbehind
rest(?=ahead) matches all instances of rest where there is ahead immediately after it. This is a positive lookahead.
\ is a reserved character in regex patterns, so to use them as part of the pattern itself, we have to escape it; hence, \\
.* matches any character, zero or more times.
? specifies that the match is not greedy (so we are implicitly assuming here that \feature\ only shows up once after \App\).
The pattern in general also assumes that there are no \ characters between \App\ and \feature\.
The full code would be something like:
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
start = '\\App\\'
end = '\\feature\\'
pattern = rf"(?<=\{start}\).*?(?=\{end}\)"
print(pattern) # (?<=\\App\\).*?(?=\\feature\\)
print(re.search(pattern, str)[0]) # Module
A link on regex lookarounds that may be helpful: https://www.regular-expressions.info/lookaround.html
We can do that by str.find somethings like
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
import re
start = '\\App\\'
end = '\\feature\\'
print( (str[str.find(start)+len(start):str.rfind(end)]))
print("\n")
output
Module
Your are looking for groups. With some small modificatians you can extract only the part between App and Feature.
(?:App\\\\)([A-Za-z0-9]*)(?:\\\\feature)
The brackets ( ) define a Match group which you can get by match.group(1). Using (?:foo) defines a non-matching group, e.g. one that is not included in your result. Try the expression here: https://regex101.com/r/24mkLO/1

How to find whether a string pattern exists in a python string?

Lets say I have an arrays of strings.I want to find all the string which contain the following
substring , charachter digit digit digit charachter (CDDDC will be the pattern). For instance the format would be as following:
H554L
K007K
Is there any fast string expression matching to find such occurrences ?
Things like this are the field of "regex". Regex is made for pattern matching. It of itself is a broad ttopic too much to explain here (check regexbuddy or another site).
python has a regex compiler build in, under the re (as well as regex module). A simple solution would hence be:
word for word in somelist if re.search(r"[a-zA-Z]\d{3}[a-zA-Z]", word)
Which iterates over somelist, and selects anything that matches (completely) a character in one of the two "ranges", followed by 3 digits, followed by a character in the range.
A noted as in the comments: re.search will match (find) any item which has a "part" of that item matching the "pattern". So it will match a123b as well as abc b123cd. If you wish to make sure that the full "word" in the array matches the substring use re.fullmatch instead.
Fullmatch will match a123b but not abc b123cd and not ab123cd
Try this example with this regex:
regex: (?i)[A-Z]\d\d\d[A-Z]
import re
xx = ['aeeea','5eeae','H554L','juan','K007K']
for i in xx:
r1 = re.findall(r"(?i)[A-Z]\d\d\d[A-Z]", i)
print (', '.join(r1)
)
Run the example online

Regex for parsing uid from URL

I am trying to parse UIDs from URLs. However regex is not something I am good at so seeking for some help.
Example Input:
https://example.com/d/iazs9fEil/somethingelse?foo=bar
Example Output:
iazs9fEil
What I've tried so far is
([/d/]+[\d\x])\w+
Which somehow works, but returns in with the /d/ prefix, so the output is /d/iazs9fEil.
How to change the regex to not contain the /d/ prefix?
EDIT:
I've tried this regex ([^/d/]+[\d\x])\w+ which outputs the correct string which is iazs9fEil, but also returns the rest of the url, so here it is somethingelse?foo=bar
In short, you may use
match = re.search(r'/d/(\w+)', your_string) # Look for a match
if match: # Check if there is a match first
print(match.group(1)) # Now, get Group 1 value
See this regex demo and a regex graph:
NOTE
/ is not any special metacharacter, do not escape it in Python string patterns
([/d/]+[\d\x])\w+ matches and captures into Group 1 any one or more slashes or digits (see [/d/]+, a positive character class) and then a digit or (here, Python shows an error: sre_contants.error incomplete escape \x, probably it could parse it as x, but it is not the case), and then matches 1+ word chars. You put the /d/ into a character class and it stopped matching a char sequence, [/d/]+ matches slashes and digits in any order and amount, and certainly places this string into Group 1.
Try (?<=/d/)[^/]+
Explanation:
(?<=/d/) - positive lookbehind, assure that what's preceeding is /d/
[^/]+ - match one or more characters other than /, so it matches everything until /
Demo
You could use a capturing group:
https?://.*?/d/([^/\s]+)
Regex demo

Python regex number by looking behind

I am extracting numbers in such format string.
AB1234
AC1234
AD1234
As you see, A is always there and the second char excludes ". I write below code to extract number.
re.search(r'(?<=A[^"])\d*',input)
But I encountered an error.
look-behind requires fixed-width pattern
So is there any convenient way to extract numbers? Now I know how to search twice to get them.Thanks in advance.
Note A is a pattern , in fact A is a world in a long string.
The regex in your example works, so I'm guessing your actual pattern has variable width character matches (*, +, etc). Unfortunately, regex look behinds do not support those. What I can suggest as an alternative, is to use a capture group and extract the matching string -
m = re.search(r'A\D+(\d+)', s)
if m:
r = m.group(1)
Details
A # your word
\D+ # anything that is not a digit
( # capture group
\d+ # 1 or more digits
)
If you want to take care of double quotes, you can make a slight modification to the regular expression by including a character class -
r'A[^\d"]+(\d+)'
Tye using this regex instead:
re.search(r'(?=A[^"]\d*)\d*',input)

Categories