Regex to extract first 5 digit+character from last hyphen - python

I am trying to extract first 5 character+digit from last hyphen.
Here is the example
String -- X008-TGa19-ER751QF7
Output -- X008-TGa19-ER751
String -- X002-KF13-ER782cPU80
Output -- X002-KF13-ER782
My attempt -- I could manage to take element from the last -- (\w+)[^-.]*$
But now how to take first 5, then return my the entire value as the output as shown in the example.

You can optionally repeat a - and 1+ word chars from the start of the string. Then match the last - and match 5 word chars.
^\w+(?:-\w+)*-\w{5}
^ Start of string
\w+ Math 1+ word chars
(?:-\w+)* Optionally repeat - and 1+ word chars
-\w{5} Match - and 5 word chars
Regex demo
import re
regex = r"^\w+(?:-\w+)*-\w{5}"
s = ("X008-TGa19-ER751QF7\n"
"X002-KF13-ER782cPU80")
print(re.findall(regex, s, re.MULTILINE))
Output
['X008-TGa19-ER751', 'X002-KF13-ER782']
Note that \w can also match _.
If there can also be other character in the string, to get the first 5 digits or characters except _ after the last hyphen, you can match word characters without an underscore using a negated character class [^\W_]{5}
Repeat that 5 times while asserting no more underscore at the right.
^.*-[^\W_]{5}(?=[^-]*$)
Regex demo

(\w+-\w+-\w{5}) seems to capture what you're asking for.
Example:
https://regex101.com/r/PcPSim/1

If you are open for non-regex solution, you can use this which is based on splitting, slicing and joining the strings:
>>> my_str = "X008-TGa19-ER751QF7"
>>> '-'.join(s[:5] for s in my_str.split('-'))
'X008-TGa19-ER751'
Here I am splitting the string based on hyphen -, slicing the string to get at max five chars per sub-string, and joining it back using str.join() to get the string in your desired format.

^(.*-[^-]{5})[^-]*$
Capture group 1 is what you need
https://regex101.com/r/SYz9i5/1
Explanation
^(.*-[^-]{5})[^-]*$
^ Start of line
( Capture group 1 start
.* Any number of any character
- hyphen
[^-]{5} 5 non-hyphen character
) Capture group 1 end
[^-]* Any number of non-hyphen character
$ End of line
Another simpler one is
^(.*-.{5}).*$
This should be quite straight-forward.
This is making use of behaviour greedy match of first .*, which will try to match as much as possible, so the - will be the last one with at least 5 character following it.
https://regex101.com/r/CFqgeF/1/

Related

How to create regex to match a string that contains only hexadecimal numbers and arrows?

I am using a string that uses the following characters:
0-9
a-f
A-F
-
>
The mixture of the greater than and hyphen must be:
->
-->
Here is the regex that I have so far:
[0-9a-fA-F\-\>]+
I tried these others using exclusion with ^ but they didn't work:
[^g-zG-Z][0-9a-fA-F\-\>]+
^g-zG-Z[0-9a-fA-F\-\>]+
[0-9a-fA-F\-\>]^g-zG-Z+
[0-9a-fA-F\-\>]+^g-zG-Z
[0-9a-fA-F\-\>]+[^g-zG-Z]
Here are some samples:
"0912adbd->12d1829-->218990d"
"ab2c8d-->82a921->193acd7"
Firstly, you don't need to escape - and >
Here's the regex that worked for me:
^([0-9a-fA-F]*(->)*(-->)*)*$
Here's an alternative regex:
^([0-9a-fA-F]*(-+>)*)*$
What does the regex do?
^ matches the beginning of the string and $ matches the ending.
* matches 0 or more instances of the preceding token
Created a big () capturing group to match any token.
[0-9a-fA-F] matches any character that is in the range.
(->) and (-->) match only those given instances.
Putting it into a code:
import re
regex = "^([0-9a-fA-F]*(->)*(-->)*)*$"
re.match(re.compile(regex),"0912adbd->12d1829-->218990d")
re.match(re.compile(regex),"ab2c8d-->82a921->193acd7")
re.match(re.compile(regex),"this-failed->so-->bad")
You can also convert it into a boolean:
print(bool(re.match(re.compile(regex),"0912adbd->12d1829-->218990d")))
print(bool(re.match(re.compile(regex),"ab2c8d-->82a921->193acd7")))
print(bool(re.match(re.compile(regex),"this-failed->so-->bad")))
Output:
True
True
False
I recommend using regexr.com to check your regex.
If there must be an arrow present, and not at the start or end of the string using a case insensitive pattern:
^[a-f\d]+(?:-{1,2}>[a-f\d]+)+$
Explanation
^ Start of string
[a-f\d]+ Match 1+ chars a-f or digits
(?: Non capture group to repeat as a whole
-{1,2}>[a-f\d]+ Match - or -- and > followed by 1+ chars a-f or digits
)+ Close the non capture group and repeat 1+ times
$ End of string
See a regex demo and a Python demo.
import re
pattern = r"^[a-f\d]+(?:-{1,2}>[a-f\d]+)+$"
s = ("0912adbd->12d1829-->218990d\n"
"ab2c8d-->82a921->193acd7\n"
"test")
print(re.findall(pattern, s, re.I | re.M))
Output
[
'0912adbd->12d1829-->218990d',
'ab2c8d-->82a921->193acd7'
]
You can construct the regex by steps. If I understand your requirements, you want a sequence of hexadecimal numbers (like a01d or 11efeb23, separated by arrows with one or two hyphens (-> or -->).
The hex part's regex is [0-9a-fA-F]+ (assuming it cannot be empty).
The arrow's regex can be -{1,2}> or (->|-->).
The arrow is only needed before each hex number but the first, so you'll build the final regex in two parts: the first number, then the repetition of arrow and number.
So the general structure will be:
NUMBER(ARROW NUMBER)*
Which gives the following regex:
[0-9a-fA-F]+(-{1,2}>[0-9a-fA-F]+)*

Test for comma delimited string, ignoring any encountered periods, say from real numbers?

The following works for a simple comma delimited string, that has no periods, but if periods in real numbers found it breaks.
pattern = re.compile(r"^(\w+)(,\s*\w+)*$")
How can I modify or change the above to ignore periods? But still validate the given string is comma delimited?
A sample test string is "23,HIGH,1.0,LOW,1.0,HIGH,1.0,LOW,1.0".
\w matches "word" characters: letters, digits and _. It doesn't match a dot. If you want to match dots as well, use [\w.] instead of \w:
pattern = re.compile(r"^([\w.]+)(,\s*[\w.]+)*$")
You might also want to add -, if you expect negative numbers. To put - in a character class, you either have to backslash escape it or make sure it's either the first or last character in the class:
[-.\w]
[\w.-]
[\w\-.]
If the value can only be a number, and matching dots only would not be desired you can use and alternation to match either word characters or a number.
^(?:[+-]?\d*\.?\d+|\w+)(?:,(?:[+-]?\d*\.?\d+|\w+))*$
Explanation
^ Start of string
(?: Non capture group
[+-]?\d*\.?\d+ Match an optional + or -, then optional digits, optional dot and 1+ digits
| Or
\w+ Match 1+ word characters
) Close non capture group
(?: Non capture group
, Match the comma
(?:[+-]?\d*\.?\d+|\w+) The same pattern as in the first part
)* Close non capture group and optionally repeat to also match a single occurrence
$ End of string
Regex demo

Regex to find sentences of a minimum length

I am trying to create a regular expression that finds sentences with a minimum length.
Really my conditions are:
there must at least be 5 words in a sequence
words in sequence must be distinct
sequence must be followed by some punctuation character.
So far I have tried
^(\b\w*\b\s?){5,}\s?[.?!]$
If my sample text is:
This is a sentence I would like to parse.
This is too short.
Single word
Not not not distinct distinct words words.
Another sentence that I would be interested in.
I would like to match on strings 1 and 5.
I am using the python re library. I am using regex101 to test and it appears the regex I have above is doing quite a bit of work regards to backtracking so I imagine those knowledgable in regex may be a bit appalled (my apologies).
You can use the following regex to identify the strings that meet all three conditions:
^(?!.*\b(\w+)\b.+\b\1\b)(?:.*\b\w+\b){5}.*[.?!]\s*$
with the case-indifferent flag set.
Demo
Python's regex engine performs the following operations.
^ # match beginning of line
(?! # begin negative lookahead
.+ # match 1+ chars
\b(\w+)\b # match a word in cap grp 1
.+ # match 1+ chars
\b\1\b # match the contents of cap grp 1 with word breaks
) # end negative lookahead
(?: # begin non-cap grp
.+ # match 1+ chars
\b\w+\b # match a word
) # end non-cap grp
{5} # execute non-cap grp 5 times
.* # match 0+ chars
[.?!] # match a punctuation char
\s* # match 0+ whitespaces
$ # match end of line
Items 1. and 3. are easily done by regex, but
2. words in sequence must be distinct
I don't see how you could do it with a regex pattern. Remember that regex is a string-matching operation; it doesn't do heavy logic. This problem doesn't sound like a regex problem to me.
I recommend splitting the string in the character " " and checking word by word. Quickier, no sweat.
Edit
can be done with a lookahead as Cary said.

Remove duplicate words in a string using regex

I'm working on my regex skills and i find one of my strings having duplicate words at the starting. I would like to remove the duplicate and just have one word of it -
server_server_dev1_check_1233.zzz
server_server_qa1_run_1233.xyz
server_server_dev2_1233.qqa
server_dev1_1233.zzz
data_data_dev9_check_660.log
I used the below regex but i get both server_server in my output,
((.*?))_(?!\D)
How can i have my output just to one server_ if there are two or more and if its only one server_, then take as is?
The output doesn't have to contain the digits and also the part after . i.e. .zzz, .xyz etc
Expected output -
server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check
you could back reference the word in your search expression:
>>> s = "server_server_dev1_check_1233.zzz"
>>> re.sub(r"(.*_)\1",r"\1",s)
'server_dev1_check_1233.zzz'
and use the "many times" suffix so if there are more than 2 occurrences it still works:
'server_server_server_dev1_check_1233.zzz'
>>> re.sub(r"(.*_)\1{1,}",r"\1",s)
'server_dev1_check_1233.zzz'
getting rid of the suffix is not the hardest part, just capture the rest and discard the end:
>>> re.sub(r"(.*_)\1{1,}(.*)(_\d+\..*)",r"\1\2",s)
'server_dev1_check'
You may use a single re.sub call to match and remove what you do not need and match and capture what you need:
re.sub(r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$', r'\1\2', s)
See the regex demo
Details
^ - start of string
([^_]+) - Capturing group 1: any 1+ chars other than _
(?:_\1)* - zero or more repetitions of _ followed with the same substring as in Group 1 (thanks to the inline backreference \1 that retrieves the text from Group 1)
(.*) - Group 2: any 0+ chars, as many as possible
_ - an underscore
\d+ - 1+ digits
\. - a dot
\w+ - 1+ word chars ([^.]+ will also do, 1 or more chars other than .)
$ - end of string.
The replacement pattern is \1\2, i.e. the contents of Group 1 and 2 are concatenated and make up the resulting value.
Python demo:
import re
rx = r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$'
strs = ["server_server_dev1_check_1233.zzz", "server_server_qa1_run_1233.xyz", "server_server_dev2_1233.qqa", "server_dev1_1233.zzz", "data_data_dev9_check_660.log"]
for s in strs:
print(re.sub(rx, r'\1\2', s))
Output:
server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check

Match only the string that has strings after last underscore

I am trying to match string with underscores, throughout the string there are underscores but I want to match the strings that that has strings after the last underscore: Let me provide an example:
s = "hello_world"
s1 = "hello_world_foo"
s2 = "hello_world_foo_boo"
In my case I only want to capture s1 and s2.
I started with following, but can't really figure how I would do the match to capture strings that has strings after hello_world's underscore.
rgx = re.compile(ur'(?P<firstpart>\w+)[_]+(?P<secondpart>\w+)$', re.I | re.U)
Try this:
reobj = re.compile("^(?P<firstpart>[a-z]+)_(?P<secondpart>[a-z]+)_(?P<lastpart>.*?)$", re.IGNORECASE)
result = reobj.findall(subject)
Regex Explanation
^(?P<firstpart>[a-z]+)_(?P<secondpart>[a-z]+)_(?P<lastpart>.*?)$
Options: case insensitive
Assert position at the beginning of the string «^»
Match the regular expression below and capture its match into backreference with name “firstpart” «(?P<firstpart>[a-z]+)»
Match a single character in the range between “a” and “z” «[a-z]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “_” literally «_»
Match the regular expression below and capture its match into backreference with name “secondpart” «(?P<secondpart>[a-z]+)»
Match a single character in the range between “a” and “z” «[a-z]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “_” literally «_»
Match the regular expression below and capture its match into backreference with name “lastpart” «(?P<lastpart>.*?)»
Match any single character that is not a line break character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Assert position at the end of the string (or before the line break at the end of the string, if any) «$»
If I understand what you are asking for (you want to match string with more than one underscore and following text)
rgx = re.compile(ur'(?P<firstpart>\w+)[_]+(?P<secondpart>\w+)_[^_]+$', re.I | re.U)

Categories