I have the following which is part of a line in a log file
-FDH-11 TIP: - 146/S Q: 48
which I want to match with regex . Is there a way to get the value of Q in the above input. I am not sure the length between -FDH- and Q: is always the same. So ideally if I found -FDH- and Q: then get the value of Q.
Code
See regex in use here
-FDH-.*?\bQ:\s*(\S+)
Explanation
-FDH- Match this literally
.*? Match any character any number of times, but as few as possible
\b Assert position as a word boundary
Q: Match this literally
\s* Match any number of whitespace characters
(\S+) Capture one or more non-whitespace characters into capture group 1
You probably don't need to use regex for this. If there is only one instance of "Q:" in the string, then you can just split on that and get the value after with the following:
str = "-FDH-11 TIP: - 146/S Q: 48"
parts = str.split("Q: ")
q_value = parts[1]
Related
I tried to compose patten with regex, and tried to validate multiple strings. However, seems my patterns fine according to regex documentation, but some reason, some invalid string is not validated correctly. Can anyone point me out what is my mistakes here?
test use case
this is test use case for one input string:
import re
usr_pat = r"^\$\w+_src_username_\w+$"
u_name='$ini_src_username_cdc_char4ec_pits'
m = re.match(usr_pat, u_name, re.M)
if m:
print("Valid username:", m.group())
else:
print("ERROR: Invalid user_name:\n", u_name)
I am expecting this return error because I am expecting input string must start with $ sign, then one string _\w+, then _, then src, then _, then user_name, then _, then end with only one string \w+. this is how I composed my pattern and tried to validate the different input strings, but some reason, it is not parsed correctly. Did I miss something here? can anyone point me out here?
desired output
this is valid and invalid input:
valid:
$ini_src_usrname_ajkc2e
$ini_src_password_ajkc2e
$ini_src_conn_url_ajkc2e
invalid:
$ini_src_usrname_ajkc2e_chan4
$ini_src_password_ajkc2e_tst1
$ini_smi_src_conn_url_ajkc2e_tst2
ini_smi_src_conn_url_ajkc2e_tst2
$ini_src_usrname_ajkc2e_chan4_jpn3
according to regex documentation, r"^\$\w+_src_username_\w+$" this should capture the logic that I want to parse, but it is not working all my test case. what did I miss here? thanks
The \w character class also matches underscores and numbers:
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.
(https://docs.python.org/3/library/re.html#regular-expression-syntax).
So the final \w+ matches the entirety of cdc_char4ec_pits
I think you are looking for [a-zA-Z0-9] which will not match underscores.
usr_pat = r"^\$[a-zA-Z0-9]+_src_username_[a-zA-Z0-9]+$"
\w+
First: \w means that capture:
1- one letter from a to z, or from A to Z
OR
2- one number from 0 to 9
OR
3- an underscore(_)
Second: The plus(+) sign after \w means that matches the previous token between one and unlimited times.
So if my regex pattern is: r"^\$\w+$"
It would match the string: '$ini_src_username_cdc_char4ec_pits'
1- The ^\$ will match the dollar sign at the beginning of the string $
2- \w+ at first it will match the character i of the word ini and because of the + sign it will continue to match the character n and the second i. After that the underscore exists after the word ini will be matched as well, this is because \w matches an underscore not just a number or a letter, the word src will be matched too, the underscore after the word src will be matched, the username word will be matched too and the whole string will be matched.
You mentioned the word "string", if you mean letters and numbers such as : "bla123", "123455" or "BLAbla", then you can use something like [a-zA-Z0-9]+ instead of \w+.
I have specific patterns which composed of string, numbers and special character in specific order. I would like to check input string is in the list of pattern that I created and print error if seeing incorrect input. To do so, I tried of using regex but my code is not neat enough. I am wondering if someone help me with this.
use case
I have input att2_epic_app_clm1_sub_valid, where I split them by _; here is list of pattern I am expecting to check and print error if not match.
Rule:
input should start with att and some number like [att][0-6]*, or [ptt][0-6]; after that it should be continued at either epic or semi, then it should be continued with [app][0-6] or [app][0-6_][clm][0-9_]+[sub|sup]; then it should end with [valid|Invalid]
so I composed this pattern with re but when I passed invalid input, it is not detected and I expect error instead.
import re
acceptable_pattern=re.compile(r'([att]+[0-6_])(epic|semi_)([app]+[0-6_]+[clm]+[0-6_])([sub|sup_])([valid|invalid]))'
input='att1_epic_app2_clm1_sub_valid' # this is valid string
wlist=input.split('_')
for each in wlist:
if any(ext in each for ext in acceptable_pattern):
print("valid")
else:
print("invalid")
this is not quite working because I have to check the string from beginning to end where split the string by _ where each new string much match of of the predefined rule such as:
input string should start with att|ptt which end with between 1-6; then next new word either epic or semi; then it should be app or app1~app6 or app{1_6}clm{1~6}{sub|sup_}; then string end with {valid|invalid};
how should I specify those rules by using re.compile to check pattern in input string and raise error if it is not sequentially? How should we do this in python? any quick way of making this happen?
Instead of using split, you could consider writing a pattern that validates the whole string.
If I am reading the requirements, you might use:
^[ap]tt[0-6]_(?:epic|semi)_app(?:[1-6]|[1-6_]clm[0-9]*_su[bp])?_valid$
^ Start of string
[ap]tt[0-6] match att or ptt and a digit 0-6
_(?:epic|semi) Match _epic or _semi
_app Match literally
(?: Non capture group for the alternation
[1-6] Match a digit 1-6
| Or
[1-6_]clm[0-9]*_su[bp] Match a digit 1-6 or _, then clm followed by optional digit 0-9 and then _sub or _sup
)? Close the non capture group and make it optional
_valid Match literally
$ End of string
See a regex demo.
If the string can also start with dev then you can use an alternation:
^(?:[ap]tt|dev)[0-6]_(?:epic|semi)_app(?:[1-6]|[1-6_]clm[0-9]*_su[bp])?_valid$
See another regex demo.
Then you can check if there was a match:
import re
pattern = r"^(?:[ap]tt|dev)[0-6]_(?:epic|semi)_app(?:[1-6]|[1-6_]clm[0-9]*_su[bp])?_valid$"
strings = [
"att2_epic_app_clm1_sub_valid",
"att12_epic_app_clm1_sub_valid",
"att2_epic_app_valid",
"att2_epic_app_clm1_sub_valid"
]
for s in strings:
m = re.match(pattern, s, re.M)
if m:
print("Valid: " + m.group())
else:
print("Invalid: " + s)
Output
Valid: att2_epic_app_clm1_sub_valid
Invalid: att12_epic_app_clm1_sub_valid
Valid: att2_epic_app_valid
Valid: att2_epic_app_clm1_sub_valid
I am trying to extract first 5 character+digit from last hyphen.
Here is the example
String -- X008-TGa19-ER751QF7
Output -- X008-TGa19-ER751
String -- X002-KF13-ER782cPU80
Output -- X002-KF13-ER782
My attempt -- I could manage to take element from the last -- (\w+)[^-.]*$
But now how to take first 5, then return my the entire value as the output as shown in the example.
You can optionally repeat a - and 1+ word chars from the start of the string. Then match the last - and match 5 word chars.
^\w+(?:-\w+)*-\w{5}
^ Start of string
\w+ Math 1+ word chars
(?:-\w+)* Optionally repeat - and 1+ word chars
-\w{5} Match - and 5 word chars
Regex demo
import re
regex = r"^\w+(?:-\w+)*-\w{5}"
s = ("X008-TGa19-ER751QF7\n"
"X002-KF13-ER782cPU80")
print(re.findall(regex, s, re.MULTILINE))
Output
['X008-TGa19-ER751', 'X002-KF13-ER782']
Note that \w can also match _.
If there can also be other character in the string, to get the first 5 digits or characters except _ after the last hyphen, you can match word characters without an underscore using a negated character class [^\W_]{5}
Repeat that 5 times while asserting no more underscore at the right.
^.*-[^\W_]{5}(?=[^-]*$)
Regex demo
(\w+-\w+-\w{5}) seems to capture what you're asking for.
Example:
https://regex101.com/r/PcPSim/1
If you are open for non-regex solution, you can use this which is based on splitting, slicing and joining the strings:
>>> my_str = "X008-TGa19-ER751QF7"
>>> '-'.join(s[:5] for s in my_str.split('-'))
'X008-TGa19-ER751'
Here I am splitting the string based on hyphen -, slicing the string to get at max five chars per sub-string, and joining it back using str.join() to get the string in your desired format.
^(.*-[^-]{5})[^-]*$
Capture group 1 is what you need
https://regex101.com/r/SYz9i5/1
Explanation
^(.*-[^-]{5})[^-]*$
^ Start of line
( Capture group 1 start
.* Any number of any character
- hyphen
[^-]{5} 5 non-hyphen character
) Capture group 1 end
[^-]* Any number of non-hyphen character
$ End of line
Another simpler one is
^(.*-.{5}).*$
This should be quite straight-forward.
This is making use of behaviour greedy match of first .*, which will try to match as much as possible, so the - will be the last one with at least 5 character following it.
https://regex101.com/r/CFqgeF/1/
following my previous question (How do i find multiple occurences of this specific string and split them into a list?), I'm now going to ask something more since the rule has been changed.
Here's the string, and the bold words are the ones that I want to extract.
text|p1_1_1120170AS074192161A0Z20|C M E -
Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier
Module 3KW|#|text|p1_4_1120170AS074192161A0Z20|Shuangdeng
6-FMX-170|#|text|p1_5_1120170AS074192161A0Z20|24021665|#|text|p1_6_1120170AS074192161A0Z20|1120170AS074192161A0Z20|#|text|p1_7_1120170AS074192161A0Z20|OK|#|text|p1_8_1120170AS074192161A0Z20||#|text|p1_9_1120170AS074192161A0Z20|ACTIVE|#|text|p1_10_1120170AS074192161A0Z20|-OK|#|text|site_id|20MJK110|#|text|barcode_flag|auto|#|text|movement_flag||#|text|unit_of_measurement||#|text|flag_waste|no|#|text|req_qty_db|2|#|text|req_qty|2
Here's my current regex:
(?<=p1\_1\_.*)[^|]+(?=\|\#\|.*|$)
After trying it out in https://regexr.com/, I found the result instead :
text|p1_1_1120170AS074192161A0Z20|C M E -
Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier
Module 3KW|#|text|p1_4_1120170AS074192161A0Z20|Shuangdeng
6-FMX-170|#|text|p1_5_1120170AS074192161A0Z20|24021665|#|text|p1_6_1120170AS074192161A0Z20|1120170AS074192161A0Z20|#|text|p1_7_1120170AS074192161A0Z20|OK|#|text|p1_8_1120170AS074192161A0Z20||#|text|p1_9_1120170AS074192161A0Z20|ACTIVE|#|text|p1_10_1120170AS074192161A0Z20|-OK|#|text|site_id|20MJK110|#|text|barcode_flag|auto|#|text|movement_flag||#|text|unit_of_measurement||#|text|flag_waste|no|#|text|req_qty_db|2|#|text|req_qty|2
The question remains: "Why don't just return the first matched occurrence ?".
Let's consider that if the value between the first "bar section" is empty, then it'll return the value of the next bar section.
Example :
text|p1_1_1120170AS074192161A0Z20||#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text . . .
And I don't want that. Let it be just return nothing instead (nothing match).
What's the correct regex to acquire such a match?
Thank you :).
This data looks more structured than you are giving it credit for. A regular expression is great for e.g. extracting email addresses from unstructured text, but this data seems delimited in a straightforward manner.
If there is structure it will be simpler, faster, and more reliable to just split on | and perhaps #:
text = 'text|p1_1_1120170AS074192161A0Z20|C M E - Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier Module 3KW|#|text|p1_4_11201...'
lines = text.split('|#|')
words = [line.split('|')[-1] for line in lines]
doc='text|p1_1_1120170AS074192161A0Z20|C M E - Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|...'
re.findall('[^|]+(?=\|\#\|)', doc)
In the re expression:
[^|]+finds chunks of text not containing the separator
(?=...) is a "lookahead assertion" (match the text but do not include in result)
About the pattern you tried
This part of the pattern [^|]+ states to match any char other than |
Then (?=\|\#\|.*|$) asserts using a positive lookahead what is on the right is |#|.* or the end of the string.
The positive lookbehind (?<=p1\_1\_.*) asserts what is on the left is p1_1_ followed by any char except a newline using a quantifier in the lookbehind.
As the pattern is not anchored, you will get all the matches for this logic because the p1_1_ assertion is true as it precedes all the|#| parts
Note that using the quantifier in the lookbehind will require the pypi regex module.
If you want the first match using a quantifier in the positive lookbehind you could for example use an anchor in combination with a negative lookahead to not cross the |#| or match || in case it is empty:
(?<=^.*?p1_1_(?:(?!\|#\|).|\|{2})*\|)[^|]+(?=\|\#\||$)
Python demo
You could use your original pattern using re.search getting the first match.
(?<=p1_1_.*)[^|]+(?=\|\#\||$)
Note that you don't have to escape the underscore in your original pattern and you can omit .* from the positive lookahead
Python demo
But to get the first match you don't have to use a positive lookbehind. You could also use an anchor, match and capturing group.
^.*?p1_1_(?:(?!\|#\|).|\|{2})*\|([^|]+)(?:\|#\||$)
^ Start of string
.*? Match any char except a newline
p1_1_ Match literally
(?: Non capturing group
(?!\|#\|).|\|{2} If what is on the right is not |#| match any char, or match 2 times ||
)* Close non capturing group and repeat 0+ times
\| Match |
( Capture group 1 (This will contain your value
[^|]+ Match 1+ times any char except |
) Close group
(?:\|#\||$) Match either |#|
Regex demo
I am trying to match string with underscores, throughout the string there are underscores but I want to match the strings that that has strings after the last underscore: Let me provide an example:
s = "hello_world"
s1 = "hello_world_foo"
s2 = "hello_world_foo_boo"
In my case I only want to capture s1 and s2.
I started with following, but can't really figure how I would do the match to capture strings that has strings after hello_world's underscore.
rgx = re.compile(ur'(?P<firstpart>\w+)[_]+(?P<secondpart>\w+)$', re.I | re.U)
Try this:
reobj = re.compile("^(?P<firstpart>[a-z]+)_(?P<secondpart>[a-z]+)_(?P<lastpart>.*?)$", re.IGNORECASE)
result = reobj.findall(subject)
Regex Explanation
^(?P<firstpart>[a-z]+)_(?P<secondpart>[a-z]+)_(?P<lastpart>.*?)$
Options: case insensitive
Assert position at the beginning of the string «^»
Match the regular expression below and capture its match into backreference with name “firstpart” «(?P<firstpart>[a-z]+)»
Match a single character in the range between “a” and “z” «[a-z]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “_” literally «_»
Match the regular expression below and capture its match into backreference with name “secondpart” «(?P<secondpart>[a-z]+)»
Match a single character in the range between “a” and “z” «[a-z]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “_” literally «_»
Match the regular expression below and capture its match into backreference with name “lastpart” «(?P<lastpart>.*?)»
Match any single character that is not a line break character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Assert position at the end of the string (or before the line break at the end of the string, if any) «$»
If I understand what you are asking for (you want to match string with more than one underscore and following text)
rgx = re.compile(ur'(?P<firstpart>\w+)[_]+(?P<secondpart>\w+)_[^_]+$', re.I | re.U)