I tried to compose patten with regex, and tried to validate multiple strings. However, seems my patterns fine according to regex documentation, but some reason, some invalid string is not validated correctly. Can anyone point me out what is my mistakes here?
test use case
this is test use case for one input string:
import re
usr_pat = r"^\$\w+_src_username_\w+$"
u_name='$ini_src_username_cdc_char4ec_pits'
m = re.match(usr_pat, u_name, re.M)
if m:
print("Valid username:", m.group())
else:
print("ERROR: Invalid user_name:\n", u_name)
I am expecting this return error because I am expecting input string must start with $ sign, then one string _\w+, then _, then src, then _, then user_name, then _, then end with only one string \w+. this is how I composed my pattern and tried to validate the different input strings, but some reason, it is not parsed correctly. Did I miss something here? can anyone point me out here?
desired output
this is valid and invalid input:
valid:
$ini_src_usrname_ajkc2e
$ini_src_password_ajkc2e
$ini_src_conn_url_ajkc2e
invalid:
$ini_src_usrname_ajkc2e_chan4
$ini_src_password_ajkc2e_tst1
$ini_smi_src_conn_url_ajkc2e_tst2
ini_smi_src_conn_url_ajkc2e_tst2
$ini_src_usrname_ajkc2e_chan4_jpn3
according to regex documentation, r"^\$\w+_src_username_\w+$" this should capture the logic that I want to parse, but it is not working all my test case. what did I miss here? thanks
The \w character class also matches underscores and numbers:
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.
(https://docs.python.org/3/library/re.html#regular-expression-syntax).
So the final \w+ matches the entirety of cdc_char4ec_pits
I think you are looking for [a-zA-Z0-9] which will not match underscores.
usr_pat = r"^\$[a-zA-Z0-9]+_src_username_[a-zA-Z0-9]+$"
\w+
First: \w means that capture:
1- one letter from a to z, or from A to Z
OR
2- one number from 0 to 9
OR
3- an underscore(_)
Second: The plus(+) sign after \w means that matches the previous token between one and unlimited times.
So if my regex pattern is: r"^\$\w+$"
It would match the string: '$ini_src_username_cdc_char4ec_pits'
1- The ^\$ will match the dollar sign at the beginning of the string $
2- \w+ at first it will match the character i of the word ini and because of the + sign it will continue to match the character n and the second i. After that the underscore exists after the word ini will be matched as well, this is because \w matches an underscore not just a number or a letter, the word src will be matched too, the underscore after the word src will be matched, the username word will be matched too and the whole string will be matched.
You mentioned the word "string", if you mean letters and numbers such as : "bla123", "123455" or "BLAbla", then you can use something like [a-zA-Z0-9]+ instead of \w+.
Related
I have specific patterns which composed of string, numbers and special character in specific order. I would like to check input string is in the list of pattern that I created and print error if seeing incorrect input. To do so, I tried of using regex but my code is not neat enough. I am wondering if someone help me with this.
use case
I have input att2_epic_app_clm1_sub_valid, where I split them by _; here is list of pattern I am expecting to check and print error if not match.
Rule:
input should start with att and some number like [att][0-6]*, or [ptt][0-6]; after that it should be continued at either epic or semi, then it should be continued with [app][0-6] or [app][0-6_][clm][0-9_]+[sub|sup]; then it should end with [valid|Invalid]
so I composed this pattern with re but when I passed invalid input, it is not detected and I expect error instead.
import re
acceptable_pattern=re.compile(r'([att]+[0-6_])(epic|semi_)([app]+[0-6_]+[clm]+[0-6_])([sub|sup_])([valid|invalid]))'
input='att1_epic_app2_clm1_sub_valid' # this is valid string
wlist=input.split('_')
for each in wlist:
if any(ext in each for ext in acceptable_pattern):
print("valid")
else:
print("invalid")
this is not quite working because I have to check the string from beginning to end where split the string by _ where each new string much match of of the predefined rule such as:
input string should start with att|ptt which end with between 1-6; then next new word either epic or semi; then it should be app or app1~app6 or app{1_6}clm{1~6}{sub|sup_}; then string end with {valid|invalid};
how should I specify those rules by using re.compile to check pattern in input string and raise error if it is not sequentially? How should we do this in python? any quick way of making this happen?
Instead of using split, you could consider writing a pattern that validates the whole string.
If I am reading the requirements, you might use:
^[ap]tt[0-6]_(?:epic|semi)_app(?:[1-6]|[1-6_]clm[0-9]*_su[bp])?_valid$
^ Start of string
[ap]tt[0-6] match att or ptt and a digit 0-6
_(?:epic|semi) Match _epic or _semi
_app Match literally
(?: Non capture group for the alternation
[1-6] Match a digit 1-6
| Or
[1-6_]clm[0-9]*_su[bp] Match a digit 1-6 or _, then clm followed by optional digit 0-9 and then _sub or _sup
)? Close the non capture group and make it optional
_valid Match literally
$ End of string
See a regex demo.
If the string can also start with dev then you can use an alternation:
^(?:[ap]tt|dev)[0-6]_(?:epic|semi)_app(?:[1-6]|[1-6_]clm[0-9]*_su[bp])?_valid$
See another regex demo.
Then you can check if there was a match:
import re
pattern = r"^(?:[ap]tt|dev)[0-6]_(?:epic|semi)_app(?:[1-6]|[1-6_]clm[0-9]*_su[bp])?_valid$"
strings = [
"att2_epic_app_clm1_sub_valid",
"att12_epic_app_clm1_sub_valid",
"att2_epic_app_valid",
"att2_epic_app_clm1_sub_valid"
]
for s in strings:
m = re.match(pattern, s, re.M)
if m:
print("Valid: " + m.group())
else:
print("Invalid: " + s)
Output
Valid: att2_epic_app_clm1_sub_valid
Invalid: att12_epic_app_clm1_sub_valid
Valid: att2_epic_app_valid
Valid: att2_epic_app_clm1_sub_valid
I am basically trying to match string pattern(wildcard match)
Please carefully look at this -
*(star) - means exactly one word .
This is not a regex pattern...it is a convention.
So,if there patterns like -
*.key - '.key.' is preceded by exactly one word(word containing no dots)
*.key.* - '.key.' is preceded and succeeded by exactly one word having no dots
key.* - '.key' preceeds exactly one word .
So,
"door.key" matches "*.key"
"brown.door.key" doesn't match "*.key".
"brown.key.door" matches "*.key.*"
but "brown.iron.key.door" doesn't match "*.key.*"
So, when I encounter a '*' in pattern, I have replace it with a regex so that it means it is exactly one word.(a-zA-z0-9_).Can anyone please help me do this in python?
To convert your pattern to a regexp, you first need to make sure each character is interpreted literally and not as a special character. We can do that by inserting a \ in front of any re special character. Those characters can be obtained through sre_parse.SPECIAL_CHARS.
Since you have a special meaning for *, we do not want to escape that one but instead replace it by \w+.
Code
import sre_parse
def convert_to_regexp(pattern):
special_characters = set(sre_parse.SPECIAL_CHARS)
special_characters.remove('*')
safe_pattern = ''.join(['\\' + c if c in special_characters else c for c in pattern ])
return safe_pattern.replace('*', '\\w+')
Example
import re
pattern = '*.key'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.key'
re.match(r_pattern, 'door.key') # Match
re.match(r_pattern, 'brown.door.key') # None
And here is an example with escaped special characters
pattern = '*.(key)'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.\\(key\\)'
re.match(r_pattern, 'door.(key)') # Match
re.match(r_pattern, 'brown.door.(key)') # None
Sidenote
If you intend looking for the output pattern with re.search or re.findall, you might want to wrap the re pattern between \b boundary characters.
The conversion rules you are looking for go like this:
* is a word, thus: \w+
. is a literal dot: \.
key is and stays a literal string
plus, your samples indicate you are going to match whole strings, which in turn means your pattern should match from the ^ beginning to the $ end of the string.
Therefore, *.key becomes ^\w+\.key$, *.key.* becomes ^\w+\.key\.\w+$, and so forth..
Online Demo: play with it!
^ means a string that starts with the given set of characters in a regular expression.
$ means a string that ends with the given set of characters in a regular expression.
\s means a whitespace character.
\S means a non-whitespace character.
+ means 1 or more characters matching given condition.
Now, you want to match just a single word meaning a string of characters that start and end with non-spaced string. So, the required regular expression is:
^\S+$
You could do it with a combination of "any characters that aren't period" and the start/end anchors.
*.key would be ^[^.]*\.key, and *.key.* would be ^[^.]*\.key\.[^.]*$
EDIT: As tripleee said, [^.]*, which matches "any number of characters that aren't periods," would allow whitespace characters (which of course aren't periods), so using \w+, "any number of 'word characters'" like the other answers is better.
I have the following which is part of a line in a log file
-FDH-11 TIP: - 146/S Q: 48
which I want to match with regex . Is there a way to get the value of Q in the above input. I am not sure the length between -FDH- and Q: is always the same. So ideally if I found -FDH- and Q: then get the value of Q.
Code
See regex in use here
-FDH-.*?\bQ:\s*(\S+)
Explanation
-FDH- Match this literally
.*? Match any character any number of times, but as few as possible
\b Assert position as a word boundary
Q: Match this literally
\s* Match any number of whitespace characters
(\S+) Capture one or more non-whitespace characters into capture group 1
You probably don't need to use regex for this. If there is only one instance of "Q:" in the string, then you can just split on that and get the value after with the following:
str = "-FDH-11 TIP: - 146/S Q: 48"
parts = str.split("Q: ")
q_value = parts[1]
I have a string with a mix of uppercase and lowercase letters. I need want to find every lowercase letter that is surronded by 3 uppercase letters and extract it from the string.
For instance ZZZaZZZ I want to extract the a in the previous string.
I have written a script that is able to extract ZZZaZZZ but not the a alone. I know I need to use nested regex expressions to do this but I can not wrap my mind on how to implement this. The following is what I have:
import string, re
if __name__ == "__main__":
#open the file
eqfile = open("string.txt")
gibberish = eqfile.read()
eqfile.close()
r = re.compile("[A-Z]{3}[a-z][A-Z]{3}")
print r.findall(gibberish)
EDIT:
Thanks for the answers guys! I guess I should have been more specific. I need to find the lowercase letter that is surrounded by three uppercase letters that are exactly the same, such as in my example ZZZaZZZ.
You are so close! Read about the .group* methods of MatchObjects. For example, if your script ended with
r = re.compile("[A-Z]{3}([a-z])[A-Z]{3}")
print r.match(gibberish).group(1)
then you'd capture the desired character inside the first group.
To address the new constraint of matching repeated letters, you can use backreferences:
r = re.compile(r'([A-Z])\1{2}(?P<middle>[a-z])\1{3}')
m = r.match(gibberish)
if m is not None:
print m.group('middle')
That reads like:
Match a letter A-Z and remember it.
Match two occurrences of the first letter found.
Match your lowercase letter and store it in the group named middle.
Match three more consecutive instances of the first letter found.
If a match was found, print the value of the middle group.
r = re.compile("(?<=[A-Z]{3})[a-z](?=[A-Z]{3})")
(?<=...) indicates a positive lookbehind and (?=...) is a positive lookahead.
module re
(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
(?<=...)
Matches if the current position in the string is preceded by a match for ... that ends at the current position.
You need to capture the part of the string you are interested in with parentheses, and then access it with re.MatchObject#group:
r = re.compile("[A-Z]{3}([a-z])[A-Z]{3}")
m = r.match(gibberish)
if m:
print "Match! Middle letter was " + m.group(1)
else:
print "No match."
Example;
X=This
Y=That
not matching;
ThisWordShouldNotMatchThat
ThisWordShouldNotMatch
WordShouldNotMatch
matching;
AWordShouldMatchThat
I tried (?<!...) but seems not to be easy :)
^(?!This).*That$
As a free-spacing regex:
^ # Start of string
(?!This) # Assert that "This" can't be matched here
.* # Match the rest of the string
That # making sure we match "That"
$ # right at the end of the string
This will match a single word that fulfills your criteria, but only if this word is the only input to the regex. If you need to find words inside a string of many other words, then use
\b(?!This)\w*That\b
\b is the word boundary anchor, so it matches at the start and at the end of a word. \w means "alphanumeric character. If you also want to allow non-alphanumerics as part of your "word", then use \S instead - this will match anything that's not a space.
In Python, you could do words = re.findall(r"\b(?!This)\w*That\b", text).