regexp: match character group or end of line

regexp: match character group or end of line - python

How do you match ^ (begin of line) and $ (end of line) in a [] (character group)?
simple example
haystack string: zazty
rules:
match any "z" or "y"
if preceded by
an "a", "b"; or
at the beginning of the line.
pass:
match the first two "z"
a regexp that would work is:
(?:^|[aAbB])([zZyY])
But I keep thinking it would be much cleaner with something like that meant beginning/end of line inside the character group
[^aAbB]([zZyY])
(in that example assumes the ^ means beginning of line, and not what it really is there, a negative for the character group)
note: using python. but knowing that on bash and vim would be good too.
Update: read again the manual it says for set of chars, everything lose it's special meaning, except the character classes (e.g. \w)
down on the list of character classes, there's \A for beginning of line, but this does not work [\AaAbB]([zZyY])
Any idea why?

You can't match a ^ or $ within a [] because the only characters with special meaning inside a character class are ^ (as in "everything but") and - (as in "range") (and the character classes). \A and \Z just don't count as character classes.
This is for all (standard) flavours of regex, so you're stuck with (^|[stuff]) and ($|[stuff]) (which aren't all that bad, really).

Concatenate the character 'a' to the beginning of the string. Then use [aAbB]([zZyY]).

Try this one:
(?<![^abAB])([yzYZ])

Why not trying escape character \? ([\^\$])
UPDATE:
If you want to find all Zs and As preceded by "a" than you can use positive lookbehind. Probably there is no way to specify wild cards in character groups (because wild cards are characters too). (It there is I would be pleased to know about it).
private static final Pattern PATTERN = Pattern.compile("(?<=(?:^|[aA]))([zZyY])");
public static void main(String[] args) {
Matcher matcher = PATTERN.matcher("zazty");
while(matcher.find()) {
System.out.println("matcher.group(0) = " + matcher.group(0));
System.out.println("matcher.start() = " + matcher.start());
}
}
Output:
matcher.group(0) = z
matcher.start() = 0
matcher.group(0) = z
matcher.start() = 2

Related

Expression in regular expression python

I would like to make a regular expression for formatting a text, in which there can't be a { character except if it's coming with a backslash \ behind. The problem is that a backslash can escape itself, so I don't want to match \\{ for example, but I do want \\\{. So I want only an odd number of backslashs before a {. I can't just take it in a group and lookup the number of backslashs there are after like this:
s = r"a wei\\\{rd thing\\\\\{"
matchs = re.finditer(r"([^\{]|(\\+)\{)+", s)
for match in matchs:
if len(match.group(2)) / 2 == len(match.group(2)) // 2: # check if it's even
continue
do_some_things()
Because the group 2 can be used more than one time, so I can access only to the last one (in this case, \\\\\)
It would be really nice if we could just do something like "([^\{]|(\\+)(?if len(\2) / 2 == len(\2) // 2)\{)+" as regular expression, but, as far as I know, that is impossible.
How can I do then ???

This matches an odd number of backslashes followed by a brace:
(?<!\\)(\\\\)*(\\\{)
Breakdown:
(?<!\\) - Not preceded by a backslash, to accommodate the next bit
This is called "negative lookbehind"
(\\\\)* - Zero or more pairs of backslashes
(\\\{) - A backslash then a brace
Matches:
\{
\\\{
\\\\\{
Non-matches:
\\{
\\\\{
\\\\\\{
Try it on RegExr
This was partly inspired by Vadim Baratashvili's answer

I think you can use this as solution:
([^\\](\\\\){0,})(\{)
We can check that between the last character that is not a backslash there are 0 or more pairs of backslashes and then goes {if part of the text matches the pattern, then we can replace it with the first group $1 (a character that is not a slash plus 0 or more pairs of slashes), so we will find and replace not escaped { .
If we want to find escaped { we ca use this expression:
([^\\](\\\\){0,})(\\\{) - second group of match is \{

Match a line if there is something before a group of characters, at the start of the line [duplicate]

The following should be matched:
AAA123
ABCDEFGH123
XXXX123
can I do: ".*123" ?

Yes, you can. That should work.
. = any char except newline
\. = the actual dot character
.? = .{0,1} = match any char except newline zero or one times
.* = .{0,} = match any char except newline zero or more times
.+ = .{1,} = match any char except newline one or more times

Yes that will work, though note that . will not match newlines unless you pass the DOTALL flag when compiling the expression:
Pattern pattern = Pattern.compile(".*123", Pattern.DOTALL);
Matcher matcher = pattern.matcher(inputStr);
boolean matchFound = matcher.matches();

Use the pattern . to match any character once, .* to match any character zero or more times, .+ to match any character one or more times.

The most common way I have seen to encode this is with a character class whose members form a partition of the set of all possible characters.
Usually people write that as [\s\S] (whitespace or non-whitespace), though [\w\W], [\d\D], etc. would all work.

.* and .+ are for any chars except for new lines.
Double Escaping
Just in case, you would wanted to include new lines, the following expressions might also work for those languages that double escaping is required such as Java or C++:
[\\s\\S]*
[\\d\\D]*
[\\w\\W]*
for zero or more times, or
[\\s\\S]+
[\\d\\D]+
[\\w\\W]+
for one or more times.
Single Escaping:
Double escaping is not required for some languages such as, C#, PHP, Ruby, PERL, Python, JavaScript:
[\s\S]*
[\d\D]*
[\w\W]*
[\s\S]+
[\d\D]+
[\w\W]+
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegularExpression{
public static void main(String[] args){
final String regex_1 = "[\\s\\S]*";
final String regex_2 = "[\\d\\D]*";
final String regex_3 = "[\\w\\W]*";
final String string = "AAA123\n\t"
+ "ABCDEFGH123\n\t"
+ "XXXX123\n\t";
final Pattern pattern_1 = Pattern.compile(regex_1);
final Pattern pattern_2 = Pattern.compile(regex_2);
final Pattern pattern_3 = Pattern.compile(regex_3);
final Matcher matcher_1 = pattern_1.matcher(string);
final Matcher matcher_2 = pattern_2.matcher(string);
final Matcher matcher_3 = pattern_3.matcher(string);
if (matcher_1.find()) {
System.out.println("Full Match for Expression 1: " + matcher_1.group(0));
}
if (matcher_2.find()) {
System.out.println("Full Match for Expression 2: " + matcher_2.group(0));
}
if (matcher_3.find()) {
System.out.println("Full Match for Expression 3: " + matcher_3.group(0));
}
}
}
Output
Full Match for Expression 1: AAA123
ABCDEFGH123
XXXX123
Full Match for Expression 2: AAA123
ABCDEFGH123
XXXX123
Full Match for Expression 3: AAA123
ABCDEFGH123
XXXX123
If you wish to explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:

There are lots of sophisticated regex testing and development tools, but if you just want a simple test harness in Java, here's one for you to play with:
String[] tests = {
"AAA123",
"ABCDEFGH123",
"XXXX123",
"XYZ123ABC",
"123123",
"X123",
"123",
};
for (String test : tests) {
System.out.println(test + " " +test.matches(".+123"));
}
Now you can easily add new testcases and try new patterns. Have fun exploring regex.
See also
regular-expressions.info/Tutorial

No, * will match zero-or-more characters. You should use +, which matches one-or-more instead.
This expression might work better for you: [A-Z]+123

Specific Solution to the example problem:-
Try [A-Z]*123$ will match 123, AAA123, ASDFRRF123. In case you need at least a character before 123 use [A-Z]+123$.
General Solution to the question (How to match "any character" in the regular expression):
If you are looking for anything including whitespace you can try [\w|\W]{min_char_to_match,}.
If you are trying to match anything except whitespace you can try [\S]{min_char_to_match,}.

Try the regex .{3,}. This will match all characters except a new line.

[^] should match any character, including newline. [^CHARS] matches all characters except for those in CHARS. If CHARS is empty, it matches all characters.
JavaScript example:
/a[^]*Z/.test("abcxyz \0\r\n\t012789ABCXYZ") // Returns ‘true’.

I like the following:
[!-~]
This matches all char codes including special characters and the normal A-Z, a-z, 0-9
https://www.w3schools.com/charsets/ref_html_ascii.asp
E.g. faker.internet.password(20, false, /[!-~]/)
Will generate a password like this: 0+>8*nZ\\*-mB7Ybbx,b>

I work this Not always dot is means any char. Exception when single line mode. \p{all} should be
String value = "|°¬<>!\"#$%&/()=?'\\¡¿/*-+_#[]^^{}";
String expression = "[a-zA-Z0-9\\p{all}]{0,50}";
if(value.matches(expression)){
System.out.println("true");
} else {
System.out.println("false");
}

Regex to check if it is exactly one single word

I am basically trying to match string pattern(wildcard match)
Please carefully look at this -
*(star) - means exactly one word .
This is not a regex pattern...it is a convention.
So,if there patterns like -
*.key - '.key.' is preceded by exactly one word(word containing no dots)
*.key.* - '.key.' is preceded and succeeded by exactly one word having no dots
key.* - '.key' preceeds exactly one word .
So,
"door.key" matches "*.key"
"brown.door.key" doesn't match "*.key".
"brown.key.door" matches "*.key.*"
but "brown.iron.key.door" doesn't match "*.key.*"
So, when I encounter a '*' in pattern, I have replace it with a regex so that it means it is exactly one word.(a-zA-z0-9_).Can anyone please help me do this in python?

To convert your pattern to a regexp, you first need to make sure each character is interpreted literally and not as a special character. We can do that by inserting a \ in front of any re special character. Those characters can be obtained through sre_parse.SPECIAL_CHARS.
Since you have a special meaning for *, we do not want to escape that one but instead replace it by \w+.
Code
import sre_parse
def convert_to_regexp(pattern):
special_characters = set(sre_parse.SPECIAL_CHARS)
special_characters.remove('*')
safe_pattern = ''.join(['\\' + c if c in special_characters else c for c in pattern ])
return safe_pattern.replace('*', '\\w+')
Example
import re
pattern = '*.key'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.key'
re.match(r_pattern, 'door.key') # Match
re.match(r_pattern, 'brown.door.key') # None
And here is an example with escaped special characters
pattern = '*.(key)'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.\\(key\\)'
re.match(r_pattern, 'door.(key)') # Match
re.match(r_pattern, 'brown.door.(key)') # None
Sidenote
If you intend looking for the output pattern with re.search or re.findall, you might want to wrap the re pattern between \b boundary characters.

The conversion rules you are looking for go like this:
* is a word, thus: \w+
. is a literal dot: \.
key is and stays a literal string
plus, your samples indicate you are going to match whole strings, which in turn means your pattern should match from the ^ beginning to the $ end of the string.
Therefore, *.key becomes ^\w+\.key$, *.key.* becomes ^\w+\.key\.\w+$, and so forth..
Online Demo: play with it!

^ means a string that starts with the given set of characters in a regular expression.
$ means a string that ends with the given set of characters in a regular expression.
\s means a whitespace character.
\S means a non-whitespace character.
+ means 1 or more characters matching given condition.
Now, you want to match just a single word meaning a string of characters that start and end with non-spaced string. So, the required regular expression is:
^\S+$

You could do it with a combination of "any characters that aren't period" and the start/end anchors.
*.key would be ^[^.]*\.key, and *.key.* would be ^[^.]*\.key\.[^.]*$
EDIT: As tripleee said, [^.]*, which matches "any number of characters that aren't periods," would allow whitespace characters (which of course aren't periods), so using \w+, "any number of 'word characters'" like the other answers is better.

Python 3 regular expression for $ but not $$ in a string

I need to match one of the following anywhere in a string:
${aa:bb[99]}
${aa:bb}
${aa}
but not:
$${aa:bb[99]}
$${aa:bb}
$${aa}
my python 3 regex is:
pattern = **r"[^\$|/^]**\$\{(?P<section>[a-zA-Z]+?\:)?(?P<key>[a-zA-Z]+?)(?P<value>\[[0-9]+\])?\}"
What I'm looking for, is the proper way to say not $ or beginning of a string. The block r"[^\$|/^]" will properly detect all cases but will fail if my string starts at the first character.
I trie, without success:
r"[^\$|\b]...
r"[^\$|\B]...
r"[^\$]...
r"[^\$|^]
Any suggestion?

Use a negative lookbehind:
(?<!\$)
and then follow it by the thing you actually want to match. This will ensure that the thing you actually want to match is not preceded by a $ (i.e. not preceded by a match for \$):
(?<!\$)\$\{(?P<section>[a-zA-Z]+?\:)?(?P<key>[a-zA-Z]+?)(?P<value>\[[0-9]+\])?\}
^ ^
| |
| +--- The dollar sign you actually want to match
|
+--- The possible second preceding dollar sign you want to exclude
(?<!...)
Matches if the current position in the string is not preceded
by a match for .... This is called a negative lookbehind assertion.
Similar to positive lookbehind assertions, the contained pattern must
only match strings of some fixed length and shouldn’t contain group
references. Patterns which start with negative lookbehind assertions
may match at the beginning of the string being searched.
https://docs.python.org/3/library/re.html

You can use a negative lookbehind (?<!\$) to say "not preceded by $":
(?<!\$)\${[^}]*}
I have simplified the part between the brackets a bit to focus on the "one and only one $ part".
Here is a regex101 link.

Thank you Amber for the ideas. I followed the same train of thought you suggest using negative look ahead. I tried them all with https://regex101.com/r/G2n0cO/1/. The only one that succeed almost perfectly is:
(?:^|[^\$])\${(?:(?P<section>[a-zA-Z0-9\-_]+?)\:)??(?P<key>[a-zA-Z0-9\-_]+?)(?:\[(?P<index>[0-9]+?)\])??\}
I still had to add a check to remove the last non-dollar character. at the end of the sample below. For history I kept a few of the iterations I made since I posted this question:
# keep tokens ${[section:][key][\[index\]]}and skip false ones
# pattern = r"\$\{((?P<section>.+?)\:)?(?P<key>.+?)(\[(?P<index>\d+?)\])+?\}"
# pattern = r'\$\{((?P<section>\S+?)\:)??(?P<key>\S+?)(\[(?P<index>\d+?)\])?\}'
# pattern = r'\$\{((?P<section>[a-zA-Z0-9\-_]+?)\:)??(?P<key>[a-zA-Z0-9\-_]+?)(\[(?P<index>[0-9]+?)\])??\}'
pattern = r'(?:^|[^\$])\${(?:(?P<section>[a-zA-Z0-9\-_]+?)\:)??(?P<key>[a-zA-Z0-9\-_]+?)(?:\[(?P<index>[0-9]+?)\])??\}'
analyser = re.compile(pattern)
mo = analyser.search(value, 0)
log.debug(f'got match object: {mo}')
while not mo is None:
log.debug(f'in while loop, level={level}')
if level > MAX_LEVEL:
raise RecursionError(f"to many recursive call to _substiture_text() while processing '{value}'.")
else:
level +=1
start = mo.start()
end = mo.end()
# re also captured the first non $ sign symbol
if value[start] != '$':
start += 1

Regular expression with only numbers after it

I'm fairly new to using regular expressions in general. And I'm having trouble coming up with one that will suit my purpose.
I've tried this
line1 = 'REQ-1234'
match = re.match(r'^REQ-\d', line1, re.I)
This will work as long as the string is not something like
'REQ-1234 and then there is more stuff'
Is there a way to specify that there must not be anything after 'REQ-' except numbers? The other requirement is that 'REQ-1234' must be the only thing in the string. I think the caret symbol takes care of that though.

You need to add a + quantifier after \d to match 1 or more digits, and then add $ anchor to require the end of string position after these digits:
match = re.match(r'REQ-\d+$', line1, re.I)
^^
Note that ^ is redundant since you are using re.match that anchors the match at the string start.
To match a req- that may be followed with digits, replace + (1 or more repetitions) with * quantifier (0 or more repetitions).
Note that with Python 3, you may use re.fullmatch without explicit anchors, r'REQ-\d+' or r'REQ-\d*' will do.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

regexp: match character group or end of line - python

Concatenate the character 'a' to the beginning of the string. Then use [aAbB]([zZyY]).

Try this one: (?<![^abAB])([yzYZ])

Related

Expression in regular expression python

Match a line if there is something before a group of characters, at the start of the line [duplicate]

Regex to check if it is exactly one single word

Python 3 regular expression for $ but not $$ in a string

Regular expression with only numbers after it

Categories

Resources