Regex doesn't match what it should

Regex doesn't match what it should - python

I'm trying to filter anything except alphanumeric characters, Russian letters, line breaks, spaces, commas, dots, question marks, exclamation marks, slashes, #, #, colons and parentheses.
My code so far:
re.sub(r"[^А-я\w\d"+"\n"+" ,.?!ё/##:()]", "", string)
However, it does not clear the following string: "𝕾𝖍𝖎𝖗𝖔𝖓".
Why not, and how can I make it do so?
Edit:: Forgot to mention that it works as expected at https://regexr.com/

You may check the string at this link and you will see that the "𝕾𝖍𝖎𝖗𝖔𝖓" string consists of characters belonging to \p{L} category. Your regex starts with [^А-я\w\d, which means it matches any chars but Russian chars (except ё (that you define a bit later) and Ё), any Unicode letters (any because in Python 3, \w - by default - matches any Unicode alphanumeric chars and connector punctuation.
It appears you only want to remove Russian and English letters, so use the corresponding ranges:
r"[^А-ЯЁа-яёA-Za-z0-9\n ,.?!/##:()]+"
It matches one or more chas other than
А-ЯЁа-яё - Russian letters
A-Za-z - ASCII letters
0-9 - ASCII digits
\n ,.?!/##:() - newline, space, comma, dot, question and exclamation marks, slash, ampersand, hash, colon and round parentheses.

You can make it so it only matches the type you need. Instead of the string type that you don't need.
This should work [А-я\w\d\"+\"\n\"+\" ,.?!ё/##:()]

Related

Regular Expression in Python strings

I want to validate a string that satisfies the below three conditions using regular expression
The special characters allowed are (. , _ , - ).
Should contain only lower-case characters.
Should not start or end with special character.
To satisfy the above conditions, I have created a format as below
^[^\W_][a-z\.,_-]+
This pattern works fine up to second character. However, this pattern is failing for the 3rd and subsequent characters if those contains any special character or upper cases characters.
Example:
Pattern Works for the string S#yanthan but not for Sa#yanthan. I am expecting that pattern to pass even if the third and subsequent characters contains any special characters or upper case characters. Can you suggest me where this pattern goes wrong please? Below is the snippet of the code.
import re
a = "Sayanthan"
exp = re.search("^[^\W_][a-z\.,_-]+",a)
if exp:
print(True)
else:
print(False)

Based on you initial rules I'd go with:
^[a-z](?:[.,_-]*[a-z])*$
See the online demo.
However, you mentioned in the comments:
"Also the third condition is "should not start with Special character" instead of "should not start or end with Special character""
In that case you could use:
^[a-z][-.,_a-z]*$
See the online demo

The pattern that you tried ^[^\W_][a-z.,_-]+ starts with [^\W_] which will match any word char except an underscore, so it could also be an uppercase char.
Then [a-z.,_-]+ will match 1+ times any of the listed, which means the string can also end with a comma for example.
Looking at the conditions listed, you could use:
^[a-z](?:[a-z.,_-]*[a-z])?\Z
^ Start of string
[a-z] Match a lower case char a-z
(?: Non capture group
[a-z.,_-]*[a-z] Match 0+ occurrences of the listed ending with a-z
)? Close group and make it optional
\Z End of string
Regex demo

Python: How to Keep Alphanumeric English,Latin Characters in the regex?

I want my regex to keep all alphanumeric English and Latin characters.
re.sub('[^A-Za-z0-9-/().&\' ]+', '',"L'Oréal")
should persist with L'Oréal
Currently, it gives me L'Oral
Is there any Latin encoding that should be added?

You may use
re.sub(r"[^-/().&' \w]|_", "", s)
See the regex demo
The regex matches
[^-/().&' \w] - a negated character class matching any char but a word char, -, /, (, ), ., &, ' and space
| - or
_ - an underscore (it is part of \w, thus, it should be added as an alternative).

Why not add a Unicode range for all Latin characters to your regex?
r"[\u00C0-\u017F]"
Will match all your diacritically enhanced Unicode characters using Latin based alphabets. From there, just add the rest of your parameters of what you are looking for.

I think this one will solve your problem:
re.sub('[(?>\P{M}\p{M}*)+]', '',"L'Oréal")
And the result will be:
L'Oréal

Python regex specific word with singe quote at end

Searching a large syslog repo and need to get a specific word to match with a certain condition.
I'm using regex to compile a search for this word. I've read the python docs on regex characters and I understand how to specify each criteria separately but somehow missing how to concatenate all together for my specific search. This is what I have so far but not working...
p = re.compile("^'[A-Z]\w+'$")
match = re.search(p, syslogline, )
the word is a username that can be alphanum, always beginning with an uppercase character (preceded by blank space), can contain chars or nums, is 3-12 in length and ends with single quote.
an example would be: Epresley01' or J98473'

Brief
Based on your requirements (also stated below), your regex doesn't work because:
^' Asserts the position at the start of the line and ensures a ' is the first character of that line.
$ Asserts the position at the end of the line.
Having said that you specify that it's preceded by a space character (which isn't present in your pattern). You pattern also checks for ' which isn't the first character of the username. Given that you haven't actually given us a sample of your file I can't confirm nor deny that your string starts before the username and ends after it, but if that's not the case the anchors ^$ are also not helping you here.
Requirements
The requirements below are simply copied from the OP's question (rewritten) to outline the username format. The username:
Is preceded by a space character.
Starts with an uppercase letter.
Contains chars or nums. I'm assuming here that chars actually means letters and that all letters in the username (including the uppercase starting character) are ASCII.
Is 3-12 characters in length (excluding the preceding space and the end character stated below).
Ends with an apostrophe character '.
Code
See regex in use here
(?<= )[A-Z][^\W_]{2,11}'
Explanation
(?<= ) Positive lookbehind ensuring what precedes is a space character
[A-Z] Match any uppercase ASCII letter
[^\W_]{2,11} Match any word character except underscore _ (equivalent to a-zA-Z0-9)
This appears a little confusing because it's actually a double-negative. It's saying match anything that's not in the set. The \W matches any non-word character. Since it's a double-negative, it's like saying don't match non-word characters. Adding _ to the set negates it.
' Match the apostrophe character ' literally

I think you can do it like this:
(Updated after the comment from #ctwheels)
See regex in use here
[A-Z][a-zA-Z0-9]{1,10}'
Explanation
Match a whitespace
Match an uppercase character [A-Z]
Match [a-zA-Z0-9]+
Match an apostrophe '
Demo

Python regular expression: how to excluding superstrings?

I want to find all appearances of "not", but does not include the terms "not good" or "not bad".
For example, "not not good, not bad, not mine" will match the first and last "not".
How do I achieve that using the re package in python?

Use negative look-ahead assertion:
\bnot\b(?!\s+(?:good|bad))
This will match not, except the case where good and bad are right after not in the string. I have added word boundary \b to make sure we are matching the word not, rather than not in nothing or knot.
\b is word boundary. It checks that the character in front is word character and the character after is not, and vice versa. Word character is normally English alphabet (a-z, A-Z), digit (0-9), and underscore (_), but there can be more depending on the regex flavor.
(?!pattern) is syntax for zero-width negative look-ahead - it will check that from the current point, it cannot find the pattern specified ahead in the input string.
\s denotes whitespace character (space (ASCII 32), new line \n, tab \t, etc. - check the documentation for more information). If you don't want to match so arbitrarily, just replace \s with (space).
The + in \s+ matches one or more instances of the preceding token, in this case, it is whitespace character.
(?:pattern) is non-capturing group. There is no need to capture good and bad, so I specify so for performance.

In regex, what does [\w*] mean?

What does this regex mean?
^[\w*]$

Quick answer: ^[\w*]$ will match a string consisting of a single character, where that character is alphanumeric (letters, numbers) an underscore (_) or an asterisk (*).
Details:
The "\w" means "any word character" which usually means alphanumeric (letters, numbers, regardless of case) plus underscore (_)
The "^" "anchors" to the beginning of a string, and the "$" "anchors" To the end of a string, which means that, in this case, the match must start at the beginning of a string and end at the end of the string.
The [] means a character class, which means "match any character contained in the character class".
It is also worth mentioning that normal quoting and escaping rules for strings make it very difficult to enter regular expressions (all the backslashes would need to be escaped with additional backslashes), so in Python there is a special notation which has its own special quoting rules that allow for all of the backslashes to be interpreted properly, and that is what the "r" at the beginning is for.
Note: Normally an asterisk (*) means "0 or more of the previous thing" but in the example above, it does not have that meaning, since the asterisk is inside of the character class, so it loses its "special-ness".
For more information on regular expressions in Python, the two official references are the re module, the Regular Expression HOWTO.

As exhuma said, \w is any word-class character (alphanumeric as Jonathan clarifies).
However because it is in square brackets it will match:
a single alphanumeric character OR
an asterisk (*)
So the whole regular expression matches:
the beginning of a
line (^)
followed by either a
single alphanumeric character or an
asterisk
followed by the end of a
line ($)
so the following would match:
blah
z <- matches this line
blah
or
blah
* <- matches this line
blah

\w refers to 0 or more alphanumeric characters and the underscore. the * in your case is also inside the character class, so [\w*] would match all of [a-zA-Z0-9_*] (the * is interpreted literally)
See http://www.regular-expressions.info/reference.html
To quote:
\d, \w and \s --- Shorthand character classes matching digits, word characters, and whitespace. Can be used inside and outside character classes.
Edit corrected in response to comment

From the beginning of this line, "Any number of word characters (letter, number, underscore)" until the end of the line.
I am unsure as to why it's in square brackets, as circle brackets (e.g. "(" and ")") are correct if you want the matched text returned.

\w is equivalent to [a-zA-Z0-9_] I don't understand the * after it or the [] around it, because \w already is a class and * in class definitions makes no sense.

As said above \w means any word. so you could use this in the context of below
view.aspx?url=[\w]
which means you can have any word as the value of the "url=" parameter

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex doesn't match what it should - python

You can make it so it only matches the type you need. Instead of the string type that you don't need. This should work [А-я\w\d\"+\"\n\"+\" ,.?!ё/##:()]

Related

Regular Expression in Python strings

Python: How to Keep Alphanumeric English,Latin Characters in the regex?

Python regex specific word with singe quote at end

Python regular expression: how to excluding superstrings?

In regex, what does [\w*] mean?

Categories

Resources