Python: How to Keep Alphanumeric English,Latin Characters in the regex? - python

I want my regex to keep all alphanumeric English and Latin characters.
re.sub('[^A-Za-z0-9-/().&\' ]+', '',"L'Oréal")
should persist with L'Oréal
Currently, it gives me L'Oral
Is there any Latin encoding that should be added?

You may use
re.sub(r"[^-/().&' \w]|_", "", s)
See the regex demo
The regex matches
[^-/().&' \w] - a negated character class matching any char but a word char, -, /, (, ), ., &, ' and space
| - or
_ - an underscore (it is part of \w, thus, it should be added as an alternative).

Why not add a Unicode range for all Latin characters to your regex?
r"[\u00C0-\u017F]"
Will match all your diacritically enhanced Unicode characters using Latin based alphabets. From there, just add the rest of your parameters of what you are looking for.

I think this one will solve your problem:
re.sub('[(?>\P{M}\p{M}*)+]', '',"L'Oréal")
And the result will be:
L'Oréal

Related

Regex doesn't match what it should

I'm trying to filter anything except alphanumeric characters, Russian letters, line breaks, spaces, commas, dots, question marks, exclamation marks, slashes, #, #, colons and parentheses.
My code so far:
re.sub(r"[^А-я\w\d"+"\n"+" ,.?!ё/##:()]", "", string)
However, it does not clear the following string: "𝕾𝖍𝖎𝖗𝖔𝖓".
Why not, and how can I make it do so?
Edit:: Forgot to mention that it works as expected at https://regexr.com/
You may check the string at this link and you will see that the "𝕾𝖍𝖎𝖗𝖔𝖓" string consists of characters belonging to \p{L} category. Your regex starts with [^А-я\w\d, which means it matches any chars but Russian chars (except ё (that you define a bit later) and Ё), any Unicode letters (any because in Python 3, \w - by default - matches any Unicode alphanumeric chars and connector punctuation.
It appears you only want to remove Russian and English letters, so use the corresponding ranges:
r"[^А-ЯЁа-яёA-Za-z0-9\n ,.?!/##:()]+"
It matches one or more chas other than
А-ЯЁа-яё - Russian letters
A-Za-z - ASCII letters
0-9 - ASCII digits
\n ,.?!/##:() - newline, space, comma, dot, question and exclamation marks, slash, ampersand, hash, colon and round parentheses.
You can make it so it only matches the type you need. Instead of the string type that you don't need.
This should work [А-я\w\d\"+\"\n\"+\" ,.?!ё/##:()]

Python regex specific word with singe quote at end

Searching a large syslog repo and need to get a specific word to match with a certain condition.
I'm using regex to compile a search for this word. I've read the python docs on regex characters and I understand how to specify each criteria separately but somehow missing how to concatenate all together for my specific search. This is what I have so far but not working...
p = re.compile("^'[A-Z]\w+'$")
match = re.search(p, syslogline, )
the word is a username that can be alphanum, always beginning with an uppercase character (preceded by blank space), can contain chars or nums, is 3-12 in length and ends with single quote.
an example would be: Epresley01' or J98473'
Brief
Based on your requirements (also stated below), your regex doesn't work because:
^' Asserts the position at the start of the line and ensures a ' is the first character of that line.
$ Asserts the position at the end of the line.
Having said that you specify that it's preceded by a space character (which isn't present in your pattern). You pattern also checks for ' which isn't the first character of the username. Given that you haven't actually given us a sample of your file I can't confirm nor deny that your string starts before the username and ends after it, but if that's not the case the anchors ^$ are also not helping you here.
Requirements
The requirements below are simply copied from the OP's question (rewritten) to outline the username format. The username:
Is preceded by a space character.
Starts with an uppercase letter.
Contains chars or nums. I'm assuming here that chars actually means letters and that all letters in the username (including the uppercase starting character) are ASCII.
Is 3-12 characters in length (excluding the preceding space and the end character stated below).
Ends with an apostrophe character '.
Code
See regex in use here
(?<= )[A-Z][^\W_]{2,11}'
Explanation
(?<= ) Positive lookbehind ensuring what precedes is a space character
[A-Z] Match any uppercase ASCII letter
[^\W_]{2,11} Match any word character except underscore _ (equivalent to a-zA-Z0-9)
This appears a little confusing because it's actually a double-negative. It's saying match anything that's not in the set. The \W matches any non-word character. Since it's a double-negative, it's like saying don't match non-word characters. Adding _ to the set negates it.
' Match the apostrophe character ' literally
I think you can do it like this:
(Updated after the comment from #ctwheels)
See regex in use here
[A-Z][a-zA-Z0-9]{1,10}'
Explanation
Match a whitespace
Match an uppercase character [A-Z]
Match [a-zA-Z0-9]+
Match an apostrophe '
Demo

Python Not Extracting Expected Pattern

I'm new to RegEx and I am trying to perform a simple match to extract a list of items using re.findall. However, I am not getting the expected result. Can you please help explain why I am also getting the first piece of this string based on the below regex pattern and what I need to modify to get the desired output?
import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''
print(re.findall('_\w+_\w+_bar_\d+', string))
Current Output:
['_1y345_xyz_orange_bar_1', '_123a5542_xyz_orange_bar_1', '_1z34512_abc_purple_bar_1']
Desired Output:
['_xyz_orange_bar_1', '_xyz_orange_bar_1', '_abc_purple_bar_1']
The \w pattern matches letters, digits and _ symbol. Depending on the Python version and options used, the letters and digits it can match may be from the whole Unicode range or just ASCII.
So, the best way to fix the issue is by replacing \w with [^\W_]:
import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''
print(re.findall(r'_[^\W_]+_[^\W_]+_bar_[0-9]+', string))
# => ['_xyz_orange_bar_1', '_xyz_orange_bar_1', '_abc_purple_bar_1']
See the Python demo.
Details:
_ - an underscore
[^\W_]+ - 1 or more chars that are either digits or letters (a [^ starts the negated character class, \W matches any non-word char, and _ is added to match any word chars other than _)
_[^\W_]+ - same as above
_bar_ - a literal substring _bar_
[0-9]+ - 1 or more ASCII digits.
See the regex demo.
_[a-z]+_\w+_bar_\d+ should work.
import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''
print(re.findall('_[a-z]+_\w+_bar_\d+', string))
o/p
['_xyz_orange_bar_1', '_xyz_orange_bar_1', '_abc_purple_bar_1']
Your problem is that the regular expression is greedy and tries to match as much as possible. Sometimes this can be fixed by adding a ? (question mark) after the + (plus) sign. However, in your current solution that is not doable (in any simple way, at least - it can likely be done with some lookahead). However, you can choose another pattern, that explicitly forbids matching then _ (underline) character as:
import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''
print(re.findall('_[^_\W]+_[^_\W]+_bar_\d+', string))
This will match what you hope for. The [^ ... ] construct means not, thus not underline and not not whitespace.
The problem with your code is that \w pattern is equivalent to the following set of characters: [a-zA-Z0-9_]
I guess you need to match the same set but without an underscore:
import re
string = '''aaaa_1y345_xyz_orange_bar_1
aaaa_123a5542_xyz_orange_bar_1
bbbb_1z34512_abc_purple_bar_1'''
print(re.findall('_[a-zA-Z0-9]+_[a-zA-Z0-9]+_bar_\d+', string))
The output:
['_xyz_orange_bar_1', '_xyz_orange_bar_1', '_abc_purple_bar_1']
Your \w usage is too permissive. It will find not only letters, but numbers and underscores as well. From the docs:
When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.
Instead us actual character groupings to match.
_[a-z]+_[a-z]+_bar_[0-9]+
If you actually need the complete matching of \w without the underscore, you can change the character groupings to:
[a-zA-Z0-9]

Regex, not statement

Heyho,
I have the regex
([ ;(\{\}),\[\'\"]?)(_[a-zA-Z_\-0-9]*)([ =;\/*\-+\]\"\'\}\{,]?)
to match every occurrence of
_var
Problem is that it also matches strings like
test_var
I tried to add a new matching group negating any word character but it didn't worked properly.
Can someone figure out what I have to do to not match strings like var_var?
Thanks for help!
You can use the following "fix":
([[ ;(){},'"]?)(\b_[a-zA-Z_0-9-]*\b)([] =;/*+"'{},-]?)
^ ^
See regex demo
The word boundary \b is an anchor that asserts the position between a word and a non-word boundary. That means your _var will never match if preceded with a letter, a digit, or a . Also, I removed overescaping inside the character classes in the optional capturing groups. Note the so-called "smart placement" of hyphens and square brackets that for a Python regex might be not that important, but is still a best practice in writing regexes. Also, in Python regex you don't need to escape / since there are no regex delimiters there.
And one more hint: without u modifier, \w matches [a-zA-Z0-9_], so you can write the regex as
([[ ;(){},'"]?)(\b_[\w-]*\b)([] =;/*+"'{},-]?)
See regex demo 2.
And an IDEONE demo (note the use of r'...'):
import re
p = re.compile(r'([[ ;(){},\'"]?)(\b_[\w-]*\b)([] =;/*+"\'{},-]?)')
test_str = "Some text _var and test_var"
print (re.findall(p, test_str))

shorthand for [:alpha:] in python regex

What is equivalent of [:alpha:] if I am making a unicode regex that need it.
For example for [:word:] it is [\w]
will be great if I get some help.
For Unicode compliance, you need to use
regex = re.compile(r"[^\W\d_]", re.UNICODE)
Unicode character properties (like \p{L}) are not supported by the current Python regex engine.
Explanation:
\w matches (if the Unicode flag is set) any letter, digit or underscore.
[^\W] matches the same thing, but with the negated character class, we can now subtract characters we don't want included:
[^\W\d_] matches whatever \w matches, but without digits (\d) or underscore (_).
>>> import re
>>> regex = re.compile(r"[^\W\d_]", re.UNICODE)
>>> regex.findall("aä12_")
['a', 'ä']
Any character in the range:
[A-Za-z]
That is the best shorthand in Python for it..
Or you could do [A-Z] with ignorecase: re.compile(r'[A-Z]', re.I)
Or inline: re.compile(r'(?i)[A-Z]')
For unicode: re.compile(r'[A-Z]', re.I|re.U) or re.compile(r'(?iu)[A-Z]')

Categories