Escaping regex unicode string in Python - python

I have a user defined string.
I want to use it in regex with small improvement: search by three apostrophes instead of one.
For example,
APOSTROPHES = re.escape('\'\u2019\u02bc')
word = re.escape("п'ять")
word = ''.join([s if s not in APOSTROPHES else '[%s]' % APOSTROPHES for s in word])
It works good for latin, but for unicode list comprehension gives the following string:
"[\\'\\\\u2019\\\\u02bc]\xd0[\\'\\\\u2019\\\\u02bc]\xbf[\\'\\\\u2019\\\\u02bc][\\'\\\\u2019\\\\u02bc][\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x8f[\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x82[\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x8c"
Looks like it finds backslashes in both strings and then substitutes APOSTROPHES
Also, print(list(w for w in APOSTROPHES)) gives ['\\', "'", '\\', '\\', 'u', '2', '0', '1', '9', '\\', '\\', 'u', '0', '2', 'b', 'c'].
How can I avoid it? I want to get "\п[\'\u2019\u02bc]\я\т\ь"

What I understand is: you want to create a regular expression which can match a given word with any apostrophe:
The RegEx which match any apostrophe can be defined in a group:
APOSTROPHES_REGEX = r'[\'\u2019\u02bc]'
For instance, you have this (Ukrainian?) word which contains a single quote:
word = "п'ять"
EDIT: If your word contains another kind of apostrophe, you can normalize it, like this:
word = re.sub(APOSTROPHES_REGEX , r"\'", word, flags=re.UNICODE)
To create a RegEx, you escape this string (because in some context, it can contains special characters like punctuation, I think). When escaped, the single quote "'" is replaced by an escaped single quote, like this: r"\'".
You can replace this r"\'" by your apostrophe RegEx:
import re
word_regex = re.escape(word)
word_regex = word_regex.replace(r'\'', APOSTROPHES_REGEX)
The new RegEx can then be used to match the same word with any apostrophe:
assert re.match(word_regex, "п'ять") # '
assert re.match(word_regex, "п’ять") # \u2019
assert re.match(word_regex, "пʼять") # \u02bc
Note: don’t forget to use the re.UNICODE flag, it will help you for some RegEx characters classes like r"\w".

Related

How to write in the correct way the following array with special characters?

I want to clean the name of a file but ONLY for the special characters not allowed:
char_not_supported_by_file_name = ['\', '/', ':', '*', '?', '"', '<', '>', '|']
tmp_file_name= file
for c in char_not_supported_by_file_name:
if c in tmp_file_name:
tmp_file_name = tmp_file_name.replace(c, '_')
I try to write this list, check if the file's name I want to clean up has one of the 9 special characters I don't want and replace it with an underscore, but my IDE says the array is written wrong. How can I write it in the correct way?
If you precede a quote with a backslash, it will have been escaped. In other words, it will be a character in the string instead of marking the end of the string. You must escape the first backslash with another backslash:
char_not_supported_by_file_name = ['\\', '/', ':', '*', '?', '"', '<', '>', '|']
Also, replace will do nothing if it can't find any instances of the character that needs to be replaced, so you can omit the if check:
for c in char_not_supported_by_file_name:
tmp_file_name = tmp_file_name.replace(c, '_')
If you are willing to import modules, this could be done without the loop, using re.sub:
import re
file_name = "this/is:a*very?bad\\example>of<a|filename"
res = re.sub("[\\\/:*?\"<>|]", "_", file_name)
print(res)
# this_is_a_very_bad_example_of_a_filename
Note the \ backslashes need to be tripled or even quadropled depending on the exact location. Read this question and its duplicates for more information. The reason is that those backslashes are escaped twice: once by the interpreter and then again by re.
Something that will make your code more concise, if you're comfortable with regex, would be using regular expressions instead of an array:
import re
tmp_file_name = file
tmp_file_name = re.sub(r'[\\/:*?\"<>|]', '_', tmp_file_name)
This solves your original problem as well, which is that the backslash in the first element of your array, '\', is escaping the end quote and turning it into a ' literal instead of closing the quotations around your backslash.

Replacing multiple chars in string with another character in Python

I have a list of strings I want to check if each string contains certain characters, if it does then replace the characters with another character.
I have something like below:
invalid_chars = [' ', ',', ';', '{', '}', '(', ')', '\\n', '\\t', '=']
word = 'Ad{min > HR'
for c in list(word):
if c in invalid_chars:
word = word.replace(c, '_')
print (word)
>>> Admin_>_HR
I am trying to convert this into a function using list comprehension but I am strange characters...
def replace_chars(word, checklist, char_replace = '_'):
return ''.join([word.replace(ch, char_replace) for ch in list(word) if ch in checklist])
print(replace_chars(word, invalid_chars))
>>> Ad_min > HRAd{min_>_HRAd{min_>_HR
Try this general pattern:
''.join([ch if ch not in invalid_chars else '_' for ch in word])
For the complete function:
def replace_chars(word, checklist, char_replace = '_'):
return ''.join([ch if ch not in checklist else char_replace for ch in word])
Note: no need to wrap string word in a list(), it's already iterable.
This might be a good use for str.translate(). You can turn your invalid_chars into a translation table with str.maketrans() and apply wherever you need it:
invalid_chars = [' ', ',', ';', '{', '}', '(', ')', '\n', '\t', '=']
invalid_table = str.maketrans({k:'_' for k in invalid_chars})
word = 'Ad{min > HR'
word.translate(invalid_table)
Result:
'Ad_min_>_HR'
This will be especially nice if you need to apply this translation to several strings and more efficient since you don't need to loop through the entire invalid_chars array for every letter, every time which you will if you us if x in invalid_chars inside a loop.
This is easier with regex. You can search for a whole group of characters with a single substitution call. It should perform better too.
>>> import re
>>> re.sub(f"[{re.escape(''.join(invalid_chars))}]", "_", word)
'Ad_min_>_HR'
The code in the f-string builds a regex pattern that looks like this
>>> pattern = f"[{re.escape(''.join(invalid_chars))}]"
>>> print(repr(pattern))
'[\\ ,;\\{\\}\\(\\)\\\n\\\t=]'
>>> print(pattern)
[\ ,;\{\}\(\)\
\ =]
That is, a regex character set containing each of your invalid chars. (The backslash escaping ensures that none of them are interpreted as a regex control character, regardless of which characters you put in invalid_chars.) If you had specified them as a string in the first place, the ''.join() would not be required.
You can also compile the pattern (using re.compile()) if you need to re-use it on multiple words.

Regex parsing text and get relevant words / characters

I want to parse a file, that contains some programming language. I want to get a list of all symbols etc.
I tried a few patterns and decided that this is the most successful yet:
pattern = "\b(\w+|\W+)\b"
Using this on my text, that is something like:
string = "the quick brown(fox).jumps(over + the) = lazy[dog];"
re.findall(pattern, string)
will result in my required output, but I have some chars that I don't want and some unwanted formatting:
['the', ' ', 'quick', ' ', 'brown', '(', 'fox', ').', 'jumps', 'over',
' + ', 'the', ') = ', 'lazy', '[', 'dog']
My list contains some whitespace that I would like to get rid of and some double symbols, like (., that I would like to have as single chars. Of course I have to modify the \W+ to get this done, but I need a little help.
The other is that my regex doesn't match the ending ];, which I also need.
Why use \W+ for one or more, if you want single non-word characters in output? Additionally exclude whitespace by use of a negated class. Also it seems like you could drop the word boundaries.
re.findall(r"\w+|[^\w\s]", str)
This matches
\w+ one or more word characters
|[^\w\s] or one character, that is neither a word character nor a whitespace
See Ideone demo

Python - parsing user input using a verbose regex

I am try to design a regex the will parse user input, in the form of full sentences. I am stuggling to get my expression to fully work. I know it is not well coded but I am trying hard to learn. I am currently trying to get it to parse precent as one string see under the code.
My test "sentence" = How I'm 15.5% wholesome-looking U.S.A. we RADAR () [] {} you -- are, ... you?
text = input("please type somewhat coherently: ")
pattern = r'''(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
|\w+(?:[-']\w+)* # permit word-internal hyphens and apostrophes
|[-.(]+ # double hyphen, ellipsis, and open parenthesis
|\S\w* # any sequence of word characters
# |[\d+(\.\d+)?%] # percentages, 82%
|[][\{\}.,;"'?():-_`] # these are separate tokens
'''
parsed = re.findall(pattern, text)
print(parsed)
My output = ['How', "I'm", '15', '.', '5', '%', 'wholesome-looking', 'U.S.A.', 'we', 'RADAR', '(', ')', '[', ']', '{', '}', 'you', '--', 'are', ',', '...', 'you', '?']
I am looking to have the '15', '.', '5', '%' parsed as '15.5%'. The line that is currently commented out is what should do it, but when commented in does absolutly nothing. I searched for resources to help but they have not.
Thank you for you time.
If you just want to have the percentage match as a whole entity, you really should be aware that regex engine analyzes the input string and the pattern from left to right. If you have an alternation, the leftmost alternative that matches the input string will be chosen, the rest won't be even tested.
Thus, you need to pull the alternative \d+(?:\.\d+)? up, and the capturing group should be turned into a non-capturing or findall will yield strange results:
(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
|\d+(?:\.\d+)?% # percentages, 82% <-- PULLED UP OVER HERE
|\w+(?:[-']\w+)* # permit word-internal hyphens and apostrophes
|[-.(]+ # double hyphen, ellipsis, and open parenthesis
|\S\w* # any sequence of word characters#
|[][{}.,;"'?():_`-] # these are separate tokens
See regex demo.
Also, please note I replaced [][\{\}.,;"'?():-_`] with [][{}.,;"'?():_`-]: braces do not have to be escaped, and - was forming an unnecessary range from a colon (decimal code 58) and an underscore (decimal 95) matching ;, <, =, >, ?, #, all the uppercase Latin letters, [, \, ] and ^.

Regex to ignore specific characters

I am parsing a text on non alphanumeric characters and would like to exclude specific characters like apostrophes, dash/hyphens and commas.
I would like to build a regex for the following cases:
non-alphanumeric character, excluding apostrophes and hypens
non-alphanumeric character, excluding commas,apostrophes and hypens
This is what i have tried:
def split_text(text):
my_text = re.split('\W',text)
# the following doesn't work.
#my_text = re.split('([A-Z]\w*)',text)
#my_text = re.split("^[a-zA-Z0-9]+(-[a-zA-Z0-9]+)*$",text)
return my_text
Case 1:
Sample Input: What's up? It's good to see you my-friend. "Hello" to-the world!.
Sample Output: ['What's','up','It's','good','to','see','you','my-friend','Hello','to-the','world']
Case 2:
Sample Input: It means that, it's not good-to do such things.
Sample Output: ['It', 'means', 'that,', 'it's', 'not', 'good-to', 'do', 'such', 'things']
Any ideas
is this what you want?
non-alphanumeric character, excluding apostrophes and hypens
my_text = re.split(r"[^\w'-]+",text)
non-alphanumeric character, excluding commas,apostrophes and hypens
my_text = re.split(r"[^\w-',]+",text)
the [] syntax defines a character class, [^..] "complements" it, i.e. it negates it.
See the documentation about that:
Characters that are not within a range can be matched by complementing the set. If the first character of the set is '^', all the characters that are not in the set will be matched. For example, [^5] will match any character except '5', and [^^] will match any character except '^'. ^ has no special meaning if it’s not the first character in the set.
You can use a negated character class for this:
my_text = re.split(r"[^\w'-]+",text)
or
my_text = re.split(r"[^\w,'-]+",text) # also excludes commas

Categories