Unicode Regex with regex not working in Python

Unicode Regex with regex not working in Python - python

I have the following Regex (see it in action in PCRE)
.*?\P{L}*?(\p{L}+-?(\p{L}+)?)\P{L}*$
However, Python doesn't upport unicode regex with \p{} syntax. To solve this I read I could use the regex module (not default re), but this doesn't seem to work either. Not even with u flag.
Example:
sentence = "valt nog zoveel zal kunnen zeggen, "
print(re.sub(".*?\P{L}*?(\p{L}+-?(\p{L}+)?)\P{L}*$","\1",sentence))
Output: < blank >
Expected output: zeggen
This doesn't work with Python 3.4.3.

As you can see unicode character classes like \p{L} are not available in the re module. However it doesn't means that you can't do it with the re module since \p{L} can be replaced with [^\W\d_] with the UNICODE flag (even if there are small differences between these two character classes, see the link in comments).
Second point, your approach is not the good one (if I understand well, you are trying to extract the last word of each line) because you have strangely decided to remove all that is not the last word (except the newline) with a replacement. ~52000 steps to extract 10 words in 10 lines of text is not acceptable (and will crash with more characters). A more efficient way consists to find all the last words, see this example:
import re
s = '''Ik heb nog nooit een kat gezien zo lélijk!
Het is een minder lelijk dan uw hond.'''
p = re.compile(r'^.*\b(?<!-)(\w+(?:-\w+)*)', re.M | re.U)
words = p.findall(s)
print('\n'.join(words))
Notices:
To obtain the same result with python 2.7 you only need to add an u before the single quotes of the string: s = u'''...
If you absolutely want to limit results to letters avoiding digits and underscores, replace \w with [^\W\d_] in the pattern.
If you use the regex module, maybe the character class \p{IsLatin} will be more appropriate for your use, or whatever the module you choose, a more explicit class with only the needed characters, something like: [A-Za-záéóú...
You can achieve the same with the regex module with this pattern:
p = regex.compile(r'^.*\m(?<!-)(\pL+(?:-\pL+)*)', regex.M | regex.U)
Other ways:
By line with the re module:
p = re.compile(r'[^\w-]+', re.U)
for line in s.split('\n'):
print(p.split(line+' ')[-2])
With the regex module you can take advantage of the reversed search:
p = regex.compile(r'(?r)\w+(?:-\w+)*\M', regex.U)
for line in s.split('\n'):
print p.search(line).group(0)

This post explains how to use unicode properties in python:
Python regex matching Unicode properties
Have you tried Ponyguruma, a Python binding to the Oniguruma
regular expression engine? In that engine you can simply say
\p{Armenian} to match Armenian characters. \p{Ll} or \p{Zs} work
too.

Related

Regular expression error: unbalanced parenthesis at position n

I have been meaning to extract the month name from the following string with regex and despite the fact that my regex works on a platform like regex101, I can't seem to be able to extract the word "August".
import re
s = "word\anyword\2021\August\202108_filename.csv"
re.findall("\d+\\([[:alpha:]]+)\\\d+", s)
Which results in the following error:
error: unbalanced parenthesis at position 17
I also tried using re.compile, re.escape as per suggestions of the previous posts dealing with the same error but none of them seems to work.
Any help and also a little explanation on why this isn't working is greatly appreciated.

You can use
import re
s = r"word\anyword\2021\August\202108_filename.csv"
m = re.search(r"\d+\\([a-zA-Z]+)\\\d+", s)
if m:
print(m.group(1))
See the Python demo.
There are three main problems here:
The input string should be the same as used at regex101.com, i.e. you need to make sure you are using literal backslashes in the Python code, hence the use of raw string literals for both the input text and regex
The POSIX character classes are not supported by Python re, so [[:alpha:]]+ should be replaced with some equivalent pattern, say, [A-Za-z]+ or [^\W\d_]+
Since it seems like you only expect a single match (there is only one August (month) name in the string), you do not need re.findall, you can use re.search. Only use re.findall when you need to extract multiple matches from a string.
Also, see these posts:
Python regex - r prefix
What does the "r" in pythons re.compile(r' pattern flags') mean?
What exactly do "u" and "r" string flags do, and what are raw string literals?

Regex For Special Character (S with line on top)

I was trying to write regex in Python to replace all non-ascii with an underscore, but if one of the characters is "S̄" (an 'S' with a line on the top), it adds an extra 'S'... Is there a way to account for this character as well? I believe it's a valid utf-8 character, but not ascii
Here's there code:
import re
line = "ra*ndom wordS̄"
print(re.sub('[\W]', '_', line))
I would expect it to output:
ra_ndom_word_
But instead I get:
ra_ndom_wordS__

The reason Python works this way is that you are actually looking at two distinct characters; there's an S and then it's followed by a combining macron U+0304
In the general case, if you want to replace a sequence of combining characters and the base character with an underscore, try e.g.
import unicodedata
def cleanup(line):
cleaned = []
strip = False
for char in line:
if unicodedata.combining(char):
strip = True
continue
if strip:
cleaned.pop()
strip = False
if unicodedata.category(char) not in ("Ll", "Lu"):
char = "_"
cleaned.append(char)
return ''.join(cleaned)
By the by, \W does not need square brackets around it; it's already a regex character class.
Python's re module lacks support for important Unicode properties, though if you really want to use specifically a regex for this, the third-party regex library has proper support for Unicode categories.
"Ll" is lowercase alphabetics and "Lu" are uppercase. There are other Unicode L categories so maybe tweak this to suit your requirements (unicodedata.category(char).startswith("L") maybe?); see also https://www.fileformat.info/info/unicode/category/index.htm

You can use the following script to get the desired output:
import re
line="ra*ndom wordS̄"
print(re.sub('[^[-~]+]*','_',line))
Output
ra_ndom_word_
In this approach, it works with other non-ascii characters as well :
import re
line="ra*ndom ¡¢£Ä wordS̄. another non-ascii: Ä and Ï"
print(re.sub('[^[-~]+]*','_',line))
output:
ra_ndom_word_another_non_ascii_and_

How to import Regex from external file with its original format and without extra escape characters

Hell Everyone,
I would like to request your support in the following question.
I am recently working in a Python script that is looking matches for about 15 sentences using regular expressions, in thousands of files.
The sentences that we will be looking for could be changing through the days/weeks and the script will be given to users with knowledge in regular expressions, but not programmability skills.
Then, in order to make this script more scalable I was looking to save the regexs in a different file where those users can modify the sentences without the necessity to modify the python script.
Example
This file would be modify continuously to match different sentences.
--- regex.log ---
Th\w*\s+sen\w*
\d{0,3}
--- matches.py ---
import re
with open("regexs.log", "r") as regexs:
regex = regexs.readlines()
text = "This sentence"
for reg in regex:
match = re.search(reg, text)
However, this is not working... when the regexs are exported, python is adding extra escape characters to the sentence. For instance, for the two regexs above these are imported as below:
"Th\\w*\\s+send\\w*"
"\\d{0,3}"
The back slash is duplicated, whereby, the regexs are no longer useful, since they don't longer match the sentences.
Just wondering if there is any way to import those regular expressions in its original state?
Similar operation happens if a store the regular expressions in a list:
>>> reg = ["\w+\n"]
>>> reg
['\\w+\n']
Regards.

regex = regex.readlines()
regex = regex.replace("\\", "\") # <= Add this
What this does is say "everywhere there is a \\ replace it with a \. But, if you are doing some other things with the file before it is finalized, you'll want to move replace to a more appropriate spot.

I tried replace as below:
regex = regex.replace("\\", "\")
but it returns:
SyntaxError: EOL while scanning string literal
It seems python is recognizing the "how to replace" (the second, value in replace function) as double quote symbol due to the escape sequence \" rather than a back slash.

Need to Escape the Character After Special Characters in Python's regex?

I have the following python code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
line = 'div><div class="fieldRow jr_name"><div class="fieldLabel">name<'
regex0 = re.compile('(.+?)\v class="fieldLabel">name.+?', re.VERBOSE | re.UNICODE)
regex1 = re.compile('(.+?)v class="fieldLabel">name.+?', re.VERBOSE | re.UNICODE)
regex2 = re.compile('(.+?) class="fieldLabel">name.+?', re.VERBOSE | re.UNICODE)
m0 = regex0.match(line)
m1 = regex1.match(line)
m2 = regex2.match(line)
if m0:
print 'regex0 is good'
else:
print 'regex0 is no good'
if m1:
print 'regex1 is good'
else:
print 'regex1 is no good'
if m2:
print 'regex2 is good'
else:
print 'regex2 is no good'
The output is
regex0 is good
regex1 is no good
regex2 is good
I don't quite understand why I need to escape the character 'v' after "(.+?)" in regex0. If I don't escape, which will become regex1, then the matching will fail. However, for space right after "(.+?)" in regex3, I don't have to escape.
Any idea?
Thanks in advance.

So, there are some issues with your approach
The ones that contribute to your specific complaint are:
You do not mark te regexp string as raw (r' prefix) - that makes the Python compiler change some "\" prefixed characters inside the string before they even reach the re.match call.
"\v" happens to be one such character - a vertical tab that is replaced by "\0x0b"
You use the "re.VERBOSE" flag - that simply tells the regexp engine to ignore any whitesapce character. "\v" being a vertical tab is one character in this class and is ignored.
So, there is your match for regex0: the letter "v" os never seem as such.
Now, for the possible fixes on you approach, in the order that you should be trying to do them:
1) Don't use regular expressions to parse HTML. Really. There are a lot of packages that can do a good job on parsing HTML, and in missing those you can use stdlib's own HTMLParser (html.parser in Python3);
2) If possible, use Python 3 instead of Python 2 - you will be bitten on the first non-ASCII character inside yourt HTML body if you go on with the naive approach of treating Python2 strings as "real life" text. Python 3 automatic encoding handling (and explicit settings allowed to you when it is not automatic) .
Since you are probably not changing anyway, so try to use regex.findall instead of regex.match - this returns a list of matching strings and could retreive the attributes you are looking at once, without searching from the beggining of the file, or depending on line-breaks inside the HTML.

There is a special symbol in Python regex \v, about which you can read here:
https://docs.python.org/2/library/re.html
Python regex usually are written in r'your regex' block, where "r" means raw string. (https://docs.python.org/3/reference/lexical_analysis.html)
In your code all special characters should be escaped to be understood as normal characters. E.g. if you write s - this is space, \s is just "s". To make it work in an opposite way use raw strings.
The line below is the solution you need, I believe.
regex1 = re.compile(r'(.+?)v class="fieldLabel">name.+?', re.VERBOSE | re.UNICODE)

how to replace markdown tags into html by python?

I want to replace some "markdown" tags into html tags.
for example:
#Title1#
##title2##
Welcome to **My Home Page**
will be turned into
<h1>Title1</h1>
<h2>title2</h2>
Welcome to <b>My Home Page</b>
I just don't know how to do that...For Title1,I tried this:
#!/usr/bin/env python3
import re
text = '''
#Title1#
##title2##
'''
p = re.compile('^#\w*#\n$')
print(p.sub('<h1>\w*</h1>',text))
but nothing happens..
#Title1#
##title2##
How could those bbcode/markdown language come into html tags?

Check this regex: demo
Here you can see how I substituted the #...# into <h1>...</h1>.
I believe you can get this to work with double # and so on to get other markdown features considered, but still you should listen to #Thomas and #nhahtdh comments and use a markdown parser. Using regexes in such cases is unreliable, slow and unsafe.
As for inline text like **...** to <b>...</b> you can try this regex with substitution: demo. Hope you can twink this for other features like underlining and so on.

Your regular expression does not work because in the default mode, ^ and $ (respectively) matches the beginning and the end of the whole string.
'^'
(Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline (my emph.)
'$'
Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in 'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in 'foo\n' will find two (empty) matches: one just before the newline, and one at the end of the string.
(7.2.1. Regular Expression Syntax)
Add the flag re.MULTILINE in your compile line:
p = re.compile('^#(\w*)#\n$', re.MULTILINE)
and it should work – at least for single words, such as your example. A better check would be
p = re.compile('^#([^#]*)#\n$', re.MULTILINE)
– any sequence that does not contain a #.
In both expressions, you need to add parentheses around the part you want to copy so you can use that text in your replacement code. See the official documentation on Grouping for that.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.