I know from this question that, nothing to repeat in a regex expression, is a known bug of python.
But I must compile this unicode expression
re.compile(u'\U0000002A \U000020E3')
as a unique character. This is an emoticon and is a unique character. Python understand this string as u'* \\u20e3' and rise me 'nothing to repeat' error.
I am looking around but I can't find any solution. Does exist any work around?
This has little to do with the question you linked. You're not running into a bug. Your regex simply has a special character (a *) that you haven't escaped.
Simply escape the string before compiling it into a regex:
re.compile(re.escape(u'\U0000002A \U000020E3'))
Now, I'm a little unsure as to why you're representing * as \U0000002A — perhaps you could clarify what your intent is here?
You need to use re.escape (as shown in "Thomas Orozco" answer)
But use it only on the part that is dynamic such as:
print re.findall( u"cool\s*%s" % re.escape(u'\U0000002A \U000020E3'),
u"cool * \U000020E3 crazy")
Related
When using Regex in Python, it's easy to use brackets to represent a range of characters a-z, but this doesn't seem to be working for other languages, like Arabic:
import re
pattern = '[ي-ا]'
p = re.compile(pattern)
This results in a long error report that ends with
raise error("bad character range")
sre_constants.error: bad character range
how can this be fixed?
Since Arabic character is rendered from right to left, the correct string below, which reads "from ا to ي" is rendered backward (try to select the string if you want to confirm):
'[ا-ي]'
Console output:
>>> re.compile('[ا-ي]')
<_sre.SRE_Pattern object at 0x6001f0a80>
>>> re.compile('[ا-ي]', re.DEBUG)
in
range (1575, 1610)
<_sre.SRE_Pattern object at 0x6001f0440>
So your pattern '[ي-ا]', is actually "from ي to ا", which is an invalid range, since the code point of ا is smaller than code point of ي.
To prevent confusion, Ignacio Vazquez-Abrams's suggestion of using Unicode escape is a good alternative to the solution I provide above.
Use Unicode escapes instead.
>>> re.compile('[\u0627-\u064a]')
<_sre.SRE_Pattern object at 0x237f460>
The approved answer does work, however the unicode [\u0627-\u064a] does not include variations of the letters 'ا' such as 'أ', 'آ' or 'إ', and the letter 'و' with its' variation 'ؤ'. (I wanted to comment/suggest-edit to the approved answer but there's a queue)
So in case someone is (re)visiting this question and needs those letter variations, a unicode that worked better for me was [\u0600-\u06FF], making the answer:
pattern = re.compile('[\u0600-\u06FF]')
This question already has answers here:
How can I use newline '\n' in an f-string to format output?
(7 answers)
Closed last month.
First off all, sorry: I'm quite certain this might be a "duplicate" but I didn't succeed finding the right solution.
I simply want to replace all linebreaks within my sql-code for logging it to one line, but Python's f-string doesn't support backslashes, so:
# Works fine (but is useless ;))
self.logger.debug(f"Executing: {sql.replace( 'C','XXX')}")
# Results in SyntaxError:
# f-string expression part cannot include a backslash
self.logger.debug(f"Executing: {sql.replace( '\n',' ')}")
Of course there are several ways to accomplish that before the f-string, but I'd really like to keep my "log the line"-code in one line and without additional helper variables.
(Besides I think it's a quite stupid behavior: Either you can execute code within the curly brackets or you cant't...not "you can, but only without backslashes"...)
This one isn't a desired solution because of additional variables:
How to use newline '\n' in f-string to format output in Python 3.6?
General Update
The suggestion in mkrieger1s comment:
self.logger.debug("Executing %s", sql.replace('\n',' '))
Works fine for me, but as it doesn't use f-strings at all (beeing that itself good or bad ;)), I think I can leave this question open.
I found possible solutions
from os import linesep
print(f'{string_with_multiple_lines.replace(linesep, " ")}')
Best,
You can do this
newline = '\n'
self.logger.debug(f"Executing: {sql.replace( newline,' ')}")
don't use f-strings, especially for logging
assign the newline to a constant and use that, which you apparently don't want to
use an other version of expressing a newline, chr(10) for instance
(Besides I think it's a quite stupid behavior: Either you can execute code within the curly brackets or you cant't...not "you can, but only without backslashes"...)
Feel free to take a shot at fixing it, I'm pretty sure this restriction was not added because the PEP authors and feature developers wanted it to be a pain in the ass.
Hi
i'm trying to print out some unicode symbols, lets say from U+2660 to U+2667.
With one there's no problem, I just write:
print('\u2660')
but when I want to print set of symbols in loop (or one, but dependent from variable), something like that doesn't work:
for i in range(2660, 2668):
print('\u{}'.format(i))
I thought Python would execute .format function first and replace {} with number, and then look what is inside quotes and print it. It doesn't, and I don't understand why. :)
Please help,
TIA
wiktor
The parsing of the Unicode escape is done at compile-time, not runtime.
for i in range(0x2660, 0x2668):
print(chr(i))
I'm seeking simple Python function that takes a string and returns a similar one but with all non-ascii characters converted to their closest ascii equivalent.
For example, diacritics and whatnot should be dropped.
I'm imagining there must be a pretty canonical way to do this and there are plenty of related stackoverflow questions but I'm not finding a simple answer so it seemed worth a separate question.
Example input/output:
"Étienne" -> "Etienne"
Reading this question made me go looking for something better.
https://pypi.python.org/pypi/Unidecode/0.04.1
Does exactly what you ask for.
In Python 3 and using the regex implementation at PyPI:
http://pypi.python.org/pypi/regex
Starting with the string:
>>> s = "Étienne"
Normalise to NFKD and then remove the diacritics:
>>> import unicodedata
>>> import regex
>>> regex.sub(r"\p{Mn}", "", unicodedata.normalize("NFKD", s))
'Etienne'
Doing a search for 'iconv TRANSLIT python' I found:
http://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/ which looks like it might be what you need. The comments have some other ideas which use the standard library instead.
There's also http://web.archive.org/web/20070807224749/http://techxplorer.com/2006/07/18/converting-unicode-to-ascii-using-python/ which uses NFKD to get the base characters where possible.
Read the answers to some of the duplicate questions. The NFKD gimmick works only as an accent stripper. It doesn't handle ligatures and lots of other Latin-based characters that can't be (or aren't) decomposed. For this a prepared translation table is necessary (and much faster).
I am trying to convert the following Perl regex I found in the Video::Filename Perl module to a Python 2.5.4 regex to parse a filename
# Perl > v5.10
re => '^(?:(?<name>.*?)[\/\s._-]*)?(?<openb>\[)?(?<season>\d{1,2})[x\/](?<episode>\d{1,2})(?:-(?:\k<season>x)?(?<endep>\d{1,2}))?(?(<openb>)\])(?:[\s._-]*(?<epname>[^\/]+?))?$',
I would like to use named groups too, and I know in Python the regex extension for named groups is different, but I am not 100% sure on the syntax.
This is what I tried:
# Python (not working)
r = re.compile(r'^(?:(?P<name>.*?)[\/\s._-]*)?(?P<openb>\[)?(?P<season>\d{1,2})[x\/](?P<episode>\d{1,2})(?:-(?:\kP<season>x)?(?P<endep>\d{1,2}))?(?(P<openb>)\])(?:[\s._-]*(?P<epname>[^\/]+?))?$')
The error I get:
raise error, v # invalid expression
sre_constants.error: bad character in group name
For example, this one I managed to convert and it works. But the one above I can't seem to get right. I get a compilation error in Python.
# Perl:
re => '^(?:(?<name>.*?)[\/\s._-]+)?(?:s|se|season|series)[\s._-]?(?<season>\d{1,2})[x\/\s._-]*(?:e|ep|episode|[\/\s._-]+)[\s._-]?(?<episode>\d{1,2})(?:-?(?:(?:e|ep)[\s._]*)?(?<endep>\d{1,2}))?(?:[\s._]?(?:p|part)[\s._]?(?<part>\d+))?(?<subep>[a-z])?(?:[\/\s._-]*(?<epname>[^\/]+?))?$',
# Python (working):
r = re.compile(r'^(?:(?P<name>.*?)[\/\s._-]+)?(?:s|se|season|series)[\s._-]?(?P<season>\d{1,2})[x\/\s._-]*(?:e|ep|episode|[\/\s._-]+)[\s._-]?(?P<episode>\d{1,2})(?:-?(?:(?:e|ep)[\s._]*)?(?P<endep>\d{1,2}))?(?:[\s._]?(?:p|part)[\s._]?(?P<part>\d+))?(?P<subep>[a-z])?(?:[\/\s._-]*(?P<epname>[^\/]+?))?$')
I am not sure where to start looking.
There are 2 problems with your translation. First of all, the second mention of openb has extra parenthesis around it making it a conditional expression, not a named expression.
Next is that you didn't translate the \k<season> backreference, Python uses (P=season) to match the same. The following compiles for me:
r = re.compile(r'^(?:(?P<name>.*?)[\/\s._-]*)?(?P<openb>\[)?(?P<season>\d{1,2})[x\/](?P<episode>\d{1,2})(?:-(?:(?P=season)x)?(?P<endep>\d{1,2}))?(?(openb)\])(?:[\s._-]*(?P<epname>[^\/]+?))?$')
If I were you, I'd use re.VERBOSE to split this expression over multiple lines and add copious documentation so you can keep understanding the expression in the future if this is something that needs to remain maintainable though.
(edited after realising the second openb reference was a conditional expression, and to properly translate the backreference).
I found the offending part but can't figure out what exactly is wrong without wrapping my mind around the whole thing.
r = re.compile(r'^(?:(?P<name>.*?)[\/\s._-]*)?(?P<openb>\[)?(?P<season>\d{1,2})[x\/](?P<episode>\d{1,2})(?:-(?:\kP<season>x)?(?P<endep>\d{1,2}))?
(?(P<openb>)\]) // this part here causes the error message
(?:[\s._-]*(?P<epname>[^\/]+?))?$')
The problem seems to be with the fact that group names in python must be valid python identifiers (check documentation). The parentheses seem to be the problem. Removing them gives
(?(P<openb>)\]) //with parentheses
(?P<openb>\]) //without parentheses
redefinition of group name 'openb' as group 6; was group 2
Those regexps are the product of a sick an twisted mind... :-)
Anyway, (?()) are conditions in both Python and Perl, and the perl syntax above looks like it should be the same as the Python syntax, i.e., it evaluates as true of the group named exists.
Where to start looking? The documentation for the modules are here:
http://docs.python.org/library/re.html
http://www.perl.com/doc/manual/html/pod/perlre.html
I may be wrong but you tried to get the backreference using :
(?:\k<season>x)
Isn't the syntax \g<name> in Python ?