Regex and unicode

Regex and unicode - python

I have a script that parses the filenames of TV episodes (show.name.s01e02.avi for example), grabs the episode name (from the www.thetvdb.com API) and automatically renames them into something nicer (Show Name - [01x02].avi)
The script works fine, that is until you try and use it on files that have Unicode show-names (something I never really thought about, since all the files I have are English, so mostly pretty-much all fall within [a-zA-Z0-9'\-])
How can I allow the regular expressions to match accented characters and the likes? Currently the regex's config section looks like..
config['valid_filename_chars'] = """0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!#£$%^&*()_+=-[]{}"'.,<>`~? """
config['valid_filename_chars_regex'] = re.escape(config['valid_filename_chars'])
config['name_parse'] = [
# foo_[s01]_[e01]
re.compile('''^([%s]+?)[ \._\-]\[[Ss]([0-9]+?)\]_\[[Ee]([0-9]+?)\]?[^\\/]*$'''% (config['valid_filename_chars_regex'])),
# foo.1x09*
re.compile('''^([%s]+?)[ \._\-]\[?([0-9]+)x([0-9]+)[^\\/]*$''' % (config['valid_filename_chars_regex'])),
# foo.s01.e01, foo.s01_e01
re.compile('''^([%s]+?)[ \._\-][Ss]([0-9]+)[\.\- ]?[Ee]([0-9]+)[^\\/]*$''' % (config['valid_filename_chars_regex'])),
# foo.103*
re.compile('''^([%s]+)[ \._\-]([0-9]{1})([0-9]{2})[\._ -][^\\/]*$''' % (config['valid_filename_chars_regex'])),
# foo.0103*
re.compile('''^([%s]+)[ \._\-]([0-9]{2})([0-9]{2,3})[\._ -][^\\/]*$''' % (config['valid_filename_chars_regex'])),
]

Use a subrange of [\u0000-\uFFFF] for what you want.
You can also use the re.UNICODE compile flag. The docs say that if UNICODE is set, \w will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.
See also http://coding.derkeiler.com/Archive/Python/comp.lang.python/2004-05/2560.html.

Python's re module doesn't support \p{Letter} or \X. However, the new regex implementation on PyPI does.

In Mastering Regular Expressions from Jeffrey Friedl (great book) it is mentioned that you could use \p{Letter} which will match unicode stuff that is considered a letter.

\X seems to be available as a generic word-character in some languages, it allows you to match a single character disregarding of how many bytes it takes up. Might be useful.

Related

I know of f-strings, but what are r-strings? Are there others?

I started learning python for the first time in an accelerated course on data science a few weeks ago and we were introduced early on to f-strings.
The simple code:
name = 'Tim'
print(f'There are some who call me {name}...')
outputs the string "There are some who call me Tim..."
Through my browsing of various packages out of curiosity, I came upon pages like this one detailing a function you can call in matplotlib to render $\LaTeX$-like expressions within the generated images. In the example code they use something similar to f-strings but with an r instead of an f.
import matplotlib.pyplot as plt
plt.title(r'$\alpha > \beta$')
plt.show()
The resulting (otherwise empty) graph has a title using text which has been formatted similarly to how one would expect using MathJax or $\LaTeX$ with a greek character alpha and a greek character beta.
My questions are the following:
What precisely is an r-string and how does it compare to an f-string? Are r-strings specifically used for matplotlib's mathtext and usetex?
Apart from f-strings and r-strings, are there any other notable similar string variants or alternates that I should familiarize myself with or be made aware of?

An r-string is a raw string.
It ignores escape characters. For example, "\n" is a string containing a newline character, and r"\n" is a string containing a backslash and the letter n.
If you wanted to compare it to an f-string, you could think of f-strings as being "batteries-included." They have tons of flexibility in the ability to escape characters and execute nearly arbitrary expressions. The r-string on the other hand is stripped down and minimalist, containing precisely the characters between its quotation marks.
As far as actually using the things, typically you would use an r-string if you're passing the string into something else that uses a bunch of weird characters or does its own escaping so that you don't have to think too hard about how many backslashes you really need to get everything to work correctly. In your example, they at least needed r-strings to get the \a bit working correctly without double escapes. Note that '$\\alpha > \\beta$' is identical to r'$\alpha > \beta$'.
Since you're using f-strings, I'll assume you have at least Python 3.6. Not all of these options are supported for older versions but any of the following prefixes are valid in Python 3.6+ in any combination of caps and lowers: r, u, f, rf, fr, b, rb, br
The b-strings are binary literals. In Python 2 they do nothing and only exist so that the source code is compatible with Python 3. In Python 3, they allow you to create a bytes object. Strings can be thought of as a view of the underlying bytes, often restricted as to which combinations are allowed. The distinction in types helps to prevent errors from blindly applying text techniques to raw data. In Python 3, note that 'A'==b'A' is False. These are not the same thing.
The u-strings are unicode literals. Strings are unicode by default in Python 3, but the u prefix is allowed for backward compatibility with Python 2. In Python 2, strings are ASCII by default, and the u prefix allows you to include non-ASCII characters in your strings. For example, note the accented character in the French phrase u"Fichier non trouvé".
In the kind of code I write, I rarely need anything beyond r, u, f, and b. Even b is a bit out there. Other people deal with those prefixes every day (presumably). They aren't necessarily anything you need to familiarize yourself with, but knowing they exist and being able to find their documentation is probably a good skill to have.
Just so that it's in an answer instead of buried in a comment, Peter Gibson linked the language specification, and that's the same place I pulled the prefix list from. With your math background, a formal language specification might be especially interesting — depending a little on how much you like algebra and mathematical logic.
Even if it's just for a semantically trivial language like Forth, I think many programmers would enjoy writing a short interpreter and gain valuable insight into how their language of choice works.

python-re.sub() and unicode

I want to replace all emoji with '' but my regEx doesn't work.For example,
content= u'?\u86cb\u767d12\U0001f633\uff0c\u4f53\u6e29\u65e9\u6668\u6b63\u5e38\uff0c\u5348\u540e\u665a\u95f4\u53d1\u70ed\uff0c\u6211\u73b0\u5728\u8be5\u548b\U0001f633?'
and I want to replace all the forms like \U0001f633 with '' so I write the code:
print re.sub(ur'\\U[0-9a-fA-F]{8}','',content)
But it doesn't work.
Thanks a lot.

You won't be able to recognize properly decoded unicode codepoints that way (as strings containing \uXXXX, etc.) Properly decoded, by the time the regex parser gets to them, each is a* character.
Depending on whether your python was compiled with only 16-bit unicode code points or not, you'll want a pattern something like either:
# 16-bit codepoints
re_strip = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
# 32-bit* codepoints
re_strip = re.compile(u'[\U00010000-\U0010FFFF]')
And your code would look like:
import re
# Pick a pattern, adjust as necessary
#re_strip = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
re_strip = re.compile(u'[\U00010000-\U0010FFFF]')
content= u'[\u86cb\u767d12\U0001f633\uff0c\u4f53\u6e29\u65e9\u6668\u6b63\u5e38\uff0c\u5348\u540e\u665a\u95f4\u53d1\u70ed\uff0c\u6211\u73b0\u5728\u8be5\u548b\U0001f633]'
print(content)
stripped = re_strip.sub('', content)
print(stripped)
Both expressions, reduce the number of characters in the stripped string to 26.
These expressions strip out the emojis you were after, but may also strip out other things you do want. It may be worth reviewing a unicode codepoint range listing (e.g. here) and adjusting them.
You can determine whether your python install will only recognize 16-bit codepoints by doing something like:
import sys
print(sys.maxunicode.bit_length())
If this displays 16, you'll need the first regex expression. If it displays something greater than 16 (for me it says 21), the second one is what you want.
Neither expression will work when used on a python install with the wrong sys.maxunicode.
See also: this related.

easy way to determine if a string CAN'T be a valid regex

I have a config file that the user can specify sections, and then within those section they can specify regular expressions. I have to parse this config file and separate the regex's into the various sections.
Is there an easy way to delimitate a regex from a section header? I was thinking just the standard
[section]
regex1
regex2
But I just realized that [section] is a valid regex. So I'm wondering if there's a way I can format a section header so that it can ONLY be understood as a section header and not a regex.

There's an unlimited ways of making an invalid regexp, but the first thing that comes to mind would be
*section*
You can't have a quantifier (*) at the start of the regexp.
(The other * is there just to satisfy my obsession for symmetry.)

I don't know your problem domain, so I don't know what forms of regex you're expecting, but it seems to me you should keep your section formatting as it is. A regex that starts with [ and ends with ] and has no square brackets in between is quite unusual. It can only match a single character. So leave the section headers as they are. Strictly speaking, they are valid regexes, but they probably aren't interesting regexes.
Also, why not use ConfigParser from the standard library, and let it do the parsing for you?

There are easy ways, but they all require changing your format:
Use indentation, similar to how Python source is interpreted. Leading spaces would need special handling, e.g. "(?: )abc" instead of " abc".
Use an INI format, where each item in a section requires a name=value pair.
Use some sort of list syntax. ast.literal_eval will be helpful.
section1 = [
"regex 1",
"2",
"3",
]
section2 = ["..."]
Primarily, don't invent your own format, or make it as close to a known format as you can. The third is a subset of Python syntax, for example, and you could even use raw string literals naturally.
JSON or YAML may be useful for you.

As others have said, please don't invent yet another config format. Use the Python Standard Library's ConfigParser, which will be able to parse the [section] notation exactly as you have shown it.
EDIT: The allow_no_value option allows you to to just have a single entry, rather than a key/value pair. And the default dict type is OrderedDict, so it will maintain order.

Converting html entities into their values in python

I use this regex on some input,
[^a-zA-Z0-9##]
However this ends up removing lots of html special characters within the input, such as
#227;, #1606;, #1588; (i had to remove the & prefix so that it wouldn't
show up as the actual value..)
is there a way that I can convert them to their values so that it will satisfy the regexp expression? I also have no idea why the text decided to be so big.

Given that your text appears to have numeric-coded, not named, entities, you can first convert your byte string that includes xml entity defs (ampersand, hash, digits, semicolon) to unicode:
import re
xed_re = re.compile(r'&#(\d+);')
def usub(m): return unichr(int(m.group(1)))
s = 'ã, ن, ش'
u = xed_re.sub(usub, s)
if your terminal emulator can display arbitrary unicode glyphs, a print u will then show
ã, ن, ش
In any case, you can now, if you wish, use your original RE and you won't accidentally "catch" the entities, only ascii letters, digits, and the couple of punctuation characters you listed. (I'm not sure that's what you really want -- why not accented letters but just ascii ones, for example? -- but, if it is what you want, it will work).
If you do have named entities in addition to the numeric-coded ones, you can also apply the htmlentitydefs standard library module recommended in another answer (it only deals with named entities which map to Latin-1 code points, however).

You can adapt the following script:
import htmlentitydefs
import re
def substitute_entity (match):
name = match.group (1)
if name in htmlentitydefs.name2codepoint:
return unichr (htmlentitydefs.name2codepoint[name])
elif name.startswith ('#'):
try:
return unichr (int (name[1:]))
except:
pass
return '?'
print re.sub ('&(#?\\w+);', substitute_entity, 'x « y &wat; z {')
Produces the following answer here:
x « y ? z {
EDIT: I understood the question as "how to get rid of HTML entities before further processing", hope I haven't wasted time on answering a wrong question ;)

Without knowing what the expression is being used for I can't tell exactly what you need.
This will match special characters or strings of characters excluding letters, digits, #, and #:
[^a-zA-Z0-9##]*|#[0-9A-Za-z]+;

Unescape _xHHHH_ XML escape sequences using Python

I'm using Python 2.x [not negotiable] to read XML documents [created by others] that allow the content of many elements to contain characters that are not valid XML characters by escaping them using the _xHHHH_ convention e.g. ASCII BEL aka U+0007 is represented by the 7-character sequence u"_x0007_". Neither the functionality that allows representation of any old character in the document nor the manner of escaping is negotiable. I'm parsing the documents using cElementTree or lxml [semi-negotiable].
Here is my best attempt at unescapeing the parser output as efficiently as possible:
import re
def unescape(s,
subber=re.compile(r'_x[0-9A-Fa-f]{4,4}_').sub,
repl=lambda mobj: unichr(int(mobj.group(0)[2:6], 16)),
):
if "_" in s:
return subber(repl, s)
return s
The above is biassed by observing a very low frequency of "_" in typical text and a better-than-doubling of speed by avoiding the regex apparatus where possible.
The question: Any better ideas out there?

You might as well check for '_x' rather than just _, that won't matter much but surely the two-character sequence's even rarer than the single underscore. Apart from such details, you do seem to be making the best of a bad situation!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.