Match letter in any language - python

How can I match a letter from any language using a regex in python 3?
re.match([a-zA-Z]) will match the english language characters but I want all languages to be supported simultaneously.
I don't wish to match the ' in can't or underscores or any other type of formatting. I do wish my regex to match: c, a, n, t, Å, é, and 中.

For Unicode regex work in Python, I very strongly recommend the following:
Use Matthew Barnett’s regex library instead of standard re, which is not really suitable for Unicode regular expressions.
Use only Python 3, never Python 2. You want all your strings to be Unicode strings.
Use only string literals with logical/abstract Unicode codepoints, not encoded byte strings.
Set your encoding on your streams and forget about it. If you find yourself ever manually calling .encode and such, you’re almost certainly doing something wrong.
Use only a wide build where code points and code units are the same, never ever ever a narrow one — which you might do well to consider deprecated for Unicode robustness.
Normalize all incoming strings to NFD on the way in and then NFC on the way out. Otherwise you can’t get reliable behavior.
Once you do this, you can safely write patterns that include \w or \p{script=Latin} or \p{alpha} and \p{lower} etc and know that these will all do what the Unicode Standard says they should. I explain all of this business of Python Unicode regex business in much more detail in this answer. The short story is to always use regex not re.
For general Unicode advice, I also have several talks from last OSCON about Unicode regular expressions, most of which apart from the 3rd talk alone is not about Python, but much of which is adaptable.
Finally, there’s always this answer to put the fear of God (or at least, of Unicode) in your heart.

What's wrong with using the \w special sequence?
# -*- coding: utf-8 -*-
import re
test = u"can't, Å, é, and 中ABC"
print re.findall('\w+', test, re.UNICODE)

You can match on
\p{L}
which matches any Unicode code point that represents a letter of a script. That is, assuming you actually have a Unicode-capable regex engine, which I really hope Python would have.

Build a match class of all the characters you want to match. This might become very, very large. No, there is no RegEx shorthand for "All Kanji" ;)
Maybe it is easier to match for what you do not want, but even then, this class would become extremely large.

import re
text = "can't, Å, é, and 中ABC"
print(re.findall('\w+', text))
This works in Python 3. But it also matches underscores. However this seems to do the job as I wish:
import regex
text = "can't, Å, é, and 中ABC _ sh_t"
print(regex.findall('\p{alpha}+', text))

For Portuguese language, use try this one:
[a-zA-ZÀ-ú ]+

As noted by others, it would be very difficult to keep the up-to-date database of all letters in all existing languages. But in most cases you don't actually need that and it can be perfectly fine for your code to begin by supporing just several chosen languages and adding others as needed.
The following simple code supports matching for Czech, German and Polish language. The character sets can be easily obtained from Wikipedia.
import re
LANGS = [
'ÁáČčĎďÉéĚěÍíŇňÓóŘřŠšŤťÚúŮůÝýŽž', # Czech
'ÄäÖöÜüẞß', # German
'ĄąĆćĘꣳŃńÓ󌜏źŻż', # Polish
]
pattern = '[A-Za-z{langs}]'.format(langs=''.join(LANGS))
pattern = re.compile(pattern)
result = pattern.findall('Žluťoučký kůň')
print(result)
# ['Ž', 'l', 'u', 'ť', 'o', 'u', 'č', 'k', 'ý', 'k', 'ů', 'ň']

Related

Python regex to split dutch address

I have a bit of an issue with regex in python, I am familiar with this regex script in PHP: https://gist.github.com/benvds/350404, but in Python, using the re module, I keep getting no results:
re.findall(r"#^([\w[:punct:] ]+) ([0-9]{1,5})([\w[:punct:]\-/]*)$#", "Wilhelminakade 173")
Output is []
Any ideas?
PHP supports alternative characters as regex delimiters. Your sample Gist uses # for that purpose. They are not part of the regex in PHP, and they are not needed in Python at all. They prevent a match. Remove them.
re.findall(r"^([\w[:punct:] ]+) ([0-9]{1,5})([\w[:punct:]\-/]*)$", "Wilhelminakade 173")
This still gives no result because Python regex does not know what [:punct:] is supposed to mean. There is no support for POSIX character classes in Python's re. Replace them with something else (i.e. the punctuation you expect, probably something like "dots, apostrophes, dashes"). This results in
re.findall(r"^([\w.'\- ]+) ([0-9]{1,5})([\w.'\-/]*)$", "Wilhelminakade 173")
which gives [('Wilhelminakade', '173', '')].
Long story short, there are different regex engines in different programming languages. You cannot just copy regex from PHP to Python without looking at it closely, and expect it to work.

Python's support for hexadecimal escapes in the replacement text

Could not found a corresponding PEP or a bug for one problem in Python's re module.
Does anyone know if the following is planned to be fixed?
From regular-expressions.info:
Python does not support hexadecimal escapes in the replacement text
syntax, even though it supports \xFF and \uFFFF in string constants.
But it actually supports standard escapes like \n, \r, etc.
So, for example one cannot replace '<' character with '>' character using hexadecimal escapes:
>>> import re
>>> re.sub(r'\x3c', r'\x3e', '\x3c')
'\\x3e'
Instead of '\\x3e' it should be '>'.
Using escaped \n works fine:
>>> re.sub(r'a', r'\n', 'a')
'\n'
Thanks in advance!
UPD: Not using the raw string is not an option. For example if pattern and replacement strings are stored in a config file, so if I write \x3e in it, it will become '\\x3e' when read, instead of '>'.
The only workaround I know if is to not use a raw string for the replacement text and instead allow normal string evaluation to make \x3e into >. This works because, as you noted, python strings do support such sequences.
>>> import re
>>> re.sub(r'\x3c', '\x3e', '\x3c')
'>'
This means that in more complex replacement text you need more escapes, which could make it less readable, but at least it works.
I don't know if there is any plan to improve on this. I took a look at the existing documentation for the python 3.4 re module (under dev) and found no mention of including this kind of support.
However, if you have a need for more complex logic on the replacement, you can pass a function instead of replacement text for the repl argument of re.sub.

Regular expression to parse word structure

I'm trying to build my first non-trivial regular expression (for use in Python), but struggling.
Let us assume that a word in language X (NOT English) is a sequence of minimal 'structures'. Each 'structure' could be:
An independent vowel (basically one letter of the alphabet)
A consonant (one letter of the alphabet)
A consonant followed by a right-attaching vowel
A left-attaching vowel followed by a consonant
(Certain left-attaching vowels) followed by a consonant followed by (certain right-attaching vowels)
For example this word of 3 characters:
<a consonant><a left-attaching vowel><an independent vowel>
is not a valid word, and should not match the regex, because there is no consonant to the right of the left-attaching vowel.
I know all the Unicode ranges - the Unicode ranges for consonants, independent vowels, left-attaching vowels and so on.
Here is what I have so far:
WordPattern = (
ur'('
ur'[\u0985-\u0994]|'
ur'[\u0995-\u09B9]|'
ur'[\u0995-\u09B9(\u09BE|[\u09C0-\u09C4])]|'
ur'[(\u09BF|\u09C7|\u09C8)\u0995-\u09B9]|'
ur'[(\u09BF|\u09C7|\u09C8)\u0995-\u09B9(\u09BE|[\u09C0-\u09C4])]'
ur')+'
)
It's not working. Apart from getting it to work, I have three specific problems:
I need to split the regular expression over multiple lines, or else the code is going to look terrible. How do I do this?
I would like to use string substitution / templates of some sort to 'name' the Unicode ranges, for code readability and to prevent typing Unicode ranges multiple times.
(This seems very difficult) The list of permissible minimal 'structures' will have to be extended later. Is there any way to set up a sort of 'loop' mechanism within a regex, so that it works for all permissible structures in a list?
Any help would be appreciated. This seems very complex to a beginner!
The appropriate tool for morphological analysis of languages with non-trivial morphology is "finite state transducers". There are robust implementations that you can track down and use (one by Xerox Parc). There's one that has python bindings (for using as an external library). Google it.
FSTs are based on finite-state automata, like (pure) regular expressions, but they are by no means a drop-in replacement. It's complex machinery, so if your goals are simple (e.g., syllabification for purposes of hyphenation) you may want to look for something simpler. There are machine-learning algorithms that will "learn" hyphenation, for example. If you are indeed interested in morphological analysis, you have to make the effort to look at FSTs.
Now for your algorithm, in case you really only need a trivial implementation: Since any vowel or consonant could be independent, your rules are ambiguous: They allow "ab" to be parsed as "a-b". Such ambiguities mean that a regexp approach will probably never work, but you may get better results if you put the longer regexps first, so they are used in preference to the short ones when both would apply. But really you need to build a parser (by hand or using a module) and try different things in steps. It's backwards from what you imagined: Set up a loop that uses different regexps, and "consumes" the string in steps.
However, it seems to me that what you are describing is essentially syllabification. And the near-universal rule of syllabification is this: A syllable consists of a core vowel, plus as many preceding ("onset") consonants as the rules of the language allow, plus any following consonants that cannot belong to the next syllable. The rule is called "maximize onset", and it has the consequence that it's easier to parse your syllables backwards (from the end of the word). Try it out.
PS. You probably know this, but if you put the following as the second line in your scripts you can embed Bengali in your regexps:
# -*- coding: utf-8 -*-
I need to split the regular expression over multiple lines, or else the code is going to look terrible. How do I do this?
Use the re.VERBOSE flag when compiling the regex.
pattern = re.compile(r"""(
[\u0985-\u0994] # comment to explain what this is
| [\u0995-\u09B9]
# etc.
)
""", re.VERBOSE)
I would like to use string substitution / templates of some sort to 'name' the Unicode ranges
You can construct an RE from ordinary Python strings:
>>> subpatterns = {"vowel": "[aeiou]", "consonant": "[^aeiou]"}
>>> "{consonant}{vowel}+{consonant}*".format(**subpatterns)
'[^aeiou][aeiou]+[^aeiou]*'
The list of permissible minimal 'structures' will have to be extended later. Is there any way to set up a sort of 'loop' mechanism within a regex, so that it works for all permissible structures in a list?
I'm not sure if I get what you mean, but... suppose you have a list of (uncompiled) REs, say, patterns, then you can compute their union with
re.compile("(%s)" % "|".join(patterns))
Be careful with special characters when constructing REs this way and use re.escape where necessary.

Python: splitting string by all space characters

To split strings by spaces in python, one usually uses split method of the string without parameters:
>>> 'a\tb c\nd'.split()
['a', 'b', 'c', 'd']
But yesterday I ran across a string that used ZERO WIDTH SPACE between words as well. Having turned my new knowledge in a short black magic performance (among JavaScript folks), I would like to ask how to better split by all whitespace characters, since the split is not enough:
>>> u'a\u200bc d'.split()
[u'a\u200bc', u'd']
UPD1
it seems the solution suggested by sth gererally works but depends on some OS settings or Python compilation options. It would be nice to know the reason for sure (and if the setting can be switched on in Windows).
UPD2
cptphil found a great link that makes everything clear:
So I contacted the Unicode Technical Committee about the issue and received a promptly received a response back. They pointed that the ZWSP was, once upon a time considered white space but that was changed in Unicode 4.0.1
A quotation from unicode site:
Changing U+200B Zero Width Space from Zs to Cf (2003.10.27)
There have been persistent problems with usage of the U+200B Zero Width Space (ZWSP). The function of this character is to allow a line break at positions where it normally would not be allowed, and is thus functionally a format character with a general category of Cf. This behavior is well documented in the Unicode Standard, and the character not considered a Whitespace character in the Unicode Character Database. However, for historical reasons the general category is still Zs (Space Separator), which causes the character to be misused. ZWSP is also the only Zs character that is not Whitespace. The general category can cause misinterpretation of rule D13 Base character as allowing ZWSP as a base for combining marks.
The proposal is to change the general category of U+200B from Zs to Cf.
Resolution: Closed. The general category of U+200B will be changed from Zs to Cf in Unicode version 4.0.1.
The change was then reflected in Python. The result of u'\u200B'.isspace() in Python 2.5.4 and 2.6.5 is True, in Python 2.7.1 it is already False.
For other space characters regular split is enough:
>>> u'a\u200Ac'.split()
[u'a', u'c']
And if that is not enough for you, add characters one by one as Gabi Purcaru suggests below.
Edit
It turns out that \u200b is not technically defined as whitespace , and so python does not recognize it as matching \s even with the unicode flag on. So it must be treated as an non-whitespace character.
http://en.wikipedia.org/wiki/Whitespace_character#Unicode
http://bugs.python.org/issue13391
import re
re.split(ur"[\u200b\s]+", "some string", flags=re.UNICODE)
You can use a regular expression with enabled Unicode matching:
>>> re.split(r'(?u)\s', u'a\u200bc d')
[u'a', u'c', u'd']
You can use re.split, like this:
import re
re.split(u'\s|\u200b', your_string)
Can you use something like this?
re.split(r'\s+', your_string, re.UNICODE)
You can use the 're' module and pass a separator to 'split': http://docs.python.org/library/re.html#re.split
Just use split:
>>> u'\u200b'.isspace()
True

Matching case sensitive unicode strings with regular expressions in Python

Suppose I want to match a lowercase letter followed by an uppercase letter, I could do something like
re.compile(r"[a-z][A-Z]")
Now I want to do the same thing for unicode strings, i.e. match something like 'aÅ' or 'yÜ'.
Tried
re.compile(r"[a-z][A-Z]", re.UNICODE)
but that does not work.
Any clues?
This is hard to do with Python regex because the current implementation doesn't support Unicode property shortcuts like \p{Lu} and \p{Ll}.
[A-Za-z] will of course only match ASCII letters, regardless of whether the Unicode option is set or not.
So until the re module is updated (or you install the regex package currently in development), you either need to do it programmatically (iterate through the string and do char.islower()/char.isupper() on the characters), or specify all the unicode code points manually which probably isn't worth the effort...

Categories