Python: splitting string by all space characters - python

To split strings by spaces in python, one usually uses split method of the string without parameters:
>>> 'a\tb c\nd'.split()
['a', 'b', 'c', 'd']
But yesterday I ran across a string that used ZERO WIDTH SPACE between words as well. Having turned my new knowledge in a short black magic performance (among JavaScript folks), I would like to ask how to better split by all whitespace characters, since the split is not enough:
>>> u'a\u200bc d'.split()
[u'a\u200bc', u'd']
UPD1
it seems the solution suggested by sth gererally works but depends on some OS settings or Python compilation options. It would be nice to know the reason for sure (and if the setting can be switched on in Windows).
UPD2
cptphil found a great link that makes everything clear:
So I contacted the Unicode Technical Committee about the issue and received a promptly received a response back. They pointed that the ZWSP was, once upon a time considered white space but that was changed in Unicode 4.0.1
A quotation from unicode site:
Changing U+200B Zero Width Space from Zs to Cf (2003.10.27)
There have been persistent problems with usage of the U+200B Zero Width Space (ZWSP). The function of this character is to allow a line break at positions where it normally would not be allowed, and is thus functionally a format character with a general category of Cf. This behavior is well documented in the Unicode Standard, and the character not considered a Whitespace character in the Unicode Character Database. However, for historical reasons the general category is still Zs (Space Separator), which causes the character to be misused. ZWSP is also the only Zs character that is not Whitespace. The general category can cause misinterpretation of rule D13 Base character as allowing ZWSP as a base for combining marks.
The proposal is to change the general category of U+200B from Zs to Cf.
Resolution: Closed. The general category of U+200B will be changed from Zs to Cf in Unicode version 4.0.1.
The change was then reflected in Python. The result of u'\u200B'.isspace() in Python 2.5.4 and 2.6.5 is True, in Python 2.7.1 it is already False.
For other space characters regular split is enough:
>>> u'a\u200Ac'.split()
[u'a', u'c']
And if that is not enough for you, add characters one by one as Gabi Purcaru suggests below.

Edit
It turns out that \u200b is not technically defined as whitespace , and so python does not recognize it as matching \s even with the unicode flag on. So it must be treated as an non-whitespace character.
http://en.wikipedia.org/wiki/Whitespace_character#Unicode
http://bugs.python.org/issue13391
import re
re.split(ur"[\u200b\s]+", "some string", flags=re.UNICODE)

You can use a regular expression with enabled Unicode matching:
>>> re.split(r'(?u)\s', u'a\u200bc d')
[u'a', u'c', u'd']

You can use re.split, like this:
import re
re.split(u'\s|\u200b', your_string)

Can you use something like this?
re.split(r'\s+', your_string, re.UNICODE)

You can use the 're' module and pass a separator to 'split': http://docs.python.org/library/re.html#re.split

Just use split:
>>> u'\u200b'.isspace()
True

Related

Python's support for hexadecimal escapes in the replacement text

Could not found a corresponding PEP or a bug for one problem in Python's re module.
Does anyone know if the following is planned to be fixed?
From regular-expressions.info:
Python does not support hexadecimal escapes in the replacement text
syntax, even though it supports \xFF and \uFFFF in string constants.
But it actually supports standard escapes like \n, \r, etc.
So, for example one cannot replace '<' character with '>' character using hexadecimal escapes:
>>> import re
>>> re.sub(r'\x3c', r'\x3e', '\x3c')
'\\x3e'
Instead of '\\x3e' it should be '>'.
Using escaped \n works fine:
>>> re.sub(r'a', r'\n', 'a')
'\n'
Thanks in advance!
UPD: Not using the raw string is not an option. For example if pattern and replacement strings are stored in a config file, so if I write \x3e in it, it will become '\\x3e' when read, instead of '>'.
The only workaround I know if is to not use a raw string for the replacement text and instead allow normal string evaluation to make \x3e into >. This works because, as you noted, python strings do support such sequences.
>>> import re
>>> re.sub(r'\x3c', '\x3e', '\x3c')
'>'
This means that in more complex replacement text you need more escapes, which could make it less readable, but at least it works.
I don't know if there is any plan to improve on this. I took a look at the existing documentation for the python 3.4 re module (under dev) and found no mention of including this kind of support.
However, if you have a need for more complex logic on the replacement, you can pass a function instead of replacement text for the repl argument of re.sub.

What is the correct way to use unicode characters in a python regex

In the process of scraping some documents using Python 2.7, I've run into some annoying page separators, which I've decided to remove. The separators use some funky characters. I already asked one question here on how to make these characters reveal their utf-8 codes. There are two non-ASCII characters used: '\xc2\xad', and '\x0c'. Now, I just need to remove these characters, as well some spaces and the page numbers.
Elsewhere on SO, I've seen unicode characters used in tandem with regexps, but it's in a strange format that I do not have these characters in, e.g. '\u00ab'. In addition, none of them are using ASCII as well as non-ASCII characters. Finally, the python docs are very light on the subject of unicode in regexes... something about flags... I don't know. Can anyone help?
Here is my current usage, which does not do what I want:
re.sub('\\xc2\\xad\s\d+\s\\xc2\\xad\s\\x0c', '', my_str)
Rather than seek out specific unwanted chars, you could remove everything not wanted:
re.sub('[^\\s!-~]', '', my_str)
This throws away all characters not:
whitespace (spaces, tabs, newlines, etc)
printable "normal" ascii characters (! is the first printable char and ~ is the last under decimal 128)
You could include more chars if needed - just adjust the character class.
i have same problem, i know this in not efficient way but in my case worked
result = re.sub(r"\\" ,",x,x",result)
result = re.sub(r",x,xu00ad" ,"",result)
result = re.sub(r",x,xu" ,"\\u",result)

Matching case sensitive unicode strings with regular expressions in Python

Suppose I want to match a lowercase letter followed by an uppercase letter, I could do something like
re.compile(r"[a-z][A-Z]")
Now I want to do the same thing for unicode strings, i.e. match something like 'aÅ' or 'yÜ'.
Tried
re.compile(r"[a-z][A-Z]", re.UNICODE)
but that does not work.
Any clues?
This is hard to do with Python regex because the current implementation doesn't support Unicode property shortcuts like \p{Lu} and \p{Ll}.
[A-Za-z] will of course only match ASCII letters, regardless of whether the Unicode option is set or not.
So until the re module is updated (or you install the regex package currently in development), you either need to do it programmatically (iterate through the string and do char.islower()/char.isupper() on the characters), or specify all the unicode code points manually which probably isn't worth the effort...

Match letter in any language

How can I match a letter from any language using a regex in python 3?
re.match([a-zA-Z]) will match the english language characters but I want all languages to be supported simultaneously.
I don't wish to match the ' in can't or underscores or any other type of formatting. I do wish my regex to match: c, a, n, t, Å, é, and 中.
For Unicode regex work in Python, I very strongly recommend the following:
Use Matthew Barnett’s regex library instead of standard re, which is not really suitable for Unicode regular expressions.
Use only Python 3, never Python 2. You want all your strings to be Unicode strings.
Use only string literals with logical/abstract Unicode codepoints, not encoded byte strings.
Set your encoding on your streams and forget about it. If you find yourself ever manually calling .encode and such, you’re almost certainly doing something wrong.
Use only a wide build where code points and code units are the same, never ever ever a narrow one — which you might do well to consider deprecated for Unicode robustness.
Normalize all incoming strings to NFD on the way in and then NFC on the way out. Otherwise you can’t get reliable behavior.
Once you do this, you can safely write patterns that include \w or \p{script=Latin} or \p{alpha} and \p{lower} etc and know that these will all do what the Unicode Standard says they should. I explain all of this business of Python Unicode regex business in much more detail in this answer. The short story is to always use regex not re.
For general Unicode advice, I also have several talks from last OSCON about Unicode regular expressions, most of which apart from the 3rd talk alone is not about Python, but much of which is adaptable.
Finally, there’s always this answer to put the fear of God (or at least, of Unicode) in your heart.
What's wrong with using the \w special sequence?
# -*- coding: utf-8 -*-
import re
test = u"can't, Å, é, and 中ABC"
print re.findall('\w+', test, re.UNICODE)
You can match on
\p{L}
which matches any Unicode code point that represents a letter of a script. That is, assuming you actually have a Unicode-capable regex engine, which I really hope Python would have.
Build a match class of all the characters you want to match. This might become very, very large. No, there is no RegEx shorthand for "All Kanji" ;)
Maybe it is easier to match for what you do not want, but even then, this class would become extremely large.
import re
text = "can't, Å, é, and 中ABC"
print(re.findall('\w+', text))
This works in Python 3. But it also matches underscores. However this seems to do the job as I wish:
import regex
text = "can't, Å, é, and 中ABC _ sh_t"
print(regex.findall('\p{alpha}+', text))
For Portuguese language, use try this one:
[a-zA-ZÀ-ú ]+
As noted by others, it would be very difficult to keep the up-to-date database of all letters in all existing languages. But in most cases you don't actually need that and it can be perfectly fine for your code to begin by supporing just several chosen languages and adding others as needed.
The following simple code supports matching for Czech, German and Polish language. The character sets can be easily obtained from Wikipedia.
import re
LANGS = [
'ÁáČčĎďÉéĚěÍíŇňÓóŘřŠšŤťÚúŮůÝýŽž', # Czech
'ÄäÖöÜüẞß', # German
'ĄąĆćĘꣳŃńÓ󌜏źŻż', # Polish
]
pattern = '[A-Za-z{langs}]'.format(langs=''.join(LANGS))
pattern = re.compile(pattern)
result = pattern.findall('Žluťoučký kůň')
print(result)
# ['Ž', 'l', 'u', 'ť', 'o', 'u', 'č', 'k', 'ý', 'k', 'ů', 'ň']

Python raw strings and trailing backslash [duplicate]

This question already has answers here:
Why can't Python's raw string literals end with a single backslash?
(14 answers)
Closed 11 months ago.
I ran across something once upon a time and wondered if it was a Python "bug" or at least a misfeature. I'm curious if anyone knows of any justifications for this behavior. I thought of it just now reading "Code Like a Pythonista," which has been enjoyable so far. I'm only familiar with the 2.x line of Python.
Raw strings are strings that are prefixed with an r. This is great because I can use backslashes in regular expressions and I don't need to double everything everywhere. It's also handy for writing throwaway scripts on Windows, so I can use backslashes there also. (I know I can also use forward slashes, but throwaway scripts often contain content cut&pasted from elsewhere in Windows.)
So great! Unless, of course, you really want your string to end with a backslash. There's no way to do that in a 'raw' string.
In [9]: r'\n'
Out[9]: '\\n'
In [10]: r'abc\n'
Out[10]: 'abc\\n'
In [11]: r'abc\'
------------------------------------------------
File "<ipython console>", line 1
r'abc\'
^
SyntaxError: EOL while scanning string literal
In [12]: r'abc\\'
Out[12]: 'abc\\\\'
So one backslash before the closing quote is an error, but two backslashes gives you two backslashes! Certainly I'm not the only one that is bothered by this?
Thoughts on why 'raw' strings are 'raw, except for backslash-quote'? I mean, if I wanted to embed a single quote in there I'd just use double quotes around the string, and vice versa. If I wanted both, I'd just triple quote. If I really wanted three quotes in a row in a raw string, well, I guess I'd have to deal, but is this considered "proper behavior"?
This is particularly problematic with folder names in Windows, where the backslash is the path delimeter.
It's a FAQ.
And in response to "you really want your string to end with a backslash. There's no way to do that in a 'raw' string.": the FAQ shows how to workaround it.
>>> r'ab\c' '\\' == 'ab\\c\\'
True
>>>
Raw strings are meant mostly for readably writing the patterns for regular expressions, which never need a trailing backslash; it's an accident that they may come in handy for Windows (where you could use forward slashes in most cases anyway -- the Microsoft C library which underlies Python accepts either form!). It's not cosidered acceptable to make it (nearly) impossible to write a regular expression pattern containing both single and double quotes, just to reinforce the accident in question.
("Nearly" because triple-quoting would almost alway help... but it could be a little bit of a pain sometimes).
So, yes, raw strings were designed to behave that way (forbidding odd numbers of trailing backslashes), and it is considered perfectly "proper behavior" for them to respect the design decisions Guido made when he invented them;-).
Another way to workaround this is:
>>> print(r"Raw \with\ trailing backslash\ "[:-1])
Raw \with\ trailing backslash\
Updated for Python 3 and removed unnecessary slash at the end which implied an escape.
Note that personally I doubt I would use the above. I guess maybe if it was a huge string with more than just a path. For the above I'd prefer non-raw and double up the slashes.
Thoughts on why 'raw' strings are 'raw, except for backslash-quote'? I
mean, if I wanted to embed a single quote in there I'd just use double
quotes around the string, and vice versa.
But that would then raise the question as to why raw strings are 'raw, except for embedded quotes?'
You have to have some escape mechanism, otherwise you can never use the outer quote characters inside the string at all. And then you need an escape mechanism for the escape mechanism.

Categories