Which characters are considered whitespace by split()?

Which characters are considered whitespace by split()? - python

I am porting some Python 2 code that calls split() on strings, so I need to know its exact behavior. The documentation states that when you do not specify the sep argument, "runs of consecutive whitespace are regarded as a single separator".
Unfortunately, it does not specify which characters that would be. There are some obvious contenders (like space, tab, and newline), but Unicode contains plenty of other candidates.
Which characters are considered to be whitespace by split()?
Since the answer might be implementation-specific, I'm targeting CPython.
(Note: I researched the answer to this myself since I couldn't find it anywhere, so I'll be posting it here, hopefully for the benefit of others.)

Unfortunately, it depends on whether your string is an str or a unicode (at least, in CPython - I don't know whether this behavior is actually mandated by a specification anywhere).
If it is an str, the answer is straightforward:
0x09 Tab
0x0a Newline
0x0b Vertical Tab
0x0c Form Feed
0x0d Carriage Return
0x20 Space
Source: these are the characters with PY_CTF_SPACE in Python/pyctype.c, which are used by Py_ISSPACE, which is used by STRINGLIB_ISSPACE, which is used by split_whitespace.
If it is a unicode, there are 29 characters, which in addition to the above are:
U+001c through 0x001f: File/Group/Record/Unit Separator
U+0085: Next Line
U+00a0: Non-Breaking Space
U+1680: Ogham Space Mark
U+2000 through 0x200a: various fixed-size spaces (e.g. Em Space), but note that Zero-Width Space is not included
U+2028: Line Separator
U+2029: Paragraph Separator
U+202f: Narrow No-Break Space
U+205f: Medium Mathematical Space
U+3000: Ideographic Space
Note that the first four are also valid ASCII characters, which means that an ASCII-only string might split differently depending on whether it is an str or a unicode!
Source: these are the characters listed in _PyUnicode_IsWhitespace, which is used by Py_UNICODE_ISSPACE, which is used by STRINGLIB_ISSPACE (it looks like they use the same function implementations for both str and unicode, but compile it separately for each type, with certain macros implemented differently). The docstring describes this set of characters as follows:
Unicode characters having the bidirectional type 'WS', 'B' or 'S' or the category 'Zs'

The answer by Aasmund Eldhuset is what I was attempting to do but I was beaten to the punch. It shows a lot of research and should definitely be the accepted answer.
If you want confirmation of that answer (or just want to test it in a different implementation, such as a non-CPython one, or a later one which may use a different Unicode standard under the covers), the following short program will print out the actual characters that cause a split when using .split() with no arguments.
It does this by constructing a string with the a and b characters(a) separated by the character being tested, then detecting if split creates an array more than one element:
int_ch = 0
while True:
try:
test_str = "a" + chr(int_ch) + "b"
except Exception as e:
print(f'Stopping, {e}')
break
if len(test_str.split()) != 1:
print(f'0x{int_ch:06x} ({int_ch})')
int_ch += 1
The output (for my system) is as follows:
0x000009 (9)
0x00000a (10)
0x00000b (11)
0x00000c (12)
0x00000d (13)
0x00001c (28)
0x00001d (29)
0x00001e (30)
0x00001f (31)
0x000020 (32)
0x000085 (133)
0x0000a0 (160)
0x001680 (5760)
0x002000 (8192)
0x002001 (8193)
0x002002 (8194)
0x002003 (8195)
0x002004 (8196)
0x002005 (8197)
0x002006 (8198)
0x002007 (8199)
0x002008 (8200)
0x002009 (8201)
0x00200a (8202)
0x002028 (8232)
0x002029 (8233)
0x00202f (8239)
0x00205f (8287)
0x003000 (12288)
Stopping, chr() arg not in range(0x110000)
You can ignore the error at the end, that's just to confirm it doesn't fail until we've moved out of the valid Unicode area (code points 0x000000 - 0x10ffff making up the seventeen planes).
(a) I'm hoping that no future version of Python ever considers a or b to be whitespace, as that would totally break this (and a lot of other) code.
I think the chances of that are rather slim, so it should be fine :-)

Related

Detecting Arabic characters in regex

I have a dataset of Arabic sentences, and I want to remove non-Arabic characters or special characters. I used this regex in python:
text = re.sub(r'[^ء-ي0-9]',' ',text)
It works perfectly, but in some sentences (4 cases from the whole dataset) the regex also removes the Arabic words!
I read the dataset using Panda (python package) like:
train = pd.read_excel('d.xlsx', encoding='utf-8')
Just to show you in a picture, I tested on Pythex site:
What is the problem?
------------------ Edited:
The sentences in the example:
انا بحكي رجعو مبارك واعملو حفلة واحرقوها بالمعازيم ولما الاخوان يروحو
يعزو احرقو العزا -- احسنلكم والله #مصر
ﺷﻔﻴﻖ ﺃﺭﺩﻭﻏﺎﻥ ﻣﺼﺮ ..ﺃﺣﻨﺍ ﻧﺒﻘﻰ ﻣﻴﻦ ﻳﺎ ﺩﺍﺩﺍ؟ #ﻣﺴﺨﺮﺓ #ﻋﺒﺚ #EgyPresident #Egypt #ﻣﻘﺎﻃﻌﻮﻥ لا يا حبيبي ما حزرت: بشار غبي بوجود بعثة أنان حاب يفضح روحه انه مجرم من هيك نفذ المجزرة لترى البعثة اجرامه بحق السورين

Those incorrectly included characters are not in the common Unicode range for Arabic (U+0621..U+64A), but are "hardcoded" as their initial, medial, and final forms.
Comparable to capitalization in Latin-based languages, but more strict than that, Arabic writing indicates both the start and end of words with a special 'flourish' form. In addition it also allows an "isolated" form (to be used when the character is not part of a full word).
This is usually encoded in a file as 'an' Arabic character and the actual rendering in initial, medial, or final form is left to the text renderer, but since all forms also have Unicode codepoints of their own, it is also possible to "hardcode" the exact forms. That is what you encountered: a mix of these two systems.
Fortunately, the Unicode ranges for the hardcoded forms are also fixed values:
Arabic Presentation Forms-A is a Unicode block encoding contextual forms and ligatures of letter variants needed for Persian, Urdu, Sindhi and Central Asian languages. The presentation forms are present only for compatibility with older standards such as codepage 864 used in DOS, and are typically used in visual and not logical order.
(https://en.wikipedia.org/wiki/Arabic_Presentation_Forms-A)
and their ranges are U+FB50..U+FDFF (Presentation Forms A) and U+FE70..U+FEFC (Presentation Forms B). If you add these ranges to your exclusion set, the regex will no longer delete these texts:
[^ء-ي0-9ﭐ-﷿ﹰ-ﻼ]
Depending on your browser and/or editor, you may have problems with selecting this text to copy and paste it. It may be more clear to explicitly use a string specifying the exact characters:
[^0-9\u0621-\u064a\ufb50-\ufdff\ufe70-\ufefc]

I have made some try on Pythex and I Found this (With the help from Regular Expression Arabic characters and numbers only) : [\u0621-\u064A0-9] who catch almost all non-Arabic characters. For un Unknown reason, this dosen't catch 'y' so you have to add it yourself : [\u0621-\u064A0-9y]
This can catch all non-arabic character. For special character, i'm sorry but i found nothing except to add them inside : [\u0621-\u064A0-9y#\!\?\,]

Python wont open a file named PRN.anything [duplicate]

I know that / is illegal in Linux, and the following are illegal in Windows
(I think) * . " / \ [ ] : ; | ,
What else am I missing?
I need a comprehensive guide, however, and one that takes into account
double-byte characters. Linking to outside resources is fine with me.
I need to first create a directory on the filesystem using a name that may
contain forbidden characters, so I plan to replace those characters with
underscores. I then need to write this directory and its contents to a zip file
(using Java), so any additional advice concerning the names of zip directories
would be appreciated.

The forbidden printable ASCII characters are:
Linux/Unix:
/ (forward slash)
Windows:
< (less than)
> (greater than)
: (colon - sometimes works, but is actually NTFS Alternate Data Streams)
" (double quote)
/ (forward slash)
\ (backslash)
| (vertical bar or pipe)
? (question mark)
* (asterisk)
Non-printable characters
If your data comes from a source that would permit non-printable characters then there is more to check for.
Linux/Unix:
0 (NULL byte)
Windows:
0-31 (ASCII control characters)
Note: While it is legal under Linux/Unix file systems to create files with control characters in the filename, it might be a nightmare for the users to deal with such files.
Reserved file names
The following filenames are reserved:
Windows:
CON, PRN, AUX, NUL
COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9
LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9
(both on their own and with arbitrary file extensions, e.g. LPT1.txt).
Other rules
Windows:
Filenames cannot end in a space or dot.
macOS:
You didn't ask for it, but just in case: Colon : and forward slash / depending on context are not permitted (e.g. Finder supports slashes, terminal supports colons). (More details)

A “comprehensive guide” of forbidden filename characters is not going to work on Windows because it reserves filenames as well as characters. Yes, characters like
* " ? and others are forbidden, but there are a infinite number of names composed only of valid characters that are forbidden. For example, spaces and dots are valid filename characters, but names composed only of those characters are forbidden.
Windows does not distinguish between upper-case and lower-case characters, so you cannot create a folder named A if one named a already exists. Worse, seemingly-allowed names like PRN and CON, and many others, are reserved and not allowed. Windows also has several length restrictions; a filename valid in one folder may become invalid if moved to another folder. The rules for
naming files and folders
are on the Microsoft docs.
You cannot, in general, use user-generated text to create Windows directory names. If you want to allow users to name anything they want, you have to create safe names like A, AB, A2 et al., store user-generated names and their path equivalents in an application data file, and perform path mapping in your application.
If you absolutely must allow user-generated folder names, the only way to tell if they are invalid is to catch exceptions and assume the name is invalid. Even that is fraught with peril, as the exceptions thrown for denied access, offline drives, and out of drive space overlap with those that can be thrown for invalid names. You are opening up one huge can of hurt.

Under Linux and other Unix-related systems, there were traditionally only two characters that could not appear in the name of a file or directory, and those are NUL '\0' and slash '/'. The slash, of course, can appear in a pathname, separating directory components.
Rumour1 has it that Steven Bourne (of 'shell' fame) had a directory containing 254 files, one for every single letter (character code) that can appear in a file name (excluding /, '\0'; the name . was the current directory, of course). It was used to test the Bourne shell and routinely wrought havoc on unwary programs such as backup programs.
Other people have covered the rules for Windows filenames, with links to Microsoft and Wikipedia on the topic.
Note that MacOS X has a case-insensitive file system. Current versions of it appear to allow colon : in file names, though historically that was not necessarily always the case:
$ echo a:b > a:b
$ ls -l a:b
-rw-r--r-- 1 jonathanleffler staff 4 Nov 12 07:38 a:b
$
However, at least with macOS Big Sur 11.7, the file system does not allow file names that are not valid UTF-8 strings. That means the file name cannot consist of the bytes that are always invalid in UTF-8 (0xC0, 0xC1, 0xF5-0xFF), and you can't use the continuation bytes 0x80..0xBF as the only byte in a file name. The error given is 92 Illegal byte sequence.
POSIX defines a Portable Filename Character Set consisting of:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 . _ -
Sticking with names formed solely from those characters avoids most of the problems, though Windows still adds some complications.
1 It was Kernighan & Pike in ['The Practice of Programming'](http://www.cs.princeton.edu/~bwk/tpop.webpage/) who said as much in Chapter 6, Testing, §6.5 Stress Tests:
When Steve Bourne was writing his Unix shell (which came to be known as the Bourne shell), he made a directory of 254 files with one-character names, one for each byte value except '\0' and slash, the two characters that cannot appear in Unix file names. He used that directory for all manner of tests of pattern-matching and tokenization. (The test directory was of course created by a program.) For years afterwards, that directory was the bane of file-tree-walking programs; it tested them to destruction.
Note that the directory must have contained entries . and .., so it was arguably 253 files (and 2 directories), or 255 name entries, rather than 254 files. This doesn't affect the effectiveness of the anecdote, or the careful testing it describes.
TPOP was previously at
http://plan9.bell-labs.com/cm/cs/tpop and
http://cm.bell-labs.com/cm/cs/tpop but both are now (2021-11-12) broken.
See also Wikipedia on TPOP.

Instead of creating a blacklist of characters, you could use a whitelist. All things considered, the range of characters that make sense in a file or directory name context is quite short, and unless you have some very specific naming requirements your users will not hold it against your application if they cannot use the whole ASCII table.
It does not solve the problem of reserved names in the target file system, but with a whitelist it is easier to mitigate the risks at the source.
In that spirit, this is a range of characters that can be considered safe:
Letters (a-z A-Z) - Unicode characters as well, if needed
Digits (0-9)
Underscore (_)
Hyphen (-)
Space
Dot (.)
And any additional safe characters you wish to allow. Beyond this, you just have to enforce some additional rules regarding spaces and dots. This is usually sufficient:
Name must contain at least one letter or number (to avoid only dots/spaces)
Name must start with a letter or number (to avoid leading dots/spaces)
Name may not end with a dot or space (simply trim those if present, like Explorer does)
This already allows quite complex and nonsensical names. For example, these names would be possible with these rules, and be valid file names in Windows/Linux:
A...........ext
B -.- .ext
In essence, even with so few whitelisted characters you should still decide what actually makes sense, and validate/adjust the name accordingly. In one of my applications, I used the same rules as above but stripped any duplicate dots and spaces.

The easy way to get Windows to tell you the answer is to attempt to rename a file via Explorer and type in a backslash, /, for the new name. Windows will popup a message box telling you the list of illegal characters.
A filename cannot contain any of the following characters:
\ / : * ? " < > |
Microsoft Docs - Naming Files, Paths, and Namespaces - Naming Conventions

Well, if only for research purposes, then your best bet is to look at this Wikipedia entry on Filenames.
If you want to write a portable function to validate user input and create filenames based on that, the short answer is don't. Take a look at a portable module like Perl's File::Spec to have a glimpse to all the hops needed to accomplish such a "simple" task.

Discussing different possible approaches
Difficulties with defining, what's legal and not were already adressed and whitelists were suggested. But not only Windows, but also many unixoid OSes support more-than-8-bit characters such as Unicode. You could here also talk about encodings such as UTF-8. You can consider Jonathan Leffler's comment, where he gives info about modern Linux and describes details for MacOS. Wikipedia states, that (for example) the
modifier letter colon [(See 7. below) is] sometimes used in Windows filenames as it is identical to the colon in the Segoe UI font used for filenames. The [inherited ASCII] colon itself is not permitted.
Therefore, I want to present a much more liberal approach using Unicode Homoglyph characters to replace the "illegal" ones. I found the result in my comparable use-case by far more readable and it's only limited by the used font, which is very broad, 3903 characters for Windows default. Plus you can even restore the original content from the replacements.
Possible choices and research notes
To keep things organized, I will always give the character, it's name and the hexadecimal number representation. The latter is is not case sensitive and leading zeroes can be added or ommitted freely, so for example U+002A and u+2a are equivalent. If available, I'll try to point to more info or alternatives - feel free to show me more or better ones.
Instead of * (U+2A * ASTERISK), you can use one of the many listed, for example U+2217 ∗ (ASTERISK OPERATOR) or the Full Width Asterisk U+FF0A ＊. u+20f0 ⃰ combining asterisk above from combining diacritical marks for symbols might also be a valid choice. You can read 4. for more info about the combining characters.
Instead of . (U+2E . full stop), one of these could be a good option, for example ⋅ U+22C5 dot operator.
Instead of " (U+22 " quotation mark), you can use “ U+201C english leftdoublequotemark, more alternatives see here. I also included some of the good suggestions of Wally Brockway's answer, in this case u+2036 ‶ reversed double prime and u+2033 ″ double prime - I will from now on denote ideas from that source by ¹³.
Instead of / (U+2F / SOLIDUS), you can use ∕ DIVISION SLASH U+2215 (others here), ̸ U+0338 COMBINING LONG SOLIDUS OVERLAY, ̷ COMBINING SHORT SOLIDUS OVERLAY U+0337 or u+2044 ⁄ fraction slash¹³. Be aware about spacing for some characters, including the combining or overlay ones, as they have no width and can produce something like -> ̸th̷is which is ̸th̷is. With added spaces you get -> ̸ th ̷ is, which is ̸ th ̷ is. The second one (COMBINING SHORT SOLIDUS OVERLAY) looks bad in the stackoverflow-font.
Instead of \ (U+5C Reverse solidus), you can use ⧵ U+29F5 Reverse solidus operator (more) or u+20E5 ⃥ combining reverse solidus overlay¹³.
To replace [ (U+5B [ Left square bracket) and ] (U+005D ] Right square bracket), you can use for example U+FF3B［ FULLWIDTH LEFT SQUARE BRACKET and U+FF3D ］FULLWIDTH RIGHT SQUARE BRACKET (from here, more possibilities here).
Instead of : (u+3a : colon), you can use U+2236 ∶ RATIO (for mathematical usage) or U+A789 ꞉ MODIFIER LETTER COLON, (see colon (letter), sometimes used in Windows filenames as it is identical to the colon in the Segoe UI font used for filenames. The colon itself is not permitted ... source and more replacements see here). Another alternative is this one: u+1361 ፡ ethiopic wordspace¹³.
Instead of ; (u+3b ; semicolon), you can use U+037E ; GREEK QUESTION MARK (see here).
For | (u+7c | vertical line), there are some good substitutes such as: U+2223 ∣ DIVIDES, U+0964 । DEVANAGARI DANDA, U+01C0 ǀ LATIN LETTER DENTAL CLICK (the last ones from Wikipedia) or U+2D4F ⵏ Tifinagh Letter Yan. Also the box drawing characters contain various other options.
Instead of , (, U+002C COMMA), you can use for example ‚ U+201A SINGLE LOW-9 QUOTATION MARK (see here).
For ? (U+003F ? QUESTION MARK), these are good candidates: U+FF1F ？ FULLWIDTH QUESTION MARK or U+FE56 ﹖ SMALL QUESTION MARK (from here and here). There are also two more from the Dingbats Block (search for "question") and the u+203d ‽ interrobang¹³.
While my machine seems to accept it unchanged, I still want to include > (u+3e greater-than sign) and < (u+3c less-than sign) for the sake of completeness. The best replacement here is probably also from the quotation block, such as u+203a › single right-pointing angle quotation mark and u+2039 ‹ single left-pointing angle quotation mark respectively. The tifinagh block only contains ⵦ (u+2D66)¹³ to replace <. The last notion is ⋖ less-than with dot u+22D6 and ⋗ greater-than with dot u+22D7.
For additional ideas, you can also look for example into this block. You still want more ideas? You can try to draw your desired character and look at the suggestions here.
How do you type these characters
Say you want to type ⵏ (Tifinagh Letter Yan). To get all of its information, you can always search for this character (ⵏ) on a suited platform such as this Unicode Lookup (please add 0x when you search for hex) or that Unicode Table (that only allows to search for the name, in this case "Tifinagh Letter Yan"). You should obtain its Unicode number U+2D4F and the HTML-code ⵏ (note that 2D4F is hexadecimal for 11599). With this knowledge, you have several options to produce these special characters including the use of
code points to unicode converter or again the Unicode Lookup to reversely convert the numerical representation into the unicode character (remember to set the code point base below to decimal or hexadecimal respectively)
a one-liner makro in Autohotkey: :?*:altpipe::{U+2D4F} to type ⵏ instead of the string altpipe - this is the way I input those special characters, my Autohotkey script can be shared if there is common interest
Alt Characters or alt-codes by pressing and holding alt, followed by the decimal number for the desired character (more info for example here, look at a table here or there). For the example, that would be Alt+11599. Be aware, that many programs do not fully support this windows feature for all of unicode (as of time writing). Microsoft Office is an exception where it usually works, some other OSes provide similar functionality. Typing these chars with Alt-combinations into MS Word is also the way Wally Brockway suggests in his answer¹³ that was already mentionted - if you don't want to transfer all the hexadecimal values to the decimal asc, you can find some of them there¹³.
in MS Office, you can also use ALT + X as described in this MS article to produce the chars
if you rarely need it, you can of course still just copy-paste the special character of your choice instead of typing it

For Windows you can check it using PowerShell
$PathInvalidChars = [System.IO.Path]::GetInvalidPathChars() #36 chars
To display UTF-8 codes you can convert
$enc = [system.Text.Encoding]::UTF8
$PathInvalidChars | foreach { $enc.GetBytes($_) }
$FileNameInvalidChars = [System.IO.Path]::GetInvalidFileNameChars() #41 chars
$FileOnlyInvalidChars = #(':', '*', '?', '\', '/') #5 chars - as a difference

For anyone looking for a regex:
const BLACKLIST = /[<>:"\/\\|?*]/g;

In Windows 10 (2019), the following characters are forbidden by an error when you try to type them:
A file name can't contain any of the following characters:
\ / : * ? " < > |

Here's a c# implementation for windows based on Christopher Oezbek's answer
It was made more complex by the containsFolder boolean, but hopefully covers everything
/// <summary>
/// This will replace invalid chars with underscores, there are also some reserved words that it adds underscore to
/// </summary>
/// <remarks>
/// https://stackoverflow.com/questions/1976007/what-characters-are-forbidden-in-windows-and-linux-directory-names
/// </remarks>
/// <param name="containsFolder">Pass in true if filename represents a folder\file (passing true will allow slash)</param>
public static string EscapeFilename_Windows(string filename, bool containsFolder = false)
{
StringBuilder builder = new StringBuilder(filename.Length + 12);
int index = 0;
// Allow colon if it's part of the drive letter
if (containsFolder)
{
Match match = Regex.Match(filename, #"^\s*[A-Z]:\\", RegexOptions.IgnoreCase);
if (match.Success)
{
builder.Append(match.Value);
index = match.Length;
}
}
// Character substitutions
for (int cntr = index; cntr < filename.Length; cntr++)
{
char c = filename[cntr];
switch (c)
{
case '\u0000':
case '\u0001':
case '\u0002':
case '\u0003':
case '\u0004':
case '\u0005':
case '\u0006':
case '\u0007':
case '\u0008':
case '\u0009':
case '\u000A':
case '\u000B':
case '\u000C':
case '\u000D':
case '\u000E':
case '\u000F':
case '\u0010':
case '\u0011':
case '\u0012':
case '\u0013':
case '\u0014':
case '\u0015':
case '\u0016':
case '\u0017':
case '\u0018':
case '\u0019':
case '\u001A':
case '\u001B':
case '\u001C':
case '\u001D':
case '\u001E':
case '\u001F':
case '<':
case '>':
case ':':
case '"':
case '/':
case '|':
case '?':
case '*':
builder.Append('_');
break;
case '\\':
builder.Append(containsFolder ? c : '_');
break;
default:
builder.Append(c);
break;
}
}
string built = builder.ToString();
if (built == "")
{
return "_";
}
if (built.EndsWith(" ") || built.EndsWith("."))
{
built = built.Substring(0, built.Length - 1) + "_";
}
// These are reserved names, in either the folder or file name, but they are fine if following a dot
// CON, PRN, AUX, NUL, COM0 .. COM9, LPT0 .. LPT9
builder = new StringBuilder(built.Length + 12);
index = 0;
foreach (Match match in Regex.Matches(built, #"(^|\\)\s*(?<bad>CON|PRN|AUX|NUL|COM\d|LPT\d)\s*(\.|\\|$)", RegexOptions.IgnoreCase))
{
Group group = match.Groups["bad"];
if (group.Index > index)
{
builder.Append(built.Substring(index, match.Index - index + 1));
}
builder.Append(group.Value);
builder.Append("_"); // putting an underscore after this keyword is enough to make it acceptable
index = group.Index + group.Length;
}
if (index == 0)
{
return built;
}
if (index < built.Length - 1)
{
builder.Append(built.Substring(index));
}
return builder.ToString();
}

Though the only illegal Unix chars might be / and NULL, although some consideration for command line interpretation should be included.
For example, while it might be legal to name a file 1>&2 or 2>&1 in Unix, file names such as this might be misinterpreted when used on a command line.
Similarly it might be possible to name a file $PATH, but when trying to access it from the command line, the shell will translate $PATH to its variable value.

The .NET Framework System.IO provides the following functions for invalid file system characters:
Path.GetInvalidFileNameChars
Path.GetInvalidPathChars
Those functions should return appropriate results depending on the platform the .NET runtime is running in. That said, the Remarks in the documentation pages for those functions say:
The array returned from this method is not guaranteed to contain the
complete set of characters that are invalid in file and directory
names. The full set of invalid characters can vary by file system.

I always assumed that banned characters in Windows filenames meant that all exotic characters would also be outlawed. The inability to use ?, / and : in particular irked me. One day I discovered that it was virtually only those chars which were banned. Other Unicode characters may be used. So the nearest Unicode characters to the banned ones I could find were identified and MS Word macros were made for them as Alt+?, Alt+: etc. Now I form the filename in Word, using the substitute chars, and copy it to the Windows filename. So far I have had no problems.
Here are the substitute chars (Alt + the decimal Unicode) :
⃰ ⇔ Alt8432
⁄ ⇔ Alt8260
⃥ ⇔ Alt8421
∣ ⇔ Alt8739
ⵦ ⇔ Alt11622
⮚ ⇔ Alt11162
‽ ⇔ Alt8253
፡ ⇔ Alt4961
‶ ⇔ Alt8246
″ ⇔ Alt8243
As a test I formed a filename using all of those chars and Windows accepted it.

This is good enough for me in Python:
def fix_filename(name, max_length=255):
"""
Replace invalid characters on Linux/Windows/MacOS with underscores.
List from https://stackoverflow.com/a/31976060/819417
Trailing spaces & periods are ignored on Windows.
>>> fix_filename(" COM1 ")
'_ COM1 _'
>>> fix_filename("COM10")
'COM10'
>>> fix_filename("COM1,")
'COM1,'
>>> fix_filename("COM1.txt")
'_.txt'
>>> all('_' == fix_filename(chr(i)) for i in list(range(32)))
True
"""
return re.sub(r'[/\\:|<>"?*\0-\x1f]|^(AUX|COM[1-9]|CON|LPT[1-9]|NUL|PRN)(?![^.])|^\s|[\s.]$', "_", name[:max_length], flags=re.IGNORECASE)
See also this outdated list for additional legacy stuff like = in FAT32.

As of 18/04/2017, no simple black or white list of characters and filenames is evident among the answers to this topic - and there are many replies.
The best suggestion I could come up with was to let the user name the file however he likes. Using an error handler when the application tries to save the file, catch any exceptions, assume the filename is to blame (obviously after making sure the save path was ok as well), and prompt the user for a new file name. For best results, place this checking procedure within a loop that continues until either the user gets it right or gives up. Worked best for me (at least in VBA).

In Unix shells, you can quote almost every character in single quotes '. Except the single quote itself, and you can't express control characters, because \ is not expanded. Accessing the single quote itself from within a quoted string is possible, because you can concatenate strings with single and double quotes, like 'I'"'"'m' which can be used to access a file called "I'm" (double quote also possible here).
So you should avoid all control characters, because they are too difficult to enter in the shell. The rest still is funny, especially files starting with a dash, because most commands read those as options unless you have two dashes -- before, or you specify them with ./, which also hides the starting -.
If you want to be nice, don't use any of the characters the shell and typical commands use as syntactical elements, sometimes position dependent, so e.g. you can still use -, but not as first character; same with ., you can use it as first character only when you mean it ("hidden file"). When you are mean, your file names are VT100 escape sequences ;-), so that an ls garbles the output.

When creating internet shortcuts in Windows, to create the file name, it skips illegal characters, except for forward slash, which is converted to minus.

I had the same need and was looking for recommendation or standard references and came across this thread. My current blacklist of characters that should be avoided in file and directory names are:
$CharactersInvalidForFileName = {
"pound" -> "#",
"left angle bracket" -> "<",
"dollar sign" -> "$",
"plus sign" -> "+",
"percent" -> "%",
"right angle bracket" -> ">",
"exclamation point" -> "!",
"backtick" -> "`",
"ampersand" -> "&",
"asterisk" -> "*",
"single quotes" -> "“",
"pipe" -> "|",
"left bracket" -> "{",
"question mark" -> "?",
"double quotes" -> "”",
"equal sign" -> "=",
"right bracket" -> "}",
"forward slash" -> "/",
"colon" -> ":",
"back slash" -> "\\",
"lank spaces" -> "b",
"at sign" -> "#"
};

Is there a Python constant for Unicode whitespace?

The string module contains a whitespace attribute, which is a string consisting of all the ASCII characters that are considered whitespace. Is there a corresponding constant that includes Unicode spaces too, such as the no-break space (U+00A0)? We can see from the question "strip() and strip(string.whitespace) give different results" that at least strip is aware of additional Unicode whitespace characters.
This question was identified as a duplicate of
In Python, how to list all characters matched by POSIX extended regex [:space:]?, but the answers to that question identify ways of searching for whitespace characters to generate your own list. This is a time-consuming process. My question was specifically about a constant.

Is there a Python constant for Unicode whitespace?
Short answer: No. I have personally grepped for these characters (specifically, the numeric code points) in the Python code base, and such a constant is not there.
The sections below explains why it is not necessary, and how it is implemented without this information being available as a constant. But having such a constant would also be a really bad idea.
If the Unicode Consortium added another character/code-point that is semantically whitespace, the maintainers of Python would have a poor choice between continuing to support semantically incorrect code or changing the constant and possibly breaking pre-existing code that might (inadvisably) make assumptions about the constant not changing.
How could it add these character code-points? There are 1,111,998 possible characters in Unicode. But only 120,672 are occupied as of version 8. Each new version of Unicode may add additional characters. One of these new characters might be a form of whitespace.
The information is stored in a dynamically generated C function
The code that determines what is whitespace in unicode is the following dynamically generated code.
# Generate code for _PyUnicode_IsWhitespace()
print("/* Returns 1 for Unicode characters having the bidirectional", file=fp)
print(" * type 'WS', 'B' or 'S' or the category 'Zs', 0 otherwise.", file=fp)
print(" */", file=fp)
print('int _PyUnicode_IsWhitespace(const Py_UCS4 ch)', file=fp)
print('{', file=fp)
print(' switch (ch) {', file=fp)
for codepoint in sorted(spaces):
print(' case 0x%04X:' % (codepoint,), file=fp)
print(' return 1;', file=fp)
print(' }', file=fp)
print(' return 0;', file=fp)
print('}', file=fp)
print(file=fp)
This is a switch statement, which is a constant code block, but this information is not available as a module "constant" like the string module has. It is instead buried in the function compiled from C and not directly accessible from Python.
This is likely because as more code points are added to Unicode, we would not be able to change constants for backwards compatibility reasons.
The Generated Code
Here's the generated code currently at the tip:
int _PyUnicode_IsWhitespace(const Py_UCS4 ch)
{
switch (ch) {
case 0x0009:
case 0x000A:
case 0x000B:
case 0x000C:
case 0x000D:
case 0x001C:
case 0x001D:
case 0x001E:
case 0x001F:
case 0x0020:
case 0x0085:
case 0x00A0:
case 0x1680:
case 0x2000:
case 0x2001:
case 0x2002:
case 0x2003:
case 0x2004:
case 0x2005:
case 0x2006:
case 0x2007:
case 0x2008:
case 0x2009:
case 0x200A:
case 0x2028:
case 0x2029:
case 0x202F:
case 0x205F:
case 0x3000:
return 1;
}
return 0;
}
Making your own constant:
The following code (from my answer here), in Python 3, generates a constant of all whitespace:
import re
import sys
s = ''.join(chr(c) for c in range(sys.maxunicode+1))
ws = ''.join(re.findall(r'\s', s))
As an optimization, you could store this in a code base, instead of auto-generating it every new process, but I would caution against assuming that it would never change.
>>> ws
'\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f \x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'
(Other answers to the question linked show how to get that for Python 2.)
Remember that at one point, some people probably thought 256 character encodings was all that we'd ever need.
>>> import string
>>> string.whitespace
' \t\n\r\x0b\x0c'
If you're insisting on keeping a constant in your code base, just generate the constant for your version of Python, and store it as a literal:
unicode_whitespace = u'\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f \x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000'
The u prefix makes it unicode in Python 2 (2.7 happens to recognize the entire string above as whitespace too), and in Python 3 it is ignored as string literals are unicode by default.

Python’s `str.format()`, fill characters, and ANSI colors

In Python 2, I’m using str.format() to align a bunch of columns of text I’m printing to a terminal. Basically, it’s a table, but I’m not printing any borders or anything—it’s simply rows of text, aligned into columns.
With no color-fiddling, everything prints as expected.
If I wrap an entire row (i.e., one print statement) with ANSI color codes, everything prints as expected.
However: If I try to make each column a different color within a row, the alignment is thrown off. Technically, the alignment is preserved; it’s the fill characters (spaces) that aren’t printing as desired; in fact, the fill characters seem to be completely removed.
I’ve verified the same issue with both colorama and xtermcolor. The results were the same. Therefore, I’m certain the issue has to do with str.format() not playing well with ANSI escape sequences in the middle of a string.
But I don’t know what to do about it! :( I would really like to know if there’s any kind of workaround for this problem.
Color and alignment are powerful tools for improving readability, and readability is an important part of software usability. It would mean a lot to me if this could be accomplished without manually aligning each column of text.
Little help? ☺

This is a very late answer, left as bread crumbs for anyone who finds this page while struggling to format text with built-in ANSI color codes.
byoungb's comment about making padding decisions on the length of pre-colorized text is exactly right. But if you already have colored text, here's a work-around:
See my ansiwrap module on PyPI. Its primary purpose is providing textwrap for ANSI-colored text, but it also exports ansilen() which tells you "how long would this string be if it didn't contain ANSI control codes?" It's quite useful in making formatting, column-width, and wrapping decisions on pre-colored text. Add width - ansilen(s) spaces to the end or beginning of s to left (or respectively, right) justify s in a column of your desired width. E.g.:
def ansi_ljust(s, width):
needed = width - ansilen(s)
if needed > 0:
return s + ' ' * needed
else:
return s
Also, if you need to split, truncate, or combine colored text at some point, you will find that ANSI's stateful nature makes that a chore. You may find ansi_terminate_lines() helpful; it "patch up" a list of sub-strings so that each has independent, self-standing ANSI codes with equivalent effect as the original string.
The latest versions of ansicolors also contain an equivalent implementation of ansilen().

Python doesn't distinguish between 'normal' characters and ANSI colour codes, which are also characters that the terminal interprets.
In other words, printing '\x1b[92m' to a terminal may change the terminal text colour, Python doesn't see that as anything but a set of 5 characters. If you use print repr(line) instead, python will print the string literal form instead, including using escape codes for non-ASCII printable characters (so the ESC ASCII code, 27, is displayed as \x1b) to see how many have been added.
You'll need to adjust your column alignments manually to allow for those extra characters.
Without your actual code, that's hard for us to help you with though.

Also late to the party. Had this same issue dealing with color and alignment. Here is a function I wrote which adds padding to a string that has characters that are 'invisible' by default, such as escape sequences.
def ljustcolor(text: str, padding: int, char=" ") -> str:
import re
pattern = r'(?:\x1B[#-_]|[\x80-\x9F])[0-?]*[ -/]*[#-~]'
matches = re.findall(pattern, text)
offset = sum(len(match) for match in matches)
return text.ljust(padding + offset,char[0])
The pattern matches all ansi escape sequences, including color codes. We then get the total length of all matches which will serve as our offset when we add it to the padding value in ljust.

How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?

I want to split a sentence into a list of words.
For English and European languages this is easy, just use split()
>>> "This is a sentence.".split()
['This', 'is', 'a', 'sentence.']
But I also need to deal with sentences in languages such as Chinese that don't use whitespace as word separator.
>>> u"这是一个句子".split()
[u'\u8fd9\u662f\u4e00\u4e2a\u53e5\u5b50']
Obviously that doesn't work.
How do I split such a sentence into a list of words?
UPDATE:
So far the answers seem to suggest that this requires natural language processing techniques and that the word boundaries in Chinese are ambiguous. I'm not sure I understand why. The word boundaries in Chinese seem very definite to me. Each Chinese word/character has a corresponding unicode and is displayed on screen as an separate word/character.
So where does the ambiguity come from. As you can see in my Python console output Python has no problem telling that my example sentence is made up of 5 characters:
这 - u8fd9
是 - u662f
一 - u4e00
个 - u4e2a
句 - u53e5
子 - u5b50
So obviously Python has no problem telling the word/character boundaries. I just need those words/characters in a list.

You can do this but not with standard library functions. And regular expressions won't help you either.
The task you are describing is part of the field called Natural Language Processing (NLP). There has been quite a lot of work done already on splitting Chinese words at word boundaries. I'd suggest that you use one of these existing solutions rather than trying to roll your own.
Chinese NLP
chinese - The Stanford NLP (Natural Language Processing) Group
Where does the ambiguity come from?
What you have listed there is Chinese characters. These are roughly analagous to letters or syllables in English (but not quite the same as NullUserException points out in a comment). There is no ambiguity about where the character boundaries are - this is very well defined. But you asked not for character boundaries but for word boundaries. Chinese words can consist of more than one character.
If all you want is to find the characters then this is very simple and does not require an NLP library. Simply decode the message into a unicode string (if it is not already done) then convert the unicode string to a list using a call to the builtin function list. This will give you a list of the characters in the string. For your specific example:
>>> list(u"这是一个句子")

just a word of caution: using list( '...' ) (in Py3; that's u'...' for Py2) will not, in the general sense, give you the characters of a unicode string; rather, it will most likely result in a series of 16bit codepoints. this is true for all 'narrow' CPython builds, which accounts for the vast majority of python installations today.
when unicode was first proposed in the 1990s, it was suggested that 16 bits would be more than enough to cover all the needs of a universal text encoding, as it enabled a move from 128 codepoints (7 bits) and 256 codepoints (8 bits) to a whopping 65'536 codepoints. it soon became apparent, however, that that had been wishful thinking; today, around 100'000 codepoints are defined in unicode version 5.2, and thousands more are pending for inclusion. in order for that to become possible, unicode had to move from 16 to (conceptually) 32 bits (although it doesn't make full use of the 32bit address space).
in order to maintain compatibility with software built on the assumption that unicode was still 16 bits, so-called surrogate pairs were devised, where two 16 bit codepoints from specifically designated blocks are used to express codepoints beyond 65'536, that is, beyond what unicode calls the 'basic multilingual plane', or BMP, and which are jokingly referred to as the 'astral' planes of that encoding, for their relative elusiveness and constant headache they offer to people working in the field of text processing and encoding.
now while narrow CPython deals with surrogate pairs quite transparently in some cases, it will still fail to do the right thing in other cases, string splitting being one of those more troublesome cases. in a narrow python build, list( 'abc大𧰼def' ) (or list( 'abc\u5927\U00027C3Cdef' ) when written with escapes) will result in ['a', 'b', 'c', '大', '\ud85f', '\udc3c', 'd', 'e', 'f'], with '\ud85f', '\udc3c' being a surrogate pair. incidentally, '\ud85f\udc3c' is what the JSON standard expects you to write in order to represent U-27C3C. either of these codepoints is useless on its own; a well-formed unicode string can only ever have pairs of surrogates.
so what you want to split a string into characters is really:
from re import compile as _Re
_unicode_chr_splitter = _Re( '(?s)((?:[\ud800-\udbff][\udc00-\udfff])|.)' ).split
def split_unicode_chrs( text ):
return [ chr for chr in _unicode_chr_splitter( text ) if chr ]
which correctly returns ['a', 'b', 'c', '大', '𧰼', 'd', 'e', 'f'] (note: you can probably rewrite the regular expression so that filtering out empty strings becomes unnecessary).
if all you want to do is splitting a text into chinese characters, you'd be pretty much done at this point. not sure what the OP's concept of a 'word' is, but to me, 这是一个句子 may be equally split into 这 | 是 | 一 | 个 | 句子 as well as 这是 | 一个 | 句子, depending on your point of view. however, anything that goes beyond the concept of (possibly composed) characters and character classes (symbols vs whitespace vs letters and such) goes well beyond what is built into unicode and python; you'll need some natural language processing to do that. let me remark that while your example 'yes the United Nations can!'.split() does successfully demonstrate that the split method does something useful to a lot of data, it does not parse the english text into words correctly: it fails to recognize United Nations as one word, while it falsely assumes can! is a word, which it is clearly not. this method gives both false positives and false negatives. depending on your data and what you intend to accomplish, this may or may not be what you want.

Ok I figured it out.
What I need can be accomplished by simply using list():
>>> list(u"这是一个句子")
[u'\u8fd9', u'\u662f', u'\u4e00', u'\u4e2a', u'\u53e5', u'\u5b50']
Thanks for all your inputs.

Best tokenizer tool for Chinese is pynlpir.
import pynlpir
pynlpir.open()
mystring = "你汉语说的很好！"
tokenized_string = pynlpir.segment(mystring, pos_tagging=False)
>>> tokenized_string
['你', '汉语', '说', '的', '很', '好', '！']
Be aware of the fact that pynlpir has a notorious but easy fixable problem with licensing, on which you can find plenty of solutions on the internet.
You simply need to replace the NLPIR.user file in your NLPIR folder downloading a valide licence from this repository and restart your environment.

Languages like Chinese have a very fluid definition of a word. E.g. One meaning of ma is "horse". One meaning of shang is "above" or "on top of". A compound is "mashang" which means literally "on horseback" but is used figuratively to mean "immediately". You need a very good dictionary with compounds in it and looking up the dictionary needs a longest-match approach. Compounding is rife in German (famous example is something like "Danube steam navigation company director's wife" being expressed as one word), Turkic languages, Finnish, and Magyar -- these languages have very long words many of which won't be found in a dictionary and need breaking down to understand them.
Your problem is one of linguistics, nothing to do with Python.

It's partially possible with Japanese, where you usually have different character classes at the beginning and end of the word, but there are whole scientific papers on the subject for Chinese. I have a regular expression for splitting words in Japanese if you are interested: http://hg.hatta-wiki.org/hatta-dev/file/cd21122e2c63/hatta/search.py#l19

Try this: http://code.google.com/p/pymmseg-cpp/

The list() is the answer for Chinese only sentence. For those hybrid English/Chines in most of case. It answered at hybrid-split, just copy answer from Winter as below.
def spliteKeyWord(str):
regex = r"[\u4e00-\ufaff]|[0-9]+|[a-zA-Z]+\'*[a-z]*"
matches = re.findall(regex, str, re.UNICODE)
return matches

if str longer than 30 then take 27 chars and add '...' at the end
otherwise return str
str='中文2018-2020年一区6、8、10、12号楼_「工程建设文档102332号」'
result = len(list(str)) >= 30 and ''.join(list(str)[:27]) + '...' or str

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.