I have the following two-line Python (v. 3.10.7) program "stdin.py":
import sys
print(sys.stdin.read())
and the following one-line text file "ansi.txt" (CP1252 encoding) containing:
‘I am well’ he said.
Note that the open and close quotes are 0x91 and 0x92, respectively. In Windows-10 cmd mode the behavior of the Python code is as expected:
python stdin.py < ansi.txt # --> ‘I am well’ he said.
On the other hand in Windows Powershell:
cat .\ansi.txt | python .\stdin.py # --> ?I am well? he said.
Apparently the CP1252 characters are seen as non-printable characters in the
combination Python/PowerShell. If I replace in "stdin.py" the standard input by file input, Python prints correctly the CP1252 quote characters to screen. PowerShell by itself recognizes and prints correctly 0x91 and 0x92.
Questions: can somebody explain to me why cmd works differently than PowerShell in combination with Python? Why doesn't Python recognize the CP1252 quote characters 0x91 and 0x92 when they are piped into it by PowerShell?
tl;dr
Use the $OutputEncoding preference variable:
In Windows PowerShell:
# Using the system's legacy ANSI code page, as Python does by default.
# NOTE: The & { ... } enclosure isn't strictly necessary, but
# ensures that the $OutputEncoding change is only temporary,
# by limiting to the child scope that the enclosure cretes.
& {
$OutputEncoding = [System.Text.Encoding]::Default
"‘I am well’ he said." | python -c 'import sys; print(sys.stdin.read())'
}
# Using UTF-8 instead, which is generally preferable.
# Note the `-X utf8` option (Python 3.7+)
& {
$OutputEncoding = [System.Text.UTF8Encoding]::new()
"‘I am well’ he said." | python -X utf8 -c 'import sys; print(sys.stdin.read())'
}
In PowerShell (Core) 7+:
# Using the system's legacy ANSI code page, as Python does by default.
# Note: In PowerShell (Core) / .NET 5+,
# [System.Text.Encoding]::Default` now reports UTF-8,
# not the active ANSI encoding.
& {
$OutputEncoding = [System.Text.Encoding]::GetEncoding([cultureinfo]::CurrentCulture.TextInfo.ANSICodePage)
"‘I am well’ he said." | python -c 'import sys; print(sys.stdin.read())'
}
# Using UTF-8 instead, which is generally preferable.
# Note the `-X utf8` option (Python 3.7+)
# NO need to set $OutputEncoding, as it now *defaults* to UTF-8
"‘I am well’ he said." | python -X utf8 -c 'import sys; print(sys.stdin.read())'
Note:
$OutputEncoding controls what encoding is used to send data TO external programs via the pipeline (to stdin). It defaults to ASCII(!) in Windows PowerShell, and UTF-8 in PowerShell (Core).
[Console]::OutputEncoding controls how data received FROM external programs (via stdout) is decoded. It defaults to the console's active code page, which in turn defaults to the system's legacy OEM code page, such as 437 on US-English systems).
That these two encodings are not aligned by default is unfortunate; while Windows PowerShell will see no more changes, there is hope for PowerShell (Core): it would make sense to have it default consistently to UTF-8:
GitHub issue #7233 suggests at least defaulting the shortcut files that launch PowerShell to UTF-8 (code page 65001); GitHub issue #14945 more generally discusses the problematic mismatch.
In Windows 10 and above, there is an option to switch to UTF-8 system-wide, which then makes both the OEM and ANSI code pages default to UTF-8 (65001); however, this has far-reaching consequences and is still labeled as being in beta as of Windows 11 - see this answer.
Background information:
It is the $OutputEncoding preference variable that determines what character encoding PowerShell uses to send data (invariably text, as of PowerShell 7.3) to an external program via the pipeline.
Note that this even applies when data is read from a file: PowerShell, as of v7.3, never sends raw bytes through the pipeline: it reads the content into .NET strings first and then re-encodes them based on $OutputEncoding on sending them through the pipeline to an external program.
Therefore, what encoding your ansi.txt input file uses is ultimately irrelevant, as long as PowerShell decodes it correctly when reading it into .NET strings (which are internally composed of UTF-16 code units).
See this answer for more information.
Thus, the character encoding stored in $OutputEncoding must match the encoding that the target program expects.
By default the encoding in $OutputEncoding is unrelated to the encoding implied by the console's active code page (which itself defaults to the system's legacy OEM code page, such as 437 on US-English systems), which is what at least legacy console applications tend to use; however, Python does not, and uses the legacy ANSI code page; other modern CLIs, notably NodeJS' node.exe, always use UTF-8.
While $OutputEncoding's default in PowerShell (Core) 7+ is now UTF-8, Windows PowerShell's default is, regrettably, ASCII(!), which means that non-ASCII characters get "lossily" transliterated to verbatim ASCII ? characters, which is what you saw.
Therefore, you must (temporarily) set $OutputEncoding to the encoding that Python expects and/or ask it use UTF-8 instead.
Related
While working on a buffer overflow exploit I found something really strange. I have successfully found that I need to provide 32 characters before the proper address I want to jump to and that the proper address is 0x08048a37. When I executed
python -c "print '-'*32+'\x37\x8a\x04\x08'" | ./MyExecutable
the exploit resulted in a success. But, when I tried:
python3 -c "print('-'*32+'\x37\x8a\x04\x08')" | ./MyExecutable
it didn't. The executable simply resulted in a Segmentation Fault without jumping to the desired address. In fact, executing
python -c "print '-'*32+'\x37\x8a\x04\x08'"
and
python3 -c "print('-'*32+'\x37\x8a\x04\x08')"
results in two different output on the console. The characters are, of course, not readable but they're visually different.
I wonder why is this happening?
The Python 2 code writes bytes, the Python 3 code writes text that is then encoded to bytes. The latter will thus not write the same output; it depends on the codec configured for your pipe.
In Python 3, write bytes to the sys.stdout.buffer object instead:
python3 -c "import sys; sys.stdout.buffer.write(b'-'*32+b'\x37\x8a\x04\x08')"
You may want to manually add the \n newline that print would add.
sys.stdout is a io.TextIOBase object, encoding data written to it to a given codec (usually based on your locale, but when using a pipe, often defaulting to ASCII), before passing it on to the underlying buffer object. The TextIOBase.buffer attribute gives you direct access to the underlying BufferedIOBase object.
I am trying to create file with Unicode character 662f on windows (via Perl or python, anything is fine for me ) . on Linux I am able to get chr 是 , but on windows I am getting this character 是 , and some how I am not able to get that file name as 是.
Python code -
import sys
name = unichr(0x662f)
print(name.encode('utf8').decode(sys.stdout.encoding))
perl code -
my $name .= chr(230).chr(152).chr(175); ##662f
print 'file name ::'. "$name"."txt";
File manipulation in Perl on Windows (Unicode characters in file name)
In Perl on Windows, I use Win32::Unicode, Win32::Unicode::File and Win32::Unicode::Dir. They work perfectly with Unicode characters in file names.
Just mind that Win32::Unicode::File::open() (and new()) have a reversed argument order compared Perl's built-in open() - mode comes first.
You do not need to encode the characters manually - just insert them as they are (if your Perl script is in UTF-8), or using the \x{N} notation.
Printing out Unicode characters on Windows
Printing Unicode into console on Windows is another problem. You can't use cmd.exe. Instead use PowerShell ISE. The drawback of the ISE is that it's not a console - scripts can't take input from keyboard thru STDIN.
To get Unicode output, you need to do set the output encoding to UTF-8 in every PowerShell ISE that's started. I suggest doing so in the startup script.
Procedure to have PowerShell ISE default to Unicode output:
1) In order for any user PowerShell scripts to be allowed to run, you first need to do:
Set-ExecutionPolicy RemoteSigned
2) Edit or create your Documents\WindowsPowerShell\Microsoft.PowerShellISE_profile.ps1 to something like:
perl -w -e "print qq!Initializing the console with Perl...\n!;"
[System.Console]::OutputEncoding = [System.Text.Encoding]::UTF8;
The short Perl command is there as a trick to allow the System.Console property be modified. Without it, you get an error when setting the OutputEncoding.
If I recall correctly, you also have to change the font to Consolas.
Even when the Unicode characters print out fine, you may have trouble including them in command line arguments. In these cases I've found the \x{N} notation works. The Windows Character Map utility is your friend here.
(Edited heavily after I rediscovered the regular PowerShell's inability to display most Unicode characters, with references to PowerShell (non-ISE) removed. Now I remember why I started using the ISE...)
I'm having problems reading text files into my python programs.
import sys
words = sys.stdin.readlines()
I'm reading the file in through stdin but when I try to execute the program I'm getting this error.
PS> python evil_61.py < evilwords.txt
At line:1 char:19
+ python evil_61.py < evilwords.txt
+ ~
The '<' operator is reserved for future use.
+ CategoryInfo : ParserError: (:) [], ParentContainsErrorRecordException
+ FullyQualifiedErrorId : RedirectionNotSupported
Could someone tell me how to run these kinds of programs as it is essential for my course and I'd rather use Windows than Linux.
Since < for input redirection is not supported in PowerShell, use Get-Content in a pipeline instead:
Get-Content evilwords.txt | python evil_61.py
Note: Adding the -Raw switch - which reads a file as a single, multi-line string - would speed things up in principle (at the expense of increased memory consumption), but PowerShell invariably appends a newline to data piped to external programs, as of PowerShell 7.2 (see this answer), so the target program will typically see an extra, empty line at the end. Get-Content's default behavior of line-by-line streaming avoids that.
Beware character-encoding issues:
Get-Content, in the absence of an -Encoding argument, assumes the following encoding:
Windows PowerShell (the built-into-Windows edition whose latest and final version is 5.1): the active ANSI code page, which is implied by the active legacy system locale (language for non-Unicode programs).
PowerShell (Core) 7+: (BOM-less) UTF-8
On passing the lines through the pipeline, they are (re-)encoded based on the encoding stored in the $OutputEncoding preference variable, which defaults to:
Windows PowerShell: ASCII(!)
PowerShell (Core) 7+: (BOM-less) UTF-8
As you can see, only PowerShell (Core) 7+ exhibits consistent behavior, though, unfortunately, as of PowerShell Core 7.2.0-preview.9, this doesn't yet extend to capturing output from external programs, because the encoding that controls the interpretation of received data, stored in [Console]::OutputEncoding], still defaults to the system's active OEM code page - see GitHub issue #7233.
Consider passing the text file as an argument at command line for Python script to use. All command line arguments (including file name) come stored in the sys.argv list:
Python Script (in evil_61.py)
import sys
txtfile = sys.argv[1]
with open(txtfile) as f:
content = f.readlines()
PowerShell Command
PS> python evil_61.py evilwords.txt
Happy examples:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
czech = u'Leoš Janáček'.encode("utf-8")
print(czech)
pl = u'Zdzisław Beksiński'.encode("utf-8")
print(pl)
jp = u'リング 山村 貞子'.encode("utf-8")
print(jp)
chinese = u'五行'.encode("utf-8")
print(chinese)
MIR = u'Машина для Инженерных Расчётов'.encode("utf-8")
print(MIR)
pt = u'Minha Língua Portuguesa: çáà'.encode("utf-8")
print(pt)
Unhappy output:
b'Leo\xc5\xa1 Jan\xc3\xa1\xc4\x8dek'
b'Zdzis\xc5\x82aw Beksi\xc5\x84ski'
b'\xe3\x83\xaa\xe3\x83\xb3\xe3\x82\xb0 \xe5\xb1\xb1\xe6\x9d\x91 \xe8\xb2\x9e\xe5\xad\x90'
b'\xe4\xba\x94\xe8\xa1\x8c'
b'\xd0\x9c\xd0\xb0\xd1\x88\xd0\xb8\xd0\xbd\xd0\xb0 \xd0\xb4\xd0\xbb\xd1\x8f \xd0\x98\xd0\xbd\xd0\xb6\xd0\xb5\xd0\xbd\xd0\xb5\xd1\x80\xd0\xbd\xd1\x8b\xd1\x85 \xd0\xa0\xd0\xb0\xd1\x81\xd1\x87\xd1\x91\xd1\x82\xd0\xbe\xd0\xb2'
b'Minha L\xc3\xadngua Portuguesa: \xc3\xa7\xc3\xa1\xc3\xa0'
And if I print them like this:
jp = u'リング 山村 貞子'
print(jp)
I get:
Traceback (most recent call last):
File "x.py", line 5, in <module>
print(jp)
File "C:\Python34\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position
0-2: character maps to <undefined>
I've also tried the following from this question (And other alternatives that involve sys.stdout.encoding):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
import sys
def safeprint(s):
try:
print(s)
except UnicodeEncodeError:
if sys.version_info >= (3,):
print(s.encode('utf8').decode(sys.stdout.encoding))
else:
print(s.encode('utf8'))
jp = u'リング 山村 貞子'
safeprint(jp)
And things get even more cryptic:
リング 山村 貞子
And the docs were not very helpful.
So, what's the deal with Python 3.4, Unicode, different languages and Windows? Almost all possible examples I could find, deal with Python 2.x.
Is there a general and cross-platform way of printing ANY Unicode character from any language in a decent and non-nasty way in Python 3.4?
EDIT:
I've tried typing at the terminal:
chcp 65001
To change the code page, as proposed here and in the comments, and it did not work (Including the attempt with sys.stdout.encoding)
Update: Since Python 3.6, the code example that prints Unicode strings directly should just work now (even without py -mrun).
Python can print text in multiple languages in Windows console whatever chcp says:
T:\> py -mpip install win-unicode-console
T:\> py -mrun your_script.py
where your_script.py prints Unicode directly e.g.:
#!/usr/bin/env python3
print('š áč') # cz
print('ł ń') # pl
print('リング') # jp
print('五行') # cn
print('ш я жх ё') # ru
print('í çáà') # pt
All you need is to configure the font in your Windows console that can display the desired characters.
You could also run your Python script via IDLE without installing non-stdlib modules:
T:\> py -midlelib -r your_script.py
To write to a file/pipe, use PYTHONIOENCODING=utf-8 as #Mark Tolonen suggested:
T:\> set PYTHONIOENCODING=utf-8
T:\> py your_script.py >output-utf8.txt
Only the last solution supports non-BMP characters such as 😒 (U+1F612 UNAMUSED FACE) -- py -mrun can write them but Windows console displays them as boxes even if the font supports corresponding Unicode characters (though you can copy-paste the boxes into another program, to get the characters).
The problem iswas (see Python 3.6 update below) with the Windows console, which supports an ANSI character set appropriate for the region targeted by your version of Windows. Python throws an exception by default when unsupported characters are output.
Python can read an environment variable to output in other encodings, or to change the error handling default. Below, I've read the console default and change the default error handling to print a ? instead of throwing an error for characters that are unsupported in the console's current code page.
C:\>chcp
Active code page: 437 # Note, US Windows OEM code page.
C:\>set PYTHONIOENCODING=437:replace
C:\>example.py
Leo? Janá?ek
Zdzis?aw Beksi?ski
??? ?? ??
??
?????? ??? ?????????? ????????
Minha Língua Portuguesa: çáà
Note the US OEM code page is limited to ASCII and some Western European characters.
Below I've instructed Python to use UTF8, but since the Windows console doesn't support it, I redirect the output to a file and display it in Notepad:
C:\>set PYTHONIOENCODING=utf8
C:\>example >out.txt
C:\>notepad out.txt
On Windows, its best to use a Python IDE that supports UTF-8 instead of the console when working with multiple languages. If only using one language, select it as the system locale in the Region and Language control panel and the console will support the characters of that language.
Update for Python 3.6
Python 3.6 now uses Windows Unicode APIs to write directly to the console, so the only limit is the console font's support of the characters. The following code works in a US Windows console. I have a Chinese language pack installed, it even displays the Chinese and Japanese if the console font is changed. Even without the correct font, replacement characters are shown in the console. Cut-n-paste to an environment such as this web page will display the characters correctly.
#!python3.6
#coding: utf8
czech = 'Leoš Janáček'
print(czech)
pl = 'Zdzisław Beksiński'
print(pl)
jp = 'リング 山村 貞子'
print(jp)
chinese = '五行'
print(chinese)
MIR = 'Машина для Инженерных Расчётов'
print(MIR)
pt = 'Minha Língua Portuguesa: çáà'
print(pt)
Output:
Leoš Janáček
Zdzisław Beksiński
リング 山村 貞子
五行
Машина для Инженерных Расчётов
Minha Língua Portuguesa: çáà
I thought I knew everything about encodings and Python, but today I came across a weird problem: although the console is set to code page 850 - and Python reports it correctly - parameters I put on the command line seem to be encoded in code page 1252. If I try to decode them with sys.stdin.encoding, I get the wrong result. If I assume 'cp1252', ignoring what sys.stdout.encoding reports, it works.
Am I missing something, or is this a bug in Python ? Windows ? Note: I am running Python 2.6.6 on Windows 7 EN, locale set to French (Switzerland).
In the test program below, I check that literals are correctly interpreted and can be printed - this works. But all values I pass on the command line seem to be encoded wrongly:
#!/usr/bin/python
# -*- encoding: utf-8 -*-
import sys
literal_mb = 'utf-8 literal: üèéÃÂç€ÈÚ'
literal_u = u'unicode literal: üèéÃÂç€ÈÚ'
print "Testing literals"
print literal_mb.decode('utf-8').encode(sys.stdout.encoding,'replace')
print literal_u.encode(sys.stdout.encoding,'replace')
print "Testing arguments ( stdin/out encodings:",sys.stdin.encoding,"/",sys.stdout.encoding,")"
for i in range(1,len(sys.argv)):
arg = sys.argv[i]
print "arg",i,":",arg
for ch in arg:
print " ",ch,"->",ord(ch),
if ord(ch)>=128 and sys.stdin.encoding == 'cp850':
print "<-",ch.decode('cp1252').encode(sys.stdout.encoding,'replace'),"[assuming input was actually cp1252 ]"
else:
print ""
In a newly created console, when running
C:\dev>test-encoding.py abcé€
I get the following output
Testing literals
utf-8 literal: üèéÃÂç?ÈÚ
unicode literal: üèéÃÂç?ÈÚ
Testing arguments ( stdin/out encodings: cp850 / cp850 )
arg 1 : abcÚÇ
a -> 97
b -> 98
c -> 99
Ú -> 233 <- é [assuming input was actually cp1252 ]
Ç -> 128 <- ? [assuming input was actually cp1252 ]
while I would expect the 4th character to have an ordinal value of 130 instead of 233 (see the code pages 850 and 1252).
Notes: the value of 128 for the euro symbol is a mystery - since cp850 does not have it. Otherwise, the '?' are expected - cp850 cannot print the characters and I have used 'replace' in the conversions.
If I change the code page of the console to 1252 by issuing chcp 1252 and run the same command, I (correctly) obtain
Testing literals
utf-8 literal: üèéÃÂç€ÈÚ
unicode literal: üèéÃÂç€ÈÚ
Testing arguments ( stdin/out encodings: cp1252 / cp1252 )
arg 1 : abcé€
a -> 97
b -> 98
c -> 99
é -> 233
€ -> 128
Any ideas what I'm missing ?
Edit 1: I've just tested by reading sys.stdin. This works as expected: in cp850, typing 'é' results in an ordinal value of 130. So the problem is really for the command line only. So, is the command line treated differently than the standard input ?
Edit 2: It seems I had the wrong keywords. I found another very close topic on SO: Read Unicode characters from command-line arguments in Python 2.x on Windows. Still, if the command line is not encoded like sys.stdin, and since sys.getdefaultencoding() reports 'ascii', it seems there is no way to know its actual encoding. I find the answer using win32 extensions pretty hacky.
Replying to myself:
On Windows, the encoding used by the console (thus, that of sys.stdin/out) differs from the encoding of various OS-provided strings - obtained through e.g. os.getenv(), sys.argv, and certainly many more.
The encoding provided by sys.getdefaultencoding() is really that - a default, chosen by Python developers to match the "most reasonable encoding" the interpreter use in extreme cases. I get 'ascii' on my Python 2.6, and tried with portable Python 3.1, which yields 'utf-8'. Both are not what we are looking for - they are merely fallbacks for encoding conversion functions.
As this page seems to state, the encoding used by OS-provided strings is governed by the Active Code Page (ACP). Since Python does not have a native function to retrieve it, I had to use ctypes:
from ctypes import cdll
os_encoding = 'cp' + str(cdll.kernel32.GetACP())
Edit: But as Jacek suggests, there actually is a more robust and Pythonic way to do it (semantics would need validation, but until proven wrong, I'll use this)
import locale
os_encoding = locale.getpreferredencoding()
# This returns 'cp1252' on my system, yay!
and then
u_argv = [x.decode(os_encoding) for x in sys.argv]
u_env = os.getenv('myvar').decode(os_encoding)
On my system, os_encoding = 'cp1252', so it works. I am quite certain this would break on other platforms, so feel free to edit and make it more generic. We would certainly need some kind of translation table between the ACP reported by Windows and the Python encoding name - something better than just prepending 'cp'.
This is a unfortunately a hack, although I find it a bit less intrusive than the one suggested by this ActiveState Code Recipe (linked to by the SO question mentioned in Edit 2 of my question). The advantage I see here is that this can be applied to os.getenv(), and not only to sys.argv.
I tried the solutions. It may still have some encoding problems. We need to use true type fonts.
Fix:
Run chcp 65001 in cmd to change the encoding to UTF-8.
Change cmd font to a True-Type one like Lucida Console that supports the
preceding code pages before 65001
Here's my complete fix for the encoding error:
def fixCodePage():
import sys
import codecs
import ctypes
if sys.platform == 'win32':
if sys.stdout.encoding != 'cp65001':
os.system("echo off")
os.system("chcp 65001") # Change active page code
sys.stdout.write("\x1b[A") # Removes the output of chcp command
sys.stdout.flush()
LF_FACESIZE = 32
STD_OUTPUT_HANDLE = -11
class COORD(ctypes.Structure):
_fields_ = [("X", ctypes.c_short), ("Y", ctypes.c_short)]
class CONSOLE_FONT_INFOEX(ctypes.Structure):
_fields_ = [("cbSize", ctypes.c_ulong),
("nFont", ctypes.c_ulong),
("dwFontSize", COORD),
("FontFamily", ctypes.c_uint),
("FontWeight", ctypes.c_uint),
("FaceName", ctypes.c_wchar * LF_FACESIZE)]
font = CONSOLE_FONT_INFOEX()
font.cbSize = ctypes.sizeof(CONSOLE_FONT_INFOEX)
font.nFont = 12
font.dwFontSize.X = 7
font.dwFontSize.Y = 12
font.FontFamily = 54
font.FontWeight = 400
font.FaceName = "Lucida Console"
handle = ctypes.windll.kernel32.GetStdHandle(STD_OUTPUT_HANDLE)
ctypes.windll.kernel32.SetCurrentConsoleFontEx(handle, ctypes.c_long(False), ctypes.pointer(font))
Note: You can see a font change while executing the program.
Well what worked for me was using following code sniped:
# -*- coding: utf-8 -*-
import os
import sys
print (f"OS: {os.device_encoding(0)}, sys: {sys.stdout.encoding}")
comparing both on some windows systems with python 3.8, showed that os.device_encoding(0) always reflected code page setting in terminal. (Tested with Powershell and with old cmd-shell on Windows 10 and Windows 7)
This was even true after changing the terminals code page with shell command:
chcp 850
or e.g.:
chcp 1252
Now using os.device_encoding(0) for tasks like decoding a subprocess stdout result from bytes to string worked out even with Non-ASCII chars like é, ö, ³, ↓, ...
So as other already pointed out on windows local setting is really just a system information, about user preferences, but not what shell actually might currently use.