I am trying to create file with Unicode character 662f on windows (via Perl or python, anything is fine for me ) . on Linux I am able to get chr 是 , but on windows I am getting this character 是 , and some how I am not able to get that file name as 是.
Python code -
import sys
name = unichr(0x662f)
print(name.encode('utf8').decode(sys.stdout.encoding))
perl code -
my $name .= chr(230).chr(152).chr(175); ##662f
print 'file name ::'. "$name"."txt";
File manipulation in Perl on Windows (Unicode characters in file name)
In Perl on Windows, I use Win32::Unicode, Win32::Unicode::File and Win32::Unicode::Dir. They work perfectly with Unicode characters in file names.
Just mind that Win32::Unicode::File::open() (and new()) have a reversed argument order compared Perl's built-in open() - mode comes first.
You do not need to encode the characters manually - just insert them as they are (if your Perl script is in UTF-8), or using the \x{N} notation.
Printing out Unicode characters on Windows
Printing Unicode into console on Windows is another problem. You can't use cmd.exe. Instead use PowerShell ISE. The drawback of the ISE is that it's not a console - scripts can't take input from keyboard thru STDIN.
To get Unicode output, you need to do set the output encoding to UTF-8 in every PowerShell ISE that's started. I suggest doing so in the startup script.
Procedure to have PowerShell ISE default to Unicode output:
1) In order for any user PowerShell scripts to be allowed to run, you first need to do:
Set-ExecutionPolicy RemoteSigned
2) Edit or create your Documents\WindowsPowerShell\Microsoft.PowerShellISE_profile.ps1 to something like:
perl -w -e "print qq!Initializing the console with Perl...\n!;"
[System.Console]::OutputEncoding = [System.Text.Encoding]::UTF8;
The short Perl command is there as a trick to allow the System.Console property be modified. Without it, you get an error when setting the OutputEncoding.
If I recall correctly, you also have to change the font to Consolas.
Even when the Unicode characters print out fine, you may have trouble including them in command line arguments. In these cases I've found the \x{N} notation works. The Windows Character Map utility is your friend here.
(Edited heavily after I rediscovered the regular PowerShell's inability to display most Unicode characters, with references to PowerShell (non-ISE) removed. Now I remember why I started using the ISE...)
Related
I have the following two-line Python (v. 3.10.7) program "stdin.py":
import sys
print(sys.stdin.read())
and the following one-line text file "ansi.txt" (CP1252 encoding) containing:
‘I am well’ he said.
Note that the open and close quotes are 0x91 and 0x92, respectively. In Windows-10 cmd mode the behavior of the Python code is as expected:
python stdin.py < ansi.txt # --> ‘I am well’ he said.
On the other hand in Windows Powershell:
cat .\ansi.txt | python .\stdin.py # --> ?I am well? he said.
Apparently the CP1252 characters are seen as non-printable characters in the
combination Python/PowerShell. If I replace in "stdin.py" the standard input by file input, Python prints correctly the CP1252 quote characters to screen. PowerShell by itself recognizes and prints correctly 0x91 and 0x92.
Questions: can somebody explain to me why cmd works differently than PowerShell in combination with Python? Why doesn't Python recognize the CP1252 quote characters 0x91 and 0x92 when they are piped into it by PowerShell?
tl;dr
Use the $OutputEncoding preference variable:
In Windows PowerShell:
# Using the system's legacy ANSI code page, as Python does by default.
# NOTE: The & { ... } enclosure isn't strictly necessary, but
# ensures that the $OutputEncoding change is only temporary,
# by limiting to the child scope that the enclosure cretes.
& {
$OutputEncoding = [System.Text.Encoding]::Default
"‘I am well’ he said." | python -c 'import sys; print(sys.stdin.read())'
}
# Using UTF-8 instead, which is generally preferable.
# Note the `-X utf8` option (Python 3.7+)
& {
$OutputEncoding = [System.Text.UTF8Encoding]::new()
"‘I am well’ he said." | python -X utf8 -c 'import sys; print(sys.stdin.read())'
}
In PowerShell (Core) 7+:
# Using the system's legacy ANSI code page, as Python does by default.
# Note: In PowerShell (Core) / .NET 5+,
# [System.Text.Encoding]::Default` now reports UTF-8,
# not the active ANSI encoding.
& {
$OutputEncoding = [System.Text.Encoding]::GetEncoding([cultureinfo]::CurrentCulture.TextInfo.ANSICodePage)
"‘I am well’ he said." | python -c 'import sys; print(sys.stdin.read())'
}
# Using UTF-8 instead, which is generally preferable.
# Note the `-X utf8` option (Python 3.7+)
# NO need to set $OutputEncoding, as it now *defaults* to UTF-8
"‘I am well’ he said." | python -X utf8 -c 'import sys; print(sys.stdin.read())'
Note:
$OutputEncoding controls what encoding is used to send data TO external programs via the pipeline (to stdin). It defaults to ASCII(!) in Windows PowerShell, and UTF-8 in PowerShell (Core).
[Console]::OutputEncoding controls how data received FROM external programs (via stdout) is decoded. It defaults to the console's active code page, which in turn defaults to the system's legacy OEM code page, such as 437 on US-English systems).
That these two encodings are not aligned by default is unfortunate; while Windows PowerShell will see no more changes, there is hope for PowerShell (Core): it would make sense to have it default consistently to UTF-8:
GitHub issue #7233 suggests at least defaulting the shortcut files that launch PowerShell to UTF-8 (code page 65001); GitHub issue #14945 more generally discusses the problematic mismatch.
In Windows 10 and above, there is an option to switch to UTF-8 system-wide, which then makes both the OEM and ANSI code pages default to UTF-8 (65001); however, this has far-reaching consequences and is still labeled as being in beta as of Windows 11 - see this answer.
Background information:
It is the $OutputEncoding preference variable that determines what character encoding PowerShell uses to send data (invariably text, as of PowerShell 7.3) to an external program via the pipeline.
Note that this even applies when data is read from a file: PowerShell, as of v7.3, never sends raw bytes through the pipeline: it reads the content into .NET strings first and then re-encodes them based on $OutputEncoding on sending them through the pipeline to an external program.
Therefore, what encoding your ansi.txt input file uses is ultimately irrelevant, as long as PowerShell decodes it correctly when reading it into .NET strings (which are internally composed of UTF-16 code units).
See this answer for more information.
Thus, the character encoding stored in $OutputEncoding must match the encoding that the target program expects.
By default the encoding in $OutputEncoding is unrelated to the encoding implied by the console's active code page (which itself defaults to the system's legacy OEM code page, such as 437 on US-English systems), which is what at least legacy console applications tend to use; however, Python does not, and uses the legacy ANSI code page; other modern CLIs, notably NodeJS' node.exe, always use UTF-8.
While $OutputEncoding's default in PowerShell (Core) 7+ is now UTF-8, Windows PowerShell's default is, regrettably, ASCII(!), which means that non-ASCII characters get "lossily" transliterated to verbatim ASCII ? characters, which is what you saw.
Therefore, you must (temporarily) set $OutputEncoding to the encoding that Python expects and/or ask it use UTF-8 instead.
I'm learning Python and tried to make a hanging game (literal translation - don't know the real name of the game in English. Sorry.). For those who aren't familiar with this game, the player must discover a secret word by guessing one letter at a time.
In my code, I get a collection of secret words which is imported from a txt file using the following code:
words_bank = open('palavras.txt', 'r')
words = []
for line in words_bank:
words.append(line.strip().lower())
words_bank.close()
print(words)
The output of print(words) is ['ma\xc3\xa7\xc3\xa3', 'a\xc3\xa7a\xc3\xad', 'tucum\xc3\xa3'] but if I try print('maçã, açaí, tucumã') in order to check the special characters, everything is printed correctly. Looks like the issue is in the encoding (or decoding... I'm still reading lots of articles about it to really understand) special characters from files.
The content of line 1 of my code is # coding: utf-8 because after some research I found out that I have to specify the Unicode format that is required for the text to be encoded/decoded. Before adding it, I was receiving the following message when running the code:
File "path/forca.py", line 12
SyntaxError: Non-ASCII character '\xc3' in file path/forca.py on line 12, but no encoding declared
Line 12 content: print('maçã, açaí, tucumã')
Things that I've already tried:
add encode='utf-8' as parameter in open('palavras.txt', 'r')
add decode='utf-8' as parameter in open('palavras.txt', 'r')
same as above but with latin1
substitute line 1 content for #coding: latin1
My OS is Ubuntu 20.04 LTS, my IDE is VS Code.
Nothing works!
I don't know what search and what to do anymore.
SOLUTION HERE
Thanks to the help given by the friends above, I was able to find out that the real problem was in the combo VS Code extension (Code Runner) + python alias version from Ubuntu 20.04 LTS.
Code Runner is set to run codes in Terminal in my situation, so apparently, when it calls for python the alias version was python 2.7.x. To overcome this situation I've used this thread to set python 3 as default.
It's done! Whenever python is called, both in terminal and VS Code with Code Runner, all special characters works just fine.
Thank's everybody for your time and your help =)
This only happens when using Python 2.x.
The error is probably because you're printing a list not printing items in the list.
When calling print(words) (words is a list), Python invokes a special function called repr on the list object. The list then creates a summary representation of the list by calling repr in each child in the list, then creates a neat string visualisation.
repr(string) actually returns an ASCII representation (with escapes) rather than a suitable version for your terminal.
Instead, try:
for x in words:
print(x)
Note. The option for open is encoding. E.g
open('myfile.txt', encoding='utf-8')
You should always, always pass the encoding option to open - Python <=3.8 on Linux and Mac will assume UTF-8 (for most people). Python <=3.8 on Windows will use an 8-bit code page.
Python 3.9 will always use UTF-8
See Python 2.x vs 3.x behaviour:
Py2
>>> print ['maçã', 'açaí', 'tucumã']
['ma\xc3\xa7\xc3\xa3', 'a\xc3\xa7a\xc3\xad', 'tucum\xc3\xa3']
>>> repr('maçã')
"'ma\\xc3\\xa7\\xc3\\xa3'"
>>> print 'maçã'
maçã
Py3
>>> print(['maçã', 'açaí', 'tucumã'])
['maçã', 'açaí', 'tucumã']
>>> repr('maçã')
"'maçã'"
I'm having problems reading text files into my python programs.
import sys
words = sys.stdin.readlines()
I'm reading the file in through stdin but when I try to execute the program I'm getting this error.
PS> python evil_61.py < evilwords.txt
At line:1 char:19
+ python evil_61.py < evilwords.txt
+ ~
The '<' operator is reserved for future use.
+ CategoryInfo : ParserError: (:) [], ParentContainsErrorRecordException
+ FullyQualifiedErrorId : RedirectionNotSupported
Could someone tell me how to run these kinds of programs as it is essential for my course and I'd rather use Windows than Linux.
Since < for input redirection is not supported in PowerShell, use Get-Content in a pipeline instead:
Get-Content evilwords.txt | python evil_61.py
Note: Adding the -Raw switch - which reads a file as a single, multi-line string - would speed things up in principle (at the expense of increased memory consumption), but PowerShell invariably appends a newline to data piped to external programs, as of PowerShell 7.2 (see this answer), so the target program will typically see an extra, empty line at the end. Get-Content's default behavior of line-by-line streaming avoids that.
Beware character-encoding issues:
Get-Content, in the absence of an -Encoding argument, assumes the following encoding:
Windows PowerShell (the built-into-Windows edition whose latest and final version is 5.1): the active ANSI code page, which is implied by the active legacy system locale (language for non-Unicode programs).
PowerShell (Core) 7+: (BOM-less) UTF-8
On passing the lines through the pipeline, they are (re-)encoded based on the encoding stored in the $OutputEncoding preference variable, which defaults to:
Windows PowerShell: ASCII(!)
PowerShell (Core) 7+: (BOM-less) UTF-8
As you can see, only PowerShell (Core) 7+ exhibits consistent behavior, though, unfortunately, as of PowerShell Core 7.2.0-preview.9, this doesn't yet extend to capturing output from external programs, because the encoding that controls the interpretation of received data, stored in [Console]::OutputEncoding], still defaults to the system's active OEM code page - see GitHub issue #7233.
Consider passing the text file as an argument at command line for Python script to use. All command line arguments (including file name) come stored in the sys.argv list:
Python Script (in evil_61.py)
import sys
txtfile = sys.argv[1]
with open(txtfile) as f:
content = f.readlines()
PowerShell Command
PS> python evil_61.py evilwords.txt
This question already has answers here:
Unicode filenames on Windows with Python & subprocess.Popen()
(5 answers)
Closed 7 years ago.
Hi I'm trying to extract audio from a video file using ffmpeg with the following function (in Python 2):
def extractAudio(path):
command = ''.join(('ffmpeg -i "',path,'" -ab 160k -ac 2 -ar 44100 -vn audio.wav'))
print(command)
subprocess.call(command,shell=True)
the above print statement successfully prints the following:
ffmpeg -i "C:/Users/pruthvi/Desktop/vidrec/temp\TAEYEON 태연_ I (feat. Verbal Jint)_Music Video.mp4" -ab 160k -ac 2 -ar 44100 -vn audio.wav
but in the next statement, it fails and throws the following error:
Traceback (most recent call last):
File "C:/Users/pruthvi/Desktop/vidrec/vidrec.py", line 53, in <module>
main()
File "C:/Users/pruthvi/Desktop/vidrec/vidrec.py", line 46, in main
extractAudio(os.path.join(di,each))
File "C:/Users/pruthvi/Desktop/vidrec/vidrec.py", line 28, in extractAudio
subprocess.call(command,shell=True)
File "C:\Python27\lib\subprocess.py", line 522, in call
return Popen(*popenargs, **kwargs).wait()
File "C:\Python27\lib\subprocess.py", line 710, in __init__
errread, errwrite)
File "C:\Python27\lib\subprocess.py", line 928, in _execute_child
args = '{} /c "{}"'.format (comspec, args)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 56-57: ordinal not in range(128)
I've tried all the possible solutions from previous questions like encoding with correct type , setting up PYTHONIOENCODING etc., but none seems to work . If I convert it to ascii, it'll no longer function because it removes non-ascii character and ends up as file not found and the audio will not be extracted from the target file. Any help is appreciated, thanks :)
To experiment, you can use the following code:
# -*- coding: utf-8 -*-
import subprocess
def extractAudio():
path = u'C:/Users/pruthvi/Desktop/vidrec/temp\TAEYEON 태연_ I (feat. Verbal Jint)_Music Video.mp4'
command = ''.join(('ffmpeg -i "',path,'" -ab 160k -ac 2 -ar 44100 -vn audio.wav'))
print(command)
subprocess.call(command,shell=True)
extractAudio()
Because you are passing in a unicode string to the subprocess.call, Python tries to encode this to an encoding it thinks the filesystem/OS will understand. For some reason it's choose ASCII which is wrong.
You can try using the correct encoding via
subprocess.call(command.encode(sys.getfilesystemencoding()))
You have two problems:
Python 2
The subprocess module breaks when using Unicode in any of the arguments. This issue is fixed in Python 3, you can pass any Unicode file names and arguments to subprocess, and it will properly forward these to the child process.
ffmpeg
ffmpeg itself cannot open these files, something that you can easily verify by just trying to run it from the command line:
C:\temp>fancy αβγ.m4v
... lots of other output
fancy a�?.m4v: Invalid data found when processing input
(my code page is windows-1252, note how the Greek α got replaced with a Latin a)
You cannot fix this problem, but you can work around it, see bobince's answer.
Same as in your previous question: most cross-platform software on Windows can't handle non-ASCII characters in filenames.
Python's subprocess module uses interfaces based on byte strings. Under Windows, the command line is based on Unicode strings (technically UTF-16 code units), so the MS C runtime converts the byte strings into Unicode strings using an encoding (the ‘ANSI’ code page) that varies from machine to machine and which can never include all Unicode characters.
If your Windows installation were a Korean one, your ANSI code page would be 949 Korean and you would be able to write the command by saying one of:
subprocess.call(command.encode('cp949'))
subprocess.call(command.encode('mbcs'))
(where mbcs is short for ‘multi-byte character set’, which is a synonym for the ANSI code page on Windows.) If your installation isn't Korean you'll have a different ANSI code page and you will be unable to write that filename into a command as your command line encoding won't have any Hangul in it. The ANSI encoding is never anything sensible like UTF-8 so no-one can reliably use subprocess to execute commands with all Unicode characters in.
As discussed in the previous question, Python includes workarounds for Unicode filenames to use the native Win32 API instead of the C standard library. In Python 3 it also uses Win32 Unicode APIs for creating processes, but this is not the case back in Python 2. You could potentially hack something up yourself by calling the Win32 CreateProcessW command through ctypes, which gives you direct access to Windows APIs, but it's a bit of a pain.
...and it would be of no use anyway, because even if you did get non-ANSI characters into the command line, the ffmpeg command would itself fail. This is because ffmpeg is also a cross-platform application that uses the C standard libraries to read command lines and files. It would fail to read the Korean in the command line argument, and even if you got it through somehow it would fail to read the file of that name!
This is a source of ongoing frustration on the Windows platform: although it supports Unicode very well internally, most tools that run over the top of it can't. The answer should have been for Windows to support UTF-8 in all the byte-string interfaces it implements, instead of the sad old legacy ANSI code pages that no-one wants. Unfortunately Microsoft have repeatedly declined to take even the first steps towards making UTF-8 a first-class citizen on Windows (namely fixing some of the bugs that stop UTF-8 working in the console). Sorry.
Unrelated: this:
''.join(('ffmpeg -i "',path,'"...
is generally a bad idea. There are a number of special characters in filenames that would break that command line and possibly end up executing all kinds of other commands. If the input paths were controlled by someone untrusted that would be a severe security hole. In general when you put a command line together from variables you need to apply escaping to make the string safe for inclusion, and the escaping rules on Windows are complex and annoying.
You can avoid both the escaping problem and the Unicode problem by keeping everything inside Python. Instead of launching a command to invoke the ffmpeg code, you could use a module that brings the functionality of ffmpeg into Python, such as PyFFmpeg.
Or a cheap 'n' cheerful 'n' crappy workaround would be to copy/move the file to a known-safe name in Python, run the ffmpeg command using the static filename, and then rename/copy the file back...
After a few days of dwelling over stackoverflow and python 2.7 doc, I have come to no conclusion about this.
Basically I'm running a python script on a windows server that must have as input a block of text. This block of text (unfortunately) has to be passed by a pipe. Something like:
PS > [something_that_outputs_text] | python .\my_script.py
So the problem is:
The server uses cp1252 encoding and I really cannot change it due to administrative regulations and whatnot. And when I pipe the text to my python script, when I read it, it comes already with ? whereas characters like \xe1 should be.
What I have done so far:
Tested with UTF-8. Yep, chcp 65001 and $OutputEncoding = [Console]::OutputEncoding "solve it", as in python gets the text perfectly and then I can decode it to unicode etc. But apparently they don't let me do it on the server /sadface.
A little script to test what the hell is happening:
import codecs
import sys
def main(argv=None):
if argv is None:
argv = sys.argv
if len(argv)>1:
for arg in argv[1:]:
print arg.decode('cp1252')
sys.stdin = codecs.getreader('cp1252')(sys.stdin)
text = sys.stdin.read().strip()
print text
return 0
if __name__=="__main__":
sys.exit(main())
Tried it with both the codecs wrapping and without it.
My input & output:
PS > echo "Blá" | python .\testinput.py blé
blé
Bl?
--> So there's no problem with the argument (blé) but the piped text (Blá) is no good :(
I even converted the text string to hex and, yes, it gets flooded with 3f (AKA mr ?), so it's not a problem with the print.
[Also: it's my first question here... feel free to ask any more info about what I did]
EDIT
I don't know if this is relevant or not, but when I do sys.stdin.encoding it yields None
Update: So... I have no problems with cmd. Checked sys.stdin.encoding while running the program on cmd and everything went fine. I think my head just exploded.
How about saving the data into a file and piping it to Python on a CMD session? Invoke Powershell and Python on CMD. Like so,
c:\>powershell -command "c:\genrateDataForPython.ps1 -output c:\data.txt"
c:\>type c:\data.txt | python .\myscript.py
Edit
Another an idea: convert the data into base64 format in Powershell and decode it in Python. Base64 is simple in Powershell, I guess in Python it isn't hard either. Like so,
# Convert some accent chars to base64
$s = [Text.Encoding]::UTF8.GetBytes("éêèë")
[System.Convert]::ToBase64String($s)
# Output:
w6nDqsOow6s=
# Decode:
$d = [System.Convert]::FromBase64String("w6nDqsOow6s=")
[Text.Encoding]::UTF8.GetString($d)
# Output
éêèë