Python3 print() Vs Python2 print - python

While working on a buffer overflow exploit I found something really strange. I have successfully found that I need to provide 32 characters before the proper address I want to jump to and that the proper address is 0x08048a37. When I executed
python -c "print '-'*32+'\x37\x8a\x04\x08'" | ./MyExecutable
the exploit resulted in a success. But, when I tried:
python3 -c "print('-'*32+'\x37\x8a\x04\x08')" | ./MyExecutable
it didn't. The executable simply resulted in a Segmentation Fault without jumping to the desired address. In fact, executing
python -c "print '-'*32+'\x37\x8a\x04\x08'"
and
python3 -c "print('-'*32+'\x37\x8a\x04\x08')"
results in two different output on the console. The characters are, of course, not readable but they're visually different.
I wonder why is this happening?

The Python 2 code writes bytes, the Python 3 code writes text that is then encoded to bytes. The latter will thus not write the same output; it depends on the codec configured for your pipe.
In Python 3, write bytes to the sys.stdout.buffer object instead:
python3 -c "import sys; sys.stdout.buffer.write(b'-'*32+b'\x37\x8a\x04\x08')"
You may want to manually add the \n newline that print would add.
sys.stdout is a io.TextIOBase object, encoding data written to it to a given codec (usually based on your locale, but when using a pipe, often defaulting to ASCII), before passing it on to the underlying buffer object. The TextIOBase.buffer attribute gives you direct access to the underlying BufferedIOBase object.

Related

Python standard IO under Windows PowerShell and CMD

I have the following two-line Python (v. 3.10.7) program "stdin.py":
import sys
print(sys.stdin.read())
and the following one-line text file "ansi.txt" (CP1252 encoding) containing:
‘I am well’ he said.
Note that the open and close quotes are 0x91 and 0x92, respectively. In Windows-10 cmd mode the behavior of the Python code is as expected:
python stdin.py < ansi.txt # --> ‘I am well’ he said.
On the other hand in Windows Powershell:
cat .\ansi.txt | python .\stdin.py # --> ?I am well? he said.
Apparently the CP1252 characters are seen as non-printable characters in the
combination Python/PowerShell. If I replace in "stdin.py" the standard input by file input, Python prints correctly the CP1252 quote characters to screen. PowerShell by itself recognizes and prints correctly 0x91 and 0x92.
Questions: can somebody explain to me why cmd works differently than PowerShell in combination with Python? Why doesn't Python recognize the CP1252 quote characters 0x91 and 0x92 when they are piped into it by PowerShell?
tl;dr
Use the $OutputEncoding preference variable:
In Windows PowerShell:
# Using the system's legacy ANSI code page, as Python does by default.
# NOTE: The & { ... } enclosure isn't strictly necessary, but
# ensures that the $OutputEncoding change is only temporary,
# by limiting to the child scope that the enclosure cretes.
& {
$OutputEncoding = [System.Text.Encoding]::Default
"‘I am well’ he said." | python -c 'import sys; print(sys.stdin.read())'
}
# Using UTF-8 instead, which is generally preferable.
# Note the `-X utf8` option (Python 3.7+)
& {
$OutputEncoding = [System.Text.UTF8Encoding]::new()
"‘I am well’ he said." | python -X utf8 -c 'import sys; print(sys.stdin.read())'
}
In PowerShell (Core) 7+:
# Using the system's legacy ANSI code page, as Python does by default.
# Note: In PowerShell (Core) / .NET 5+,
# [System.Text.Encoding]::Default` now reports UTF-8,
# not the active ANSI encoding.
& {
$OutputEncoding = [System.Text.Encoding]::GetEncoding([cultureinfo]::CurrentCulture.TextInfo.ANSICodePage)
"‘I am well’ he said." | python -c 'import sys; print(sys.stdin.read())'
}
# Using UTF-8 instead, which is generally preferable.
# Note the `-X utf8` option (Python 3.7+)
# NO need to set $OutputEncoding, as it now *defaults* to UTF-8
"‘I am well’ he said." | python -X utf8 -c 'import sys; print(sys.stdin.read())'
Note:
$OutputEncoding controls what encoding is used to send data TO external programs via the pipeline (to stdin). It defaults to ASCII(!) in Windows PowerShell, and UTF-8 in PowerShell (Core).
[Console]::OutputEncoding controls how data received FROM external programs (via stdout) is decoded. It defaults to the console's active code page, which in turn defaults to the system's legacy OEM code page, such as 437 on US-English systems).
That these two encodings are not aligned by default is unfortunate; while Windows PowerShell will see no more changes, there is hope for PowerShell (Core): it would make sense to have it default consistently to UTF-8:
GitHub issue #7233 suggests at least defaulting the shortcut files that launch PowerShell to UTF-8 (code page 65001); GitHub issue #14945 more generally discusses the problematic mismatch.
In Windows 10 and above, there is an option to switch to UTF-8 system-wide, which then makes both the OEM and ANSI code pages default to UTF-8 (65001); however, this has far-reaching consequences and is still labeled as being in beta as of Windows 11 - see this answer.
Background information:
It is the $OutputEncoding preference variable that determines what character encoding PowerShell uses to send data (invariably text, as of PowerShell 7.3) to an external program via the pipeline.
Note that this even applies when data is read from a file: PowerShell, as of v7.3, never sends raw bytes through the pipeline: it reads the content into .NET strings first and then re-encodes them based on $OutputEncoding on sending them through the pipeline to an external program.
Therefore, what encoding your ansi.txt input file uses is ultimately irrelevant, as long as PowerShell decodes it correctly when reading it into .NET strings (which are internally composed of UTF-16 code units).
See this answer for more information.
Thus, the character encoding stored in $OutputEncoding must match the encoding that the target program expects.
By default the encoding in $OutputEncoding is unrelated to the encoding implied by the console's active code page (which itself defaults to the system's legacy OEM code page, such as 437 on US-English systems), which is what at least legacy console applications tend to use; however, Python does not, and uses the legacy ANSI code page; other modern CLIs, notably NodeJS' node.exe, always use UTF-8.
While $OutputEncoding's default in PowerShell (Core) 7+ is now UTF-8, Windows PowerShell's default is, regrettably, ASCII(!), which means that non-ASCII characters get "lossily" transliterated to verbatim ASCII ? characters, which is what you saw.
Therefore, you must (temporarily) set $OutputEncoding to the encoding that Python expects and/or ask it use UTF-8 instead.

Printing unicode charcter in windows

I am trying to create file with Unicode character 662f on windows (via Perl or python, anything is fine for me ) . on Linux I am able to get chr 是 , but on windows I am getting this character 是 , and some how I am not able to get that file name as 是.
Python code -
import sys
name = unichr(0x662f)
print(name.encode('utf8').decode(sys.stdout.encoding))
perl code -
my $name .= chr(230).chr(152).chr(175); ##662f
print 'file name ::'. "$name"."txt";
File manipulation in Perl on Windows (Unicode characters in file name)
In Perl on Windows, I use Win32::Unicode, Win32::Unicode::File and Win32::Unicode::Dir. They work perfectly with Unicode characters in file names.
Just mind that Win32::Unicode::File::open() (and new()) have a reversed argument order compared Perl's built-in open() - mode comes first.
You do not need to encode the characters manually - just insert them as they are (if your Perl script is in UTF-8), or using the \x{N} notation.
Printing out Unicode characters on Windows
Printing Unicode into console on Windows is another problem. You can't use cmd.exe. Instead use PowerShell ISE. The drawback of the ISE is that it's not a console - scripts can't take input from keyboard thru STDIN.
To get Unicode output, you need to do set the output encoding to UTF-8 in every PowerShell ISE that's started. I suggest doing so in the startup script.
Procedure to have PowerShell ISE default to Unicode output:
1) In order for any user PowerShell scripts to be allowed to run, you first need to do:
Set-ExecutionPolicy RemoteSigned
2) Edit or create your Documents\WindowsPowerShell\Microsoft.PowerShellISE_profile.ps1 to something like:
perl -w -e "print qq!Initializing the console with Perl...\n!;"
[System.Console]::OutputEncoding = [System.Text.Encoding]::UTF8;
The short Perl command is there as a trick to allow the System.Console property be modified. Without it, you get an error when setting the OutputEncoding.
If I recall correctly, you also have to change the font to Consolas.
Even when the Unicode characters print out fine, you may have trouble including them in command line arguments. In these cases I've found the \x{N} notation works. The Windows Character Map utility is your friend here.
(Edited heavily after I rediscovered the regular PowerShell's inability to display most Unicode characters, with references to PowerShell (non-ISE) removed. Now I remember why I started using the ISE...)

UnicodeEncodeError for UTF-8 file names returned from a tk dialog [duplicate]

This question already has answers here:
Unicode filenames on Windows with Python & subprocess.Popen()
(5 answers)
Closed 7 years ago.
Hi I'm trying to extract audio from a video file using ffmpeg with the following function (in Python 2):
def extractAudio(path):
command = ''.join(('ffmpeg -i "',path,'" -ab 160k -ac 2 -ar 44100 -vn audio.wav'))
print(command)
subprocess.call(command,shell=True)
the above print statement successfully prints the following:
ffmpeg -i "C:/Users/pruthvi/Desktop/vidrec/temp\TAEYEON 태연_ I (feat. Verbal Jint)_Music Video.mp4" -ab 160k -ac 2 -ar 44100 -vn audio.wav
but in the next statement, it fails and throws the following error:
Traceback (most recent call last):
File "C:/Users/pruthvi/Desktop/vidrec/vidrec.py", line 53, in <module>
main()
File "C:/Users/pruthvi/Desktop/vidrec/vidrec.py", line 46, in main
extractAudio(os.path.join(di,each))
File "C:/Users/pruthvi/Desktop/vidrec/vidrec.py", line 28, in extractAudio
subprocess.call(command,shell=True)
File "C:\Python27\lib\subprocess.py", line 522, in call
return Popen(*popenargs, **kwargs).wait()
File "C:\Python27\lib\subprocess.py", line 710, in __init__
errread, errwrite)
File "C:\Python27\lib\subprocess.py", line 928, in _execute_child
args = '{} /c "{}"'.format (comspec, args)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 56-57: ordinal not in range(128)
I've tried all the possible solutions from previous questions like encoding with correct type , setting up PYTHONIOENCODING etc., but none seems to work . If I convert it to ascii, it'll no longer function because it removes non-ascii character and ends up as file not found and the audio will not be extracted from the target file. Any help is appreciated, thanks :)
To experiment, you can use the following code:
# -*- coding: utf-8 -*-
import subprocess
def extractAudio():
path = u'C:/Users/pruthvi/Desktop/vidrec/temp\TAEYEON 태연_ I (feat. Verbal Jint)_Music Video.mp4'
command = ''.join(('ffmpeg -i "',path,'" -ab 160k -ac 2 -ar 44100 -vn audio.wav'))
print(command)
subprocess.call(command,shell=True)
extractAudio()
Because you are passing in a unicode string to the subprocess.call, Python tries to encode this to an encoding it thinks the filesystem/OS will understand. For some reason it's choose ASCII which is wrong.
You can try using the correct encoding via
subprocess.call(command.encode(sys.getfilesystemencoding()))
You have two problems:
Python 2
The subprocess module breaks when using Unicode in any of the arguments. This issue is fixed in Python 3, you can pass any Unicode file names and arguments to subprocess, and it will properly forward these to the child process.
ffmpeg
ffmpeg itself cannot open these files, something that you can easily verify by just trying to run it from the command line:
C:\temp>fancy αβγ.m4v
... lots of other output
fancy a�?.m4v: Invalid data found when processing input
(my code page is windows-1252, note how the Greek α got replaced with a Latin a)
You cannot fix this problem, but you can work around it, see bobince's answer.
Same as in your previous question: most cross-platform software on Windows can't handle non-ASCII characters in filenames.
Python's subprocess module uses interfaces based on byte strings. Under Windows, the command line is based on Unicode strings (technically UTF-16 code units), so the MS C runtime converts the byte strings into Unicode strings using an encoding (the ‘ANSI’ code page) that varies from machine to machine and which can never include all Unicode characters.
If your Windows installation were a Korean one, your ANSI code page would be 949 Korean and you would be able to write the command by saying one of:
subprocess.call(command.encode('cp949'))
subprocess.call(command.encode('mbcs'))
(where mbcs is short for ‘multi-byte character set’, which is a synonym for the ANSI code page on Windows.) If your installation isn't Korean you'll have a different ANSI code page and you will be unable to write that filename into a command as your command line encoding won't have any Hangul in it. The ANSI encoding is never anything sensible like UTF-8 so no-one can reliably use subprocess to execute commands with all Unicode characters in.
As discussed in the previous question, Python includes workarounds for Unicode filenames to use the native Win32 API instead of the C standard library. In Python 3 it also uses Win32 Unicode APIs for creating processes, but this is not the case back in Python 2. You could potentially hack something up yourself by calling the Win32 CreateProcessW command through ctypes, which gives you direct access to Windows APIs, but it's a bit of a pain.
...and it would be of no use anyway, because even if you did get non-ANSI characters into the command line, the ffmpeg command would itself fail. This is because ffmpeg is also a cross-platform application that uses the C standard libraries to read command lines and files. It would fail to read the Korean in the command line argument, and even if you got it through somehow it would fail to read the file of that name!
This is a source of ongoing frustration on the Windows platform: although it supports Unicode very well internally, most tools that run over the top of it can't. The answer should have been for Windows to support UTF-8 in all the byte-string interfaces it implements, instead of the sad old legacy ANSI code pages that no-one wants. Unfortunately Microsoft have repeatedly declined to take even the first steps towards making UTF-8 a first-class citizen on Windows (namely fixing some of the bugs that stop UTF-8 working in the console). Sorry.
Unrelated: this:
''.join(('ffmpeg -i "',path,'"...
is generally a bad idea. There are a number of special characters in filenames that would break that command line and possibly end up executing all kinds of other commands. If the input paths were controlled by someone untrusted that would be a severe security hole. In general when you put a command line together from variables you need to apply escaping to make the string safe for inclusion, and the escaping rules on Windows are complex and annoying.
You can avoid both the escaping problem and the Unicode problem by keeping everything inside Python. Instead of launching a command to invoke the ffmpeg code, you could use a module that brings the functionality of ffmpeg into Python, such as PyFFmpeg.
Or a cheap 'n' cheerful 'n' crappy workaround would be to copy/move the file to a known-safe name in Python, run the ffmpeg command using the static filename, and then rename/copy the file back...

Python pipe cp1252 string from PowerShell to a python (2.7) script

After a few days of dwelling over stackoverflow and python 2.7 doc, I have come to no conclusion about this.
Basically I'm running a python script on a windows server that must have as input a block of text. This block of text (unfortunately) has to be passed by a pipe. Something like:
PS > [something_that_outputs_text] | python .\my_script.py
So the problem is:
The server uses cp1252 encoding and I really cannot change it due to administrative regulations and whatnot. And when I pipe the text to my python script, when I read it, it comes already with ? whereas characters like \xe1 should be.
What I have done so far:
Tested with UTF-8. Yep, chcp 65001 and $OutputEncoding = [Console]::OutputEncoding "solve it", as in python gets the text perfectly and then I can decode it to unicode etc. But apparently they don't let me do it on the server /sadface.
A little script to test what the hell is happening:
import codecs
import sys
def main(argv=None):
if argv is None:
argv = sys.argv
if len(argv)>1:
for arg in argv[1:]:
print arg.decode('cp1252')
sys.stdin = codecs.getreader('cp1252')(sys.stdin)
text = sys.stdin.read().strip()
print text
return 0
if __name__=="__main__":
sys.exit(main())
Tried it with both the codecs wrapping and without it.
My input & output:
PS > echo "Blá" | python .\testinput.py blé
blé
Bl?
--> So there's no problem with the argument (blé) but the piped text (Blá) is no good :(
I even converted the text string to hex and, yes, it gets flooded with 3f (AKA mr ?), so it's not a problem with the print.
[Also: it's my first question here... feel free to ask any more info about what I did]
EDIT
I don't know if this is relevant or not, but when I do sys.stdin.encoding it yields None
Update: So... I have no problems with cmd. Checked sys.stdin.encoding while running the program on cmd and everything went fine. I think my head just exploded.
How about saving the data into a file and piping it to Python on a CMD session? Invoke Powershell and Python on CMD. Like so,
c:\>powershell -command "c:\genrateDataForPython.ps1 -output c:\data.txt"
c:\>type c:\data.txt | python .\myscript.py
Edit
Another an idea: convert the data into base64 format in Powershell and decode it in Python. Base64 is simple in Powershell, I guess in Python it isn't hard either. Like so,
# Convert some accent chars to base64
$s = [Text.Encoding]::UTF8.GetBytes("éêèë")
[System.Convert]::ToBase64String($s)
# Output:
w6nDqsOow6s=
# Decode:
$d = [System.Convert]::FromBase64String("w6nDqsOow6s=")
[Text.Encoding]::UTF8.GetString($d)
# Output
éêèë

Writing unicode strings via sys.stdout in Python

Assume for a moment that one cannot use print (and thus enjoy the benefit of automatic encoding detection). So that leaves us with sys.stdout. However, sys.stdout is so dumb as to not do any sensible encoding.
Now one reads the Python wiki page PrintFails and goes to try out the following code:
$ python -c 'import sys, codecs, locale; print str(sys.stdout.encoding); \
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout);
However this too does not work (at least on Mac). Too see why:
>>> import locale
>>> locale.getpreferredencoding()
'mac-roman'
>>> sys.stdout.encoding
'UTF-8'
(UTF-8 is what one's terminal understands).
So one changes the above code to:
$ python -c 'import sys, codecs, locale; print str(sys.stdout.encoding); \
sys.stdout = codecs.getwriter(sys.stdout.encoding)(sys.stdout);
And now unicode strings are properly sent to sys.stdout and hence printed properly on the terminal (sys.stdout is attached the terminal).
Is this the correct way to write unicode strings in sys.stdout or should I be doing something else?
EDIT: at times--say, when piping the output to less--sys.stdout.encoding will be None. in this case, the above code will fail.
export PYTHONIOENCODING=utf-8
will do the job, but can't set it on python itself ...
what we can do is verify if isn't setting and tell the user to set it before call script with :
if __name__ == '__main__':
if (sys.stdout.encoding is None):
print >> sys.stderr, "please set python env PYTHONIOENCODING=UTF-8, example: export PYTHONIOENCODING=UTF-8, when write to stdout."
exit(1)
Best idea is to check if you are directly connected to a terminal. If you are, use the terminal's encoding. Otherwise, use system preferred encoding.
if sys.stdout.isatty():
default_encoding = sys.stdout.encoding
else:
default_encoding = locale.getpreferredencoding()
It's also very important to always allow the user specify whichever encoding she wants. Usually I make it a command-line option (like -e ENCODING), and parse it with the optparse module.
Another good thing is to not overwrite sys.stdout with an automatic encoder. Create your encoder and use it, but leave sys.stdout alone. You could import 3rd party libraries that write encoded bytestrings directly to sys.stdout.
There is an optional environment variable "PYTHONIOENCODING" which may be set to a desired default encoding. It would be one way of grabbing the user-desired encoding in a way consistent with all of Python. It is buried in the Python manual here.
This is what I am doing in my application:
sys.stdout.write(s.encode('utf-8'))
This is the exact opposite fix for reading UTF-8 names from argv:
for file in sys.argv[1:]:
file = file.decode('utf-8')
This is very ugly (IMHO) as it force you to work with UTF-8.. which is the norm on Linux/Mac, but not on windows... Works for me anyway :)
It's not clear to my why you wouldn't be able to do print; but assuming so, yes, the approach looks right to me.

Categories