After a few days of dwelling over stackoverflow and python 2.7 doc, I have come to no conclusion about this.
Basically I'm running a python script on a windows server that must have as input a block of text. This block of text (unfortunately) has to be passed by a pipe. Something like:
PS > [something_that_outputs_text] | python .\my_script.py
So the problem is:
The server uses cp1252 encoding and I really cannot change it due to administrative regulations and whatnot. And when I pipe the text to my python script, when I read it, it comes already with ? whereas characters like \xe1 should be.
What I have done so far:
Tested with UTF-8. Yep, chcp 65001 and $OutputEncoding = [Console]::OutputEncoding "solve it", as in python gets the text perfectly and then I can decode it to unicode etc. But apparently they don't let me do it on the server /sadface.
A little script to test what the hell is happening:
import codecs
import sys
def main(argv=None):
if argv is None:
argv = sys.argv
if len(argv)>1:
for arg in argv[1:]:
print arg.decode('cp1252')
sys.stdin = codecs.getreader('cp1252')(sys.stdin)
text = sys.stdin.read().strip()
print text
return 0
if __name__=="__main__":
sys.exit(main())
Tried it with both the codecs wrapping and without it.
My input & output:
PS > echo "Blá" | python .\testinput.py blé
blé
Bl?
--> So there's no problem with the argument (blé) but the piped text (Blá) is no good :(
I even converted the text string to hex and, yes, it gets flooded with 3f (AKA mr ?), so it's not a problem with the print.
[Also: it's my first question here... feel free to ask any more info about what I did]
EDIT
I don't know if this is relevant or not, but when I do sys.stdin.encoding it yields None
Update: So... I have no problems with cmd. Checked sys.stdin.encoding while running the program on cmd and everything went fine. I think my head just exploded.
How about saving the data into a file and piping it to Python on a CMD session? Invoke Powershell and Python on CMD. Like so,
c:\>powershell -command "c:\genrateDataForPython.ps1 -output c:\data.txt"
c:\>type c:\data.txt | python .\myscript.py
Edit
Another an idea: convert the data into base64 format in Powershell and decode it in Python. Base64 is simple in Powershell, I guess in Python it isn't hard either. Like so,
# Convert some accent chars to base64
$s = [Text.Encoding]::UTF8.GetBytes("éêèë")
[System.Convert]::ToBase64String($s)
# Output:
w6nDqsOow6s=
# Decode:
$d = [System.Convert]::FromBase64String("w6nDqsOow6s=")
[Text.Encoding]::UTF8.GetString($d)
# Output
éêèë
Related
I am trying to get an output from Bash and work with it in Python as bytes. The output of the command is something like this, after is formatted - \x48\x83\xec\x08\x48\x8b\x05\xdd
Then Python receives it through sys.argv but for it to be recognised as bytes I need to use .encode(), when encoding it I get b'\\x48\\x83\\xec\\x08\\x48\\x8b\\x05\\xdd' which is the representation of a single back slash according to what I've read, but I would need it with a single backslash and not two.
I've tried different solutions such as encoding it and decoding it again with 'unicode_escape' as suggested here - Remove double back slashes to no avail.
Surely I am missing some knowledge here, any help would be really appreciated.
#!/bin/sh
echo Enter a executable name.
read varname
echo Enter a PID to search in memory.
read PID
byteString=$(objdump -d -j .text /bin/$varname | head -n100 | tail -n93 |
cut -c11-30 | sed 's/[a-z0-9]\{2\}/\\x&/g' | tr -d '[:space:]')
python3 /home/internship/Desktop/memory_analysis.py $PID $byteString
Above is the bash script with the command to get the bytes. And the following is how the bytes are received by Python.
#!/usr/bin/python3
import sys
if len(sys.argv) < 2:
print ("Please specify a PID")
exit(1);
element = bytes(sys.argv[2].encode())
print(element)
output - b'\\xe8\\x5b\\xfd\\xff\\xff\\xe8\\x56\\xfd\\xff\\xff\\xe8\\x51\\xfd\\xff\\xff\\xe8\\x4c\\xfd\\xff\\xff\\xe8\\x47\\xfd\\xff\\xff'
When I hard code it with a variable it works just fine such as this - element =
b'\xe8\x5b\xfd\xff\xff\xe8\x56\xfd\xff\xff\xe8\x51\xfd\xff\xff\xe8\x4c\xfd\xff\xff'
Although, I need a some automation.
Thank you in advance!
If you can pass this data as binary to a Python script then you can deal with it like this:
import os
import sys
if __name__ == '__main__':
bytes_arg = os.fsencode(sys.argv[1])
print(bytes_arg)
~$ python script.py $'\x48\x83\xec\x08\x48\x8b\x05\xdd'
b'H\x83\xec\x08H\x8b\x05\xdd'
But if you get a string it ends up being x48x83xecx08x48x8bx05xdd.
import os
import sys
if __name__ == '__main__':
cleaned = ''.join(sys.argv[1].split('x'))
bytes_arg = bytes.fromhex(cleaned)
print(bytes_arg)
~$ python script.py \x48\x83\xec\x08\x48\x8b\x05\xdd
b'H\x83\xec\x08H\x8b\x05\xdd'
Hope this is what you expected :
python -c "import sys;print(bytes.fromhex(sys.argv[1].replace(r'\x','')))" '\x48\x83\xec\x08\x48\x8b\x05\xdd'
# Output : b'H\x83\xec\x08H\x8b\x05\xdd'
Based on your update :
test.sh
#!/bin/sh
byteString='\xe8\x5b\xfd\xff\xff\xe8\x56\xfd\xff\xff\xe8\x51\xfd\xff\xff\xe8\x4c\xfd\xff\xff'
PID=999
python3 test.py $PID "$byteString"
test.py
#!/usr/bin/python3
import re
import sys
if len(sys.argv) < 2:
print ("Please specify a PID")
exit(1);
element = bytes.fromhex(sys.argv[2].replace(r'\x',''))
print(element)
# output b'\xe8[\xfd\xff\xff\xe8V\xfd\xff\xff\xe8Q\xfd\xff\xff\xe8L\xfd\xff\xff'
print("b'"+re.sub('(..)', r'\\x\1',element.hex())+"'")
# output b'\xe8\x5b\xfd\xff\xff\xe8\x56\xfd\xff\xff\xe8\x51\xfd\xff\xff\xe8\x4c\xfd\xff\xff'
You actually don't have to use the codecs module, I just used it in that original answer in an attempt to make things more visually accommodating. Your question is practically identical to the one you referenced. The codecs.encode() function and the str.encode() method can both use the raw_unicode_escape text encoding.
In fact, you can just do as follows:
sys.argv[2].encode('raw_unicode_escape')
Just remember that raw_unicode_escape neither escapes or un-escapes backslashes when encoding or decoding.
All the current answers have given you what you wanted, but keep in mind that bytes objects are rendered different when printed. Additionally, when you encode a string you don't need to use the bytes() function, since it is automatically converted to a bytes object when encoded.
>>> b'\x48\x83\xec\x08\x48\x8b\x05\xdd' == b'H\x83\xec\x08\x48\x8b\x05\xdd'
True
Currently, I'm scripting a small python application that executes a PowerShell script. My python script should process the returned string but unfortunately, I have some trouble with the encoding of special characters like 'ä', 'ö', 'Ü' and so on. How can I return a Unicode/UTF-8 string?
You can see a simple example below. The console output is b'\xc7\xcf\r\n'. I don't understand why it's not b'\xc3\xa4\r\n' because \xc3\xa4 should be the correct UTF8 Encoding for the character 'ä'.
try.py:
import subprocess
p = subprocess.check_output(["powershell.exe", ".\script.ps1"])
print(p)
script.ps1:
return 'ä'
I adopted my PowerShell script in some ways but did not get the desired result.
Added "[Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8". Result: b'\xc3\x83\xc2\xa4\r\n'
Returned return [System.Text.Encoding]::UTF8.GetBytes("ä"). Result: b'195\r\n131\r\n194\r\n164\r\n'
Who can help to get console output of 'ä' for my upper script?
I used "pwsh" because I ran it on mac, you can use "powershell.exe" in your code
Try this:
import subprocess
p = subprocess.check_output(["pwsh", ".\sc.ps1"])
print(p.decode('utf-8'))
For more: You can read here.
Working Screenshot
So I'm using approach in this post
to extract a double quoted string from a string. If the input string comes from terminal argument, it works fine. But if the input string comes from a txt file like the following, it gives nontype error. I tried to get the hash code for two strings(one from file and one from terminal) with identical txt content, and turns out they are different. I'm curious if anyone knows how to solve this?(in Python 3.x)
That said, I have set the default encoding to "utf-8" in my code.
python filename.py < input.txt
If you are using command python, the command recognize it to python 2.x.
If you want python 3.x, just change the command to python3
like this
python3 filename.py < input.txt
Two things, if you want to ingest a txt file into a python script, you need to specify it. Add these two lines
import sys
text = str(sys.argv[1])
this mean text would be your 'input.txt'.
Second, if your script has only a function, it would not know what you want to do with the function, you have to either, tell the script explicity to execute the function through the entry main
import re
import sys
def doit(text):
matches=re.findall(r'\"(.+?)\"',text)
# matches is now ['String 1', 'String 2', 'String3']
return ",".join(matches)
if __name__ == '__main__':
text_file = str(sys.argv[1])
text = open(text_file).read()
print(doit(text))
Alternately, you can just execute line by line without wrapping the re in a function, since it is only one line.
I just figure it out, the bug doesn't come from my code. I had the "smart quotes" enabled on my Mac so whenever it reads a quote, it's identified as a special character. Disable this under keyboard setting would do the trick.
LOL what a "bug".
I am trying to create file with Unicode character 662f on windows (via Perl or python, anything is fine for me ) . on Linux I am able to get chr 是 , but on windows I am getting this character 是 , and some how I am not able to get that file name as 是.
Python code -
import sys
name = unichr(0x662f)
print(name.encode('utf8').decode(sys.stdout.encoding))
perl code -
my $name .= chr(230).chr(152).chr(175); ##662f
print 'file name ::'. "$name"."txt";
File manipulation in Perl on Windows (Unicode characters in file name)
In Perl on Windows, I use Win32::Unicode, Win32::Unicode::File and Win32::Unicode::Dir. They work perfectly with Unicode characters in file names.
Just mind that Win32::Unicode::File::open() (and new()) have a reversed argument order compared Perl's built-in open() - mode comes first.
You do not need to encode the characters manually - just insert them as they are (if your Perl script is in UTF-8), or using the \x{N} notation.
Printing out Unicode characters on Windows
Printing Unicode into console on Windows is another problem. You can't use cmd.exe. Instead use PowerShell ISE. The drawback of the ISE is that it's not a console - scripts can't take input from keyboard thru STDIN.
To get Unicode output, you need to do set the output encoding to UTF-8 in every PowerShell ISE that's started. I suggest doing so in the startup script.
Procedure to have PowerShell ISE default to Unicode output:
1) In order for any user PowerShell scripts to be allowed to run, you first need to do:
Set-ExecutionPolicy RemoteSigned
2) Edit or create your Documents\WindowsPowerShell\Microsoft.PowerShellISE_profile.ps1 to something like:
perl -w -e "print qq!Initializing the console with Perl...\n!;"
[System.Console]::OutputEncoding = [System.Text.Encoding]::UTF8;
The short Perl command is there as a trick to allow the System.Console property be modified. Without it, you get an error when setting the OutputEncoding.
If I recall correctly, you also have to change the font to Consolas.
Even when the Unicode characters print out fine, you may have trouble including them in command line arguments. In these cases I've found the \x{N} notation works. The Windows Character Map utility is your friend here.
(Edited heavily after I rediscovered the regular PowerShell's inability to display most Unicode characters, with references to PowerShell (non-ISE) removed. Now I remember why I started using the ISE...)
I just want to get UTF-8 working. I tried this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
t = "одобрение за"
print t
But when I run this program from the command line, output looks like: одобрение за
I've searched up and down the net, tried the whole sys.setdefaultencoding thing, tried calling encode() and decode(), tried placing the little "u" in front, tried unicode(), etc.
I'm about ready to explode from frustration. Is there a definitive answer for what the heck you're supposed to do?
Your code works for me (tm)
In [1]: t = u"одобрение за"
In [2]: print t
одобрение за
Make sure your terminal supports UTF-8. One way is to check the LANG env-variable:
$ echo $LANG
en_US.UTF-8
also, try the locale command.
$LANG/locale just tells you what your system will use when writing to stdout/stderr.
Best way to test if terminal supports UTF-8 is probably to print something to it and see if it looks correct. Something like this:
echo -e '\xe2\x82\xac'
You should get a €-sign.
If not, try a different shell...
Since you are using Windows cmd.exe, you have to follow two steps:
Make sure your console is using Lucidia console font family (other fonts cannot display UTF-8 properly).
Type chcp 65001 (that's change codepage) and hit enter.
Run your command.
For subsequent runs (once you close the cmd.exe window), you'll have to change the codepage again. The font should be permanent.