Removing non-ascii characters on utf-16 (Python)

Removing non-ascii characters on utf-16 (Python) - python

i have some code i'm using to decrypt a string, the string is originally encrypted and coming from .net source code but i'm able to make it all work fine. yet, the string coming into python has some extra characters in it and it has to decode as utf-16
here is some code for the decryption portion. my original string that i encrypted was "test2" , which is what is within the text variable in my code below.
import Crypto.Cipher.AES
import base64, sys
password = base64.b64decode('PSCIQGfoZidjEuWtJAdn1JGYzKDonk9YblI0uv96O8s=')
salt = base64.b64decode('ehjtnMiGhNhoxRuUzfBOXw==')
aes = Crypto.Cipher.AES.new(password, Crypto.Cipher.AES.MODE_CBC, salt)
text = base64.b64decode('TzQaUOYQYM/Nq9f/pY6yaw==')
print(aes.decrypt(text).decode('utf-16'))
text1 = aes.decrypt(text).decode('utf-16')
print(text1)
my issue is when i decrypt and print the result of text it is "test2ЄЄ" instead of the expected "test2"
if i save the same decrypt value into a variable it gets decoded incorrectly as "틊첃陋ភ滑毾穬ヸ"
my goal is i need to find a way to :
strip off the non ascii characters from the end of test2 value
be able to store that into a variable holding the correct string/text value
any help or suggestions appreciated? thanks

In python 2, you can use str.decode, like this:
string.decode('ascii', 'ignore')
The locale is ascii, and ignore specifies that anything that cannot be converted is to be dropped.
In python 3, you'll need to re-encode it first before decoding, since all str objects are decoded to your locale by default:
string.encode('ascii', 'ignore').decode()

Related

Read in special characters for pandas dataframe [duplicate]

I have JSON file which contains followingly encoded strings:
"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1",
I am trying to parse this file using the json module. However I am not able to decode this string correctly.
What I get after decoding the JSON using .load() method is 'HornÃ\xadkovÃ¡'. The string should be correctly decoded as 'Horníková' instead.
I read the JSON specification and I understasnd that after \u there should be 4 hexadecimal numbers specifing Unicode number of character. But it seems that in this JSON file UTF-8 encoded bytes are stored as \u-sequences.
What type of encoding is this and how to correctly parse it in Python 3?
Is this type JSON file even valid JSON file according to the specification?

Your text is already encoded and you need to tell this to Python by using a b prefix in your string but since you're using json and the input needs to be string you have to decode your encoded text manually. Since your input is not byte you can use 'raw_unicode_escape' encoding to convert the string to byte without encoding and prevent the open method to use its own default encoding. Then you can simply use aforementioned approach to get the desired result.
Note that since you need to do the encoding and decoding your have to read file content and perform the encoding on loaded string, then you should use json.loads() instead of json.load().
In [168]: with open('test.json', encoding='raw_unicode_escape') as f:
...: d = json.loads(f.read().encode('raw_unicode_escape').decode())
...:
In [169]: d
Out[169]: {'sender_name': 'Horníková'}

The JSON you are reading was written incorrectly and the Unicode strings decoded from it will have to be re-encoded with the wrong encoding used, then decoded with the correct encoding.
Here's an example:
#!python3
import json
# The bad JSON you have
bad_json = r'{"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}'
print('bad_json =',bad_json)
# The wanted result from json.loads()
wanted = {'sender_name':'Horníková'}
# What correctly written JSON should look like
good_json = json.dumps(wanted)
print('good_json =',good_json)
# What you get when loading the bad JSON.
got = json.loads(bad_json)
print('wanted =',wanted)
print('got =',got)
# How to correct the mojibake string
corrected_sender = got['sender_name'].encode('latin1').decode('utf8')
print('corrected_sender =',corrected_sender)
Output:
bad_json = {"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}
good_json = {"sender_name": "Horn\u00edkov\u00e1"}
wanted = {'sender_name': 'Horníková'}
got = {'sender_name': 'HornÃ\xadkovÃ¡'}
corrected_sender = Horníková

I don't know enough about JSON to be able to say whether this is valid or not, but you can parse these strings using the raw_unicode_escape codec:
>>> "Horn\u00c3\u00adkov\u00c3\u00a1".encode('raw_unicode_escape').decode('utf8')
'Horníková'

Reencode to bytes, and then redecode to text.
>>> 'HornÃ\xadkovÃ¡'.encode('latin-1').decode('utf-8')
'Horníková'
Is this type JSON file even valid JSON file according to the specification?
No.
A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes [emphasis added].
source
A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). [...] Any code point may be represented as a hexadecimal escape sequence [...] represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point [emphasis added].
source
UTF-8 byte sequences are neither Unicode characters nor Unicode code points.

Python: Correct Way to refer to index of unicode string

Not sure if this is exactly the problem, but I'm trying to insert a tag on the first letter of a unicode string and it seems that this is not working. Could these be because unicode indices work differently than those of regular strings?
Right now my code is this:
for index, paragraph in enumerate(intro[2:-2]):
intro[index] = bold_letters(paragraph, 1)
def bold_letters(string, index):
return "<b>"+string[0]+"</b>"+string[index:]
And I'm getting output like this:
<b>?</b>?רך האחד וישתבח הבורא בחכמתו ורצונו כל צבא השמים ארץ וימים אלה ואלונים.
It seems the unicode gets messed up when I try to insert the HTML tag. I tried messing with the insert position but didn't make any progress.
Example desired output (hebrew goes right to left):
>>>first_letter_bold("הקדמה")
"הקדמ<\b>ה<b>"
BTW, this is for Python 2

You are right, indices work over each byte when you are dealing with raw bytes i.e String in Python(2.x).
To work seamlessly with Unicode data, you need to first let Python(2.x) know that you are dealing with Unicode, then do the string manipulation. You can finally convert it back to raw bytes to keep the behavior abstracted i.e you get String and you return String.
Ideally you should convert all the data from UTF8 raw encoding to Unicode object (I am assuming your source encoding is Unicode UTF8 because that is the standard used by most applications these days) at the very beginning of your code and convert back to raw bytes at the fag end of code like saving to DB, responding to client etc. Some frameworks might handle that for you so that you don't have to worry.
def bold_letters(string, index):
string = string.decode('utf8')
string "<b>"+string[0]+"</b>"+string[index:]
return string.encode('utf8')
This will also work for ASCII because UTF8 is a super-set of ASCII. You can understand how Unicode works and in Python specifically better by reading http://nedbatchelder.com/text/unipain.html
Python 3.x String is a Unicode object so you don't have to explicitly do anything.

You should use Unicode strings. Byte strings in UTF-8 use a variable number of bytes per character. Unicode use one (at least those in the BMP on Python 2...the first 65536 characters):
#coding:utf8
s = u"הקדמה"
t = u'<b>'+s[0]+u'</b>'+s[1:]
print(t)
with open('out.htm','w',encoding='utf-8-sig') as f:
f.write(t)
Output:
<b>ה</b>קדמה
But my Chrome browser displays out.htm as:

Python gpgme non-ascii text handling

I am trying to encrypt-decrypt a text via GPG using pygpgme, while it works for western characters decryption fails on a Russian text. I use GPG suite on Mac to decrypt e-mail.
Here's the code I use to produce encrypted e-mail body, note that I tried to encode message in Unicode but it didn't make any difference. I use Python 2.7.
Please help, I must say I am new to Python.
ctx = gpgme.Context()
ctx.armor = True
key = ctx.get_key('0B26AE38098')
payload = 'Просто тест'
#plain = BytesIO(payload.encode('utf-8'))
plain = BytesIO(payload)
cipher = BytesIO()
ctx.encrypt([key], gpgme.ENCRYPT_ALWAYS_TRUST, plain, cipher)

There are multiple problems here. You really should read the Unicode HOWTO, but I'll try to explain.
payload = 'Просто тест'
Python 2.x source code is, by default, Latin-1. But your source clearly isn't Latin-1, because Latin-1 doesn't even have those characters. What happens if you write Просто тест in one program (like a text editor) as UTF-8, then read it in another program (like Python) as Latin-1? You get ÐÑÐ¾ÑÑÐ¾ ÑÐµÑÑ. So, what you're doing is creating a string full of nonsense. If you're using ISO-8859-5 rather than UTF-8, it'll be different nonsense, but still nonsense
So, first and foremost, you need to find out what encoding you did use in your text editor. It's probably UTF-8, if you're on a Mac, but don't just guess; find out.
Second, you have to tell Python what encoding you used. You do that by using an encoding declaration. For example, if your text editor uses UTF-8, add this line to the top of your code:
# coding=utf-8
One you fix that, payload will be a byte string, encoded in whatever encoding your text editor uses. But you can't encode already-encoded byte strings, only Unicode strings.
Python 2.x will let you call encode on them anyway, but it's not very useful—what it will do is first decode the string to Unicode using sys.getdefaultencoding, so it can then encode that. That's unlikely to be what you want.
The right way to fix this is to make payload a Unicode string in the first place, by using a Unicode literal. Like this:
payload = u'Просто тест'
Now, finally, you can actually encode the payload to UTF-8, which you did perfectly correctly in your first attempt:
plain = BytesIO(payload.encode('utf-8'))
Finally, you're encrypting UTF-8 plain text with GPG. When you decrypt it on the other side, make sure to decode it as UTF-8 there as well, or again you'll probably see nonsense.

Special characters appearing as question marks

Using the Python programming language, I'm having trouble outputting characters such as å, ä and ö. The following code gives me a question mark (?) as output, not an å:
#coding: iso-8859-1
input = "å"
print input
The following code lets you input random text. The for-loop goes through each character of the input, adds them to the string variable a and then outputs the resulting string. This code works correctly; you can input å, ä and ö and the output will still be correct. For example, "år" outputs "år" as expected.
#coding: iso-8859-1
input = raw_input("Test: ")
a = ""
for i in range(0, len(input)):
a = a + input[i]
print a
What's interesting is that if I change input = raw_input("Test: ") to input = "år", it will output a question mark (?) for the "å".
#coding: iso-8859-1
input = "år"
a = ""
for i in range(0, len(input)):
a = a + input[i]
print a
For what it's worth, I'm using TextWrangler, and my document's character encoding is set to ISO Latin 1. What causes this? How can I solve the problem?

You're using Python 2, I assume running on a platform like Linux that encodes I/O in UTF-8.
Python 2's "" literals represent byte-strings. So when you specify "år" in your ISO 8859-1-encoded source file, the variable input has the value b'\xe5r'. When you print this, the raw bytes are output to the console, but show up as a question-mark because they are not valid UTF-8.
To demonstrate, try it with print repr(a) instead of print a.
When you use raw_input(), the user's input is already UTF-8-encoded, and so are correctly output.
To fix this, either:
Encode your string as UTF-8 before printing it:
print a.encode('utf-8')
Use Unicode strings (u'text') instead of byte-strings. You will need to be careful with decoding the input, since on Python 2, raw_input() returns a byte-string rather than a text string. If you know the input is UTF-8, use raw_input().decode('utf-8').
Encode your source file in UTF-8 instead of iso-8859-1. Then the byte-string literal will already be in UTF-8.

Python - writing unicode strings to a file & beautiful soup

I'm using BeautifulSoup to parse some XML files. One of the fields in this file frequently uses Unicode characters. I've tried unsuccessfully to write the unicode to a file using encode.
The process so far is basically:
Get the name
gamename = items.find('name').string.strip()
Then incorporate the name into a list which is later converted into a string:
stringtoprint = userid, gamename.encode('utf-8') #
newstring = "INSERT INTO collections VALUES " + str(stringtoprint) + ";" +"\n"
Then write that string to a file.
listofgamesowned.write(newstring.encode("UTF-8"))
It seems that I won't have to .encode quite so often. I had tried encoding directly upon parsing out the name e.g. gamename = items.find('name').string.strip().encode('utf-8') - however, that did not seem to work.
Currently - 'Uudet L\xc3\xb6yt\xc3\xb6retket'
is being printed and saved rather than Uudet Löytöretket.
It seems if this were a string I was generating then I'd use something.write(u'Uudet L\xc3\xb6yt\xc3\xb6retket'); however, it's one element embedded in a string.

Unicode is an in-memory representation of a string. When you write out or read in you need to encode and decode.
Uudet L\xc3\xb6yt\xc3\xb6retket is the utf-8 encoded version of Uudet Löytöretket, so it is what you want to write out. When you want to read a string back from a file you need to decode it.
>>> print 'Uudet L\xc3\xb6yt\xc3\xb6retket'
Uudet LÃ¶ytÃ¶retket
>>> print 'Uudet L\xc3\xb6yt\xc3\xb6retket'.decode('utf-8')
Uudet Löytöretket
Just remember to encode immediately before you output and decode immediately after you read it back.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.