Hex string to ASCII conversion with errors? - python

I am trying to write a python script to convert a hex string into ASCII and save the result into a file in .der cert format. I can do this in Notepad++ using the conversion plugin, but I would like to find a way to do this conversion in a python script from command line, either by invoking the notepad++ NppConverter plugin or using python modules.
I am part way there, but my conversion is not identical to the ASCII ouptut seen in notepad++, below is a snippet of the output in Notepad++
But my python conversion is displaying a slightly different output below
As you can see my script causes missing characters in the output, and if i'm honest I don't know why certain blocks are outlined in black. But these missing blocks are needed in the same format to the first picture.
Here's my basic code, I am working in Python 3, I am using the backslashreplace error control as this is the only way I can get the problematic hex to appear in the output file
result = bytearray.fromhex('380c2fd6172cd06d1f30').decode('ascii', 'backslashreplace')
text_file = open("C:\Output.der", "w")
text_file.write(result)
text_file.close()
Any guidance would be greatly appreciated.

MikG, I would say that python did exactly what you requested.
You told to convert the bytes to string, and replace bytes with most significant bit set with escape sequence (except for \xFF char).
Characters \x04 (ETB) and \x1F (US) are perfectly legal ASCII chars (though non-printable), and they are encoded using their literal value.
Characters \xd6 and \xd0 are illegal in ASCII - they are 8-bit long. They are encoded using 4-letter long escape sequence, as you asked: "\" (backslash char) and "xd6" / "xd0" strings
I'm not good with DER, but suppose that you expect to have raw 8-bit sequences. Here is how this could be accomplished:
result = bytearray.fromhex('380c2fd6172cd06d1f30')
with open("Output.der", "wb") as text_file:
text_file.write(result)
Please note "wb" specifier to open -- it tells python to do binary IO.
I also used with statement to ensure that text_file is closed whatever happens with write.

Related

Unicode Emoji's in python from csv files

I have some csv data of some users tweet.
In excel it is displayed like this:
‰ÛÏIt felt like they were my friends and I was living the story with them‰Û #retired #IAN1
I had imported this csv file into python and in python the same tweet appears like this (I am using putty to connect to a server and I copied this from putty's screen)
▒▒▒It felt like they were my friends and I was living the story with them▒۝ #retired #IAN1
I am wondering how to display these emoji characters properly. I am trying to separate all the words in this tweet but I am not sure how I can separate those emoji unicode characters.
In fact, you certainly have a loss of data…
I don’t know how you get your CSV file from users tweet (you may explain that). But generally, CSV files are encoded in "cp1252" (or "windows-1252"), sometimes in "iso-8859-1" encoding. Nowadays, we can found CSV files encoded in "utf-8".
If you tweets are encoded in "cp1252" or any 8-bit single-byte coded character sets, the Emojis are lost (replaced by "?") or badly converted.
Then, if you open your CSV file into Excel, it will use it’s default encoding ("cp1252") and load the file with corrupted characters. You can try with Libre Office, it has a dialog box which allows you to choose your encoding more easily.
The copy/paste from Putty will also convert your characters depending of your console encoding… It is worst!
If your CSV file use "utf-8" encoding (or "utf-16", "utf-32") you may have more chance to preserve the Emojis. But there is still a problem: most Emojis have a code-point greater that U+FFFF (65535 in decimal). For instance, Grinning Face "😀" has the code-point U+1F600).
This kind of characters are badly handled in Python, try this:
# coding: utf8
from __future__ import unicode_literals
emoji = u"😀"
print(u"emoji: " + emoji)
print(u"repr: " + repr(emoji))
print(u"len: {}".format(len(emoji)))
You’ll get (if your console allow it):
emoji: 😀
repr: u'\U0001f600'
len: 2
The first line won’t print if your console don’t allow unicode,
The \U escape sequence is similar to the \u, but expects 8 hex digits, not 4.
Yes, this character has a length of 2!
EDIT: With Python 3, you get:
emoji: 😀
repr: '😀'
len: 1
No escape sequence for repr(),
the length is 1!
What you can do is posting your CSV file (a fragment) as attachment, then one could analyse it…
See also Unicode Literals in Python Source Code in the Python 2.7 documentation.
First of all you shouldn't work with text copied from a console (nonetheless from a remote connection) because of formatting differences and how unreliable clipboards are. I'd suggest exporting your CSV and reading it directly.
I'm not quite sure what you are trying to do but twitter emojis cannot be displayed in a console due to them being basically compressed images. Would you mind explaning your issue further?
I would personally treat the whole string as Unicode, separate each character in a list then rebuilding words based on spaces.

Python: Correct Way to refer to index of unicode string

Not sure if this is exactly the problem, but I'm trying to insert a tag on the first letter of a unicode string and it seems that this is not working. Could these be because unicode indices work differently than those of regular strings?
Right now my code is this:
for index, paragraph in enumerate(intro[2:-2]):
intro[index] = bold_letters(paragraph, 1)
def bold_letters(string, index):
return "<b>"+string[0]+"</b>"+string[index:]
And I'm getting output like this:
<b>?</b>?רך האחד וישתבח הבורא בחכמתו ורצונו כל צבא השמים ארץ וימים אלה ואלונים.
It seems the unicode gets messed up when I try to insert the HTML tag. I tried messing with the insert position but didn't make any progress.
Example desired output (hebrew goes right to left):
>>>first_letter_bold("הקדמה")
"הקדמ<\b>ה<b>"
BTW, this is for Python 2
You are right, indices work over each byte when you are dealing with raw bytes i.e String in Python(2.x).
To work seamlessly with Unicode data, you need to first let Python(2.x) know that you are dealing with Unicode, then do the string manipulation. You can finally convert it back to raw bytes to keep the behavior abstracted i.e you get String and you return String.
Ideally you should convert all the data from UTF8 raw encoding to Unicode object (I am assuming your source encoding is Unicode UTF8 because that is the standard used by most applications these days) at the very beginning of your code and convert back to raw bytes at the fag end of code like saving to DB, responding to client etc. Some frameworks might handle that for you so that you don't have to worry.
def bold_letters(string, index):
string = string.decode('utf8')
string "<b>"+string[0]+"</b>"+string[index:]
return string.encode('utf8')
This will also work for ASCII because UTF8 is a super-set of ASCII. You can understand how Unicode works and in Python specifically better by reading http://nedbatchelder.com/text/unipain.html
Python 3.x String is a Unicode object so you don't have to explicitly do anything.
You should use Unicode strings. Byte strings in UTF-8 use a variable number of bytes per character. Unicode use one (at least those in the BMP on Python 2...the first 65536 characters):
#coding:utf8
s = u"הקדמה"
t = u'<b>'+s[0]+u'</b>'+s[1:]
print(t)
with open('out.htm','w',encoding='utf-8-sig') as f:
f.write(t)
Output:
<b>ה</b>קדמה
But my Chrome browser displays out.htm as:

Python: Converting special characters into operable integers?

I am currently working on a really simple encryption project algorithm to show basic understanding of how encryption works, and my encryption algorithm basically just uses the 'ord()' function for converting standard ASCII characters into integers that the algorithm can work on.
The problem I have run into is that I also need my program to be capable of encrypting, for example, the contents of a Windows executable (EXE) file. To do so, I need to convert all sorts of special characters (Not ASCII) into integers that I can operate off of.
I don't know a whole lot about encoding, but from what I understand, 'ord()' only works because there is a ASCII character map that has a corresponding number for each character. I couldn't seem to figure how to convert the special characters of an EXE file straight to integers, so I tried converting to bytes which seems a little more universal to me (please correct me if I am wrong).
At this point, I am just looking for a solution to be able to read an EXE file, and convert each character into a number specific to that character (for encryption/ decryption purposes).
You are confusing the meaning assigned to bytes (like the ASCII standard) with the bytes themselves. ord() just gives you the numerical value for a given byte. That Python interprets those bytes and shows you ASCII codepoints is neither here nor there.
In other words, ord() doesn't have to consult an ASCII table and can handle any byte value. All it has to do is take the already known byte value and give you a Python int object for it.
Read your data as binary (open the file with b added to the file mode), and use ord(). In Python 2, that'll result in str objects, and each character in such an object is really a byte value in the range 0 - 255.
Note that if you are using Python 3, reading from a file in binary mode results in a bytes object that makes it clearer still that these are integer values in a range:
>>> b'abc'
b'abc'
>>> b'abc'[0]
97
Indexing to an individual point in a bytes object produces the integer value and no call to ord() is required.

Python Unicode Bug

I'm making a virtual machine in RPython using PyPy. The problem is, when I tried to add unicode support I found an unusual problem. I'll use the letter "á" in my examples.
# The char in the example is á
print len(char)
OUTPUT:
2
I understand how the letter "á" takes two bytes, hence the length of 2. But the problem is when I use this example below I am faced with the problem.
# In this example instr = "á" (including the quotes)
for char in instr:
print hex(int(ord(char)))
OUTPUT:
0x22
0xc3
0xa1
0x22
As you can there are 4 numbers. For 0x22 are for the quotes, but there is only 1 letter in between the quotes but there are two numbers. My question is, some machines I tested this script on produced this output:
OUTPUT:
0x22
0xe1
0x22
Is there anyway to make the output the same on both machines? The script is exactly the same on each.
The program is not being given the same input on the two machines:
In [154]: '\xe1'.decode('cp1252').encode('utf_8') == '\xc3\xa1'
Out[154]: True
When you type á in a console, you may see the glyph á, but the console is translating that into bytes. The particular bytes it translates that into depends on the encoding used by the console. On a Windows machine, that may be cp1252, while on a Unix machine it is likely to be utf-8.
So you may see the input as the same, but the console (and thus the program) receives different input.
If your program were to decode the bytes with the appropriate encoding, and then work with unicode, then both programs will operate the same after that point. If you are receiving the bytes from sys.stdin, then sys.stdin.encoding will be the encoding Python detects the console is using.
You have this question tagged "Python-3.x" -- is it possible that some machines are running Python 2.x, and others are running Python 3.x?
The character á is in fact U+00E1, so on a Python 3.x system, I would expect to see your second output. Since strings are Unicode in Python3 by default, len(char) will be 3 (including the quotes).
In Python 2.x, that same character in a string will be two bytes long, and (depending on your input method) will be represented in UTF-8 as \xc3\xa1. On that system, len(char) will be 4, and you would see your first output.
The issue is that you use bytestrings to work with a text data. You should use Unicode instead.
It implies that you need to know the character encoding of your input data -- There Ain't No Such Thing As Plain Text.
If you know the character encoding then it is easy to convert a bytestring to Unicode e.g.:
unicode_text = bytestring.decode(encoding)
It should resolve your initial issue.
There are also Unicode normalization forms e.g.:
import unicodedata
norm_text = unicodedata.normalize('NFC', unicode_text)
If I don't change the encoding in the program how can I output unicode characters for example?
You might mean that you have a sequence of bytes e.g., '\xc3\xa1' (two bytes) that can be interpreted as text using some character encoding e.g., it is U+00E1 Unicode codepoint in utf-8. It may be something different in a different character encoding. Please, read the link I've provided above The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Unless by accident your terminal uses the same character encoding as data in your input file; you need to be able to convert from one character encoding to another. Otherwise the output will be corrupted e.g., instead of á you might get ├б on the screen.
In ordinary Python, you could use bytes.decode, unicode.encode methods (or codecs module directly). I don't know whether it is possible in RPython.

Erasing all unreadable characters in tab-delimited txt

I am running a python program to process a tab-delimited txt data.
But it causes trouble because it often has unicodes such as U+001A or those in http://en.wikipedia.org/wiki/Newline#Unicode
(Worse, these characters are not even seen unless the txt is opened by sublime txt, not even by notepad++)
If the python program is run on Linux then it automatically ignores such characters, but on Windows, it can't.
For example if there is U+001A in the txt, then the python program will automatically think that's the end of the file.
For another example, if there is U+0085 in the txt, then the python program will think that's the point where a new line starts.
So I just want a separate program that will erase EVERY unicode characters that are not shown in ordinary file openers like notepad++(and that program should work on Windows).
I do want to keep things like あ and ä . But I only to delete things like U+001A and U+0085 which are not seen by notepad++
How can this be achieved?
There is no such thing as an "unicode character". A character is a character and how it is encoded is on a different page. The capital letter "A" can be encoded in a lot of ways, amongst these UTF-8, EBDIC, ASCII, etc.
If you want to delete every character that cannot be represented in ASCII, then you can use the following (py3):
a = 'aあäbc'
a.encode ('ascii', 'ignore')
This will yield abc.
And if there are really U+001A, i.e. SUBSTITUTE, characters in your document, most probably something has gone haywire in a prior encoding step.
Using unicodedata looks to be the best way to do it, as suggested by #Hyperboreus (Stripping non printable characters from a string in python) but as a quick hack you could do (in Python 2.x):
Open source in binary mode. This prevents Windows from truncating reads when it finds the EOL Control Char.
my_file = open("filename.txt", "rb")
Decode the file (assumes encoding was UTF-8:
my_str = my_file.read().decode("UTF-8")
Replace known "bad" code points:
my_str.replace(u"\u001A", "")
You could skip step 2 and replace the UTF-8 encoded value of each "bad" code point in step 3, for example \x1A, but the method above allows for UTF-16/32 source if required.

Categories