Python - String changing after decoding and encoding again (zlib + base64)

Python - String changing after decoding and encoding again (zlib + base64) - python

I have this very simple Python code:
in_data = "eNrtmD1Lw0AY..."
print("Input: " + in_data)
out_data = in_data.decode('base64').decode('zlib').encode('zlib').encode('base64')
print("Output: " + out_data)
It outputs:
Input: eNrtmD1Lw0AY...
Output: eJztmE1LAkEY...
The string is also correctly decoded; if I display in_data.decode('base64').decode('zlib'), it gives the expected result.
Also, the formatting is different for both strings:
Why is the decoding/encoding not working properly? Are there some sort of parameters I should use?

Your data on input starts with the hex bytes 78 DA, your output starts with 78 9C:
>>> 'eNrt'.decode('base64').encode('hex')[:4]
'78da'
>>> 'eJzt'.decode('base64').encode('hex')[:4]
'789c'
DA is the highest compression level, 9C is the default. See What does a zlib header look like?
Rather than use .encode('zlib') use the zlib.compress() function, an set the level to 9:
import zlib
zlib.compress(decoded_data, 9).encode('base64')
The output of the base64 encoding inserts a newline every 76 characters to make it suitable for MIME encapsulation (emailing). You could use the base64.b64encode() function instead to encode without newlines.

Related

How to decode a string representation of a bytes object?

I have a string which includes encoded bytes inside it:
str1 = "b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'"
I want to decode it, but I can't since it has become a string. Therefore I want to ask whether there is any way I can convert it into
str2 = b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'
Here str2 is a bytes object which I can decode easily using
str2.decode('utf-8')
to get the final result:
'Output file 문항분석.xlsx Created'

You could use ast.literal_eval:
>>> print(str1)
b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'
>>> type(str1)
<class 'str'>
>>> from ast import literal_eval
>>> literal_eval(str1).decode('utf-8')
'Output file 문항분석.xlsx Created'

Based on the SyntaxError mentioned in your comments, you may be having a testing issue when attempting to print due to the fact that stdout is set to ascii in your console (and you may also find that your console does not support some of the characters you may be trying to print). You can try something like the following to set sys.stdout to utf-8 and see what your console will print (just using string slice and encode below to get bytes rather than the ast.literal_eval approach that has already been suggested):
import codecs
import sys
sys.stdout = codecs.getwriter('utf-8')(sys.stdout.buffer)
s = "b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'"
b = s[2:-1].encode().decode('utf-8')

A simple way is to assume that all the characters of the initial strings are in the [0,256) range and map to the same Unicode value, which means that it is a Latin1 encoded string.
The conversion is then trivial:
str1[2:-1].encode('Latin1').decode('utf8')

Finally I have found an answer where i use a function to cast a string to bytes without encoding.Given string
str1 = "b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'"
now i take only actual encoded text inside of it
str1[2:-1]
and pass this to the function which convert the string to bytes without encoding its values
import struct
def rawbytes(s):
"""Convert a string to raw bytes without encoding"""
outlist = []
for cp in s:
num = ord(cp)
if num < 255:
outlist.append(struct.pack('B', num))
elif num < 65535:
outlist.append(struct.pack('>H', num))
else:
b = (num & 0xFF0000) >> 16
H = num & 0xFFFF
outlist.append(struct.pack('>bH', b, H))
return b''.join(outlist)
So, calling the function would convert it to bytes which then is decoded
rawbytes(str1[2:-1]).decode('utf-8')
will give the correct output
'Output file 문항분석.xlsx Created'

Python adds extra to crypt result

I'm trying to create an API with token to communicate between an Raspberry Pi and a Webserver. Right now i'm tring to generate an Token with Python.
from Crypto.Cipher import AES
import base64
import os
import time
import datetime
import requests
BLOCK_SIZE = 32
BLOCK_SZ = 14
#!/usr/bin/python
salt = "123456789123" # Zorg dat de salt altijd even lang is! (12 Chars)
iv = "1234567891234567" # Zorg dat de salt altijd even lang is! (16 Chars)
currentDate = time.strftime("%d%m%Y")
currentTime = time.strftime("%H%M")
PADDING = '{'
pad = lambda s: s + (BLOCK_SIZE - len(s) % BLOCK_SIZE) * PADDING
EncodeAES = lambda c, s: base64.b64encode(c.encrypt(pad(s)))
DecodeAES = lambda c, e: c.decrypt(base64.b64decode(e)).rstrip(PADDING)
secret = salt + currentTime
cipher=AES.new(key=secret,mode=AES.MODE_CBC,IV=iv)
encode = currentDate
encoded = EncodeAES(cipher, encode)
print (encoded)
The problem is that the output of the script an exta b' adds to every encoded string.. And on every end a '
C:\Python36-32>python.exe encrypt.py
b'Qge6lbC+SulFgTk/7TZ0TKHUP0SFS8G+nd5un4iv9iI='
C:\Python36-32>python.exe encrypt.py
b'DTcotcaU98QkRxCzRR01hh4yqqyC92u4oAuf0bSrQZQ='
Hopefully someone can explain what went wrong.
FIXED!
I was able to fix it to decode it to utf-8 format.
sendtoken = encoded.decode('utf-8')

You are running Python 3.6, which uses Unicode (UTF-8) for string literals. I expect that the EncodeAES() function returns an ASCII string, which Python is indicating is a bytestring rather than a Unicode string by prepending the b to the string literal it prints.
You could strip the b out of the output post-Python, or you could print(str(encoded)), which should give you the same characters, since ASCII is valid UTF-8.
EDIT:
What you need to do is decode the bytestring into UTF-8, as mentioned in the answer and in a comment above. I was wrong about str() doing the conversion for you, you need to call decode('UTF-8') on the bytestring you wish to print. That converts the string into the internal UTF-8 representation, which then prints correctly.

python: print hex data instead of dictionary

Trying to decode Zigbee protocol from a Lowe's Iris SmartSwitch. The API callback I am using receives a frame that is already parsed, and sent to the callback as (I believe) a dictionary object (?). My code:
def callback(data):
print data
prints the following:
{'profile': '\xc2\x16', 'source_addr': '\x93\x0c', 'dest_endpoint':
'\x02', 'rf_data': '\t\x00\x81\x00\x00', 'source_endpoint': '\x02',
'options': '\x01', 'source_addr_long': '\x00\ro\x00\x03\xbc\xdf\xab',
'cluster': '\x00\xef', 'id': 'rx_explicit'}
I think this is in byte order, and I would much prefer an output like:
C2 16 93 0C 02 09 00 81 00 00 02 01 00 ...
Any way to do this with Python built-ins, with the given 'data' argument? Also, I dont know how to interpret "\ro" as 8-bit hex. "\t", I assume, is 0x09.
What I would REALLY like is a raw dump of the dataframe, but I don't know if there is an API call for that.

Any way to do this with Python built-ins, with the given 'data' argument?
No. Dictionaries are arbitrarily ordered, so there's no way to know what order the values should be in.
Also, I dont know how to interpret "\ro" as 8-bit hex.
It's "\r" and "o", which is 0x0d 0x6f.
"\t", I assume, is 0x09.
>>> hex(ord('\t'))
'0x9'
What I would REALLY like is a raw dump of the dataframe, but I don't know if there is an API call for that.
We don't know, since you haven't shared which API you're using in the first place.

If you're ok with no spaces, try the builtin encode method on the strings.
So '\x0a\x0b\x0c'.encode('hex') will yield '0a0b0c'. There are a number of was you could iterate over dictionary values, but beware that it's an inherently un-ordered data structure.
Maybe something like this would do?
data = {'a':'\x0a\x0a\x0a', 'b':'\x0b\x0b\x0b'}
print '{' + ''.join( [ ('\'%s\'' %k) + ': \'%s\',' %data[k].encode('hex') for k in data ] ) + '}'

this works:
def printData(data)
str = "> "
for d in data:
for e in data[d]:
str = str + "{0:02X}".format(ord(e)) + " "
print (str)
I use this string builder so I can use logging.info(str) instead of print...

How to print out 0xfb in python

I'm falling the unicode hell.
My environment in on unix, python 2.7.3
LC_CTYPE=zh_TW.UTF-8
LANG=en_US.UTF-8
I'm trying to dump hex encoded data in human readable format, here is simplified code
#! /usr/bin/env python
# encoding:utf-8
import sys
s=u"readable\n" # previous result keep in unicode string
s2="fb is not \xfb" # data read from binary file
s += s2
print s # method 1
print s.encode('utf-8') # method 2
print s.encode('utf-8','ignore') # method 3
print s.decode('iso8859-1') # method 4
# method 1-4 display following error message
#UnicodeDecodeError: 'ascii' codec can't decode byte 0xfb
# in position 0: ordinal not in range(128)
f = open('out.txt','wb')
f.write(s)
I just want to print out the 0xfb.
I should describe more here. The key is 's += s2'.
Where s will keep my previous decoded string.
And the s2 is next string which should append into s.
If I modified as following, it occurs on write file.
s=u"readable\n"
s2="fb is not \xfb"
s += s2.decode('cp437')
print s
f=open('out.txt','wb')
f.write(s)
# UnicodeEncodeError: 'ascii' codec can't encode character
# u'\u221a' in position 1: ordinal not in range(128)
I wish the result of out.txt is
readable
fb is not \xfb
or
readable
fb is not 0xfb
[Solution]
#! /usr/bin/env python
# encoding:utf-8
import sys
import binascii
def fmtstr(s):
r = ''
for c in s:
if ord(c) > 128:
r = ''.join([r, "\\x"+binascii.hexlify(c)])
else:
r = ''.join([r, c])
return r
s=u"readable"
s2="fb is not \xfb"
s += fmtstr(s2)
print s
f=open('out.txt','wb')
f.write(s)

I strongly suspect that your code is actually erroring out on the previous line: the s += s2 one. s2 is just a series of bytes, which can't be arbitrarily tacked on to a unicode object (which is instead a series of code points).
If you had intended the '\xfb' to represent U+FB, LATIN SMALL LETTER U WITH CIRCUMFLEX, it would have been better to assign it like this instead:
s2 = u"\u00fb"
But you said that you just want to print out \xHH codes for control characters. If you just want it to be something humans can understand which still makes it apparent that special characters are in a string, then repr may be enough. First, don't have s be a unicode object, because you're treating your strings here as a series of bytes, not a series of code points.
s = s.encode('utf-8')
s += s2
print repr(s)
Finally, if you don't want the extra quotes on the outside that repr adds, for nice pretty printing or whatever, there's not a simple builtin way to do that in Python (that I know of). I've used something like this before:
import re
controlchars_re = re.compile(r'[\x00-\x31\x7f-\xff]')
def _show_control_chars(match):
txt = repr(match.group(0))
return txt[1:-1]
def escape_special_characters(s):
return controlchars_re.sub(_show_control_chars, s.replace('\\', '\\\\'))
You can pretty easily tweak the controlchars_re regex to define which characters you care about escaping.

Is this a sensible approach for an EBCDIC (CP500) to Latin-1 converter?

I have to convert a number of large files (up to 2GB) of EBCDIC 500 encoded files to Latin-1. Since I could only find EBCDIC to ASCII converters (dd, recode) and the files contain some additional proprietary character codes, I thought I'd write my own converter.
I have the character mapping so I'm interested in the technical aspects.
This is my approach so far:
# char mapping lookup table
EBCDIC_TO_LATIN1 = {
0xC1:'41', # A
0xC2:'42', # B
# and so on...
}
BUFFER_SIZE = 1024 * 64
ebd_file = file(sys.argv[1], 'rb')
latin1_file = file(sys.argv[2], 'wb')
buffer = ebd_file.read(BUFFER_SIZE)
while buffer:
latin1_file.write(ebd2latin1(buffer))
buffer = ebd_file.read(BUFFER_SIZE)
ebd_file.close()
latin1_file.close()
This is the function that does the converting:
def ebd2latin1(ebcdic):
result = []
for ch in ebcdic:
result.append(EBCDIC_TO_LATIN1[ord(ch)])
return ''.join(result).decode('hex')
The question is whether or not this is a sensible approach from an engineering standpoint. Does it have some serious design issues? Is the buffer size OK? And so on...
As for the "proprietary characters" that some don't believe in: Each file contains a year's worth of patent documents in SGML format. The patent office has been using EBCDIC until they switched to Unicode in 2005. So there are thousands of documents within each file. They are separated by some hex values that are not part of any IBM specification. They were added by the patent office. Also, at the beginning of each file there are a few digits in ASCII that tell you about the length of the file. I don't really need that information but if I want to process the file so I have to deal with them.
Also:
$ recode IBM500/CR-LF..Latin1 file.ebc
recode: file.ebc failed: Ambiguous output in step `CR-LF..data'
Thanks for the help so far.

EBCDIC 500, aka Code Page 500, is amongst Pythons encodings, although you link to cp1047, which doesn't. Which one are you using, really? Anyway this works for cp500 (or any other encoding that you have).
from __future__ import with_statement
import sys
from contextlib import nested
BUFFER_SIZE = 16384
with nested(open(sys.argv[1], 'rb'), open(sys.argv[2], 'wb')) as (infile, outfile):
while True:
buffer = infile.read(BUFFER_SIZE)
if not buffer:
break
outfile.write(buffer.decode('cp500').encode('latin1'))
This way you shouldn't need to keep track of the mappings yourself.

If you set up the table correctly, then you just need to do:
translated_chars = ebcdic.translate(EBCDIC_TO_LATIN1)
where ebcdic contains EBCDIC characters and EBCDIC_TO_LATIN1 is a 256-char string which maps each EBCDIC character to its Latin-1 equivalent. The characters in EBCDIC_TO_LATIN1 are the actual binary values rather than their hex representations. For example, if you are using code page 500, the first 16 bytes of EBCDIC_TO_LATIN1 would be
'\x00\x01\x02\x03\x37\x2D\x2E\x2F\x16\x05\x25\x0B\x0C\x0D\x0E\x0F'
using this reference.

While this might not help the original poster anymore, some time ago I released a package for Python 2.6+ and 3.2+ that adds most of the western 8 bit mainframe codecs including CP1047 (French) and CP1141 (German): https://pypi.python.org/pypi/ebcdic. Simply import ebcdic to add the codecs and then use open(..., encoding='cp1047') to read or write files.

Answer 1:
Yet another silly question: What gave you the impression that recode produced only ASCII as output? AFAICT it will transcode ANY of its repertoire of charsets to ANY of its repertoire, AND its repertoire includes IBM cp500 and cp1047, and OF COURSE latin1. Reading the comments, you will note that Lennaert and I have discovered that there aren't any "proprietary" codes in those two IBM character sets. So you may well be able to use recode after all, once you are certain what charset you've actually got.
Answer 2:
If you really need/want to transcode IBM cp1047 via Python, you might like to firstly get the mapping from an authoritative source, processing it via script with some checks:
URL = "http://source.icu-project.org/repos/icu/data/trunk/charset/data/ucm/glibc-IBM1047-2.1.2.ucm"
"""
Sample lines:
<U0000> \x00 |0
<U0001> \x01 |0
<U0002> \x02 |0
<U0003> \x03 |0
<U0004> \x37 |0
<U0005> \x2D |0
"""
import urllib, re
text = urllib.urlopen(URL).read()
regex = r"<U([0-9a-fA-F]{4,4})>\s+\\x([0-9a-fA-F]{2,2})\s"
results = re.findall(regex, text)
wlist = [None] * 256
for result in results:
unum, inum = [int(x, 16) for x in result]
assert wlist[inum] is None
assert 0 <= unum <= 255
wlist[inum] = chr(unum)
assert not any(x is None for x in wlist)
print repr(''.join(wlist))
Then carefully copy/paste the output into your transcoding script for use with Vinay's buffer.translate(the_mapping) idea, with a buffer size perhaps a bit larger than 16KB and certainly a bit smaller than 2GB :-)

No crystal ball, no info from OP, so had a bit of a rummage in the EPO website. Found freely downloadable weekly patent info files, still available in cp500/SGML even though website says this to be replaced by utf8/XML in 2006 :-). Got the 2009 week 27 file. Is a zip containing 2 files s350927[ab].bin. "bin" means "not XML". Got the spec! Looks possible that "proprietary codes" are actually BINARY fields. Each record has a fixed 252-byte header. First 5 bytes are record length in EBCDIC e.g. hex F0F2F2F0F8 -> 2208 bytes. Last 2 bytes of the fixed header are the BINARY length (redundant) of the following variable part. In the middle are several text fields, two 2-byte binary fields, and one 4-byte binary field. The binary fields are serial numbers within groups, but all I saw are 1. The variable part is SGML.
Example (last record from s350927b.bin):
Record number: 7266
pprint of header text and binary slices:
['EPB102055619 TXT00000001',
1,
' 20090701200927 08013627.8 EP20090528NN ',
1,
1,
' T *lots of spaces snipped*']
Edited version of the rather long SGML:
<PATDOC FILE="08013627.8" CY=EP DNUM=2055619 KIND=B1 DATE=20090701 STATUS=N>
*snip*
<B541>DE<B542>Windschutzeinheit für ein Motorrad
<B541>EN<B542>Windshield unit for saddle-ride type vehicle
<B541>FR<B542>Unité pare-brise pour motocyclette</B540>
*snip*
</PATDOC>
There are no header or trailer records, just this one record format.
So: if the OP's annual files are anything like this, we might be able to help him out.
Update: Above was the "2 a.m. in my timezone" version. Here's a bit more info:
OP said: "at the beginning of each file there are a few digits in ASCII that tell you about the length of the file." ... translate that to "at the beginning of each record there are five digits in EBCDIC that tell you exactly the length of the record" and we have a (very fuzzy) match!
Here is the URL of the documentation page: http://docs.epoline.org/ebd/info.htm
The FIRST file mentioned is the spec.
Here is the URL of the download-weekly-data page: http://ebd2.epoline.org/jsp/ebdst35.jsp
An observation: The data that I looked at is in the ST.35 series. There is also available for download ST.32 which appears to be a parallel version containing only the SGML content (in "reduced cp437/850", one tag per line). This indicates that the fields in the fixed-length header of the ST.35 records may not be very interesting, and can thus be skipped over, which would greatly simplify the transcoding task.
For what it's worth, here is my (investigatory, written after midnight) code:
[Update 2: tidied up the code a little; no functionality changes]
from pprint import pprint as pp
import sys
from struct import unpack
HDRSZ = 252
T = '>s' # text
H = '>H' # binary 2 bytes
I = '>I' # binary 4 bytes
hdr_defn = [
6, T,
38, H,
40, T,
94, I,
98, H,
100, T,
251, H, # length of following SGML text
HDRSZ + 1
]
# above positions as per spec, reduce to allow for counting from 1
for i in xrange(0, len(hdr_defn), 2):
hdr_defn[i] -= 1
def records(fname, output_encoding='latin1', debug=False):
xlator=''.join(chr(i).decode('cp500').encode(output_encoding, 'replace') for i in range(256))
# print repr(xlator)
def xlate(ebcdic):
return ebcdic.translate(xlator)
# return ebcdic.decode('cp500') # use this if unicode output desired
f = open(fname, 'rb')
recnum = -1
while True:
# get header
buff = f.read(HDRSZ)
if not buff:
return # EOF
recnum += 1
if debug: print "\nrecnum", recnum
assert len(buff) == HDRSZ
recsz = int(xlate(buff[:5]))
if debug: print "recsz", recsz
# split remainder of header into text and binary pieces
fields = []
for i in xrange(0, len(hdr_defn) - 2, 2):
ty = hdr_defn[i + 1]
piece = buff[hdr_defn[i]:hdr_defn[i+2]]
if ty == T:
fields.append(xlate(piece))
else:
fields.append(unpack(ty, piece)[0])
if debug: pp(fields)
sgmlsz = fields.pop()
if debug: print "sgmlsz: %d; expected: %d - %d = %d" % (sgmlsz, recsz, HDRSZ, recsz - HDRSZ)
assert sgmlsz == recsz - HDRSZ
# get sgml part
sgml = f.read(sgmlsz)
assert len(sgml) == sgmlsz
sgml = xlate(sgml)
if debug: print "sgml", sgml
yield recnum, fields, sgml
if __name__ == "__main__":
maxrecs = int(sys.argv[1]) # dumping out the last `maxrecs` records in the file
fname = sys.argv[2]
keep = [None] * maxrecs
for recnum, fields, sgml in records(fname):
# do something useful here
keep[recnum % maxrecs] = (recnum, fields, sgml)
keep.sort()
for k in keep:
if k:
recnum, fields, sgml = k
print
print recnum
pp(fields)
print sgml

Assuming cp500 contains all of your "additional propietary characters", a more concise version based on Lennart's answer using the codecs module:
import sys, codecs
BUFFER_SIZE = 64*1024
ebd_file = codecs.open(sys.argv[1], 'r', 'cp500')
latin1_file = codecs.open(sys.argv[2], 'w', 'latin1')
buffer = ebd_file.read(BUFFER_SIZE)
while buffer:
latin1_file.write(buffer)
buffer = ebd_file.read(BUFFER_SIZE)
ebd_file.close()
latin1_file.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - String changing after decoding and encoding again (zlib + base64) - python

Related

How to decode a string representation of a bytes object?

Python adds extra to crypt result

python: print hex data instead of dictionary

How to print out 0xfb in python

Is this a sensible approach for an EBCDIC (CP500) to Latin-1 converter?

Categories

Resources