Data encoding in python

Data encoding in python - python

I download an image from newsgroups with the nntplib module in python. I then want to save the data to the file. I use:
news.group('alt.binaries.misc')
data=''.join(news.body('<DhTgplpHcRsZMBTTw3i35#spot.net>')[-1])
f=open('image.png','wb')
f.write(data)
f.close()
However, the saved file isn't a proper image file.
data is a string of the form:
'\x89PNG=B=C\x1a=C=A=A=A=BIHDR=A=A\x02X=A=A\x01Q\x08\x06=A=A=A\xa8\x81\xd3\x89=A=A=A\tpHYs=A=A\x0b\x13=A=A\x0b\x13\x01=A\x9a\x9c\x18=A=A etc... '
, with a lenght of 309530.
I can tell from the first bytes that the file should be a png file and the size also seems good to me, so I assume that the data is correct.
Does anyone knows what i'm doing wrong?
UPDATE:
I looked in the header of the article and its says: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit I don't think this is very helpful with decoding the text, but who knows..
I also compared my string with regular headers of png files. This is \x89PNG\r\n\x1a\n, or \x89PNG\x0d\x0a\x1a\x0a. (as alexis also stated)
I concluded that =B stands for \x0d, =C for \x0a and =A for \x00. I assume that the other \x..'s are not encoded, but i'm not sure (I don't know very much about encodings) update3 shows that they do differ.
What is an encoding that encodes this way?
UPDATE2:
the data: -see below- (repr(data))
UPDATE3:
I was able to save the image with another program and then to open it in python. This is what the data should be. -see below-.
The beginnings look kind of similar, but after that there is a big difference. What the hell is this encoding? it really frustrates me. (BTW, thanks for all the great help so far)
All the files: http://dl.dropbox.com/u/1499291/python-encoding-question/index.html

NULL is replaced with "=A", CR with "=B", LF with "=C", and "=" with "=D", very similar to gZip 8bit ([http://www.imc.org/ietf-usefor/2003/Feb/0575.html][1]).

looks like the data is uuencoded, so you have to decode it, before you write it to your image file.
you can verify this, when you uudecode image.png the file you have written with your above script.
Python has support for this in its uu module.

I don't believe NNTP.body is supposed to automatically decode article content, or is it? Have you looked at the article source? It should specify the encoding.
Anyway the PNG signature should start with \x89PNG, but should then be followed by 0d 0a (CR LF). This ain't it. Could it be base64 encoded, or some such? All those = signs look very familiar.

Related

Problems decoding German "Umlaute" of QR-Codes with "pyzbar"

I wrote a Python (V3.9.9) program (Windows 10) to decode QR-Codes of type "EPC QR" -> see Wikipedia
Everything is working fine, except, if there are German "Umlaute" (ÄÖÜäöü) within the text of the QR-Code. Here is a sample program, to demonstrate/isolate the problem:
import cv2 # Read image / camera/video input
from pyzbar.pyzbar import decode
img = cv2.imread ("GiroCodeUmlaute.PNG")
print (decode (img))
for code in decode (img):
print (code.type)
print (code.data.decode ("UTF-8"))
And here is the QR-Code for testing:
GiroCodeUmlaute.PNG -> see QR-Code generator
The 6th line of the encoded QR-Code text contains "Ärzte ohne Grenzen".
But when it's decoded with "UTF-8" (which is the correct character set), then "ﾃвzte ohne Grenzen" is displayed.
I think, also the decoded read hex data are looking a bit strange:
[Decoded(data=b'BCD\n002\n1\nSCT\nRLNWATWW\n\xef\xbe\x83\xd0\xb2zte ohne Grenzen...
From where are the 4(!) hex bytes coming? \xef\xbe\x83\xd0\xb2zte
Where is the 'r' of the original text?
The same problem occurs, if this test-program is running under a Raspberry computer.
If this sample QR-Code is scanned by an android mobile app, the "Umlaut" is correct displayed.
From my point of view it looks like a problem of the "pyzbar" module. But maybe I'm doing something wrong?
Every help and tip is appreciated!
Thanks

I faced the same Problem with Swiss Bill QR-Codes on Linux using the python3-qrtools - guess this module is using the zbar-library as well.
A workaround would be to replace the wrong chars in the decoded string. Depending on the output you get out with print (code.data.decode ("UTF-8")) for your Umlaute, I got the following symbols which I need to replace with ä, ö resp. ü:
string = string.replace('瓣', 'ä')
string = string.replace('繹', 'ö')
string = string.replace('羹', 'ü')
It's not a very elegant solution because it does not deal with all special characters which get wrongly decoded.
As I'm calling the python program from a Pascal/Delphi/lazarus Program, I hope to find an other tool somewhere around which reads QR-codes from the webcam and correctly decode it...

Python read .bin data and convert to string

I have multiple bin files, and I want to extract the data from them, but the results i'm getting are pretty weird.
For example, my first file does the following:
path = 'D:\lut.bin'
with open(path, 'rb') as file: # b is important -> binary
fileContent = file.read()
print(fileContent)
Output:
xc7\xfb\x99\x0c\x8e\xf9~7\xb9a\xb1*\x06\xd2o\xb8\xb2 \x81\x8bg\xd2\xc6bE\r\xb9KL7\xa0\xa52\xa5\xd2\x17\x86(\xe9\x02\xbf\xeb\x8fDvk\xe7\x8d\x03\x872\x9fop\xbck\xe1\x94\x02\xdc\xef\x85I\t\xc8\x8d\xdfl\x90\xcf*\xb1\x02(\x16~)\xc7\xa2\x1f\xf6o\xdc\x1en\x84H\xf6%\xfaW/\xee\xbc\xdd^/\x9b\x9a\xe5\x99\xa2\xd7\xe4\x93U\xd4\xef$\xa5\x8aW\xf6\xc9\xb0T\xe3<\x147\xcc\x08}\xc8\x15J3v\n\x9d\x16\xa3\x8d\r\xa2\xc4\x15\xf13!\xa2\x01\x14\xef\xaf\x06\x83p\xa7Ot\x8cr\xdf\xef\xbe\x93\xc2D`y\\\xdb\x8a\x1c\\H\x9cE\xabF\xd6\xe1B\xdd\xbc\x8a\xdb\x06|\x05{!\xf0K25K0\xb9\xfe\xa6n\xd7-\xd1\xcb\xefQ\xd9w\x08{4\x13\xba8\x06\x00}S\xe4\xd8*\xe2\x81f\x8d\xc4P\xde\x88/\xa6q\x7fG\x99\xbd\xa84v\xcfS+\xc6\xc5#\x0ey\xd8\xcd\xf2!\xf8`1\x03k5\xb9\xee\xb3V\xc3">\xdd\xf4\x94\x1b\x83\xf9\xdbe\xfcw\xf4+O\xf4\xf1\xfc\xa2 \xc5\xccq\xd1\xc8dH\x00\xf7K|7\x87\xa8$\xb8\x92^\x90.\xffK\xbf\xf6\xcaHv9l\xa6\x0e\xd5"\xd6`>}f\xfc\xd1\x15\xd0\xf0\x89\xb7\x12\xdf\xc9\xdfn\x97\xc7O\xf8\x05)Ua|\xd6\xd5\x03P\xf3\xcd\x08 \xc6\xc7\xe2"\xae\x1fz\xb9\xbd\x99\x100\x9a\x8d\xeb\x89\xa3T\xa0\xc7S\xcc\xe4h\xbe\xf3R\xe9\x9d\xf4Y\xe91\xa4%\x85>mn\xc3\x1e\x8a}\x04\xd9:\xb5\xde\x01h\x90y\xfe4&\xea\x1d\x9a\xbd\xac\x1a\x8e{\xb2Y\xcb\xc47\xd8\xe2\xf6\xd6\xdc\x91,]\x1d\xca\x90_sb\x86X\xad]\x8e\xe1A\x1a\xaa\xc6\xdf\x1ca#A\x1a\xa2\t!3\x06y\x92\x96\xebg\xdb3\xdd\x9f\xefh\x9d6\x17c0\x0e\xfe\x9a\x06\x06;\x16\xa7\x
I have no idea what this is, but it does not look like readable text, is there a way to even convert this?
My other file looks like this:
U\xff\xf3\xe8d\x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03H\x00\x00\x00\x00LAME3.100UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
Again with the same code above.
I have tried decoding it, I keep getting decoding errors, and text encoding to utf-8 doesn't help either.
I want to get the text from this, these files came with book on the playstore I downloaded.

Bin files are just binary data, meaning each byte can be any value between 0 and 255 (00000000 and 11111111, hex 0x00-0xFF).
Printable characters are a subset of those codes.
This means, not every bin file can be converted to text.
Python tries to visualise the byte stream already by putting those printable characters in place of their \xNN code (where N is a hex digit). The rest of the characters are printed as their codes.
This means the
U\xff\xf3\xe8d\x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03H\x00\x00\x00\x00LAME3.100UU
is in fact
\x55\xff\xf3\xe8\x64\x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03\x48\x00\x00\x00\x00\x4c\x41\x4d\x45\x33\x2e\x31\x30\x30\x55\x55
[Copy this into your Python interpreter as a string (i.e. in quotes) and see how it visually converts itself when displaying/printing!]
The parts:
decoded U,
then not decoded \xff\xf3\xe8
decoded d
not decoded \x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03
decodedH
not decoded \x00\x00\x00\x00
decoded LAME3.100UU
Can you extract some data from it? Depending on the type of the bin, you may probably find some strings directly put in there - like the LAME3.10 that looks like some code/version... but I really doubt that you would find anything useful. It can be literally anything, just dumped there: text, photo, memory dump...

This is very late, but LAME3.100 followed by a bunch of U characters is actually the start of a certain encoding of .mp3 file, and knowing that it may possibly be incomplete, you could try and convert it using https://ffmpeg.org into a proper .mp3 container
After you have ffmpeg in your path, a command such as ffmpeg -i "D:/lut.bin" "D:/lut.mp3" should hopefully decode and re-encode it

Python: Convert PNG string obtained via toDataURL to binary PNG file

The toDataURL method (see e.g. https://developer.mozilla.org/de/docs/Web/API/HTMLCanvasElement/toDataURL) gives a string representation of a PNG of the following form:
"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNby
blAAAADElEQVQImWNgoBMAAABpAAFEI8ARAAAAAElFTkSuQmCC"
How can I convert such a PNG string to a binary PNG file in python 3 ?

OK, so it was a simple (and maybe stupid) mistake that I made. The first part before the comma, i.e. data:image/png;base64 must be removed, like this
import base64
with open('sample.png', 'wb') as f:
f.write(base64.decodestring(string.split(',')[1].encode()))
does the trick for me. So it is an obvious mistake that you need to remove the header. But I will still leave this as an answer in case it happens to others just as it happened to me. Also look at this thread Python: Ignore 'Incorrect padding' error when base64 decoding concerning padding.

Reading srt (subtitle) files with Python3

I wish to be able to read an srt file with python3.
These files can be found here:
http://www.opensubtitles.org/
With info here:
http://en.wikipedia.org/wiki/SubRip
Subrip supports any encoding: ascii or unicode, for example.
If I understand correctly then I need to specify which decoder to use when I use pythons read function. So am I right in saying that I need to know how the file is encoded in order to make this judgement? If so how do I establish that for each file if I have a hundred such files with different sources and language support?
Ultimately I would prefer if I could convert the files so that they are all in utf-8 encoding to start with. But some of these files might be some obscure encoding for all I know.
Please help,
Barry

You could use the charade package (formerly chardet) to detect the encoding.

You can check for the byte order mark at the start of each .srt file to test for encoding. However, this probably won't work for all files, as it is not a required attribute, and only specified in UTF files anyways. A check can be performed by
testStr = b'\xff\xfeOtherdata'
if testStr[0:2] == b'\xff\xfe':
print('UTF-16 Little Endian')
elif testStr[0:2] == b'\xfe\xff':
print('UTF-16 Big Endian')
#...
What you probably want to do is simply open your file, then decode whatever you pull out of the file into unicode, deal with the unicode representation until you are ready to print, and then encode it back again. See this talk for some more information, and code samples that might be relevant.

There's also a decent library for handling SRT files:
https://pypi.python.org/pypi/pysrt
You can specify the encoding when opening and writing SRT files.

how to deal with japanese word using python xlrd

this is my code:
#!/usr/bin/python
#-*-coding:utf-8-*-
import xlrd,sys,re
data = xlrd.open_workbook('a.xls',encoding_override="utf-8")
a = data.sheets()[0]
s=''
for i in range(a.nrows):
if 9<i<20:
#stage
print a.row_values(i)[1].decode('shift_jis')+'\n'
but it show :
????
????????
??????
????
????
????
????????
so what can i do ,
thanks

Background: In a "modern" (Excel 97-2003) XLS file, text is effectively stored as Unicode. In older files, text is stored as 8-bit strings, and a "codepage" record tells how it is encoded e.g. the integer 1252 corresponds to the encoding known as cp1252 or windows-1252. In either case, xlrd presents extracted text as unicode objects.
Please insert this line into your code:
print data.biff_version, data.codepage, data.encoding
If you have a new file, you should see
80 1200 utf_16_le
In any case, please edit your question to report the outcome.
Problem 1: encoding_override is required ONLY if the file is an old file AND you know/suspect that the codepage record is omitted or wrong. It is ignored if the file is a new file. Do you really know that the file is pre-Excel-97 and the text is encoded in UTF-8? If so, it can only have been created by some seriously deluded 3rd-party software, and Excel will blow up if you try to open it with Excel; visit the author with a baseball bat. Otherwise, don't use encoding_override.
Problem 2: You should have unicode objects. To display them, you need to encode (not decode) them from unicode to str using a suitable encoding. It is very suprising that print unicode_object.decode('shift-jis') doesn't raise an exception and prints question marks.
To help understand this, please change your code to be like this:
text = a.rowvalues(i)[1]
print i, repr(text)
print repr(text.decode('shift-jis'))
and report the outcome.
So that we can help you choose an appropriate encoding (if any), tell us what version of what operating system you are using, and what the following display:
print sys.stdout.encoding
import locale
print locale.getpreferredencoding()
Further reading:
(1) the xlrd documentation (section on Unicode, right up the front) ... included in the distribution, or get the latest commit here.
(2) the Python Unicode HOWTO.

Why isn't your encoding override on open shift-jis?
data = xlrd.open_workbook('a.xls',encoding_override="shift-jis")
If the file is really shift-JIS, there are lots of code points (well frankly, almost all of them) that don't overlap with valid UTF-8 code points. If you are getting illegal characters (?) and your file is really UTF-8 and you want to output Shift-JIS, might I suggest that your output shell (for print - probably a file would be fine) can't handle the encoding.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.