Problems decoding German "Umlaute" of QR-Codes with "pyzbar"

Problems decoding German "Umlaute" of QR-Codes with "pyzbar" - python

I wrote a Python (V3.9.9) program (Windows 10) to decode QR-Codes of type "EPC QR" -> see Wikipedia
Everything is working fine, except, if there are German "Umlaute" (ÄÖÜäöü) within the text of the QR-Code. Here is a sample program, to demonstrate/isolate the problem:
import cv2 # Read image / camera/video input
from pyzbar.pyzbar import decode
img = cv2.imread ("GiroCodeUmlaute.PNG")
print (decode (img))
for code in decode (img):
print (code.type)
print (code.data.decode ("UTF-8"))
And here is the QR-Code for testing:
GiroCodeUmlaute.PNG -> see QR-Code generator
The 6th line of the encoded QR-Code text contains "Ärzte ohne Grenzen".
But when it's decoded with "UTF-8" (which is the correct character set), then "ﾃвzte ohne Grenzen" is displayed.
I think, also the decoded read hex data are looking a bit strange:
[Decoded(data=b'BCD\n002\n1\nSCT\nRLNWATWW\n\xef\xbe\x83\xd0\xb2zte ohne Grenzen...
From where are the 4(!) hex bytes coming? \xef\xbe\x83\xd0\xb2zte
Where is the 'r' of the original text?
The same problem occurs, if this test-program is running under a Raspberry computer.
If this sample QR-Code is scanned by an android mobile app, the "Umlaut" is correct displayed.
From my point of view it looks like a problem of the "pyzbar" module. But maybe I'm doing something wrong?
Every help and tip is appreciated!
Thanks

I faced the same Problem with Swiss Bill QR-Codes on Linux using the python3-qrtools - guess this module is using the zbar-library as well.
A workaround would be to replace the wrong chars in the decoded string. Depending on the output you get out with print (code.data.decode ("UTF-8")) for your Umlaute, I got the following symbols which I need to replace with ä, ö resp. ü:
string = string.replace('瓣', 'ä')
string = string.replace('繹', 'ö')
string = string.replace('羹', 'ü')
It's not a very elegant solution because it does not deal with all special characters which get wrongly decoded.
As I'm calling the python program from a Pascal/Delphi/lazarus Program, I hope to find an other tool somewhere around which reads QR-codes from the webcam and correctly decode it...

Related

Python read .bin data and convert to string

I have multiple bin files, and I want to extract the data from them, but the results i'm getting are pretty weird.
For example, my first file does the following:
path = 'D:\lut.bin'
with open(path, 'rb') as file: # b is important -> binary
fileContent = file.read()
print(fileContent)
Output:
xc7\xfb\x99\x0c\x8e\xf9~7\xb9a\xb1*\x06\xd2o\xb8\xb2 \x81\x8bg\xd2\xc6bE\r\xb9KL7\xa0\xa52\xa5\xd2\x17\x86(\xe9\x02\xbf\xeb\x8fDvk\xe7\x8d\x03\x872\x9fop\xbck\xe1\x94\x02\xdc\xef\x85I\t\xc8\x8d\xdfl\x90\xcf*\xb1\x02(\x16~)\xc7\xa2\x1f\xf6o\xdc\x1en\x84H\xf6%\xfaW/\xee\xbc\xdd^/\x9b\x9a\xe5\x99\xa2\xd7\xe4\x93U\xd4\xef$\xa5\x8aW\xf6\xc9\xb0T\xe3<\x147\xcc\x08}\xc8\x15J3v\n\x9d\x16\xa3\x8d\r\xa2\xc4\x15\xf13!\xa2\x01\x14\xef\xaf\x06\x83p\xa7Ot\x8cr\xdf\xef\xbe\x93\xc2D`y\\\xdb\x8a\x1c\\H\x9cE\xabF\xd6\xe1B\xdd\xbc\x8a\xdb\x06|\x05{!\xf0K25K0\xb9\xfe\xa6n\xd7-\xd1\xcb\xefQ\xd9w\x08{4\x13\xba8\x06\x00}S\xe4\xd8*\xe2\x81f\x8d\xc4P\xde\x88/\xa6q\x7fG\x99\xbd\xa84v\xcfS+\xc6\xc5#\x0ey\xd8\xcd\xf2!\xf8`1\x03k5\xb9\xee\xb3V\xc3">\xdd\xf4\x94\x1b\x83\xf9\xdbe\xfcw\xf4+O\xf4\xf1\xfc\xa2 \xc5\xccq\xd1\xc8dH\x00\xf7K|7\x87\xa8$\xb8\x92^\x90.\xffK\xbf\xf6\xcaHv9l\xa6\x0e\xd5"\xd6`>}f\xfc\xd1\x15\xd0\xf0\x89\xb7\x12\xdf\xc9\xdfn\x97\xc7O\xf8\x05)Ua|\xd6\xd5\x03P\xf3\xcd\x08 \xc6\xc7\xe2"\xae\x1fz\xb9\xbd\x99\x100\x9a\x8d\xeb\x89\xa3T\xa0\xc7S\xcc\xe4h\xbe\xf3R\xe9\x9d\xf4Y\xe91\xa4%\x85>mn\xc3\x1e\x8a}\x04\xd9:\xb5\xde\x01h\x90y\xfe4&\xea\x1d\x9a\xbd\xac\x1a\x8e{\xb2Y\xcb\xc47\xd8\xe2\xf6\xd6\xdc\x91,]\x1d\xca\x90_sb\x86X\xad]\x8e\xe1A\x1a\xaa\xc6\xdf\x1ca#A\x1a\xa2\t!3\x06y\x92\x96\xebg\xdb3\xdd\x9f\xefh\x9d6\x17c0\x0e\xfe\x9a\x06\x06;\x16\xa7\x
I have no idea what this is, but it does not look like readable text, is there a way to even convert this?
My other file looks like this:
U\xff\xf3\xe8d\x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03H\x00\x00\x00\x00LAME3.100UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
Again with the same code above.
I have tried decoding it, I keep getting decoding errors, and text encoding to utf-8 doesn't help either.
I want to get the text from this, these files came with book on the playstore I downloaded.

Bin files are just binary data, meaning each byte can be any value between 0 and 255 (00000000 and 11111111, hex 0x00-0xFF).
Printable characters are a subset of those codes.
This means, not every bin file can be converted to text.
Python tries to visualise the byte stream already by putting those printable characters in place of their \xNN code (where N is a hex digit). The rest of the characters are printed as their codes.
This means the
U\xff\xf3\xe8d\x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03H\x00\x00\x00\x00LAME3.100UU
is in fact
\x55\xff\xf3\xe8\x64\x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03\x48\x00\x00\x00\x00\x4c\x41\x4d\x45\x33\x2e\x31\x30\x30\x55\x55
[Copy this into your Python interpreter as a string (i.e. in quotes) and see how it visually converts itself when displaying/printing!]
The parts:
decoded U,
then not decoded \xff\xf3\xe8
decoded d
not decoded \x00\x00\x00\x01\xa4\x00\x00\x00\x00\x00\x00\x03
decodedH
not decoded \x00\x00\x00\x00
decoded LAME3.100UU
Can you extract some data from it? Depending on the type of the bin, you may probably find some strings directly put in there - like the LAME3.10 that looks like some code/version... but I really doubt that you would find anything useful. It can be literally anything, just dumped there: text, photo, memory dump...

This is very late, but LAME3.100 followed by a bunch of U characters is actually the start of a certain encoding of .mp3 file, and knowing that it may possibly be incomplete, you could try and convert it using https://ffmpeg.org into a proper .mp3 container
After you have ffmpeg in your path, a command such as ffmpeg -i "D:/lut.bin" "D:/lut.mp3" should hopefully decode and re-encode it

Python: Convert PNG string obtained via toDataURL to binary PNG file

The toDataURL method (see e.g. https://developer.mozilla.org/de/docs/Web/API/HTMLCanvasElement/toDataURL) gives a string representation of a PNG of the following form:
"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNby
blAAAADElEQVQImWNgoBMAAABpAAFEI8ARAAAAAElFTkSuQmCC"
How can I convert such a PNG string to a binary PNG file in python 3 ?

OK, so it was a simple (and maybe stupid) mistake that I made. The first part before the comma, i.e. data:image/png;base64 must be removed, like this
import base64
with open('sample.png', 'wb') as f:
f.write(base64.decodestring(string.split(',')[1].encode()))
does the trick for me. So it is an obvious mistake that you need to remove the header. But I will still leave this as an answer in case it happens to others just as it happened to me. Also look at this thread Python: Ignore 'Incorrect padding' error when base64 decoding concerning padding.

Custom Decode Script: EOL while scanning string literal

i made a script to decode a file made a few years ago, and I've run into an issue whilst doing my second decode test.
My code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from decoder.encodings import *
#Toontown Online Encoded Script Decoder
"""
##########################################
# Decoder was built to decompile #
# Team Pawz Multihack v2.0 #
##########################################
"""
input = "Text can be located here: http://pastebin.com/rdeAhyar ";
def decode():
print input.decode('latin_1')
decode()
When i execute the code i get
SyntaxError: EOL while scanning string literal
SyntaxError: EOL while scanning string literal
Press any key to continue . . .
If this helps i'm using the version of Python distributed within Panda3D.

The problem is embedding binary data in source code by simply pasting it. The error appears on Windows because Windows sees a byte value of 26 (hex 1A) as the end of text files and stops reading text files right before this byte value. Linux is not affected by this, that's the reason I was unable to reproduce the problem.
Observe the difference in file size and the amount of bytes a ”full” read() returns under Windows:
>>> os.path.getsize('test.py')
49297L
>>> len(open('test.py', 'r').read()) # text mode
1100
>>> len(open('test.py', 'rb').read()) # binary mode
49297
The solution is not to embed the binary data in the source code but load it from an extra file. Make sure to open it in binary mode instead of text mode.
Or you have to encode the binary data so it doesn't contain ”exotic” byte values any more. Base64 encoding is a good candidate for this.

Python encoding issue involving special characters

I am running Win7 x64 and I have Python 2.7.5 x64 installed. I am using Wing IDE 101 4.1.
For some reason, encoding is messed up.
special_str = "sauté"
print string
# saut├⌐
string
# 'saut\xc3\xa9'
I don't understand why when I try to print it, it comes out weird. When I write it to a notepad text file, it comes out as right ("sauté"). Problem with this is that when I use BeautifulSoup on the string, it comes out containing that weird string "saut├⌐" and then when I output it back into a csv file, I end up with a html chunk containing that weird bit. Help!

You need to declare the encoding of the source file so Python can properly decode your string literals.
You can do this with a special comment at the top of the file (first or second line).
# coding:<coding>
where <coding> is the encoding used when saving the file, for example utf-8.

Unicode problems in Python

Huh, ok, so I have this massive problem with encodings and I just do not know how to deal with it. After two days of Google searches I think I just run out of options :)
What I want to do is the following.
Place text in a textbox on a website
Send the text to the backend (written in Python)
Use the text to create:
a. An image in PIL.
b. An entry in MySQL.
Now all of this works smoothly when we're talking about regular characters. But when I try to use Korean, Polish, Japanese characters I get very weird looking characters inserted in both the image and the database. In the examples below I'll use a three character string of Polish characters - "ąść".
Here's what I have done after Googling.
Inserted the following in .htaccess:
AddCharset UTF-8 .py .css .js .html
My python file now starts with:
#!/usr/bin/python
# -*- coding: utf-8 -*-
All of my MySQL databases are encoded in "utf8_unicode_ci".
Now, here's an example of what I'm trying to do... Whenever I parse "ąść" (three Polish characters) it gets saved in the database and generated on the image as:
Ä…Å›Ä‡
Now, a few debugging issues. I go directly to Python and assign the following to the variable (value_text1) that usually has its text parsed (so - no text parsing, simply set fixed text to generate the image with and put into the database):
A) If I go with value_text1 = 'ąść' I get …Å›Ä‡ as a result.
B) If I go with value_text1 = u'ąść' I get the following error message:
UnicodeEncodeError: 'latin-1' codec can't encode characters in
position 0-1: ordinal not in range(256)
C) If I go with value_text1 = u'ąść'.encode('UTF-8') I get …Å›Ä‡ as a result.
D) If I go with value_text1 = u'\u0105\u015B\u0107'.encode('UTF-8'), where "\u0105\u015B\u0107" is the actual unicode for "ąść" I get …Å›Ä‡ as a result.
Really no clue what I'm doing wrong - server settings, python file settings, wrong command? Will appreciate any thoughts, huge thank you in advance.

If I try it in an interactive shell or from a .py file
#!/usr/bin/python
# -*- coding: utf-8 -*-
value_text1 = u'ąść'
print value_text1
it works perfectly well for me, so I guess it's something with your server configuration.
BTW, make sure to use charset="utf-8" when connecting to the server.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.