Custom Decode Script: EOL while scanning string literal - python

i made a script to decode a file made a few years ago, and I've run into an issue whilst doing my second decode test.
My code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from decoder.encodings import *
#Toontown Online Encoded Script Decoder
"""
##########################################
# Decoder was built to decompile #
# Team Pawz Multihack v2.0 #
##########################################
"""
input = "Text can be located here: http://pastebin.com/rdeAhyar ";
def decode():
print input.decode('latin_1')
decode()
When i execute the code i get
SyntaxError: EOL while scanning string literal
SyntaxError: EOL while scanning string literal
Press any key to continue . . .
If this helps i'm using the version of Python distributed within Panda3D.

The problem is embedding binary data in source code by simply pasting it. The error appears on Windows because Windows sees a byte value of 26 (hex 1A) as the end of text files and stops reading text files right before this byte value. Linux is not affected by this, that's the reason I was unable to reproduce the problem.
Observe the difference in file size and the amount of bytes a ”full” read() returns under Windows:
>>> os.path.getsize('test.py')
49297L
>>> len(open('test.py', 'r').read()) # text mode
1100
>>> len(open('test.py', 'rb').read()) # binary mode
49297
The solution is not to embed the binary data in the source code but load it from an extra file. Make sure to open it in binary mode instead of text mode.
Or you have to encode the binary data so it doesn't contain ”exotic” byte values any more. Base64 encoding is a good candidate for this.

Related

Python opening files with utf-8 file names

In my code I used something like file = open(path +'/'+filename, 'wb') to write the file
but in my attempt to support non-ascii filenames, I encode it as such
naming = path+'/'+filename
file = open(naming.encode('utf-8', 'surrogateescape'), 'wb')
write binary data...
so the file is named something like directory/path/\xd8\xb9\xd8\xb1\xd8\xa8\xd9.txt
and it works, but the issue arises when I try to get that file again by crawling into the same directory using:
for file in path:
data = open(file.as_posix(), 'rb)
...
I keep getting this error 'ascii' codec can't encode characters in position..
I tried converting the string to bytes like data = open(bytes(file.as_posix(), encoding='utf-8'), 'rb') but I get 'utf-8' codec can't encode characters in position...'
I also tried file.as_posix().encode('utf-8', 'surrogateescape'), I found that both encode and print just fine but with open() I still get the error 'utf-8' codec can't encode characters in position...'
How can I open a file with a utf-8 filename?
I'm using Python 3.9 on ubuntu linux
Any help is greatly appreciated.
EDIT
I figured out why the issue happens when crawling to the directory after writing.
So, when I write the file and give it the raw string directory/path/\xd8\xb9\xd8\xb1\xd8\xa8\xd9.txt and encode the string to utf, it writes fine.
But when finding the file again by crawling into the directory the str(filepath) or filepath.as_posix() returns the string as directory/path/????????.txt so it gives me an error when I try to encode it to any codec.
Currently I'm investigating if the issue's related to my linux locale, it was set to POSIX, I changed it to C.UTF-8 but still no luck atm.
More context: this is a file system where the file is uploaded through a site, so I receive the filename string in utf-8 format
I don't understand why you feel you need to recode filepaths.
Linux (unix) filenames are just sequences of bytes (with a couple of prohibited byte values). There's no need to break astral characters in surrogate pairs; the UTF-8 sequence for an astral character is perfectly acceptable in a filename. But creating surrogate pairs is likely to get you into trouble, because there's no UTF-8 encoding for a surrogate. So if you actually manage to create something that looks like the UTF-8 encoding for a surrogate codepoint, you're likely to encounter a decoding error when you attempt to turn it back into a Unicode codepoint.
Anyway, there's no need to go to all that trouble. Before running this session, I created a directory called ´ñ´ with two empty files, 𝔐 and mañana. The first one is an astral character, U+1D510. As you can see, everything works fine, with no need for manual decoding.
>>> [*Path('ñ').iterdir()]
[PosixPath('ñ/𝔐'), PosixPath('ñ/mañana')]
>>> Path.mkdir('ñ2')
>>> for path in Path('ñ').iterdir():
... open(Path('ñ2', path.name), 'w').close()
...
>>> [*Path('ñ2').iterdir()]
[PosixPath('ñ2/𝔐'), PosixPath('ñ2/mañana')]
>>> [open(path).read() for path in Path('ñ2').iterdir()]
['', '']
Note:
In a comment, OP says that they had previously tried:
file = open('/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png', 'wb')
and received the error
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-11: ordinal not in range(128)
Without more details, it's hard to know how to respond to that. It's possible that open will raise that error for a filesystem which doesn't allow non-ascii characters, but that wouldn't be normal on Linux.
However, it's worth noting that the string literal
'/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png'
is not the string you think it is. \x escapes in a Python string are Unicode codepoints (with a maximum value of 255), not individual UTF-8 byte values. The Python string literal, "\xd8\xb9" contains two characters, "O with stroke" (Ø) and "superscript 1" (¹); in other words, it is exactly the same as the string literal "\u00d8\u00b9".
To get the Arabic letter ain (ع), either just type it (if you have an Arabic keyboard setting and your source file encoding is UTF-8, which is the default), or use a Unicode escape for its codepoint U+0639: "\u0639".
If for some reason you insist on using explicit UTF-8 byte encoding, you can use a byte literal as the argument to open:
file = open(b'/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png', 'wb')
But that's not recommended.
So after being in a rabbit hole for the past few days, I figured the issue isn't with python itself but with the locale that my web framework was using. Debugging this, I saw that
import sys
print(sys.getfilesystemencoding())
returned 'ASCII', which was weird considering I had set the linux locale to C.UTF-8 but discovered that since I was running WSGI on Apache2, I had to add locale to my WSGI as such WSGIDaemonProcess my_app locale='C.UTF-8' in the Apache configuration file thanks to this post.

Problems decoding German "Umlaute" of QR-Codes with "pyzbar"

I wrote a Python (V3.9.9) program (Windows 10) to decode QR-Codes of type "EPC QR" -> see Wikipedia
Everything is working fine, except, if there are German "Umlaute" (ÄÖÜäöü) within the text of the QR-Code. Here is a sample program, to demonstrate/isolate the problem:
import cv2 # Read image / camera/video input
from pyzbar.pyzbar import decode
img = cv2.imread ("GiroCodeUmlaute.PNG")
print (decode (img))
for code in decode (img):
print (code.type)
print (code.data.decode ("UTF-8"))
And here is the QR-Code for testing:
GiroCodeUmlaute.PNG -> see QR-Code generator
The 6th line of the encoded QR-Code text contains "Ärzte ohne Grenzen".
But when it's decoded with "UTF-8" (which is the correct character set), then "テвzte ohne Grenzen" is displayed.
I think, also the decoded read hex data are looking a bit strange:
[Decoded(data=b'BCD\n002\n1\nSCT\nRLNWATWW\n\xef\xbe\x83\xd0\xb2zte ohne Grenzen...
From where are the 4(!) hex bytes coming? \xef\xbe\x83\xd0\xb2zte
Where is the 'r' of the original text?
The same problem occurs, if this test-program is running under a Raspberry computer.
If this sample QR-Code is scanned by an android mobile app, the "Umlaut" is correct displayed.
From my point of view it looks like a problem of the "pyzbar" module. But maybe I'm doing something wrong?
Every help and tip is appreciated!
Thanks
I faced the same Problem with Swiss Bill QR-Codes on Linux using the python3-qrtools - guess this module is using the zbar-library as well.
A workaround would be to replace the wrong chars in the decoded string. Depending on the output you get out with print (code.data.decode ("UTF-8")) for your Umlaute, I got the following symbols which I need to replace with ä, ö resp. ü:
string = string.replace('瓣', 'ä')
string = string.replace('繹', 'ö')
string = string.replace('羹', 'ü')
It's not a very elegant solution because it does not deal with all special characters which get wrongly decoded.
As I'm calling the python program from a Pascal/Delphi/lazarus Program, I hope to find an other tool somewhere around which reads QR-codes from the webcam and correctly decode it...

Unicode with cp1251 and utf-8 on windows

I am playing around with unicode in python.
So there is a simple script:
# -*- coding: cp1251 -*-
print 'юникод'.decode('cp1251')
print unicode('юникод', 'cp1251')
print unicode('юникод', 'utf-8')
In cmd I've switched encoding to Active code page: 1251.
And there is the output:
СЋРЅРёРєРѕРґ
СЋРЅРёРєРѕРґ
юникод
I am a little bit confused.
Since I've specified encoding to cp1251 I expect that it would be decoded correctly.
But as result there is some trash code points were interpreted.
I am understand that 'юникод' is just a bytes like:
'\xd1\x8e\xd0\xbd\xd0\xb8\xd0\xba\xd0\xbe\xd0\xb4'.
But there is a way to get correct output in terminal with cp1251?
Should I build byte string manually?
Seems like I misunderstood something.
I think I can understand what happened to you. The last line gave me the hint, that your trash codepoints confirmed. You try to display cp1251 characters but your editor is configured to use utf8.
The # -*- coding: cp1251 -*- is only used by the Python interpretor to convert characters from source python files that are outside of the ASCII range. And anyway it it is only used for unicode litterals because bytes from original source give er... exactly same bytes in byte strings. Some text editors are kind enough to automagically use this line (IDLE editor is), but I'm little confident in that and allways switch manually to the proper encoding when I use gvim for example. Short story: # -*- coding: cp1251 -*- in unused in your code and can only mislead a reader since it it not the actual encoding.
If you want to be sure of what lies in your source, you'd better use explicit escapes. In code page 1251, this word юникод is composed by those characters: '\xfe\xed\xe8\xea\xee\xe4'
If you write this source:
txt = '\xfe\xed\xe8\xea\xee\xe4'
print txt
print txt.decode('cp1251')
print unicode(txt, 'cp1251')
print unicode(txt, 'utf-8')
and execute it in a console configured to use CP1251 charset, the first three lines will output юникод, and the last one will throw a UnicodeDecodeError exception because the input is no longer valid 'utf8'.
Alternatively, if you find comfortable with you current editor, you could write:
# -*- coding: utf8 -*-
txt = 'юникод'.decode('utf8').encode('cp1251') # or simply txt = u'юникод'.encode('cp1251')
print txt
print txt.decode('cp1251')
print unicode(txt, 'cp1251')
print unicode(txt, 'utf-8')
which should give same results - but now the declared source encoding should be the actual encoding of your python source.
BTW, a Python 3.5 IDLE that natively uses unicode confirmed that:
>>> 'СЋРЅРёРєРѕРґ'.encode('cp1251').decode('utf8')
'юникод'
Your issue is that the encoding declaration is wrong: your editor uses utf-8 character encoding to save the source code. Use # -*- coding: utf-8 -*- to fix it.
>>> u'юникод'
u'\u044e\u043d\u0438\u043a\u043e\u0434'
>>> u'юникод'.encode('utf-8')
'\xd1\x8e\xd0\xbd\xd0\xb8\xd0\xba\xd0\xbe\xd0\xb4'
>>> print _.decode('cp1251') # mojibake due to the wrong encoding
СЋРЅРёРєРѕРґ
>>> print u'юникод'
юникод
Do not use bytestrings ('' literals create bytes object on Python 2) to represent text; use Unicode strings (u'' literals -- unicode type) instead.
If your code uses Unicode strings then a code page that your Windows console uses doesn't matter as long as the chosen font can display the corresponding (non-BMP) characters. See Python, Unicode, and the Windows console
Here's complete code, for reference:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
print(u'юникод')
Note: no .decode(), unicode(). If you are using a literal to create a string; you should use Unicode literals if the string contains text. It is the only option on Python 3 where you can't put non-ascii characters inside a bytes literal and it is a good practice (to use Unicode for text instead of bytestrings) on Python 2 too.
If you are given a bytestring as an input (not literal) by some API then its encoding has nothing to do with the encoding declaration. What specific encoding to use depends on the source of the data.
Just use the following, but ensure you save the source code in the declared encoding. It can be any encoding that supports the characters you want to print. The terminal can be in a different encoding, as long as it also supports the characters you want to print:
#coding:utf8
print u'юникод'
The advantage is that you don't need to know the terminal's encoding. Python will normally1 detect the terminal encoding and encode the print output correctly.
1Unless your terminal is misconfigured.

How to print symbols like ● to files in Python

I'm trying to write the symbol ● to a text file in python. I think it has something to do with the encoding (utf-8). Here is the code:
# -*- coding: utf-8 -*-
outFile = open('./myFile.txt', 'wb')
outFile.write("●")
outFile.close()
Instead of the black "●" I get "â—". How can I fix this?
Open the file using the io package for this to work with both python2 and python3 with encoding set to utf8 for this to work. When printing, When writing, write as a unicode string.
import io
outFile = io.open('./myFile.txt', 'w', encoding='utf8')
outFile.write(u'●')
outFile.close()
Tested on Python 2.7.8 and Python 3.4.2
If you are using Python 2, use codecs.open instead of open and unicode instead of str:
# -*- coding: utf-8 -*-
import codecs
outFile = codecs.open('./myFile.txt', 'wb', 'utf-8')
outFile.write(u"●")
outFile.close()
In Python 3, pass the encoding keyword argument to open:
# -*- coding: utf-8 -*-
outFile = open('./myFile.txt', 'w', encoding='utf-8')
outFile.write("●")
outFile.close()
>>> ec = u'\u25cf' # unicode("●", "UTF-8")
>>> open("/tmp/file.txt", "w").write(ec.encode('UTF-8'))
This should do the trick
# -*- coding: utf-8 -*-
outFile = open('./myFile.txt', 'wb')
outFile.write(u"\u25CF".encode('utf-8'))
outFile.close()
have a look at this
What your program does is to produce an output file in the same encoding as your program editor (the coding at the top does not matter, unless your program editor uses it for saving the file). Thus, if you open myFile.txt with a program that uses the same encoding as your program editor, everything looks fine.
This does not mean that your program works for everybody.
For this, you must do two things. You must first indicate the encoding used for text files on your machine. This is a little hard to detect, but the following should often work:
# coding=utf-8 # Put your editor's encoding here
import codecs
import locale
import sys
# Selection of the first non-None, reasonable encoding:
out_encoding = (locale.getlocale()[1]
or locale.getpreferredencoding()
or sys.stdin.encoding or sys.stdout.encoding
# Default:
or "UTF8")
outFile = codecs.open('./myFile.txt', 'w', out_encoding)
Note that it is very important to specify the right coding on top of the file: this must be your program editor's encoding.
If you know the encoding you want for your output file, you can directly put it in open(). Otherwise, the more general and portable out_encoding expression above should work for most users on most computers (i.e., whatever their encoding of choice is, they should be able to read "●" in the resulting file—assuming their computer's encoding can represent it).
Then you must print a string, not bytes:
outFile.write(u"●")
(note the leading u, meaning "unicode string").
For a deeper understanding of the issues at hand, one of my previous answers should be very helpful: UnicodeDecodeError when redirecting to file.
I'm very sorry, but writing a symbol to a text file without saying what the encoding of the file should be is simply non-sense.
It may not be evident at first sight, but text files are indeed encoded and may be encoded in different ways. If you have only letters (upper and lower case, but not accented oned), digits and simple symbols (everything that has an ASCII code below 128), all should be fine, because ASCII 7 bits is now a standard and in fact those characters have same representation in major encodings.
But as soon as you get true symbols, or accented chars, their representation vary from one encoding to the other. For example, the symbol ● has a UTF-8 representation of (Python coding) : \xe2\x97\x8f. What is worse, it cannot be represented in latin1 (ISO-8859-1) encoding.
Another example is the french e accent aigu : é it is represented in UTF8 as \xc3\xa9 (note 2 bytes), but is represented in Latin1 as \x89 (one single byte)
So I tested your code in my Ubuntu box using a UTF8 encoding and the command
cat myFile.txt ... correctly showed the bullet !
sba#sba-ubuntu:~/stackoverflow$ cat myFile.txt
●sba#sba-ubuntu:~/stackoverflow$
(as you didn't add any newline after the bullet, the prompt immediately follows it)
In conclusion :
Your code correctly writes the bullet to the file in UTF8 encoding. If your system uses natively another encoding (ISO-8859-1 or its variant Windows-1252) you cannot natively convert it because this character simply does not exist in this encodings.
But you can always see it in a text editor that supports different encoding like the excellent vim that exists on all major systems.
Proof of above :
On a Windows 7 computer, I opened a vim window and instructed it to accept utf8 with :set encoding='utf8'. I then pasted original code from OP and saved it to a file foo.py.
I opened a cmd.exe window and executed python foo.py (using a Python 2.7) : it created a file myFile.txt containing the 3 bytes (hexa) : e2 97 8f that is the utf8 representation of the bullet ● (I could confirm it with vim Tools/Hexa convert).
I could even open myFile.txt in idle and actually saw the bullet. Even notepad.exe could show the bullet !
So even on a Windows 7 computer that does not natively accept utf-8, the code from OP correctly generates a text file that when opened with a text editor accepting UTF-8 contains the bullet ●.
Of course, if I try to open myFile.txt with vim in latin1 mode, I get : â—, on a cmd windows with codepage 850, type myFile.txt shows ÔùÅ, and with codepage 1252 (variant of latin1) : â—.
In conclusion original OP code creates a correct utf8 encoded file - it is up to the reading part to interpret correctly utf8.

Python 2.7: Setting I/O Encoding, ’?

Attempting to write a line to a text file in Python 2.7, and have the following code:
# -*- coding: utf-8 -*-
...
f = open(os.path.join(os.path.dirname(__file__), 'output.txt'), 'w')
f.write('Smith’s BaseBall Cap') // Note the strangely shaped apostrophe
However, in output.txt, I get Smith’s BaseBall Cap, instead. Not sure how to correct this encoding problem? Any protips with this sort of issue?
You have declared your file to be encoded with UTF-8, so your byte-string literal is in UTF-8. The curly apostrophe is U+2019. In UTF-8, this is encoded as three bytes, \xE2\x80\x99. Those three bytes are written to your output file. Then, when you examine the output file, it is interpreted as something other than UTF-8, and you see the three incorrect characters instead.
In Mac OS Roman, those three bytes display as ’.
Your file is a correct UTF-8 file, but you are viewing it incorrectly.
There are a couple possibilities, but the first one to check is that the output file actually contains what you think it does. Are you sure you're not viewing the file with the wrong encoding? Some editors have an option to choose what encoding you're viewing the file in. The editor needs to know the file's encoding, and if it interprets the file as being in some other encoding than UTF-8, it will display the wrong thing even though the contents of the file are correct.
When I run your code (on Python 2.6) I get the correct output in the file. Another thing to try: Use the codecs module to open the file for UTF-8 writing: f = codecs.open("file.txt", "w", "utf-8"). Then declare the string as a unicode string withu"'Smith’s BaseBall Cap'"`.

Categories