Problem with viewing string output in Google Colab - python

I was supposed to obtain an output like this when i run the command 'print(abide.description)'
But the output I am obtaining is something like this. The entire string is shown in a single line which is making it quite difficult to read and interpret. How can I obtain the output as the picture above?
My code snippet:
print(abide.description)
The output:

The issue is that abide.description is returning bytes rather than a string. If you want it to print as a normal string, you can use the bytes.decode() method to convert the bytes to a unicode string.
For example:
content_bytes = b'this is a byte string\nand it will not be wrapped\nunless it is first decoded'
print(content_bytes)
# b'this is a byte string\nand it will not be wrapped\nunless it is first decoded'
print(content_bytes.decode())
# this is a byte string
# and it will not be wrapped
# unless it is first decoded

Related

Python printing bytes strangely

So, I have this very simple code:
from construct import VarInt
a = VarInt.build(12345)
print(a)
with open('hello.txt', 'wb') as file:
file.write(a)
When I print "a", it shows only the first byte and the last byte for some reason does not show properly. It prints "b'\xb9`'" and it should print "b'\xb9\x60'", or at least that's what I would like to print. Now, when I look at the file that I stored, it saves the bytes exactly how it should with no issue there. Does anybody know what's going on here? Also, with some integers it prints properly, but with this one, for example, does not.
It's not a single byte but
b'\xb9`'
Do You see "`" after 9? It's a character encoded as 0x60. If You want to display all bytes as hexadecimal You can try this snippet:
print(" ".join(hex(n) for n in a))
Why do you think that is does not print properly? Assuming an ASCII derived code page(ASCII, UTF-8 or latin), the code for the back quote (`) is 96 or 0x60.
As \xb9 does not map to a character in your system code page, it is escaped, but as \x60 does map to the back quote, the representation just uses that character.

The correct way to load and read JSON file contains special characters in Python

I'm working with a JSON file contains some unknown-encoded strings as the example below:
"L\u00c3\u00aa Nguy\u00e1\u00bb\u0085n Ph\u00c3\u00ba"
I have loaded this text by using json.load() function in Python 3.7 environment and tried to encode/decode it with some methods I found around the Internet but I still cannot get the proper string as I expected. (In this case, it has to be Lê Nguyễn Phú).
My question is, which is the encoding method they used and how to parse this text in a proper way in Python?
Because the JSON file comes from an external source that I didn't handle so that I cannot know or make any changes in the process of encoding the text.
[Updated] More details:
The JSON file looks like this:
{
"content":"L\u00c3\u00aa Nguy\u00e1\u00bb\u0085n Ph\u00c3\u00ba"
}
Firstly, I loaded the JSON file:
with open(json_path, 'r') as f:
data = json.load(f)
But when I extract the content, it's not what I expected:
string = data.get('content', '')
print(string)
'Lê Nguyá»\x85n Phú'
Someone took "Lê Nguyễn Phú", encoded that as UTF-8, and then took the resulting series of bytes and lied to a JSON encoder by telling it that those bytes were the characters of a string. The JSON encoder then cooperatively produced garbage by encoding those characters. But it is reversible garbage. You can reverse this process using something like
json.loads(in_string).encode("latin_1").decode("utf_8")
Which decodes the string from the JSON, extracts the bytes from it (the 256 symbols in Latin-1 are in a 1-to-1 correspondence with the first 256 Unicode codepoints), and then re-decodes those bytes as UTF-8.
The big problem with this technique is that it only works if you are sure that all of your input is garbled in this fashion... there's no completely reliable way to look at an input and decide whether it should have this broken decoding applied to it. If you try to apply it to a validly-encoded string containing codepoints above U+00FF, it will crash. But if you try to apply it to a validly-encoding string containing only codepoints up to U+00FF, it will turn your perfectly good string into a different kind of garbage.

Converting a string formatted in hex in Python to binary data correctly

I have a string formatted as '\x00\x00\x00\x00' and need it to be formatted such that, when printed, it appears in the console as b'\x00\x00\x00\x00'
How do I do this?
edit: I had a different version of the code printing out a string formatted with b'\xf4\x00\x00\x00' etc and on my computer it prints '\xf4\x00\x00\x00'
Just add the b literal before the string, in that way you'll be defining the string as bytes
# example
s = b"\x00\x00\x00\x00"
print(s)
If instead you're receiving the string from somewhere else and you're not manually writing it you can just encode the string into bytes
# an other example
# let's pretend that we received the value from, say, a function
s = "\x00\x00\x00\x00".encode() # again, pretend that this value has just been retrieved from a function
print(s)

Python: Correct Way to refer to index of unicode string

Not sure if this is exactly the problem, but I'm trying to insert a tag on the first letter of a unicode string and it seems that this is not working. Could these be because unicode indices work differently than those of regular strings?
Right now my code is this:
for index, paragraph in enumerate(intro[2:-2]):
intro[index] = bold_letters(paragraph, 1)
def bold_letters(string, index):
return "<b>"+string[0]+"</b>"+string[index:]
And I'm getting output like this:
<b>?</b>?רך האחד וישתבח הבורא בחכמתו ורצונו כל צבא השמים ארץ וימים אלה ואלונים.
It seems the unicode gets messed up when I try to insert the HTML tag. I tried messing with the insert position but didn't make any progress.
Example desired output (hebrew goes right to left):
>>>first_letter_bold("הקדמה")
"הקדמ<\b>ה<b>"
BTW, this is for Python 2
You are right, indices work over each byte when you are dealing with raw bytes i.e String in Python(2.x).
To work seamlessly with Unicode data, you need to first let Python(2.x) know that you are dealing with Unicode, then do the string manipulation. You can finally convert it back to raw bytes to keep the behavior abstracted i.e you get String and you return String.
Ideally you should convert all the data from UTF8 raw encoding to Unicode object (I am assuming your source encoding is Unicode UTF8 because that is the standard used by most applications these days) at the very beginning of your code and convert back to raw bytes at the fag end of code like saving to DB, responding to client etc. Some frameworks might handle that for you so that you don't have to worry.
def bold_letters(string, index):
string = string.decode('utf8')
string "<b>"+string[0]+"</b>"+string[index:]
return string.encode('utf8')
This will also work for ASCII because UTF8 is a super-set of ASCII. You can understand how Unicode works and in Python specifically better by reading http://nedbatchelder.com/text/unipain.html
Python 3.x String is a Unicode object so you don't have to explicitly do anything.
You should use Unicode strings. Byte strings in UTF-8 use a variable number of bytes per character. Unicode use one (at least those in the BMP on Python 2...the first 65536 characters):
#coding:utf8
s = u"הקדמה"
t = u'<b>'+s[0]+u'</b>'+s[1:]
print(t)
with open('out.htm','w',encoding='utf-8-sig') as f:
f.write(t)
Output:
<b>ה</b>קדמה
But my Chrome browser displays out.htm as:

Convert Unicode string to UTF-8, and then to JSON

I want to encode a string in UTF-8 and view the corresponding UTF-8 bytes individually. In the Python REPL the following seems to work fine:
>>> unicode('©', 'utf-8').encode('utf-8')
'\xc2\xa9'
Note that I’m using U+00A9 COPYRIGHT SIGN as an example here. The '\xC2\xA9' looks close to what I want — a string consisting of two separate code points: U+00C2 and U+00A9. (When UTF-8-decoded, it gives back the original string, '\xA9'.)
Then, I want the UTF-8-encoded string to be converted to a JSON-compatible string. However, the following doesn’t seem to do what I want:
>>> import json; json.dumps('\xc2\xa9')
'"\\u00a9"'
Note that it generates a string containing U+00A9 (the original symbol). Instead, I need the UTF-8-encoded string, which would look like "\u00C2\u00A9" in valid JSON.
TL;DR How can I turn '©' into "\u00C2\u00A9" in Python? I feel like I’m missing something obvious — is there no built-in way to do this?
If you really want "\u00c2\u00a9" as the output, give json a Unicode string as input.
>>> print json.dumps(u'\xc2\xa9')
"\u00c2\u00a9"
You can generate this Unicode string from the raw bytes:
s = unicode('©', 'utf-8').encode('utf-8')
s2 = u''.join(unichr(ord(c)) for c in s)
I think what you really want is "\xc2\xa9" as the output, but I'm not sure how to generate that yet.

Categories