I am trying to do the first CryptoPals challenge where I have to implement the base64 algorithm. I am using Python.
It has the following advice:
Always operate on raw bytes, never on encoded strings. Only use hex and base64 for pretty-printing.
So say I have a string and I have to convert this into bits. The issue is that I have seen people using UTF-8 or ASCII before converting the string to bytes/bits. (See here)
Should I always proceed with UTF-8? Or there could be issues down the road if my script parses something encoded in ASCII?
I'm getting decoding exception errors when reading export files from multiple applications. Have been running into this for a month, as I learn far more about unicode than I ever wanted to know. Some fundamentals are still missing. I understand utf, I understand codepages, I understand how they tend to be used in practice (a single codepage per document e.g., though I can't imagine that's still true today--see the back page of a health statement with 15 languages.)
Is it true that utf-8 can and does encode every possible unicode char? How then is it possible for one application to write a utf-8 file and another to not be able to read it?
when utf is used, codepages are NOT used, is that correct? as I think it through, the codepage is an older style and is made obsolete by utf. I'm sure there are some exceptions.
utf could also be looked as a data compression scheme, less than an encoding one.
But there I'm stuck, as in practice, I have 6 different applications made in different countries, which can create export files, 3 in ut-f, 3 in cp1252, yet python 3.7 cannot read them without error:
'charmap' codec can't decode byte 0x9d in position 1555855: character maps to
'charmap' codec can't decode byte 0x81 in position 4179683: character maps to
I use Edit Pro to examine the files, which successfully reads the files. It points to a line that contains an extra pair of special double quotes:
"Metro Exodus review: “Not only the best Metro yet, it's one of the best shooters in years” | GamesRadar+"
Removing that ” allows python to continue reading in the file, to the next error.
python reports it as char x9d, but an (really old: Codewright) old editor reports it as x94. Codewright I believe. Verified it is an x94 and x93 pair on the internet so it must be true. ;-)
It is very troublesome that I don't know for sure what the actual bytes are, as there are so many layers of translation, interpretation, format for display, etc.
So the visual studio debug report of x9d is a misdirect. What's going on with the python library that it would report this?
How is this possible? I can find no info about how chars in one codepage can be invalid under utf (if that's the problem). What would I search under?
It should not be this hard. I have 30 years experience in programming c++, sql, you name it, learning new libraries, languages is just breakfast.
I also do not understand why the information to handle this is so hard to find. Surely numerous other programmers doing data conversions, import/exports between applications have run into this for decades.
The files I'm importing are csv files from 6 apps, and json files from another. the 6 apps export in utf-8 and cp1252 (as reported by Edit Pro) and the other app exports json in utf-8, though I could also choose csv.
The 6 apps run on an iPhone and export files I'm attempting to read on windows 10. I'm running python 3.7.8, though this problem has persisted since 3.6.3.
Thanks in advance
Dan
The error 'charmap' codec can't decode byte... shows that you are not using utf-8 to read the file. That's the source of your struggles on this one. Unless the file starts with a BOM (byte order mark), you kinda have to know how the file was encoded to decode it correctly.
utf-8 encodes all unicode characters and python should be able to read them all. Displaying is another matter. You need font files for the unicode characters to do that part. You were reading in "charmap", not "utf-8" and that's why you had the error.
"when utf is used" ... there are several UTF encodings. utf-8, utf-16-be (big endian), utf-16-le (little endian), utf-16 (synonym for utf-16-le), utf-32 variants (I've never seen this in the wild) and variants that include the BOM (byte order mark) which is an optional set of characters at the start of the file describing utf encoding type.
But yes, UTF encodings are meant to replace the older codepage encodings.
No, its not compression. The encoded stream could be larger than the bytes needed to hold the string in memory. This is especially true of utf-8, less true with utf-16 (that's why Microsoft went with utf-16). But utf-8 as a superset of ASCII that does not have byte order issues like utf-16 has many other advantages (that's why all the sane people chose it). I can't think of a case where a UTF encoding would ever be smaller than the count of its characters.
I am trying to decode the .csv file that has strange characters in it, but nothing seems to work. It is Chinese somehow encoded and it looks like *ST东海A,海南大东海旅ć. I tried a bunch o different encodings but they don't seem to work. Checked with chardet and it outputs it is 'UTF-8-SIG', but when I read this file to Python dataframe it still shows strange letters. Not sure where to look for solutions further or what to google even anymore.Thanks!
I'm receiving frames from a websocket server and I'm not sure how to interpret some of the bytes object because they are mixed with actual words inside them.
I get something like this:
b'\x00\x17\x04\x00\x00\x00\xc0\x05FOCUS\x01\x00\xff\xfc\x00\x05;\xea\x01\x03\xe8\x81'
This one has 'FOCUS' and a ';' in it. I am expecting 'FOCUS' to be part of the payload, but I don't know why it's showing up as is, and not in hex form. Can someone explain what's going on and how I can unpack the rest of the data?
Also, it seems I'm getting the data in reverse order. I think \x81 is supposed to be the first byte of the frame.
I'm using Python 3.6 and the websocket-client lib. Thank you.
Among all the encodings available here http://docs.python.org/library/codecs.html
which one is the one I should use for decoding binary data into unicode without it becoming corrupted when I encode it back to string?
I've used raw_unicode_data and it doesn't work.
Example: I upload picture in a POST (but not as file attachment). Django converts POST data to unicode using utf-8. However when converting back from unicode to string (again using utf-8), data becomes corrupted. I used raw_unicode_data and the same happened (though only a few bytes this time). Which encoding should I use so that the decode and encode steps don't corrupt the data.
If you want to post binary data use the base64 encoding.
http://docs.python.org/library/base64.html
"Binary data" is not text, therefore converting it to a unicode is meaningless. If there is text embedded in the binary data then extract it first and decode using the encoding given in the specification for the data format.
As others have already stated, your question isn't particularly clear. If you are wanting to funnel binary data through a text channel (such as POST), then base64 is the right format to use with appropriate data transformation operations in the client and the server (binary data -> base64 text -> pass over text channel -> base64 text -> binary data).
Alternatively, if you are wanting to tolerate improperly encoded text (e.g. as Python 3 tries to do for some interfaces such as file paths and environment variables), then Python 3.1 and later offer the surrogatescape error handler, which will convert invalid values into a format that isn't valid readable text, but allows the original binary data to be faithfully recreated when encoding back to bytes.