Geting the encoding of an string that has \ values - python
I have the following string that I am receiving via a python server. Do not have access to that server.
\xa1\x823\xc2\xd5\x823\xc2\xff\x823\xc2\x12\x833\xc2\x1b\x833\xc2\x16\x833\xc2\x1e\x833\xc2 \x833\xc2\x0e\x833\xc2\x03\x833\xc2\x01\x833\xc2\x10\x833\xc2\'\x833\xc2\x17\x833\xc2\x00\x833\xc2\x11\x833\xc2$\x833\xc2$\x833\xc2\x1f\x833\xc2\x02\x833\xc2\xc0\x823\xc2\x94\x823\xc2\x91\x823\xc2\x7f\x823\xc2a\x823\xc2R\x823\xc2N\x823\xc2e\x823\xc2+\x823\xc2\xd3\x813\xc2\xee\x813\xc2\xe9\x813\xc2\xdf\x813\xc2\xfb\x813\xc2(\x823\xc25\x823\xc2\x17\x823\xc2\x1c\x823\xc2;\x823\xc2\xa2\x823\xc2\xe5\x823\xc2\xc2\x823\xc2\xbc\x823\xc2\x9b\x823\xc2\x13\x823\xc2\xbd\x813\xc2\xc0\x813\xc2\xc5\x813\xc2\xf2\x813\xc2(\x823\xc27\x823\xc2;\x823\xc2.\x823\xc2,\x823\xc20\x823\xc2\x11\x823\xc2\x0b\x823\xc2\xdf\x813\xc2\xb0\x813\xc2\xa2\x813\xc2\x7f\x813\xc2v\x813\xc2y\x813\xc2l\x813\xc2m\x813\xc2z\x813\xc2\x8c\x813\xc2\x89\x813\xc2w\x813\xc2Y\x813\xc2Y\x813\xc2c\x813\xc2e\x813\xc2Z\x813\xc2\x10\x813\xc2\xd2\x803\xc2\x8c\x803\xc2G\x803\xc2)\x803\xc2-\x803\xc2\x19\x803\xc2\xef\x7f3\xc2\xc9\x7f3\xc2\xc9\x7f3\xc2\xc8\x7f^C}3\xc2\xe7}3\xc2\xdd}3\xc2\xbc}3\xc2\xa9}3\xc2\xb7}3\xc2\xc1}3\xc2\xb0}3\xc2\x95}3\xc2\x9f}3\xc2\xd8}3\xc2\x05~3\xc2\x12~3\xc2\x15~3\xc2\r~3\xc2\x15~3\xc23~3\xc2/~3\xc2\x1d~3\xc2\x17~3\xc2\x15~3\xc2\x1d~3\xc2\x1e~3\xc2\x1a~3\xc2\x1f~3\xc2E~3\xc2W~3\xc2C~3\xc2o~3\xc2g~3\xc2p~3\xc2\xa3~3\xc2\x9b~3\xc2\x9e~3\xc2\x9e~3\xc2\xce~3\xc2\xe5~3\xc2\xe0~3\xc2\xd2~3\xc2\xc6~3\xc2\xc6~3\xc2\xc1~3\xc2\xca~3\xc2\xd6~3\xc2\xce~3\xc2\xa4~3\xc2\xad~3\xc2\xe1~3\xc2\xf8~3\xc2\xf8~3\xc2\x11\x7f3\xc2;\x7f3\xc2)\x7f3\xc2\xe6~3\xc2\xc4~3\xc2\xcc~3\xc2\xcd~3\xc2\xca~3\xc2\xc4~3\xc2\xbf~3\xc2\xcc~3\xc2\xc8~3\xc2\xc8~3\xc2\xd3~3\xc2\xd5~3\xc2\xa2~3\xc2L~3\xc2\x1c~3\xc2\x11~3\xc2\x14~3\xc2\x0e~3\xc2\x01~3\xc2\xf2}3\xc2\xf8}3\xc2\x05~3\xc2\xe3}3\xc2\xb0}3\xc2\x9c}3\xc2\x9e}3\xc2\x90}3\xc2\xcc}3\xc2\x1b~3\xc2\x05~3\xc2\xfa}3\xc2\x06~3\xc2\xf7}3\xc2\xf6}3\xc2\x15~3\xc2\x1f~3\xc2\x1b~3\xc2#~3\xc23~3\xc2H~3\xc2o~3\xc2\x89~3\xc2\x89~3\xc2\x94~3\xc2\x97~3\xc2\x84~3\xc2m~3\xc2\x8d~3\xc2\xdf~3\xc2\x0e\x7f3\xc2\x10\x7f3\xc27\x7f3\xc2]\x7f3\xc2i\x7f3\xc2e\x7f3\xc2[\x7f3\xc2k\x7f3\xc2x\x7f3\xc2\x89\x7f3\xc2\x9b\x7f3\xc2\xae\x7f3\xc2\xbd\x7f3\xc2\xb2\x7f3\xc2\xa4\x7f3\xc2\xba\x7f3\xc2\xce\x7f3\xc2\xd1\x7f3\xc2\xd0\x7f3\xc2\xc7\x7f3\xc2\xaa\x7f3\xc2m\x7f3\xc25\x7f3\xc2\x1e\x7f3\xc2\x1f\x7f3\xc2\x1b\x7f3\xc2\x1e\x7f3\xc2\r\x7f3\xc2\xed~3\xc2\xe3~3\xc2\xdd~3\xc2\xe6~3\xc2\x15\x7f3\xc2:\x7f3\xc29\x7f3\xc2B\x7f3\xc2N\x7f3\xc21\x7f3\xc2\x11\x7f3\xc2\x13\x7f3\xc2:\x7f3\xc2k\x7f3\xc2v\x7f3\xc2u\x7f3\xc2\x89\x7f3\xc2\x9f\x7f3\xc2\xa7\x7f3\xc2\xbe\x7f3\xc2\xd1\x7f3\xc2\xec\x7f3\xc2\n\x803\xc2\t\x803\xc2\x1f\x803\xc2Y\x803\xc2{\x803\xc2t\x803\xc2p\x803\xc2i\x803\xc2
In reality, this should be floating point number after decoding.
How can I decode it? How to know the encoding of the string? Preferably using python !!
I tried chardet , decode('utf8') and what not !! Any help is appreciated.
After trying this >
c=a.decode('utf-16-be', errors='ignore').encode('ascii')
Got this >
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-199: ordinal not in range(128)
after trying this >>>
c=a.decode('utf-16-le').encode('ascii')
Got this >>>>
File "/usr/lib/python2.7/encodings/utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x33 in position
470: truncated data
It looks like this data has been packed using the Python's struct module.
I'm not sure what the first two characters in the string represent, they aren't floats but could be chars or short ints. The remainder of the string comprises of floats.
Ignoring the first two characters for now, we get:
struct.unpack('!143f', s2[2:]) # s2 is the example string from the question
(9.037722037419371e-08, 9.038631532121144e-08, 9.036266845896535e-08, 9.071190731901879e-08, 9.06009489654025e-08, 9.058094008196349e-08, 9.058094008196349e-08, 9.058457806077058e-08, 9.063005279585923e-08, 9.067370854154433e-08, 9.071736428722943e-08, 9.071190731901879e-08, 9.064278572168405e-08, 9.063005279585923e-08, 9.062277683824504e-08, 9.059003502898122e-08, 9.058821603957767e-08, 9.057912109255994e-08, 9.057366412434931e-08, 9.054456029389257e-08, 9.051727545283939e-08, 9.05099994952252e-08, 9.04863526329791e-08, 9.046270577073301e-08, 9.04463348661011e-08, 9.04663437495401e-08, 9.050090454820747e-08, 9.054274130448903e-08, 9.050090454820747e-08, 9.052273242105002e-08, 9.062277683824504e-08, 9.061368189122732e-08, 9.067916550975497e-08, 9.07519250858968e-08, 9.073009721305425e-08, 9.071918327663298e-08, 9.06882604567727e-08, 9.066097561571951e-08, 9.071372630842234e-08, 9.071190731901879e-08, 9.069735540379043e-08, 9.071918327663298e-08, 9.071918327663298e-08, 9.071918327663298e-08, 9.075738205410744e-08, 9.076283902231808e-08, 9.072645923424716e-08, 9.074101114947553e-08, 9.070281237200106e-08, 9.069189843557979e-08, 9.069371742498333e-08, 9.066825157333369e-08, 9.062277683824504e-08, 9.060276795480604e-08, 9.063732875347341e-08, 9.06409667322805e-08, 9.068644146736915e-08, 9.074646811768616e-08, 9.072645923424716e-08, 9.073009721305425e-08, 9.074464912828262e-08, 9.06882604567727e-08, 9.068644146736915e-08, 9.074283013887907e-08, 9.075010609649325e-08, 9.070463136140461e-08, 9.069189843557979e-08, 9.075920104351098e-08, 9.079012386337126e-08, 9.078830487396772e-08, 9.034265957552634e-08, 9.036630643777244e-08, 9.038813431061499e-08, 9.04245140986859e-08, 9.051910154767029e-08, 9.056457628275894e-08, 9.056639527216248e-08, 9.061550798605822e-08, 9.071737139265679e-08, 9.07337422972887e-08, 9.077557905357025e-08, 9.042452120411326e-08, 9.040087434186717e-08, 9.037540849021752e-08, 9.03626755643927e-08, 9.039541737365653e-08, 9.046453897099127e-08, 9.043907311934163e-08, 9.042634019351681e-08, 9.046453897099127e-08, 9.050455673786928e-08, 9.052456562130828e-08, 9.050637572727283e-08, 9.046090099218418e-08, 9.041542625709553e-08, 9.036449455379625e-08, 9.080286389462344e-08, 9.034448567035724e-08, 9.077376006416671e-08, 9.070827644563906e-08, 9.066825867876105e-08, 9.068826756220005e-08, 9.070100048802487e-08, 9.065916373174332e-08, 9.065552575293623e-08, 9.073919926549934e-08, 9.038268444783171e-08, 9.039359838425298e-08, 9.044453008755227e-08, 9.059004923983593e-08, 9.06518948795565e-08, 9.06682657841884e-08, 9.079195706362952e-08, 9.044089921417253e-08, 9.044089921417253e-08, 9.042089033073353e-08, 9.041361437311934e-08, 9.041725235192644e-08, 9.039360548968034e-08, 9.042816628834771e-08, 9.048091698105054e-08, 9.053146499127251e-08, -75.2074203491211, -71.7074203491211, -61.10371017456055, -55.85371017456055, -58.35371017456055, -87.2074203491211, -102.7074203491211, -103.7074203491211, -107.2074203491211, -118.2074203491211, -114.7074203491211, -111.2074203491211, -105.7074203491211, -74.7074203491211, -58.10371017456055, -55.35371017456055, -58.55054473876953, 1.054752845871448e+18, 1005890699264.0, 6.59220528669655e+16, -7.216911831845169e-31)
Treating the first two characters as chars:
struct.unpack('!2c', s2[:2])
('5', 'g')
As short ints:
struct.unpack('!h', s2[:2])
(13671,)
You can unpack the whole string at once by combining the formats:
>>> struct.unpack('!h143f', s2)
The format string consists of three parts:
! indicates that we are using network (big-endian) byte order.
h indicates the first 2 bytes are a short (the size of a short int is 2); if the first two bytes where chars (size 1) we would use 2c instead of h.
143f indicates that there follows 143 floats (the size of afloat is 4)
Added together, the sizes equal the length of the input string:
2 + (143 *4) == len(s2) == 574
True
Related
pyodbc unpacking SQL_GUID for converting to lowercase string
sql-server and pyodbc return all SQL_GUID datatypes in uppercase, while queries are case-insensitive: psql.py --sql="select controllerid from controllers where controllerid='F573A57D-9247-44CB-936A-D16DD4E8327F'" [('F573A57D-9247-44CB-936A-D16DD4E8327F', )] psql.py --sql="select controllerid from controllers where controllerid='f573a57d-9247-44cb-936a-d16dd4e8327f'" [('F573A57D-9247-44CB-936A-D16DD4E8327F', )] I want it to output in lowercase. Added a pyodbc.add_output_converter() method which just does a lower on SQL_GUID but this is a packed structure: def guid_to_lowercase(value): return value.lower() pyodbc.add_output_converter(pyodbc.SQL_GUID, guid_to_lowercase) [(b'}\xa5s\xf5g\x92\xcbd\x93j\xd1m\xd4\xe82\x7f', )] It looks like a byte, but changing it: def guid_to_lowercase(value): #log.error("type:{}".format(type(value))) return value.decode("ascii").lower() UnicodeDecodeError: 'ascii' codec can't decode byte 0xa5 in position 1: ordinal not in range(128) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 1: invalid start byte I assume I need to unpack it, but how is the format. Outside of pyodbc I can do a uuid.lower()
When we print out the raw bytes returned by the server for the GUID 'F573A57D-9247-44CB-936A-D16DD4E8327F' we get b'}\xa5s\xf5g\x92\xcbd\x93j\xd1m\xd4\xe82\x7f' Converting the ASCII characters to their hex values, removing the \x prefixes, and adding some spaces we get 7da573f5 4792 cb44 936a d16dd4e8327f revealing that SQL Server ODBC is returning the bytes for the GUID, but the first three segments are in little-endian order, while the last two segments are in big-endian order. So, if we change our output converter function to def guid_to_lowercase(value): first_three_values = struct.unpack('<I2H', value[:8]) fourth_value = struct.unpack('>H', value[8:10])[0] fifth_value = struct.unpack('>Q', b'\x00\x00' + value[10:16])[0] guid_string_parts = ( '{:08x}'.format(first_three_values[0]), '{:04x}'.format(first_three_values[1]), '{:04x}'.format(first_three_values[2]), '{:04x}'.format(fourth_value), '{:012x}'.format(fifth_value), ) return '-'.join(guid_string_parts) it returns 'f573a57d-9247-44cb-936a-d16dd4e8327f'
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 40: ordinal not in range(128)
I'm trying to save concrete content of the dictionary to a file but when I try to write it, I get the following error: Traceback (most recent call last): File "P4.py", line 83, in <module> outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8")) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 40: ordinal not in range(128) And here is the code: from collections import Counter with open("corpus.txt") as inf: wordtagcount = Counter(line.decode("latin_1").rstrip() for line in inf) with open("lexic.txt", "w") as outf: outf.write('Palabra\tTag\tApariciones\n'.encode("utf-8")) for word,count in wordtagcount.iteritems(): outf.write(u"{}\t{}\n".format(word, count).encode("utf-8")) """ 2) TAGGING USING THE MODEL Dados los ficheros de test, para cada palabra, asignarle el tag mas probable segun el modelo. Guardar el resultado en ficheros que tengan este formato para cada linea: Palabra Prediccion """ file=open("lexic.txt", "r") # abrimos el fichero lexic (nuestro modelo) (probar con este) data=file.readlines() file.close() diccionario = {} """ In this portion of code we iterate the lines of the .txt document and we create a dictionary with a word as a key and a List as a value Key: word Value: List ([tag, #ocurrencesWithTheTag]) """ for linea in data: aux = linea.decode('latin_1').encode('utf-8') sintagma = aux.split('\t') # Here we separate the String in a list: [word, tag, ocurrences], word=sintagma[0], tag=sintagma[1], ocurrences=sintagma[2] if (sintagma[0] != "Palabra" and sintagma[1] != "Tag"): #We are not interested in the first line of the file, this is the filter if (diccionario.has_key(sintagma[0])): #Here we check if the word was included before in the dictionary aux_list = diccionario.get(sintagma[0]) #We know the name already exists in the dic, so we create a List for every value aux_list.append([sintagma[1], sintagma[2]]) #We add to the list the tag and th ocurrences for this concrete word diccionario.update({sintagma[0]:aux_list}) #Update the value with the new list (new list = previous list + new appended element to the list) else: #If in the dic do not exist the key, que add the values to the empty list (no need to append) aux_list_else = ([sintagma[1],sintagma[2]]) diccionario.update({sintagma[0]:aux_list_else}) """ Here we create a new dictionary based on the dictionary created before, in this new dictionary (diccionario2) we want to keep the next information: Key: word Value: List ([suggestedTag, #ocurrencesOfTheWordInTheDocument, probability]) For retrieve the information from diccionario, we have to keep in mind: In case we have more than 1 Tag associated to a word (keyword ), we access to the first tag with keyword[0], and for ocurrencesWithTheTag with keyword[1], from the second case and forward, we access to the information by this way: diccionario.get(keyword)[2][0] -> with this we access to the second tag diccionario.get(keyword)[2][1] -> with this we access to the second ocurrencesWithTheTag diccionario.get(keyword)[3][0] -> with this we access to the third tag ... .. . etc. """ diccionario2 = dict.fromkeys(diccionario.keys())#We create a dictionary with the keys from diccionario and we set all the values to None with open("estimation.txt", "w") as outfile: for keyword in diccionario: tagSugerido = unicode(diccionario.get(keyword[0]).decode('utf-8')) #tagSugerido is the tag with more ocurrences for a concrete keyword maximo = float(diccionario.get(keyword)[1]) #maximo is a variable for the maximum number of ocurrences in a keyword if ((len(diccionario.get(keyword))) > 2): #in case we have > 2 tags for a concrete word suma = float(diccionario.get(keyword)[1]) for i in range (2, len(diccionario.get(keyword))): suma += float(diccionario.get(keyword)[i][1]) if (diccionario.get(keyword)[i][1] > maximo): tagSugerido = unicode(diccionario.get(keyword)[i][0]).decode('utf-8')) maximo = float(diccionario.get(keyword)[i][1]) probabilidad = float(maximo/suma); diccionario2.update({keyword:([tagSugerido, suma, probabilidad])}) else: diccionario2.update({keyword:([diccionario.get(keyword)[0],diccionario.get(keyword)[1], 1])}) outfile.write(u"{}\t{}\n".format(keyword, tagSugerido).encode("utf-8")) The desired output will look like this: keyword(String) tagSugerido(String): Hello NC Friend N Run V ...etc The conflictive line is: outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8")) Thank you.
Like zmo suggested: outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8")) should be: outfile.write(u"{}\t{}\n".format(keyword, tagSugerido.encode("utf-8"))) A note on unicode in Python 2 Your software should only work with unicode strings internally, converting to a particular encoding on output. Do prevent from making the same error over and over again you should make sure you understood the difference between ascii and utf-8 encodings and also between str and unicode objects in Python. The difference between ASCII and UTF-8 encoding: Ascii needs just one byte to represent all possible characters in the ascii charset/encoding. UTF-8 needs up to four bytes to represent the complete charset. ascii (default) 1 If the code point is < 128, each byte is the same as the value of the code point. 2 If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.) utf-8 (unicode transformation format) 1 If the code point is <128, it’s represented by the corresponding byte value. 2 If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255. 3 Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255. The difference between str and unicode objects: You can say that str is baiscally a byte string and unicode is a unicode string. Both can have a different encoding like ascii or utf-8. str vs. unicode 1 str = byte string (8-bit) - uses \x and two digits 2 unicode = unicode string - uses \u and four digits 3 basestring /\ / \ str unicode If you follow some simple rules you should go fine with handling str/unicode objects in different encodings like ascii or utf-8 or whatever encoding you have to use: Rules 1 encode(): Gets you from Unicode -> bytes encode([encoding], [errors='strict']), returns an 8-bit string version of the Unicode string, 2 decode(): Gets you from bytes -> Unicode decode([encoding], [errors]) method that interprets the 8-bit string using the given encoding 3 codecs.open(encoding=”utf-8″): Read and write files directly to/from Unicode (you can use any encoding, not just utf-8, but utf-8 is most common). 4 u”: Makes your string literals into Unicode objects rather than byte sequences. 5 unicode(string[, encoding, errors]) Warning: Don’t use encode() on bytes or decode() on Unicode objects And again: Software should only work with Unicode strings internally, converting to a particular encoding on output.
As you're not giving a simple concise code to illustrate your question, I'll just give you a general advice on what should be the error: If you're getting a decode error, it's that tagSugerido is read as ASCII and not as Unicode. To fix that, you should do: tagSugerido = unicode(diccionario.get(keyword[0]).decode('utf-8')) to store it as an unicode. Then you're likely to get an encode error at the write() stage, and you should fix your write the following way: outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8")) should be: outfile.write(u"{}\t{}\n".format(keyword, tagSugerido.encode("utf-8"))) I litterally answered a very similar question moments ago. And when working with unicode strings, switch to python3, it'll make your life easier! If you cannot switch to python3 just yet, you can make your python2 behave like it is almost python3, using the python-future import statement: from __future__ import absolute_import, division, print_function, unicode_literals N.B.: instead of doing: file=open("lexic.txt", "r") # abrimos el fichero lexic (nuestro modelo) (probar con este) data=file.readlines() file.close() which will fail to close properly the file descriptor upon failure during readlines, you should better do: with open("lexic.txt", "r") as f: data=f.readlines() which will take care of always closing the file even upon failure. N.B.2: Avoid using file as this is a python type you're shadowing, but use f or lexic_file…
Python: Convert utf-8 string to byte string [duplicate]
This question already has answers here: Best way to convert string to bytes in Python 3? (5 answers) Closed 11 days ago. I have the following function to parse a utf-8 string from a sequence of bytes Note -- 'length_size' is the number of bytes it take to represent the length of the utf-8 string def parse_utf8(self, bytes, length_size): length = bytes2int(bytes[0:length_size]) value = ''.join(['%c' % b for b in bytes[length_size:length_size+length]]) return value def bytes2int(raw_bytes, signed=False): """ Convert a string of bytes to an integer (assumes little-endian byte order) """ if len(raw_bytes) == 0: return None fmt = {1:'B', 2:'H', 4:'I', 8:'Q'}[len(raw_bytes)] if signed: fmt = fmt.lower() return struct.unpack('<'+fmt, raw_bytes)[0] I'd like to write the function in reverse -- i.e. a function that will take a utf-8 encoded string and return it's representation as a byte string. So far, I have the following: def create_utf8(self, utf8_string): return utf8_string.encode('utf-8') I run into the following error when attempting to test it: File "writer.py", line 229, in create_utf8 return utf8_string.encode('utf-8') UnicodeDecodeError: 'ascii' codec can't decode byte 0x98 in position 0: ordinal not in range(128) If possible, I'd like to adopt a structure for the code similar to the parse_utf8 example. What am I doing wrong? Thank you for your help! UPDATE: test driver, now correct def random_utf8_seq(self, length): # from http://www.w3.org/2001/06/utf-8-test/postscript-utf-8.html test_charset = u" !\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬ ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĂ㥹ĆćČčĎďĐđĘęĚěĹ弾ŁłŃńŇňŐőŒœŔŕŘřŚśŞşŠšŢţŤťŮůŰűŸŹźŻżŽžƒˆˇ˘˙˛˜˝–—‘’‚“”„†‡•…‰‹›€™" utf8_seq = u"" for i in range(length): utf8_seq += random.choice(test_charset) return utf8_seq I get the following error: input_str = self.random_utf8_seq(200) File "writer.py", line 226, in random_utf8_seq print unicode(utf8_seq, "utf-8") UnicodeDecodeError: 'utf8' codec can't decode byte 0xbb in position 0: invalid start byte
If utf-8 => bytestring conversion is what do you want then you may use str.encode, but first you need to properly mark the type of source string in your example - prefix with u for unicode: # coding: utf-8 import random def random_utf8_seq(length): # from http://www.w3.org/2001/06/utf-8-test/postscript-utf-8.html test_charset = u" !\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬ ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĂ㥹ĆćČčĎďĐđĘęĚěĹ弾ŁłŃńŇňŐőŒœŔŕŘřŚśŞşŠšŢţŤťŮůŰűŸŹźŻżŽžƒˆˇ˘˙˛˜˝–—‘’‚“”„†‡•…‰‹›€™" utf8_seq = u'' for i in range(length): utf8_seq += random.choice(test_charset) print utf8_seq.encode('utf-8') return utf8_seq.encode('utf-8') print( type(random_utf8_seq(200)) ) -- output -- õ3×sÔP{Ć.s(Ë°˙ě÷xÓ#bűV—û´ő¢uZÓČn˜0|_"Ðyø`êš·ÏÝhunÍÅ=ä? óP{tlÇűpb¸7s´ňƒG—čøň\zčłŢXÂYqLĆúěă(ÿî ¥PyÐÔŇnל¦Ì˝+•ì› ŻÛ°Ñ^ÝC÷ŢŐIñJĹţÒył"MťÆ‹ČČ4þ!»šåŮ#Öhň- ÈLGĄ¢ß˛Đ¯.ªÆź˘Ř^ĽÛŹËaĂŕ¹#¢éüÜńlÊqš=VřU…‚–MŽÎÉèoÙŹŠ¨Ð <type 'str'>
python reading 8 byte block bin file
I'm not familiar with reading binary file in python I want to read bin file each 8 byte of block which it content would be byte 1: integer of 0-255 byte 2: integer of 0-255 byte 3-4: date string with format of dd.mm.yyyy byte 5-6: time string with format of hh:mm:ss byte 7: integer of 0-255 byte 8: crc I tried following ` with open("tes.bin", "rb") as f: byte = f.read(8) while byte != "": byte = f.read(8) ` I'm not sure how I can treat byte var to extract the correct data After reading the doc again, it is said that date is in bytes 3 and 4. So it said bits 0-4 represent day of 2^5, bits 5-8 representing month of 2^4 and bits 9-15 represent the year of 2^7. How can I do operation on python from bytes to bit in python Could please anyone point me some hint?
How to print out 0xfb in python
I'm falling the unicode hell. My environment in on unix, python 2.7.3 LC_CTYPE=zh_TW.UTF-8 LANG=en_US.UTF-8 I'm trying to dump hex encoded data in human readable format, here is simplified code #! /usr/bin/env python # encoding:utf-8 import sys s=u"readable\n" # previous result keep in unicode string s2="fb is not \xfb" # data read from binary file s += s2 print s # method 1 print s.encode('utf-8') # method 2 print s.encode('utf-8','ignore') # method 3 print s.decode('iso8859-1') # method 4 # method 1-4 display following error message #UnicodeDecodeError: 'ascii' codec can't decode byte 0xfb # in position 0: ordinal not in range(128) f = open('out.txt','wb') f.write(s) I just want to print out the 0xfb. I should describe more here. The key is 's += s2'. Where s will keep my previous decoded string. And the s2 is next string which should append into s. If I modified as following, it occurs on write file. s=u"readable\n" s2="fb is not \xfb" s += s2.decode('cp437') print s f=open('out.txt','wb') f.write(s) # UnicodeEncodeError: 'ascii' codec can't encode character # u'\u221a' in position 1: ordinal not in range(128) I wish the result of out.txt is readable fb is not \xfb or readable fb is not 0xfb [Solution] #! /usr/bin/env python # encoding:utf-8 import sys import binascii def fmtstr(s): r = '' for c in s: if ord(c) > 128: r = ''.join([r, "\\x"+binascii.hexlify(c)]) else: r = ''.join([r, c]) return r s=u"readable" s2="fb is not \xfb" s += fmtstr(s2) print s f=open('out.txt','wb') f.write(s)
I strongly suspect that your code is actually erroring out on the previous line: the s += s2 one. s2 is just a series of bytes, which can't be arbitrarily tacked on to a unicode object (which is instead a series of code points). If you had intended the '\xfb' to represent U+FB, LATIN SMALL LETTER U WITH CIRCUMFLEX, it would have been better to assign it like this instead: s2 = u"\u00fb" But you said that you just want to print out \xHH codes for control characters. If you just want it to be something humans can understand which still makes it apparent that special characters are in a string, then repr may be enough. First, don't have s be a unicode object, because you're treating your strings here as a series of bytes, not a series of code points. s = s.encode('utf-8') s += s2 print repr(s) Finally, if you don't want the extra quotes on the outside that repr adds, for nice pretty printing or whatever, there's not a simple builtin way to do that in Python (that I know of). I've used something like this before: import re controlchars_re = re.compile(r'[\x00-\x31\x7f-\xff]') def _show_control_chars(match): txt = repr(match.group(0)) return txt[1:-1] def escape_special_characters(s): return controlchars_re.sub(_show_control_chars, s.replace('\\', '\\\\')) You can pretty easily tweak the controlchars_re regex to define which characters you care about escaping.