I've got a chunk of code that reads binary data off a string buffer (StringIO object), and tries to convert it to a bytearray object, but it's throwing errors when the value is greater than 127, which the ascii encoding can't handle, even when I'm trying to override it:
file = open(filename, 'r+b')
file.seek(offset)
chunk = file.read(length)
chunk = zlib.decompress(chunk)
chunk = StringIO(chunk)
d = bytearray(chunk.read(10), encoding="iso8859-1", errors="replace")
Running that code gives me:
d = bytearray(chunk.read(10), encoding="iso8859-1", errors="replace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 3: ordinal not in range(128)
Obviously 240 (decimal of 0xf0) can't fit in the ascii encoding range, but that's why I'm explicitly setting the encoding. But it seems to be ignoring it.
When converting a string to another encoding, its original encoding is taken to be ASCII if it is a str or Unicode if it is a unicode object. When creating the bytearray, the encoding parameter is required only if the string is unicode. Just don't specify an encoding and you will get the results you want.
I am not quite sure what the problem is.
StringIO is for string IO, not for binary IO.
If you want to get a bytearray representing the whole content of the file, use:
with open ('filename', 'r') as file: bytes = bytearray (file.read () )
if you want to get a string with only ascii characters contained in that file, use:
with open ('filename', 'r') as file: asciis = file.read ().decode ('ascii', 'ignore')
(If you run it on windows, you will probably need the binary flag for opening the file.
Related
I have a file including some data like : \xe1\x8a\xa0\xe1\x88\x9b\xe1\x88\xad\xe1\x8a\x9b
How do i read this and write the string format(አማርኛ) in another file? And also vice versa?
[\xe1\x8a\xa0\xe1\x88\x9b\xe1\x88\xad\xe1\x8a\x9b == አማርኛ ]
That is a byte string, so you need to decode it to a utf-8 Unicode string.
b'\xe1\x8a\xa0\xe1\x88\x9b\xe1\x88\xad\xe1\x8a\x9b'.decode('utf8')
result: 'አማርኛ'
And to encode it back to byte string:
'አማርኛ'.encode()
result: b'\xe1\x8a\xa0\xe1\x88\x9b\xe1\x88\xad\xe1\x8a\x9b'
Basicly you have a byte string, you can do what you are talking about with the functions encode() and decode() respectively, in the example below, i will start by printing the byte string. And then i'm taking the byte string and decoding it to utf-8 (default value in all python versions above 2.7 if you don't specify a version yourself)
f = open("input.txt","rb")
x = f.read()
print(x) # b'\xe1\x8a\xa0\xe1\x88\x9b\xe1\x88\xad\xe1\x8a\x9b'
print(x.decode()) # አማርኛ
If you want to do the inverse operation, you can achieve this by just encoding back the decoded byte array! (Do note that the open function i'm using the arguments "rb" that stands for (following the wiki) "Opens a file for reading only in binary format."
I need to save a params file in python and this params file contains some parameters that I won't leave on plain text, so I codify the entire file to base64 (I know that this isn't the most secure encoding of the world but it works for the kind of data that I need to use).
With the encoding, everything works well. I encode the content of my file (a simply txt with a proper extension) and save the file. The problem comes with the decode. I print the text coded before save the file and the text coded from the file saved and there are exactly the same, but for a reason I don't know, the decode of the text of the file saved returns me this error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8d in position 1: invalid start byte and the decode of the text before save the file works well.
Any idea to resolve this issue?
This is my code, I have tried converting all to bytes, to string, and everything...
params = open('params.bpr','r').read()
paramsencoded = base64.b64encode(bytes(params,'utf-8'))
print(paramsencoded)
paramsdecoded = str(base64.b64decode(str(paramsencoded,'utf-8')),'utf-8')
newparams = open('paramsencoded.bpr','w+',encoding='utf-8')
newparams.write(str(paramsencoded))
newparams.close()
params2 = open('paramsencoded.bpr',encoding='utf-8').read()
print(params2)
paramsdecoded = str(base64.b64decode(str(paramsencoded,'utf-8')),'utf-8')
paramsdecoded = base64.b64decode(str(params2))
print(str(paramsdecoded,'utf-8'))
Your error lies in your handling of the bytes object returned by base64.b64encode(), you called str() on the object:
newparams.write(str(paramsencoded))
That doesn't decode the bytes object:
>>> bytesvalue = b'abc='
>>> str(bytesvalue)
"b'abc='"
Note the b'...' notation. You produced the representation of the bytes object, which is a string containing Python syntax that can reproduce the value for debugging purposes (you can copy that string value and paste it into Python to re-create the same bytes value).
This may not be that easy to notice at first, as base64.b64encode() otherwise only produces output with printable ASCII bytes.
But your decoding problem originates from there, because when decoding the value read back from the file includes the b' characters at the start. Those first two characters are interpreted as Base64 data too; the b is a valid Base64 character, and the ' is ignored by the parser:
>>> bytesvalue = b'hello world'
>>> base64.b64encode(bytesvalue)
b'aGVsbG8gd29ybGQ='
>>> str(base64.b64encode(bytesvalue))
"b'aGVsbG8gd29ybGQ='"
>>> base64.b64decode(str(base64.b64encode(bytesvalue))) # with str()
b'm\xa1\x95\xb1\xb1\xbc\x81\xdd\xbd\xc9\xb1\x90'
>>> base64.b64decode(base64.b64encode(bytesvalue)) # without str()
b'hello world'
Note how the output is completely different, because the Base64 decoding is now starting from the wrong place, as b is the first 6 bits of the first byte (making the first decoded byte a 6C, 6D, 6E or 6F bytes, so m,n, o or p ASCII).
You could properly decode the value (using paramsencoded.decode('ascii') or str(paramsencoded, 'ascii')) but you should't treat any of this data as text.
Instead, open your files in binary mode. Reading and writing then operates with bytes objects, and the base64.b64encode() and base64.b64decode() functions also operate on bytes, making for a perfect match:
with open('params.bpr', 'rb') as params_source:
params = params_source.read() # bytes object
params_encoded = base64.b64encode(params)
print(params_encoded.decode('ascii')) # base64 data is always ASCII data
params_decoded = base64.b64decode(params_encoded)
with open('paramsencoded.bpr', 'wb') as new_params:
newparams.write(params_encoded) # write binary data
with open('paramsencoded.bpr', 'rb') as new_params:
params_written = new_params.read()
print(params_written.decode('ascii')) # still Base64 data, so decode as ASCII
params_decoded = base64.b64decode(params_written) # decode the bytes value
print(params_decoded.decode('utf8')) # assuming the original source was UTF-8
I explicitly use bytes.decode(codec) rather than str(..., codec) to avoid accidental str(...) calls.
I want to open a text file (.dat) in python and I get the following error:
'utf-8' codec can't decode byte 0x92 in position 4484: invalid start byte
but the file is encoded using utf-8, so maybe there some character that cannot be read. I am wondering, is there a way to handle the problem without calling each single weird characters? Cause I have a rather huge text file and it would take me hours to run find the non encoded Utf-8 encoded character.
Here is my code
import codecs
f = codecs.open('compounds.dat', encoding='utf-8')
for line in f:
if "InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
print(line)
searchfile.close()
It shouldn't "take you hours" to find the bad byte. The error tells you exactly where it is; it's at index 4484 in your input with a value of 0x92; if you did:
with open('compounds.dat', 'rb') as f:
data = f.read()
the invalid byte would be at data[4484], and you can slice as you like to figure out what's around it.
In any event, if you just want to ignore or replace invalid bytes, that's what the errors parameter is for. Using io.open (because codecs.open is subtly broken in many ways, and io.open is both faster and more correct):
# If this is Py3, you don't even need the import, just use plain open which is
# an alias for io.open
import io
with io.open('compounds.dat', encoding='utf-8', errors='ignore') as f:
for line in f:
if u"InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
print(line)
will just ignore the invalid bytes (dropping them as if they never existed). You can also pass errors='replace' to insert a replacement character for each garbage byte, so you're not silently dropping data.
if working with huge data , better to use encoding as default and if the error persists then use errors="ignore" as well
with open("filename" , 'r' , encoding="utf-8",errors="ignore") as f:
f.read()
This text file (30 bytes only, the content is '(Ne pas r\xe9pondre a ce message)') can be opened and inserted in a dict successfully :
import json
d = {}
with open('temp.txt', 'r') as f:
d['blah'] = f.read()
with open('test.txt', 'w') as f:
data = json.dumps(d)
f.write(data)
But it is impossible to dump the dict into a JSON file (see traceback below). Why?
I tried lots of solutions provided by various SO questions. The closest solution I could get was this answer. When using this, I can dump to file, but then the JSON file looks like this:
# test.txt
{"blah": "(Ne pas r\u00e9pondre a ce message)"}
instead of
# test.txt
{"blah": "(Ne pas répondre a ce message)"}
Traceback:
File "C:\Python27\lib\json\encoder.py", line 270, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 9: invalid continuation byte
[Finished in 0.1s with exit code 1]
Your file is not UTF-8 encoded. It uses a Latin codec, like ISO-8859-1 or Windows Codepage 1252. Reading the file gives you the encoded text.
JSON however requires Unicode text. Python 2 has a separate Unicode type, and byte strings (type str) need to be decoded using a suitable codec. The json.dumps() function uses UTF-8 by default; UTF-8 is a widely used codec for encoding Unicode text data that can handle all codepoints in the standard, and is also the default codec for JSON strings to use (JSON requires documents to be encoding in one of 3 UTF codecs).
You need to either decode the string manually or tell json.dumps() what codec to use for the byte string:
data = json.dumps(d, encoding='latin1') # applies to all bytestrings in d
or
d['blah'] = d['blah'].decode('latin1')
data = json.dumps(d)
or using io.open() to decode as you read:
import io
with io.open('test.txt', 'w', encoding='latin1') as f:
d['blah'] = f.read()
By default, the json library produces ASCII-safe JSON output by using the \uhhhh escape syntax the JSON standard allows for. This is entirely normal, the output is valid JSON and readable by any compliant JSON decoder.
If you must produce UTF-8 encoded output without the \uhhhh escape sequences, see Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence
I'm trying to remove all non-ascii characters from a text document. I found a package that should do just that, https://pypi.python.org/pypi/Unidecode
It should accept a string and convert all non-ascii characters to the closest ascii character available. I used this same module in perl easily enough by just calling while (<input>) { $_ = unidecode($_); } and this one is a direct port of the perl module, the documentation indicates that it should work the same.
I'm sure this is something simple, I just don't understand enough about character and file encoding to know what the problem is. My origfile is encoded in UTF-8 (converted from UCS-2LE). The problem may have more to do with my lack of encoding knowledge and handling strings wrong than the module, hopefully someone can explain why though. I've tried everything I know without just randomly inserting code and search the errors I'm getting with no luck so far.
Here's my python
from unidecode import unidecode
def toascii():
origfile = open(r'C:\log.convert', 'rb')
convertfile = open(r'C:\log.toascii', 'wb')
for line in origfile:
line = unidecode(line)
convertfile.write(line)
origfile.close()
convertfile.close()
toascii();
If I don't open the original file in byte mode (origfile = open('file.txt','r') then I get an error UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1563: character maps to <undefined> from the for line in origfile: line.
If I do open it in byte mode 'rb' I get TypeError: ord() expected string length 1, but int found from the line = unidecode(line) line.
if I declare line as a string line = unidecode(str(line)) then it will write to the file, but... not correctly. \r\n'b'\xef\xbb\xbf[ 2013.10.05 16:18:01 ] User_Name > .\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\ It's writing out the \n, \r, etc and unicode characters instead of converting them to anything.
If I convert the line to string as above, and open the convertfile in byte mode 'wb' it gives the error TypeError: 'str' does not support the buffer interface
If I open it in byte mode without declaring it a string 'wb' and unidecode(line) then I get the TypeError: ord() expected string length 1, but int found error again.
The unidecode module accepts unicode string values and returns a unicode string in Python 3. You are giving it binary data instead. Decode to unicode or open the input text file in textmode, and encode the result to ASCII before writing it to a file, or open the output text file in text mode.
Quoting from the module documentation:
The module exports a single function that takes an Unicode object (Python 2.x) or string (Python 3.x) and returns a string (that can be encoded to ASCII bytes in Python 3.x)
Emphasis mine.
This should work:
def toascii():
with open(r'C:\log.convert', 'r', encoding='utf8') as origfile, open(r'C:\log.toascii', 'w', encoding='ascii') as convertfile:
for line in origfile:
line = unidecode(line)
convertfile.write(line)
This opens the inputfile in text modus (using UTF8 encoding, which judging by your sample line is correct) and writes in text modus (encoding to ASCII).
You do need to explicitly specify the encoding of the file you are opening; if you omit the encoding the current system locale is used (the result of a locale.getpreferredencoding(False) call), which usually won't be the correct codec if your code needs to be portable.