Decode a compressed bytes to utf-8 in python?

Decode a compressed bytes to utf-8 in python? - python

I have a base64 format variable
variable = b'gAN9cQAoGdS4='
To store in database I decode it
st1 = string.decode('utf-8')
st1
Out[35]: 'gAN9cQAoGdS4='
Now I have a very large variable more than 4GB, so I compress it using zlib
import zlib
variable_comp = zlib.compress(variable)
Now to store in db, I can't decode it
st1 = variable_comp.decode('utf-8')
I get
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 1: invalid start byte
so I tried
st2 = variable_comp.decode('utf-8', errors="ignore")
But when I decompress it, I get error
variable_decomp = zlib.decompress(st2)
TypeError: a bytes-like object is required, not 'str'
may I know the fix for this, whether gzip will fix it?

Related

unicode decode error while importing Medical Data on pandas

I tried importing a medical data and I ran into this unicode error, here is my code:
output_path = r"C:/Users/muham/Desktop/AI projects/cancer doc classification"
my_file = glob.glob(os.path.join(output_path, '*.csv'))
for files in my_file:
data = pd.read_csv(files)
print(data)
My error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 3314: invalid start byte

Try other encodings, default one is utf-8
like
import pandas
pandas.read_csv(path, encoding="cp1252")
or ascii, latin1, etc ...

Encoding 'UTF-8' throws an exception in macOS

I'm trying to read a GloVe file: glove.twitter.27B.200d.txt. I have the next function to read the file:
def glove_reader(glove_file):
glove_dict = {}
with open(glove_file, 'rt', encoding='utf-8') as glove_reader:
for line in glove_reader:
tokens = line.rstrip().split()
vect = [float(token) for token in tokens[1:]]
glove_dict[tokens[0]] = vect
return glove_dict
The problem is that I get the next error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in position 0: invalid continuation byte
I tried with latin-1 but it didn't work. Throws me the next error:
ValueError: could not convert string to float: 'Ù\x86'
I also tried change 'rt' with 'r' and 'rb'. I think is a problem of macOS because in Windows didn't throw me this error. Can someone please help me to know why I can't read this file.

UnicodeDecodeError: 'utf8' codec can't decode byte 0x8e in position 1

This is my first post here excuse me if i miss anything.
I have some data in my CSV file and am trying to import data into my prod and getting UnicodeDecodeError. I have some french words in my csv file
Code:
open_csv = csv.DictReader(open('filename',''rb))
for i in open_csv:
x = find(where={})#mongodb query
x.something = i.get(row_header)
x.save()
am getting UnicodeDecodeError: 'utf8' codec can't decode byte 0x8e in position 1 error while saving the data

I would suggest you to try the following code:
import codecs
open_csv = csv.DictReader(codecs.open('filename','rb'))
for i in open_csv:
x = find(where={})
x.something = i.get(row_header)
x.save()
I work in Python 3.x but this should work in 2.x too if that is what you are using.

Write unicode gif to file in python

I have a GIF file (or any image format) in unicode form:
>>> data
u'GIF89a,\x000\x00\ufffd\ufffd\x00\x00\x00\x00\ufffd\ufffd\ufff...
I want to write this to file:
>>> f = open('file.gif', 'wb')
>>> f.write(data)
But I get an error:
UnicodeEncodeError at /image
'ascii' codec can't encode characters in position 10-11: ordinal not in range(128)
How do I do this?

Try this:
utf8data = data.encode('UTF-8')
open('file.gif', 'w').write(utf8data)

You must encode the string to unicode explicitly
f.write(data.encode('utf-8'))

urllib2 opener providing wrong charset

When I open the url and read it, I can't recognize it. But when I check the content header it says it is encoded as utf-8. So I tried to convert it to unicode and it complained UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 1: ordinal not in range(128) using unicode().
.encode("utf-8") produces
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 1: ordinal not in range(128)
.decode("utf-8") produced
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: invalid start byte.
I have tried everything I can come up with(I'm not that good at encodings)
I would be happy if I could get this to work. Thanks.

This is a common mistake. The server sends gzipped stream.
You should unpack it first:
response = opener.open(self.__url, data)
if response.info().get('Content-Encoding') == 'gzip':
buf = StringIO.StringIO( response.read())
gzip_f = gzip.GzipFile(fileobj=buf)
content = gzip_f.read()
else:
content = response.read()

The header is probably wrong. Check out chardet.
EDIT: Thinking more about it -- my money is on the contents being gzipped. I believe some of Python's various URL-opening modules/classes/etc will ungzip, while others won't.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Decode a compressed bytes to utf-8 in python? - python

Related

unicode decode error while importing Medical Data on pandas

Encoding 'UTF-8' throws an exception in macOS

UnicodeDecodeError: 'utf8' codec can't decode byte 0x8e in position 1

Write unicode gif to file in python

urllib2 opener providing wrong charset

Categories

Resources