Handling B64 encoded data in Python - python

In my Google App Engine based app, I am fetching data from a SOAP webservice.
The problem is that one of the tag contains binary 64 encoded data. I decode it using
decodedStr = base64.b64decode(str(content))
It seems that the decoding is not done correctly a I get garbage data in decodeStr. I think the problem is that the content string is falsely parsed as a unicode string instead of simple byte string
Can any Python guru tell me how to handle b64 encoded data in Python?
For now I am using this workaround
fileContent = str(fileContent)
fileContent = fileContent[3:-3]
self.response.out.write(base64.b64decode(fileContent))

You could try using base64.decodestring or if you were passed an url base64.urlsafe_b64decode.
Make sure that the data is not in base16 or base32.

Strange. If the content were not b64 encoded, the call to decode should raise a TypeError exception. I assume that's not happening?
Which would lead me to wonder how you know the resulting decodedStr is not what your after?

Related

python - How to properly encode string in utf8 which is ISO8859-1 from xml

I'm using the following code in python 2.7 to retrieve an xmlFile containing some german umlauts like ä,ü,ö,ß:
.....
def getXML(self,url):
xmlFile=urllib2.urlopen(self.url)
xmlResponse=xmlFile.read()
xmlResponse=xmlResponse
xmlFile.close()
return xmlResponse
pass
def makeDict(self, xmlFile):
data = xmltodict.parse(xmlFile)
return data
def saveJSON(self, dictionary):
currentDayData=dictionary['speiseplan']['tag'][1]
file=open('data.json','w')
# Write the currentDay as JSON
file.write(json.dumps(currentDayData))
file.close()
return True
pass
# Execute
url="path/to/xml/"
App=GetMensaJSON(url)
xml=GetMensaJSON.getXML(App,url)
dictionary=GetMensaJSON.makeDict(App,xml)
GetMensaJSON.saveJSON(App,dictionary)
The problem is that the xml File claims in its <xml> tag that it is utf-8. It however isn't. By trying I found out that it is iso8859_1
So I wanted to reconvert from utf-8 to iso8859 and back to utf-8 to resolve the conflicts.
Is there an elegant way to resolve missing umlauts? In my code for example I have instead of ß\u00c3\u009f an instead of ü\u00c3\u00bc
I found this similar question but I can't get it to work How to parse an XML file with encoding declaration in Python?
Also I should add that I can't influence the way I get the xml.
The XML File Link can be found in the code.
The output from ´repr(xmlResponse)´ is
"<?xml version='1.0' encoding='utf-8'?>\n<speiseplan>\n<tag timestamp='1453676400'>\n<item language='de'>\n<category>Stammessen</category>\n<title>Gem\xc3\x83\xc2\xbcsebr\xc3\x83\xc2\xbche mit Backerbsen (25,28,31,33,34), paniertes H\xc3\x83\xc2\xa4hnchenbrustschnitzel (25) mit Paprikasauce (33,34), Pommes frites und Gem\xc3\x83\xc2\xbcse
You are trying to encode already encoded data. urllib2.urlopen() can only return you a bytestring, not unicode, so encoding makes little sense.
What happens instead is that Python is trying to be helpful here; if you insist on encoding bytes, then it'll decode those to unicode data first. And it'll use the default codec for that.
On top of that, XML documents are themselves responsible for documenting what codec should be used to decode. The default codec is UTF-8, don't manually re-code the data yourself, leave that to a XML parser.
If you have Mojibake data in your XML document, the best way to fix that is to do so after parsing. I recommend the ftfy package to do this for you.
You could manually 'fix' the encoding by first decoding as UTF-8, then encoding to Latin-1 again:
xmlResponse = xmlFile.read().decode('utf-8').encode('latin-1')
However, this makes the assumption that your data has been badly decoded as Latin-1 to begin with; this is not always a safe assumption. If it was decoded as Windows CP 1252, for example, then the best way to recover your data is to still use ftfy.
You could try using ftfy before parsing as XML, but this relies on the document not having used any non-ASCII elements outside of text and attribute content:
xmlResponse = ftfy.fix_text(
xmlFile.read().decode('utf-8'),
fix_entities=False, uncurl_quotes=False, fix_latin_ligatures=False).encode('utf-8')

Saving uploaded binary to local file

I'm trying to upload files from a javascript webpage, to a python-based server, with websockets.
In the JS, this is how I'm transmitting the package of data over the websocket:
var json = JSON.stringify({
'name': name,
'iData': image
});
in the python, I'm decoding it like this:
noJson = json.loads(message)
fName = noJson["name"]
fData = noJson["iData"]
I know fData is in unicode format, but when I try to save the file locally is when the problems begin. Say, I'm trying to upload/save a JPG file. Looking at that file after upload I see at the beginning:
ÿØÿà^#^PJFIF
the original code should be:
<FF><D8><FF><E0>^#^PJFIF
So how do I get it to save with the codes, instead of the interpreted unicode characters?
fd = codecs.open( fName, encoding='utf-8', mode='wb' ) ## On Unix, so the 'b' might be ignored
fd.write( fData)
fd.close()
(if I don't use the "encoding=" bit, it throws a UnicodeDecodeError exception)
Use 'latin-1' encoding to save the file.
The fData that you are getting already has the characters encoded, i.e. you get the string u'\xff\xd8\xff\xe0^#^PJFIF'. The latin-1 encoding will literally convert all codepoints between U+00 and U+FF to a single char, and fail to convert any codepoint above U+FF.

Can't properly encode csv file?

I have this exact problem: https://www.en.adwords-community.com/t5/Basics-for-New-Advertisers/Character-Encoding-used-by-the-editor/td-p/100244 (tl;dr: trying to upload a file to google, contains foreign characters, they look funny when opened in excel and google is rejecting them for not being properly encoded)
I have the following code. Note that I've tried adding a byte order mark to the beginning of the http response object, as well as tried to encode all strings as utf-8.
<some code where workbook is created and populated via xlwt>
output = StringIO.StringIO()
workbook.save(output)
wb = open_workbook(file_contents=output.getvalue())
sheet = wb.sheet_by_name(spreadsheet)
response = HttpResponse(content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename='+(account.name+'-'+spreadsheet).replace(',', '')+'.csv'
response.write('\xEF\xBB\xBF')
writer = csv.writer(response)
for rownum in xrange(sheet.nrows):
newRow = []
for s in sheet.row_values(rownum):
if isinstance(s,unicode):
newRow.append(s.encode("utf-8"))
elif isinstance(s, float):
newRow.append(int(s))
else:
newRow.append(s.decode('utf-8'))
writer.writerow(newRow)
return response
But they still don't look right when opened in Excel! Why?
Whenever you write a Unicode string to a file or stream it must be encoded. You can do the encoding yourself, or you can let the various module and library functions attempt to do it for you. If you're not sure what encoding will be selected for you, and you know which encoding you want written, it's better to do the encoding yourself.
You've followed this advice already when you encounter a Unicode string in the input. However, when you encounter a string that's already encoded as UTF-8, you decode it back to Unicode! This results in the reverse conversion being done in writerow, and evidently it's not picking utf-8 as the default encoding. By leaving the string alone instead of decoding it the writerow will write it out exactly as you intended.
You want to write encoded data always, but for string values you are decoding to Unicode values:
else:
newRow.append(s.decode('utf-8'))
Most likely your web framework is encoding that data to Latin-1 instead in that case.
Just append the value without decoding:
for s in sheet.row_values(rownum):
if isinstance(s, unicode):
s = s.encode("utf-8"))
elif isinstance(s, float):
s = int(s)
newRow.append(s)
Further tips:
It's a good idea to communicate the character set in the response headers too:
response = HttpResponse(content_type='text/csv; charset=utf-8')
Use codecs.BOM_UTF8 to write the BOM instead of hardcoding the value. Much less error prone.
response.write(codecs.BOM_UTF8)

how to serialize arbitrary file types to json string in python

My server is going to be sending a JSON, serialized as a string, through a socket to another client machine. I'll take my final json and do this:
import json
python_dict_obj = { "id" : 1001, "name" : "something", "file" : <???> }
serialized_json_str = json.dumps(python_dict_obj)
I'd like to have one of the fields in my JSON have the value that is a file, encoded as a string.
Performance-wise (but also interoperability-wise) what is the best way to encode a file using python? Base64? Binary? Just the raw string text?
EDIT - For those suggestion base64, something like this?
# get file
import base64
import json
with open(filename, 'r') as f:
filecontents = f.read()
encoded = base64.b64encode(filecontents)
python_dict_obj['file'] = encoded
serialized_json_str = json.dumps(python_dict_obj)
# ... sent to client via socket
# decrpyting
json_again = json.loads(serialized)
filecontents_again = base64.b64decode(json_again['file'])
I'd use base64. JSON isn't designed to communicate binary data. So unless your file's content is vanilla text, it "should be" encoded to use vanilla text. Virtually everything can encode and decode base64. If you instead use (for example) Python's repr(file_content), that also produces "plain text", but the receiving end would need to know how to decode the string escapes Python's repr() uses.
JSON cannot handle binary. You will need to encode the data as text before serializing, and the easiest to encode it as is Base64. You do not need to use the URL-safe form of encoding unless there are requirements for it further down the processing chain.

How to open an ascii-encoded file as UTF8?

My files are in US-ASCII and a command like a = file( 'main.html') and a.read() loads them as an ASCII text. How do I get it to load as UTF8?
The problem I am tring to solve is:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 38: ordinal not in range(128)
I was using the content of the files for templating as in template_str.format(attrib=val). But the string to interpolate is of a superset of ASCII.
Our team's version control and text editors does not care about the encoding. So how do I handle it in the code?
You are trying to opening files without specifying an encoding, which means that python uses the default value (ASCII).
You need to decode the byte-string explicitly, using the .decode() function:
template_str = template_str.decode('utf8')
Your val variable you tried to interpolate into your template is itself a unicode value, and python wants to automatically convert your byte-string template (read from the file) into a unicode value too, so that it can combine both, and it'll use the default encoding to do so.
Did I mention already you should read Joel Spolsky's article on Unicode and the Python Unicode HOWTO? They'll help you understand what happened here.
A solution working in Python2:
import codecs
fo = codecs.open('filename.txt', 'r', 'ascii')
content = fo.read() ## returns unicode
assert type(content) == unicode
fo.close()
utf8_content = content.encode('utf-8')
assert type(utf8_content) == str
I suppose that you are sure that your files are encoded in ASCII. Are you? :) As ASCII is included in UTF-8, you can decode this data using UTF-8 without expecting problems. However, when you are sure that the data is just ASCII, you should decode the data using just ASCII and not UTF-8.
"How do I get it to load as UTF8?"
I believe you mean "How do I get it to load as unicode?". Just decode the data using the ASCII codec and, in Python 2.x, the resulting data will be of type unicode. In Python 3, the resulting data will be of type str.
You will have to read about this topic in order to learn how to perform this kind of decoding in Python. Once understood, it is very simple.

Categories