I'm trying to upload files from a javascript webpage, to a python-based server, with websockets.
In the JS, this is how I'm transmitting the package of data over the websocket:
var json = JSON.stringify({
'name': name,
'iData': image
});
in the python, I'm decoding it like this:
noJson = json.loads(message)
fName = noJson["name"]
fData = noJson["iData"]
I know fData is in unicode format, but when I try to save the file locally is when the problems begin. Say, I'm trying to upload/save a JPG file. Looking at that file after upload I see at the beginning:
ÿØÿà^#^PJFIF
the original code should be:
<FF><D8><FF><E0>^#^PJFIF
So how do I get it to save with the codes, instead of the interpreted unicode characters?
fd = codecs.open( fName, encoding='utf-8', mode='wb' ) ## On Unix, so the 'b' might be ignored
fd.write( fData)
fd.close()
(if I don't use the "encoding=" bit, it throws a UnicodeDecodeError exception)
Use 'latin-1' encoding to save the file.
The fData that you are getting already has the characters encoded, i.e. you get the string u'\xff\xd8\xff\xe0^#^PJFIF'. The latin-1 encoding will literally convert all codepoints between U+00 and U+FF to a single char, and fail to convert any codepoint above U+FF.
Related
I am trying to properly decode a Base64 string from Power Apps to an audio file. The point is: I do decode it and I can play it. But as soon as I try to convert it using ffmpeg or any website, all kind of errors are thrown. I have tried changing the formats too (aac, weba, m4a, wav, mp3, ogg, 3gp, caf), but none of them could be converted to another format.
PS: If I decode the string (which is too big to post here) directly using a website, then the audio file can finally be converted, indicating that the issue is in the code or even in the Python library.
===============
CODE ===============
import os
import base64
mainDir = os.path.dirname(__file__)
audioFileOGG = os.path.join(mainDir, "myAudio.ogg")
audioFile3GP = os.path.join(mainDir, "myAudio.3gp")
audioFileAAC = os.path.join(mainDir, "myAudio.aac")
binaryFileTXT = os.path.join(mainDir, 'binaryData.txt')
with open(binaryFileTXT, 'rb') as f:
audioData = f.readlines()
audioData = audioData[0]
with open(audioFileAAC, "wb") as f:
f.write(base64.b64decode(audioData))
Result: the audio file is playable, but it cannot be converted to any other format (I need *.wav).
What am I missing here?
I found the issue myself: in order to decode the Base64 string, one must remove the header first (eg.: "data:audio/webm;base64,"). Then it works!
My Django application is working with both .txt and .doc filetypes. And this application opens a file, compares it with other files in db and prints out some report.
Now the problem is that, when file type is .txt, I get 'utf-8' codec can't decode byte error (here I'm using encoding='utf-8'). When I switch encoding='utf-8' to encoding='ISO-8859-1' error changes to 'latin-1' codec can't decode byte.
I want to find such encoding format that works with every type of a file. This is a small part of my function:
views.py:
#login_required(login_url='sign_in')
def result(request):
last_uploaded = OriginalDocument.objects.latest('id')
original = open(str(last_uploaded.document), 'r', encoding='utf-8')
original_words = original.read().lower().split()
words_count = len(original_words)
open_original = open(str(last_uploaded.document), "r")
read_original = open_original.read()
report_fives = open("static/report_documents/" + str(last_uploaded.student_name) +
"-" + str(last_uploaded.document_title) + "-5.txt", 'w')
# Path to the documents with which original doc is comparing
path = 'static/other_documents/doc*.txt'
files = glob.glob(path)
rows, found_count, fives_count, rounded_percentage_five, percentage_for_chart_five, fives_for_report, founded_docs_for_report = search_by_five(last_uploaded, 5, original_words, report_fives, files)
context = {
...
}
return render(request, 'result.html', context)
There is no general encoding which automatically knows how to decode an already encoded file in a specific encoding.
UTF-8 is a good option with many compatibilities with other encodings. You can e.g. simply ignore or replace characters which aren't decodable like this:
from codecs import open
original = open(str(last_uploaded.document), encoding="utf-8", errors="ignore")
original_words = original.read().lower().split()
...
original.close()
Or even using a context manager (with statement) who closes the file for you:
with open(str(last_uploaded.document), encoding="utf-8", errors="ignore") as fr:
original_words = fr.read().lower().split()
...
(Note: You do not need to use the codecs library if you're using Python 3, but you have tagged your question with python-2.7.)
You can see advantages and disadvantages of using different error handlers here and here. You have to know that not using an error handler will default to using errors="strict" which you probably do not want. Other options may be nearly self-explaining, e.g.:
using errors="replace" will replace an undecodable character with a suitable replacement marker
using errors="ignore" will simply ignore the character and continues reading the file data.
What you should use depends on your needs and usecase(s).
You're saying that you also have encoding problems not only with plain text files, but also with proprietary doc files:
The .doc format is not a plain text file which you can simply read with open() or codecs.open() since there are many information stored in binary format, see this site for more information. So you need a special reader for .doc files to get the text from it. Which library you are using depends on your Python version and maybe also on the operating system you are using. Maybe here is a good starting point for you.
Unfortunately, using a library does not prevent you completely from encoding errors. (Maybe yes, but I'm not sure if the encoding is saved in the file itself like in a .docx file.) You maybe also have the chance to figure out the encoding of the file. How you can handle encoding errors likely depends on the library itself.
So I just guess that you are trying opening .doc files as simple text files. Then you will get decoding errors, because it's not saved as human readable text. And even if you get rid of the error, you only will see the non human readable text: (I've created a simple text file with LibreOffice in doc-format (Microsoft Word 1997-2003)):
In [1]: open("./test.doc", "r").read()
UnicodeDecodeError: 'utf-8' codec can`t decode byte 0xd0 in position 0: invalid continuation byte
In [2]: open("./test.doc", "r", errors="replace").read() # or open("./test.doc", "rb").read()
'��\x11\u0871\x1a�\x00\x00\x00' ...
I am accessing dataset that lives on ftp server. after I download the data, I used pandas to read it as csv but I got an encoding error. The file has csv file extension but after I opened the file with MS excell, data was in Unicode Text format. I want to make conversion of those dataset that stored in Unicode text format. How can I make this happen? Any idea to get this done?
my attempt:
from ftplib import FTP
import os
def mydef():
defaultIP=''
username='cat'
password='cat'
ftp = FTP(defaultIP,user=username, passwd=password)
ftp.dir()
filenames=ftp.nlst()
for filename in files:
local_filename = os.path.join('C:\\Users\\me', filename)
file = open(local_filename, 'wb')
ftp.retrbinary('RETR '+ filename, file.write)
file.close()
ftp.quit()
then I tried this to get correct encoding:
mydef.encode('utf-8').splitlines()
but this one is not working for me. I used this solution
the output of above code:
here is output snippet of above code:
b'\xff\xfeF\x00L\x00O\x00W\x00\t\x00C\x00T\x00Y\x00_\x00R\x00P\x00T\x00\t\x00R\x00E\x00P\x00O\x00R\x00T\x00E\x00R\x00\t\x00C\x00T\x00Y\x00_\x00P\x00T\x00N\x00\t\x00P\x00A\x00R\x00T\x00N\x00E\x00R\x00\t\x00C\x00O\x00M\x00M\x00O\x00D\x00I\x00T\x00Y\x00\t\x00D\x00E\x00S\x00C\x00R\x00I\x00P\x00T\x00I\x00O\x00N\x00\t'
expected output
the expected output of this dataset should be in normal csv data such as common trade data, but encoding doesn't work for me.
I used different encoding for getting the correct conversion of csv format data but none of them works for me. How can I make that work? any idea to get this done? thanks
EDIT: I have to change it - now I remove 2 bytes at the beginning (BOM) and one byte at the end because data is incomplete (every char needs 2 bytes)
It seems it is not utf-8 but utf-16 with BOM
If I remove first two bytes (BOM - Bytes Order Mark) and last byte at the end because it is incomplete (every char needs two bytes) and use decode('utf-16-le')
b'F\x00L\x00O\x00W\x00\t\x00C\x00T\x00Y\x00_\x00R\x00P\x00T\x00\t\x00R\x00E\x00P\x00O\x00R\x00T\x00E\x00R\x00\t\x00C\x00T\x00Y\x00_\x00P\x00T\x00N\x00\t\x00P\x00A\x00R\x00T\x00N\x00E\x00R\x00\t\x00C\x00O\x00M\x00M\x00O\x00D\x00I\x00T\x00Y\x00\t\x00D\x00E\x00S\x00C\x00R\x00I\x00P\x00T\x00I\x00O\x00N\x00'.decode('utf-16-le')
then I get
'FLOW\tCTY_RPT\tREPORTER\tCTY_PTN\tPARTNER\tCOMMODITY\tDESCRIPTION'
EDIT: meanwhile I found also Python - Decode UTF-16 file with BOM
I have this exact problem: https://www.en.adwords-community.com/t5/Basics-for-New-Advertisers/Character-Encoding-used-by-the-editor/td-p/100244 (tl;dr: trying to upload a file to google, contains foreign characters, they look funny when opened in excel and google is rejecting them for not being properly encoded)
I have the following code. Note that I've tried adding a byte order mark to the beginning of the http response object, as well as tried to encode all strings as utf-8.
<some code where workbook is created and populated via xlwt>
output = StringIO.StringIO()
workbook.save(output)
wb = open_workbook(file_contents=output.getvalue())
sheet = wb.sheet_by_name(spreadsheet)
response = HttpResponse(content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename='+(account.name+'-'+spreadsheet).replace(',', '')+'.csv'
response.write('\xEF\xBB\xBF')
writer = csv.writer(response)
for rownum in xrange(sheet.nrows):
newRow = []
for s in sheet.row_values(rownum):
if isinstance(s,unicode):
newRow.append(s.encode("utf-8"))
elif isinstance(s, float):
newRow.append(int(s))
else:
newRow.append(s.decode('utf-8'))
writer.writerow(newRow)
return response
But they still don't look right when opened in Excel! Why?
Whenever you write a Unicode string to a file or stream it must be encoded. You can do the encoding yourself, or you can let the various module and library functions attempt to do it for you. If you're not sure what encoding will be selected for you, and you know which encoding you want written, it's better to do the encoding yourself.
You've followed this advice already when you encounter a Unicode string in the input. However, when you encounter a string that's already encoded as UTF-8, you decode it back to Unicode! This results in the reverse conversion being done in writerow, and evidently it's not picking utf-8 as the default encoding. By leaving the string alone instead of decoding it the writerow will write it out exactly as you intended.
You want to write encoded data always, but for string values you are decoding to Unicode values:
else:
newRow.append(s.decode('utf-8'))
Most likely your web framework is encoding that data to Latin-1 instead in that case.
Just append the value without decoding:
for s in sheet.row_values(rownum):
if isinstance(s, unicode):
s = s.encode("utf-8"))
elif isinstance(s, float):
s = int(s)
newRow.append(s)
Further tips:
It's a good idea to communicate the character set in the response headers too:
response = HttpResponse(content_type='text/csv; charset=utf-8')
Use codecs.BOM_UTF8 to write the BOM instead of hardcoding the value. Much less error prone.
response.write(codecs.BOM_UTF8)
I'm writing a program to 'manually' arrange a csv file to be proper JSON syntax, using a short Python script. From the input file I use readlines() to format the file as a list of rows, which I manipulate and concenate into a single string, which is then outputted into a separate .txt file. The output, however, contains gibberish instead of Hebrew characters that were present in the input file, and the output is double-spaced, horizontally (a whitespace character is added in between each character). As far as I can understand, the problem has to do with the encoding, but I haven't been able to figure out what. When I detect the encoding of the input and output files (using .encoding attribute), they both return None, which means they use the system default. Technical details: Python 2.7, Windows 7.
While there are a number of questions out there on this topic, I didn't find a direct answer to my problem.
Detecting the system defaults won't help me in this case, because I need the program to be portable.
Here's the code:
def txt_to_JSON(csv_list):
...some manipulation of the list...
return JSON_string
file_name = "input_file.txt"
my_file = open(file_name)
# make each line of input file a value in a list
lines = my_file.readlines()
# break up each line into a list such that each 'column' is a value in that list
for i in range(0,len(lines)):
lines[i] = lines[i].split("\t")
J_string = txt_to_JSON(lines)
json_file = open("output_file.txt", "w+")
json_file.write(jstring)
json_file.close()
All data needs to be encoded to be stored on disk. If you don't know the encoding, the best you can do is guess. There's a library for that: https://pypi.python.org/pypi/chardet
I highly recommend Ned Batchelder's presentation
http://nedbatchelder.com/text/unipain.html
for details.
There's an explanation about the use of "unicode" as an encoding on windows: What's the difference between Unicode and UTF-8?
TLDR:
Microsoft uses UTF16 as encoding for unicode strings, but decided to call it "unicode" as they also use it internally.
Even if Python2 is a bit lenient as to string/unicode conversions, you should get used to always decode on input and encode on output.
In your case
filename = 'where your data lives'
with open(filename, 'rb') as f:
encoded_data = f.read()
decoded_data = encoded_data.decode("UTF16")
# do stuff, resulting in result (all on unicode strings)
result = text_to_json(decoded_data)
encoded_result = result.encode("UTF-16") #really, just using UTF8 for everything makes things a lot easier
outfile = 'where your data goes'
with open(outfile, 'wb') as f:
f.write(encoded_result)
You need to tell Python to use the Unicode character encoding to decode the Hebrew characters.
Here's a link to how you can read Unicode characters in Python: Character reading from file in Python