Pycurl: uploading files with filenames in UTF-8

Pycurl: uploading files with filenames in UTF-8 - python

This question is related to this one.
Please read the problem that Chris describes there. I'll narrow it down: there's a CURL error 26 if a filename is utf-8-encoded and contains characters that are not in the range of those supported by non-unicode programs.
Let me explain myself:
local_filename = filename.encode("utf-8")
self.curl.setopt(self.curl.HTTPPOST, [(field, (self.curl.FORM_FILE, local_filename, self.curl.FORM_FILENAME, local_filename))])
I have windows 7 with Russian set as the language for non-unicode programs. If I don't encode filename to utf-8 (and pass filename, not local_filename to pycurl(, everything goes flawlessly if the filename contains either English or Russian chars. But if there is, say, an à, — it throws an error 26. If I pass local_filename (so encoded to UTF-8), even Russian chars aren't allowed.
Could you help, please? Thanks!

This is easy to answer, harder to fix:
pycurl uses libcurl for formposting. libcurl uses plain fopen() to open files for posting. Therefore you need to tell libcurl the exact file name that it should open and read from your local file system.

Decompose this problem into 2 components:
tell pycurl which file to open to read file data
send filename in correct encoding to the server
These may or may not be same encodings.
For 1, use sys.getfilesystemencoding() to convert unicode filename (which you use throughout python code correctly) to a string that pycurl/libcurl can open correctly with fopen(). Use strace (linux) or equivalent windows osx to verify correct file path is being opened by pycurl.
If that totally fails you can always feed file data stream from Python via pycurl.READFUNCTION.
For 2, learn how filename is transmitted during file upload, example. I don't have a good link, all I know it's not trivial, e.g. when it comes to very long file names.
I hacked up your code snippet, I have this, it works against nc -vl 5050 at least.
#!/usr/bin/python
import pycurl
c = pycurl.Curl()
filename = u"example-\N{EURO SIGN}.mp3"
with open(filename, "wb") as f:
f.write("\0\xfffoobar\x07\xff" * 9)
local_filename = filename.encode("utf-8")
c.setopt(pycurl.HTTPPOST, [("xxx", (pycurl.FORM_FILE, local_filename, pycurl.FORM_FILENAME, local_filename))])
c.setopt(pycurl.URL, "http://localhost:5050/")
c.setopt(pycurl.HTTPHEADER, ["Expect:"])
c.perform()
My test doesn't cover the case where encoding is different between OS and HTTP.
Should be enough to get your started though, shouldn't it?

Related

python 2.7 cpickle.load adds \r to strings in dictionaries [duplicate]

I got a pickled object (a list with a few numpy arrays in it) that was created on Windows and apparently saved to a file loaded as text, not in binary mode (ie. with open(filename, 'w') instead of open(filename, 'wb')). Result is that now I can't unpickle it (not even on Windows) because it's infected with \r characters (and possibly more)? The main complaint is
ImportError: No module named multiarray
supposedly because it's looking for numpy.core.multiarray\r, which of course doesn't exist. Simply removing the \r characters didn't do the trick (tried both sed -e 's/\r//g' and, in python s = file.read().replace('\r', ''), but both break the file and yield a cPickle.UnpicklingError later on)
Problem is that I really need to get the data out of the objects. Any ideas how to fix the files?
Edit: On request, the first few hundred bytes of my file, Octal:
\x80\x02]q\x01(}q\x02(U\r\ntotal_timeq\x03G?\x90\x15r\xc9(s\x00U\rreaction_timeq\x04NU\x0ejump_directionq\x05cnumpy.core.multiarray\r\nscalar\r\nq\x06cnumpy\r\ndtype\r\nq\x07U\x02f8K\x00K\x01\x87Rq\x08(K\x03U\x01<NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00tbU\x08\x025\x9d\x13\xfc#\xc8?\x86Rq\tU\x14normalised_directionq\r\nh\x06h\x08U\x08\xf0\xf9,\x0eA\x18\xf8?\x86Rq\x0bU\rjump_distanceq\x0ch\x06h\x08U\x08\x13\x14\xea&\xb0\x9b\x1a#\x86Rq\rU\x04jumpq\x0ecnumpy.core.multiarray\r\n_reconstruct\r\nq\x0fcnumpy\r\nndarray\r\nq\x10K\x00\x85U\x01b\x87Rq\x11(K\x01K\x02\x85h\x08\x89U\x10\x87\x16\xdaEG\xf4\xf3?\x06`OC\xe7"\x1a#tbU\x0emovement_speedq\x12h\x06h\x08U\x08\\p\xf5[2\xc2\xef?\x86Rq\x13U\x0ctrial_lengthq\x14G#\t\x98\x87\xf8\x1a\xb4\xbaU\tconditionq\x15U\x0bhigh_mentalq\x16U\x07subjectq\x17K\x02U\x12movement_directionq\x18h\x06h\x08U\x08\xde\x06\xcf\x1c50\xfd?\x86Rq\x19U\x08positionq\x1ah\x0fh\x10K\x00\x85U\x01b\x87Rq\x1b(K\x01K\x02\x85h\x08\x89U\x10K\xb7\xb4\x07q=\x1e\xc0\xf2\xc2YI\xb7U&\xc0tbU\x04typeq\x1ch\x0eU\x08movementq\x1dh\x0fh\x10K\x00\x85U\x01b\x87Rq\x1e(K\x01K\x02\x85h\x08\x89U\x10\xad8\x9c9\x10\xb5\xee\xbf\xffa\xa2hWR\xcf?tbu}q\x1f(h\x03G#\t\xba\xbc\xb8\xad\xc8\x14h\x04G?\xd9\x99%]\xadV\x00h\x05h\x06h\x08U\x08\xe3X\xa9=\xc1\xb1\xeb?\x86Rq h\r\nh\x06h\x08U\x08\x88\xf7\xb9\xc1\t\xd6\xff?\x86Rq!h\x0ch\x06h\x08U\x08v\x7f\xeb\x11\xea5\r#\x86Rq"h\x0eh\x0fh\x10K\x00\x85U\x01b\x87Rq#(K\x01K\x02\x85h\x08\x89U\x10\xcd\xd9\x92\x9a\x94=\x06#]C\xaf\xef\xeb\xef\x02#tbh\x12h\x06h\x08U\x08-\x9c&\x185\xfd\xef?\x86Rq$h\x14G#\r\xb8W\xb2`V\xach\x15h\x16h\x17K\x02h\x18h\x06h\x08U\x08\x8e\x87\xd1\xc2
You may also download the whole file (22k).

Presuming that the file was created with the default protocol=0 ASCII-compatible method, you should be able to load it anywhere by using open('pickled_file', 'rU') i.e. universal newlines.
If this doesn't work, show us the first few hundred bytes: print repr(open('pickled_file', 'rb').read(200)) and paste the results into an edit of your question.
Update after file contents were published:
Your file starts with '\x80\x02'; it was dumped with protocol 2, the latest/best. Protocols 1 and 2 are binary protocols. Your file was written in text mode on Windows. This has resulted in each '\n' being converted to '\r\n' by the C runtime. Files should be opened in binary mode like this:
with open('result.pickle', 'wb') as f: # b for binary
pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)
with open('result.pickle', 'rb') as f: # b for binary
obj = pickle.load(f)
Docs are here. This code will work portably on both Windows and non-Windows systems.
You can recover the original pickle image by reading the file in binary mode and then reversing the damage by replacing all occurrences of '\r\n' by '\n'. Note: This recovery procedure is necessary whether you are trying to read it on Windows or not.

Newlines in Windows aren't just '\r', it's CRLF, or '\r\n'.
Give file.read().replace('\r\n', '\n') a try. You were previously deleting carriage returns that may not have actually been part of newlines.

Can't you -- on Windows -- just open the file in text mode, the same way it was written, read it in and then write it out to another file opened properly in binary mode?

Have you tried unpickling in text mode? That is,
x = pickle.load(open(filename, 'r'))
(On Windows, of course.)

Is there any function like iconv in Python?

I have some CSV files need to convert from shift-jis to utf-8.
Here is my code in PHP, which is successful transcode to readable text.
$str = utf8_decode($str);
$str = iconv('shift-jis', 'utf-8'. '//TRANSLIT', $str);
echo $str;
My problem is how to do same thing in Python.

I don't know PHP, but does this work :
mystring.decode('shift-jis').encode('utf-8') ?
Also I assume the CSV content is from a file. There are a few options for opening a file in python.
with open(myfile, 'rb') as fin
would be the first and you would get data as it is
with open(myfile, 'r') as fin
would be the default file opening
Also I tried on my computed with a shift-js text and the following code worked :
with open("shift.txt" , "rb") as fin :
text = fin.read()
text.decode('shift-jis').encode('utf-8')
result was the following in UTF-8 (without any errors)
' \xe3\x81\xa6 \xe3\x81\xa7 \xe3\x81\xa8'
Ok I validate my solution :)
The first char is indeed the good character: "\xe3\x81\xa6" means "E3 81 A6"
It gives the correct result.
You can try yourself at this URL

for when pythons built-in encodings are insufficient there's an iconv at PyPi.
pip install iconv
unfortunately the documentation is nonexistant.
There's also iconv_codecs
pip install iconv_codecs
eg:
>>> import iconv_codecs
>>> iconv_codecs.register('ansi_x3.110-1983')
>>> "foo".encode('ansi_x3.110-1983')

It would be helpful if you could post the string that you are trying to convert since this error suggest some problem with the in-data, older versions of PHP failed silently on broken input strings which makes this hard to diagnose.
According to the documentation this might also be due to differences in shift-jis dialects, try using 'shift_jisx0213' or 'shift_jis_2004' instead.
If using another dialect does not work you might get away with asking python to fail silently by using .decode('shift-jis','ignore') or .decode('shift-jis','replace') .

Postgres COPY FROM file throwing unicode error while referenced character apparently not in file

First of all, thank you to everyone on Stack Overflow for past, present, and future help. You've all saved me from disaster (both of my own design and otherwise) too many times to count.
The present issue is part of a decision at my firm to transition from a Microsoft SQL Server 2005 database to PostgreSQL 9.4. We have been following the notes on the Postgres wiki (https://wiki.postgresql.org/wiki/Microsoft_SQL_Server_to_PostgreSQL_Migration_by_Ian_Harding), and these are the steps we're following for the table in question:
Download table data [on Windows client]:
bcp "Carbon.consensus.observations" out "Carbon.consensus.observations" -k -S [servername] -T -w
Copy to Postgres server [running CentOS 7]
Run Python pre-processing script on Postgres server to change encoding and clean:
import sys
import os
import re
import codecs
import fileinput
base_path = '/tmp/tables/'
cleaned_path = '/tmp/tables_processed/'
files = os.listdir(base_path)
for filename in files:
source_path = base_path + filename
temp_path = '/tmp/' + filename
target_path = cleaned_path + filename
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with open(source_path, 'r') as source_file:
with open(target_path, 'w') as target_file:
start = True
while True:
contents = source_file.read(BLOCKSIZE).decode('utf-16le')
if not contents:
break
if start:
if contents.startswith(codecs.BOM_UTF8.decode('utf-8')):
contents = contents.replace(codecs.BOM_UTF8.decode('utf-8'), ur'')
contents = contents.replace(ur'\x80', u'')
contents = re.sub(ur'\000', ur'', contents)
contents = re.sub(ur'\r\n', ur'\n', contents)
contents = re.sub(ur'\r', ur'\\r', contents)
target_file.write(contents.encode('utf-8'))
start = False
for line in fileinput.input(target_path, inplace=1):
if '\x80' in line:
line = line.replace(r'\x80', '')
sys.stdout.write(line)
Execute SQL to load table:
COPY consensus.observations FROM '/tmp/tables_processed/Carbon.consensus.observations';
The issue is that the COPY command is failing with a unicode error:
[2015-02-24 19:52:24] [22021] ERROR: invalid byte sequence for encoding "UTF8": 0x80
Where: COPY observations, line 2622420: "..."
Given that this could very likely be because of bad data in the table (which also contains legitimate non-ASCII characters), I'm trying to find the actual byte sequence in context, and I can't find it anywhere (sed to look at the line in question, regexes to replace the character as part of the preprocessing, etc). For reference, this grep returns nothing:
cat /tmp/tables_processed/Carbon.consensus.observations | grep --color='auto' -P "[\x80]"
What am I doing wrong in tracking down where this byte sequence sits in context?

I would recommend loading the SQL file (which appears to be /tmp/tables_processed/Carbon.consensus.observations) into an editor that has a hex mode. This should allow you to see it (depending on the exact editor) in context.
gVim (or terminal-based Vim) is one option I would recommend.
For example, if I open in gVim an SQL copy file that has this content:
1 1.2
2 1.1
3 3.2
I can the convert it into hex mode via the command %!xxd (in gVim or terminal Vim) or the Menu option Tools > Convert to HEX.
That yields this display:
0000000: 3109 312e 320a 3209 312e 310a 3309 332e 1.1.2.2.1.1.3.3.
0000010: 320a 2.
You can then run %!xxd -r to convert it back, or the Menu option Tools > Convert back.
Note: This actually modifies the file, so it would be advisable to do this to a copy of the original, just in case the changes somehow get written (you would have to explicitly save the buffer in Vim).
This way, you can see both the hex sequences on the left, and their ASCII equivalent on the right. If you search for 80, you should be able to see it in context. With gVim, the line numbering will be different for both modes, though, as is evidenced by this example.
It's likely the first 80 you find will be that line, though, since if there were earlier ones, it likely would've failed on those instead.
Another tool which might help that I've used in the past is the graphical hex editor GHex. Since that's a GNOME project, not quite sure it'll work with CentOS. wxHexEditor supposedly works with CentOS and looks promising from the website, although I've not yet used it. It's pitched as a "hex editor for massive files", so if your SQL file is large, that might be the way to go.

Winzip cannot open an archive created by python shutil.make_archive on windows. On ubuntu archive manager does fine

I am trying to return a zip file in django http response, the code goes something like...
archive = shutil.make_archive('testfolder', 'zip', MEDIA_ROOT, 'testfolder')
response = HttpResponse(FileWrapper(open(archive)),
content_type=mimetypes.guess_type(archive)[0])
response['Content-Length'] = getsize(archive)
response['Content-Disposition'] = "attachment; filename=test %s.zip" % datetime.now()
return response
Now when this code is executed on ubuntu the resulting downloaded file opens without any issue, but when its executed on windows the file created does not open in winzip (gives error 'Unsupported Zip Format').
Is there something very obvious I am missing here? Isn't python code supposed to be portable?
EDIT:
Thanks to J.F. Sebastian for his comment...
There was no problem in creating the archive, it was reading it back into the request. So, the solution is to change second line of my code from,
response = HttpResponse(FileWrapper(open(archive)),
content_type=mimetypes.guess_type(archive)[0])
to,
response = HttpResponse(FileWrapper(open(archive, 'rb')), # notice extra 'rb'
content_type=mimetypes.guess_type(archive)[0])
checkout, my answer to this question for more details...

The code you have written should work correctly. I've just run the following line from your snippet to generate a zip file and was able to extract on Linux and Windows.
archive = shutil.make_archive('testfolder', 'zip', MEDIA_ROOT, 'testfolder')
There is something funny and specific going on. I recommend you check the following:
Generate the zip file outside of Django with a script that just has that one liner. Then try and extract it on a Windows machine. This will help you rule out anything going on relating to Django, web server or browser
If that works then look at exactly what is in the folder you compressed. Do the files have any funny characters in their names, are there strange file types, or super long filenames.
Run a md5 checksum on the zip file in Windows and Linux just to make absolutely sure that the two files are byte by byte identical. To rule out any file corruption that might have occured.

Thanks to J.F. Sebastian for his comment...
I'll still write the solution here in detail...
There was no problem in creating the archive, it was reading it back into the request. So, the solution is to change second line of my code from,
response = HttpResponse(FileWrapper(open(archive)),
content_type=mimetypes.guess_type(archive)[0])
to,
response = HttpResponse(FileWrapper(open(archive, 'rb')), # notice extra 'rb'
content_type=mimetypes.guess_type(archive)[0])
because apparently, hidden somewhere in python 2.3 documentation on open:
The most commonly-used values of mode are 'r' for reading, 'w' for
writing (truncating the file if it already exists), and 'a' for
appending (which on some Unix systems means that all writes append to
the end of the file regardless of the current seek position). If mode
is omitted, it defaults to 'r'. The default is to use text mode, which
may convert '\n' characters to a platform-specific representation on
writing and back on reading. Thus, when opening a binary file, you
should append 'b' to the mode value to open the file in binary mode,
which will improve portability. (Appending 'b' is useful even on
systems that don’t treat binary and text files differently, where it
serves as documentation.) See below for more possible values of mode.
So, in simple terms while reading binary files, using open(file, 'rb') increases portability of your code (it certainly did in this case)
Now, it extracts without troubles, on windows...

Reading non-text files into Python

I want to read in a non text file. It has an extension ".map" but can be opened by notepad. How should I open this file through python?
file = open("path-to-file","r") doesn't work for me. It returns No such file or directory: error.
Here's what my file looks like:
111 + gi|89106884|ref|AC_000091.1| 725803 TCGAGATCGACCATGTTGCCCGCCT IIIIIIIIIIIIIIIIIIIIIIIII 0 14:A>G
457 + gi|89106884|ref|AC_000091.1| 32629 CCGTGTCCACCGACTACGACACCTC IIIIIIIIIIIIIIIIIIIIIIIII 0 4:C>G,22:T>C
779 + gi|89106884|ref|AC_000091.1| 483582 GATCACCCACGCAAAGATGGGGCGA IIIIIIIIIIIIIIIIIIIIIIIII 0 15:A>G,18:C>G
784 + gi|89106884|ref|AC_000091.1| 226200 ACCGATAGTGAACCAGTACCGTGAG IIIIIIIIIIIIIIIIIIIIIIIII 1
If I do the follwing:
file = open("D:\bowtie-0.12.7-win32\bowtie-0.12.7\output_635\results_NC_000117.fna.1.ebwt.map","rb")
It still gives me No such file or directory: 'D:\x08owtie-0.12.7-win32\x08owtie-0.12.7\\output_635\results_NC_000117.fna.1.ebwt.map' error. Is this because the file isn't binary or I don't have some permissions?
Would apppreciate help with this!

Binary files should use a binary mode.
f = open("path-to-file","rb")
But that won't help if you don't have the appropriate permissions or don't know the format of the file itself.
EDIT:
Obviously you didn't bother reading the error message, or you would have noticed that the filename it is using is not the one you expected.
f = open("D:\\bowtie-0.12.7-win32\\bowtie-0.12.7\\output_635\\results_NC_000117.fna.1.ebwt.map","rb")
f = open(r"D:\bowtie-0.12.7-win32\bowtie-0.12.7\output_635\results_NC_000117.fna.1.ebwt.map","rb")

You have hit upon a minor difference between Unix and Windows here.
Since you mentioned Notepad, you must be running this on Windows. In DOS/Windows land, opening a binary file requires specifying attribute 'b' for binary, as others have already indicated. Unix/Linux are a bit more relaxed about this. Omitting attribute 'b' will still open a binary file.
The same behavior is exhibited in the C library's fopen() call.

If its a non-text file you could try opening it using binary format. Try this -
with open("path-to-file", "rb") as f:
byte = f.read(1)
while byte != "":
byte = f.read(1) # Do stuff with byte.
The with statement handles opening and closing the file, including if an exception is raised in the inner block.
Of course since the format is binary you need to know what you are going to do after you read. Also, here I read 1 byte at a time, you can define bigger chunk sizes too.
UPDATE: Maybe this is not a binary file. You might be having problems with file encoding, the characters might not be ascii or they might belong to unicode charset. Try this -
import codecs
f = codecs.open(u'path-to-file','r','utf-8')
print f.read()
f.close()
If you print this out in the terminal, you might still get gibberish since the terminal might not support this charset. I would advise, go ahead & process the text assuming its properly opened.
Source

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.