How to generate valid filename with unicode characters for download? [duplicate]

How to generate valid filename with unicode characters for download? [duplicate] - python

Web applications that want to force a resource to be downloaded rather than directly rendered in a Web browser issue a Content-Disposition header in the HTTP response of the form:
Content-Disposition: attachment; filename=FILENAME
The filename parameter can be used to suggest a name for the file into which the resource is downloaded by the browser. RFC 2183 (Content-Disposition), however, states in section 2.3 (The Filename Parameter) that the file name can only use US-ASCII characters:
Current [RFC 2045] grammar restricts
parameter values (and hence
Content-Disposition filenames) to
US-ASCII. We recognize the great
desirability of allowing arbitrary
character sets in filenames, but it is
beyond the scope of this document to
define the necessary mechanisms.
There is empirical evidence, nevertheless, that most popular Web browsers today seem to permit non-US-ASCII characters yet (for the lack of a standard) disagree on the encoding scheme and character set specification of the file name. Question is then, what are the various schemes and encodings employed by the popular browsers if the file name “naïvefile” (without quotes and where the third letter is U+00EF) needed to be encoded into the Content-Disposition header?
For the purpose of this question, popular browsers being:
Google Chrome
Safari
Internet Explorer or Edge
Firefox
Opera

I know this is an old post but it is still very relevant. I have found that modern browsers support rfc5987, which allows utf-8 encoding, percentage encoded (url-encoded). Then Naïve file.txt becomes:
Content-Disposition: attachment; filename*=UTF-8''Na%C3%AFve%20file.txt
Safari (5) does not support this. Instead you should use the Safari standard of writing the file name directly in your utf-8 encoded header:
Content-Disposition: attachment; filename=Naïve file.txt
IE8 and older don't support it either and you need to use the IE standard of utf-8 encoding, percentage encoded:
Content-Disposition: attachment; filename=Na%C3%AFve%20file.txt
In ASP.Net I use the following code:
string contentDisposition;
if (Request.Browser.Browser == "IE" && (Request.Browser.Version == "7.0" || Request.Browser.Version == "8.0"))
contentDisposition = "attachment; filename=" + Uri.EscapeDataString(fileName);
else if (Request.Browser.Browser == "Safari")
contentDisposition = "attachment; filename=" + fileName;
else
contentDisposition = "attachment; filename*=UTF-8''" + Uri.EscapeDataString(fileName);
Response.AddHeader("Content-Disposition", contentDisposition);
I tested the above using IE7, IE8, IE9, Chrome 13, Opera 11, FF5, Safari 5.
Update November 2013:
Here is the code I currently use. I still have to support IE8, so I cannot get rid of the first part. It turns out that browsers on Android use the built in Android download manager and it cannot reliably parse file names in the standard way.
string contentDisposition;
if (Request.Browser.Browser == "IE" && (Request.Browser.Version == "7.0" || Request.Browser.Version == "8.0"))
contentDisposition = "attachment; filename=" + Uri.EscapeDataString(fileName);
else if (Request.UserAgent != null && Request.UserAgent.ToLowerInvariant().Contains("android")) // android built-in download manager (all browsers on android)
contentDisposition = "attachment; filename=\"" + MakeAndroidSafeFileName(fileName) + "\"";
else
contentDisposition = "attachment; filename=\"" + fileName + "\"; filename*=UTF-8''" + Uri.EscapeDataString(fileName);
Response.AddHeader("Content-Disposition", contentDisposition);
The above now tested in IE7-11, Chrome 32, Opera 12, FF25, Safari 6, using this filename for download: 你好abcABCæøåÆØÅäöüïëêîâéíáóúýñ½§!#¤%&()=`#£$€{[]}+´¨^~'-_,;.txt
On IE7 it works for some characters but not all. But who cares about IE7 nowadays?
This is the function I use to generate safe file names for Android. Note that I don't know which characters are supported on Android but that I have tested that these work for sure:
private static readonly Dictionary<char, char> AndroidAllowedChars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ._-+,#£$€!½§~'=()[]{}0123456789".ToDictionary(c => c);
private string MakeAndroidSafeFileName(string fileName)
{
char[] newFileName = fileName.ToCharArray();
for (int i = 0; i < newFileName.Length; i++)
{
if (!AndroidAllowedChars.ContainsKey(newFileName[i]))
newFileName[i] = '_';
}
return new string(newFileName);
}
#TomZ: I tested in IE7 and IE8 and it turned out that I did not need to escape apostrophe ('). Do you have an example where it fails?
#Dave Van den Eynde: Combining the two file names on one line as according to RFC6266 works except for Android and IE7+8 and I have updated the code to reflect this. Thank you for the suggestion.
#Thilo: No idea about GoodReader or any other non-browser. You might have some luck using the Android approach.
#Alex Zhukovskiy: I don't know why but as discussed on Connect it doesn't seem to work terribly well.

There is no interoperable way to encode non-ASCII names in Content-Disposition. Browser compatibility is a mess.
The theoretically correct syntax for use of UTF-8 in Content-Disposition is very weird: filename*=UTF-8''foo%c3%a4 (yes, that's an asterisk, and no quotes except an empty single quote in the middle)
This header is kinda-not-quite-standard (HTTP/1.1 spec acknowledges its existence, but doesn't require clients to support it).
There is a simple and very robust alternative: use a URL that contains the filename you want.
When the name after the last slash is the one you want, you don't need any extra headers!
This trick works:
/real_script.php/fake_filename.doc
And if your server supports URL rewriting (e.g. mod_rewrite in Apache) then you can fully hide the script part.
Characters in URLs should be in UTF-8, urlencoded byte-by-byte:
/mot%C3%B6rhead # motörhead

There is discussion of this, including links to browser testing and backwards compatibility, in the proposed RFC 5987, "Character Set and Language Encoding for Hypertext Transfer Protocol (HTTP) Header Field Parameters."
RFC 2183 indicates that such headers should be encoded according to RFC 2184, which was obsoleted by RFC 2231, covered by the draft RFC above.

RFC 6266 describes the “Use of the Content-Disposition Header Field in the Hypertext Transfer Protocol (HTTP)”. Quoting from that:
6. Internationalization Considerations
The “filename*” parameter (Section 4.3), using the encoding defined
in [RFC5987], allows the server to transmit characters outside the
ISO-8859-1 character set, and also to optionally specify the language
in use.
And in their examples section:
This example is the same as the one above, but adding the "filename"
parameter for compatibility with user agents not implementing
RFC 5987:
Content-Disposition: attachment;
filename="EURO rates";
filename*=utf-8''%e2%82%ac%20rates
Note: Those user agents that do not support the RFC 5987 encoding
ignore “filename*” when it occurs after “filename”.
In Appendix D there is also a long list of suggestions to increase interoperability. It also points at a site which compares implementations. Current all-pass tests suitable for common file names include:
attwithisofnplain: plain ISO-8859-1 file name with double quotes and without encoding. This requires a file name which is all ISO-8859-1 and does not contain percent signs, at least not in front of hex digits.
attfnboth: two parameters in the order described above. Should work for most file names on most browsers, although IE8 will use the “filename” parameter.
That RFC 5987 in turn references RFC 2231, which describes the actual format. 2231 is primarily for mail, and 5987 tells us what parts may be used for HTTP headers as well. Don't confuse this with MIME headers used inside a multipart/form-data HTTP body, which is governed by RFC 2388 (section 4.4 in particular) and the HTML 5 draft.

The following document linked from the draft RFC mentioned by Jim in his answer further addresses the question and definitely worth a direct note here:
Test Cases for HTTP Content-Disposition header and RFC 2231/2047 Encoding

Put the file name in double quotes. Solved the problem for me. Like this:
Content-Disposition: attachment; filename="My Report.doc"
http://kb.mozillazine.org/Filenames_with_spaces_are_truncated_upon_download
I've tested multiple options. Browsers do not support the specs and act differently, I believe double quotes is the best option.

I use the following code snippets for encoding (assuming fileName contains the filename and extension of the file, i.e.: test.txt):
PHP:
if ( strpos ( $_SERVER [ 'HTTP_USER_AGENT' ], "MSIE" ) > 0 )
{
header ( 'Content-Disposition: attachment; filename="' . rawurlencode ( $fileName ) . '"' );
}
else
{
header( 'Content-Disposition: attachment; filename*=UTF-8\'\'' . rawurlencode ( $fileName ) );
}
Java:
fileName = request.getHeader ( "user-agent" ).contains ( "MSIE" ) ? URLEncoder.encode ( fileName, "utf-8") : MimeUtility.encodeWord ( fileName );
response.setHeader ( "Content-disposition", "attachment; filename=\"" + fileName + "\"");

in asp.net mvc2 i use something like this:
return File(
tempFile
, "application/octet-stream"
, HttpUtility.UrlPathEncode(fileName)
);
I guess if you don't use mvc(2) you could just encode the filename using
HttpUtility.UrlPathEncode(fileName)

In ASP.NET Web API, I url encode the filename:
public static class HttpRequestMessageExtensions
{
public static HttpResponseMessage CreateFileResponse(this HttpRequestMessage request, byte[] data, string filename, string mediaType)
{
HttpResponseMessage response = new HttpResponseMessage(HttpStatusCode.OK);
var stream = new MemoryStream(data);
stream.Position = 0;
response.Content = new StreamContent(stream);
response.Content.Headers.ContentType =
new MediaTypeHeaderValue(mediaType);
// URL-Encode filename
// Fixes behavior in IE, that filenames with non US-ASCII characters
// stay correct (not "_utf-8_.......=_=").
var encodedFilename = HttpUtility.UrlEncode(filename, Encoding.UTF8);
response.Content.Headers.ContentDisposition =
new ContentDispositionHeaderValue("attachment") { FileName = encodedFilename };
return response;
}
}

In PHP this did it for me (assuming the filename is UTF8 encoded):
header('Content-Disposition: attachment;'
. 'filename="' . addslashes(utf8_decode($filename)) . '";'
. 'filename*=utf-8\'\'' . rawurlencode($filename));
Tested against IE8-11, Firefox and Chrome.
If the browser can interpret filename*=utf-8 it will use the UTF8 version of the filename, else it will use the decoded filename. If your filename contains characters that can't be represented in ISO-8859-1 you might want to consider using iconv instead.

Just an update since I was trying all this stuff today in response to a customer issue
With the exception of Safari configured for Japanese, all browsers our customer tested worked best with filename=text.pdf - where text is a customer value serialized by ASP.Net/IIS in utf-8 without url encoding. For some reason, Safari configured for English would accept and properly save a file with utf-8 Japanese name but that same browser configured for Japanese would save the file with the utf-8 chars uninterpreted. All other browsers tested seemed to work best/fine (regardless of language configuration) with the filename utf-8 encoded without url encoding.
I could not find a single browser implementing Rfc5987/8187 at all. I tested with the latest Chrome, Firefox builds plus IE 11 and Edge. I tried setting the header with just filename*=utf-8''texturlencoded.pdf, setting it with both filename=text.pdf; filename*=utf-8''texturlencoded.pdf. Not one feature of Rfc5987/8187 appeared to be getting processed correctly in any of the above.

If you are using a nodejs backend you can use the following code I found here
var fileName = 'my file(2).txt';
var header = "Content-Disposition: attachment; filename*=UTF-8''"
+ encodeRFC5987ValueChars(fileName);
function encodeRFC5987ValueChars (str) {
return encodeURIComponent(str).
// Note that although RFC3986 reserves "!", RFC5987 does not,
// so we do not need to escape it
replace(/['()]/g, escape). // i.e., %27 %28 %29
replace(/\*/g, '%2A').
// The following are not required for percent-encoding per RFC5987,
// so we can allow for a little better readability over the wire: |`^
replace(/%(?:7C|60|5E)/g, unescape);
}

I tested the following code in all major browsers, including older Explorers (via the compatibility mode), and it works well everywhere:
$filename = $_GET['file']; //this string from $_GET is already decoded
if (strstr($_SERVER['HTTP_USER_AGENT'],"MSIE"))
$filename = rawurlencode($filename);
header('Content-Disposition: attachment; filename="'.$filename.'"');

I ended up with the following code in my "download.php" script (based on this blogpost and these test cases).
$il1_filename = utf8_decode($filename);
$to_underscore = "\"\\#*;:|<>/?";
$safe_filename = strtr($il1_filename, $to_underscore, str_repeat("_", strlen($to_underscore)));
header("Content-Disposition: attachment; filename=\"$safe_filename\""
.( $safe_filename === $filename ? "" : "; filename*=UTF-8''".rawurlencode($filename) ));
This uses the standard way of filename="..." as long as there are only iso-latin1 and "safe" characters used; if not, it adds the filename*=UTF-8'' url-encoded way. According to this specific test case, it should work from MSIE9 up, and on recent FF, Chrome, Safari; on lower MSIE version, it should offer filename containing the ISO8859-1 version of the filename, with underscores on characters not in this encoding.
Final note: the max. size for each header field is 8190 bytes on apache. UTF-8 can be up to four bytes per character; after rawurlencode, it is x3 = 12 bytes per one character. Pretty inefficient, but it should still be theoretically possible to have more than 600 "smiles" %F0%9F%98%81 in the filename.

From .NET 4.5 (and Core 1.0) you can use ContentDispositionHeaderValue to do the formatting for you.
var fileName = "Naïve file.txt";
var h = new System.Net.Http.Headers.ContentDispositionHeaderValue("attachment");
h.FileNameStar = fileName;
h.FileName = "fallback-ascii-name.txt";
Response.Headers.Add("Content-Disposition", h.ToString());
h.ToString() Will result in:
attachment; filename*=utf-8''Na%C3%AFve%20file.txt; filename=fallback-ascii-name.txt

PHP framework Symfony 4 has $filenameFallback in HeaderUtils::makeDisposition.
You can look into this function for details - it is similar to the answers above.
Usage example:
$filenameFallback = preg_replace('#^.*\.#', md5($filename) . '.', $filename);
$disposition = $response->headers->makeDisposition(ResponseHeaderBag::DISPOSITION_ATTACHMENT, $filename, $filenameFallback);
$response->headers->set('Content-Disposition', $disposition);

For those who need a JavaScript way of encoding the header, I found that this function works well:
function createContentDispositionHeader(filename:string) {
const encoded = encodeURIComponent(filename);
return `attachment; filename*=UTF-8''${encoded}; filename="${encoded}"`;
}
This is based on what Nextcloud seems to be doing when downloading a file. The filename appears first as UTF-8 encoded, and possibly for compatibility with some browsers, the filename also appears without the UTF-8 prefix.

Classic ASP Solution
Most modern browsers support passing the Filename as UTF-8 now but as was the case with a File Upload solution I use that was based on FreeASPUpload.Net (site no longer exists, link points to archive.org) it wouldn't work as the parsing of the binary relied on reading single byte ASCII encoded strings, which worked fine when you passed UTF-8 encoded data until you get to characters ASCII doesn't support.
However I was able to find a solution to get the code to read and parse the binary as UTF-8.
Public Function BytesToString(bytes) 'UTF-8..
Dim bslen
Dim i, k , N
Dim b , count
Dim str
bslen = LenB(bytes)
str=""
i = 0
Do While i < bslen
b = AscB(MidB(bytes,i+1,1))
If (b And &HFC) = &HFC Then
count = 6
N = b And &H1
ElseIf (b And &HF8) = &HF8 Then
count = 5
N = b And &H3
ElseIf (b And &HF0) = &HF0 Then
count = 4
N = b And &H7
ElseIf (b And &HE0) = &HE0 Then
count = 3
N = b And &HF
ElseIf (b And &HC0) = &HC0 Then
count = 2
N = b And &H1F
Else
count = 1
str = str & Chr(b)
End If
If i + count - 1 > bslen Then
str = str&"?"
Exit Do
End If
If count>1 then
For k = 1 To count - 1
b = AscB(MidB(bytes,i+k+1,1))
N = N * &H40 + (b And &H3F)
Next
str = str & ChrW(N)
End If
i = i + count
Loop
BytesToString = str
End Function
Credit goes to Pure ASP File Upload by implementing the BytesToString() function from include_aspuploader.asp in my own code I was able to get UTF-8 filenames working.
Useful Links
Multipart/form-data and UTF-8 in a ASP Classic application
Unicode, UTF, ASCII, ANSI format differences

The method mimeHeaderEncode($string) from the library class Unicode does the job.
$file_name= Unicode::mimeHeaderEncode($file_name);
Example in drupal/php:
https://github.com/drupal/core-utility/blob/8.8.x/Unicode.php
/**
* Encodes MIME/HTTP headers that contain incorrectly encoded characters.
*
* For example, Unicode::mimeHeaderEncode('tést.txt') returns
* "=?UTF-8?B?dMOpc3QudHh0?=".
*
* See http://www.rfc-editor.org/rfc/rfc2047.txt for more information.
*
* Notes:
* - Only encode strings that contain non-ASCII characters.
* - We progressively cut-off a chunk with self::truncateBytes(). This ensures
* each chunk starts and ends on a character boundary.
* - Using \n as the chunk separator may cause problems on some systems and
* may have to be changed to \r\n or \r.
*
* #param string $string
* The header to encode.
* #param bool $shorten
* If TRUE, only return the first chunk of a multi-chunk encoded string.
*
* #return string
* The mime-encoded header.
*/
public static function mimeHeaderEncode($string, $shorten = FALSE) {
if (preg_match('/[^\x20-\x7E]/', $string)) {
// floor((75 - strlen("=?UTF-8?B??=")) * 0.75);
$chunk_size = 47;
$len = strlen($string);
$output = '';
while ($len > 0) {
$chunk = static::truncateBytes($string, $chunk_size);
$output .= ' =?UTF-8?B?' . base64_encode($chunk) . "?=\n";
if ($shorten) {
break;
}
$c = strlen($chunk);
$string = substr($string, $c);
$len -= $c;
}
return trim($output);
}
return $string;
}

We had a similar problem in a web application, and ended up by reading the filename from the HTML <input type="file">, and setting that in the url-encoded form in a new HTML <input type="hidden">. Of course we had to remove the path like "C:\fakepath\" that is returned by some browsers.
Of course this does not directly answer OPs question, but may be a solution for others.

I normally URL-encode (with %xx) the filenames, and it seems to work in all browsers. You might want to do some tests anyway.

Related

How To Remove (%0D) in python requests

import requests
nexmokey = 'mykey'
nexmosec = 'mysecretkey'
nexmoBal = 'https://rest.nexmo.com/account/get-balance?api_key={}&api_secret={}'.format(nexmokey,nexmosec)
rr = requests.get(nexmoBal)
print(rr.url)
I would like to send a request to post at
https://rest.nexmo.com/account/get-balance?api_key=mykey&api_secret=mysecretkey
but why does %0D appear?
https://rest.nexmo.com/account/get-balance?api_key=mykey%0D&api_secret=mysecretkey%0D

requests.get expects parameters like api_secret=my_secret to be provided through the params argument, not as part of the URL, which is URL-encoded for you.
Use this:
nexmoBal = 'https://rest.nexmo.com/account/get-balance'
rr = requests.get(nexmoBal, params={'api_key': nexmokey, 'api_secret': nexmosec})
The fact that %0D ends up in there, indicates you have a character #13 (0D hexadecimal) in there, which is a carriage return (part of the end of line on Windows systems) - probably because you are reading the key and secret from some file and didn't include them in the example code.
Also, note that you mention you want to post, but you're calling .get().

Encoding error: in MIME file data via AWS SES

I am trying to retrieve attachments data like file format and name of file from MIME via aws SES. Unfortunately some time file name encoding is changed, like file name is "3_amrishmishra_Entry Level Resume - 02.pdf" and in MIME it appears as '=?UTF-8?Q?amrishmishra=5FEntry_Level_Resume_=E2=80=93_02=2Epdf?=', any way to get exact file name?
if email_message.is_multipart():
message = ''
if "apply" in receiver_email.split('#')[0].split('_')[0] and isinstance(int(receiver_email.split('#')[0].split('_')[1]), int):
for part in email_message.walk():
content_type = str(part.get_content_type()).lower()
content_dispo = str(part.get('Content-Disposition')).lower()
print(content_type, content_dispo)
if 'text/plain' in content_type and "attachment" not in content_dispo:
message = part.get_payload()
if content_type in ['application/pdf', 'text/plain', 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'image/jpeg', 'image/jpg', 'image/png', 'image/gif'] and "attachment" in content_dispo:
filename = part.get_filename()
# open('/tmp/local' + filename, 'wb').write(part.get_payload(decode=True))
# s3r.meta.client.upload_file('/tmp/local' + filename, bucket_to_upload, filename)
data = {
'base64_resume': part.get_payload(),
'filename': filename,
}
data_list.append(data)
try:
api_data = {
'email_data': email_data,
'resumes_data': data_list
}
print(len(data_list))
response = requests.post(url, data=json.dumps(api_data),
headers={'content-type': 'application/json'})
print(response.status_code, response.content)
except Exception as e:
print("error %s" % e)

This syntax '=?UTF-8?Q?...?=' is a MIME encoded word. It is used in MIME email when a header value includes non-ASCII characters (gory details in RFC 2047). Your attachment filename includes an "en dash" character, which is why it was sent with this encoding.
The best way to handle it depends on which Python version you're using...
Python 3
Python 3's updated email.parser package can correctly decode RFC 2047 headers for you:
# Python 3
from email import message_from_bytes, policy
raw_message_bytes = b"<< the MIME message you downloaded from SES >>"
message = message_from_bytes(raw_message_bytes, policy=policy.default)
for attachment in message.iter_attachments():
# (EmailMessage.iter_attachments is new in Python 3)
print(attachment.get_filename())
# amrishmishra_Entry Level Resume – 02.pdf
You must specifically request policy.default. If you don't, the parser will use a compat32 policy that replicates Python 2.7's buggy behavior—including not decoding RFC 2047. (Also, early Python 3 releases were still shaking out bugs in the new email package, so make sure you're on Python 3.5 or later.)
Python 2
If you're on Python 2, the best option is upgrading to Python 3.5 or later, if at all possible. Python 2's email parser has many bugs and limitations that were fixed with a massive rewrite in Python 3. (And the rewrite added handy new features like iter_attachments() shown above.)
If you can't switch to Python 3, you can decode the RFC 2047 filename yourself using email.header.decode_header:
# Python 2 (also works in Python 3, but you shouldn't need it there)
from email.header import decode_header
filename = '=?UTF-8?Q?amrishmishra=5FEntry_Level_Resume_=E2=80=93_02=2Epdf?='
decode_header(filename)
# [('amrishmishra_Entry Level Resume \xe2\x80\x93 02.pdf', 'utf-8')]
(decoded_string, charset) = decode_header(filename)[0]
decoded_string.decode(charset)
# u'amrishmishra_Entry Level Resume – 02.pdf'
But again, if you're trying to parse real-world email in Python 2.7, be aware that this is probably just the first of several problems you'll encounter.
The django-anymail package I maintain includes a compatibility version of email.parser.BytesParser that tries to work around several (but not all) other bugs in Python 2.7 email parsing. You may be able to borrow that (internal) code for your purposes. (Or since you tagged your question Django, you might want to look into Anymail's normalized inbound email handling, which includes Amazon SES support.)

Unable to display Japanese (UTF-8) characters in email body with webbrowser

I am reading text from two different .txt files and concatenating them together. Then add that to a body of the email through by using webbrowser.
One text file is English characters (ascii) and the other Japanese (UTF-8). The text will display fine if I write it to a text file. But if I use webbrowser to insert the text into an email body the Japanese text displays as question marks.
I have tried running the script on multiple machines that have different mail clients as their defaults. Initially I thought maybe that was the issue, but that does not appear to be. Thunderbird and Mail (MacOSX) display question marks.
Hello. Today is 2014-05-09
????????????????2014-05-09????
I have looked at similar issues around on SO but they have not solved the issue.
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
position 20: ordinal not in
range(128)
Japanese in python function
Printing out Japanese (Chinese) characters
python utf-8 japanese
Is there a way to have the Japanese (UTF-8) display in the body of an email created with webbrowser in python? I could use the email functionality but the requirement is the script needs to open the default mail client and insert all the information.
The code and text files I am using are below. I have simplified it to focus on the issue.
email-template.txt
Hello. Today is {{date}}
email-template-jp.txt
こんにちは。今日は {{date}} です。
Python Script
#
# -*- coding: utf-8 -*-
#
import sys
import re
import os
import glob
import webbrowser
import codecs,sys
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
# vars
date_range = sys.argv[1:][0]
email_template_en = "email-template.txt"
email_template_jp = "email-template-jp.txt"
email_to_send = "email-to-send.txt" # finished email is saved here
# Default values for the composed email that will be opened
mail_list = "test#test.com"
cc_list = "test1#test.com, test2#test.com"
subject = "Email Subject"
# Open email templates and insert the date from the parameters sent in
try:
f_en = open(email_template_en, "r")
f_jp = codecs.open(email_template_jp, "r", "UTF-8")
try:
email_content_en = f_en.read()
email_content_jp = f_jp.read()
email_en = re.sub(r'{{date}}', date_range, email_content_en)
email_jp = re.sub(r'{{date}}', date_range, email_content_jp).encode("UTF-8")
# this throws an error
# UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 26: ordinal not in range(128)
# email_en_jp = (email_en + email_jp).encode("UTF-8")
email_en_jp = (email_en + email_jp)
finally:
f_en.close()
f_jp.close()
pass
except Exception, e:
raise e
# Open the default mail client and fill in all the information
try:
f = open(email_to_send, "w")
try:
f.write(email_en_jp)
# Does not send Japanese text to the mail client. But will write to the .txt file fine. Unsure why.
webbrowser.open("mailto:%s?subject=%s&cc=%s&body=%s" %(mail_list, subject, cc_list, email_en_jp), new=1) # open mail client with prefilled info
finally:
f.close()
pass
except Exception, e:
raise e
edit: Forgot to add I am using Python 2.7.1

EDIT 2: Found a workable solution after all.
Replace your webbrowser call with this.
import subprocess
[... other code ...]
arg = "mailto:%s?subject=%s&cc=%s&body=%s" % (mail_list, subject, cc_list, email_en_jp)
subprocess.call(["open", arg])
This will open your default email client on MacOS. For other OSes please replace "open" in the subprocess line with the proper executable.
EDIT: I looked into it a bit more and Mark's comment above made me read the RFC (2368) for mailto URL scheme.
The special hname "body" indicates that the associated hvalue is the
body of the message. The "body" hname should contain the content for
the first text/plain body part of the message. The mailto URL is
primarily intended for generation of short text messages that are
actually the content of automatic processing (such as "subscribe"
messages for mailing lists), not general MIME bodies.
And a bit further down:
8-bit characters in mailto URLs are forbidden. MIME encoded words (as
defined in [RFC2047]) are permitted in header values, but not for any
part of a "body" hname."
So it looks like this is not possible as per RFC, although that makes me question why the JavaScript solution in the JSFiddle provided by naota works at all.
I leave my previous answer as is below, although it does not work.
I have run into same issues with Python 2.7.x quite a couple of times now and every time a different solution somehow worked.
So here are several suggestions that may or may not work, as I haven't tested them.
a) Force unicode strings:
webbrowser.open(u"mailto:%s?subject=%s&cc=%s&body=%s" % (mail_list, subject, cc_list, email_en_jp), new=1)
Notice the small u right after the opening ( and before the ".
b) Force the regex to use unicode:
email_jp = re.sub(ur'{{date}}', date_range, email_content_jp).encode("UTF-8")
# or maybe
email_jp = re.sub(ur'{{date}}', date_range, email_content_jp)
c) Another idea regarding the regex, try compiling it first with the re.UNICODE flag, before applying it.
pattern = re.compile(ur'{{date}}', re.UNICODE)
d) Not directly related, but I noticed you write the combined text via the normal open method. Try using the codecs.open here as well.
f = codecs.open(email_to_send, "w", "UTF-8")
Hope this helps.

Am I parsing this HTTP POST request properly?

Let me start off by saying, I'm using the twisted.web framework. Twisted.web's file uploading didn't work like I wanted it to (it only included the file data, and not any other information), cgi.parse_multipart doesn't work like I want it to (same thing, twisted.web uses this function), cgi.FieldStorage didn't work ('cause I'm getting the POST data through twisted, not a CGI interface -- so far as I can tell, FieldStorage tries to get the request via stdin), and twisted.web2 didn't work for me because the use of Deferred confused and infuriated me (too complicated for what I want).
That being said, I decided to try and just parse the HTTP request myself.
Using Chrome, the HTTP request is formed like this:
------WebKitFormBoundary7fouZ8mEjlCe92pq
Content-Disposition: form-data; name="upload_file_nonce"
11b03b61-9252-11df-a357-00266c608adb
------WebKitFormBoundary7fouZ8mEjlCe92pq
Content-Disposition: form-data; name="file"; filename="login.html"
Content-Type: text/html
<!DOCTYPE html>
<html>
<head>
...
------WebKitFormBoundary7fouZ8mEjlCe92pq
Content-Disposition: form-data; name="file"; filename=""
------WebKitFormBoundary7fouZ8mEjlCe92pq--
Is this always how it will be formed? I'm parsing it with regular expressions, like so (pardon the wall of code):
(note, I snipped out most of the code to show only what I thought was relevant (the regular expressions (yeah, nested parentheses), this is an __init__ method (the only method so far) in an Uploads class I built. The full code can be seen in the revision history (I hope I didn't mismatch any parentheses)
if line == "--{0}--".format(boundary):
finished = True
if in_header == True and not line:
in_header = False
if 'type' not in current_file:
ignore_current_file = True
if in_header == True:
m = re.match(
"Content-Disposition: form-data; name=\"(.*?)\"; filename=\"(.*?)\"$", line)
if m:
input_name, current_file['filename'] = m.group(1), m.group(2)
m = re.match("Content-Type: (.*)$", line)
if m:
current_file['type'] = m.group(1)
else:
if 'data' not in current_file:
current_file['data'] = line
else:
current_file['data'] += line
you can see that I start a new "file" dict whenever a boundary is reached. I set in_header to True to say that I'm parsing headers. When I reach a blank line, I switch it to False -- but not before checking if a Content-Type was set for that form value -- if not, I set ignore_current_file since I'm only looking for file uploads.
I know I should be using a library, but I'm sick to death of reading documentation, trying to get different solutions to work in my project, and still having the code look reasonable. I just want to get past this part -- and if parsing an HTTP POST with file uploads is this simple, then I shall stick with that.
Note: this code works perfectly for now, I'm just wondering if it will choke on/spit out requests from certain browsers.

My solution to this Problem was parsing the content with cgi.FieldStorage like:
class Root(Resource):
def render_POST(self, request):
self.headers = request.getAllHeaders()
# For the parsing part look at [PyMOTW by Doug Hellmann][1]
img = cgi.FieldStorage(
fp = request.content,
headers = self.headers,
environ = {'REQUEST_METHOD':'POST',
'CONTENT_TYPE': self.headers['content-type'],
}
)
print img["upl_file"].name, img["upl_file"].filename,
print img["upl_file"].type, img["upl_file"].type
out = open(img["upl_file"].filename, 'wb')
out.write(img["upl_file"].value)
out.close()
request.redirect('/tests')
return ''

You're trying to avoid reading documentation, but I think the best advice is to actually read:
rfc 2388 Returning Values from Forms: multipart/form-data
rfc 1867 Form-based File Upload in HTML
to make sure you don't miss any cases. An easier route might be to use the poster library.

The content-disposition header has no defined order for fields, plus it may contain more fields than just the filename. So your match for filename may fail - there may not even be a filename!
See rfc2183 (edit that's for mail, see rfc1806, rfc2616 and maybe more for http)
Also I would suggest in these kind of regexps to replace every space by \s*, and not to rely on character case.

Django: Unicode Filenames with ASCII headers?

I have a list of strangely encoded files: 02 - Charlie, Woody and You／Study #22.mp3 which I suppose isn't so bad but there are a few particular characters which Django OR nginx seem to be snagging on.
>>> test = u'02 - Charlie, Woody and You／Study #22.mp3'
>>> test
u'02 - Charlie, Woody and You\uff0fStudy #22.mp3'
I am using nginx as a reverse proxy to connect to django's built in webserver (still in development stages) and postgresql for my database. My database and tables are all en_US.UTF-8 and I am using pgadmin3 to view my tables outside of django. My issue goes a little beyond my title, firstly how should I be saving possibly whacky filenames in my database? My current method is
'path': smart_unicode(path.lstrip(MUSIC_PATH)),
'filename': smart_unicode(file)
and when I pprint out the values they do show u'whateverthecrap'
I am not sure if that is how I should be doing it but assuming it is now I have issues trying to spit out the download.
My download view looks something like this:
def song_download(request, song_id):
song = get_object_or_404(Song, pk=song_id)
url = u'/static_music/%s/%s' % (song.path, song.filename)
print url
response = HttpResponse()
response['X-Accel-Redirect'] = url
response['Content-Type'] = 'audio/mpeg'
response['Content-Disposition'] = "attachment; filename=test.mp3"
return response
and most files will download but when I get to 02 - Charlie, Woody and You／Study #22.mp3 I receive this from django: 'ascii' codec can't encode character u'\uff0f' in position 118: ordinal not in range(128), HTTP response headers must be in US-ASCII format.
How can I use an ASCII acceptable string if my filename is out of bounds? 02 - Charlie, Woody and You\uff0fStudy #22.mp3 doesn't seem to work...
EDIT 1
I am using Ubuntu for my OS.

Although ／ is an unusual and undesirable character, your script will break for any non-ASCII character.
response['X-Accel-Redirect'] = url
url is Unicode (and it isn't a URL, it's a filepath). Response headers are bytes. You'll need to encode it.
response['X-Accel-Redirect'] = url.encode('utf-8')
that's assuming you're running on a server with UTF-8 as the filesystem encoding.
(Now, how to encode the filename in the Content-Disposition header... that's an altogether trickier question!)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.