Get python getaddresses() to decode encoded-word encoding - python

msg = \
"""To: =?ISO-8859-1?Q?Caren_K=F8lter?= <ck#example.dk>, bob#example.com
Cc: "James =?ISO-8859-1?Q?K=F8lter?=" <jk#example.dk>
Subject: hello
message body blah blah blah
"""
import email.parser, email.utils
import itertools
parser = email.parser.Parser()
parsed_message = parser.parsestr(msg)
address_fields = ('to', 'cc')
addresses = itertools.chain(*(parsed_message.get_all(field) for field in address_fields if parsed_message.has_key(field)))
address_list = set(email.utils.getaddresses(addresses))
print address_list
It seems like email.utils.getaddresses() doesn't seem to automatically handle MIME RFC 2047 in address fields.
How can I get the expected result below?
actual result:
set([('', 'bob#example.com'), ('=?ISO-8859-1?Q?Caren_K=F8lter?=', 'ck#example.dk'), ('James =?ISO-8859-1?Q?K=F8lter?=', 'jk#example.dk')])
desired result:
set([('', 'bob#example.com'), (u'Caren_K\xf8lter', 'ck#example.dk'), (u'James \xf8lter', 'jk#example.dk')])

The function you want is email.header.decode_header, which returns a list of (decoded_string, charset) pairs. It's up to you to further decode them according to charset and join them back together again before passing them to email.utils.getaddresses or wherever.
You might think that this would be straightforward:
def decode_rfc2047_header(h):
return ' '.join(s.decode(charset or 'ascii')
for s, charset in email.header.decode_header(h))
But since message headers typically come from untrusted sources, you have to handle (1) badly encoded data; and (2) bogus character set names. So you might do something like this:
def decode_safely(s, charset='ascii'):
"""Return s decoded according to charset, but do so safely."""
try:
return s.decode(charset or 'ascii', 'replace')
except LookupError: # bogus charset
return s.decode('ascii', 'replace')
def decode_rfc2047_header(h):
return ' '.join(decode_safely(s, charset)
for s, charset in email.header.decode_header(h))

Yeah, the email package interface really isn't very helpful a lot of the time.
Here, you have to use email.header.decode_header manually on each address, and then, since that gives you a list of decoded tokens, you have to stitch them back together again manually:
for name, address in email.utils.getaddresses(addresses):
name= u' '.join(
unicode(b, e or 'ascii') for b, e in email.header.decode_header(name)
)
...

Thank you Gareth Rees.Your answer was helpful in solving a problem case:
Input: 'application/octet-stream;\r\n\tname="=?utf-8?B?KFVTTXMpX0FSTE8uanBn?="'
The absence of whitespace around the encoded-word caused email.Header.decode_header to overlook it. I'm too new to this to know if I've only made things worse, but this kludge, along with joining with a '' instead of ' ', fixed it:
if not ' =?' in h:
h = h.replace('=?', ' =?').replace('?=', '?= ')
Output: u'application/octet-stream; name="(USMs)_ARLO.jpg"'

Related

Translation of a PHP Script to Python3 (Django)

I am attempting to convert from scratch the following PHP script into Python for my Django Project:
Note that it is my understanding that this script should handle values sent from a form, sign the data with the Secret_Key, encrypt the data in SHA256 and encode it in Base64
<?php
define ('HMAC_SHA256', 'sha256');
define ('SECRET_KEY', '<REPLACE WITH SECRET KEY>');
function sign ($params) {
return signData(buildDataToSign($params), SECRET_KEY);
}
function signData($data, $secretKey) {
return base64_encode(hash_hmac('sha256', $data, $secretKey, true));
}
function buildDataToSign($params) {
$signedFieldNames = explode(",",$params["signed_field_names"]);
foreach ($signedFieldNames as $field) {
$dataToSign[] = $field . "=" . $params[$field];
}
return commaSeparate($dataToSign);
}
function commaSeparate ($dataToSign) {
return implode(",",$dataToSign);
}
?>
Here is what I have done so far :
def sawb_confirmation(request):
if request.method == "POST":
form = SecureAcceptance(request.POST)
if form.is_valid():
access_key = 'afc10315b6aaxxxxxcfc912xx812b94c'
profile_id = 'E25C4XXX-4622-47E9-9941-1003B7910B3B'
transaction_uuid = str(uuid.uuid4())
signed_field_names = 'access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency'
signed_date_time = datetime.datetime.now()
signed_date_time = str(signed_date_time.strftime("20%y-%m-%dT%H:%M:%SZ"))
locale = 'en'
transaction_type = str(form.cleaned_data["transaction_type"])
reference_number = str(form.cleaned_data["reference_number"])
amount = str(form.cleaned_data["amount"])
currency = str(form.cleaned_data["currency"])
# Transform the String into a List
signed_field_names = [x.strip() for x in signed_field_names.split(',')]
# Get Values for each of the fields in the form
values = [access_key, profile_id, transaction_uuid,signed_field_names,'',signed_date_time,locale,transaction_type,reference_number,amount,currency]
# Insert the signedfieldnames in their place in the list (MUST BE KEPT)
values[3] = 'access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency'
# Merge the two lists as one
DataToSign = list(map('='.join, zip(signed_field_names, values)))
# Hash Sha-256
API_SECRET = 'bb588d4f96ac491ebd43cceb18xx149b79291f874f1a41fcbf5bc078bb6c8793af2df5ad4b174f80bd5f24a4e4eec6fdabdxxxxxc6c1410db40252deea613e0b976748539294438694ba08xx4ba831d3d850349cacfa445f9706aa57be7f8e61aab0be2288054dbe88ec6200ccd7c72888bcc0aa373f42059ec248d3c86b0f45'
message = '{} {}'.format(DataToSign, API_SECRET)
signature = hmac.new(bytes(API_SECRET , 'latin-1'), msg = bytes(message , 'latin-1'), digestmod = hashlib.sha256).hexdigest().upper()
base64string = base64.b64encode( bytes(signature, "utf-8") )
When printing the variables as they come, I obtain the following :
VALUES : ['afc10315b6aa3b2a8cfc91253812b94c', 'E25C4FE4-4622-47E9-9941-1003B7910B3B', '0b59b0ae-bd25-4421-a231-bb83dcfc91fa', 'access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency', '', '2021-03-06T22:07:30Z', 'en', 'authorization', '1615068450109', '100', 'USD']
DATATOSIGN : ['access_key=afc10315b6aa3b2a8cfc91253812b94c', 'profile_id=E25C4FE4-4622-47E9-9941-1003B7910B3B', 'transaction_uuid=0b59b0ae-bd25-4421-a231-bb83dcfc91fa', 'signed_field_names=access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency', 'unsigned_field_names=', 'signed_date_time=2021-03-06T22:07:30Z', 'locale=en', 'transaction_type=authorization', 'reference_number=1615068450109', 'amount=100', 'currency=USD']
SIGNATURE : 953C786EB9884CEC13C24118B00125BDCFE23AFF8AB02E7BEF29A83156C55C16
BASE64STRING : b'OTUzQzc4NkVCOTg4NENFQzEzQzI0MTE4QjAwMTI1QkRDRkUyM0FGRjhBQjAyRTdCRUYyOUE4MzE1NkM1NUMxNg=='
I think I am getting pretty close from the final result I would like to achieve since I would then simply have to post the Base64String to a specific URL.
However, I am unsure of a couple of things which may seem a bit off :
Is my "translation" of the PHP code into Python correct? Am I meant to merge my lists with a result in "DATATOSIGN"? I am not proficient in PHP so I might have misunderstood how to present the data.
The signature in Base64 should be 44 chars AT ALL TIME like "WrXOhTzhBjYMZROwiCug2My3jiZHOqATimcz5EBA07M=" when using the PHP Sample Code but mine way exceeds this limitation.
If you need any additional information, please do not hesitate to ask.
Hope you can give me pointers !
To approach this problem, it might be good to get an idea of what your final PHP result would be for given parameters.
Here are the parameters I'm using for this with your given PHP code:
$params = [
'access_key' => 'afc10315b6aaxxxxxcfc912xx812b94c',
'profile_id' => 'E25C4XXX-4622-47E9-9941-1003B7910B3B',
'transaction_uuid' => '12345',
'signed_field_names' => 'access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency',
'unsigned_field_names' => '',
'signed_date_time' => '2021-03-06 16:14:00',
'locale' => 'en',
'transaction_type' => 'credit',
'reference_number' => '12345',
'amount' => '50',
'currency' => 'usd'
];
When I run the original PHP code with these parameters, and these lines for outputting the code:
<?php
echo "build data to sign:\n";
print_r(buildDataToSign($params));
echo "\n";
echo "sign data:\n";
echo signData(buildDataToSign($params), 'secret');
?>
I get the following output:
build data to sign:
access_key=afc10315b6aaxxxxxcfc912xx812b94c,profile_id=E25C4XXX-4622-47E9-9941-1003B7910B3B,transaction_uuid=12345,signed_field_names=access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency,unsigned_field_names=,signed_date_time=2021-03-06 16:14:00,locale=en,transaction_type=credit,reference_number=12345,amount=50,currency=usd
sign data:
6V0iIqu3smGmadPK4KvRuHm1nNkuIVLBPbLg7VkA7M8=
So with your new Python version of this PHP code, you'll probably want to have a similar sign data value of 6V0iIqu3smGmadPK4KvRuHm1nNkuIVLBPbLg7VkA7M8= at the end with these parameters!
Because your Python example does not seem to get the same result as-is, after adding a return base64string to the end of your Python def, I get the following output:
sign_data:
b'NThFMjU4QTQyRjU2MkVDRDgzM0RCOEIwM0VDODczQTExNjc3MUNDMEM2OURGMDFGMjdFQkU3MEMzMDAyNjA3RQ=='
In order to match the PHP version of your code, I wanted to try to find out what was going on between the PHP and Python approaches in regard to the hmac and base64 parts.
When I broke down your PHP code example into steps relating to the hmac value and then later the base64 value, here is what I found (using a data message of 'hello' and a key of 'secret' to keep it simple):
Example PHP Code:
<?php
$hash_value = hash_hmac('sha256', 'hello', 'secret', true);
$base64_value = base64_encode($hash_value);
echo "hash value:\n";
echo $hash_value;
echo "\n";
echo "base64 value:\n";
echo $base64_value;
echo "\n";
?>
Example PHP Code Output:
;▒▒▒▒▒C▒|
base64 value:
iKqz7ejTrflNJquQ07r9SiCDBww7zOnAFO4EpEOEfAs=
That looks like some crazy binary-type data! Then, I wanted to try to make sure that the base64 value could be reproduced in Python. To do that, I used a simple approach in Python using the same values as earlier.
Example Python Code:
import base64
import hashlib
import hmac
# Based on your Python code example
hash_value = hmac.new(bytes('secret' , 'latin-1'), msg = bytes('hello', 'latin-1'), digestmod = hashlib.sha256).hexdigest().upper()
base64_value = base64.b64encode(bytes(hash_value, 'utf-8'))
print("hash value:")
print(hash_value)
print("base64 value:")
print(base64_value)
Example Python Code Output:
hash value:
88AAB3EDE8D3ADF94D26AB90D3BAFD4A2083070C3BCCE9C014EE04A443847C0B
base64 value:
b'ODhBQUIzRURFOEQzQURGOTREMjZBQjkwRDNCQUZENEEyMDgzMDcwQzNCQ0NFOUMwMTRFRTA0QTQ0Mzg0N0MwQg=='
So, like you found out earlier, something is causing this base64 value result on the Python side to be longer than the PHP version.
After looking into things more (especially seeing the strange data result in the PHP test code above), I found out that the hash_hmac() function in PHP has the option to return a result in binary form (with the true value at the end of the hash_hmac() in your PHP code example). On the Python side, it looks like you decided to use hmac.hexdigest() which I think I've used before in the past when I wanted a string-like value. For this case, however, I think you might want to get the value back as a binary value. To do this, it looks like you'll want to use hmac.digest() instead.
Modified Example Python Code:
import base64
import hashlib
import hmac
# Based on your Python code example
hash_value = hmac.new(bytes('secret' , 'latin-1'), msg = bytes('hello', 'latin-1'), digestmod = hashlib.sha256).digest()
base64_value = base64.b64encode(bytes(hash_value))
print("hash value:")
print(hash_value)
print("base64 value:")
print(base64_value)
Modified Example Python Code Output:
hash value:
b'\x88\xaa\xb3\xed\xe8\xd3\xad\xf9M&\xab\x90\xd3\xba\xfdJ \x83\x07\x0c;\xcc\xe9\xc0\x14\xee\x04\xa4C\x84|\x0b'
base64 value:
b'iKqz7ejTrflNJquQ07r9SiCDBww7zOnAFO4EpEOEfAs='
Now, the final base64 results appear to match between the example PHP and Python code.
In order for me to better understand what was different between the PHP and Python code, I ended up writing a simple translation of your PHP code into Python (and partly based on your Python code as well).
Here is what the related Python code looks like on my side (with example params):
import base64
import hmac
params = {
'access_key': 'afc10315b6aaxxxxxcfc912xx812b94c',
'profile_id': 'E25C4XXX-4622-47E9-9941-1003B7910B3B',
'transaction_uuid': "12345",
'signed_field_names': 'access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency',
'unsigned_field_names': '',
'signed_date_time': "2021-03-06 16:14:00",
'locale': 'en',
'transaction_type': "credit",
'reference_number': "12345",
'amount': "50",
'currency': "usd"
}
SECRET_KEY = 'secret'
def sign(params):
return sign_data(build_data_to_sign(params), SECRET_KEY)
def sign_data(data, secret_key):
return base64.b64encode(bytes(hmac.new(bytes(secret_key, 'latin-1'), msg=bytes(data, 'latin-1'), digestmod='sha256').digest()))
def build_data_to_sign(params):
data_to_sign = []
signed_field_names = params['signed_field_names'].split(',')
for field in signed_field_names:
data_to_sign.append(field + "=" + params[field])
return comma_separate(data_to_sign)
def comma_separate(data_to_sign):
return ','.join(data_to_sign)
When I use my code translation to check your Python code, I checked the values for the variables signed_field_names and DataToSign in your code, and I got the following results:
signed_field_names:
['access_key', 'profile_id', 'transaction_uuid', 'signed_field_names', 'unsigned_field_names', 'signed_date_time', 'locale', 'transaction_type', 'reference_number', 'amount', 'currency']
DataToSign:
['access_key=afc10315b6aaxxxxxcfc912xx812b94c', 'profile_id=E25C4XXX-4622-47E9-9941-1003B7910B3B', 'transaction_uuid=12345', 'signed_field_names=access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency', 'unsigned_field_names=', 'signed_date_time=2021-03-06 16:14:00', 'locale=en', 'transaction_type=credit', 'reference_number=12345', 'amount=50', 'currency=usd']
When I check the values with my code translation attempt, I get these values:
signed_field_names:
['access_key', 'profile_id', 'transaction_uuid', 'signed_field_names', 'unsigned_field_names', 'signed_date_time', 'locale', 'transaction_type', 'reference_number', 'amount', 'currency']
DataToSign:
access_key=afc10315b6aaxxxxxcfc912xx812b94c,profile_id=E25C4XXX-4622-47E9-9941-1003B7910B3B,transaction_uuid=12345,signed_field_names=access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency,unsigned_field_names=,signed_date_time=2021-03-06 16:14:00,locale=en,transaction_type=credit,reference_number=12345,amount=50,currency=usd
So it looks like your DataToSign = list(map('='.join, zip(signed_field_names, values))) line is specifying a list whereas my code attempt is specifying a string based on your original PHP example.
Because of this, I think you'll want to turn the result back into a string like this (though the variable name could also be written differently if you so choose):
DataToSignString = ','.join(DataToSign)
To save time in this long post, I also found that your message variable was different than my translation of your PHP code. To work around this, I made the message variable in your Python code set to the previously mentioned DataToSignString:
# Commenting out previous message line for now
# message = '{} {}'.format(DataToSignString, API_SECRET)
message = DataToSignString
Also, the following changes seem to be needed for your Python example:
signature = hmac.new(bytes(API_SECRET , 'latin-1'), msg = bytes(message , 'latin-1'), digestmod = hashlib.sha256).digest()
base64string = base64.b64encode(bytes(signature))
This way, you have a binary version of the hmac object. Also, it looks like the utf-8 part might not be needed for now in the base64encode part.
Finally, I added a return to return the calculated base64string while also converting it to a string before base64string is returned:
return str(base64string, 'utf-8')
When put together, here is what the modified code from your Python example looks like:
import base64
import datetime
import hashlib
import hmac
import pprint
import uuid
def sign():
access_key = 'afc10315b6aaxxxxxcfc912xx812b94c'
profile_id = 'E25C4XXX-4622-47E9-9941-1003B7910B3B'
transaction_uuid = "12345"
signed_field_names = 'access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency'
signed_date_time = "2021-03-06 16:14:00"
locale = 'en'
transaction_type = "credit"
reference_number = "12345"
amount = "50"
currency = "usd"
# Transform the String into a List
signed_field_names = [x.strip() for x in signed_field_names.split(',')]
# Get Values for each of the fields in the form
values = [access_key, profile_id, transaction_uuid,signed_field_names,'',signed_date_time,locale,transaction_type,reference_number,amount,currency]
# Insert the signedfieldnames in their place in the list (MUST BE KEPT)
values[3] = 'access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency'
# Merge the two lists as one
DataToSign = list(map('='.join, zip(signed_field_names, values)))
DataToSignString = ','.join(DataToSign)
# Hash Sha-256
API_SECRET = 'secret'
message = DataToSignString
signature = hmac.new(bytes(API_SECRET , 'latin-1'), msg = bytes(message , 'latin-1'), digestmod = hashlib.sha256).digest()
base64string = base64.b64encode(bytes(signature))
return str(base64string, 'utf-8')
result = sign()
print("sign_data:")
print(result)
The output for this code (with the given parameters) is:
sign_data:
6V0iIqu3smGmadPK4KvRuHm1nNkuIVLBPbLg7VkA7M8=
The value part of this output should be the same as the PHP output from earlier in this post. The earlier value was 6V0iIqu3smGmadPK4KvRuHm1nNkuIVLBPbLg7VkA7M8= and the latest output showed a result of 6V0iIqu3smGmadPK4KvRuHm1nNkuIVLBPbLg7VkA7M8=.
#summea You are god sent ! Thanks a ton ! I cannot believe how much of an effort you made, I am baffled !
If anyone is attempted to Implement Secure Acceptance / CyberSource and see this message, note that you would not be able to pass value="{{ signed_field_names }} in your front end as it is since the data looks like ['access_key','profile_id'].
You would need to either tweak it before sending it (which somehow gives out a Signature Mismatched on CYBS end) or simply hardcode the input field in payment_confirmation like so : value="access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency"/>

Decoding base64 string return None

I tried to generate uid for a user confirmation email.
'uid':urlsafe_base64_encode(force_bytes(user.pk)),
so, it's works nice, it returns something like "Tm9uZQ"
Then, when I tried to decode it, using force_text(urlsafe_base64_decode(uidb64))
it return None.
The next string
urlsafe_base64_decode(uidb64)
also, return b'None'
I tried to google it, and see different implementations, but copy-paste code not works.
I write something like
b64_string = uidb64
b64_string += "=" * ((4 - len(b64_string) % 4) % 4)
print(b64_string)
print(force_text(base64.urlsafe_b64decode(b64_string)))
and the result still None:
Tm9uZQ==
None
I don't understand how the default decode doesn't work.
"Tm9uZQ==" is the base64 encoding of the string "None",
>>> from base64 import b64encode, b64decode
>>> s = b'None'
>>>
>>> b64encode(s)
b'Tm9uZQ=='
>>> b64decode(b64encode(s))
b'None'
>>>
It could be possible that some of your data is missing. E.g. user.pk is not set. I think that force_bytes is turning a None user.pk into the bytestring b'None', from the Django source,
def force_bytes(s, encoding='utf-8', strings_only=False, errors='strict'):
"""
Similar to smart_bytes, except that lazy instances are resolved to
strings, rather than kept as lazy objects.
If strings_only is True, don't convert (some) non-string-like objects.
"""
# Handle the common case first for performance reasons.
if isinstance(s, bytes):
if encoding == 'utf-8':
return s
else:
return s.decode('utf-8', errors).encode(encoding, errors)
if strings_only and is_protected_type(s):
return s
if isinstance(s, memoryview):
return bytes(s)
return str(s).encode(encoding, errors)
You might be able to prevent None being turned into b'None' by setting strings_only=True when calling force_bytes.

trying to automate translation on babelfish with python

I have modified a python babelizer to help me to translate english to chinese.
## {{{ http://code.activestate.com/recipes/64937/ (r4)
# babelizer.py - API for simple access to babelfish.altavista.com.
# Requires python 2.0 or better.
#
# See it in use at http://babel.MrFeinberg.com/
"""API for simple access to babelfish.altavista.com.
Summary:
import babelizer
print ' '.join(babelizer.available_languages)
print babelizer.translate( 'How much is that doggie in the window?',
'English', 'French' )
def babel_callback(phrase):
print phrase
sys.stdout.flush()
babelizer.babelize( 'I love a reigning knight.',
'English', 'German',
callback = babel_callback )
available_languages
A list of languages available for use with babelfish.
translate( phrase, from_lang, to_lang )
Uses babelfish to translate phrase from from_lang to to_lang.
babelize(phrase, from_lang, through_lang, limit = 12, callback = None)
Uses babelfish to translate back and forth between from_lang and
through_lang until either no more changes occur in translation or
limit iterations have been reached, whichever comes first. Takes
an optional callback function which should receive a single
parameter, being the next translation. Without the callback
returns a list of successive translations.
It's only guaranteed to work if 'english' is one of the two languages
given to either of the translation methods.
Both translation methods throw exceptions which are all subclasses of
BabelizerError. They include
LanguageNotAvailableError
Thrown on an attempt to use an unknown language.
BabelfishChangedError
Thrown when babelfish.altavista.com changes some detail of their
layout, and babelizer can no longer parse the results or submit
the correct form (a not infrequent occurance).
BabelizerIOError
Thrown for various networking and IO errors.
Version: $Id: babelizer.py,v 1.4 2001/06/04 21:25:09 Administrator Exp $
Author: Jonathan Feinberg <jdf#pobox.com>
"""
import re, string, urllib
import httplib, urllib
import sys
"""
Various patterns I have encountered in looking for the babelfish result.
We try each of them in turn, based on the relative number of times I've
seen each of these patterns. $1.00 to anyone who can provide a heuristic
for knowing which one to use. This includes AltaVista employees.
"""
__where = [ re.compile(r'name=\"q\">([^<]*)'),
re.compile(r'td bgcolor=white>([^<]*)'),
re.compile(r'<\/strong><br>([^<]*)')
]
# <div id="result"><div style="padding:0.6em;">??</div></div>
__where = [ re.compile(r'<div id=\"result\"><div style=\"padding\:0\.6em\;\">(.*)<\/div><\/div>', re.U) ]
__languages = { 'english' : 'en',
'french' : 'fr',
'spanish' : 'es',
'german' : 'de',
'italian' : 'it',
'portugese' : 'pt',
'chinese' : 'zh'
}
"""
All of the available language names.
"""
available_languages = [ x.title() for x in __languages.keys() ]
"""
Calling translate() or babelize() can raise a BabelizerError
"""
class BabelizerError(Exception):
pass
class LanguageNotAvailableError(BabelizerError):
pass
class BabelfishChangedError(BabelizerError):
pass
class BabelizerIOError(BabelizerError):
pass
def saveHTML(txt):
f = open('page.html', 'wb')
f.write(txt)
f.close()
def clean(text):
return ' '.join(string.replace(text.strip(), "\n", ' ').split())
def translate(phrase, from_lang, to_lang):
phrase = clean(phrase)
try:
from_code = __languages[from_lang.lower()]
to_code = __languages[to_lang.lower()]
except KeyError, lang:
raise LanguageNotAvailableError(lang)
html = ""
try:
params = urllib.urlencode({'ei':'UTF-8', 'doit':'done', 'fr':'bf-res', 'intl':'1' , 'tt':'urltext', 'trtext':phrase, 'lp' : from_code + '_' + to_code , 'btnTrTxt':'Translate'})
headers = {"Content-type": "application/x-www-form-urlencoded","Accept": "text/plain"}
conn = httplib.HTTPConnection("babelfish.yahoo.com")
conn.request("POST", "http://babelfish.yahoo.com/translate_txt", params, headers)
response = conn.getresponse()
html = response.read()
saveHTML(html)
conn.close()
#response = urllib.urlopen('http://babelfish.yahoo.com/translate_txt', params)
except IOError, what:
raise BabelizerIOError("Couldn't talk to server: %s" % what)
#print html
for regex in __where:
match = regex.search(html)
if match:
break
if not match:
raise BabelfishChangedError("Can't recognize translated string.")
return match.group(1)
#return clean(match.group(1))
def babelize(phrase, from_language, through_language, limit = 12, callback = None):
phrase = clean(phrase)
seen = { phrase: 1 }
if callback:
callback(phrase)
else:
results = [ phrase ]
flip = { from_language: through_language, through_language: from_language }
next = from_language
for i in range(limit):
phrase = translate(phrase, next, flip[next])
if seen.has_key(phrase): break
seen[phrase] = 1
if callback:
callback(phrase)
else:
results.append(phrase)
next = flip[next]
if not callback: return results
if __name__ == '__main__':
import sys
def printer(x):
print x
sys.stdout.flush();
babelize("I won't take that sort of treatment from you, or from your doggie!",
'english', 'french', callback = printer)
## end of http://code.activestate.com/recipes/64937/ }}}
and the test code is
import babelizer
print ' '.join(babelizer.available_languages)
result = babelizer.translate( 'How much is that dog in the window?', 'English', 'chinese' )
f = open('result.txt', 'wb')
f.write(result)
f.close()
print result
The result is to be expected inside a div block . I modded the script to save the html response . What I found is that all utf8 characters are turned to nul . Do I need take special care in treating the utf8 response ?
I think you need to use:
import codecs
codecs.open
instead of plain open, in your:
saveHTML
method, to handle utf-8 docs. See the Python Unicode Howto for a complete explanation.

Python % character problem while manipulation (Escape char problem)

I try to send a confirmation email in Django but there is a problem with excape characters.
I have a helper function for content of mail as
def getActivationMailBody():
email_body = "<table width='100%'>
email_body = email_body + '<p>' + '%(confirmLink)s' + '</p>'
return email_body
And the confirmation url is embedded like
email_body = getActivationMailBody()
email_body = email_body % {'confirmLink': '%s/kullanici/onay/%s/%s'%(WEB_URL,md5.new(form.cleaned_data['email']).hexdigest()[:30], activation_key)}
msg = EmailMessage(email_subject, email_body, DEFAULT_FROM_EMAIL, [email_to])
msg.content_subtype="html"
res = msg.send(fail_silently=False)
However, while the confirmLink embedding I get an error as
unsupported format character ''' (0x27) at index 18
I found that the problem is caused by % character but I couldn't figure out how can I correct that.
Could you give me any suggestion ? Thanks
In a format string, a % can be escaped by doubling:
email_body = "<table width='100%%'>"
It's a little odd how you've constructed this, since getActivationEmailBody isn't returning the body of the email, but instead a format string to create the body. You might want to rename the function.

Python email quoted-printable encoding problem

I am extracting emails from Gmail using the following:
def getMsgs():
try:
conn = imaplib.IMAP4_SSL("imap.gmail.com", 993)
except:
print 'Failed to connect'
print 'Is your internet connection working?'
sys.exit()
try:
conn.login(username, password)
except:
print 'Failed to login'
print 'Is the username and password correct?'
sys.exit()
conn.select('Inbox')
# typ, data = conn.search(None, '(UNSEEN SUBJECT "%s")' % subject)
typ, data = conn.search(None, '(SUBJECT "%s")' % subject)
for num in data[0].split():
typ, data = conn.fetch(num, '(RFC822)')
msg = email.message_from_string(data[0][1])
yield walkMsg(msg)
def walkMsg(msg):
for part in msg.walk():
if part.get_content_type() != "text/plain":
continue
return part.get_payload()
However, some emails I get are nigh impossible for me to extract dates (using regex) from as encoding-related chars such as '=', randomly land in the middle of various text fields. Here's an example where it occurs in a date range I want to extract:
Name: KIRSTI Email:
kirsti#blah.blah Phone #: + 999
99995192 Total in party: 4 total, 0
children Arrival/Departure: Oct 9=
,
2010 - Oct 13, 2010 - Oct 13, 2010
Is there a way to remove these encoding characters?
You could/should use the email.parser module to decode mail messages, for example (quick and dirty example!):
from email.parser import FeedParser
f = FeedParser()
f.feed("<insert mail message here, including all headers>")
rootMessage = f.close()
# Now you can access the message and its submessages (if it's multipart)
print rootMessage.is_multipart()
# Or check for errors
print rootMessage.defects
# If it's a multipart message, you can get the first submessage and then its payload
# (i.e. content) like so:
rootMessage.get_payload(0).get_payload(decode=True)
Using the "decode" parameter of Message.get_payload, the module automatically decodes the content, depending on its encoding (e.g. quoted printables as in your question).
If you are using Python3.6 or later, you can use the email.message.Message.get_content() method to decode the text automatically. This method supersedes get_payload(), though get_payload() is still available.
Say you have a string s containing this email message (based on the examples in the docs):
Subject: Ayons asperges pour le =?utf-8?q?d=C3=A9jeuner?=
From: =?utf-8?q?Pep=C3=A9?= Le Pew <pepe#example.com>
To: Penelope Pussycat <penelope#example.com>,
Fabrette Pussycat <fabrette#example.com>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Salut!
Cela ressemble =C3=A0 un excellent recipie[1] d=C3=A9jeuner.
[1] http://www.yummly.com/recipe/Roasted-Asparagus-Epicurious-203718
--Pep=C3=A9
=20
Non-ascii characters in the string have been encoded with the quoted-printable encoding, as specified in the Content-Transfer-Encoding header.
Create an email object:
import email
from email import policy
msg = email.message_from_string(s, policy=policy.default)
Setting the policy is required here; otherwise policy.compat32 is used, which returns a legacy Message instance that doesn't have the get_content method. policy.default will eventually become the default policy, but as of Python3.7 it's still policy.compat32.
The get_content() method handles decoding automatically:
print(msg.get_content())
Salut!
Cela ressemble à un excellent recipie[1] déjeuner.
[1] http://www.yummly.com/recipe/Roasted-Asparagus-Epicurious-203718
--Pepé
If you have a multipart message, get_content() needs to be called on the individual parts, like this:
for part in message.iter_parts():
print(part.get_content())
That's known as quoted-printable encoding. You probably want to use something like quopri.decodestring - http://docs.python.org/library/quopri.html

Categories