Decoding a fetched email header or subject into a readable manner

Decoding a fetched email header or subject into a readable manner - python

I get emails with unique subjects, and I want to save them.
I tried this (stage with credentials input is omitted)
import email
import imaplib
suka.select('Inbox')
key = 'FROM'
value = 'TBD'
_, data = suka.search(None, key, value)
mail_id_list = data[0].split()
msgs = []
for num in mail_id_list:
typ, data = suka.fetch(num, '(RFC822)')
msgs.append(data)
for msg in msgs[::-1]:
for response_part in msg:
if type(response_part) is tuple:
my_msg=email.message_from_bytes((response_part[1]))
print ("subj:", my_msg['subject'])
for part in my_msg.walk():
#print(part.get_content_type())
if part.get_content_type() == 'text/plain':
print (part.get_payload())
I do get the subjects, but in a form of "subj: =?utf-8?B?0LfQsNGP0LLQutCwIDIxXzE0MTIyMg==?=". Thus, a decoding is required. The secret seems to be, which variable needs to be decoded?
Also tried the other way:
yek, do = suka.uid('fetch', govno,('RFC822'))
, where govno is the latest email in the inbox. The output is "can't concat int to bytes".
Thus, is there a way to decode the subjects as they appear in the email client? Thank you.

There is a built-in decode_header() method.
Decode a message header value without converting the character set.
The header value is in header.
This function returns a list of (decoded_string, charset) pairs
containing each of the decoded parts of the header. charset is None
for non-encoded parts of the header, otherwise a lower case string
containing the name of the character set specified in the encoded
string.
>>> from email.header import decode_header
>>> decoded_headers = decode_header("=?utf-8?B?0LfQsNGP0LLQutCwIDIxXzE0MTIyMg==?=")
>>> decoded_headers
[(b'\xd0\xb7\xd0\xb0\xd1\x8f\xd0\xb2\xd0\xba\xd0\xb0 21_141222', 'utf-8')]
>>> first = decoded_headers[0]
>>> first[0].decode(first[1])
'заявка 21_141222'
You can decode the actual value returned by decode_header using the charset returned by it.
For follow-up question, here's a helper function to get the header value in case of multiline header value which handlers errors -
from email.header import decode_header
def get_header(header_text, default='utf8'):
try:
headers = decode_header(header_text)
except:
print('Error while decoding header, using the header without decoding')
return header_text
header_sections = []
for text, charset in headers:
try:
# if charset does not exist, try to decode with utf-8
header_section = text.decode(charset or 'utf-8')
except:
# if we fail to decode the text(header is incorrectly encoded)
# Try to do decode, and ignore the decoding errors
header_section = text.decode('utf-8', errors='ignore')
if header_section:
header_sections.append(header_section)
return ' '.join(header_sections)
print(get_header("=?utf-8?B?0LfQsNGP0LLQutCwIDIxXzE0MTIyMg==?="))

Related

DSA key format not supported

I am trying to do a P2MS script.
For my script, I am saving the keys into a text file with DER format instead of the usual PEM file. Both key and signatures are saved in a text file and hexlify. Below is my code for the P2MS execution.
from Crypto.PublicKey import DSA
from Crypto.Hash import SHA256
from Crypto.Signature import DSS
from binascii import hexlify, unhexlify
import binascii
message = b"helpsmepls"
# Read scriptPubKey and scriptSig from files
with open('scriptPubKey.txt', 'r') as f:
readscriptPubKey = f.read().strip()
with open('scriptSig.txt', 'r') as f:
scriptSig = f.read().strip()
print(type(readscriptPubKey))
tempholder = readscriptPubKey.split()
# Removing the first character and last character
removeend = tempholder[1:-1]
scriptPubKey = []
# Removing front extra headings
for count, x in enumerate(removeend):
w = bytes(removeend[count][1:-1], encoding = 'utf-8')
#print(w)
scriptPubKey.append(w)
# Splitting the Signatures
signatures = scriptSig.split()
hash_obj = SHA256.new(message)
# Going through the pubkeys based on the number of signatures generated
for o, sig in enumerate(signatures):
pub_key = DSA.import_key(bytes.fromhex(scriptPubKey[o].decode("utf-8")))
hash_obj = SHA256.new(message)
verifier = DSS.new(pub_key, 'fips-183-3')
# Verifying if the Public key and signatures match, loop will break if False is encountered
if verifier.verify(hash_obj, sig):
d = True
else:
d = False
break
break
if d == True:
print("The message is authentic.")
else: print("The message is not authentic.")
Unfortunately before my code can reach the verification, it encountered an error.
Full traceback
It seems my DSA key format has an error, but I am not too sure why is it giving me that error.
I have also tried unhexlifying my input from the public key text file, but it also did not work. I have tried to hex decode the input to get the DER format of the input, but my type is still just bytes. I am not so sure how to properly import the key with the appropriate DSA key format from a txt file. I am able to do that with a PEM file but would just like to find out how to execute it with a txt file.
My expected outcome is the DSA key is imported properly and i am able to verify the public key with the signatures.

Implementing SHA1-HMAC with Python

I am implementing SHA1-HMAC generation for python (v 3.7) to be able to create HMAC code.
I have used an online generator to create SHA1-HMAC with the following data:
string: '123'
Secret Key: 'secret'
Digest algorithm: SHA1
I am getting this result:
b14e92eb17f6b78ec5a205ee0e1ab220fb7f86d7
However when I try to do this same with Python I am getting different results which are wrong.
import hashlib
import hmac
import base64
def make_digest(message, key):
key = bytes(key, 'UTF-8')
message = bytes(message, 'UTF-8')
digester = hmac.new(key, message, hashlib.sha1)
signature1 = digester.digest()
signature2 = base64.urlsafe_b64encode(signature1)
return str(signature2, 'UTF-8')
result = make_digest('123', 'secret')
print(result)
This code gives result:
sU6S6xf2t47FogXuDhqyIPt_htc=
What could be wrong with this code?

You should not use Base64 here. The site you link to gives you the hex values of the digest bytes. Use the HMAC.hexdigest() method to get the same value in hex in Python:
>>> key = b'secret'
>>> message = b'123'
>>> digester = hmac.new(key, message, hashlib.sha1)
>>> digester.hexdigest()
'b14e92eb17f6b78ec5a205ee0e1ab220fb7f86d7'
put differently, your code outputs the correct value, but as Base64-encoded data:
>>> digester.digest()
b'\xb1N\x92\xeb\x17\xf6\xb7\x8e\xc5\xa2\x05\xee\x0e\x1a\xb2 \xfb\x7f\x86\xd7'
>>> base64.urlsafe_b64encode(digester.digest())
b'sU6S6xf2t47FogXuDhqyIPt_htc='
and the value you generated online contains the exact same bytes as the hex digest, so we can generate the same base64 output for that:
>>> bytes.fromhex('b14e92eb17f6b78ec5a205ee0e1ab220fb7f86d7')
b'\xb1N\x92\xeb\x17\xf6\xb7\x8e\xc5\xa2\x05\xee\x0e\x1a\xb2 \xfb\x7f\x86\xd7'
>>> base64.urlsafe_b64encode(bytes.fromhex('b14e92eb17f6b78ec5a205ee0e1ab220fb7f86d7'))
b'sU6S6xf2t47FogXuDhqyIPt_htc='

Generating a Temporary url in Python

I've been trying to generate a temporary url in python, the url will some data that i need to make sure isn't changed so i'll add a hash in the end but i keep ending with a bString no matter what i try, can anyone point out what i'm doing wrong?
Here's a sample of my code
oh and i know that maybe changing the algorithms/encoding might solve the problem but i can't find a suitable one, can any downvoter explain why he downvoted
import hashlib
import datetime
from Crypto.Cipher import AES
def checkTemp(tempLink):
encrypter = AES.new('1234567890123456', AES.MODE_CBC, 'this is an iv456')
decryption = encrypter.decrypt(tempLink)
length = len(decryption)
hash_code = decryption[length-32:length]
data= decryption[:length-32].strip()
hasher = hashlib.sha256()
hasher.update(data)
hashCode = hasher.digest()
if(hash_code==hashCode):
array = data.decode().split(",",5)
print("expiry date is :"+ str(array[5]))
return array[0],array[1],array[2],array[3],array[4]
else:
return "","","","",""
def createTemp(inviter,email,role,pj_name,cmp_name):
delim = ','
data = inviter+delim+email+delim+role+delim+pj_name+delim+cmp_name+delim+str(datetime.datetime.now().time())
data = data.encode(encoding='utf_8', errors='strict')
hasher = hashlib.sha256()
hasher.update(data)
hashCode = hasher.digest()
encrypter = AES.new('1234567890123456', AES.MODE_CBC, 'this is an iv456')
# to make the link a multiple of 16 by adding for AES with the addition of spaces
newData = data+b' '*(len(data)%16)
result = encrypter.encrypt(newData+hashCode)
return result
#print(str(link).split(",",5))
link = createTemp("name","email#homail.com","Designer","Project Name","My Company")
print(link)
inviter,email,role,project,company = checkTemp(link)

The problem was not being able to putout a normal string because the encryption will result in characters that are almost impossible to encode, so the solution is to use binascii to encode the bStrings for us and decode them
import binascii
then we encode the resulting usable string for the link
hexedLink = binascii.hexlify(link).decode()
and we unhexify it before using it in the method
inviter,email,role,project,company = checkTemp(binascii.unhexlify(hexedLink))

Error parsing emails using Python's email module when the encoding is in shift_jis

I am getting an error that says "UnicodeDecodeError: 'shift_jis' codec can't decode bytes in position 2-3: illegal multibyte sequence" when I try to use my email parser to decode a shift_jis encoded email and convert it to unicode. The code and email can be found below:
import email.header
import base64
import sys
import email
def getrawemail():
line = ' '
raw_email = ''
while line:
line = sys.stdin.readline()
raw_email += line
return raw_email
def getheader(subject, charsets):
for i in charsets:
if isinstance(i, str):
encoding = i
break
if subject[-2] == "?=":
encoded = subject[5 + len(encoding):len(subject) - 2]
else:
encoded = subject[5 + len(encoding):]
return (encoding, encoded)
def decodeheader((encoding, encoded)):
decoded = base64.b64decode(encoded)
decoded = unicode(decoded, encoding)
return decoded
raw_email = getrawemail()
msg = email.message_from_string(raw_email)
subject = decodeheader(getheader(msg["Subject"], msg.get_charsets()))
print subject
Email: http://pastebin.com/L4jAkm5R
I have read on another Stack Overflow question that this may be related to a difference between how Unicode and shift_jis are encoded (they referenced this Microsoft Knowledge Base article). If anyone knows what in my code could be causing it to not work, or if this is even reasonably fixable, I would very much appreciate finding out how.

Starting with this string:
In [124]: msg['Subject']
Out[124]: '=?ISO-2022-JP?B?GyRCNS5KfSRLJEgkRiRiQmdAWiRKJCpDTiRpJDskLCQiJGo'
=?ISO-2022-JP?B? means the string is ISO-2022-JP encoded, then base64 encoded.
In [125]: msg['Subject'].lstrip('=?ISO-2022-JP?B?')
Out[125]: 'GyRCNS5KfSRLJEgkRiRiQmdAWiRKJCpDTiRpJDskLCQiJGo'
Unfortunately, trying to reverse that process results in an error:
In [126]: base64.b64decode(msg['Subject'].lstrip('=?ISO-2022-JP?B?'))
TypeError: Incorrect padding
Reading this SO answer lead me to try adding '?=' to the end of the string:
In [130]: print(base64.b64decode(msg['Subject'].lstrip('=?ISO-2022-JP?B?')+'?=').decode('ISO-2022-JP'))
貴方にとても大切なお知らせがあり
According to google translate, this may be translated as "You know there is a very important".
So it appears the subject line has been truncated.

Best way to convert a Unicode URL to ASCII (UTF-8 percent-escaped) in Python?

I'm wondering what's the best way -- or if there's a simple way with the standard library -- to convert a URL with Unicode chars in the domain name and path to the equivalent ASCII URL, encoded with domain as IDNA and the path %-encoded, as per RFC 3986.
I get from the user a URL in UTF-8. So if they've typed in http://➡.ws/♥ I get 'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5' in Python. And what I want out is the ASCII version: 'http://xn--hgi.ws/%E2%99%A5'.
What I do at the moment is split the URL up into parts via a regex, and then manually IDNA-encode the domain, and separately encode the path and query string with different urllib.quote() calls.
# url is UTF-8 here, eg: url = u'http://➡.ws/㉌'.encode('utf-8')
match = re.match(r'([a-z]{3,5})://(.+\.[a-z0-9]{1,6})'
r'(:\d{1,5})?(/.*?)(\?.*)?$', url, flags=re.I)
if not match:
raise BadURLException(url)
protocol, domain, port, path, query = match.groups()
try:
domain = unicode(domain, 'utf-8')
except UnicodeDecodeError:
return '' # bad UTF-8 chars in domain
domain = domain.encode('idna')
if port is None:
port = ''
path = urllib.quote(path)
if query is None:
query = ''
else:
query = urllib.quote(query, safe='=&?/')
url = protocol + '://' + domain + port + path + query
# url is ASCII here, eg: url = 'http://xn--hgi.ws/%E3%89%8C'
Is this correct? Any better suggestions? Is there a simple standard-library function to do this?

Code:
import urlparse, urllib
def fixurl(url):
# turn string into unicode
if not isinstance(url,unicode):
url = url.decode('utf8')
# parse it
parsed = urlparse.urlsplit(url)
# divide the netloc further
userpass,at,hostport = parsed.netloc.rpartition('#')
user,colon1,pass_ = userpass.partition(':')
host,colon2,port = hostport.partition(':')
# encode each component
scheme = parsed.scheme.encode('utf8')
user = urllib.quote(user.encode('utf8'))
colon1 = colon1.encode('utf8')
pass_ = urllib.quote(pass_.encode('utf8'))
at = at.encode('utf8')
host = host.encode('idna')
colon2 = colon2.encode('utf8')
port = port.encode('utf8')
path = '/'.join( # could be encoded slashes!
urllib.quote(urllib.unquote(pce).encode('utf8'),'')
for pce in parsed.path.split('/')
)
query = urllib.quote(urllib.unquote(parsed.query).encode('utf8'),'=&?/')
fragment = urllib.quote(urllib.unquote(parsed.fragment).encode('utf8'))
# put it back together
netloc = ''.join((user,colon1,pass_,at,host,colon2,port))
return urlparse.urlunsplit((scheme,netloc,path,query,fragment))
print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5')
print fixurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5/%2F')
print fixurl(u'http://Åsa:abc123#➡.ws:81/admin')
print fixurl(u'http://➡.ws/admin')
Output:
http://xn--hgi.ws/%E2%99%A5
http://xn--hgi.ws/%E2%99%A5/%2F
http://%C3%85sa:abc123#xn--hgi.ws:81/admin
http://xn--hgi.ws/admin
Read more:
urllib.quote()
urlparse.urlparse()
urlparse.urlunparse()
urlparse.urlsplit()
urlparse.urlunsplit()
Edits:
Fixed the case of already quoted characters in the string.
Changed urlparse/urlunparse to urlsplit/urlunsplit.
Don't encode user and port information with the hostname. (Thanks Jehiah)
When "#" is missing, don't treat the host/port as user/pass! (Thanks hupf)

the code given by MizardX isnt 100% correct. This example wont work:
example.com/folder/?page=2
check out django.utils.encoding.iri_to_uri() to convert unicode URL to ASCII urls.
http://docs.djangoproject.com/en/dev/ref/unicode/

there's some RFC-3896 url parsing work underway (e.g. as part of the Summer Of Code) but nothing in the standard library yet AFAIK -- and nothing much on the uri encoding side of things either, again AFAIK. So you might as well go with MizardX's elegant approach.

Okay, with these comments and some bug-fixing in my own code (it didn't handle fragments at all), I've come up with the following canonurl() function -- returns a canonical, ASCII form of the URL:
import re
import urllib
import urlparse
def canonurl(url):
r"""Return the canonical, ASCII-encoded form of a UTF-8 encoded URL, or ''
if the URL looks invalid.
>>> canonurl(' ')
''
>>> canonurl('www.google.com')
'http://www.google.com/'
>>> canonurl('bad-utf8.com/path\xff/file')
''
>>> canonurl('svn://blah.com/path/file')
'svn://blah.com/path/file'
>>> canonurl('1234://badscheme.com')
''
>>> canonurl('bad$scheme://google.com')
''
>>> canonurl('site.badtopleveldomain')
''
>>> canonurl('site.com:badport')
''
>>> canonurl('http://123.24.8.240/blah')
'http://123.24.8.240/blah'
>>> canonurl('http://123.24.8.240:1234/blah?q#f')
'http://123.24.8.240:1234/blah?q#f'
>>> canonurl('\xe2\x9e\xa1.ws') # tinyarro.ws
'http://xn--hgi.ws/'
>>> canonurl(' http://www.google.com:80/path/file;params?query#fragment ')
'http://www.google.com:80/path/file;params?query#fragment'
>>> canonurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5')
'http://xn--hgi.ws/%E2%99%A5'
>>> canonurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5/pa%2Fth')
'http://xn--hgi.ws/%E2%99%A5/pa/th'
>>> canonurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5/pa%2Fth;par%2Fams?que%2Fry=a&b=c')
'http://xn--hgi.ws/%E2%99%A5/pa/th;par/ams?que/ry=a&b=c'
>>> canonurl('http://\xe2\x9e\xa1.ws/\xe2\x99\xa5?\xe2\x99\xa5#\xe2\x99\xa5')
'http://xn--hgi.ws/%E2%99%A5?%E2%99%A5#%E2%99%A5'
>>> canonurl('http://\xe2\x9e\xa1.ws/%e2%99%a5?%E2%99%A5#%E2%99%A5')
'http://xn--hgi.ws/%E2%99%A5?%E2%99%A5#%E2%99%A5'
>>> canonurl('http://badutf8pcokay.com/%FF?%FE#%FF')
'http://badutf8pcokay.com/%FF?%FE#%FF'
>>> len(canonurl('google.com/' + 'a' * 16384))
4096
"""
# strip spaces at the ends and ensure it's prefixed with 'scheme://'
url = url.strip()
if not url:
return ''
if not urlparse.urlsplit(url).scheme:
url = 'http://' + url
# turn it into Unicode
try:
url = unicode(url, 'utf-8')
except UnicodeDecodeError:
return '' # bad UTF-8 chars in URL
# parse the URL into its components
parsed = urlparse.urlsplit(url)
scheme, netloc, path, query, fragment = parsed
# ensure scheme is a letter followed by letters, digits, and '+-.' chars
if not re.match(r'[a-z][-+.a-z0-9]*$', scheme, flags=re.I):
return ''
scheme = str(scheme)
# ensure domain and port are valid, eg: sub.domain.<1-to-6-TLD-chars>[:port]
match = re.match(r'(.+\.[a-z0-9]{1,6})(:\d{1,5})?$', netloc, flags=re.I)
if not match:
return ''
domain, port = match.groups()
netloc = domain + (port if port else '')
netloc = netloc.encode('idna')
# ensure path is valid and convert Unicode chars to %-encoded
if not path:
path = '/' # eg: 'http://google.com' -> 'http://google.com/'
path = urllib.quote(urllib.unquote(path.encode('utf-8')), safe='/;')
# ensure query is valid
query = urllib.quote(urllib.unquote(query.encode('utf-8')), safe='=&?/')
# ensure fragment is valid
fragment = urllib.quote(urllib.unquote(fragment.encode('utf-8')))
# piece it all back together, truncating it to a maximum of 4KB
url = urlparse.urlunsplit((scheme, netloc, path, query, fragment))
return url[:4096]
if __name__ == '__main__':
import doctest
doctest.testmod()

You might use urlparse.urlsplit instead, but otherwise you seem to have a very straightforward solution, there.
protocol, domain, path, query, fragment = urlparse.urlsplit(url)
(You can access the domain and port separately by accessing the returned value's named properties, but as port syntax is always in ASCII it is unaffected by the IDNA encoding process.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Decoding a fetched email header or subject into a readable manner - python

Related

DSA key format not supported

Implementing SHA1-HMAC with Python

Generating a Temporary url in Python

Error parsing emails using Python's email module when the encoding is in shift_jis

Best way to convert a Unicode URL to ASCII (UTF-8 percent-escaped) in Python?

Categories

Resources