Parse multipart/form-data file in UTF-8

Parse multipart/form-data file in UTF-8 - python

I am parsing a multipart/form input with Python's cgi module:
body_file = StringIO.StringIO(self.request.body)
pdict = {'boundary': 'xYzZY'}
httpbody = cgi.parse_multipart(body_file, pdict)
text = self.trim(httpbody['text'])
and I want to print some elements of httpbody that are the UTF-8 encoded.
I tried text.decode('utf-8') and unicode(text, encoding='utf-8'), but nothing seems to work. Am I missing something here?

Try the following:
text = self.trim(httpbody['text'])
text.encode('utf-8')
I'm assuming the text variable is in string, if not sure str(). Otherwise, you'll get another error thrown at you.

Related

Read JSON data from UTF-8 encoded byte string

I have a script that sends a JSON UTF-8 encoded Byte string to a socket. (A github project: https://github.com/alios/raildriver). Now I'm writing the python script that needs to read the incoming data. Right now I can receive the data and print it to the terminal. With the following script: https://www.binarytides.com/code-telnet-client-sockets-python/
Output:
data = '{"Current": 117.42609405517578, "Accelerometer": -5.394751071929932, "SpeedometerKPH": 67.12493133544922, "Ammeter": 117.3575210571289, "Amp": 117.35590362548828, "Acceleration": -0.03285316377878189, "TractiveEffort": -5.394751071929932, "Effort": 48.72163772583008, "RawTargetDistance": 3993.927734375, "TargetDistanceBar": 0.9777777791023254, "TargetDistanceDigits100": -1.0, "TargetDistanceDigits1000": -1.0}'
The problem is that I can't find how to read the JSON array. For example read "Ammeter" and return its value 117.357521057289 to a new variable.
All the data is being received in the variable data
The code I have right now:
decodedjson = data.decode('utf-8')
dumpedjson = json.dumps(decodedjson)
loadedjson = json.loads(dumpedjson)
Can you please help me?

You are encoding to JSON then decoding again. SImply not encode, remove the second line:
decodedjson = data.decode('utf-8')
loadedjson = json.loads(decodedjson)
If you are using Python 3.6 or newer, you don't actually have to decode from UTF-8, as the json.loads() function knows how to deal with UTF-encoded JSON data directly. The same applies to Python 2:
loadedjson = json.loads(data)
Demo using Python 3.7:
>>> data = b'{"Current": 117.42609405517578, "Accelerometer": -5.394751071929932, "SpeedometerKPH": 67.12493133544922, "Ammeter": 117.3575210571289, "Amp": 117.35590362548828, "Acceleration": -0.03285316377878189, "TractiveEffort": -5.394751071929932, "Effort": 48.72163772583008, "RawTargetDistance": 3993.927734375, "TargetDistanceBar": 0.9777777791023254, "TargetDistanceDigits100": -1.0, "TargetDistanceDigits1000": -1.0}'
>>> loadedjson = json.loads(data)
>>> loadedjson['Ammeter']
117.3575210571289

Unable to display Japanese (UTF-8) characters in email body with webbrowser

I am reading text from two different .txt files and concatenating them together. Then add that to a body of the email through by using webbrowser.
One text file is English characters (ascii) and the other Japanese (UTF-8). The text will display fine if I write it to a text file. But if I use webbrowser to insert the text into an email body the Japanese text displays as question marks.
I have tried running the script on multiple machines that have different mail clients as their defaults. Initially I thought maybe that was the issue, but that does not appear to be. Thunderbird and Mail (MacOSX) display question marks.
Hello. Today is 2014-05-09
????????????????2014-05-09????
I have looked at similar issues around on SO but they have not solved the issue.
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
position 20: ordinal not in
range(128)
Japanese in python function
Printing out Japanese (Chinese) characters
python utf-8 japanese
Is there a way to have the Japanese (UTF-8) display in the body of an email created with webbrowser in python? I could use the email functionality but the requirement is the script needs to open the default mail client and insert all the information.
The code and text files I am using are below. I have simplified it to focus on the issue.
email-template.txt
Hello. Today is {{date}}
email-template-jp.txt
こんにちは。今日は {{date}} です。
Python Script
#
# -*- coding: utf-8 -*-
#
import sys
import re
import os
import glob
import webbrowser
import codecs,sys
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
# vars
date_range = sys.argv[1:][0]
email_template_en = "email-template.txt"
email_template_jp = "email-template-jp.txt"
email_to_send = "email-to-send.txt" # finished email is saved here
# Default values for the composed email that will be opened
mail_list = "test#test.com"
cc_list = "test1#test.com, test2#test.com"
subject = "Email Subject"
# Open email templates and insert the date from the parameters sent in
try:
f_en = open(email_template_en, "r")
f_jp = codecs.open(email_template_jp, "r", "UTF-8")
try:
email_content_en = f_en.read()
email_content_jp = f_jp.read()
email_en = re.sub(r'{{date}}', date_range, email_content_en)
email_jp = re.sub(r'{{date}}', date_range, email_content_jp).encode("UTF-8")
# this throws an error
# UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 26: ordinal not in range(128)
# email_en_jp = (email_en + email_jp).encode("UTF-8")
email_en_jp = (email_en + email_jp)
finally:
f_en.close()
f_jp.close()
pass
except Exception, e:
raise e
# Open the default mail client and fill in all the information
try:
f = open(email_to_send, "w")
try:
f.write(email_en_jp)
# Does not send Japanese text to the mail client. But will write to the .txt file fine. Unsure why.
webbrowser.open("mailto:%s?subject=%s&cc=%s&body=%s" %(mail_list, subject, cc_list, email_en_jp), new=1) # open mail client with prefilled info
finally:
f.close()
pass
except Exception, e:
raise e
edit: Forgot to add I am using Python 2.7.1

EDIT 2: Found a workable solution after all.
Replace your webbrowser call with this.
import subprocess
[... other code ...]
arg = "mailto:%s?subject=%s&cc=%s&body=%s" % (mail_list, subject, cc_list, email_en_jp)
subprocess.call(["open", arg])
This will open your default email client on MacOS. For other OSes please replace "open" in the subprocess line with the proper executable.
EDIT: I looked into it a bit more and Mark's comment above made me read the RFC (2368) for mailto URL scheme.
The special hname "body" indicates that the associated hvalue is the
body of the message. The "body" hname should contain the content for
the first text/plain body part of the message. The mailto URL is
primarily intended for generation of short text messages that are
actually the content of automatic processing (such as "subscribe"
messages for mailing lists), not general MIME bodies.
And a bit further down:
8-bit characters in mailto URLs are forbidden. MIME encoded words (as
defined in [RFC2047]) are permitted in header values, but not for any
part of a "body" hname."
So it looks like this is not possible as per RFC, although that makes me question why the JavaScript solution in the JSFiddle provided by naota works at all.
I leave my previous answer as is below, although it does not work.
I have run into same issues with Python 2.7.x quite a couple of times now and every time a different solution somehow worked.
So here are several suggestions that may or may not work, as I haven't tested them.
a) Force unicode strings:
webbrowser.open(u"mailto:%s?subject=%s&cc=%s&body=%s" % (mail_list, subject, cc_list, email_en_jp), new=1)
Notice the small u right after the opening ( and before the ".
b) Force the regex to use unicode:
email_jp = re.sub(ur'{{date}}', date_range, email_content_jp).encode("UTF-8")
# or maybe
email_jp = re.sub(ur'{{date}}', date_range, email_content_jp)
c) Another idea regarding the regex, try compiling it first with the re.UNICODE flag, before applying it.
pattern = re.compile(ur'{{date}}', re.UNICODE)
d) Not directly related, but I noticed you write the combined text via the normal open method. Try using the codecs.open here as well.
f = codecs.open(email_to_send, "w", "UTF-8")
Hope this helps.

escaping double quotes in cgi.FieldStorage() with json.loads python

if I receive a JSON string with different values I want to escape double quotes.
This is not working because if I want to loop over the values of given fields I need to json.loads(string) first, but this already fails because their is a misleading double quote in one of the values.
If I loop over the raw string it escapes double quotes that are set correctly as well and it fails again.
How would I accomplish to just loop over the values?
print("Test started...")
try:
import json
import cgi
form3 = cgi.FieldStorage()
print("cgi Fieldstorage loaded in form3...")
form2 = form3["json"].value
print("form 2 is now form3.value...")
print("loop now starting...")
for x in form2:
print("in loop...")
x = x.replace('"','\"')
print("dumped an item in form1...")
form1 = json.loads(form2)
print("form1 is being prepared with json.loads... ")
print("dumped form1 string looks like : " + json.dumps(form1))
# handeling JSON-Exceptions thrown by corrupted parameters
except (ValueError, KeyError):
import sys
print("Encoding Error")
sys.exit()
Example input from adressbar:
http://localhost/script.py?json={"field":"value","field2":"value","field3":"val"ue"}
note that value from field 3 should escape as following:
http://localhost/script.py?json={"field":"value","field2":"value","field3":"val\"ue"}
what happens if the string is escaped without loading it as json via json.loads(string)
http://localhost/script.py?json={\"field\":\"value\",\"field2\":\"value\",\"field3\":\"val\"ue\"}
it's obvious that this happens, but this string can't be loaded via json.loads anymore
neither can I json.loads and afterwards escape the values because json.loads won't recognize a correct json string (value3 corrupted)

reading web pages including various languages such Russian, Korean and etc

everyone.
For my research projects, I have collected some web pages.
For example, http://git.gnome.org/browse/anjuta/commit/?id=d17caca8f81bb0f0ba4d341d6d6132ff51d186e3
As you see the above web page, the committer's name is not English.
Other web pages, also, have committers' names written in various languages not English.
The following codes are for handling with committers' names.
import csv
import re
import urllib
def get_page (link):
k = 1
while k == 1:
try:
f = urllib.urlopen (link)
htmlSource = f.read()
return htmlSource
except EnvironmentError:
print ('Error occured:', link)
else:
k = 2
f.close()
def get_commit_info (commit_page):
commit_page_string = str (commit_page)
author_pattern = re.compile (r'<tr><th>author</th><td>(.*?)</td><td class=', re.DOTALL)
t_author = author_pattern.findall (commit_page_string)
t_author_string = str (t_author)
author_point = re.search (" <", t_author_string)
author = t_author_string[:author_point.start()]
print author
git_url = "http://git.gnome.org/browse/anjuta/commit/?id=d17caca8f81bb0f0ba4d341d6d6132ff51d186e3"
commit_page = get_page (git_url)
get_commit_info (commit_page)
The result of 'print author' is as follows:
\xd0\x9c\xd0\xb8\xd1\x80\xd0\xbe\xd1\x81\xd0\xbb\xd0\xb0\xd0\xb2 \xd0\x9d\xd0\
xb8\xd0\xba\xd0\xbe\xd0\xbb\xd0\xb8\xd1\x9b
How can I print the name exactly?

WELL... this will do what you want
author = 'Мирослав Николић'
print author.decode('utf8') # Мирослав Николић
But it also won't work if the encoding isn't UTF8...
Mostly things use utf8. Mostly.
Unicode is complicated stuff to get your head around. 'author' is a string object that contains bytes. There is no information in those bytes to tell you what those bytes represent. Absolutely none. You have to tell Python that this string of bytes are code points in UTF8. For each byte you come across, look it up in the UTF8 code table and see which UTF8 unicode glyph this represents.
You could detect the encoding for each page by looking at the meta tags. In html5 they would look like this:
<meta charset="utf-8">.

Python char encoding

I have the following code :
msgtxt = "é"
msg = MIMEText(msgtxt)
msg.set_charset('ISO-8859-1')
msg['Subject'] = "subject"
msg['From'] = "from#mail.com"
msg['To'] = "to#mail.com"
serv.sendmail("from#mail.com","to#mail.com", msg.as_string())
The e-mail arrive with Ã© as its body instead of the expected é
I have tried :
msgtxt = "é".encode("ISO-8859-1")
msgtxt = u"é"
msgtxt = unicode("é", "ISO-8859-1")
all yield the same result.
How to make this work?
Any help is appreciated.
Thanks in advance, J.

msgtxt = "é"
msg.set_charset('ISO-8859-1')
Well, what's the encoding of the source file containing this code? If it's UTF-8, which is a good default choice, just writing the é will have given you the two-byte string '\xc3\xa9', which, when viewed as ISO-8859-1, looks like Ã©.
If you want to use non-ASCII byte string literals in your source file without having to worry about what encoding the text editor is saving it as, use a string literal escape:
msgtxt = '\xE9'

# coding: utf-8 (or whatever you want to save your source file in)
msgtxt = u"é"
msg = MIMEText(msgtxt,_charset='ISO-8859-1')
Without the u the text will be in the source encoding. As a Unicode string, msgtxt will be encoded in the indicated character set.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse multipart/form-data file in UTF-8 - python

Try the following: text = self.trim(httpbody['text']) text.encode('utf-8') I'm assuming the text variable is in string, if not sure str(). Otherwise, you'll get another error thrown at you.

Related

Read JSON data from UTF-8 encoded byte string

Unable to display Japanese (UTF-8) characters in email body with webbrowser

escaping double quotes in cgi.FieldStorage() with json.loads python

reading web pages including various languages such Russian, Korean and etc

Python char encoding

Categories

Resources