Encoding error using Python - python

I wrote a code to connect to imap and then parse the body information and insert into database. But I am having some problems with accents.
From email header I got this information:
Content-Type: text/html; charset=ISO-8859-1
But, I am not sure if I can trust in this information...
The email was wrote in portuguese, so we have a lot of words with accents. For example, I extract the following phrase from the email source code (using my browser):
"...instalação de eletrônicos..."
So, I connected to imap and fetched some emails:
... typ, data = M.fetch(num, '(RFC822)') ...
When I print the content, I get the following word:
print data[0][1]
instala+º+úo de eletr+¦nicos
I tried to use .decode('utf-8') but I had no success.
instalação de eletrônicos
How can I make it a human readable? My database is in utf-8.

The header says it is using "ISO-8859-1" charset. So you need to decode the string with that encoding.
Try this:
data[0][1].decode('iso-8859-1')

Specifying the source code encoding worked for me. It's the code at the top of my example code below. This should be defined at the top of your python file.
#!/usr/bin/python
# -*- coding: iso-8859-15 -*-
value = """...instalação de eletrônicos...""".decode("iso-8859-15")
print value
# prints: ...instalação de eletrônicos...
import unicodedata
value = unicodedata.normalize('NFKD', value).encode('ascii','ignore')
print value
# prints: ...instalacao de eletronicos...
And now you can do str(value) without an exception as well.
See: http://docs.python.org/2/library/unicodedata.html
This seems to keep all accents:
#!/usr/bin/python
# -*- coding: iso-8859-15 -*-
import unicodedata
value = """...instalação de eletrônicos...""".decode("iso-8859-15")
value = unicodedata.normalize('NFKC', value).encode('utf-8')
print value
print str(value)
# prints (without exceptions/errors):
# ...instalação de eletrônicos...
# ...instalação de eletrônicos...
EDIT:
Do note that with the last version even though the outcome looks the same it doesn't return equal is True. In example:
#!/usr/bin/python
# -*- coding: iso-8859-15 -*-
import unicodedata
inValue = """...instalação de eletrônicos...""".decode("iso-8859-15")
normalizedValue = unicodedata.normalize('NFKC', inValue).encode('utf-8')
try:
print inValue == normalizedValue
except UnicodeWarning:
pass
# False
EDIT2:
This returns the same:
normalizedValue = unicode("""...instalação de eletrônicos...""".decode("iso-8859-15")).encode('utf-8')
print normalizedValue
print str(normalizedValue )
# prints (without exceptions/errors):
# ...instalação de eletrônicos...
# ...instalação de eletrônicos...
Though I'm not sure this will actually be valid for a utf-8 encoded database. Probably not?

Thanks for Martijn Pieters. We figured out that the email had two different encode. I had to split this parts and treat individually.

Related

Python json load with global language support

Hi I tried to use the international language on my script.
But it was returning the encoded data type.
Here my code.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
string ='{\"NAME\":\"ทะเลทอง แลปกุ้ง\",\"DESC\":\"Shop Descriptionอาหารกุ้ง วิตามิน แร่ธาตุ\",\"ADDRESS_LINE_1\":\"29/4หมู่13 บางแก้วซอย1 ต.บางขวัญอ.เมือง\"}'
print json.loads(string)
It was returning the below encoded format
{u'ADDRESS_LINE_1': u'29/4\u0e2b\u0e21\u0e39\u0e4813 \u0e1a\u0e32\u0e07\u0e41\u0e01\u0e49\u0e27\u0e0b\u0e2d\u0e221 \u0e15.\u0e1a\u0e32\u0e07\u0e02\u0e27\u0e31\u0e0d\u0e2d.\u0e40\u0e21\u0e37\u0e2d\u0e07', u'NAME': u'\u0e17\u0e30\u0e40\u0e25\u0e17\u0e2d\u0e07 \u0e41\u0e25\u0e1b\u0e01\u0e38\u0e49\u0e07', u'DESC': u'Shop Description\u0e2d\u0e32\u0e2b\u0e32\u0e23\u0e01\u0e38\u0e49\u0e07 \u0e27\u0e34\u0e15\u0e32\u0e21\u0e34\u0e19 \u0e41\u0e23\u0e48\u0e18\u0e32\u0e15\u0e38'}
This script should suppot all kind of languages like Thai, Tamil, Chineese etc..
Expected OutPut
data = json.loads(string)
print data['NAME']
this should print 'ทะเลทอง แลปกุ้ง'
Your script works perfectly (as expected) provided you use it on a unicode capable terminal.
I use IDLE for Python 2.7.12 for win32 on a Windows 7 box and this code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
string ='{\"NAME\":\"ทะเลทอง แลปกุ้ง\",\"DESC\":\"Shop Descriptionอาหารกุ้ง วิตามิน แร่ธาตุ\",\"ADDRESS_LINE_1\":\"29/4หมู่13 บางแก้วซอย1 ต.บางขวัญอ.เมือง\"}'
data = json.loads(string)
print data
print data['NAME']
correctly displays:
{u'ADDRESS_LINE_1': u'29/4\u0e2b\u0e21\u0e39\u0e4813 \u0e1a\u0e32\u0e07\u0e41\u0e01\u0e49\u0e27\u0e0b\u0e2d\u0e221 \u0e15.\u0e1a\u0e32\u0e07\u0e02\u0e27\u0e31\u0e0d\u0e2d.\u0e40\u0e21\u0e37\u0e2d\u0e07', u'NAME': u'\u0e17\u0e30\u0e40\u0e25\u0e17\u0e2d\u0e07 \u0e41\u0e25\u0e1b\u0e01\u0e38\u0e49\u0e07', u'DESC': u'Shop Description\u0e2d\u0e32\u0e2b\u0e32\u0e23\u0e01\u0e38\u0e49\u0e07 \u0e27\u0e34\u0e15\u0e32\u0e21\u0e34\u0e19 \u0e41\u0e23\u0e48\u0e18\u0e32\u0e15\u0e38'}
ทะเลทอง แลปกุ้ง
Said differently it is not a Python problem but only a terminal configuration one.
import json
string ='{\"NAME\":\"ทะเลทอง แลปกุ้ง\",\"DESC\":\"Shop Descriptionอาหารกุ้ง วิตามิน แร่ธาตุ\",\"ADDRESS_LINE_1\":\"29/4หมู่13 บางแก้วซอย1 ต.บางขวัญอ.เมือง\"}'
print (json.loads(string))
out:
{'DESC': 'Shop Descriptionอาหารกุ้ง วิตามิน แร่ธาตุ', 'ADDRESS_LINE_1': '29/4หมู่13 บางแก้วซอย1 ต.บางขวัญอ.เมือง', 'NAME': 'ทะเลทอง แลปกุ้ง'}
Just use python3

How can i solve this ascii error in python

def scrapeFacebookPageFeedStatus(page_id, access_token):
# -*- coding: utf-8 -*-
with open('%s_facebook_statuses.csv' % page_id, 'wb') as file:
w = csv.writer(file)
w.writerow(["status_id", "status_message", "link_name", "status_type", "status_link",
"status_published", "num_likes", "num_comments", "num_shares"])
has_next_page = True
num_processed = 0 # keep a count on how many we've processed
scrape_starttime = datetime.datetime.now()
print "Scraping %s Facebook Page: %s\n" % (page_id, scrape_starttime)
statuses = getFacebookPageFeedData(page_id, access_token, 100)
while has_next_page:
for status in statuses['data']:
w.writerow(processFacebookPageFeedStatus(status))
# output progress occasionally to make sure code is not stalling
num_processed += 1
if num_processed % 1000 == 0:
print "%s Statuses Processed: %s" % (num_processed, datetime.datetime.now())
# if there is no next page, we're done.
if 'paging' in statuses.keys():
statuses = json.loads(request_until_succeed(statuses['paging']['next']))
else:
has_next_page = False
print "\nDone!\n%s Statuses Processed in %s" % (num_processed, datetime.datetime.now() - scrape_starttime)
scrapeFacebookPageFeedStatus(page_id, access_token)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 40-43: ordinal not in range(128)
I'm writing code to scrape through Facebook pages to gather all the posts in cvs file.
The code is working properly when there is only the English language, but
the error above appears when I try to scrape through pages that post in Arabic.
I know the solution is to use utf-8 but I don't know how to implement it on the code.
Your problem probably is not in this code, I suspect is in your processFacebookPageFeedStatus function. But when you are formatting your fields you'll want to make sure any that may contain unicode characters are all decoded (or encoded as appropriate) in utf-8.
import codecs
field_a = "some unicode text in here"
field_a.decode('utf-8') -----> \u1234\u........
field_a.encode('utf-8') -----> Back to original unicode
Your CSV probably doesn't support unicode, so you need to decode each field in your source data.
Debugging unicode is a pain, but there are a lot of SO posts about different problems related to encoding/decoding unicode
import sys
reload(sys).setdefaultencoding("utf-8")
I added this piece of code and it works fine when I open this file in pandas .
there are no other errors or what so ever for now

Unable to display Japanese (UTF-8) characters in email body with webbrowser

I am reading text from two different .txt files and concatenating them together. Then add that to a body of the email through by using webbrowser.
One text file is English characters (ascii) and the other Japanese (UTF-8). The text will display fine if I write it to a text file. But if I use webbrowser to insert the text into an email body the Japanese text displays as question marks.
I have tried running the script on multiple machines that have different mail clients as their defaults. Initially I thought maybe that was the issue, but that does not appear to be. Thunderbird and Mail (MacOSX) display question marks.
Hello. Today is 2014-05-09
????????????????2014-05-09????
I have looked at similar issues around on SO but they have not solved the issue.
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
position 20: ordinal not in
range(128)
Japanese in python function
Printing out Japanese (Chinese) characters
python utf-8 japanese
Is there a way to have the Japanese (UTF-8) display in the body of an email created with webbrowser in python? I could use the email functionality but the requirement is the script needs to open the default mail client and insert all the information.
The code and text files I am using are below. I have simplified it to focus on the issue.
email-template.txt
Hello. Today is {{date}}
email-template-jp.txt
こんにちは。今日は {{date}} です。
Python Script
#
# -*- coding: utf-8 -*-
#
import sys
import re
import os
import glob
import webbrowser
import codecs,sys
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
# vars
date_range = sys.argv[1:][0]
email_template_en = "email-template.txt"
email_template_jp = "email-template-jp.txt"
email_to_send = "email-to-send.txt" # finished email is saved here
# Default values for the composed email that will be opened
mail_list = "test#test.com"
cc_list = "test1#test.com, test2#test.com"
subject = "Email Subject"
# Open email templates and insert the date from the parameters sent in
try:
f_en = open(email_template_en, "r")
f_jp = codecs.open(email_template_jp, "r", "UTF-8")
try:
email_content_en = f_en.read()
email_content_jp = f_jp.read()
email_en = re.sub(r'{{date}}', date_range, email_content_en)
email_jp = re.sub(r'{{date}}', date_range, email_content_jp).encode("UTF-8")
# this throws an error
# UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 26: ordinal not in range(128)
# email_en_jp = (email_en + email_jp).encode("UTF-8")
email_en_jp = (email_en + email_jp)
finally:
f_en.close()
f_jp.close()
pass
except Exception, e:
raise e
# Open the default mail client and fill in all the information
try:
f = open(email_to_send, "w")
try:
f.write(email_en_jp)
# Does not send Japanese text to the mail client. But will write to the .txt file fine. Unsure why.
webbrowser.open("mailto:%s?subject=%s&cc=%s&body=%s" %(mail_list, subject, cc_list, email_en_jp), new=1) # open mail client with prefilled info
finally:
f.close()
pass
except Exception, e:
raise e
edit: Forgot to add I am using Python 2.7.1
EDIT 2: Found a workable solution after all.
Replace your webbrowser call with this.
import subprocess
[... other code ...]
arg = "mailto:%s?subject=%s&cc=%s&body=%s" % (mail_list, subject, cc_list, email_en_jp)
subprocess.call(["open", arg])
This will open your default email client on MacOS. For other OSes please replace "open" in the subprocess line with the proper executable.
EDIT: I looked into it a bit more and Mark's comment above made me read the RFC (2368) for mailto URL scheme.
The special hname "body" indicates that the associated hvalue is the
body of the message. The "body" hname should contain the content for
the first text/plain body part of the message. The mailto URL is
primarily intended for generation of short text messages that are
actually the content of automatic processing (such as "subscribe"
messages for mailing lists), not general MIME bodies.
And a bit further down:
8-bit characters in mailto URLs are forbidden. MIME encoded words (as
defined in [RFC2047]) are permitted in header values, but not for any
part of a "body" hname."
So it looks like this is not possible as per RFC, although that makes me question why the JavaScript solution in the JSFiddle provided by naota works at all.
I leave my previous answer as is below, although it does not work.
I have run into same issues with Python 2.7.x quite a couple of times now and every time a different solution somehow worked.
So here are several suggestions that may or may not work, as I haven't tested them.
a) Force unicode strings:
webbrowser.open(u"mailto:%s?subject=%s&cc=%s&body=%s" % (mail_list, subject, cc_list, email_en_jp), new=1)
Notice the small u right after the opening ( and before the ".
b) Force the regex to use unicode:
email_jp = re.sub(ur'{{date}}', date_range, email_content_jp).encode("UTF-8")
# or maybe
email_jp = re.sub(ur'{{date}}', date_range, email_content_jp)
c) Another idea regarding the regex, try compiling it first with the re.UNICODE flag, before applying it.
pattern = re.compile(ur'{{date}}', re.UNICODE)
d) Not directly related, but I noticed you write the combined text via the normal open method. Try using the codecs.open here as well.
f = codecs.open(email_to_send, "w", "UTF-8")
Hope this helps.

Coding: utf-8 doesn't seem to work

Utf-8 doesn't work on my computer. I tried the exact same code at another computer and it worked but on my computer it doesn't. It's in python.
My program starts like this:
# -*- coding: utf-8 -*- # Behövs i python 2 för åäö
from Tkinter import *
class Kryssruta(Button):
""" Knapp som kryssas i/ur när man trycker på den """
def __init__(self, master, nr = 0, rad = 0, kolumn = 0):
#Konstruktor, notera master
Button.__init__(self,master)
self.master = master
self.rad = rad
self.kolumn = kolumn
self.markerad = False
self.kryssad = False
self.cirklad = False
self["command"] = self.kryssa
def kryssa(self):
if self.markerad==False:
self.master.klickat(self)
On one computer it works like a charm, but on my own computer I get the message.
SyntaxError: Non-ASCII character '\xc3' in file 'blah' but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
Using a PC, running in powershell.
Anyone who knows what seems to be the problem?
You have a (number of) blank line(s) above the coding: line. From the document listed in the error message:
To define a source code encoding, a magic comment must
be placed into the source files either as first or second
line in the file, such as:
You declare that the source file is using utf-8 encoding but actually it isn't, it's using the Windows code page default for your system.
Open the file in Notepad and save it out again with Save As, setting UTF-8 in the Encoding dropdown.

Removing extra spaces in Chinese HTML files using lxml

I have a bunch of improperly formatted Chinese html files. They contain unnecessary spaces and line breaks which will be displayed as extra spaces in the browser. I've written a script using lxml to modify the html files. It works fine on simple tags, but I'm stuck on nested ones. For example:
<p>祝你<span>19</span>岁
生日快乐。</p>
will be displayed is the browser as:
祝你19岁 生日快乐。
Notice the extra space. This is what needs to be deleted. The result html should be like this:
<p>祝你<span>19</span>岁生日快乐。</p>
How do I do this?
Note that the nesting(like the span tag) could be arbitrary, but I don't need to consider the content in the nested elements, they should be preserved as they are. Only the text in the outer element needs to by formatted.
This is what I've got:
# -*- coding: utf-8 -*-
import lxml.html
import re
s1 = u"""<p>祝你19岁
生日快乐。</p>"""
p1 = lxml.html.fragment_fromstring(s1)
print p1.text # I get the whole line.
p1.text = re.sub("\s+", "", p1.text)
print p1.tostring() # spaces are removed.
s2 = u"""<p>祝你<span>19</span>岁
生日快乐。</p>"""
p2 = lxml.html.fragment_fromstring(s2)
print p2.text # I get "祝你"
print p2.tail # I get None
i = p2.itertext()
print i.next() # I get "祝你"
print i.next() # I get "19" from <span>
print i.next() # I get the tailed text, but how do I assemble them back?
print p2.text_content() # The whole text, but how do I put <span> back?
>>> root = etree.fromstring('<p>祝你<span>19</span>岁\n生日快乐。</p>')
>>> etree.tostring(root)
b'<p>祝你<span>19</span>岁\n生日快乐。</p>'
>>> for e in root.xpath('/p/*'):
... if e.tail:
... e.tail = e.tail.replace('\n', '')
...
>>> etree.tostring(root)
b'<p>祝你<span>19</span>岁生日快乐。</p>'
Controversially, I wonder whether this is possible to complete without using an HTML/XML parser, considering that it appears to be cause by line wrapping.
I built a regular expression to look for whitespace between Chinese text with the help of this solution here: https://stackoverflow.com/a/2718268/267781
I don't know whether the catch-all of any whitespace between characters or whether the more specific [char]\n\s*[char] is most suitable to your problem.
# -*- coding: utf-8 -*-
import re
# Whitespace in Chinese HTML
## Used this solution to create regexp: https://stackoverflow.com/a/2718268/267781
## \s+
fixwhitespace2 = re.compile(u'[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d](\s+)[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d]',re.M)
## \n\s*
fixwhitespace = re.compile(u'[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d](\n\s*)[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d]',re.M)
sample = u'<html><body><p>\u795d\u4f6019\u5c81\n \u751f\u65e5\u5feb\u4e50\u3002</p></body></html>'
fixwhitespace.sub('',sample)
Yielding
<html><body><p>祝你19日快乐。</p></body></html>
However, here's how you might do it using the parser and xpath to find linefeeds:
# -*- coding: utf-8 -*-
from lxml import etree
import re
fixwhitespace = re.compile(u'[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d](\n\s*)[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d]',re.M)
sample = u'<html><body><p>\u795d\u4f6019\u5c81\n \u751f\u65e5\u5feb\u4e50\u3002</p></body></html>'
doc = etree.HTML(sample)
for t in doc.xpath("//text()[contains(.,'\n')]"):
if t.is_tail:
t.getparent().tail = fixwhitespace.sub('',t)
elif t.is_text:
t.getparent().text = fixwhitespace.sub('',t)
print etree.tostring(doc)
Yields:
<html><body><p>祝你19日快乐。</p></body></html>
I'm curious what the best match to your working data is.

Categories