def scrapeFacebookPageFeedStatus(page_id, access_token):
# -*- coding: utf-8 -*-
with open('%s_facebook_statuses.csv' % page_id, 'wb') as file:
w = csv.writer(file)
w.writerow(["status_id", "status_message", "link_name", "status_type", "status_link",
"status_published", "num_likes", "num_comments", "num_shares"])
has_next_page = True
num_processed = 0 # keep a count on how many we've processed
scrape_starttime = datetime.datetime.now()
print "Scraping %s Facebook Page: %s\n" % (page_id, scrape_starttime)
statuses = getFacebookPageFeedData(page_id, access_token, 100)
while has_next_page:
for status in statuses['data']:
w.writerow(processFacebookPageFeedStatus(status))
# output progress occasionally to make sure code is not stalling
num_processed += 1
if num_processed % 1000 == 0:
print "%s Statuses Processed: %s" % (num_processed, datetime.datetime.now())
# if there is no next page, we're done.
if 'paging' in statuses.keys():
statuses = json.loads(request_until_succeed(statuses['paging']['next']))
else:
has_next_page = False
print "\nDone!\n%s Statuses Processed in %s" % (num_processed, datetime.datetime.now() - scrape_starttime)
scrapeFacebookPageFeedStatus(page_id, access_token)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 40-43: ordinal not in range(128)
I'm writing code to scrape through Facebook pages to gather all the posts in cvs file.
The code is working properly when there is only the English language, but
the error above appears when I try to scrape through pages that post in Arabic.
I know the solution is to use utf-8 but I don't know how to implement it on the code.
Your problem probably is not in this code, I suspect is in your processFacebookPageFeedStatus function. But when you are formatting your fields you'll want to make sure any that may contain unicode characters are all decoded (or encoded as appropriate) in utf-8.
import codecs
field_a = "some unicode text in here"
field_a.decode('utf-8') -----> \u1234\u........
field_a.encode('utf-8') -----> Back to original unicode
Your CSV probably doesn't support unicode, so you need to decode each field in your source data.
Debugging unicode is a pain, but there are a lot of SO posts about different problems related to encoding/decoding unicode
import sys
reload(sys).setdefaultencoding("utf-8")
I added this piece of code and it works fine when I open this file in pandas .
there are no other errors or what so ever for now
Related
I am reading text from two different .txt files and concatenating them together. Then add that to a body of the email through by using webbrowser.
One text file is English characters (ascii) and the other Japanese (UTF-8). The text will display fine if I write it to a text file. But if I use webbrowser to insert the text into an email body the Japanese text displays as question marks.
I have tried running the script on multiple machines that have different mail clients as their defaults. Initially I thought maybe that was the issue, but that does not appear to be. Thunderbird and Mail (MacOSX) display question marks.
Hello. Today is 2014-05-09
????????????????2014-05-09????
I have looked at similar issues around on SO but they have not solved the issue.
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
position 20: ordinal not in
range(128)
Japanese in python function
Printing out Japanese (Chinese) characters
python utf-8 japanese
Is there a way to have the Japanese (UTF-8) display in the body of an email created with webbrowser in python? I could use the email functionality but the requirement is the script needs to open the default mail client and insert all the information.
The code and text files I am using are below. I have simplified it to focus on the issue.
email-template.txt
Hello. Today is {{date}}
email-template-jp.txt
こんにちは。今日は {{date}} です。
Python Script
#
# -*- coding: utf-8 -*-
#
import sys
import re
import os
import glob
import webbrowser
import codecs,sys
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
# vars
date_range = sys.argv[1:][0]
email_template_en = "email-template.txt"
email_template_jp = "email-template-jp.txt"
email_to_send = "email-to-send.txt" # finished email is saved here
# Default values for the composed email that will be opened
mail_list = "test#test.com"
cc_list = "test1#test.com, test2#test.com"
subject = "Email Subject"
# Open email templates and insert the date from the parameters sent in
try:
f_en = open(email_template_en, "r")
f_jp = codecs.open(email_template_jp, "r", "UTF-8")
try:
email_content_en = f_en.read()
email_content_jp = f_jp.read()
email_en = re.sub(r'{{date}}', date_range, email_content_en)
email_jp = re.sub(r'{{date}}', date_range, email_content_jp).encode("UTF-8")
# this throws an error
# UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 26: ordinal not in range(128)
# email_en_jp = (email_en + email_jp).encode("UTF-8")
email_en_jp = (email_en + email_jp)
finally:
f_en.close()
f_jp.close()
pass
except Exception, e:
raise e
# Open the default mail client and fill in all the information
try:
f = open(email_to_send, "w")
try:
f.write(email_en_jp)
# Does not send Japanese text to the mail client. But will write to the .txt file fine. Unsure why.
webbrowser.open("mailto:%s?subject=%s&cc=%s&body=%s" %(mail_list, subject, cc_list, email_en_jp), new=1) # open mail client with prefilled info
finally:
f.close()
pass
except Exception, e:
raise e
edit: Forgot to add I am using Python 2.7.1
EDIT 2: Found a workable solution after all.
Replace your webbrowser call with this.
import subprocess
[... other code ...]
arg = "mailto:%s?subject=%s&cc=%s&body=%s" % (mail_list, subject, cc_list, email_en_jp)
subprocess.call(["open", arg])
This will open your default email client on MacOS. For other OSes please replace "open" in the subprocess line with the proper executable.
EDIT: I looked into it a bit more and Mark's comment above made me read the RFC (2368) for mailto URL scheme.
The special hname "body" indicates that the associated hvalue is the
body of the message. The "body" hname should contain the content for
the first text/plain body part of the message. The mailto URL is
primarily intended for generation of short text messages that are
actually the content of automatic processing (such as "subscribe"
messages for mailing lists), not general MIME bodies.
And a bit further down:
8-bit characters in mailto URLs are forbidden. MIME encoded words (as
defined in [RFC2047]) are permitted in header values, but not for any
part of a "body" hname."
So it looks like this is not possible as per RFC, although that makes me question why the JavaScript solution in the JSFiddle provided by naota works at all.
I leave my previous answer as is below, although it does not work.
I have run into same issues with Python 2.7.x quite a couple of times now and every time a different solution somehow worked.
So here are several suggestions that may or may not work, as I haven't tested them.
a) Force unicode strings:
webbrowser.open(u"mailto:%s?subject=%s&cc=%s&body=%s" % (mail_list, subject, cc_list, email_en_jp), new=1)
Notice the small u right after the opening ( and before the ".
b) Force the regex to use unicode:
email_jp = re.sub(ur'{{date}}', date_range, email_content_jp).encode("UTF-8")
# or maybe
email_jp = re.sub(ur'{{date}}', date_range, email_content_jp)
c) Another idea regarding the regex, try compiling it first with the re.UNICODE flag, before applying it.
pattern = re.compile(ur'{{date}}', re.UNICODE)
d) Not directly related, but I noticed you write the combined text via the normal open method. Try using the codecs.open here as well.
f = codecs.open(email_to_send, "w", "UTF-8")
Hope this helps.
This question already has answers here:
How to determine the encoding of text
(16 answers)
Closed 5 years ago.
I'm scraping news articles from various sites, using GAE and Python.
The code where I scrape one article url at a time leads to the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8858: ordinal not in range(128)
Here's my code in its simplest form:
from google.appengine.api import urlfetch
def fetch(url):
headers = {'User-Agent' : "Chrome/11.0.696.16"}
result = urlfetch.fetch(url,headers)
if result.status_code == 200:
return result.content
Here is another variant I have tried, with the same result:
def fetch(url):
headers = {'User-Agent' : "Chrome/11.0.696.16"}
result = urlfetch.fetch(url,headers)
if result.status_code == 200:
s = result.content
s = s.decode('utf-8')
s = s.encode('utf-8')
s = unicode(s,'utf-8')
return s
Here's the ugly, brittle one, which also doesn't work:
def fetch(url):
headers = {'User-Agent' : "Chrome/11.0.696.16"}
result = urlfetch.fetch(url,headers)
if result.status_code == 200:
s = result.content
try:
s = s.decode('iso-8859-1')
except:
pass
try:
s = s.decode('ascii')
except:
pass
try:
s = s.decode('GB2312')
except:
pass
try:
s = s.decode('Windows-1251')
except:
pass
try:
s = s.decode('Windows-1252')
except:
s = "did not work"
s = s.encode('utf-8')
s = unicode(s,'utf-8')
return s
The last variant returns s as the string "did not work" from the last except.
So, am I going to have to expand my clumsy try/except construction to encompass all possible encodings (will that even work?), or is there an easier way?
Why have I decided to scrape the entire html, not just the BeautifulSoup? Because I want to do the soupifying later, to avoid DeadlineExceedError in GAE.
Have I read all the excellent articles about Unicode, and how it should be done? Yes. However, I have failed to find a solution that does not assume I know the incoming encoding, which I don't, since I'm scraping different sites every day.
I had the same problem some time ago and there is nothing 100% accurate. What I did was:
Get encoding from Content-Type
Get encoding from meta tags
Detect encoding with chardet Python module
Decode text from the most common encoding to Unicode
Process the text/html
It is better to simply read Content-Type in meta tags or headers. Please note, that Chrome (opposite to Opera) does not guess encoding. If it is not said that it is UTF8 or anything else in either of places, it threads site as windows default encoded. So only really bad sites do not define it.
So I have a Python CGI script running on my apache server. Basically, from a webpage, the user enters a word into a form, and that word is passed to the script. The word is then used to query the Twitter Search API and return all the tweets for that word. So the issue is, I'm running this query in a loop so I get three pages of results returned (approximately 300 tweets). But what I call the script (which prints out all the tweets into an HTML page), the page will sometimes display 5 tweets, sometimes 18, completley random numbers. Is this a timeout issue, or am I missing some basic in my code? Python CGI script posted below, thanks in advance.
#!/usr/bin/python
# Import modules for CGI handling
import cgi, cgitb
import urllib
import json
# Create instance of FieldStorage
form = cgi.FieldStorage()
# Get data from fields
topic = form.getvalue('topic')
results=[]
for x in range(1,3):
response = urllib.urlopen("http://search.twitter.com/search.json?q="+topic+"&rpp=100&include_entities=true&result_type=mixed&lang=en&page="+str(x))
pyresponse= json.load(response)
results= results + pyresponse["results"]
print "Content-type:text/html\r\n\r\n"
print "<!DOCTYPE html>"
print "<html>"
print "<html lang=\"en\">"
print "<head>"
print "<meta charset=\"utf-8\" />"
print "<meta name=\"description\" content=\"\"/>"
print "<meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\"/>"
print "<title>Data analysis for %s </title>" %(topic)
print "</head>"
print "<body>"
print "<label>"
for i in range(len(results)):
print str(i)+": "+results[i]["text"]+ "<br></br>"
print "</label>"
print "</body>"
print "</html>"
First of all I would point out that range(1,3) will not get you three pages like you are expecting.
However, running your Python code in an interpreter encountered an exception at this point:
>>> for i in range(len(results)):
... print str(i) + ": "+ results[x]["text"]
<a few results print successfully>
UnicodeEncodeError: 'latin-1' codec can't encode character u'\U0001f611'
in position 121: ordinal not in range(256)
Modifying the encoding then would print them all:
>>> for i in range(len(results)):
... print str(i) + ": "+ results[i]["text"].encode('utf-8')
<success>
Ok, got it.
It was actually a really stupid fix. Basically, since the Python is parsing the JSON it needs to encode all of the text into UTF-8 format so it can display correctly.
print str(i)+": "+results[i]["text"].encode('utf-8')+ "<br></br>"
Nothing to do with the script or server itself.
I have a Plone 3.3 site which retrieves information from a RESTful web service; the service returns utf-8-encoded xml data. The requests are sent via a special browser (let's call it ##proxy for this question). Everything is working fine as long as no non-ASCII data is returned.
The get method of the browser looks like this:
def get(self, thedict=None):
"""
reduced version for stackoverflow
"""
context = self.context
request = context.request
response = request.response
if thedict is None:
thedict = request.form
theurl = '?'.join((basejoin(SERVICEBASE, thedict['path']),
urlencode(auth_data),
))
fo = urlopen(theurl)
code = fo.code
headers = fo.info().headers
for line in headers:
name, text = line.split(':', 1)
response.setHeader(name, text.strip()) # trailing \r\n
text = unicode(fo.read(), 'utf-8')
# --- debugging only ... -- 8< ----- 8< ----- 8< ----- 8< ----- 8<
CHARSHERE = set(list(text))
funnychars = CHARSHERE.difference(XMLCHARS)
if funnychars:
funnychars = u''.join(tuple(funnychars))
logger.info('funny chars: %r' % funnychars)
elif 1:
logger.info('funny chars: none')
# --- ... debugging only -- >8 ----- >8 ----- >8 ----- >8 ----- >8
response.setBody(text.encode('utf-8'))
First I didn't do anything about encoding (unicode, encode). However, this appears to be necessary; when non-ASCII data like umlauts etc. is contained (e.g. INFO proxy##w3l funny chars: u'\xdf'), the browser doesn't like the result. When I try the same request directly (to theurl, not via my ##proxy browser), it works. The problem does not depend on the internet browser, i.e. both Firefox and IE cough.
There is a Content-Type header of value "text/xml;charset=UTF-8"; inserting a space character behind the semicolon didn't change anything.
edit: This is what Seamonkey says:
XML-Verarbeitungsfehler: Kein Element gefunden
Adresse: http://.../##proxy/get
Zeile Nr. 1, Spalte 1:
^
(in English: xml processing error: no element found; row #1, column #1)
Since I'm not the author of the service, I'm not entirely sure the data is really UTF-8 (although the HTTP header and the <?xml ...> line say so).
What is my mistake? What can I do to find the problem? Am I missing something important about Zope's response objects and/or urllib2.urlopen?
I also tried something like that (condensed):
fo = urlopen(theurl)
raw = fo.read().strip()
try:
text = unicode(raw, 'utf-8')
logger.info('Text passes off as utf-8')
except Exception, e:
logger.info('no valid utf-8:')
logger.exception(e)
text = unicode(raw, 'latin-1')
... or the other way round; I never got any decoding error.
I wrote the "raw" data to a file (open(filename, 'wb')) which I inspected with vim (set enc? fenc?), which yielded utf-8.
I'm on my wit's end.
everyone.
For my research projects, I have collected some web pages.
For example, http://git.gnome.org/browse/anjuta/commit/?id=d17caca8f81bb0f0ba4d341d6d6132ff51d186e3
As you see the above web page, the committer's name is not English.
Other web pages, also, have committers' names written in various languages not English.
The following codes are for handling with committers' names.
import csv
import re
import urllib
def get_page (link):
k = 1
while k == 1:
try:
f = urllib.urlopen (link)
htmlSource = f.read()
return htmlSource
except EnvironmentError:
print ('Error occured:', link)
else:
k = 2
f.close()
def get_commit_info (commit_page):
commit_page_string = str (commit_page)
author_pattern = re.compile (r'<tr><th>author</th><td>(.*?)</td><td class=', re.DOTALL)
t_author = author_pattern.findall (commit_page_string)
t_author_string = str (t_author)
author_point = re.search (" <", t_author_string)
author = t_author_string[:author_point.start()]
print author
git_url = "http://git.gnome.org/browse/anjuta/commit/?id=d17caca8f81bb0f0ba4d341d6d6132ff51d186e3"
commit_page = get_page (git_url)
get_commit_info (commit_page)
The result of 'print author' is as follows:
\xd0\x9c\xd0\xb8\xd1\x80\xd0\xbe\xd1\x81\xd0\xbb\xd0\xb0\xd0\xb2 \xd0\x9d\xd0\
xb8\xd0\xba\xd0\xbe\xd0\xbb\xd0\xb8\xd1\x9b
How can I print the name exactly?
WELL... this will do what you want
author = 'Мирослав Николић'
print author.decode('utf8') # Мирослав Николић
But it also won't work if the encoding isn't UTF8...
Mostly things use utf8. Mostly.
Unicode is complicated stuff to get your head around. 'author' is a string object that contains bytes. There is no information in those bytes to tell you what those bytes represent. Absolutely none. You have to tell Python that this string of bytes are code points in UTF8. For each byte you come across, look it up in the UTF8 code table and see which UTF8 unicode glyph this represents.
You could detect the encoding for each page by looking at the meta tags. In html5 they would look like this:
<meta charset="utf-8">.