Python Encoding NLTK - 'charmap' codec can't encode character - python

import pypyodbc
from pypyodbc import *
import nltk
from nltk import *
import csv
import sys
import codecs
import re
#connect to the database
conn = pypyodbc.connect('Driver={Microsoft Access Driver (*.Mdb)};\
DBQ=C:\\TextData.mdb')
#create a cursor to control the datbase with
cur = conn.cursor()
cur.execute('''SELECT Text FROM MessageCreationDate WHERE Tags LIKE 'GHS - %'; ''')
TextSet = cur.fetchall()
ghsWordList = []
TextWords = list(TextSet)
for row in TextWords :
message = re.split('\W+',str(row))
for eachword in message :
if eachword.isalpha() :
ghsWordList.append(eachword.lower())
print(ghsWordList)
When I run this code, it's giving me an error:
'charmap' codec can't encode character '\u0161' in position 2742: character maps to <undefined>
I've looked at a number of other answers on here to similar questions, and googled the hell out of it; however I am not well versed enough in Python nor Character Encoding to know where I need to used the Codecs module to change the character set being used to present/append/create the list?
Could someone not only help me with the code but also point me in the direct of some good reading materials for understanding this sort of thing?

If you are using Python 2.x, add the following lines to your code:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Note: if you are using Python 3.x, reload is not a built-in, it is imp.relaod(), so an import needs to be added for my solution to work. I don't develop in 3.x, so my suggestion is:
from imp import reload
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Place this ahead of all your other imports.

Related

UnicodeEncodeError in Python

I am getting an error and I don't know what exactly I should do?!
The error message:
File "pandas_libs\writers.pyx", line 55, in pandas._libs.writers.write_csv_rows
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 147: ordinal not in range(128)
import numpy as np
import pandas as pd
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import subjectivity
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *
import matplotlib.pyplot as mlpt
import tweepy
import csv
import pandas as pd
import random
import numpy as np
import pandas as pd
import re
consumer_key = ''
consumer_secret = ''
access_token = ''
access_token_secret = ''
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth,wait_on_rate_limit=True)
fetch_tweets=tweepy.Cursor(api.search, q="#unitedAIRLINES",count=100, lang ="en",since="2018-9-13", tweet_mode="extended").items()
data=pd.DataFrame(data=[[tweet_info.created_at.date(),tweet_info.full_text]for tweet_info in fetch_tweets],columns=['Date','Tweets'])
data.to_csv("Tweets.csv")
cdata=pd.DataFrame(columns=['Date','Tweets'])
total=100
index=0
for index,row in data.iterrows():
stre=row["Tweets"]
my_new_string = re.sub('[^ a-zA-Z0-9]', '', stre)
cdata.sort_index()
cdata.set_value(index,'Date',row["Date"])
cdata.set_value(index,'Tweets',my_new_string)
index=index+1
#print(cdata.dtypes)
cdata
I found a solution that works also:
adding (encoding='utf-8') to the line:
data.to_csv("Tweets.csv", encoding='utf-8')
PANDAS is tripping up on handling Unicode data, presumably in generating a CSV output file.
One approach, if you don't really need to process Unicode data, is to simply make conversions on your data to get everything ASCII.
Another approach is to make a pass on your data prior to generating the CSV output file to get the UTF-8 encoding of any non-ASCII characters. (You may need to do this at the cell level of your spreadsheet data.)
I'm assuming Python3 here...
>>> s = "one, two, three, \u2026"
>>> print(s)
one, two, three, …
>>> ascii = str(s.encode("utf-8"))[2:-1]
>>> ascii
'one, two, three, \\xe2\\x80\\xa6'
>>> print(ascii)
one, two, three, \xe2\x80\xa6
See also: help() on codecs module.

UnicodeError with SqlAlchemy + Firebird + FDB

I'm triying to display results from a firebird 3.x database, but get:
File
"/...../Envs/pos/lib/python3.6/site-packages/fdb/fbcore.py",
line 479, in b2u
return st.decode(charset) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 9: invalid continuation byte
Despite I set utf-8 everywhere:
# -- coding: UTF-8 --
import os
os.environ["PYTHONIOENCODING"] = "utf8"
from sqlalchemy import *
SERVIDOR = "localhost"
BASEDATOS_1 = "db.fdb"
PARAMS = dict(
user="SYSDBA",
pwd="masterkey",
host="localhost",
port=3050,
path=BASEDATOS_1,
charset='utf-8'
)
firebird = create_engine("firebird+fdb://%(user)s:%(pwd)s#%(host)s:%(port)d/%(path)s?charset=%(charset)s" % PARAMS, encoding=PARAMS['charset'])
def select(eng, sql):
with eng.connect() as con:
return eng.execute(sql)
for row in select(firebird, "SELECT * from clientes"):
print(row)
I had the same problem.
In my situation the database was not in UTF-8.
After setting the correct charset in the connection string it worked: ?charset=ISO8859_1
I would try to use the module unidecode.
Your script is crashing when it tries to convert, so this module can help you. As they says in the module documentation:
The module exports a single function that takes an Unicode object
(Python 2.x) or string (Python 3.x) and returns a string (that can be
encoded to ASCII bytes in Python 3.x)
First you download it using pip and then try this:
import unidecode
...
if type(line) is unicode:
line = unidecode.unidecode(line)
I hope it solves your problem.

How to use utf-8 characters in dryscrape in Python?

I need use utf-8 characters in set dryscrape method. But after run show this error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)
My code (for example):
site = dryscrape.Session()
site.visit("https://www.website.com")
search = site.at_xpath('//*[#name="search"]')
search.set(u'فارسی')
search.form().submit()
Also u'فارسی' change to search.set(unicode('فارسی', 'utf-8')), But show this error.
Its very easy... This method working perfectly with google. Also try with any other if you know the url prams
import dryscrape as d
d.start_xvfb()
br = d.Session()
import urllib.parse
query = urllib.parse.quote("فارسی")
print(query) #it prints : '%D9%81%D8%A7%D8%B1%D8%B3%DB%8C'
Url = "http://google.com/search?q="+query
br.visit(Url)
print(br.xpath('//title')[0].text())
#it prints : Google Search - فارسی
#You can also check it with br.render("url_screenshot.png")

How to find string and return it to stdout in Python

I am getting familiar with Python & am struggling to do the below with BeautifulSoup, Python.
What is expected:
*If the output of the script below contains the string 5378, it should email me with the line the string appears.
#! /usr/bin/env python
from bs4 import BeautifulSoup
from lxml import html
import urllib2,re
import codecs
import sys
streamWriter = codecs.lookup('utf-8')[-1]
sys.stdout = streamWriter(sys.stdout)
BASE_URL = "http://outlet.us.dell.com/ARBOnlineSales/Online/InventorySearch.aspx?c=us&cs=22&l=en&s=dfh&brandid=2201&fid=111162"
webpage = urllib2.urlopen(BASE_URL)
soup = BeautifulSoup(webpage.read(), "lxml")
findcolumn = soup.find("div", {"id": "itemheader-FN"})
name = findcolumn.text.strip()
print name
I tried using findall(5378, name), but it returns to empty braces like this [].
I am struggling with Unicode issues if I am trying to use it along with grep.
$ python dell.py | grep 5378
Traceback (most recent call last):
File "dell.py", line 18, in <module>
print name
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 817: ordinal not in range(128)
Can someone tell me what am I doing wrong in both cases?
The function findall (from the re module) expects the first parameter to be a a regular expression, which is a string, but you provided an integer. Try this instead:
re.findall("5378", name)
When printed this will output [u'5378'] when it found something or [] when it didn't.
I suspect you want to retrieve the product name from the number, which means you have to iterate through elements in findcolumn. We can use re.search() here to check for a single match within the element's texts.
for input_element in findcolumn.find_all("div"):
name = unicode(input_element.text.strip())
if re.search("5378", name) != None:
print unicode(name)
As for the unicode error, there are a bunch of solutions, depending on your operating system and configuration: Reconfigure your system locale on Ubuntu or Encode your script output with .encode()/unicode().

nltk NERTagger UnicodeDecodeError in python

I am writing a program in python 2.7.6 that uses nltk with Stanford named entity tagger in Window 7 professional to tag a text and print the result as follows:
import re
from nltk.tag.stanford import NERTagger
WORD = re.compile(r'\w+')
st = NERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz", "stanford-ner.jar")
text = "title Wienfilm 1896-1976 (1976)"
words = WORD.findall(text )
print words
answer = st.tag(words )
print answer
The last print statement in the program suppose to return a tuple consisting of five lists as:
[(u'title', u'O'), (u'Wienfilm', u'O'), (u'1896', u'O'), (u'1976', u'O'), (u'1976', u'O')]
However when I run the program, it gives me the following error message:
['title', 'Wienfilm', '1896', '1976', '1976']
Traceback (most recent call last):
File "E:\Google Drive\myPyPrgs\testNLP.py", line 27, in <module>
answer = st.tag(words )
File "C:\Python27\lib\site-packages\nltk\tag\stanford.py", line 59, in tag
return self.tag_sents([tokens])[0]
File "C:\Python27\lib\site-packages\nltk\tag\stanford.py", line 82, in tag_sents
stanpos_output = stanpos_output.decode(encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 23: ordinal not in
range(128)
Note that if I remove the number, '-1976' from the text string the program tags and prints the correct answer. But if the number '-1976' is within the text, I always have the above error.
In this forum, somebody suggested to me to change the default encoding in the stanford.py of the nltk. When I changed the default encoding in the stanford.py from ascii to UTF-16 and replaced the the last print statement of the above code with the following looping:
for i, word_pos in enumerate(answer):
word, pos = word_pos
print i , word.encode('utf-16'), pos.encode('utf-16')
I got the following incorrect output:
0 ÿþ ÿþtitle/O Wienfilm/O 1896 1976 1976/O
Please any clues on how to deal with this issue? Thanks in advance.
This worked for me: specify the encoding argument as UTF-8 when you create NERTagger object
st = NERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz", "stanford-ner.jar", encoding='utf-8')
Open terminal(cmd), and write;
chcp
It should return something like;
active code page: 857
Then, write;
chcp 1254
After then, in your .py script, to the top of your script write;
# -*- coding: cp1254 -*-
This should solve your problem.If it's not, copy these codes and paste to the top of your script.
# -*-coding:utf-8-*-
import locale
locale.setlocale(locale.LC_ALL, '')
I had many problems with decoding before, these methods solved.
ASCII can decode only 2^7 = 128 characters, that's why you getting that error.As you see in the error sentence ordinal not in range(128) .
And check this website please.Use arrow keys for switching pages :-) I believe it's going to solve your problem.
At the top of your app add:
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
I was dealing with the same problem and I solved it by adding the encoding options on internals.py in nltk.
You must open internals.py saven on:
%YourPythonFolder%\Lib\site-packages\nltk\internals.py
Then go to the method java and adding this line after #construct the full command string (about line 147)
cmd = cmd + ['-inputEncoding', 'utf-8', '-outputEncoding', 'utf-8']
This section code must look like:
# Construct the full command string.
cmd = list(cmd)
cmd = ['-cp', classpath] + cmd
cmd = [_java_bin] + _java_options + cmd
cmd = cmd + ['-inputEncoding', 'utf-8', '-outputEncoding', 'utf-8']
Hope it helps.

Categories