UnicodeEncodeError in Python

UnicodeEncodeError in Python - python

I am getting an error and I don't know what exactly I should do?!
The error message:
File "pandas_libs\writers.pyx", line 55, in pandas._libs.writers.write_csv_rows
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 147: ordinal not in range(128)
import numpy as np
import pandas as pd
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import subjectivity
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *
import matplotlib.pyplot as mlpt
import tweepy
import csv
import pandas as pd
import random
import numpy as np
import pandas as pd
import re
consumer_key = ''
consumer_secret = ''
access_token = ''
access_token_secret = ''
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth,wait_on_rate_limit=True)
fetch_tweets=tweepy.Cursor(api.search, q="#unitedAIRLINES",count=100, lang ="en",since="2018-9-13", tweet_mode="extended").items()
data=pd.DataFrame(data=[[tweet_info.created_at.date(),tweet_info.full_text]for tweet_info in fetch_tweets],columns=['Date','Tweets'])
data.to_csv("Tweets.csv")
cdata=pd.DataFrame(columns=['Date','Tweets'])
total=100
index=0
for index,row in data.iterrows():
stre=row["Tweets"]
my_new_string = re.sub('[^ a-zA-Z0-9]', '', stre)
cdata.sort_index()
cdata.set_value(index,'Date',row["Date"])
cdata.set_value(index,'Tweets',my_new_string)
index=index+1
#print(cdata.dtypes)
cdata

I found a solution that works also:
adding (encoding='utf-8') to the line:
data.to_csv("Tweets.csv", encoding='utf-8')

PANDAS is tripping up on handling Unicode data, presumably in generating a CSV output file.
One approach, if you don't really need to process Unicode data, is to simply make conversions on your data to get everything ASCII.
Another approach is to make a pass on your data prior to generating the CSV output file to get the UTF-8 encoding of any non-ASCII characters. (You may need to do this at the cell level of your spreadsheet data.)
I'm assuming Python3 here...
>>> s = "one, two, three, \u2026"
>>> print(s)
one, two, three, …
>>> ascii = str(s.encode("utf-8"))[2:-1]
>>> ascii
'one, two, three, \\xe2\\x80\\xa6'
>>> print(ascii)
one, two, three, \xe2\x80\xa6
See also: help() on codecs module.

Related

i am trying to use dataset using pandas in ibm cloud error='utf-8' codec can't decode bytes in position 135-136: invalid continuation byte

This is the code automatically generated by ibm cloud when i upload my dataset and i tried encoding='latin-1' but still it is giving me error
import os, types
import pandas as pd
from botocore.client import Config
import ibm_boto3
def __iter__(self): return 0
if os.environ.get('RUNTIME_ENV_LOCATION_TYPE') == 'external':
endpoint_3660ea30b8c954806ac4 = 'https://s3.us.cloud-object-storage.appdomain.cloud'
else:
endpoint_3660ea30b8c954806ac4 = 'https://s3.private.us.cloud-object-storage.appdomain.cloud'
client_3660ea30b8c954806ac4 = ibm_boto3.client(service_name='s3',
ibm_api_key_id='xjHcqdBlY9iaaD7qu17e6-njKJPFSdGWk4d',
ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
config=Config(signature_version='oauth'),
endpoint_url=endpoint_3660ea30b8c954806ac4)
body = client_3660ea30b8c954806ac4.get_object(Bucket='spamdetectionmodel-donotdelete-pr-mt98rs41prv05c',Key='spam.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )
df_data_1 = pd.read_csv(body)
df_data_1.head()
Error:
'utf-8' codec can't decode bytes in position 135-136: invalid
continuation byte

Have you tried to change pandas encoding settings, can you try below :
df_data_1 = pd.read_csv(body, encoding='utf-8')
Or alternatively :
df_data_1 = pd.read_csv(body, encoding='ISO-8859-1')
Have a read on encoding setting, below helped me on resolving such error :
UnicodeDecodeError when reading CSV file in Pandas with Python

Python, why does decrypt fails the second time

Below is a test. When I try to decode the same string for the second time I get a error message and I really have no clue why or where to search for. Please advise.
Error:
Traceback (most recent call last):
File "test.py", line 28, in <module>
telnr_to_string = str(telnr_decrypt, 'utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9e in position 0: invalid start byte
Code:
from Crypto.Cipher import AES
from Crypto.Util.Padding import *
from urllib.request import urlopen
import os
import urllib
import subprocess
import re
import time
import base64
import random
file_key = open("key.txt")
key = file_key.read()
key = key.encode()
file_iv = open("iv.txt")
iv = file_iv.read()
iv = iv.encode()
obj2 = AES.new(key, AES.MODE_CBC, iv)
z=1
while z < 3:
x=''
telnr_decrypt=''
telnr=''
telnr_to_string=''
telnr_cleaned=''
x=(b'\x9e\xde\x98p:\xa3\x04\x92\xb5!2K[\x8e\\\xee')
telnr_decrypt = obj2.decrypt(x)
telnr_to_string = str(telnr_decrypt, 'utf-8')
telnr_cleaned = re.sub("[^0-9]", "", telnr_to_string)
print (telnr_cleaned)
z=z+1

Just move obj2 = AES.new(key, AES.MODE_CBC, iv) into the while loop. Don't worry, cipher objects are not very stateful objects; they don't do much other than to store the key (or, for AES, probably the internal subkeys) and maintain a small buffer. Generally it is therefore a bad idea to reuse or cache them.
If you call encrypt / decrypt multiple times in succession then the methods act as if you would be encrypting one big message. In other words, they allow you to encryption / decrypt in a piecemeal fashion so that message-sized buffers can be avoided. This is not very explicitly documented, but you can see this in figure 2 in the documentation.
For CBC mode it means that the operation is identical to setting the IV to the last block of ciphertext. If the IV is wrong then the first block of plaintext will be randomized. Random bytes do generally not contain valid UTF-8 encoding, which means that decoding to string will (likely) fail.

unicode utf8 error while moving working code to separate module

I am a newbie python programmer. I am strugling with strange error - which pops out only when I move working code from main script file to separate module (file) as a function. The error is SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xbf in position 58: invalid start byte.
If the function is in the main code there is no error and code works properly...
The code is about do some webscraping with the use of selenium and xpath
#main file:
import requests
import lxml.html as lh
import pandas as pd
import numpy
import csv
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import funkcje as f
spolka = "https://mojeinwestycje.interia.pl/gie/prof/spolki/notowania?wlid=213"
wynik = f.listaTransakcji(spolka)
#module file with function definition (funkcje.py):
def listaTransakcji(spolka):
driver = webdriver.Firefox()
driver.implicitly_wait(30)
driver.get(spolka)
driver.find_element_by_xpath("//button[#class='rodo-popup-agree']").click()
driver.find_element_by_xpath("//input[#type='radio' and #name='typ' and #value='wsz']").click()
driver.find_element_by_xpath("//input[#type='submit' and #name='Submit' and #value='pokaż']").click()
page = driver.page_source
#end of selenium-----------------------------------------------------------------------------
#Store the contents of the website under doc
doc = lh.fromstring(page)
#wyluskanie rekordów transakcji - xpath------------------------------------------------------
tr_elements = doc.xpath('//table//tr[#bgcolor="#FFFFFF" or #bgcolor="#F7FAFF"]/td')
rekord = numpy.array([])
length = len(tr_elements)
for i in range (0, length):
if(tr_elements[i].text=='TRANSAKCJA') or (tr_elements[i].text=='WIDEŁKI STAT') or (tr_elements[i].text=='WIDEŁKI DYN'):
new_rekord=[tr_elements[i-5].text, tr_elements[i-4].text, tr_elements[i-3].text, tr_elements[i-2].text, tr_elements[i-1].text, tr_elements[i].text]
rekord=numpy.concatenate((rekord,new_rekord))
ilosc = (len(rekord))//6
tablica = numpy.array([])
tablica = rekord.reshape(ilosc, 6)
header = numpy.array(["godzina", "cena", "zmiana", "wolumen", "numer", "typ operacji"])
header = header.reshape(1, 6)
tablica = numpy.concatenate((header,tablica))
return (tablica)
offending line 10:
import funkcje as f
offending line 34:
driver.find_element_by_xpath("//input[#type='submit' and #name='Submit' and #value='pokaż']").click()
expected result:
["11:17:40","0,4930","0,00",24300,76,"TRANSAKCJA"]
actual result:
Traceback (most recent call last):
File "C:/Users/Vox/PycharmProjects/interia/scraper.py", line 10, in <module>
import funkcje as f
File "C:\Users\Vox\PycharmProjects\interia\funkcje.py", line 34
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xbf in position 58: invalid start byte

thanks Marat!
try putting # -- coding: utf-8 -- on top of the new file (replace
utf-8 with whatever encoding is used for pokaż
solved issue... no idea why it happened in the first place though. Like new file that is not a main file is not utf-8 by default?

Attribute error when encoding with base64

I have two keys(secret key and public key) that are generated using curve25519. I want to encode the two keys using base64.safe_b64encode but i keep getting an error. Is there any way I can encode using this?
This is my code:
import libnacl.public
import libnacl.secret
import libnacl.utils
from tinydb import TinyDB
from hashlib import sha256
import json
import base64
pikeys = libnacl.public.SecretKey()
piprivkey = pikeys.sk
pipubkey = pikeys.pk
piprivkey = base64.safe_b64encode(piprivkey)
pipubkey = base64.safe_b64encode(pipubkey)
print("encoded priv", piprivkey)
print("encoded pub", pipubkey)
This is the error I got:
Traceback (most recent call last):
File "/home/pi/Desktop/finalcode/pillar1.py", line 130, in <module>
File "/home/pi/Desktop/finalcode/pillar1.py", line 50, in generatepillar1key
piprivkey = base64.safe_b64encode(piprivkey)
AttributeError: 'module' object has no attribute 'safe_b64encode'

The reason you get this error is because the base64 library does not have a function named safe_base64encode. What do you even mean by safe_base64encode? Why do you want to encode both of your keys with base64? there is a urlsafe encoding function and there is the regular base64 encoding function.
encoded_data = base64.b64encode(data_to_encode)
or
encoded_data = base64.urlsafe_b64encode(data_to_encode)
The latter one will just have a different alphabet with - instead of + and _ instead of / so it's urlsafe. I'm not sure what you want to do but refer to the docs here

The error is telling you that the function safe_b64encode does not exist in the base64 module. Perhaps you meant to use base64.urlsafe_b64encode(s)?

Python Encoding NLTK - 'charmap' codec can't encode character

import pypyodbc
from pypyodbc import *
import nltk
from nltk import *
import csv
import sys
import codecs
import re
#connect to the database
conn = pypyodbc.connect('Driver={Microsoft Access Driver (*.Mdb)};\
DBQ=C:\\TextData.mdb')
#create a cursor to control the datbase with
cur = conn.cursor()
cur.execute('''SELECT Text FROM MessageCreationDate WHERE Tags LIKE 'GHS - %'; ''')
TextSet = cur.fetchall()
ghsWordList = []
TextWords = list(TextSet)
for row in TextWords :
message = re.split('\W+',str(row))
for eachword in message :
if eachword.isalpha() :
ghsWordList.append(eachword.lower())
print(ghsWordList)
When I run this code, it's giving me an error:
'charmap' codec can't encode character '\u0161' in position 2742: character maps to <undefined>
I've looked at a number of other answers on here to similar questions, and googled the hell out of it; however I am not well versed enough in Python nor Character Encoding to know where I need to used the Codecs module to change the character set being used to present/append/create the list?
Could someone not only help me with the code but also point me in the direct of some good reading materials for understanding this sort of thing?

If you are using Python 2.x, add the following lines to your code:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Note: if you are using Python 3.x, reload is not a built-in, it is imp.relaod(), so an import needs to be added for my solution to work. I don't develop in 3.x, so my suggestion is:
from imp import reload
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Place this ahead of all your other imports.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

UnicodeEncodeError in Python - python

I found a solution that works also: adding (encoding='utf-8') to the line: data.to_csv("Tweets.csv", encoding='utf-8')

Related

i am trying to use dataset using pandas in ibm cloud error='utf-8' codec can't decode bytes in position 135-136: invalid continuation byte

Python, why does decrypt fails the second time

unicode utf8 error while moving working code to separate module

Attribute error when encoding with base64

Python Encoding NLTK - 'charmap' codec can't encode character

Categories

Resources