indian rupee symbol UnicodeEncodeError while uploading file to s3 using pandas - python

I have scraped some data from a website for my assignment. It consists of Indian rupee character - "₹". The data when I'm trying to save into CSV file in utf-8 characters on local machine using pandas, it is saving effortlessly. The same file, I have changed the delimiters and tried to save the file to s3 using pandas, but it gave "UnicodeEncodeError" error. I'm scraping the web page using scrapy framework.
Earlier I was trying to save the file in Latin-1 i.e. "ISO-8859-1" formatting and hence changed to "utf-8" but the same error is occurring. I'm using pythn 3.7 for the development.
Below code used for saving on the local machine which is working:
result_df.to_csv(filename+str2+'.csv',index=False)
Below code is used to save the file to S3:
search_df.to_csv('s3://my-bucket/folder_path/filename_str2.csv',encoding = 'utf-8',line_terminator='^',sep='~',index=False)
Below is the error while saving the file to S3:
2019-10-29 19:24:27 [scrapy.utils.signal] ERROR: Error caught on signal handler: <function Spider.close at 0x0000019CD3B1AA60>
Traceback (most recent call last):
File "c:\programdata\anaconda3\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred
result = f(*args, **kw)
File "c:\programdata\anaconda3\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "c:\programdata\anaconda3\lib\site-packages\scrapy\spiders\__init__.py", line 94, in close
return closed(reason)
File "C:\local_path\spiders\Pduct_Scrape.py", line 430, in closed
search_df.to_csv('s3://my-bucket/folder_path/filename_str2.csv',encoding = 'utf-8',line_terminator='^',sep='~',index=False)
File "c:\programdata\anaconda3\lib\site-packages\pandas\core\generic.py", line 3020, in to_csv
formatter.save()
File "c:\programdata\anaconda3\lib\site-packages\pandas\io\formats\csvs.py", line 172, in save
self._save()
File "c:\programdata\anaconda3\lib\site-packages\pandas\io\formats\csvs.py", line 288, in _save
self._save_chunk(start_i, end_i)
File "c:\programdata\anaconda3\lib\site-packages\pandas\io\formats\csvs.py", line 315, in _save_chunk
self.cols, self.writer)
File "pandas/_libs/writers.pyx", line 75, in pandas._libs.writers.write_csv_rows
File "c:\programdata\anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u20b9' in position 2661: character maps to <undefined>
I am very new to this StackOverflow platform and please let me know if more information is to be presented.

The error gives an evidence that the code tries to encode the filename_str2.csv file in cp1252. From your stack trace:
...File "C:\local_path\spiders\Pduct_Scrape.py", line 430, in closed
search_df.to_csv('s3://my-bucket/folder_path/ filename_str2.csv ',......
File "c:\programdata\anaconda3\lib\encodings\ cp1252.py ", line 19, in encode
The reason I do not know, because you explicitely ask for an utf-8 encoding. But as the codecs page in the Python Standard Library reference says that the canonical name for utf8 is utf_8 (notice the underline instead of minus sign) and does not list utf-8 in allowed aliases, I would first try to use utf_8. If it still uses cp1252, then you will have to give the exact versions of Python and pandas that you are using.

Related

How do I determine which text in a corpus contains an error generated by the NLTK suite in Python?

I am trying to do some rudimentary corpus analysis with Python. I am getting the following error message(s):
Traceback (most recent call last):
File "<pyshell#28>", line 2, in <module>
print(len(poems.words(f)), f)
File "C:\Python38-32\lib\site-packages\nltk\corpus\reader\util.py", line 240, in __len__
for tok in self.iterate_from(self._toknum[-1]):
File "C:\Python38-32\lib\site-packages\nltk\corpus\reader\util.py", line 306, in iterate_from
tokens = self.read_block(self._stream)
File "C:\Python38-32\lib\site-packages\nltk\corpus\reader\plaintext.py", line 134, in _read_word_block
words.extend(self._word_tokenizer.tokenize(stream.readline()))
File "C:\Python38-32\lib\site-packages\nltk\data.py", line 1220, in readline
new_chars = self._read(readsize)
File "C:\Python38-32\lib\site-packages\nltk\data.py", line 1458, in _read
chars, bytes_decoded = self._incr_decode(bytes)
File "C:\Python38-32\lib\site-packages\nltk\data.py", line 1489, in _incr_decode
return self.decode(bytes, 'strict')
File "C:\Python38-32\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 12: invalid start byte
My assumption is that there is a UTF error in one of the 202 text files I am looking at.
Is there any way of telling, from the error messages, which file or files have the problem?
Assuming that you know the files ids (the paths of your corpus files) you can open all of them with encoding="utf-8"
If you don't know the paths, assuming that you are using the nltk corpus loader, you can get them by using:
poems.fileids()
After that, for every file in your list of files (for example fileids) you can try:
for file_ in fileids:
try:
with open(file_, encoding="utf-8") a f_i:
f_i.readlines()
except:
print("You got problems with the file: ", file_)
Anyway, your loader has also a parameter named "encoding" that you can use for the correct encoding of your corpus. By default is set to "utf-8"
More details here: nltk corpus loader

UnicodeError when try to send a file with greek filename

I have created and populated Greek names in a set() and I then pass this set of values to a view function.
When I try to print this set Greek names appear as jibberish. I believe this has somethign to do that Apache mod_wsgi or Bottle doens't start with utf-8 support.
How can I tell Apache/Bottle to use LANG=el_GR.utf-8 so I can display unicode properly because I believe that's the case here?
I looked for AddDefaultCharset utf-8 in httpd.conf but it is already enabled, so I have to ask why the Greek chars appear as jibberish?
This is when i try to download a file with a greek filename.
Error: 500 Internal Server Error
Sorry, the requested URL 'http://superhost.gr/downloads/file' caused an error:
Internal Server Error
Exception:
UnicodeEncodeError('ascii', '/static/files/Î\x92ιογÏ\x81αÏ\x86ικÏ\x8c - Î\x9dίκοÏ\x82.docx', 14, 34, 'ordinal not in range(128)')
Traceback:
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/bottle.py", line 862, in _handle
return route.call(**args)
File "/usr/lib/python3.6/site-packages/bottle.py", line 1740, in wrapper
rv = callback(*a, **ka)
File "/usr/lib/python3.6/site-packages/bottle.py", line 2690, in wrapper
return func(*a, **ka)
File "/home/nikos/public_html/downloads.py", line 148, in file
return static_file(filename, root='/static/files', download=True)
File "/usr/lib/python3.6/site-packages/bottle.py", line 2471, in static_file
if not os.path.exists(filename) or not os.path.isfile(filename):
File "/usr/lib64/python3.6/genericpath.py", line 19, in exists
os.stat(path)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 14-33: ordinal not in range(128)
The code use to download the file is:
return static_file(filename, root='/static/files', download=True)
my system is et to utf-8
[root#superhost public_html]# echo $LANG
en_US.UTF-8
Perhaps something with Apache or is it a probelm with Python3 ?
You can't use Bottle static_file() with unicode filename and download=True. See accepted answer for this question for two alternative solutions of this limitation.

tf.gfile.Glob gives me UnicodeDecodeError error anyway to fix this?

I was trying to get the list of name of txt file that was written in Korean in the specified directory with the code below
dir_list = tf.gfile.Glob(engine.TXT_DIR+"/*.txt")
However, This one gives me the following error:
Traceback (most recent call last):
File "D:/Prj_mayDay/Prj_FrankenShtine/shakespear_reborn/main.py", line 108, in <module>
dir_list = tf.gfile.Glob(engine.TXT_DIR+"/*.txt")
File "D:\KimKanna's Class\python35\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 326, in get_matching_files
compat.as_bytes(filename), status)
File "D:\KimKanna's Class\python35\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 325, in <listcomp>
for matching_filename in pywrap_tensorflow.GetMatchingFiles(
File "D:\KimKanna's Class\python35\lib\site-packages\tensorflow\python\util\compat.py", line 106, in as_str_any
return as_str(value)
File "D:\KimKanna's Class\python35\lib\site-packages\tensorflow\python\util\compat.py", line 84, in as_text
return bytes_or_text.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 19: invalid start byte
Now, throughout some research, I found out the reason
The error is because there is some non-ascii character in the dictionary and it can't be encoded/decoded
However, I do not see any way to apply the solution into my code. or is there?
**if there is alternative code for this. It should be applicable for both cloud stroage bucket / my personal hard drive as the code above did.
I'm using python3, Tensorflow version of 1.2.0-rc2
so after few hours of fiddling around with my code I finally found the solution.
Afterall one of the file inside of the directory I specified had a name in Korean. After I took that out of the directory. problem was gone.

Umlauts in JSON files lead to errors in Python code created by ANTLR4

I've created python modules from the JSON grammar on github / antlr4 with
antlr4 -Dlanguage=Python3 JSON.g4
I've written a main program "JSON2.py" following this guide: https://github.com/antlr/antlr4/blob/master/doc/python-target.md
and downloaded the example1.json also from github.
python3 ./JSON2.py example1.json # works perfectly, but
python3 ./JSON2.py bookmarks-2017-05-24.json # the bookmarks contain German Umlauts like "ü"
...
File "/home/xyz/lib/python3.5/site-packages/antlr4/FileStream.py", line 27, in readDataFrom
return codecs.decode(bytes, encoding, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 227: ordinal not in range(128)
The offending line in JSON2.py is
input = FileStream(argv[1])
I've searched stackoverflow and tried this instead of using the above FileStream:
fp = codecs.open(argv[1], 'rb', 'utf-8')
try:
input = fp.read()
finally:
fp.close()
lexer = JSONLexer(input)
stream = CommonTokenStream(lexer)
parser = JSONParser(stream)
tree = parser.json() # This is line 39, mentioned in the error message
Execution of this program ends with an error message, even if the input file doesn't contain Umlauts:
python3 ./JSON2.py example1.json
Traceback (most recent call last):
File "./JSON2.py", line 46, in <module>
main(sys.argv)
File "./JSON2.py", line 39, in main
tree = parser.json()
File "/home/x/Entwicklung/antlr/links/JSONParser.py", line 108, in json
self.enterRule(localctx, 0, self.RULE_json)
File "/home/xyz/lib/python3.5/site-packages/antlr4/Parser.py", line 358, in enterRule
self._ctx.start = self._input.LT(1)
File "/home/xyz/lib/python3.5/site-packages/antlr4/CommonTokenStream.py", line 61, in LT
self.lazyInit()
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 186, in lazyInit
self.setup()
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 189, in setup
self.sync(0)
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 111, in sync
fetched = self.fetch(n)
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 123, in fetch
t = self.tokenSource.nextToken()
File "/home/xyz/lib/python3.5/site-packages/antlr4/Lexer.py", line 111, in nextToken
tokenStartMarker = self._input.mark()
AttributeError: 'str' object has no attribute 'mark'
This parses correctly:
javac *.java
grun JSON json -gui bookmarks-2017-05-24.json
So the grammar itself is not the problem.
So finally the question: How should I process the input file in python, so that lexer and parser can digest it?
Thanks in advance.
Make sure your input file is actually encoded as UTF-8. Many problems with character recognition by the lexer are caused by using other encodings. I just took a testbed application, added ëto the list of available characters for an IDENTIFIER and it works again. UTF-8 is the key -- and make sure your grammar also allows these characters where you want to accept them.
I solved it by passing the encoding info:
input = FileStream(sys.argv[1], encoding = 'utf8')
If without the encoding info, I will have the same issue as yours.
Traceback (most recent call last):
File "test.py", line 20, in <module>
main()
File "test.py", line 9, in main
input = FileStream(sys.argv[1])
File ".../lib/python3.5/site-packages/antlr4/FileStream.py", line 20, in __init__
super().__init__(self.readDataFrom(fileName, encoding, errors))
File ".../lib/python3.5/site-packages/antlr4/FileStream.py", line 27, in readDataFrom
return codecs.decode(bytes, encoding, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 1: ordinal not in range(128)
Where my input data is
[今明]天(台南|高雄)的?天氣如何

Spyder crashes at start: UnicodeDecodeError

During a Spyder session my Linux froze. After startup, I could not start Spyder; I got the following error instead:
(trusty)dreamer#localhost:~$ spyder
Traceback (most recent call last):
File "/home/dreamer/anaconda2/bin/spyder", line 2, in <module>
from spyderlib import start_app
File "/home/dreamer/anaconda2/lib/python2.7/site-packages/spyderlib/start_app.py", line 13, in <module>
from spyderlib.config import CONF
File "/home/dreamer/anaconda2/lib/python2.7/site-packages/spyderlib/config.py", line 736, in <module>
subfolder=SUBFOLDER, backup=True, raw_mode=True)
File "/home/dreamer/anaconda2/lib/python2.7/site-packages/spyderlib/userconfig.py", line 215, in __init__
self.load_from_ini()
File "/home/dreamer/anaconda2/lib/python2.7/site-packages/spyderlib/userconfig.py", line 260, in load_from_ini
self.readfp(configfile)
File "/home/dreamer/anaconda2/lib/python2.7/ConfigParser.py", line 324, in readfp
self._read(fp, filename)
File "/home/dreamer/anaconda2/lib/python2.7/ConfigParser.py", line 479, in _read
line = fp.readline()
File "/home/dreamer/anaconda2/lib/python2.7/codecs.py", line 690, in readline
return self.reader.readline(size)
File "/home/dreamer/anaconda2/lib/python2.7/codecs.py", line 545, in readline
data = self.read(readsize, firstline=True)
File "/home/dreamer/anaconda2/lib/python2.7/codecs.py", line 492, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfe in position 2: invalid start byte
(trusty)dreamer#localhost:~$
I have found this solution, which sounds very much like my problem, but am curious if there are others, and whether anyone knows why this occurred.
My guess is that your spyder configuration file somehow got corrupted. This is the file spyder.ini, which resides in a directory like ~/.spyder2 (the exact name of the directory depends on the version you have installed). Maybe the encoding of the configuration file changed or a Unicode byte order mark was somehow introduced.
Possible solutions: use an editor to convert the file back to UTF-8; delete the configuration file; delete the whole directory containing the configuration file. The last two obviously delete any changes you made to the configuration.

Categories