Python encoding for Swedish characters - python

I'm using scrapy to extract data from a web site. I'm saving the data to a mysql database using MysqlDB. The script works for English sites, but when I try it on a Swedish site I get:
self.db.query(insertion_query)
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 156:
ordinal not in range(128)
I have put the following line at the top of each file involved in the scraping process to indicate the use of international charachters:
# -- coding: utf-8 --
But I still get an error. What else do I need for python to accept non-english charachters? Here's the full stack trace:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\scrapy-0.14.3-py2.7-win32.egg\scrapy\middleware.py",
line 60, in _process_
chain
return process_chain(self.methods[methodname], obj, *args)
File "C:\Python27\lib\site-packages\scrapy-0.14.3-py2.7-win32.egg\scrapy\utils\defer.py",
line 65, in process_
chain
d.callback(input)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 368, in callback
self._startRunCallbacks(result)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 464, in
_startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 551, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Python27\tco\tco\pipelines.py", line 64, in process_item
self.db.query(insertion_query)
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 156:
ordinal not in range(128)

This unicode issue look confusing at first, but it's actually pretty easy.
# -- coding: utf-8 --
If you write this on top of your source code, It means, python is going to
treat your code as utf-8, but not incoming or outgoing data.
You obviously want to write some data to your database, and this error happens
when some of your module encoding your utf-8 string (which is I guess swedish) to ascii.
That means, either MySQL was set as ascii or your mysql db driver is set as ascii.
So I suggest go check your mysql setting or driver setting.
db = MySQLdb.connect(host=database_host ,user=user ,passwd=pass,db=database_name, charset = "utf8", use_unicode = True)
This will make your mysql driver connect to mysql server using utf8

This blog post contains a hint: When creating the connection (either using PooledDB or MySQLdb.connect), specify the options charset = "utf8", use_unicode = True

Related

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 1: invalid continuation byte in Django

This is very possibly a duplicate, since I saw a certain amount of similar questions but I can't seem to find a solution.
My problem is as stated in the description. I am working on a Django project on python 3.8.5. My professor wanted me to program a website and use PostgreSQL as db. I did use it but I always got the following error when I used python manage.py runserver.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 1: invalid continuation byte
I uninstalled PostgreSQL and tryed to start a older project of mine that uses sqlite3, but it did not work and threw the same error. Following the stack trace.
Exception in thread django-main-thread:
Traceback (most recent call last):
File "C:\Users\Startklar\AppData\Local\Programs\Python\Python38\lib\threading.py", line 932, in _bootstrap_inner
self.run()
File "C:\Users\Startklar\AppData\Local\Programs\Python\Python38\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\Startklar\PycharmProject\Modul133_Movie\venv\lib\site-packages\django\utils\autoreload.py", line 53, in wrapper
fn(*args, **kwargs)
File "C:\Users\Startklar\PycharmProject\Modul133_Movie\venv\lib\site-packages\django\core\management\commands\runserver.py", line 139, in inner_run
run(self.addr, int(self.port), handler,
File "C:\Users\Startklar\PycharmProject\Modul133_Movie\venv\lib\site-packages\django\core\servers\basehttp.py", line 206, in run
httpd = httpd_cls(server_address, WSGIRequestHandler, ipv6=ipv6)
File "C:\Users\Startklar\PycharmProject\Modul133_Movie\venv\lib\site-packages\django\core\servers\basehttp.py", line 67, in __init__
super().__init__(*args, **kwargs)
File "C:\Users\Startklar\AppData\Local\Programs\Python\Python38\lib\socketserver.py", line 452, in __init__
self.server_bind()
File "C:\Users\Startklar\AppData\Local\Programs\Python\Python38\lib\wsgiref\simple_server.py", line 50, in server_bind
HTTPServer.server_bind(self)
File "C:\Users\Startklar\AppData\Local\Programs\Python\Python38\lib\http\server.py", line 140, in server_bind
self.server_name = socket.getfqdn(host)
File "C:\Users\Startklar\AppData\Local\Programs\Python\Python38\lib\socket.py", line 756, in getfqdn
hostname, aliases, ipaddrs = gethostbyaddr(name)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 1: invalid continuation byte
I am running this entire thing locally on a Windows 10.
I have no clue whatsoever on how to solve this issue. I am really new to django and even my professor can't seem to help me. I have been looking through a variety of questions here in stack overflow and tried a numbers of things out but I can't seem to find a solution at all.
Every bit of help or advice is welcomed.
Your database table or columns may be using a different encoding. Try running the following query in SQL:
ALTER TABLE your_table_name CONVERT TO CHARACTER SET utf8;
This will change the character set of your table to UTF-8.
Also found an issue similar to yours:
Unicodedecodeerror with runserver
It may have to do with non-ASCII characters in the hostname or computer name.

indian rupee symbol UnicodeEncodeError while uploading file to s3 using pandas

I have scraped some data from a website for my assignment. It consists of Indian rupee character - "₹". The data when I'm trying to save into CSV file in utf-8 characters on local machine using pandas, it is saving effortlessly. The same file, I have changed the delimiters and tried to save the file to s3 using pandas, but it gave "UnicodeEncodeError" error. I'm scraping the web page using scrapy framework.
Earlier I was trying to save the file in Latin-1 i.e. "ISO-8859-1" formatting and hence changed to "utf-8" but the same error is occurring. I'm using pythn 3.7 for the development.
Below code used for saving on the local machine which is working:
result_df.to_csv(filename+str2+'.csv',index=False)
Below code is used to save the file to S3:
search_df.to_csv('s3://my-bucket/folder_path/filename_str2.csv',encoding = 'utf-8',line_terminator='^',sep='~',index=False)
Below is the error while saving the file to S3:
2019-10-29 19:24:27 [scrapy.utils.signal] ERROR: Error caught on signal handler: <function Spider.close at 0x0000019CD3B1AA60>
Traceback (most recent call last):
File "c:\programdata\anaconda3\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred
result = f(*args, **kw)
File "c:\programdata\anaconda3\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "c:\programdata\anaconda3\lib\site-packages\scrapy\spiders\__init__.py", line 94, in close
return closed(reason)
File "C:\local_path\spiders\Pduct_Scrape.py", line 430, in closed
search_df.to_csv('s3://my-bucket/folder_path/filename_str2.csv',encoding = 'utf-8',line_terminator='^',sep='~',index=False)
File "c:\programdata\anaconda3\lib\site-packages\pandas\core\generic.py", line 3020, in to_csv
formatter.save()
File "c:\programdata\anaconda3\lib\site-packages\pandas\io\formats\csvs.py", line 172, in save
self._save()
File "c:\programdata\anaconda3\lib\site-packages\pandas\io\formats\csvs.py", line 288, in _save
self._save_chunk(start_i, end_i)
File "c:\programdata\anaconda3\lib\site-packages\pandas\io\formats\csvs.py", line 315, in _save_chunk
self.cols, self.writer)
File "pandas/_libs/writers.pyx", line 75, in pandas._libs.writers.write_csv_rows
File "c:\programdata\anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u20b9' in position 2661: character maps to <undefined>
I am very new to this StackOverflow platform and please let me know if more information is to be presented.
The error gives an evidence that the code tries to encode the filename_str2.csv file in cp1252. From your stack trace:
...File "C:\local_path\spiders\Pduct_Scrape.py", line 430, in closed
search_df.to_csv('s3://my-bucket/folder_path/ filename_str2.csv ',......
File "c:\programdata\anaconda3\lib\encodings\ cp1252.py ", line 19, in encode
The reason I do not know, because you explicitely ask for an utf-8 encoding. But as the codecs page in the Python Standard Library reference says that the canonical name for utf8 is utf_8 (notice the underline instead of minus sign) and does not list utf-8 in allowed aliases, I would first try to use utf_8. If it still uses cp1252, then you will have to give the exact versions of Python and pandas that you are using.

Why am I getting a UnicodeDecodeError in Python's JSON encoding?

I am using Solr 3.3 to index stuff from my database. I compose the JSON content in Python. I manage to upload 2126 records which add up to 523246 chars (approx 511kb). But when I try 2027 records, Python gives me the error:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "D:\Technovia\db_indexer\solr_update.py", line 69, in upload_service_details
request_string.append(param_list)
File "C:\Python27\lib\json\__init__.py", line 238, in dumps
**kw).encode(obj)
File "C:\Python27\lib\json\encoder.py", line 203, in encode
chunks = list(chunks)
File "C:\Python27\lib\json\encoder.py", line 425, in _iterencode
for chunk in _iterencode_list(o, _current_indent_level):
File "C:\Python27\lib\json\encoder.py", line 326, in _iterencode_list
for chunk in chunks:
File "C:\Python27\lib\json\encoder.py", line 384, in _iterencode_dict
yield _encoder(value)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 68: invalid start byte
Ouch. Is 512kb worth of bytes a fundamental limit? Is there any high-volume alternative to the existing JSON module?
Update: its a fault of some data as trying to encode *biz_list[2126:]* causes an immediate error. Here is the offending piece:
'2nd Floor, Gurumadhavendra Towers,\nKadavanthra Road, Kaloor,\nCochin \x96 682 017'
How can I configure it so that it can be encodable into JSON?
Update 2: The answer worked as expected: the data came from a MySQL table encoded in "latin-1-swedish-ci". I saw significance in a random number. Sorry for spontaneously channeling the spirit of a headline writer when diagnosing the fault.
Simple, just don't use utf-8 encoding if your data is not in utf-8
>>> json.loads('["\x96"]')
....
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 0: invalid start byte
>>> json.loads('["\x96"]', encoding="latin-1")
[u'\x96']
json.loads
If s is a str instance and is encoded with an ASCII based
encoding other than utf-8 (e.g. latin-1) then an appropriate
encoding name must be specified. Encodings that are not ASCII
based (such as UCS-2) are not allowed and should be decoded to
unicode first.
Edit: To get proper unicode value of "\x96" use "cp1252" as Eli Collins mentioned
>>> json.loads('["\x96"]', encoding="cp1252")
[u'\u2013']

UnicodeDecodeError is raised when getting a cookie in Google App Engine

I have a GAE project in Python where I am setting a cookie in one of my RequestHandlers with this code:
self.response.headers['Set-Cookie'] = 'app=ABCD; expires=Fri, 31-Dec-2020 23:59:59 GMT'
I checked in Chrome and I can see the cookie listed, so it appears to be working.
Then later in another RequestHandler, I get the cookie to check it:
appCookie = self.request.cookies['app']
This line gives the following error when executed:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1962: ordinal not in range(128)
It seems that it is trying to decode the incoming cookie info using an ASCII codec rather than UTF-8.
How do I force Python to use UTF-8 to decode this?
Are there any other Unicode-related gotchas that I need to be aware of as a newbie to Python and Google App Engine (but an experienced programmer in other languages)?
Here is the full Traceback:
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 4144, in _HandleRequest
self._Dispatch(dispatcher, self.rfile, outfile, env_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 4049, in _Dispatch
base_env_dict=env_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 616, in Dispatch
base_env_dict=base_env_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 3120, in Dispatch
self._module_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 3024, in ExecuteCGI
reset_modules = exec_script(handler_path, cgi_path, hook)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 2887, in ExecuteOrImportScript
exec module_code in script_module.__dict__
File "/Users/ken/hgdev/juicekit/main.py", line 402, in <module>
main()
File "/Users/ken/hgdev/juicekit/main.py", line 399, in main
run_wsgi_app(application)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/util.py", line 98, in run_wsgi_app
run_bare_wsgi_app(add_wsgi_middleware(application))
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/util.py", line 116, in run_bare_wsgi_app
result = application(env, _start_response)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 721, in __call__
response.wsgi_write(start_response)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 296, in wsgi_write
body = self.out.getvalue()
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/StringIO.py", line 270, in getvalue
self.buf += ''.join(self.buflist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1962: ordinal not in range(128)
You're looking to use the decode function somewhat like this (cred #agf:):
self.request.cookies['app'].decode('utf-8')
From official python documentation (plus a couple added details):
Python’s 8-bit strings have a .decode([encoding], [errors]) method that interprets the string using the given encoding. The following example shows the string as it goes to unicode and then back to 8-bit string:
>>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string
>>> type(u), u # Examine
(<type 'unicode'>, u'\ua000abcd\u07b4')
>>> utf8_version = u.encode('utf-8') # Encode as UTF-8
>>> type(utf8_version), utf8_version # Examine
(<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
>>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
>>> u == u2 # The two strings match
True
First, encode any unicode value you set in the cookies. You also need to quote them in case they can break the header:
import urllib
# This is the value we want to set.
initial_value = u'äëïöü'
# WebOb version that comes with SDK doesn't quote cookie values
# in the Response, neither webapp.Response. So we have to do it.
quoted_value = urllib.quote(initial_value.encode('utf-8'))
rsp = webapp.Response()
rsp.headers['Set-Cookie'] = 'app=%s; Path=/' % quoted_value
Now let's read the value. To test it, create a fake Request to test the cookie we have set. This code was extracted from a real unittest:
cookie = rsp.headers.get('Set-Cookie')
req = webapp.Request.blank('/', headers=[('Cookie', cookie)])
# The stored value is the same quoted value from before.
# Notice that here we use .str_cookies, not .cookies.
stored_value = req.str_cookies.get('app')
self.assertEqual(stored_value, quoted_value)
Our value is still encoded and quoted. We must do the reverse to get the initial one:
# And we can get the initial value unquoting and decoding.
final_value = urllib.unquote(stored_value).decode('utf-8')
self.assertEqual(final_value, initial_value)
If you can, consider using webapp2. webob.Response does all the hard work of quoting and setting cookies, and you can set unicode values directly. See a summary of these issues here.

Parsing spanish text and saving it in a db

I'm parsing a web page written in spanish with scrapy. The problem is that I can't save the text because of the wrong encoding.
This is the parse function:
def parse(self, response):
hxs = HtmlXPathSelector(response)
text = hxs.select('//text()').extract() # Ex: [u' Sustancia mineral, m\xe1s o menos dura y compacta, que no es terrosa ni de aspecto met\xe1lico.']
s = "".join(text)
db = dbf.Dbf("test.dbf", new=True)
db.addField(
("WORD", "C", 25),
("DATA", "M", 15000), # Memo field
)
rec = db.newRecord()
rec["WORD"] = "Stone"
rec["DATA"] = s
rec.store()
db.close()
When I try to save it to a db(a dbf db) I get an ASCII(128) error. I tried decoding/encoding using 'utf-8' and 'latin1' but with no success.
Edit:
To save the db I'm using dbfpy. I added the dbf saving code in the parse function above.
This is the error message:
Traceback (most recent call last):
File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 1179, in mainLoop
self.runUntilCurrent()
File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 778, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 280, in callback
self._startRunCallbacks(result)
File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 354, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 371, in _runCallbacks
self.result = callback(self.result, *args, **kw)
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/rae_spider.py", line 54, in parse
rec.store()
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/record.py", line 211, in store
self.dbf.append(self)
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/dbf.py", line 214, in append
record._write()
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/record.py", line 173, in _write
self.dbf.stream.write(self.toString())
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/record.py", line 223, in toString
for (_def, _dat) in izip(self.dbf.header.fields, self.fieldData)
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/fields.py", line 215, in encodeValue
return str(value)[:self.length].ljust(self.length)
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 18: ordinal not in range(128)
Please, don't remember that DBF files don't support unicode at all
and I also suggest to use Ethan Furman's dbf package (link in another answer)
You can use only 'table = dbf.Table('filename') to guess real type.
Example of usage with non cp437 encoding is:
#!/usr/bin/env python
# coding: koi8-r
import dbf
text = 'текст в koi8-r'
table = dbf.Table(':memory:', ['test M'], 128, False, False, True, False, 'dbf', 'koi8-r')
record = table.append()
record.test = text
Please note following information about version 0.87.14 and 'dbf' table type:
With DBF package 0.87.14 you can found exception 'TypeError: ord() excepted character...' at ".../site-packages/dbf/tables.py", line 686
Only 'dbf' table type has affected with this tupo!
DISCLAIMER: I don't know real correct values to use in following values, so don't blame me about incompatibility with this "fix".
You can to replace values '' to '\0' (at least) at lines 490 and 491 to make this test workable.
Looks like http://sourceforge.net/projects/dbfpy is what you are talking about. Whatever gave you the idea that it could handle creating a VFP-compatible DBF file just by throwing Unicode at it? There's no docs worth the description AFAICT and the source simply doesn't contain .encode( and there's no supported way of changing the default "signature" away from 0x03 (very plain dBaseIII dile)/
If you encode your text fields in cp850 or cp437 before you throw them at the dbf it may work, but you'll need to check that you can open the resulting file using VFP and that all your accented Spanish characters are represented properly when you view the text fields on the screen.
If that doesn't work (and even if it does), you should have a look at Ethan Furman's dbf package ... it purports to know all about VFP and language driver IDs and codepages and suchlike.
Update: I see that you have 15000-byte memo field defined. One of us is missing something ... the code that I'm reading says in fields.py about line 330 Note: memos aren't currenly [sic] completely supported followed a bit later by two occurrences of raise NotImplementedError ... back up to line 3: TODO: - make memos work. When I tried the code that you say you used (with plain ASCII data), it raised NotImplementedError from the rec.store(). Have you managed to get it to work at all?

Categories