Parsing spanish text and saving it in a db

Parsing spanish text and saving it in a db - python

I'm parsing a web page written in spanish with scrapy. The problem is that I can't save the text because of the wrong encoding.
This is the parse function:
def parse(self, response):
hxs = HtmlXPathSelector(response)
text = hxs.select('//text()').extract() # Ex: [u' Sustancia mineral, m\xe1s o menos dura y compacta, que no es terrosa ni de aspecto met\xe1lico.']
s = "".join(text)
db = dbf.Dbf("test.dbf", new=True)
db.addField(
("WORD", "C", 25),
("DATA", "M", 15000), # Memo field
)
rec = db.newRecord()
rec["WORD"] = "Stone"
rec["DATA"] = s
rec.store()
db.close()
When I try to save it to a db(a dbf db) I get an ASCII(128) error. I tried decoding/encoding using 'utf-8' and 'latin1' but with no success.
Edit:
To save the db I'm using dbfpy. I added the dbf saving code in the parse function above.
This is the error message:
Traceback (most recent call last):
File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 1179, in mainLoop
self.runUntilCurrent()
File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 778, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 280, in callback
self._startRunCallbacks(result)
File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 354, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 371, in _runCallbacks
self.result = callback(self.result, *args, **kw)
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/rae_spider.py", line 54, in parse
rec.store()
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/record.py", line 211, in store
self.dbf.append(self)
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/dbf.py", line 214, in append
record._write()
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/record.py", line 173, in _write
self.dbf.stream.write(self.toString())
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/record.py", line 223, in toString
for (_def, _dat) in izip(self.dbf.header.fields, self.fieldData)
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/fields.py", line 215, in encodeValue
return str(value)[:self.length].ljust(self.length)
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 18: ordinal not in range(128)

Please, don't remember that DBF files don't support unicode at all
and I also suggest to use Ethan Furman's dbf package (link in another answer)
You can use only 'table = dbf.Table('filename') to guess real type.
Example of usage with non cp437 encoding is:
#!/usr/bin/env python
# coding: koi8-r
import dbf
text = 'текст в koi8-r'
table = dbf.Table(':memory:', ['test M'], 128, False, False, True, False, 'dbf', 'koi8-r')
record = table.append()
record.test = text
Please note following information about version 0.87.14 and 'dbf' table type:
With DBF package 0.87.14 you can found exception 'TypeError: ord() excepted character...' at ".../site-packages/dbf/tables.py", line 686
Only 'dbf' table type has affected with this tupo!
DISCLAIMER: I don't know real correct values to use in following values, so don't blame me about incompatibility with this "fix".
You can to replace values '' to '\0' (at least) at lines 490 and 491 to make this test workable.

Looks like http://sourceforge.net/projects/dbfpy is what you are talking about. Whatever gave you the idea that it could handle creating a VFP-compatible DBF file just by throwing Unicode at it? There's no docs worth the description AFAICT and the source simply doesn't contain .encode( and there's no supported way of changing the default "signature" away from 0x03 (very plain dBaseIII dile)/
If you encode your text fields in cp850 or cp437 before you throw them at the dbf it may work, but you'll need to check that you can open the resulting file using VFP and that all your accented Spanish characters are represented properly when you view the text fields on the screen.
If that doesn't work (and even if it does), you should have a look at Ethan Furman's dbf package ... it purports to know all about VFP and language driver IDs and codepages and suchlike.
Update: I see that you have 15000-byte memo field defined. One of us is missing something ... the code that I'm reading says in fields.py about line 330 Note: memos aren't currenly [sic] completely supported followed a bit later by two occurrences of raise NotImplementedError ... back up to line 3: TODO: - make memos work. When I tried the code that you say you used (with plain ASCII data), it raised NotImplementedError from the rec.store(). Have you managed to get it to work at all?

Related

indian rupee symbol UnicodeEncodeError while uploading file to s3 using pandas

I have scraped some data from a website for my assignment. It consists of Indian rupee character - "₹". The data when I'm trying to save into CSV file in utf-8 characters on local machine using pandas, it is saving effortlessly. The same file, I have changed the delimiters and tried to save the file to s3 using pandas, but it gave "UnicodeEncodeError" error. I'm scraping the web page using scrapy framework.
Earlier I was trying to save the file in Latin-1 i.e. "ISO-8859-1" formatting and hence changed to "utf-8" but the same error is occurring. I'm using pythn 3.7 for the development.
Below code used for saving on the local machine which is working:
result_df.to_csv(filename+str2+'.csv',index=False)
Below code is used to save the file to S3:
search_df.to_csv('s3://my-bucket/folder_path/filename_str2.csv',encoding = 'utf-8',line_terminator='^',sep='~',index=False)
Below is the error while saving the file to S3:
2019-10-29 19:24:27 [scrapy.utils.signal] ERROR: Error caught on signal handler: <function Spider.close at 0x0000019CD3B1AA60>
Traceback (most recent call last):
File "c:\programdata\anaconda3\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred
result = f(*args, **kw)
File "c:\programdata\anaconda3\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "c:\programdata\anaconda3\lib\site-packages\scrapy\spiders\__init__.py", line 94, in close
return closed(reason)
File "C:\local_path\spiders\Pduct_Scrape.py", line 430, in closed
search_df.to_csv('s3://my-bucket/folder_path/filename_str2.csv',encoding = 'utf-8',line_terminator='^',sep='~',index=False)
File "c:\programdata\anaconda3\lib\site-packages\pandas\core\generic.py", line 3020, in to_csv
formatter.save()
File "c:\programdata\anaconda3\lib\site-packages\pandas\io\formats\csvs.py", line 172, in save
self._save()
File "c:\programdata\anaconda3\lib\site-packages\pandas\io\formats\csvs.py", line 288, in _save
self._save_chunk(start_i, end_i)
File "c:\programdata\anaconda3\lib\site-packages\pandas\io\formats\csvs.py", line 315, in _save_chunk
self.cols, self.writer)
File "pandas/_libs/writers.pyx", line 75, in pandas._libs.writers.write_csv_rows
File "c:\programdata\anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u20b9' in position 2661: character maps to <undefined>
I am very new to this StackOverflow platform and please let me know if more information is to be presented.

The error gives an evidence that the code tries to encode the filename_str2.csv file in cp1252. From your stack trace:
...File "C:\local_path\spiders\Pduct_Scrape.py", line 430, in closed
search_df.to_csv('s3://my-bucket/folder_path/ filename_str2.csv ',......
File "c:\programdata\anaconda3\lib\encodings\ cp1252.py ", line 19, in encode
The reason I do not know, because you explicitely ask for an utf-8 encoding. But as the codecs page in the Python Standard Library reference says that the canonical name for utf8 is utf_8 (notice the underline instead of minus sign) and does not list utf-8 in allowed aliases, I would first try to use utf_8. If it still uses cp1252, then you will have to give the exact versions of Python and pandas that you are using.

How to get rid of special characters while extracting data from web?

I am extracting data from the website and it has an entry that contains a special character i.e. Comfort Inn And Suites�? Blazing Stump. When I try to extract it, it throws an error:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 638, in _tick
taskObj._oneWorkUnit()
File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 484, in _oneWorkUnit
result = next(self._iterator)
File "C:\Python27\lib\site-packages\scrapy\utils\defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\scrapy\utils\defer.py", line 96, in iter_errback
yield it.next()
File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\offsite.py", line 24, in process_spider_output
for x in result:
File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\referer.py", line 14, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\urllength.py", line 32, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\depth.py", line 48, in <genexpr>
return (r for r in result or () if _filter(r))
File "E:\Scrapy projects\emedia\emedia\spiders\test_spider.py", line 46, in parse
print repr(business.select('a[#class="name"]/text()').extract()[0])
File "C:\Python27\lib\site-packages\scrapy\selector\lxmlsel.py", line 51, in select
result = self.xpathev(xpath)
File "xpath.pxi", line 318, in lxml.etree.XPathElementEvaluator.__call__ (src\lxml\lxml.etree.c:145954)
File "xpath.pxi", line 241, in lxml.etree._XPathEvaluatorBase._handle_result (src\lxml\lxml.etree.c:144987)
File "extensions.pxi", line 621, in lxml.etree._unwrapXPathObject (src\lxml\lxml.etree.c:139973)
File "extensions.pxi", line 655, in lxml.etree._createNodeSetResult (src\lxml\lxml.etree.c:140328)
File "extensions.pxi", line 676, in lxml.etree._unpackNodeSetEntry (src\lxml\lxml.etree.c:140524)
File "extensions.pxi", line 784, in lxml.etree._buildElementStringResult (src\lxml\lxml.etree.c:141695)
File "apihelpers.pxi", line 1373, in lxml.etree.funicode (src\lxml\lxml.etree.c:26255)
exceptions.UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 22: invalid continuation byte
I have tried a lot of different things after searching on the web such as decode('utf-8'), unicodedata.normalize('NFC',business.select('a[#class="name"]/text()').extract()[0]) but the problem persists?
The source URL is "http://www.truelocal.com.au/find/hotels/97/" and on this page it is fourth entry which I am talking about.

You have a bad Mojibake in the original webpage,
probably due to bad handling of Unicode in the data entry somewhere. The actual UTF-8 bytes in the source are C3 3F C2 A0 when expressed in hexadecimal.
I think it was once a U+00A0 NO-BREAK SPACE. Encoded to UTF-8 that becomes C2 A0, interpret that as Latin-1 instead then encode to UTF-8 again becomes C3 82 C2 A0, but 82 is a control character if interpreted as Latin-1 again so that was substituted by a ? question mark, hex 3F when encoded.
When you follow the link to the detail page for that venue then you get a different Mojibake for the same name: Comfort Inn And SuitesÃ‚Â Blazing Stump, giving us the Unicode characters U+00C3, U+201A, U+00C2 a HTML entity, or unicode character U+00A0 again. Encode that as Windows Codepage 1252 (a superset of Latin-1) and you get C3 82 C2 A0 again.
You can only get rid of it by targeting this directly in the source of the page
pagesource.replace('\xc3?\xc2\xa0', '\xc2\xa0')
This 'repairs' the data by substituting the train wreck with the original intended UTF-8 bytes.
If you have a scrapy Response object, replace the body:
body = response.body.replace('\xc3?\xc2\xa0', '\xc2\xa0')
response = response.replace(body=body)

Don't use "replace" to fix Mojibake, fix the database and the code that caused the Mojibake.
But first you need to determine whether it is simply Mojibake or "double-encoding". With a SELECT col, HEX(col) ... determine whether a single character turned into 2-4 bytes (Mojibake) or 4-6 bytes (double encoding). Examples:
`é` (as utf8) should come back `C3A9`, but instead shows `C383C2A9`
The Emoji `👽` should come back `F09F91BD`, but comes back `C3B0C5B8E28098C2BD`
Review "Mojibake" and "double encoding" here
Then the database fixes are discussed here :
CHARACTER SET latin1, but have utf8 bytes in it; leave bytes alone while fixing charset:
First, lets assume you have this declaration for tbl.col:
col VARCHAR(111) CHARACTER SET latin1 NOT NULL
Then to convert the column without changing the bytes via this 2-step ALTER:
ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;
Note: If you start with TEXT, use BLOB as the intermediate definition. (This is the "2-step ALTER, as discussed elsewhere.) (Be sure to keep the other specifications the same - VARCHAR, NOT NULL, etc.)
CHARACTER SET utf8mb4 with double-encoding:
UPDATE tbl SET col = CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8mb4);
CHARACTER SET latin1 with double-encoding: Do the 2-step ALTER, then fix the double-encoding.

UnicodeDecodeError with the sys.stdout inside traceback.print_exc()

I am getting UnicodeDecodeError with the traceback.print_exc(file=sys.stdout). I am using Python3.4 and did not get the problem with Python2.7.
Am I missing something here? How can I make sure that sys.stdout passes the correct encoded/decoded to the traceback.print_exc() ?
My code looks something similar to this:
try:
# do something which might throw an exception
except Exception as e:
# do something
traceback.print_exc(file=sys.stdout) # Here I am getting the error
Error log:
traceback.print_exc(file=sys.stdout)
File "C:\Python34\lib\traceback.py", line 252, in print_exc
print_exception(*sys.exc_info(), limit=limit, file=file, chain=chain)
File "C:\Python34\lib\traceback.py", line 169, in print_exception
for line in _format_exception_iter(etype, value, tb, limit, chain):
File "C:\Python34\lib\traceback.py", line 153, in _format_exception_iter
yield from _format_list_iter(_extract_tb_iter(tb, limit=limit))
File "C:\Python34\lib\traceback.py", line 18, in _format_list_iter
for filename, lineno, name, line in extracted_list:
File "C:\Python34\lib\traceback.py", line 65, in _extract_tb_or_stack_iter
line = linecache.getline(filename, lineno, f.f_globals)
File "C:\Python34\lib\linecache.py", line 15, in getline
lines = getlines(filename, module_globals)
File "C:\Python34\lib\linecache.py", line 41, in getlines
return updatecache(filename, module_globals)
File "C:\Python34\lib\linecache.py", line 127, in updatecache
lines = fp.readlines()
File "C:\Python34\lib\codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 5213: invalid continuation byte

The traceback module wants to include source code lines with the traceback. Normally, a traceback consists only of pointers to source code, not the source code itself, as Python has been executing the compiled bytecode. In the bytecode are hints as to exactly what source code line the bytecode came from.
To then show the sourcecode, the actual source is read from disk, using the linecache module. This also means that Python has to determine the encoding for those files too. The default encoding for a Python 3 source file is UTF-8, but you can set a PEP 263 comment to let Python know if you are deviating from that.
Because source code is read after the code is already loaded and a traceback took place, it is possible that you changed the source code after starting the script, or there was a byte cache file (in a __pycache__ subdirectory) that appeared to be fresh but was no longer matching your source files.
Either way, when you started the script, Python was able to re-use a bytecache file or read the source code just fine and run your code. But when the traceback was being printed, at least one of the source code files was no longer decodable as UTF-8.
If you can reliably reproduce the traceback (so start the Python script again without encoding problems but printing the traceback fails), it is most likely a stale bytecode file somewhere, one that could even hold pointers to a filename that now contains nothing but binary data, not plain source.
If you know how to use the pdb module, add a pdb.set_trace() call before the traceback.print_exc() call and trace what filenames are being loaded from by the linecache module.
Otherwise edit your C:\Python34\lib\traceback.py file and insert a print('Loading {} from the linecache'.format(filename)) statement just before the linecache.checkcache(filename) line in the _extract_tb_or_stack_iter function.

Python encoding for Swedish characters

I'm using scrapy to extract data from a web site. I'm saving the data to a mysql database using MysqlDB. The script works for English sites, but when I try it on a Swedish site I get:
self.db.query(insertion_query)
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 156:
ordinal not in range(128)
I have put the following line at the top of each file involved in the scraping process to indicate the use of international charachters:
# -- coding: utf-8 --
But I still get an error. What else do I need for python to accept non-english charachters? Here's the full stack trace:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\scrapy-0.14.3-py2.7-win32.egg\scrapy\middleware.py",
line 60, in _process_
chain
return process_chain(self.methods[methodname], obj, *args)
File "C:\Python27\lib\site-packages\scrapy-0.14.3-py2.7-win32.egg\scrapy\utils\defer.py",
line 65, in process_
chain
d.callback(input)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 368, in callback
self._startRunCallbacks(result)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 464, in
_startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 551, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Python27\tco\tco\pipelines.py", line 64, in process_item
self.db.query(insertion_query)
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 156:
ordinal not in range(128)

This unicode issue look confusing at first, but it's actually pretty easy.
# -- coding: utf-8 --
If you write this on top of your source code, It means, python is going to
treat your code as utf-8, but not incoming or outgoing data.
You obviously want to write some data to your database, and this error happens
when some of your module encoding your utf-8 string (which is I guess swedish) to ascii.
That means, either MySQL was set as ascii or your mysql db driver is set as ascii.
So I suggest go check your mysql setting or driver setting.
db = MySQLdb.connect(host=database_host ,user=user ,passwd=pass,db=database_name, charset = "utf8", use_unicode = True)
This will make your mysql driver connect to mysql server using utf8

This blog post contains a hint: When creating the connection (either using PooledDB or MySQLdb.connect), specify the options charset = "utf8", use_unicode = True

Beautiful Soup raises UnicodeEncodeError "ordinal not in range(128)"

I am trying to parse arbitrary documents download from the wild web, and yes, I have no control of their content.
Since Beautiful Soup won't choke if you give it bad markup... I wonder why does it giving me those hick-ups when sometimes, part of the doc is malformed, and whether there is a way to make it resume to next readable portion of the doc, regardless of this error.
The line where the error occurred is the 3rd one:
from BeautifulSoup import BeautifulSoup as doc_parser
reader = open(options.input_file, "rb")
doc = doc_parser(reader)
CLI full output is:
Traceback (most recent call last):
File "./grablinks", line 101, in <module>
sys.exit(main())
File "./grablinks", line 88, in main
links = grab_links(options)
File "./grablinks", line 36, in grab_links
doc = doc_parser(reader)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1519, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1144, in __init__
self._feed(isHTML=isHTML)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1186, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/usr/lib/python2.7/sgmllib.py", line 143, in goahead
k = self.parse_endtag(i)
File "/usr/lib/python2.7/sgmllib.py", line 320, in parse_endtag
self.finish_endtag(tag)
File "/usr/lib/python2.7/sgmllib.py", line 358, in finish_endtag
method = getattr(self, 'end_' + tag)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-16: ordinal not in range(128)

Yeah, It will choke if you have elements with non-ASCII names (<café>). And that's not even ‘bad markup’, for XML...
It's a bug in sgmllib which BeautifulSoup is using: it tries to find custom methods with the same names as tags, but in Python 2 method names are byte strings so even looking for a method with a non-ASCII character in, which will never be present, fails.
You can hack a fix into sgmllib by changing lines 259 and 371 from except AttributeError: to except AttributeError, UnicodeError: but that's not really a good fix. Not trivial to override the rest of the method either.
What is it you're trying to parse? BeautifulStoneSoup was always of questionable usefulness really—XML doesn't have the wealth of ghastly parser hacks that HTML does, so in general broken XML isn't XML. Consequently you should generally use a plain old XML parser (eg use a standard DOM or etree). For parsing general HTML, html5lib is your better option these days.

This happens if there are non-ascii chars in the input in python versions before Python 3.0
If you are trying to use str(...)on a string containing chars with a char value > 128 (ANSII & unicode), this exception is raised.
Here, the error possibly occurs because getattr tries to use str on a unicode string - it "thinks" it can safely do this because in python versions prior to 3.0 identifiers must not contain unicode.
Check your HTML for unicode characters. Try to replace / encode these and if it still does not work, tell us.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.