How to get rid of special characters while extracting data from web?

How to get rid of special characters while extracting data from web? - python

I am extracting data from the website and it has an entry that contains a special character i.e. Comfort Inn And Suites�? Blazing Stump. When I try to extract it, it throws an error:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 638, in _tick
taskObj._oneWorkUnit()
File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 484, in _oneWorkUnit
result = next(self._iterator)
File "C:\Python27\lib\site-packages\scrapy\utils\defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\scrapy\utils\defer.py", line 96, in iter_errback
yield it.next()
File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\offsite.py", line 24, in process_spider_output
for x in result:
File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\referer.py", line 14, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\urllength.py", line 32, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\depth.py", line 48, in <genexpr>
return (r for r in result or () if _filter(r))
File "E:\Scrapy projects\emedia\emedia\spiders\test_spider.py", line 46, in parse
print repr(business.select('a[#class="name"]/text()').extract()[0])
File "C:\Python27\lib\site-packages\scrapy\selector\lxmlsel.py", line 51, in select
result = self.xpathev(xpath)
File "xpath.pxi", line 318, in lxml.etree.XPathElementEvaluator.__call__ (src\lxml\lxml.etree.c:145954)
File "xpath.pxi", line 241, in lxml.etree._XPathEvaluatorBase._handle_result (src\lxml\lxml.etree.c:144987)
File "extensions.pxi", line 621, in lxml.etree._unwrapXPathObject (src\lxml\lxml.etree.c:139973)
File "extensions.pxi", line 655, in lxml.etree._createNodeSetResult (src\lxml\lxml.etree.c:140328)
File "extensions.pxi", line 676, in lxml.etree._unpackNodeSetEntry (src\lxml\lxml.etree.c:140524)
File "extensions.pxi", line 784, in lxml.etree._buildElementStringResult (src\lxml\lxml.etree.c:141695)
File "apihelpers.pxi", line 1373, in lxml.etree.funicode (src\lxml\lxml.etree.c:26255)
exceptions.UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 22: invalid continuation byte
I have tried a lot of different things after searching on the web such as decode('utf-8'), unicodedata.normalize('NFC',business.select('a[#class="name"]/text()').extract()[0]) but the problem persists?
The source URL is "http://www.truelocal.com.au/find/hotels/97/" and on this page it is fourth entry which I am talking about.

You have a bad Mojibake in the original webpage,
probably due to bad handling of Unicode in the data entry somewhere. The actual UTF-8 bytes in the source are C3 3F C2 A0 when expressed in hexadecimal.
I think it was once a U+00A0 NO-BREAK SPACE. Encoded to UTF-8 that becomes C2 A0, interpret that as Latin-1 instead then encode to UTF-8 again becomes C3 82 C2 A0, but 82 is a control character if interpreted as Latin-1 again so that was substituted by a ? question mark, hex 3F when encoded.
When you follow the link to the detail page for that venue then you get a different Mojibake for the same name: Comfort Inn And SuitesÃ‚Â Blazing Stump, giving us the Unicode characters U+00C3, U+201A, U+00C2 a HTML entity, or unicode character U+00A0 again. Encode that as Windows Codepage 1252 (a superset of Latin-1) and you get C3 82 C2 A0 again.
You can only get rid of it by targeting this directly in the source of the page
pagesource.replace('\xc3?\xc2\xa0', '\xc2\xa0')
This 'repairs' the data by substituting the train wreck with the original intended UTF-8 bytes.
If you have a scrapy Response object, replace the body:
body = response.body.replace('\xc3?\xc2\xa0', '\xc2\xa0')
response = response.replace(body=body)

Don't use "replace" to fix Mojibake, fix the database and the code that caused the Mojibake.
But first you need to determine whether it is simply Mojibake or "double-encoding". With a SELECT col, HEX(col) ... determine whether a single character turned into 2-4 bytes (Mojibake) or 4-6 bytes (double encoding). Examples:
`é` (as utf8) should come back `C3A9`, but instead shows `C383C2A9`
The Emoji `👽` should come back `F09F91BD`, but comes back `C3B0C5B8E28098C2BD`
Review "Mojibake" and "double encoding" here
Then the database fixes are discussed here :
CHARACTER SET latin1, but have utf8 bytes in it; leave bytes alone while fixing charset:
First, lets assume you have this declaration for tbl.col:
col VARCHAR(111) CHARACTER SET latin1 NOT NULL
Then to convert the column without changing the bytes via this 2-step ALTER:
ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;
Note: If you start with TEXT, use BLOB as the intermediate definition. (This is the "2-step ALTER, as discussed elsewhere.) (Be sure to keep the other specifications the same - VARCHAR, NOT NULL, etc.)
CHARACTER SET utf8mb4 with double-encoding:
UPDATE tbl SET col = CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8mb4);
CHARACTER SET latin1 with double-encoding: Do the 2-step ALTER, then fix the double-encoding.

Related

indian rupee symbol UnicodeEncodeError while uploading file to s3 using pandas

I have scraped some data from a website for my assignment. It consists of Indian rupee character - "₹". The data when I'm trying to save into CSV file in utf-8 characters on local machine using pandas, it is saving effortlessly. The same file, I have changed the delimiters and tried to save the file to s3 using pandas, but it gave "UnicodeEncodeError" error. I'm scraping the web page using scrapy framework.
Earlier I was trying to save the file in Latin-1 i.e. "ISO-8859-1" formatting and hence changed to "utf-8" but the same error is occurring. I'm using pythn 3.7 for the development.
Below code used for saving on the local machine which is working:
result_df.to_csv(filename+str2+'.csv',index=False)
Below code is used to save the file to S3:
search_df.to_csv('s3://my-bucket/folder_path/filename_str2.csv',encoding = 'utf-8',line_terminator='^',sep='~',index=False)
Below is the error while saving the file to S3:
2019-10-29 19:24:27 [scrapy.utils.signal] ERROR: Error caught on signal handler: <function Spider.close at 0x0000019CD3B1AA60>
Traceback (most recent call last):
File "c:\programdata\anaconda3\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred
result = f(*args, **kw)
File "c:\programdata\anaconda3\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "c:\programdata\anaconda3\lib\site-packages\scrapy\spiders\__init__.py", line 94, in close
return closed(reason)
File "C:\local_path\spiders\Pduct_Scrape.py", line 430, in closed
search_df.to_csv('s3://my-bucket/folder_path/filename_str2.csv',encoding = 'utf-8',line_terminator='^',sep='~',index=False)
File "c:\programdata\anaconda3\lib\site-packages\pandas\core\generic.py", line 3020, in to_csv
formatter.save()
File "c:\programdata\anaconda3\lib\site-packages\pandas\io\formats\csvs.py", line 172, in save
self._save()
File "c:\programdata\anaconda3\lib\site-packages\pandas\io\formats\csvs.py", line 288, in _save
self._save_chunk(start_i, end_i)
File "c:\programdata\anaconda3\lib\site-packages\pandas\io\formats\csvs.py", line 315, in _save_chunk
self.cols, self.writer)
File "pandas/_libs/writers.pyx", line 75, in pandas._libs.writers.write_csv_rows
File "c:\programdata\anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u20b9' in position 2661: character maps to <undefined>
I am very new to this StackOverflow platform and please let me know if more information is to be presented.

The error gives an evidence that the code tries to encode the filename_str2.csv file in cp1252. From your stack trace:
...File "C:\local_path\spiders\Pduct_Scrape.py", line 430, in closed
search_df.to_csv('s3://my-bucket/folder_path/ filename_str2.csv ',......
File "c:\programdata\anaconda3\lib\encodings\ cp1252.py ", line 19, in encode
The reason I do not know, because you explicitely ask for an utf-8 encoding. But as the codecs page in the Python Standard Library reference says that the canonical name for utf8 is utf_8 (notice the underline instead of minus sign) and does not list utf-8 in allowed aliases, I would first try to use utf_8. If it still uses cp1252, then you will have to give the exact versions of Python and pandas that you are using.

Beautiful Soup raises UnicodeEncodeError "ordinal not in range(128)"

I am trying to parse arbitrary documents download from the wild web, and yes, I have no control of their content.
Since Beautiful Soup won't choke if you give it bad markup... I wonder why does it giving me those hick-ups when sometimes, part of the doc is malformed, and whether there is a way to make it resume to next readable portion of the doc, regardless of this error.
The line where the error occurred is the 3rd one:
from BeautifulSoup import BeautifulSoup as doc_parser
reader = open(options.input_file, "rb")
doc = doc_parser(reader)
CLI full output is:
Traceback (most recent call last):
File "./grablinks", line 101, in <module>
sys.exit(main())
File "./grablinks", line 88, in main
links = grab_links(options)
File "./grablinks", line 36, in grab_links
doc = doc_parser(reader)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1519, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1144, in __init__
self._feed(isHTML=isHTML)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1186, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/usr/lib/python2.7/sgmllib.py", line 143, in goahead
k = self.parse_endtag(i)
File "/usr/lib/python2.7/sgmllib.py", line 320, in parse_endtag
self.finish_endtag(tag)
File "/usr/lib/python2.7/sgmllib.py", line 358, in finish_endtag
method = getattr(self, 'end_' + tag)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-16: ordinal not in range(128)

Yeah, It will choke if you have elements with non-ASCII names (<café>). And that's not even ‘bad markup’, for XML...
It's a bug in sgmllib which BeautifulSoup is using: it tries to find custom methods with the same names as tags, but in Python 2 method names are byte strings so even looking for a method with a non-ASCII character in, which will never be present, fails.
You can hack a fix into sgmllib by changing lines 259 and 371 from except AttributeError: to except AttributeError, UnicodeError: but that's not really a good fix. Not trivial to override the rest of the method either.
What is it you're trying to parse? BeautifulStoneSoup was always of questionable usefulness really—XML doesn't have the wealth of ghastly parser hacks that HTML does, so in general broken XML isn't XML. Consequently you should generally use a plain old XML parser (eg use a standard DOM or etree). For parsing general HTML, html5lib is your better option these days.

This happens if there are non-ascii chars in the input in python versions before Python 3.0
If you are trying to use str(...)on a string containing chars with a char value > 128 (ANSII & unicode), this exception is raised.
Here, the error possibly occurs because getattr tries to use str on a unicode string - it "thinks" it can safely do this because in python versions prior to 3.0 identifiers must not contain unicode.
Check your HTML for unicode characters. Try to replace / encode these and if it still does not work, tell us.

Why am I getting a UnicodeDecodeError in Python's JSON encoding?

I am using Solr 3.3 to index stuff from my database. I compose the JSON content in Python. I manage to upload 2126 records which add up to 523246 chars (approx 511kb). But when I try 2027 records, Python gives me the error:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "D:\Technovia\db_indexer\solr_update.py", line 69, in upload_service_details
request_string.append(param_list)
File "C:\Python27\lib\json\__init__.py", line 238, in dumps
**kw).encode(obj)
File "C:\Python27\lib\json\encoder.py", line 203, in encode
chunks = list(chunks)
File "C:\Python27\lib\json\encoder.py", line 425, in _iterencode
for chunk in _iterencode_list(o, _current_indent_level):
File "C:\Python27\lib\json\encoder.py", line 326, in _iterencode_list
for chunk in chunks:
File "C:\Python27\lib\json\encoder.py", line 384, in _iterencode_dict
yield _encoder(value)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 68: invalid start byte
Ouch. Is 512kb worth of bytes a fundamental limit? Is there any high-volume alternative to the existing JSON module?
Update: its a fault of some data as trying to encode *biz_list[2126:]* causes an immediate error. Here is the offending piece:
'2nd Floor, Gurumadhavendra Towers,\nKadavanthra Road, Kaloor,\nCochin \x96 682 017'
How can I configure it so that it can be encodable into JSON?
Update 2: The answer worked as expected: the data came from a MySQL table encoded in "latin-1-swedish-ci". I saw significance in a random number. Sorry for spontaneously channeling the spirit of a headline writer when diagnosing the fault.

Simple, just don't use utf-8 encoding if your data is not in utf-8
>>> json.loads('["\x96"]')
....
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 0: invalid start byte
>>> json.loads('["\x96"]', encoding="latin-1")
[u'\x96']
json.loads
If s is a str instance and is encoded with an ASCII based
encoding other than utf-8 (e.g. latin-1) then an appropriate
encoding name must be specified. Encodings that are not ASCII
based (such as UCS-2) are not allowed and should be decoded to
unicode first.
Edit: To get proper unicode value of "\x96" use "cp1252" as Eli Collins mentioned
>>> json.loads('["\x96"]', encoding="cp1252")
[u'\u2013']

UnicodeDecodeError is raised when getting a cookie in Google App Engine

I have a GAE project in Python where I am setting a cookie in one of my RequestHandlers with this code:
self.response.headers['Set-Cookie'] = 'app=ABCD; expires=Fri, 31-Dec-2020 23:59:59 GMT'
I checked in Chrome and I can see the cookie listed, so it appears to be working.
Then later in another RequestHandler, I get the cookie to check it:
appCookie = self.request.cookies['app']
This line gives the following error when executed:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1962: ordinal not in range(128)
It seems that it is trying to decode the incoming cookie info using an ASCII codec rather than UTF-8.
How do I force Python to use UTF-8 to decode this?
Are there any other Unicode-related gotchas that I need to be aware of as a newbie to Python and Google App Engine (but an experienced programmer in other languages)?
Here is the full Traceback:
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 4144, in _HandleRequest
self._Dispatch(dispatcher, self.rfile, outfile, env_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 4049, in _Dispatch
base_env_dict=env_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 616, in Dispatch
base_env_dict=base_env_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 3120, in Dispatch
self._module_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 3024, in ExecuteCGI
reset_modules = exec_script(handler_path, cgi_path, hook)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 2887, in ExecuteOrImportScript
exec module_code in script_module.__dict__
File "/Users/ken/hgdev/juicekit/main.py", line 402, in <module>
main()
File "/Users/ken/hgdev/juicekit/main.py", line 399, in main
run_wsgi_app(application)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/util.py", line 98, in run_wsgi_app
run_bare_wsgi_app(add_wsgi_middleware(application))
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/util.py", line 116, in run_bare_wsgi_app
result = application(env, _start_response)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 721, in __call__
response.wsgi_write(start_response)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 296, in wsgi_write
body = self.out.getvalue()
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/StringIO.py", line 270, in getvalue
self.buf += ''.join(self.buflist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1962: ordinal not in range(128)

You're looking to use the decode function somewhat like this (cred #agf:):
self.request.cookies['app'].decode('utf-8')
From official python documentation (plus a couple added details):
Python’s 8-bit strings have a .decode([encoding], [errors]) method that interprets the string using the given encoding. The following example shows the string as it goes to unicode and then back to 8-bit string:
>>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string
>>> type(u), u # Examine
(<type 'unicode'>, u'\ua000abcd\u07b4')
>>> utf8_version = u.encode('utf-8') # Encode as UTF-8
>>> type(utf8_version), utf8_version # Examine
(<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
>>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
>>> u == u2 # The two strings match
True

First, encode any unicode value you set in the cookies. You also need to quote them in case they can break the header:
import urllib
# This is the value we want to set.
initial_value = u'äëïöü'
# WebOb version that comes with SDK doesn't quote cookie values
# in the Response, neither webapp.Response. So we have to do it.
quoted_value = urllib.quote(initial_value.encode('utf-8'))
rsp = webapp.Response()
rsp.headers['Set-Cookie'] = 'app=%s; Path=/' % quoted_value
Now let's read the value. To test it, create a fake Request to test the cookie we have set. This code was extracted from a real unittest:
cookie = rsp.headers.get('Set-Cookie')
req = webapp.Request.blank('/', headers=[('Cookie', cookie)])
# The stored value is the same quoted value from before.
# Notice that here we use .str_cookies, not .cookies.
stored_value = req.str_cookies.get('app')
self.assertEqual(stored_value, quoted_value)
Our value is still encoded and quoted. We must do the reverse to get the initial one:
# And we can get the initial value unquoting and decoding.
final_value = urllib.unquote(stored_value).decode('utf-8')
self.assertEqual(final_value, initial_value)
If you can, consider using webapp2. webob.Response does all the hard work of quoting and setting cookies, and you can set unicode values directly. See a summary of these issues here.

Parsing spanish text and saving it in a db

I'm parsing a web page written in spanish with scrapy. The problem is that I can't save the text because of the wrong encoding.
This is the parse function:
def parse(self, response):
hxs = HtmlXPathSelector(response)
text = hxs.select('//text()').extract() # Ex: [u' Sustancia mineral, m\xe1s o menos dura y compacta, que no es terrosa ni de aspecto met\xe1lico.']
s = "".join(text)
db = dbf.Dbf("test.dbf", new=True)
db.addField(
("WORD", "C", 25),
("DATA", "M", 15000), # Memo field
)
rec = db.newRecord()
rec["WORD"] = "Stone"
rec["DATA"] = s
rec.store()
db.close()
When I try to save it to a db(a dbf db) I get an ASCII(128) error. I tried decoding/encoding using 'utf-8' and 'latin1' but with no success.
Edit:
To save the db I'm using dbfpy. I added the dbf saving code in the parse function above.
This is the error message:
Traceback (most recent call last):
File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 1179, in mainLoop
self.runUntilCurrent()
File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 778, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 280, in callback
self._startRunCallbacks(result)
File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 354, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 371, in _runCallbacks
self.result = callback(self.result, *args, **kw)
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/rae_spider.py", line 54, in parse
rec.store()
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/record.py", line 211, in store
self.dbf.append(self)
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/dbf.py", line 214, in append
record._write()
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/record.py", line 173, in _write
self.dbf.stream.write(self.toString())
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/record.py", line 223, in toString
for (_def, _dat) in izip(self.dbf.header.fields, self.fieldData)
File "/home/katy/Dropbox/proyectos/rae/rae/spiders/fields.py", line 215, in encodeValue
return str(value)[:self.length].ljust(self.length)
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 18: ordinal not in range(128)

Please, don't remember that DBF files don't support unicode at all
and I also suggest to use Ethan Furman's dbf package (link in another answer)
You can use only 'table = dbf.Table('filename') to guess real type.
Example of usage with non cp437 encoding is:
#!/usr/bin/env python
# coding: koi8-r
import dbf
text = 'текст в koi8-r'
table = dbf.Table(':memory:', ['test M'], 128, False, False, True, False, 'dbf', 'koi8-r')
record = table.append()
record.test = text
Please note following information about version 0.87.14 and 'dbf' table type:
With DBF package 0.87.14 you can found exception 'TypeError: ord() excepted character...' at ".../site-packages/dbf/tables.py", line 686
Only 'dbf' table type has affected with this tupo!
DISCLAIMER: I don't know real correct values to use in following values, so don't blame me about incompatibility with this "fix".
You can to replace values '' to '\0' (at least) at lines 490 and 491 to make this test workable.

Looks like http://sourceforge.net/projects/dbfpy is what you are talking about. Whatever gave you the idea that it could handle creating a VFP-compatible DBF file just by throwing Unicode at it? There's no docs worth the description AFAICT and the source simply doesn't contain .encode( and there's no supported way of changing the default "signature" away from 0x03 (very plain dBaseIII dile)/
If you encode your text fields in cp850 or cp437 before you throw them at the dbf it may work, but you'll need to check that you can open the resulting file using VFP and that all your accented Spanish characters are represented properly when you view the text fields on the screen.
If that doesn't work (and even if it does), you should have a look at Ethan Furman's dbf package ... it purports to know all about VFP and language driver IDs and codepages and suchlike.
Update: I see that you have 15000-byte memo field defined. One of us is missing something ... the code that I'm reading says in fields.py about line 330 Note: memos aren't currenly [sic] completely supported followed a bit later by two occurrences of raise NotImplementedError ... back up to line 3: TODO: - make memos work. When I tried the code that you say you used (with plain ASCII data), it raised NotImplementedError from the rec.store(). Have you managed to get it to work at all?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get rid of special characters while extracting data from web? - python

Related

indian rupee symbol UnicodeEncodeError while uploading file to s3 using pandas

Beautiful Soup raises UnicodeEncodeError "ordinal not in range(128)"

Why am I getting a UnicodeDecodeError in Python's JSON encoding?

UnicodeDecodeError is raised when getting a cookie in Google App Engine

Parsing spanish text and saving it in a db

Categories

Resources