Insert large csv to MySQL, ignore lines with unknown characters

Insert large csv to MySQL, ignore lines with unknown characters - python

I have a large .csv that I'm trying to import into a MySQL database for a Django project. I'm using the django.db library to write raw sql statements such as:
LOAD DATA LOCAL INFILE 'file.csv'...
However, I keep getting the following error:
django.db.utils.OperationalError: (1300, "Hey! Are you out tonight?")
After grepping the .csv for the line, I realised that the error is being caused by this character: 😜; though I'm sure there will be other characters throwing that error after I fix this.
Running:
$ file --mime file.csv
from a terminal, returns:
$ file.csv: text/html; charset=us-ascii
Since the rest of my db is in UTF-8, I tried writing a python script to re-encode it, using .encode('utf-8', 'ignore') hoping that the 'ignore' would remove any symbols that gave it trouble, but it threw:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 825410: invalid continuation byte
The thing is, I don't actually care about inserting 100% of the file into my db. I would rather just insert only the 'safe' lines that don't contain strange characters.
So ideally, I'm looking for a way to modify my LOAD DATA LOCAL INFILE sql statement so it just skips inserting any lines that give it trouble. This is optimal, since I don't want to spend time preprocessing the data.
If that isn't feasible, the next best thing is to remove any troublesome character/lines with a Python script that I could later run from my django app whenever I update my db.
If all else fails, information on how to grep out any characters that aren't UTF-8 friendly that I could write a shell script around would be useful.

For 😜, MySQL must use CHARACTER SET utf8mb4 on the column where you will be storing it, the LOAD DATA, and on the connection.
More Python notes: http://mysql.rjweb.org/doc.php/charcoll#python
E9 does not make sense. The hex for the UTF-8 encoding for 😜 is F09F989C.
The link on converting between character is irrelevant; only UTF-8 can be used for Emoji.

Not 100% sure if this will help but this is what I'd try:
Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getpreferredencoding()). To decode a file using a different encoding, use the encoding argument of open:
import csv
with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
That's an example gathered from official docs. Have in mind that you might need to replace utf-8 with the actual file encoding, as docs say. Then you can either continue using python to push your data into DB or write a new file with a new encoding.
Alternatively, this could could be another approach.

Related

I keep getting 'charmap' codec can't encode characters error when trying to save python script's output to clipboard or text file [duplicate]

I'm trying to scrape a website, but it gives me an error.
I'm using the following code:
import urllib.request
from bs4 import BeautifulSoup
get = urllib.request.urlopen("https://www.website.com/")
html = get.read()
soup = BeautifulSoup(html)
print(soup)
And I'm getting the following error:
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 70924-70950: character maps to <undefined>
What can I do to fix this?

I was getting the same UnicodeEncodeError when saving scraped web content to a file. To fix it I replaced this code:
with open(fname, "w") as f:
f.write(html)
with this:
with open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you need to support Python 2, then use this:
import io
with io.open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you want to use a different encoding than UTF-8, specify whatever your actual encoding is for encoding.

I fixed it by adding .encode("utf-8") to soup.
That means that print(soup) becomes print(soup.encode("utf-8")).

In Python 3.7, and running Windows 10 this worked (I am not sure whether it will work on other platforms and/or other versions of Python)
Replacing this line:
with open('filename', 'w') as f:
With this:
with open('filename', 'w', encoding='utf-8') as f:
The reason why it is working is because the encoding is changed to UTF-8 when using the file, so characters in UTF-8 are able to be converted to text, instead of returning an error when it encounters a UTF-8 character that is not suppord by the current encoding.

set PYTHONIOENCODING=utf-8
set PYTHONLEGACYWINDOWSSTDIO=utf-8
You may or may not need to set that second environment variable PYTHONLEGACYWINDOWSSTDIO.
Alternatively, this can be done in code (although it seems that doing it through env vars is recommended):
sys.stdin.reconfigure(encoding='utf-8')
sys.stdout.reconfigure(encoding='utf-8')
Additionally: Reproducing this error was a bit of a pain, so leaving this here too in case you need to reproduce it on your machine:
set PYTHONIOENCODING=windows-1252
set PYTHONLEGACYWINDOWSSTDIO=windows-1252

While saving the response of get request, same error was thrown on Python 3.7 on window 10. The response received from the URL, encoding was UTF-8 so it is always recommended to check the encoding so same can be passed to avoid such trivial issue as it really kills lots of time in production
import requests
resp = requests.get('https://en.wikipedia.org/wiki/NIFTY_50')
print(resp.encoding)
with open ('NiftyList.txt', 'w') as f:
f.write(resp.text)
When I added encoding="utf-8" with the open command it saved the file with the correct response
with open ('NiftyList.txt', 'w', encoding="utf-8") as f:
f.write(resp.text)

Even I faced the same issue with the encoding that occurs when you try to print it, read/write it or open it. As others mentioned above adding .encoding="utf-8" will help if you are trying to print it.
soup.encode("utf-8")
If you are trying to open scraped data and maybe write it into a file, then open the file with (......,encoding="utf-8")
with open(filename_csv , 'w', newline='',encoding="utf-8") as csv_file:

For those still getting this error, adding encode("utf-8") to soup will also fix this.
soup = BeautifulSoup(html_doc, 'html.parser').encode("utf-8")
print(soup)

There are multiple aspects to this problem. The fundamental question is which character set you want to output into. You may also have to figure out the input character set.
Printing (with either print or write) into a file with an explicit encoding="..." will translate Python's internal Unicode representation into that encoding. If the output contains characters which are not supported by that encoding, you will get an UnicodeEncodeError. For example, you can't write Russian or Chinese or Indic or Hebrew or Arabic or emoji or ... anything except a restricted set of some 200+ Western characters to a file whose encoding is "cp1252" because this limited 8-bit character set has no way to represent these characters.
Basically the same problem will occur with any 8-bit character set, including nearly all the legacy Windows code pages (437, 850, 1250, 1251, etc etc), though some of them support some additional script in addition to or instead of English (1251 supports Cyrillic, for example, so you can write Russian, Ukrainian, Serbian, Bulgarian, etc). An 8-bit encoding has only a maximum of 256 character codes and no way to represent a character which isn't among them.
Perhaps now would be a good time to read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
On platforms where the terminal is not capable of printing Unicode (only Windows these days really, though if you're into retrocomputing, this problem was also prevalent on other platforms in the previous millennium) attempting to print Unicode strings can also produce this error, or output mojibake. If you see something like HÃ©llÃ¶ instead of Héllö, this is your issue.
In short, then, you need to know:
What is the character set of the page you scraped, or the data you received? Was it correctly scraped? Did the originator correctly identify its encoding, or are you able to otherwise obtain this information (or guess it)? Some web sites incorrectly declare a different character set than the page actually contains, some sites have incorrectly configured the connection between the web server and a back-end database. See e.g. scrape with correct character encoding (python requests + beautifulsoup) for a more detailed example with some solutions.
What is the character set you want to write? If printing to the screen, is your terminal correctly configured, and is your Python interpreter configured identically?
Perhaps see also How to display utf-8 in windows console
If you are here, probably the answer to one of these questions is not "UTF-8". This is increasingly becoming the prevalent encoding for web pages, too, though the former standard was ISO-8859-1 (aka Latin-1) and more recently Windows code page 1252.
Going forward, you basically want all your textual data to be Unicode, outside of a few fringe use cases. Generally, that means UTF-8, though on Windows (or if you need Java compatibility), UTF-16 is also vaguely viable, albeit somewhat cumbersome. (There are several other Unicode serialization formats, which may be useful in specialized circumstances. UTF-32 is technically trivial, but takes up a lot more memory; UTF-7 is used in a few network protocols where 7-bit ASCII is required for transport.)
Perhaps see also https://utf8everywhere.org/
Naturally, if you are printing to a file, you also need to examine that file using a tool which can correctly display it. A common pilot error is to open the file using a tool which only displays the currently selected system encoding, or one which tries to guess the encoding, but guesses wrong. Again, a common symptom when viewing UTF-8 text using Windows code page 1252 would result, for example, in Héllö displaying as HÃ©llÃ¶.
If the encoding of character data is unknown, there is no simple way to automatically establish it. If you know what the text is supposed to represent, you can perhaps infer it, but this is typically a manual process with some guesswork involved. (Automatic tools like chardet and ftfy can help, but they get it wrong some of the time, too.)
To establish which encoding you are looking at, it can be helpful if you can identify the individual bytes in a character which isn't displayed correctly. For example, if you are looking at H\x8ell\x9a but expect it to represent Héllö, you can look up the bytes in a translation table. I have published one such table at https://tripleee.github.io/8bit where you can see that in this example, it's probably one of the legacy Mac 8-bit character sets; with more data points, perhaps you can narrow it down to just one of them (and if not, any one of them will do in practice, since all the code points you care about map to the same Unicode characters).
Python 3 on most platforms defaults to UTF-8 for all input and output, but on Windows, this is commonly not the case. It will then instead default to the system's default encoding (still misleadingly called "ANSI code page" in some Microsoft documentation), which depends on a number of factors. On Western systems, the default encoding out of the box is commonly Windows code page 1252.
(Earlier Python versions had somewhat different expectations, and in Python 2, the internal string representation was not Unicode.)
If you are on Windows and write UTF-8 to a text file, maybe specify encoding="utf-8-sig" which adds a BOM sequence at the beginning of the file. This is strictly speaking not necessary or correct, but some Windows tools need it to correctly identify the encoding.
Several of the earlier answers here suggest blindly applying some encoding, but hopefully this should help you understand how that's not generally the correct approach, and how to figure out - rather than guess - which encoding to use.

From Python 3.7 onwards,
Set the the environment variable PYTHONUTF8 to 1
The following script included other useful variables too which set System Environment Variables.
setx /m PYTHONUTF8 1
setx PATHEXT "%PATHEXT%;.PY" ; In CMD, Python file can be executed without extesnion.
setx /m PY_PYTHON 3.10 ; To set default python version for py
Source

I got the same error so I use (encoding="utf-8") and it solve the error.
This generally happens when we got some unidentified symbol or pattern in text data that our encoder does not understand.
with open("text.txt", "w", encoding='utf-8') as f:
f.write(data)
This will solve your problem.

if you are using windows try to pass encoding='latin1', encoding='iso-8859-1' or encoding='cp1252'
example:
csv_data = pd.read_csv(csvpath,encoding='iso-8859-1')
print(print(soup.encode('iso-8859-1')))

PSQL COPY statement error says character is encoded WIN1252 despite file seeming to be encoded UTF-8

I am trying to copy a CSV I created in Python to a PSQL database using a copy statement and am receiving the error PostgreSQL: character with byte sequence 0xc2 0x81 in encoding “UTF8” has no equivalent in encoding “WIN1252”.
The original file was UTF-8, and I'm fairly certain the file I created and am trying to copy is encoded with UTF-8. I believe I added the correct params in my Python code:
def process_csvs():
with open('movie_file.csv', mode='w', encoding='utf-8') as movie_file:
movie_writer = csv.writer(movie_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL, lineterminator='\n')
with open('movie_basics.tsv', encoding='utf-8') as csv_file:
csv_reader = csv.reader(csv_file, delimiter="\t")
for row in csv_reader:
...
When I open the output file in notepad, it says the encoding is UTF-8.
My COPY statement is as follows:
COPY movies(movie_id, title, year, runningtime) FROM 'C:\Users\Public\Documents\movie_file.csv' DELIMITER ',' CSV HEADER NULL AS 'Nul
l';
I believe the character that is currently giving me trouble is is the accented A in: tt0000676,Don Álvaro o la fuerza del sino,1908,Null
Anybody know how it could be encoded WIN1252 with my configuration and any way to fix it? Thanks!
EDIT:
I recreated the database with:
CREATE DATABASE "scratch"
WITH TEMPLATE template0
ENCODING 'UTF8'
LC_COLLATE = 'en_US.UTF-8'
LC_CTYPE = 'en_US.UTF-8';
As far as I can see, everything involved is encoded in UTF-8 outside of perhaps the command line, which I believe defaults to WIN1252. I am pretty new to all of this. And I did attempt to write the file with WIN1252 encoding in my Python script, but the original data is encoded UTF-8, and the script threw errors attempting to make the conversion.

Thanks to Adrian Klaver for sticking with me and ultimately pointing me to this post, which explained how to change the encoding of Postgres and the active code page of the terminal within CMD. The COPY statement worked fine after that.
The steps outlined in the post are:
open cmd
SET PGCLIENTENCODING=utf-8
chcp 65001
Log into Postgres
For any who are coming into this clueless about encoding, CP65001 (Code Page 65001) is Windows's way of specifying UTF-8 encoding. SET PGCLIENTENCODING=utf-8 changes the encoding of the client in Postgres to be UTF-8, and then chcp 65001 changes the active code page of the console to match it.

Python | UnicodeEncodeError: 'charmap' codec can't encode character '\u0119' in position 1: character maps to <undefined> [duplicate]

I'm trying to scrape a website, but it gives me an error.
I'm using the following code:
import urllib.request
from bs4 import BeautifulSoup
get = urllib.request.urlopen("https://www.website.com/")
html = get.read()
soup = BeautifulSoup(html)
print(soup)
And I'm getting the following error:
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 70924-70950: character maps to <undefined>
What can I do to fix this?

I was getting the same UnicodeEncodeError when saving scraped web content to a file. To fix it I replaced this code:
with open(fname, "w") as f:
f.write(html)
with this:
with open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you need to support Python 2, then use this:
import io
with io.open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you want to use a different encoding than UTF-8, specify whatever your actual encoding is for encoding.

I fixed it by adding .encode("utf-8") to soup.
That means that print(soup) becomes print(soup.encode("utf-8")).

In Python 3.7, and running Windows 10 this worked (I am not sure whether it will work on other platforms and/or other versions of Python)
Replacing this line:
with open('filename', 'w') as f:
With this:
with open('filename', 'w', encoding='utf-8') as f:
The reason why it is working is because the encoding is changed to UTF-8 when using the file, so characters in UTF-8 are able to be converted to text, instead of returning an error when it encounters a UTF-8 character that is not suppord by the current encoding.

set PYTHONIOENCODING=utf-8
set PYTHONLEGACYWINDOWSSTDIO=utf-8
You may or may not need to set that second environment variable PYTHONLEGACYWINDOWSSTDIO.
Alternatively, this can be done in code (although it seems that doing it through env vars is recommended):
sys.stdin.reconfigure(encoding='utf-8')
sys.stdout.reconfigure(encoding='utf-8')
Additionally: Reproducing this error was a bit of a pain, so leaving this here too in case you need to reproduce it on your machine:
set PYTHONIOENCODING=windows-1252
set PYTHONLEGACYWINDOWSSTDIO=windows-1252

While saving the response of get request, same error was thrown on Python 3.7 on window 10. The response received from the URL, encoding was UTF-8 so it is always recommended to check the encoding so same can be passed to avoid such trivial issue as it really kills lots of time in production
import requests
resp = requests.get('https://en.wikipedia.org/wiki/NIFTY_50')
print(resp.encoding)
with open ('NiftyList.txt', 'w') as f:
f.write(resp.text)
When I added encoding="utf-8" with the open command it saved the file with the correct response
with open ('NiftyList.txt', 'w', encoding="utf-8") as f:
f.write(resp.text)

Even I faced the same issue with the encoding that occurs when you try to print it, read/write it or open it. As others mentioned above adding .encoding="utf-8" will help if you are trying to print it.
soup.encode("utf-8")
If you are trying to open scraped data and maybe write it into a file, then open the file with (......,encoding="utf-8")
with open(filename_csv , 'w', newline='',encoding="utf-8") as csv_file:

For those still getting this error, adding encode("utf-8") to soup will also fix this.
soup = BeautifulSoup(html_doc, 'html.parser').encode("utf-8")
print(soup)

From Python 3.7 onwards,
Set the the environment variable PYTHONUTF8 to 1
The following script included other useful variables too which set System Environment Variables.
setx /m PYTHONUTF8 1
setx PATHEXT "%PATHEXT%;.PY" ; In CMD, Python file can be executed without extesnion.
setx /m PY_PYTHON 3.10 ; To set default python version for py
Source

I got the same error so I use (encoding="utf-8") and it solve the error.
This generally happens when we got some unidentified symbol or pattern in text data that our encoder does not understand.
with open("text.txt", "w", encoding='utf-8') as f:
f.write(data)
This will solve your problem.

if you are using windows try to pass encoding='latin1', encoding='iso-8859-1' or encoding='cp1252'
example:
csv_data = pd.read_csv(csvpath,encoding='iso-8859-1')
print(print(soup.encode('iso-8859-1')))

Can't write Unicode text using cx_Oracle

I'm working on a Python script that reads in a CSV writes the contents out to an Oracle database using cx_Oracle. So far I've been getting the following error:
UnicodeEncodeError: 'ascii' codec can't encode character '\xa0' in position 1369: ordinal not in range(128)
Evidently, cx_Oracle is trying to convert a Unicode character to ASCII and it's not working.
A few clarifying points:
I'm using Python 3.4.3
The CSV file is encoded in UTF-8 and is being opened like open('all.csv', encoding='utf8')
I'm using NVARCHAR2 fields for text in the database and the NLS_NCHAR_CHARACTERSET is set to AL16UTF16. The NLS_CHARACTERSET is WE8MSWIN1252 but from what I understand that shouldn't be relevant since I'm using NVARCHAR2.
I've tried setting the NLS_LANG environment variable to things like .AL16UTF16, _.AL16UTF16 and AMERICAN_AMERICA.WE8MSWIN1252 per this post, but I still get the same error.
Given that I'm reading a UTF-8 file and trying to write to a Unicode-encoded table, can anyone think of why cx_Oracle would still be trying to convert my data to ASCII?
I'm able to produce the error with this code:
field_map = {
...
}
with open('all.csv', encoding='utf8') as f:
reader = csv.DictReader(f)
out_rows = []
for row in reader:
if i == 1000:
break
out_row = {}
for field, source_field in field_map.items():
out_val = row[source_field]
out_row[field] = out_val
out_rows.append(out_row)
i += 1
out_db = datum.connect('oracle-stgeom://user:pass#db')
out_table = out_db['service_requests']
out_table.write(out_rows, chunk_size=10000)
The datum module is a data abstraction library I'm working on. The function responsible for writing to Oracle table is found here.
The full traceback is:
File "C:\Projects\311\write.py", line 64, in <module>
out_table.write(out_rows, chunk_size=10000)
File "z:\datum\datum\table.py", line 89, in write
self._child.write(rows, from_srid=from_srid, chunk_size=chunk_size)
File "z:\datum\datum\oracle_stgeom\table.py", line 476, in write
self._c.executemany(None, val_rows)
UnicodeEncodeError: 'ascii' codec can't encode character '\xa0' in position 1361: ordinal not in range(128)

Check the value of the "encoding" and "nencoding" attributes on the connection. This value is set by calling OCI routines that check the environment variables NLS_LANG and NLS_NCHAR. It looks like this value is US-ASCII or equivalent. When writing to the database, cx_Oracle takes the text and gets a byte string by encoding in the encoding the Oracle client is expecting. Note that this is unrelated to the database encoding. In general, for best performance, it is a good idea to match the database and client encodings -- but if you don't, Oracle will quite happily convert between the two, provided all of the characters used can be represented in both character sets!
Note that if the value of NLS_LANG is invalid it is essentially ignored. AL16UTF16 is one such invalid entry! So set it to the value you would like (such as .AL32UTF8) and check the value of encoding and nencoding on the connection until you get what you want.
Note as well that unless you state otherwise, all strings bound via cx_Oracle to the database are assumed to be in the normal encoding, not the NCHAR encoding. You can override this by using cursor.setinputsizes() and specifying that the input type is NCHAR, FIXED_NCHAR or LONG_NCHAR.

UnicodeEncodeError while writing data to an xml file

My aim is to write an XML file with few tags whose values are in the regional language. I'm using Python to do this and using IDLE (Pythong GUI) for programming.
While I try to write the words in an xmls file it gives the following error:
UnicodeEncodeError: 'ascii' codec
can't encode characters in position
0-4: ordinal not in range(128)
For now, I'm not using any xml writer library; instead, I'm opening a file "test.xml" and writing the data into it. This error is encountered by the line:
f.write(data)
If I replace the above write statement with print statement then it prints the data properly on the Python shell.
I'm reading the data from an Excel file which is not in the UTF-8, 16, or 32 encoding formats. It's in some other format. cp1252 is reading the data properly.
Any help in getting this data written to an XML file would be highly appreciated.

You should .decode your incoming cp1252 to get Unicode strings, and .encode them in utf-8 (by far the preferred encoding for XML) at the time you write, i.e.
f.write(unicodedata.encode('utf-8'))
where unicodedata is obtained by .decode('cp1252') on the incoming bytestrings.
It's possible to put lipstick on it by using the codecs module of the standard Python library to open the input and output files each with their proper encodings in lieu of plain open, but what I show is the underlying mechanism (and it's often, though not invariably, clearer and more explicit to apply it directly, rather than indirectly via codecs -- a matter of style and taste).
What does matter is the general principle: translate your input strings to unicode as soon as you can right after you obtain them, use unicode throughout your processing, translate them back to byte strings at late as you can just before you output them. This gives you the simplest, most straightforward life!-)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.