How to read a SQL file with python with proper character encoding?

How to read a SQL file with python with proper character encoding? - python

Based on a few examples (here, here) I can use psycopg2 to read and hopefully run a SQL file from python (the file is 200 lines, and though runs quickly, isn't something I want to maintain in a python file).
Here is the script to read and run the file:
sql = open("sql/sm_bounds_current.sql", "r").read()
curDest.execute(sql)
However, when I run the script, the following error is thrown:
Error: syntax error at or near "ï»¿drop"
LINE 1: ï»¿drop table dpsdata.sm_boundaries_current_dev;
As you can see, the first line in the script is to drop a table, but I'm not sure why the extra characters are being read, and can't seem to find a solution that might set the encoding of the file when reading it.

Found this post dealing with encoding and byte order marks, which is where my problem was.
The solution was to import OPEN from CODECS, which allows for the ENCODING option on OPEN:
import codecs
from codecs import open
sqlfile = "sql/sm_bounds_current.sql"
sql = open(sqlfile, mode='r', encoding='utf-8-sig').read()
curDest.execute(sql)
Seems to work great!

Related

Airflow worker not understanding file encoding when parsing a file with pandas

With the following code :
import pandas as pd
filename = r"/path/to/my/file.csv"
with open(filename) as f:
data_frame = pd.read_csv(f,
usecols=['col1', 'col2'],
parse_dates=['DateProd', 'DateStart', 'DateEnd'],
header=0,
delimiter=';',
encoding='latin-1')
print(data_frame)
When this is executed locally, it prints the expected dataframe with proper accentuation. When executed in an airflow task on a remote worker, it fails with the following error :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2581: invalid continuation byte
When looking at the full debug strack trace (sorry, sensitive information, can't fully provide), I see within the stack trace that encoding='latin-1' was definitely specified, and it still fails with the UnicodeDecodeError error. Anybody noticed a similar issue? I've been running in circles, trying as many encoding as possible, but nothing seems to work.
I forgot to mention that the file is a remote file on a samba share. Wether I try to read it directly with smbclient.open() or copy it over locally and then opening it, I get the same result : UnicodeDecodeError. When I try the same thing locally (both reading directly from the samba share, or copying it over), everything seems fine, and I noticed I don't even need to specify the encoding, it will find it automatically, and accents will be displayed properly.
Another update : It seems that wether the file is read from the samba share or not does not make a difference. I managed to run the docker image that is used on the remote worker, and I can reproduce this issue with everything hosted locally, wether I open the file before, wether I read it entirely before giving it to pandas, or wether I simply give the filename to read_csv.
The engine does not seem to make a difference either : specifying engine='python' or engine='c' yield the same results.
Another update : It appears that this same issue is also happening with a fresh ubuntu docker image. I'm guessing there is some locales that need to be installed before it is able to parse them.

I've figured it out.
On a Windows machine, the encoding seems to be different by default. I don't even have to specify an encoding for it to work. Not in the container. Thus, I need to specify the encoding when opening files in the containers. The following should work :
import pandas as pd
filename = r"/path/to/my/file.csv"
with open(filename, encoding='latin-1') as f:
data_frame = pd.read_csv(f,
usecols=['col1', 'col2'],
parse_dates=['DateProd', 'DateStart', 'DateEnd'],
header=0,
delimiter=';',)
# Notice the lack of encoding='latin-1' here.
print(data_frame)
But! SambaHook essentially returns a pysmbclient's SambaClient. When you try to open a file with this Sambaclient, there is no way to specify the encoding of the file. So, locally, on a windows machine, everything seems to work fine. In a linux container, it fails with the UnicodeDecodeError. Looking under the hood, I've found that it essentially copies the file over before simply calling open() on the file.
For now, this is my solution : Copy the file over with the SambaClient returned by SambaHook in a temp file, open it with the proper encoding, ask panda to parse it. I will see what I can do about improving SambaHook and pysmbclient so that others can specify the encoding when opening a file.

How am I suppposed to handle the BOM while text processing using sys.stdin in Python 3?

Note: The possible duplicate concerns an older version of Python and this question has already generated unique answers.
I have been working on a script to process Project Gutenberg Texts texts into an internal file format for an application I am developing. In the script I process chapter headings with the re module. This works very well except in one case: the first line. My regex will always fail on the first Chapter marker at the first line if it includes the ^ caret to require the regex match to be at the beginning of the line because the BOM is consumed as the first character. (Example regex: ^Chapter).
What I've discovered is that if I do not include the caret, it won't fail on the first line, and then <feff> is included in the heading after I've processed it. An example:
<h1><feff>Chapter I</h1>
The advice according to this SO question (from which I learned of the BOM) is to fix your script to not consume/corrupt the BOM. Other SO questions talk about decoding the file with a codec but discuss errors I never encounter and do not discuss the syntax for opening a file with the template decoder.
To be clear:
I generally use pipelines of the following format:
cat -s <filename> | <other scripts> | python <scriptname> [options] > <outfile>
And I am opening the file with the following syntax:
import sys
fin = sys.stdin
if '-i' in sys.argv: # For command line option "-i <infile>"
fin = open(sys.argv[sys.argv.index('-i') + 1], 'rt')
for line in fin:
...Processing here...
My question is what is the proper way to handle this? Do I remove the BOM before processing the text? If so, how? Or do I use a decoder on the file before processing it (I am reading from stdin, so how would I accomplish this?)
The files are stored in UTF-8 encoding with DOS endings (\r\n). I convert them in vim to UNIX file format before processing using set ff=unix (I have to do several manual pre-processing tasks before running the script).

As a complement to the existing answer, it is possible to filter the UTF8 BOM from stdin with the codecs module. Simply you must use sys.stdin.buffer to access the underlying byte stream and decode it with a StreamReader
import sys
import codecs
# trick to process sys.stdin with a custom encoding
fin = codecs.getreader('utf_8_sig')(sys.stdin.buffer, errors='replace')
if '-i' in sys.argv: # For command line option "-i <infile>"
fin = open(sys.argv[sys.argv.index('-i') + 1], 'rt',
encoding='utf_8_sig', errors='replace')
for line in fin:
...Processing here...

In Python 3, stdin should be auto-decoded properly, but if it's not working for you (and for Python 2) you need to specify PythonIOEncoding before invoking your script like
PYTHONIOENCODING="UTF-8-SIG" python <scriptname> [options] > <outfile>
Notice that this setting also makes stdout working with UTF-8-SIG, so your <outfile> will maintain the original encoding.
For your -i parameter, just do open(path, 'rt', encoding="UTF-8-SIG")

You really don't need to import codecs or anything to deal with this. As lenz suggested in comments just check for the BOM and throw it out.
for line in input:
if line[0] == "\ufeff":
line = line[1:] # trim the BOM away
# the rest of your code goes here as usual

In Python 3.9 default encoding for standard input seems to be utf-8, at least on Linux:
In [2]: import sys
In [3]: sys.stdin
Out[3]: <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>
sys.stdin has the method reconfigure():
sys.stdin.reconfigure("utf-8-sig")
which should be called before any attempt of reading the standard input. This will decode the BOM, which will no longer appear when reading sys.stdin.

Is there any function like iconv in Python?

I have some CSV files need to convert from shift-jis to utf-8.
Here is my code in PHP, which is successful transcode to readable text.
$str = utf8_decode($str);
$str = iconv('shift-jis', 'utf-8'. '//TRANSLIT', $str);
echo $str;
My problem is how to do same thing in Python.

I don't know PHP, but does this work :
mystring.decode('shift-jis').encode('utf-8') ?
Also I assume the CSV content is from a file. There are a few options for opening a file in python.
with open(myfile, 'rb') as fin
would be the first and you would get data as it is
with open(myfile, 'r') as fin
would be the default file opening
Also I tried on my computed with a shift-js text and the following code worked :
with open("shift.txt" , "rb") as fin :
text = fin.read()
text.decode('shift-jis').encode('utf-8')
result was the following in UTF-8 (without any errors)
' \xe3\x81\xa6 \xe3\x81\xa7 \xe3\x81\xa8'
Ok I validate my solution :)
The first char is indeed the good character: "\xe3\x81\xa6" means "E3 81 A6"
It gives the correct result.
You can try yourself at this URL

for when pythons built-in encodings are insufficient there's an iconv at PyPi.
pip install iconv
unfortunately the documentation is nonexistant.
There's also iconv_codecs
pip install iconv_codecs
eg:
>>> import iconv_codecs
>>> iconv_codecs.register('ansi_x3.110-1983')
>>> "foo".encode('ansi_x3.110-1983')

It would be helpful if you could post the string that you are trying to convert since this error suggest some problem with the in-data, older versions of PHP failed silently on broken input strings which makes this hard to diagnose.
According to the documentation this might also be due to differences in shift-jis dialects, try using 'shift_jisx0213' or 'shift_jis_2004' instead.
If using another dialect does not work you might get away with asking python to fail silently by using .decode('shift-jis','ignore') or .decode('shift-jis','replace') .

Postgres COPY FROM file throwing unicode error while referenced character apparently not in file

First of all, thank you to everyone on Stack Overflow for past, present, and future help. You've all saved me from disaster (both of my own design and otherwise) too many times to count.
The present issue is part of a decision at my firm to transition from a Microsoft SQL Server 2005 database to PostgreSQL 9.4. We have been following the notes on the Postgres wiki (https://wiki.postgresql.org/wiki/Microsoft_SQL_Server_to_PostgreSQL_Migration_by_Ian_Harding), and these are the steps we're following for the table in question:
Download table data [on Windows client]:
bcp "Carbon.consensus.observations" out "Carbon.consensus.observations" -k -S [servername] -T -w
Copy to Postgres server [running CentOS 7]
Run Python pre-processing script on Postgres server to change encoding and clean:
import sys
import os
import re
import codecs
import fileinput
base_path = '/tmp/tables/'
cleaned_path = '/tmp/tables_processed/'
files = os.listdir(base_path)
for filename in files:
source_path = base_path + filename
temp_path = '/tmp/' + filename
target_path = cleaned_path + filename
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with open(source_path, 'r') as source_file:
with open(target_path, 'w') as target_file:
start = True
while True:
contents = source_file.read(BLOCKSIZE).decode('utf-16le')
if not contents:
break
if start:
if contents.startswith(codecs.BOM_UTF8.decode('utf-8')):
contents = contents.replace(codecs.BOM_UTF8.decode('utf-8'), ur'')
contents = contents.replace(ur'\x80', u'')
contents = re.sub(ur'\000', ur'', contents)
contents = re.sub(ur'\r\n', ur'\n', contents)
contents = re.sub(ur'\r', ur'\\r', contents)
target_file.write(contents.encode('utf-8'))
start = False
for line in fileinput.input(target_path, inplace=1):
if '\x80' in line:
line = line.replace(r'\x80', '')
sys.stdout.write(line)
Execute SQL to load table:
COPY consensus.observations FROM '/tmp/tables_processed/Carbon.consensus.observations';
The issue is that the COPY command is failing with a unicode error:
[2015-02-24 19:52:24] [22021] ERROR: invalid byte sequence for encoding "UTF8": 0x80
Where: COPY observations, line 2622420: "..."
Given that this could very likely be because of bad data in the table (which also contains legitimate non-ASCII characters), I'm trying to find the actual byte sequence in context, and I can't find it anywhere (sed to look at the line in question, regexes to replace the character as part of the preprocessing, etc). For reference, this grep returns nothing:
cat /tmp/tables_processed/Carbon.consensus.observations | grep --color='auto' -P "[\x80]"
What am I doing wrong in tracking down where this byte sequence sits in context?

I would recommend loading the SQL file (which appears to be /tmp/tables_processed/Carbon.consensus.observations) into an editor that has a hex mode. This should allow you to see it (depending on the exact editor) in context.
gVim (or terminal-based Vim) is one option I would recommend.
For example, if I open in gVim an SQL copy file that has this content:
1 1.2
2 1.1
3 3.2
I can the convert it into hex mode via the command %!xxd (in gVim or terminal Vim) or the Menu option Tools > Convert to HEX.
That yields this display:
0000000: 3109 312e 320a 3209 312e 310a 3309 332e 1.1.2.2.1.1.3.3.
0000010: 320a 2.
You can then run %!xxd -r to convert it back, or the Menu option Tools > Convert back.
Note: This actually modifies the file, so it would be advisable to do this to a copy of the original, just in case the changes somehow get written (you would have to explicitly save the buffer in Vim).
This way, you can see both the hex sequences on the left, and their ASCII equivalent on the right. If you search for 80, you should be able to see it in context. With gVim, the line numbering will be different for both modes, though, as is evidenced by this example.
It's likely the first 80 you find will be that line, though, since if there were earlier ones, it likely would've failed on those instead.
Another tool which might help that I've used in the past is the graphical hex editor GHex. Since that's a GNOME project, not quite sure it'll work with CentOS. wxHexEditor supposedly works with CentOS and looks promising from the website, although I've not yet used it. It's pitched as a "hex editor for massive files", so if your SQL file is large, that might be the way to go.

Cannot read in files

I have a small problem with reading in my file. My code:
import csv as csv
import numpy
with open("train_data.csv","rb") as training:
csv_file_object = csv.reader(training)
header = csv_file_object.next()
data = []
for row in csv_file_object:
data.append(row)
data = numpy.array(data)
I get the error no such file "train_data.csv", so I know the problem lies with the location. But whenever I specify the pad like this: open("C:\Desktop...etc) it doesn't work either. What am I doing wrong?

If you give the full file path, your script should work. Since it is not, it must be that you have escape characters in your path. To fix this, use a raw-string to specify the file path:
# Put an 'r' at the start of the string to make it a raw-string.
with open(r"C:\path\to\file\train_data.csv","rb") as training:
Raw strings do not process escape characters.
Also, just a technical fact, not giving the full file path causes Python to look for the file in the directory that the script is launched from. If it is not there, an error is thrown.

When you use open() and Windows you need to deal with the backslashes properly.
Option 1.) Use the raw string, this will be the string prefixed with an r.
open(r'C:\Users\Me\Desktop\train_data.csv')
Option 2.) Escape the backslashes
open('C:\\Users\\Me\\Desktop\\train_data.csv')
Option 3.) Use forward slashes
open('C:/Users/Me/Desktop/train_data.csv')
As for finding the file you are using, if you just do open('train_data.csv') it is looking in the directory you are running the python script from. So, if you are running it from C:\Users\Me\Desktop\, your train_data.csv needs to be on the desktop as well.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.