Python MySQL UTF-8 encoding differs depending on order of execution - python

I recently inherited a python project and I've got some behavior I'm struggling to account for.
The code has two sections, it can import a file into the database, or it can dump the database to an output file. The import looks something like this:
def importStuff(self):
mysqlimport_args = ['mysqlimport', '--host='+self.host, '--user='+self.username, '--password='+self.password, '--fields-terminated-by=|', '--lines-terminated-by=\n', '--replace', '--local', self.database, filename, '-v']
output = check_output(mysqlimport_args)
The dump looks like this:
def getStuff(self):
db = MySQLdb.connect(self.host, self.username, self.password, self.database)
cursor = db.cursor()
sql = 'SELECT somestuff'
cursor.execute(sql)
records = cursor.fetchall()
cursor.close()
db.close()
return records
def toCsv(self, records, csvfile):
f = open(csvfile, 'wb')
writer = csv.writer(f, quoting=csv.QUOTE_ALL)
writer.writerow(['StuffId'])
count = 1
for record in records:
writer.writerow([record[0]])
f.close()
Okay not the prettiest python you'll ever see (style comments welcome as I'd love to learn more) but it seems reasonable.
But, I got a complaint from a consumer that my output wasn't in UTF-8 (the mysql table is using utf8 encoding by the way). Here's where I get lost, if the program executes like this:
importStuff(...)
getStuff(...)
toCsv(...)
Then the output file doesn't appear to be valid utf-8. When I break the execution into two different steps
importStuff(...)
then in another file
getStuff(...)
toCsv(...)
Suddenly my output appears as valid utf-8. Aside from the fact that I have a work around, I can't seem to explain this behavior. Can anyone shed some light on what I'm doing wrong here? Or is there more information I can provide that might clarify what's going on?
Thanks.
(python 2.7 in case that factors in)
EDIT: More code as requested. I've made some minor tweaks to protect the innocent such as my company, but it's more or less here:
def main():
dbutil = DbUtil(config.DB_HOST, config.DB_DATABASE, config.DB_USERNAME, config.DB_PASSWORD)
if(args.import):
logger.info('Option: --import')
try:
dbutil.mysqlimport(AcConfig.DB_FUND_TABLE)
except Exception, e:
logger.warn("Error occured at mysqlimport. Error is %s" % (e.message))
if(args.db2csv):
try:
logger.info('Option: --db2csv')
records = dbutil.getStuff()
fileutil.toCsv(records, csvfile)
except Exception, e:
logger.warn("Error Occured at db2csv. Message:%s" %(e.message))
main()
And that's about it. It's really short which is making this much less obvious.
The output I'm not sure how to faithfully represent, it looks something like this:
"F0NR006F8F"
They all look like more or less ASCII characters to me, so I'm not sure what problem they could be creating. Maybe I'm approaching this from the wrong angle, I'm currently relying on my text editor's best guess for what encoding a file is in. I'm not sure how I could best detect which character is causing it to stop reading my file as utf-8.

Dumbest answer of all time. The input data wasn't in UTF-8. Someone solved this by writing another sproc that would be called periodically to convert the non-utf-8 characters to utf-8. In the time it took me to break my code into two files and run them separately, the job ran. It just happened to run that way the 4-5 times I tried it leading to a false conclusion on my part. I'm now changing the read process to accommodate a non-utf-8 input source so I don't have a weird race condition hiding in the system. Sorry to have lead you all on this goosechase.

Related

File.write will not allow me to add to the textfile?

In my class, my instructor went over code to add names to a text file and to identify an IOError if one occurs. I copied the code he wrote exactly. It worked for him but not for me. The only difference I can think of is that he was using an older version of Python (it was an online video from 2017, doing online classes). I am currently using Python 3.8. Here is the code:
try:
file = open("namesList.txt", "a")
file.write("EOF")
except IOError:
print("IO Error")
file.close()
else:
print("EOF written successfully")
file.close()
I have tried pulling the code out of the try block to see if that works, but no errors popped up. It will still print "EOF written successfully" while in the try block and outside of it, but in the text file "EOF" does not show up.
I hope I explained it well enough, let me know if I need to clarify anything else. Thank you!
JessDee, the code is working for me as it is.
In any case, I think you should consider using the with statement when working with files.
It's cleaner, it's more pythonic. That way, you don't need to worry about closing the file. Python will do it for you.
I don't know if it will fix your problem, but it's something to consider.
This would be your code using with statement:
try:
with open("namesList.txt", "a+") as file:
file.write("EOF")
print("EOF written successfully")
except IOError:
print("IO Error")
Notice I used a+ instead of a. This means it will be opened in write/read mode.
Since we don't know the exact nature of your problem, I don't know if it will solve it, but it'll help you from now on. Good luck !

How to read a SQL file with python with proper character encoding?

Based on a few examples (here, here) I can use psycopg2 to read and hopefully run a SQL file from python (the file is 200 lines, and though runs quickly, isn't something I want to maintain in a python file).
Here is the script to read and run the file:
sql = open("sql/sm_bounds_current.sql", "r").read()
curDest.execute(sql)
However, when I run the script, the following error is thrown:
Error: syntax error at or near "drop"
LINE 1: drop table dpsdata.sm_boundaries_current_dev;
As you can see, the first line in the script is to drop a table, but I'm not sure why the extra characters are being read, and can't seem to find a solution that might set the encoding of the file when reading it.
Found this post dealing with encoding and byte order marks, which is where my problem was.
The solution was to import OPEN from CODECS, which allows for the ENCODING option on OPEN:
import codecs
from codecs import open
sqlfile = "sql/sm_bounds_current.sql"
sql = open(sqlfile, mode='r', encoding='utf-8-sig').read()
curDest.execute(sql)
Seems to work great!

using the input file object to read a variable from a file in python

So I'm thinking this is one of those problems where I can't see the forest for the tree. Here is the assignment:
Using the file object input, write code that read an integer from a file called
rawdata into a variable datum (make sure you assign an integer value to datum).
Open the file at the beginning of your code, and close it at the end.
okay so first thing: I thought the input function was for assigning data to an object such as a variable, not for reading data from an object. Wouldn't that be read.file_name ?
But I gave it shot:
infile = open('rawdata','r')
datum = int(input.infile())
infile.close()
Now first problem... MyProgrammingLab doesn't want to grade it. By that I mean I type in the code, click 'submit' and I get the "Checking" screen. And that's it. At the time of writing this, my latest attempt to submit as been 'checking' for 11 minutes. It's not giving me an error, it's just not... 'checking' I guess.
Now at the moment I can't use Python to try the program because it's looking for a while and I'm on a school computer that is write locked, so even if I have the code right (I doubt it), the program will fail to run because it can neither find the file rawdata nor create it.
So... what's the deal? Am I reading the instructions wrong or is it telling me to use input in some other way then I'm trying to use it? Or am I supposed to be using a different method?
You are so close. You're just using the file object slightly incorrectly. Once it's open, you can just .read() it, and get the value.
It would probably look something like this
infile = open('rawdata','r')
datum = int(infile.read())
infile.close()
I feel like your confusion is based purely on the wording of the question - the term "file object input" can certainly be confusing if you haven't worked with Python I/O before. In this case, the "file object" is infile and the "input" would be the rawdata file, I suppose.
Currently taking this class and figured this out. This is my contribution to all of us college peeps just trying to make it through, lol. MPL accepts this answer.
input = open('rawdata','r')
datum = int(input.readline())
input.close()

Why do I get a SyntaxError <unicode error> on my import statement? Should this not be simple?

Here's my first simple test program in Python. Without importing the os library the program runs fine... Leading me to believe there's something wrong with my import statement, however this is the only way i ever see them written. Why am I still getting a syntax error?
import os # <-- why does this line give me a syntax error?!?!?! <unicode error> -->
CalibrationData = r'C:/Users/user/Desktop/blah Redesign/Data/attempts at data gathering/CalibrationData.txt'
File = open(CalibrationData, 'w')
File.write('Test')
File.close()
My end goal is to write a simple program that will look through a directory and tabularize data from relevant .ini files within it.
Well, as MDurant pointed out... I pasted in some unprintable character - probably when i entered the URL.

python limit must be integer

I'm trying to run the following code but for some reason I get the following error: "TypeError: limit must be an integer".
Reading csv data file
import sys
import csv
maxInt = sys.maxsize
decrement = True
while decrement:
decrement = False
try:
**csv.field_size_limit(maxInt)**
except OverflowError:
maxInt = int(maxInt/10)
decrement = True
with open("Data.csv", 'rb') as textfile:
text = csv.reader(textfile, delimiter=" ", quotechar='|')
for line in text:
print ' '.join(line)
The error occurs in the starred line. I have only added the extra bit above the csv read statement as the file was too large to read normally. Alternatively, I could change the file to a text file from csv but I'm not sure whether this will corrupt the data further I can't actually see any of the data as the file is >2GB and hence costly to open.
Any ideas? I'm fairly new to Python but I'd really like to learn a lot more.
I'm not sure whether this qualifies as an answer or not, but here are a few things:
First, the csv reader automatically buffers per line of the CSV, so the file size shouldn't matter too much, 2KB or 2GB, whatever.
What might matter is the number of columns or amount of data inside the fields themselves. If this CSV contains War and Peace in each column, then yeah, you're going to have an issue reading it.
Some ways to potentially debug are to run print sys.maxsize, and to just open up a python interpreter, import sys, csv and then run csv.field_size_limit(sys.maxsize). If you are getting some terribly small number or an exception, you may have a bad install of Python. Otherwise, try to take a simpler version of your file. Maybe the first line, or the first several lines and just 1 column. See if you can reproduce the smallest possible case and remove the variability of your system and the file size.
On Windows 7 64bit with Python 2.6, maxInt = sys.maxsize returns 9223372036854775807L which consequently results in a TypeError: limit must be an integer when calling csv.field_size_limit(maxInt). Interestingly, using maxInt = int(sys.maxsize) does not change this. A crude workaround is to simlpy use csv.field_size_limit(2147483647) which of course cause issues on other platforms. In my case this was adquat to identify the broken value in the CSV, fix the export options in the other application and remove the need for csv.field_size_limit().
-- originally posted by user roskakori on this related question

Categories