I am fetching data from SQL (Oracle and MS SQL both) databases from a python code using pyodbc and cxOracle packages. Python automatically converts all date time fields in SQL to datetime.datetime. Is there any way I can capture data as is from SQL into a file. Same happens to Null and integer columns as well.
1) Date: Value in DB and expected-- 12-AUG-19 12.00.01.000 -- Python Output: 2019-08-12 00:00:01
2) Null becomes a NaN
3) Integer value 1s and 0s becomes True and False.
I tried to google the issue, and seems like a common issue amongst all packages like pyodbc, cx_oracle, pandas.read_sql as well.
I would like the data appearing exactly the same as in the database.
We are calling a Oracle/SQL Server Stored proc and NOT a SQL query to get this result and we can't change the stored proc. We cannot use CAST in sql query.
Pyodbc fetchall() output is the table in list format. We lose the formatting of the data as soon as it is captured in python.
Could someone help with this issue?
I'm not sure about Oracle, but on the SQL Server side, you could change the command you use so that you capture the results of the stored proc in a temp table, and then you can CAST() the columns of the temp table.
So if you currently call a stored proc on SQL Server like this: EXEC {YourProcName}
Then you could change your command to something like this:
CREATE TABLE #temp
(
col1 INT
,col2 DATETIME
,col3 VARCHAR(20)
);
INSERT INTO #temp
EXEC [sproc];
SELECT
col1 = CAST(col1 AS VARCHAR(20))
,col2 = CAST(FORMAT(col2,'dd-MMM-yy ') AS VARCHAR) + REPLACE(CAST(CAST(col2 AS TIME(3)) AS VARCHAR),':','.')
,col3
FROM #temp;
DROP TABLE #temp
You'll want to create your temp table using the same column names and datatypes that get output from the proc. Then you can CAST() numeric values to VARCHAR, and with dates/datetimes, you can use FORMAT() to define your date string format. The example I have here should result in format you want of 12-AUG-19 12.00.01.000. I couldn't find a single format string that gave me the correct output, so I broke the date and time elements apart, format them in the expected way, and then concatenate the casted values.
Related
One of our old sql legacy code, converts a numerical column in sql using the HASHBYTES function and sha2_256.
The entire process is moving to python as we are putting in some advanced usage on top of the legacy work. However, when using connector, we are calling the same sql code, the HASHBYTES('sha2_256',column_name) id returning values with lot of garbage.
Running the code in sql result in this
Column Encoded_Column
101286297 0x7AC82B2779116F40A8CEA0D85BE4AA02AF7F813B5383BAC60D5E71B7BDB9F705
Running same sql query from python result in
Column Encoded_Column
101286297
b"z\xc8+'y\x11o#\xa8\xce\xa0\xd8[\xe4\xaa\x02\xaf\x7f\x81;S\x83\xba\xc6\r^q\xb7\xbd\xb9\xf7\x05"
Code is
Select Column,HASHBYTES('SHA2_256', CONVERT(VARBINARY(8),Column)) as Encoded_Column from table
I have tried usual garbage removal but not helping
You are getting the right result but is displayed as raw bytes (This is why you have the b in b"...").
Looking at the result from SQL you have the data encoded with hexadecimal.
So to transform the python result you can do:
x = b"z\xc8+'y\x11o#\xa8\xce\xa0\xd8[\xe4\xaa\x02\xaf\x7f\x81;S\x83\xba\xc6\r^q\xb7\xbd\xb9\xf7\x05"
x.hex().upper()
And the result will be:
'7AC82B2779116F40A8CEA0D85BE4AA02AF7F813B5383BAC60D5E71B7BDB9F705'
Which is what you had in SQL.
You can read more here about the 0x at the start of the SQL result that is not present in the python code.
And finally, if you are working with pandas you can convert the whole column with:
df["Encoded_Column"] = df["Encoded_Column"].apply(lambda x: x.hex().upper())
# And if you want the '0x' at the start do:
df["Encoded_Column"] = "0x" + df["Encoded_Column"]
I am making a script, that should create a schema for each customer. I’m fetching all metadata from a database that defines how each customer’s schema should look like, and then create it. Everything is well defined, the types, names of tables, etc. A customer has many tables (fx, address, customers, contact, item, etc), and each table has the same metadata.
My procedure now:
get everything I need from the metadataDatabase.
In a for loop, create a table, and then Alter Table and add each metadata (This is done for each table).
Right now my script runs in about a minute for each customer, which I think is too slow. It has something to do with me having a loop, and in that loop, I’m altering each table.
I think that instead of me altering (which might be not so clever approach), I should do something like the following:
Note that this is just a stupid but valid example:
for table in tables:
con.execute("CREATE TABLE IF NOT EXISTS tester.%s (%s, %s);", (table, "last_seen date", "valid_from timestamp"))
But it gives me this error (it seems like it reads the table name as a string in a string..):
psycopg2.errors.SyntaxError: syntax error at or near "'billing'"
LINE 1: CREATE TABLE IF NOT EXISTS tester.'billing' ('last_seen da...
Consider creating tables with a serial type (i.e., autonumber) ID field and then use alter table for all other fields by using a combination of sql.Identifier for identifiers (schema names, table names, column names, function names, etc.) and regular format for data types which are not literals in SQL statement.
from psycopg2 import sql
# CREATE TABLE
query = """CREATE TABLE IF NOT EXISTS {shm}.{tbl} (ID serial)"""
cur.execute(sql.SQL(query).format(shm = sql.Identifier("tester"),
tbl = sql.Identifier("table")))
# ALTER TABLE
items = [("last_seen", "date"), ("valid_from", "timestamp")]
query = """ALTER TABLE {shm}.{tbl} ADD COLUMN {col} {typ}"""
for item in items:
# KEEP IDENTIFIER PLACEHOLDERS
final_query = query.format(shm="{shm}", tbl="{tbl}", col="{col}", typ=i[1])
cur.execute(sql.SQL(final_query).format(shm = sql.Identifier("tester"),
tbl = sql.Identifier("table"),
col = sql.Identifier(item[0]))
Alternatively, use str.join with list comprehension for one CREATE TABLE:
query = """CREATE TABLE IF NOT EXISTS {shm}.{tbl} (
"id" serial,
{vals}
)"""
items = [("last_seen", "date"), ("valid_from", "timestamp")]
val = ",\n ".join(["{{}} {typ}".format(typ=i[1]) for i in items])
# KEEP IDENTIFIER PLACEHOLDERS
pre_query = query.format(shm="{shm}", tbl="{tbl}", vals=val)
final_query = sql.SQL(pre_query).format(*[sql.Identifier(i[0]) for i in items],
shm = sql.Identifier("tester"),
tbl = sql.Identifier("table"))
cur.execute(final_query)
SQL (sent to database)
CREATE TABLE IF NOT EXISTS "tester"."table" (
"id" serial,
"last_seen" date,
"valid_from" timestamp
)
However, this becomes heavy as there are too many server roundtrips.
How many tables with how many columns are you creating that this is slow? Could you ssh to a machine closer to your server and run the python there?
I don't get that error. Rather, I get an SQL syntax error. A values list is for conveying data. But ALTER TABLE is not about data, it is about metadata. You can't use a values list there. You need the names of the columns and types in double quotes (or no quotes) rather than single quotes. And you can't have a comma between name and type. And you can't have parentheses around each pair. And each pair needs to be introduced with "ADD", you can't have it just once. You are using the wrong tool for the job. execute_batch is almost the right tool, except it will use single quotes rather than double quotes around the identifiers. Perhaps you could add a flag to it tell it to use quote_ident.
Not only is execute_values the wrong tool for the job, but I think python in general might be as well. Why not just load from a .sql file?
Using Python and psycopg2 I am trying to build a dynamic SQL query to insert rows into tables.
The variables are:
1. Table name
2. Variable list of column names
3. Variable list of values, ideally entering multiple rows in one statement
The problems I have come across are the treatment of string literals from Python to SQL and psycopg2 trying to avoid you exposing your code to SQL injection attacks.
Using the SQL module from psycopg2, I have resolved dynamically adding the Table name and List of columns. However I am really struggling with adding the VALUES. Firstly the values are put into the query as %(val)s and seem to be passed literally like this to the database, causing an error.
Secondly, I would then like to be able to add multiple rows at once.
Code below. All help much appreciated :)
import psycopg2 as pg2
from psycopg2 import sql
conn = pg2.connect(database='my_dbo',user='***',password='***')
cols = ['Col1','Col2','Col3']
vals = ['val1','val2','val3']
#Build query
q2 = sql.SQL("insert into my_table ({}) values ({})") \
.format(sql.SQL(',').join(map(sql.Identifier, cols)), \
sql.SQL(',').join(map(sql.Placeholder,vals)))
When I print this string as print(q2.as_string(conn)) I get:
insert into my_table ("Col1","Col2","Col3") values %(val1)s,%(val2)s,%(val3)s
And then when i try and a execute such a string I get the following error:
ProgrammingError: syntax error at or near "%"
LINE 1: ... ("Col1","Col2","Col3") values (%(val1)s...
^
Ok I solved this. Firstly use Literal rather than Placeholder, secondly put your row values together as tuples within a tuple, loop through adding each tuple to a list as literals and then drop in at the end when building the query.
this seems like a basic function, but I'm new to Python so maybe I'm not googling this correctly.
In Microsoft SQL Server, when you have a statement like
SELECT top 100 * FROM dbo.Patient_eligibility
you get a result like
Patient_ID | Patient_Name | Patient_Eligibility
67456 | Smith, John | Eligible
...
etc.
Etc.
I am running a connection to SQL through Python as such, and would like the output to look exactly the same as in SQL. Specifically - with column names and all the data rows specified in the SQL query. It doesn't have to appear in the console or the log, I just need a way to access it to see what's in it.
Here is my current code attempts:
import pyodbc
conn = pyodbc.connect(connstr)
cursor = conn.cursor()
sql = "SELECT top 100 * FROM [dbo].[PATIENT_ELIGIBILITY]"
cursor.execute(sql)
data = cursor.fetchall()
#Query1
for row in data :
print (row[1])
#Query2
print (data)
#Query3
data
My understanding is that somehow the results of PATIENT_ELIGIBILITY are stored in the variable data. Query 1, 2, and 3 represent methods of accessing that data that I've googled for - again seems like basic stuff.
The results of #Query1 give me the list of the first column, without a column name in the console. In the variable explorer, 'data' appears as type List. When I open it up, it just says 'Row object of pyodbc module' 100 times, one for each row. Not what I'm looking for. Again, I'm looking for the same kind of view output I would get if I ran it in Microsoft SQL Server.
Running #Query2 gets me a little closer to this end. The results appear like a .csv file - unreadable, but it's all there, in the console.
Running #Query3, just the 'data' variable, gets me the closest result but with no column names. How can I bring in the column names?
More directly, how do i get 'data' to appear as a clean table with column names somewhere? Since this appears a basic SQL function, could you direct me to a SQL-friendly library to use instead?
Also note that neither of the Queries required me to know the column names or widths. My entire method here is attempting to eyeball the results of the Query and quickly check the data - I can't see that the Patient_IDs are loading properly if I don't know which column is patient_ids.
Thanks for your help!
It's more than 1 question, I'll try help you and give advice.
I am running a connection to SQL through Python as such, and would like the output to look exactly the same as in SQL.
You are mixing SQL as language and formatted output of some interactive SQL tool.
SQL itself does not have anything about "look" of data.
Also note that neither of the Queries required me to know the column names or widths. My entire method here is attempting to eyeball the results of the Query and quickly check the data - I can't see that the Patient_IDs are loading properly if I don't know which column is patient_ids.
Correct. cursor.fetchall returns only data.
Field informations can be read from cursor.description.
Read more in PEP-O249
how do i get 'data' to appear as a clean table with column names somewhere?
It depends how do you define "appear".
You want: text output, html page or maybe GUI?
For text output: you can read column names from cursor.description and print them before data.
If you want html/excel/pdf/other - find some library/framework suiting your taste.
If you want an interactive experience similar to SQL tools - I recommend to look on jupyter-notebook + pandas.
Something like:
pandas.read_sql_query(sql)
will give you "clean table" nothing worse than SQLDeveloper/SSMS/DBeaver/other gives.
We don't need any external libraries.
Refer to this for more details.
Print results in MySQL format with Python
However, the latest version of MySQL gives an error to this code. So, I modified it.
Below is the query for the dataset
stri = "select * from table_name"
cursor.execute(stri)
data = cursor.fetchall()
mycon.commit()
Below it will print the dataset in tabular form
def columnnm(name):
v = "SELECT LENGTH("+name+") FROM table_name WHERE LENGTH("+name+") = (SELECT MAX(LENGTH("+name+")) FROM table_name) LIMIT 1;"
cursor.execute(v)
data = cursor.fetchall()
mycon.commit()
return data[0][0]
widths = []
columns = []
tavnit = '|'
separator = '+'
for cd in cursor.description:
widths.append(max(columnnm(cd[0]), len(cd[0])))
columns.append(cd[0])
for w in widths:
tavnit += " %-"+"%ss |" % (w,)
separator += '-'*w + '--+'
print(separator)
print(tavnit % tuple(columns))
print(separator)
for row in data:
print(tavnit % row)
print(separator)
I am running a sql notebook on databricks. I would like to analyze a table with half a billion records in it. I can run simple sql queries on the data. However, I need to change the date column type from str to date.
Unfortunately, update/alter statements do not seem to be supported by sparkSQL so it seems I cannot modify the data in the table.
What would be the one-line of code that would allow me to convert the SQL table to a python data structure (in pyspark) in the next cell?
Then I could modify the file and return it to SQL.
dataFrame = sqlContext.sql('select * from myTable')
df=sqlContext.sql("select * from table")
To convert dataframe back to sql view,
df.createOrReplaceTempView("myview")