Dynamical loading data from Pandas into MSSQL Server using python - python

I'm trying to load database with number of tables with different number of columns from pandas into MS SQL Server, I've reached the last step:
I am dynamically creating the new tables in the database, base on the rows in pandas, fetching the data ... but I have issue with last step - adding the data into the columns, names of rows are being gathered dynamically, however I cannot pass them to the system so it would threat them as name of rows not as strings
with conn.cursor() as cursor:
cursor.execute("IF EXISTS(SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = '"+foldern[0]+"' AND TABLE_SCHEMA = 'dbo') DROP TABLE [dbo].["+foldern[0]+"];")
skuel = "CREATE TABLE " + foldern[0] + " ( "+all_columns+" ) AS NODE"
print(foldern[0])
cursor.execute(skuel)
conn.commit()
for index, row in df2.iterrows():
skuel2 = "INSERT INTO " + foldern[0] + " ( "+all_columns2+" ) values("+all_columns3+")"
cursor.execute(skuel2, *namesofrows)
Under +all_columns2+ - I do get name of columns, under values I got as many "?" as there is columns
Under "namesofrows" is a tupple that contains the names of rows from df2 (in format row.ROWNAME ), but those rows are not "resolved", what I mean by that, is that instead of having the values of rows coming from pandas, I'm having number of rows with same string which is the name of row (like row.id, row.types)
Any idea how to proceed with that?

That is good point, I was not aware of SQLAlchemy possibility which I will probably use - but - btw. I did solve the issue
resolvedvalues = []
for e in namesofrows:
resolvedvalues.append(eval(e))
cursor.execute(skuel2,(resolvedvalues))

Related

Insert data from pandas into sql db - keys doesn't fit columns

I have a database with around 10 columns. Sometimes I need to insert a row which has only 3 of the required columns, the rest are not in the dic.
The data to be inserted is a dictionary named row :
(this insert is to avoid duplicates)
row = {'keyword':'abc','name':'bds'.....}
df = pd.DataFrame([row]) # df looks good, I see columns and 1 row.
engine = getEngine()
connection = engine.connect()
df.to_sql('temp_insert_data_index', connection, if_exists ='replace',index=False)
result = connection.execute(('''
INSERT INTO {t} SELECT * FROM temp_insert_data_index
ON CONFLICT DO NOTHING''').format(t=table_name))
connection.close()
Problem : when I don't have all columns in the row(dic), it will insert dic fields by order (a 3 keys dic will be inserted to the first 3 columns) and not to the right columns. ( I expect the keys in dic to fit the db columns)
Why ?
Consider explicitly naming the columns to be inserted in INSERT INTO and SELECT clauses which is best practice for SQL append queries. Doing so, the dynamic query should work for all or subset of columns. Below uses F-string (available Python 3.6+) for all interpolation to larger SQL query:
# APPEND TO STAGING TEMP TABLE
df.to_sql('temp_insert_data_index', connection, if_exists='replace', index=False)
# STRING OF COMMA SEPARATED COLUMNS
cols = ", ".join(df.columns)
sql = (
f"INSERT INTO {table_name} ({cols}) "
f"SELECT {cols} FROM temp_insert_data_index "
"ON CONFLICT DO NOTHING"
)
result = connection.execute(sql)
connection.close()

Populating Table with Values from other table if ID not in DWH table

I am performing an ETL task where I am querying tables in a Data Warehouse to see if it contains IDs in a DataFrame (df) which was created by joining tables from the operational database.
The DataFrame only has ID columns from each joined table in the operational database. I have created a variable for each of these columns, e.g. 'billing_profiles_id' as below:
billing_profiles_dim_id = df['billing_profiles_dim_id']
I am attempting to iterated row by row to see if the ID here is in the 'billing_profiles_dim' table of the Data Warehouse. Where the ID is not present, I want to populate the DWH tables row by row using the matching ID rows in the ODB:
for key in billing_profiles_dim_id:
sql = "SELECT * FROM billing_profiles_dim WHERE id = '"+str(key)+"'"
dwh_cursor.execute(sql)
result = dwh_cursor.fetchone()
if result == None:
sqlQuery = "SELECT * from billing_profile where id = '"+str(key)+"'"
sqlInsert = "INSERT INTO billing_profile_dim VALUES ('"+str(key)+"','"+billing_profile.name"')
op_cursor = op_connector.execute(sqlInsert)
billing_profile = op_cursor.fetchone()
So far at least, I am receiving the following error:
SyntaxError: EOL while scanning string literal
This error message points at the close of barcket at
sqlInsert = "INSERT INTO billing_profile_dim VALUES ('"+str(key)+"','"+billing_profile.name"')
Which I am currently unable to solve. I'm also aware that this code may run into another problem or two. Could someone please see how I can solve the current issue and please ensure that I head down the correct path?
You are missing a double tick and a +
sqlInsert = "INSERT INTO billing_profile_dim VALUES ('"+str(key)+"','"+billing_profile.name+"')"
But you should really switch to prepared statements like
sql = "SELECT * FROM billing_profiles_dim WHERE id = '%s'"
dwh_cursor.execute(sql,(str(key),))
...
sqlInsert = ('INSERT INTO billing_profile_dim VALUES '
'(%s, %s )')
dwh_cursor.execute(sqlInsert , (str(key), billing_profile.name))

Is there a way to pass a nan value as null from a Pandas dataframe to a table in an Oracle database using SQLAlchemy?

I am attempting to merge values from a pandas dataframe into a table within an Oracle database using SQLAlchemy. The table has 137 columns, and some entries within the columns contain no value (represented by null in Oracle). Is there a way to enter no value (None/null) using sqlalchemy without the values being changed to CLOB within the database? I will be working with many large tables with different column specifications, so it would be good to find a generic way of doing this.
When I assign the numpy nan value, I get the following error:
sqlalchemy.exc.DatabaseError: (cx_Oracle.DatabaseError) ORA-00910: specified length too long for its datatype
The columns have been specified within the database with a variety of different precisions (some are only 1 character). Therefore entering 'nan' into some table cells will not work.
I have a dataframe, df, of dimensions n rows x 137 columns.
I have tried filling in the empty values with:
df.fillna(np.nan,inplace=True)
To specify the datatypes I use:
dtypes1 = {c:types.VARCHAR(df[c].str.len().max()) for c in df.columns[df.dtypes == 'object'].tolist()}
And then the following statement to transfer the table to a temporary table within Oracle (this works fine):
df.to_sql('table1', conn2, if_exists='replace', dtype=dtypes1, index=false)
I then attempt to merge the data from table1 to the original table, table2 by:
creating a connection to the Oracle db using an SQLAlchemy engine:
conn = create_engine('oracle+cx_oracle://.......')
Defining the merge statement:
SQL_statement = 'MERGE INTO ' + table2 + ' USING ' + table1 + ' ON (' + table1 + '.TIME = ' + table2 + '.TIME AND ' + table1 + '.ID = ' + table2 + '.ID) WHEN MATCHED THEN UPDATE SET ' + updateMsg + ' WHEN NOT MATCHED THEN INSERT (' + insertMsgB + ') VALUES (' + insertMsgA + ')'
executing the merge statement
conn.execute(SQL_statement)
I would like no value (null) to be entered within the table when the table changes are merged into the original Oracle table.
At the moment, I think my code is trying to write 'nan' into the cells, but this doesn't conform with the column specifications.
Any help with this would be much appreciated.

Why does multi-columns indexing in SQLite slow down the query's performance, unless indexing all columns?

I am trying to optimize the performance of a simple query to a SQLite database by using indexing. As an example, the table has 5M rows, 5 columns; the SELECT statement is to pick up all columns and the WHERE statement checks for only 2 columns. However, unless I have all columns in the multi-column index, the performance of the query is worse than without any index.
Did I index the column incorrectly, or when selecting all columns, am I supposed to include all of them in the index in order to improve performance?
Below each case # is the result I got when creating the SQLite database in hard-disk. However, for some reason using the ':memory:' mode made all the indexing cases faster than without index.
import sqlite3
import datetime
import pandas as pd
import numpy as np
import os
import time
# Simulate the data
size = 5000000
apps = [f'{i:010}' for i in range(size)]
dates = np.random.choice(pd.date_range('2016-01-01', '2019-01-01').to_pydatetime().tolist(), size)
prod_cd = np.random.choice([f'PROD_{i}' for i in range(30)], size)
models = np.random.choice([f'MODEL{i}' for i in range(15)], size)
categories = np.random.choice([f'GROUP{i}' for i in range(10)], size)
# create a db in memory
conn = sqlite3.connect(':memory:', detect_types=sqlite3.PARSE_DECLTYPES)
c = conn.cursor()
# Create table and insert data
c.execute("DROP TABLE IF EXISTS experiment")
c.execute("CREATE TABLE experiment (appId TEXT, dtenter TIMESTAMP, prod_cd TEXT, model TEXT, category TEXT)")
c.executemany("INSERT INTO experiment VALUES (?, ?, ?, ?, ?)", zip(apps, dates, prod_cd, models, categories))
# helper functions
def time_it(func):
def wrapper(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
print("time for {} function is {}".format(func.__name__, time.time() - start))
return result
return wrapper
#time_it
def read_db(query):
df = pd.read_sql_query(query, conn)
return df
#time_it
def run_query(query):
output = c.execute(query).fetchall()
print(output)
# The main query
query = "SELECT * FROM experiment WHERE prod_cd IN ('PROD_1', 'PROD_5', 'PROD_10') AND dtenter >= '2018-01-01'"
# CASE #1: WITHOUT ANY INDEX
run_query("EXPLAIN QUERY PLAN " + query)
df = read_db(query)
>>> time for read_db function is 2.4783718585968018
# CASE #2: WITH INDEX FOR COLUMNS IN WHERE STATEMENT
run_query("DROP INDEX IF EXISTs idx")
run_query("CREATE INDEX idx ON experiment(prod_cd, dtenter)")
run_query("EXPLAIN QUERY PLAN " + query)
df = read_db(query)
>>> time for read_db function is 3.221407890319824
# CASE #3: WITH INDEX FOR MORE THEN WHAT IN WHERE STATEMENT, BUT NOT ALL COLUMNS
run_query("DROP INDEX IF EXISTs idx")
run_query("CREATE INDEX idx ON experiment(prod_cd, dtenter, appId, category)")
run_query("EXPLAIN QUERY PLAN " + query)
df = read_db(query)
>>>time for read_db function is 3.176532745361328
# CASE #4: WITH INDEX FOR ALL COLUMNS
run_query("DROP INDEX IF EXISTs idx")
run_query("CREATE INDEX idx ON experiment(prod_cd, dtenter, appId, category, model)")
run_query("EXPLAIN QUERY PLAN " + query)
df = read_db(query)
>>> time for read_db function is 0.8257918357849121
The SQLite Query Optimizer Overview says:
When doing an indexed lookup of a row, the usual procedure is to do a binary search on the index to find the index entry, then extract the rowid from the index and use that rowid to do a binary search on the original table. Thus a typical indexed lookup involves two binary searches.
Index entries are not in the same order as the table entries, so if a query returns data from most of the table's pages, all those random-access lookups are slower than just scanning all table rows.
Index lookups are more efficient than a table scan only if your WHERE condition filters out much more rows than are returned.
SQLite assumes that lookups on indexed columns have a high selectivity. You can get better estimates by running ANALYZE after filling the table.
But if all your queries are in a form where an index does not help, it wold be a better idea to not use an index at all.
When you create an index over all columns used in the query, the additional table accesses are no longer necessary:
If, however, all columns that were to be fetched from the table are already available in the index itself, SQLite will use the values contained in the index and will never look up the original table row. This saves one binary search for each row and can make many queries run twice as fast.
When an index contains all of the data needed for a query and when the original table never needs to be consulted, we call that index a "covering index".

MYSQL: how to insert statement without specifying col names or question marks?

I have a list of tuples of which i'm inserting into a Table.
Each tuple has 50 values. How do i insert without having to specify the column names and how many ? there is?
col1 is an auto increment column so my insert stmt starts in col2 and ends in col51.
current code:
l = [(1,2,3,.....),(2,4,6,.....),(4,6,7,.....)...]
for tup in l:
cur.execute(
"""insert into TABLENAME(col2,col3,col4.........col50,col51)) VALUES(?,?,?,.............)
""")
want:
insert into TABLENAME(col*) VALUES(*)
MySQL's syntax for INSERT is documented here: http://dev.mysql.com/doc/refman/5.7/en/insert.html
There is no wildcard syntax like you show. The closest thing is to omit the column names:
INSERT INTO MyTable VALUES (...);
But I don't recommend doing that. It works only if you are certain you're going to specify a value for every column in the table (even the auto-increment column), and your values are guaranteed to be in the same order as the columns of the table.
You should learn to use code to build the SQL query based on arrays of values in your application. Here's a Python example the way I do it. Suppose you have a dict of column: value pairs called data_values.
placeholders = ['%s'] * len(data_values)
sql_template = """
INSERT INTO MyTable ({columns}) VALUES ({placeholders})
"""
sql = sql_template.format(
columns=','.join(keys(data_values)),
placeholders=','.join(placeholders)
)
cur = db.cursor()
cur.execute(sql, data_values)
example code to put before your code:
cols = "("
for x in xrange(2, 52):
cols = cols + "col" + str(x) + ","
test = test[:-1]+")"
Inside your loop
for tup in l:
cur.execute(
"""insert into TABLENAME " + cols " VALUES {0}".format(tup)
""")
This is off the top of my head with no error checking

Categories