I am loading data from various sources (csv, xls, json etc...) into Pandas dataframes and I would like to generate statements to create and fill a SQL database with this data. Does anyone know of a way to do this?
I know pandas has a to_sql function, but that only works on a database connection, it can not generate a string.
Example
What I would like is to take a dataframe like so:
import pandas as pd
import numpy as np
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
And a function that would generate this (this example is PostgreSQL but any would be fine):
CREATE TABLE data
(
index timestamp with time zone,
"A" double precision,
"B" double precision,
"C" double precision,
"D" double precision
)
If you only want the 'CREATE TABLE' sql code (and not the insert of the data), you can use the get_schema function of the pandas.io.sql module:
In [10]: print pd.io.sql.get_schema(df.reset_index(), 'data')
CREATE TABLE "data" (
"index" TIMESTAMP,
"A" REAL,
"B" REAL,
"C" REAL,
"D" REAL
)
Some notes:
I had to use reset_index because it otherwise didn't include the index
If you provide an sqlalchemy engine of a certain database flavor, the result will be adjusted to that flavor (eg the data type names).
GENERATE SQL CREATE STATEMENT FROM DATAFRAME
SOURCE = df
TARGET = data
GENERATE SQL CREATE STATEMENT FROM DATAFRAME
def SQL_CREATE_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET):
# SQL_CREATE_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET)
# SOURCE: source dataframe
# TARGET: target table to be created in database
import pandas as pd
sql_text = pd.io.sql.get_schema(SOURCE.reset_index(), TARGET)
return sql_text
Check the SQL CREATE TABLE Statement String
print('\n\n'.join(sql_text))
GENERATE SQL INSERT STATEMENT FROM DATAFRAME
def SQL_INSERT_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET):
sql_texts = []
for index, row in SOURCE.iterrows():
sql_texts.append('INSERT INTO '+TARGET+' ('+ str(', '.join(SOURCE.columns))+ ') VALUES '+ str(tuple(row.values)))
return sql_texts
Check the SQL INSERT INTO Statement String
print('\n\n'.join(sql_texts))
Insert Statement Solution
Not sure if this is the absolute best way to do it but this is more efficient than using df.iterrows() as that is very slow. Also this takes care of nan values with the help of regular expressions.
import re
def get_insert_query_from_df(df, dest_table):
insert = """
INSERT INTO `{dest_table}` (
""".format(dest_table=dest_table)
columns_string = str(list(df.columns))[1:-1]
columns_string = re.sub(r' ', '\n ', columns_string)
columns_string = re.sub(r'\'', '', columns_string)
values_string = ''
for row in df.itertuples(index=False,name=None):
values_string += re.sub(r'nan', 'null', str(row))
values_string += ',\n'
return insert + columns_string + ')\n VALUES\n' + values_string[:-2] + ';'
If you want to write the file by yourself, you may also retrieve columns names and dtypes and build a dictionary to convert pandas data types to sql data types.
As an example:
import pandas as pd
import numpy as np
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
tableName = 'table'
columnNames = df.columns.values.tolist()
columnTypes = map(lambda x: x.name, df.dtypes.values)
# Storing column names and dtypes in a dataframe
tableDef = pd.DataFrame(index = range(len(df.columns) + 1), columns=['cols', 'dtypes'])
tableDef.iloc[0] = ['index', df.index.dtype.name]
tableDef.loc[1:, 'cols'] = columnNames
tableDef.loc[1:, 'dtypes'] = columnTypes
# Defining a dictionnary to convert dtypes
conversion = {'datetime64[ns]':'timestamp with time zone', 'float64':'double precision'}
# Writing sql in a file
f = open('yourdir\%s.sql' % tableName, 'w')
f.write('CREATE TABLE %s\n' % tableName)
f.write('(\n')
for i, row in tableDef.iterrows():
sep = ",\n" if i < tableDef.index[-1] else "\n"
f.write('\t\"%s\" %s%s' % (row['cols'], conversion[row['dtypes']], sep))
f.write(')')
f.close()
You can do the same way to populate your table with INSERT INTO.
SINGLE INSERT QUERY SOLUTION
I didn't find the above answers to suit my needs. I wanted to create one single insert statement for a dataframe with each row as the values. This can be achieved by the below:
import re
import pandas as pd
table = 'your_table_name_here'
# You can read from CSV file here... just using read_sql_query as an example
df = pd.read_sql_query(f'select * from {table}', con=db_connection)
cols = ', '.join(df.columns.to_list())
vals = []
for index, r in df.iterrows():
row = []
for x in r:
row.append(f"'{str(x)}'")
row_str = ', '.join(row)
vals.append(row_str)
f_values = []
for v in vals:
f_values.append(f'({v})')
# Handle inputting NULL values
f_values = ', '.join(f_values)
f_values = re.sub(r"('None')", "NULL", f_values)
sql = f"insert into {table} ({cols}) values {f_values};"
print(sql)
db.dispose()
If you're just looking to generate a string with inserts based on pandas.DataFrame - I'd suggest using bulk sql insert syntax as suggested by #rup.
Here's an example of a function I wrote for that purpose:
import pandas as pd
import re
def df_to_sql_bulk_insert(df: pd.DataFrame, table: str, **kwargs) -> str:
"""Converts DataFrame to bulk INSERT sql query
>>> data = [(1, "_suffixnan", 1), (2, "Noneprefix", 0), (3, "fooNULLbar", 1, 2.34)]
>>> df = pd.DataFrame(data, columns=["id", "name", "is_deleted", "balance"])
>>> df
id name is_deleted balance
0 1 _suffixnan 1 NaN
1 2 Noneprefix 0 NaN
2 3 fooNULLbar 1 2.34
>>> query = df_to_sql_bulk_insert(df, "users", status="APPROVED", address=None)
>>> print(query)
INSERT INTO users (id, name, is_deleted, balance, status, address)
VALUES (1, '_suffixnan', 1, NULL, 'APPROVED', NULL),
(2, 'Noneprefix', 0, NULL, 'APPROVED', NULL),
(3, 'fooNULLbar', 1, 2.34, 'APPROVED', NULL);
"""
df = df.copy().assign(**kwargs)
columns = ", ".join(df.columns)
tuples = map(str, df.itertuples(index=False, name=None))
values = re.sub(r"(?<=\W)(nan|None)(?=\W)", "NULL", (",\n" + " " * 7).join(tuples))
return f"INSERT INTO {table} ({columns})\nVALUES {values};"
By the way, it converts nan/None entries to NULL and it's possible to pass constant column=value pairs as keyword arguments (see status="APPROVED" and address=None arguments in docstring example).
Generally, it works faster since any database does a lot of work for a single insert: checking the constraints, building indices, flushing, writing to log, etc. This complex operations can be optimized by the database when doing several-in-one operation, and not calling the engine one-by-one.
Taking the user #Jaris's post to get the CREATE, I extended it further to work for any CSV
import sqlite3
import pandas as pd
db = './database.db'
csv = './data.csv'
table_name = 'data'
# create db and setup schema
df = pd.read_csv(csv)
create_table_sql = pd.io.sql.get_schema(df.reset_index(), table_name)
conn = sqlite3.connect(db)
c = conn.cursor()
c.execute(create_table_sql)
conn.commit()
# now we can insert data
def insert_data(row, c):
values = str(row.name)+','+','.join([str('"'+str(v)+'"') for v in row])
sql_insert=f"INSERT INTO {table_name} VALUES ({values})"
try:
c.execute(sql_insert)
except Exception as e:
print(f"SQL:{sql_insert} \n failed with Error:{e}")
# use apply to loop over dataframe and call insert_data on each row
df.apply(lambda row: insert_data(row, c), axis=1)
# finally commit all those inserts into the database
conn.commit()
Hopefully this is more simple than the alternative answers and more pythonic!
Depending on if you can forego generating an intermediate representation of the SQL statement; You can just outright execute the insert statement as well.
con.executemany("INSERT OR REPLACE INTO data (A, B, C, D) VALUES (?, ?, ?, ?, ?)", list(df_.values))
This worked a little better as there is less messing around with string generation.
Related
I am not experienced with SQL or SQLite3.
I have a list of ids from another table. I want to use the list as a key in my query and get all records based on the list. I want the SQL query to feed directly into a DataFrame.
import pandas as pd
import sqlite3
cnx = sqlite3.connect('c:/path/to/data.sqlite')
# the below values are ones found in "s_id"
id_list = ['C20','C23','C25','C28', ... ,'C83']
# change list to sql string.
id_sql = ", ".join(str(x) for x in id_list)
df = pd.read_sql_query(f"SELECT * FROM table WHERE s_id in ({id_sql})", cnx)
I am getting a DatabaseError: Execution failed on sql 'SELECT * FROM ... : no such column: C20.
When I saw this error I thought the code just needs a simple switch. So I tried this
df = pd.read_sql_query(f"SELECT * FROM table WHERE ({id_sql}) in s_id", cnx)
it did not work.
So how can I get this to work?
The table is like.
id
s_id
date
assigned_to
date_complete
notes
0
C10
1/6/2020
Jack
1/8/2020
None
1
C20
1/10/2020
Jane
1/12/2020
Call back
2
C23
1/11/2020
Henry
1/12/2020
finished
n
C83
rows
of
more
data
n+1
D85
9/10/2021
Jeni
9/12/2021
Call back
Currently, you are missing the single quotes around your literal values and consequently the SQLite engine assumes you are attempting to query columns. However, avoid concatenation of values altogether but bind them to parameters which pandas pandas.read_sql supports with the params argument:
# the below values are ones found in "s_id"
id_list = ['C20','C23','C25','C28', ... ,'C83']
# build equal length string of ? place holders
prm_list = ", ".join("?" for _ in id_list)
# build prepared SQL statement
sql = f"SELECT * FROM table WHERE s_id IN ({prm_list})"
# run query, passing parameters and values separately
df = pd.read_sql(sql, con=cnx, params=id_list)
So the problem is that it is missing single quote marks in the sql string. So for the in part needs ' on each side of the s_id values.
import pandas as pd
import sqlite3
cnx = sqlite3.connect('c:/path/to/data.sqlite')
# the below values are ones found in "s_id"
id_list = ['C20','C23','C25','C28', ... ,'C83']
# change list to sql string.
id_sql = "', '".join(str(x) for x in id_list)
df = pd.read_sql_query(f"SELECT * FROM table WHERE s_id in ('{id_sql}')", cnx)
I have an access table called "Cell_list" with a key column called "Cell_#". I want to read the table into a dataframe, but only the rows that match indices which are specified in a python list "cell_numbers".
I tried several variations on:
import pyodbc
import pandas as pd
cell_numbers = [1,3,7]
cnn_str = r'Driver={Microsoft Access Driver (*.mdb,*.accdb)};DBQ=C:\folder\myfile.accdb;'
conn = pyodbc.connect(cnn_str)
query = ('SELECT * FROM Cell_list WHERE Cell_# in '+tuple(cell_numbers))
df = pd.read_sql(query, conn)
But no matter what I try I get a syntax error.
How do I do this?
Consider best practice of parameterization which is supported in pandas.read_sql:
# PREPARED STATEMENT, NO DATA
query = (
'SELECT * FROM Cell_list '
'WHERE [Cell_#] IN (?, ?, ?)'
)
# RUN SQL WITH BINDED PARAMS
df = pd.read_sql(query, conn, params=cell_numbers)
Consider even dynamic qmark placeholders dependent on length of cell_numbers:
qmarks = [', '.join('?' for _ in cell_numbers)]
query = (
'SELECT * FROM Cell_list '
f'WHERE [Cell_#] IN ({qmarks})'
)
Convert (join) cell_numbers to text:
cell_text = '(1,3,7)'
and concatenate this.
The finished SQL should read (you may need brackets around the weird field name Cell_#):
SELECT * FROM Cell_list WHERE [Cell_#] IN (1,3,7)
I want to append dataframe (pandas) to my table in oracle.
But this code deletes all rows in table:(
My dataframe and my result become this:
0, 0, 0, ML_TEST, 0, 5
0, 0, 0, ML_TEST, 0, 6
by this code block below :
import cx_Oracle
import pandas as pd
from sqlalchemy import types, create_engine
dataset = pd.read_csv("denemedf.txt", delimiter=",")
print(dataset)
from sqlalchemy import create_engine
engine = create_engine('oracle://***:***#***:***/***', echo=False)
dataset.to_sql(name='dev_log',con=engine ,if_exists = 'append', index=False)
How can i append this dataframe's rows to the last of the table without deleting existing rows in this table?
Now i tried again, now appends to the last, but in first try it deleted all existing rows.
How can do this effectively without causing any problem?
Actually problem occurs because of the schema of this table.
This table is in gnl owner, but i connected with prg. So it couldnt find the table and created another.
Is that any way to write owner or schema in to this function?
I think this may help :
import cx_Oracle
import pandas as pd
dataset = pd.read_csv("C:\\pathToFile\\denemedf.txt", delimiter=",")
con = cx_Oracle.connect('uname/pwd#serverName:port/instanceName')
cursor = con.cursor()
sql='INSERT INTO gnl.tbl_deneme VALUES(:1,:2,:3,:4,:5,:6)'
df_list = dataset.values.tolist()
n = 0
for i in dataset.iterrows():
cursor.execute(sql,df_list[n])
n += 1
con.commit()
cursor.close
con.close
provided insert privilege is already granted to the schema prg for your table tbl_deneme
( after connecting to gnl -> grant insert on tbl_deneme to prg )
where your text file( denemedf.txt ) is assumed to be
col1,col2,col3,col4,col5,col6
0, 0, 0, ML_TEST, 0, 5
0, 0, 0, ML_TEST, 0, 6
Moreover, a dynamic option, which will create a table if not exists by using the column names at the first line and insert the values depending on the splitted elements of the values list derived from the second line without explicitly specified the variable list one by one, and more performant option along with using cursor.executemany might be provided such as
import cx_Oracle
import pandas as pd
con = cx_Oracle.connect(user, password, host+':'+port+'/'+dbname)
cur = con.cursor()
tab_name = 'gnl.tbl_deneme'
cursor.execute('SELECT COUNT(*) FROM user_tables WHERE table_name = UPPER(:1) ',[tab_name])
exs = cursor.fetchone()[0]
df = pd.read_csv('C:\\pathToFile\\denemedf.txt', sep = ',', dtype=str)
col=df.columns.tolist()
crt=""
for k in col:
crt += ''.join(k)+' VARCHAR2(4000),'
if int(exs) == 0:
crt = 'CREATE TABLE '+tab_name+' ('+crt.rstrip(",")+')'
cursor.execute(crt)
vrs=""
for i in range(0,len(col)):
vrs+= ':'+str(i+1)+','
cols=[]
sql = 'INSERT INTO '+tab_name+' VALUES('+vrs.rstrip(",")+')'
for i in range(0,len(df)):
cols.append(tuple(df.fillna('').values[i]))
cursor.executemany(sql,cols)
con.commit()
cursor.close
con.close
Considering data_df to be dataframe it can be done by below 3 lines
rows = [tuple(x) for x in data_df.values]
cur.executemany("INSERT INTO table_name VALUES (:1,:2,:3,:4)",rows)
con_ora.commit()
dataset.to_sql('dev_log',engine ,if_exists = 'append', index=False)
dev_log use directly as a table name and engine directly for connection not name='dev_log' and con = engine is
paramenter: append: Insert new values to the existing table
so i think it will work for appending new row to the existing table and it will not delte any row from existing table
pandas.DataFrame.to_sql
New to Python Data science.
Here I have a sql server extract and I am extracting the data via 'pyodbc.connect' and reading the data by pd.read_sql(.....SQL query) from SQL server.
Here my intention is want to use a list or vector (example below) in SQL query where condition. How I do that? It hleps us not fetching millions of rows into memory.
I like to know how I pass number list and string list (both have different use cases)
1st whare conditions string:
raw_data2 = {'age1': ['ten','twenty']}
df2 = pd.DataFrame(raw_data2, columns = ['age1'])
2nd where condition number:
raw_data2 = {'age_num': [10,20,30]}
df3 = pd.DataFrame(raw_data2, columns = ['age_num'])
Thank you for your help and this will reduce our fetch time to 80%
Consider using pandas' read_sql and pass parameters to avoid type handling. Additionally, save all in a dictionary of dataframes with keys corresponding to original raw_data keys and avoid flooding global environment with many sepeate dataframes:
raw_data = {'age1': ['ten','twenty'],
'age_num': [10, 20, 30]}
df_dict = {}
for k, v in raw_data.items():
# BUILD PREPARED STATEMENT WITH PARAM PLACEHOLDERS
where = '{col} IN ({prm})'.format(col=k, prm=", ".join(['?' for _ in v]))
sql = 'SELECT * FROM mytable WHERE {}'.format(where)
print(sql)
# IMPORT INTO DATAFRAME
df_dict[k] = pd.read_sql(sql, conn, params = v)
# OUTPUT TOP ROWS OF EACH DF ELEM
df_dict['age1'].head()
df_dict['age_num'].head()
For separate dataframe objects:
def build_query(my_dict):
for k, v in my_dict.items():
# BUILD PREPARED STATEMENT WITH PARAM PLACEHOLDERS IN WHERE CLAUSE
where = '{col} IN ({prm})'.format(col=k, prm=", ".join(['?' for _ in v]))
sql = 'SELECT * FROM mytable WHERE {}'.format(where)
return sql
raw_data2 = {'age1': ['ten','twenty']}
# ASSIGNS QUERY
sql = build_query(raw_data2)
# IMPORT TO DATAFRAME PASSING PARAM VALUES
df2 = pd.read_sql(sql, conn, params = raw_data2['age1'])
raw_data3 = {'age_num': [10,20,30]}
# ASSIGNS QUERY
sql = build_query(raw_data3)
# IMPORT TO DATAFRAME PASSING PARAM VALUES
df3 = pd.read_sql(sql, conn, params = raw_data3['age_num'])
I have a list of tuples of which i'm inserting into a Table.
Each tuple has 50 values. How do i insert without having to specify the column names and how many ? there is?
col1 is an auto increment column so my insert stmt starts in col2 and ends in col51.
current code:
l = [(1,2,3,.....),(2,4,6,.....),(4,6,7,.....)...]
for tup in l:
cur.execute(
"""insert into TABLENAME(col2,col3,col4.........col50,col51)) VALUES(?,?,?,.............)
""")
want:
insert into TABLENAME(col*) VALUES(*)
MySQL's syntax for INSERT is documented here: http://dev.mysql.com/doc/refman/5.7/en/insert.html
There is no wildcard syntax like you show. The closest thing is to omit the column names:
INSERT INTO MyTable VALUES (...);
But I don't recommend doing that. It works only if you are certain you're going to specify a value for every column in the table (even the auto-increment column), and your values are guaranteed to be in the same order as the columns of the table.
You should learn to use code to build the SQL query based on arrays of values in your application. Here's a Python example the way I do it. Suppose you have a dict of column: value pairs called data_values.
placeholders = ['%s'] * len(data_values)
sql_template = """
INSERT INTO MyTable ({columns}) VALUES ({placeholders})
"""
sql = sql_template.format(
columns=','.join(keys(data_values)),
placeholders=','.join(placeholders)
)
cur = db.cursor()
cur.execute(sql, data_values)
example code to put before your code:
cols = "("
for x in xrange(2, 52):
cols = cols + "col" + str(x) + ","
test = test[:-1]+")"
Inside your loop
for tup in l:
cur.execute(
"""insert into TABLENAME " + cols " VALUES {0}".format(tup)
""")
This is off the top of my head with no error checking