python pandas with to_sql() , SQLAlchemy and schema in exasol - python

I'm trying to upload a pandas data frame to an SQL table. It seemed to me that pandas to_sql function is the best solution for larger data frames, but I can't get it to work. I can easily extract data, but get an error message when trying to write it to a new table:
# connect to Exasol DB
exaString='DSN=exa'
conDB = pyodbc.connect(exaString)
# get some data from somewhere, works without error
sqlString = "SELECT * FROM SOMETABLE"
data = pd.read_sql(sqlString, conDB)
# now upload this data to a new table
data.to_sql('MYTABLENAME', conDB, flavor='mysql')
conDB.close()
The error message I get is
pyodbc.ProgrammingError: ('42000', "[42000] [EXASOL][EXASolution driver]syntax error, unexpected identifier_chain2, expecting
assignment_operator or ':' [line 1, column 6] (-1)
(SQLExecDirectW)")
Unfortunately I have no idea how the query that caused this syntax error looks like or what else is wrong. Can someone please point me in the right direction?
(Second) EDIT:
Following Humayuns and Joris suggestions, I now use Pandas version 0.14 and SQLAlchemy in combination with the Exasol dialect (?). Since I am connecting to a defined schema, I am using the meta data option, but the programm crashes with "Bus error (core dumped)".
engine = create_engine('exa+pyodbc://uid:passwd#exa/mySchemaName', echo=True)
# get some data
sqlString = "SELECT * FROM SOMETABLE" # SOMETABLE is a view in mySchemaName
df = pd.read_sql(sqlString, con=engine) # works
print engine.has_table('MYTABLENAME') # MYTABLENAME is a view in mySchemaName
# prints "True"
# upload it to a new table
meta = sqlalchemy.MetaData(engine, schema='mySchemaName')
meta.reflect(engine, schema='mySchemaName')
pdsql = sql.PandasSQLAlchemy(engine, meta=meta)
pdsql.to_sql(df, 'MYTABLENAME')
I am not sure about setting "mySchemaName" in create_engine(..), but the outcome is the same.

Pandas does not support the EXASOL syntax out of the box, so it need to be changed a bit, here is a working example of your code without SQLAlchemy:
import pyodbc
import pandas as pd
con = pyodbc.connect('DSN=EXA')
con.execute('OPEN SCHEMA TEST2')
# configure pandas to understand EXASOL as mysql flavor
pd.io.sql._SQL_TYPES['int']['mysql'] = 'INT'
pd.io.sql._SQL_SYMB['mysql']['br_l'] = ''
pd.io.sql._SQL_SYMB['mysql']['br_r'] = ''
pd.io.sql._SQL_SYMB['mysql']['wld'] = '?'
pd.io.sql.PandasSQLLegacy.has_table = \
lambda self, name: name.upper() in [t[0].upper() for t in con.execute('SELECT table_name FROM cat').fetchall()]
data = pd.read_sql('SELECT * FROM services', con)
data.to_sql('SERVICES2', con, flavor = 'mysql', index = False)
If you use the EXASolution Python package, then the code would look like follows:
import exasol
con = exasol.connect(dsn='EXA') # normal pyodbc connection with additional functions
con.execute('OPEN SCHEMA TEST2')
data = con.readData('SELECT * FROM services') # pandas data frame per default
con.writeData(data, table = 'services2')

The problem is that also in pandas 0.14 the read_sql and to_sql functions cannot deal with schemas, but using exasol without schemas makes no sense. This will be fixed in 0.15. If you want to use it now look at this pull request https://github.com/pydata/pandas/pull/7952

Related

Using Python variables for table name and column value in an SQL query

I have a Python function to read from an SQL table into a pandas DataFrame:
def project_cable_collector(dbase, table, project):
engine = create_engine(dbase)
df = pd.read_sql('SELECT * from table WHERE project_id = project', engine)
return (df)
However it returns sqlalchemy.exc.ProgrammingError:
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.SyntaxError) syntax error at or near "table"
LINE 1: SELECT * from table WHERE project_id = project
I tried editing quotation marks to see if that's a fix, but it fails.
Any ideas?
An exact fix to your current problem might be to use an f-string:
def project_cable_collector(dbase, table, project):
engine = create_engine(dbase)
sql = f"SELECT * FROM {table} WHERE project_id = {project}"
df = pd.read_sql(sql, engine)
return (df)
However, note that it is highly undesirable to build a SQL query string this way using concatenation and substitution. The reason is that your function invites something called SQL injection, which means that someone could pass in a malicious SQL code fragment into the function and try to get your Python script to execute it. Instead, you should read about using prepared statements.
Further to Tim's answer, you'll need to use an f-string to insert the table name into the SQL text, but you should use a parameter to specify the column value:
from sqlalchemy import text
# …
def project_cable_collector(dbase, table, project):
engine = create_engine(dbase)
sql = f"SELECT * FROM {table} WHERE project_id = :project_id"
df = pd.read_sql_query(text(sql), engine, params=dict(project_id=project))
return df
Note also that read_sql_query() is preferable to read_sql().

ImportError: Using URI string without sqlalchemy installed, while executing REGEXP function on pandas SQL API with SqlLite

I am trying to execute REGEXP Funtion of SqlLite using Pandas SQL API, but getting an error of
"ImportError: Using URI string without sqlalchemy installed."ohon
The python code is as follows :
import pandas as pd
import csv, sqlite3
import json, re
conn = sqlite3.connect(":memory:")
print(sqlite3.version)
print(sqlite3.sqlite_version)
def regexp(y, x, search=re.search):
return 1 if search(y, x) else 0
conn.create_function("regexp", 2, regexp)
df = pd.read_json("idxData1.json", lines=True)
df.to_sql("temp_log", conn, if_exists="append", index=False)
rsDf = pd.read_sql_query(
conn, """SELECT * from temp_log WHERE user REGEXP 'ph'""", chunksize=20,
)
for gendf in rsDf:
for item in gendf.to_dict(orient="records"):
print(item)
The error it throws is
raise ImportError("Using URI string without sqlalchemy installed.")
ImportError: Using URI string without sqlalchemy installed.
Can anyone suggest what I am missing. Please not that I have a specific requirement of using Pandas SQL API.
You get this error because you specified the parameters to read_sql_query in the wrong order. Specifically, the 1st parameter should be the query, and the connection comes second, like this:
rsDf = pd.read_sql_query(
"""SELECT * from temp_log WHERE user REGEXP 'ph'""", conn, chunksize=20,
)
You can simply just run the following command and install SQLAlchemy:
pip3 install SQLAlchemy
As #Xbel said:
Note that the error may be raised because the order of the parameters is wrong, e.g., adding a connection first and then the SQL statement. As it seems, it was the case. Note that installing SQLAlchemy does not help.

ORA-00942 when importing data from Oracle Database to Pandas

I try to import data from a oracle database to a pandas dataframe.
right Now i am using:
import cx_Oracle
import pandas as pd
db_connection_string = '.../A1#server:port/servername'
con = cx_Oracle.connect(db_connection_string)
query = """SELECT*
FROM Salesdata"""
df = pd.read_sql(query, con=con)
and get the following error: DatabaseError: ORA-00942: Table or view doesn't exist
When I run a query to get the list of all tables:
cur = con.cursor()
cur.execute("SELECT table_name FROM dba_tables")
for row in cur:
print (row)
The output looks like this:
('A$',)
('A$BD',)
('Salesdata',)
What I am doing wrong? I used this question to start.
If I use the comment to print(query)I get:
SELECT*
FROM Salesdata
Getting ORA-00942 when running SELECT can have 2 possible causes:
The table does not exist: here you should make sure the table name is prefixed by the table owner (schema name) as in select * from owner_name.table_name. This is generally needed if the current Oracle user connected is not the table owner.
You don't have SELECT privileges on the table. This is also generally needed if the current Oracle user connected is not the table owner.
You need to check both.

Specifying pyODBC options (fast_executemany = True in particular) using SQLAlchemy

I would like to switch on the fast_executemany option for the pyODBC driver while using SQLAlchemy to insert rows to a table. By default it is of and the code runs really slow... Could anyone suggest how to do this?
Edits:
I am using pyODBC 4.0.21 and SQLAlchemy 1.1.13 and a simplified sample of the code I am using are presented below.
import sqlalchemy as sa
def InsertIntoDB(self, tablename, colnames, data, create = False):
"""
Inserts data into given db table
Args:
tablename - name of db table with dbname
colnames - column names to insert to
data - a list of tuples, a tuple per row
"""
# reflect table into a sqlalchemy object
meta = sa.MetaData(bind=self.engine)
reflected_table = sa.Table(tablename, meta, autoload=True)
# prepare an input object for sa.connection.execute
execute_inp = []
for i in data:
execute_inp.append(dict(zip(colnames, i)))
# Insert values
self.connection.execute(reflected_table.insert(),execute_inp)
Try this for pyodbc
crsr = cnxn.cursor()
crsr.fast_executemany = True
Starting with version 1.3, SQLAlchemy has directly supported fast_executemany, e.g.,
engine = create_engine(connection_uri, fast_executemany=True)

PyMySQL with Python 3.5 - selecting into pandas dataframe with the LIKE clause fails due to escape characters?

I am using PyMySQL to fetch some data from the MySQL DB into pandas dataframe. I need to run a select with the LIKE clause, but seems like PyMySQL does something weird with the select statement and doesn't like when one has % in the query:
#connection to MySQL
engine = create_engine('mysql+pymysql://user:password#localhost:1234/mydb', echo=False)
#get descriptions we want
decriptions = pd.read_sql(sql=r"select content from listings where content not like '%The Estimate%'", con = engine)
I get error:
ValueError: unsupported format character 'T' (0x54) at index 54
Any advice on how to get around this?
Try using %%
decriptions = pd.read_sql(sql=r"select content
from listings where content not like '%%The Estimate%'", con = engine)

Categories