python-pandas and databases like mysql - python

The documentation for Pandas has numerous examples of best practices for working with data stored in various formats.
However, I am unable to find any good examples for working with databases like MySQL for example.
Can anyone point me to links or give some code snippets of how to convert query results using mysql-python to data frames in Pandas efficiently ?

As Wes says, io/sql's read_sql will do it, once you've gotten a database connection using a DBI compatible library. We can look at two short examples using the MySQLdb and cx_Oracle libraries to connect to Oracle and MySQL and query their data dictionaries. Here is the example for cx_Oracle:
import pandas as pd
import cx_Oracle
ora_conn = cx_Oracle.connect('your_connection_string')
df_ora = pd.read_sql('select * from user_objects', con=ora_conn)
print 'loaded dataframe from Oracle. # Records: ', len(df_ora)
ora_conn.close()
And here is the equivalent example for MySQLdb:
import MySQLdb
mysql_cn= MySQLdb.connect(host='myhost',
port=3306,user='myusername', passwd='mypassword',
db='information_schema')
df_mysql = pd.read_sql('select * from VIEWS;', con=mysql_cn)
print 'loaded dataframe from MySQL. records:', len(df_mysql)
mysql_cn.close()

For recent readers of this question: pandas have the following warning in their docs for version 14.0:
Warning: Some of the existing functions or function aliases have been
deprecated and will be removed in future versions. This includes:
tquery, uquery, read_frame, frame_query, write_frame.
And:
Warning: The support for the ‘mysql’ flavor when using DBAPI connection objects has
been deprecated. MySQL will be further supported with SQLAlchemy
engines (GH6900).
This makes many of the answers here outdated. You should use sqlalchemy:
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('dialect://user:pass#host:port/schema', echo=False)
f = pd.read_sql_query('SELECT * FROM mytable', engine, index_col = 'ID')

For the record, here is an example using a sqlite database:
import pandas as pd
import sqlite3
with sqlite3.connect("whatever.sqlite") as con:
sql = "SELECT * FROM table_name"
df = pd.read_sql_query(sql, con)
print df.shape

I prefer to create queries with SQLAlchemy, and then make a DataFrame from it. SQLAlchemy makes it easier to combine SQL conditions Pythonically if you intend to mix and match things over and over.
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Table
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from pandas import DataFrame
import datetime
# We are connecting to an existing service
engine = create_engine('dialect://user:pwd#host:port/db', echo=False)
Session = sessionmaker(bind=engine)
session = Session()
Base = declarative_base()
# And we want to query an existing table
tablename = Table('tablename',
Base.metadata,
autoload=True,
autoload_with=engine,
schema='ownername')
# These are the "Where" parameters, but I could as easily
# create joins and limit results
us = tablename.c.country_code.in_(['US','MX'])
dc = tablename.c.locn_name.like('%DC%')
dt = tablename.c.arr_date >= datetime.date.today() # Give me convenience or...
q = session.query(tablename).\
filter(us & dc & dt) # That's where the magic happens!!!
def querydb(query):
"""
Function to execute query and return DataFrame.
"""
df = DataFrame(query.all());
df.columns = [x['name'] for x in query.column_descriptions]
return df
querydb(q)

MySQL example:
import MySQLdb as db
from pandas import DataFrame
from pandas.io.sql import frame_query
database = db.connect('localhost','username','password','database')
data = frame_query("SELECT * FROM data", database)

The same syntax works for Ms SQL server using podbc also.
import pyodbc
import pandas.io.sql as psql
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER=servername;DATABASE=mydb;UID=username;PWD=password')
cursor = cnxn.cursor()
sql = ("""select * from mytable""")
df = psql.frame_query(sql, cnxn)
cnxn.close()

And this is how you connect to PostgreSQL using psycopg2 driver (install with "apt-get install python-psycopg2" if you're on Debian Linux derivative OS).
import pandas.io.sql as psql
import psycopg2
conn = psycopg2.connect("dbname='datawarehouse' user='user1' host='localhost' password='uberdba'")
q = """select month_idx, sum(payment) from bi_some_table"""
df3 = psql.frame_query(q, conn)

For Sybase the following works (with http://python-sybase.sourceforge.net)
import pandas.io.sql as psql
import Sybase
df = psql.frame_query("<Query>", con=Sybase.connect("<dsn>", "<user>", "<pwd>"))

pandas.io.sql.frame_query is deprecated. Use pandas.read_sql instead.

import the module
import pandas as pd
import oursql
connect
conn=oursql.connect(host="localhost",user="me",passwd="mypassword",db="classicmodels")
sql="Select customerName, city,country from customers order by customerName,country,city"
df_mysql = pd.read_sql(sql,conn)
print df_mysql
That works just fine and using pandas.io.sql frame_works (with the deprecation warning). Database used is the sample database from mysql tutorial.

This should work just fine.
import MySQLdb as mdb
import pandas as pd
con = mdb.connect(‘127.0.0.1’, ‘root’, ‘password’, ‘database_name’);
with con:
cur = con.cursor()
cur.execute(“select random_number_one, random_number_two, random_number_three from randomness.a_random_table”)
rows = cur.fetchall()
df = pd.DataFrame( [[ij for ij in i] for i in rows] )
df.rename(columns={0: ‘Random Number One’, 1: ‘Random Number Two’, 2: ‘Random Number Three’}, inplace=True);
print(df.head(20))

This helped for me for connecting to AWS MYSQL(RDS) from python 3.x based lambda function and loading into a pandas DataFrame
import json
import boto3
import pymysql
import pandas as pd
user = 'username'
password = 'XXXXXXX'
client = boto3.client('rds')
def lambda_handler(event, context):
conn = pymysql.connect(host='xxx.xxxxus-west-2.rds.amazonaws.com', port=3306, user=user, passwd=password, db='database name', connect_timeout=5)
df= pd.read_sql('select * from TableName limit 10',con=conn)
print(df)
# TODO implement
#return {
# 'statusCode': 200,
# 'df': df
#}

For Postgres users
import psycopg2
import pandas as pd
conn = psycopg2.connect("database='datawarehouse' user='user1' host='localhost' password='uberdba'")
customers = 'select * from customers'
customers_df = pd.read_sql(customers,conn)
customers_df

Related

Optimization of create/run multiple SQLalchemy Engines?

Assuming I have 30 databases in MySQL from db1 to db30. I have a python script that will create engine and connect to one db,
import pandas as pd
import MySQLdb
from sqlalchemy import create_engine
df = pd.read_csv('pricelist.csv')
new_df = df[['date','time','new_price']]
engine = create_engine('mysql+mysqldb://root:python#localhost:3306/db1', echo = False)
new_df.to_sql(name='temporary_table', con=engine, if_exists = 'append', index=False)
with engine.begin() as cnx:
sql_insert_query_new = 'REPLACE INTO newlist (SELECT * FROM temporary_table)'
cnx.execute(sql_insert_query_new)
cnx.execute("DROP TABLE temporary_table")
Now with the above script, I will need to have 30 python scripts to create engine and connect each db to conduct the query. And to call these 30 scripts, I will need to use a batch file on a task scheduler.
Is there an optimize way of connecting to multiple databases with a single script? I read up on sessions and don't think it is able to take in multiple databases. And if I have 30 python scripts doing this creation engine and connection, will there be any issue in terms of processing performance? Eventually, I will have like hundreds of db in MySQL.
Thanks!
Note: Each database has their own unique table names.
Using Python 3.7
I think may be you can do something like this:
import pandas as pd
import MySQLdb
from sqlalchemy import create_engine
df = pd.read_csv('pricelist.csv')
new_df = df[['date','time','new_price']]
db_names = [f'db{i}' for i in range(1, 31)]
table_names = ['temporary_table', 'table_name_2', 'table_name_3', ...]
for db, tb in zip(db_names, table_names):
engine = create_engine(f'mysql+mysqldb://root:python#localhost:3306/{db}', echo=False)
new_df.to_sql(name=tb, con=engine, if_exists='append', index=False)
with engine.begin() as cnx:
sql_insert_query_new = f'REPLACE INTO newlist (SELECT * FROM {tb})'
cnx.execute(sql_insert_query_new)
cnx.execute(f"DROP TABLE {tb}")

Apache Spark JDBC SQL Injection (pyspark)

I am trying to submit a sql query to jdbc while being protected from sql injection attacks. I have some code such as
from pyspark import SparkContext
from pyspark.sql import DataFrameReader, SQLContext
from pyspark.sql.functions import col
url = 'jdbc:mysql://.../....'
properties = {'user': '', 'driver': 'com.mysql.jdbc.Driver', 'password': ''}
sc = SparkContext("local[*]", "name")
sqlContext = SQLContext(sc)
from pyspark.sql.functions import desc
pushdown_query = """(
select * from my_table
where timestamp > {}
) AS tmp""".format(my_date)
df = sqlContext.read.jdbc(url=url, properties=properties, table=pushdown_query)
Can I use bind params somehow?
Any solution that prevents SQL injection here would work.
I also use SQLAlchemy if that helps.
If you use SQLAlchemy, you can try:
from sqlalchemy.dialects import mysql
from sqlalchemy import text
pushdown_query = str(
text("""(select * from my_table where timestamp > :my_date ) AS tmp""")
.bindparams(my_date=my_date)
.compile(dialect=mysql.dialect(), compile_kwargs={"literal_binds": True}))
df = sqlContext.read.jdbc(url=url, properties=properties, table=pushdown_query)
but in a simple case, like this one, there is no need for subqueries. You can:
df = (sqlContext.read
.jdbc(url=url, properties=properties, table=my_table)
.where(col("timestamp") > my_date)))
and if you worry about SQL injections, it is possible you have a bigger problem. If alone has (almost) no security mechanisms built-in and probably shouldn't be exposed in untrusted environment.

python MySQL db and Cursor cannot execute INSERT INTO

In Python version 2.7.6
Pandas version 0.18.1
MySQL 5.7
import MySQLdb as dbapi
import sys
import csv
import os
import sys, getopt
import pandas as pd
df = pd.read_csv('test.csv')
rows = df.apply(tuple, 1).unique().tolist()
db=dbapi.connect(host=dbServer,user=dbUser,passwd=dbPass)
cur=db.cursor()
for (CLIENT_ID,PROPERTY_ID,YEAR) in rows:
INSERT_QUERY=("INSERT INTO {DATABASE}.TEST SELECT * FROM {DATABASE}_{CLIENT_ID}.TEST WHERE PROPERTY_ID = {PROPERTY_ID} AND YEAR = {YEAR};".format(
CLIENT_ID=CLIENT_ID,
PROPERTY_ID=PROPERTY_ID,
YEAR=YEAR,
DATABASE=DATABASE
))
print INSERT_QUERY
cur.execute(INSERT_QUERY)
db.query(INSERT_QUERY)
This will print out the query I am looking for, however, without successfully returning the results of INSERT INTO when I checked the results in MySQL
INSERT INTO test.TEST SELECT * FROM test_1.TEST WHERE PROPERTY_ID = 1 AND YEAR = 2015;
However, if I just copy and paste this MySQL query into MySQL GUI, it will execute without any problem. Could any guru enlighten?
I also tried the following
cur.execute(INSERT_QUERY, multi=True)
Returns an error
TypeError: execute() got an unexpected keyword argument 'multi'
The answer here is we need to use "from mysql.connector" and a db.commit(). Here is a good example
http://www.mysqltutorial.org/python-mysql-insert/
import MySQLdb as dbapi
import mysql.connector
import sys
import csv
import os
import sys, getopt
import pandas as pd
df = pd.read_csv('test.csv')
rows = df.apply(tuple, 1).unique().tolist()
db=dbapi.connect(host=dbServer,user=dbUser,passwd=dbPass)
cur=db.cursor()
conn = mysql.connector.connect(host=dbServer,user=dbUser,port=dbPort,password=dbPass)
cursor=conn.cursor()
for (CLIENT_ID,PROPERTY_ID,YEAR) in rows:
INSERT_QUERY=("INSERT INTO {DATABASE}.TEST SELECT * FROM {DATABASE}_{CLIENT_ID}.TEST WHERE PROPERTY_ID = {PROPERTY_ID} AND YEAR = {YEAR};".format(
CLIENT_ID=CLIENT_ID,
PROPERTY_ID=PROPERTY_ID,
YEAR=YEAR,
DATABASE=DATABASE
))
print INSERT_QUERY
cursor.execute(INSERT_QUERY)
conn.commit()
Only by having the commit, the database/ table changes will be accepted
I was using mysql-connector pool, trying to insert a new row into a table, and got the same problem. The version info: mysql-8, python3.7.
The solution is to add connection.commit at last even you didn't start transaction.

Read Teradata query into Pandas

Has anyone found a way to read a Teradata query into a Pandas dataframe? It looks like SQLAlchemy does not have a Teradata dialect.
http://docs.sqlalchemy.org/en/latest/dialects/
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html
You can use slqalchemy but you will need to install sqlalchemy-teradata too. You can do that via PIP
pip install sqlachemy-teradata
The rest of the code remains the same :)
from sqlalchemy import create_engine
import pandas as pd
user, pasw, host = 'username','userpass', 'hostname'
# connect
td_engine = create_engine('teradata://{}:{}#{}:22/'.format(user,pasw,hostname))
# execute sql
query = 'select * from dbc.usersV'
result = td_engine.execute(query)
#To read your query to Pandas
df = pd.read_sql(query,td_engine)
I did it using read_sql . Below id the code snip :
def dqm() :
conn_rw = create_connection()
dataframes = []
srcfile = open('srcqueries.sql', 'rU').read()
querylist = srcfile.split(';')
querylist.pop()
for query in querylist :
dataframes.append(pd.read_sql(query, conn_rw))
close_connection(conn_rw)
return dataframes,querylist
You can create connection as below :
def create_connection():
conn = pyodbc.connect("DRIVER=Teradata;DBCNAME=tddb;UID=uid;PWD=pwd;QUIETMODE=YES", autocommit=True,unicode_results=True)
return conn
You can check complete code here : GitHub Link
Let me know if this answers your query .

How to insert pandas dataframe via mysqldb into database?

I can connect to my local mysql database from python, and I can create, select from, and insert individual rows.
My question is: can I directly instruct mysqldb to take an entire dataframe and insert it into an existing table, or do I need to iterate over the rows?
In either case, what would the python script look like for a very simple table with ID and two data columns, and a matching dataframe?
Update:
There is now a to_sql method, which is the preferred way to do this, rather than write_frame:
df.to_sql(con=con, name='table_name_for_df', if_exists='replace', flavor='mysql')
Also note: the syntax may change in pandas 0.14...
You can set up the connection with MySQLdb:
from pandas.io import sql
import MySQLdb
con = MySQLdb.connect() # may need to add some other options to connect
Setting the flavor of write_frame to 'mysql' means you can write to mysql:
sql.write_frame(df, con=con, name='table_name_for_df',
if_exists='replace', flavor='mysql')
The argument if_exists tells pandas how to deal if the table already exists:
if_exists: {'fail', 'replace', 'append'}, default 'fail'
fail: If table exists, do nothing.
replace: If table exists, drop it, recreate it, and insert data.
append: If table exists, insert data. Create if does not exist.
Although the write_frame docs currently suggest it only works on sqlite, mysql appears to be supported and in fact there is quite a bit of mysql testing in the codebase.
Andy Hayden mentioned the correct function (to_sql). In this answer, I'll give a complete example, which I tested with Python 3.5 but should also work for Python 2.7 (and Python 3.x):
First, let's create the dataframe:
# Create dataframe
import pandas as pd
import numpy as np
np.random.seed(0)
number_of_samples = 10
frame = pd.DataFrame({
'feature1': np.random.random(number_of_samples),
'feature2': np.random.random(number_of_samples),
'class': np.random.binomial(2, 0.1, size=number_of_samples),
},columns=['feature1','feature2','class'])
print(frame)
Which gives:
feature1 feature2 class
0 0.548814 0.791725 1
1 0.715189 0.528895 0
2 0.602763 0.568045 0
3 0.544883 0.925597 0
4 0.423655 0.071036 0
5 0.645894 0.087129 0
6 0.437587 0.020218 0
7 0.891773 0.832620 1
8 0.963663 0.778157 0
9 0.383442 0.870012 0
To import this dataframe into a MySQL table:
# Import dataframe into MySQL
import sqlalchemy
database_username = 'ENTER USERNAME'
database_password = 'ENTER USERNAME PASSWORD'
database_ip = 'ENTER DATABASE IP'
database_name = 'ENTER DATABASE NAME'
database_connection = sqlalchemy.create_engine('mysql+mysqlconnector://{0}:{1}#{2}/{3}'.
format(database_username, database_password,
database_ip, database_name))
frame.to_sql(con=database_connection, name='table_name_for_df', if_exists='replace')
One trick is that MySQLdb doesn't work with Python 3.x. So instead we use mysqlconnector, which may be installed as follows:
pip install mysql-connector==2.1.4 # version avoids Protobuf error
Output:
Note that to_sql creates the table as well as the columns if they do not already exist in the database.
You can do it by using pymysql:
For example, let's suppose you have a MySQL database with the next user, password, host and port and you want to write in the database 'data_2', if it is already there or not.
import pymysql
user = 'root'
passw = 'my-secret-pw-for-mysql-12ud'
host = '172.17.0.2'
port = 3306
database = 'data_2'
If you already have the database created:
conn = pymysql.connect(host=host,
port=port,
user=user,
passwd=passw,
db=database,
charset='utf8')
data.to_sql(name=database, con=conn, if_exists = 'replace', index=False, flavor = 'mysql')
If you do NOT have the database created, also valid when the database is already there:
conn = pymysql.connect(host=host, port=port, user=user, passwd=passw)
conn.cursor().execute("CREATE DATABASE IF NOT EXISTS {0} ".format(database))
conn = pymysql.connect(host=host,
port=port,
user=user,
passwd=passw,
db=database,
charset='utf8')
data.to_sql(name=database, con=conn, if_exists = 'replace', index=False, flavor = 'mysql')
Similar threads:
Writing to MySQL database with pandas using SQLAlchemy, to_sql
Writing a Pandas Dataframe to MySQL
The to_sql method works for me.
However, keep in mind that the it looks like it's going to be deprecated in favor of SQLAlchemy:
FutureWarning: The 'mysql' flavor with DBAPI connection is deprecated and will be removed in future versions. MySQL will be further supported with SQLAlchemy connectables. chunksize=chunksize, dtype=dtype)
Python 2 + 3
Prerequesites
Pandas
MySQL server
sqlalchemy
pymysql: pure python mysql client
Code
from pandas.io import sql
from sqlalchemy import create_engine
engine = create_engine("mysql+pymysql://{user}:{pw}#localhost/{db}"
.format(user="root",
pw="your_password",
db="pandas"))
df.to_sql(con=engine, name='table_name', if_exists='replace')
This should do the trick:
import pandas as pd
import pymysql
pymysql.install_as_MySQLdb()
from sqlalchemy import create_engine
# Create engine
engine = create_engine('mysql://USER_NAME_HERE:PASS_HERE#HOST_ADRESS_HERE/DB_NAME_HERE')
# Create the connection and close it(whether successed of failed)
with engine.begin() as connection:
df.to_sql(name='INSERT_TABLE_NAME_HERE/INSERT_NEW_TABLE_NAME', con=connection, if_exists='append', index=False)
You might output your DataFrame as a csv file and then use mysqlimport to import your csv into your mysql.
EDIT
Seems pandas's build-in sql util provide a write_frame function but only works in sqlite.
I found something useful, you might try this
This has worked for me. At first I've created only the database, no predefined table I created.
from platform import python_version
print(python_version())
3.7.3
path='glass.data'
df=pd.read_csv(path)
df.head()
!conda install sqlalchemy
!conda install pymysql
pd.__version__
'0.24.2'
sqlalchemy.__version__
'1.3.20'
restarted the Kernel after installation.
from sqlalchemy import create_engine
engine = create_engine('mysql+pymysql://USER:PASSWORD#HOST:PORT/DATABASE_NAME', echo=False)
try:
df.to_sql(name='glasstable',con=engine,index=False, if_exists='replace')
print('Sucessfully written to Database!!!')
except Exception as e:
print(e)
df.to_sql(name = "owner", con= db_connection, schema = 'aws', if_exists='replace', index = >True, index_label='id')

Categories