Using Pandas Dataframe within a SQL Join - python

I'm trying to perform a SQL join on the the contents of a dataframe with an external table I have in a Postgres Database.
This is what the Dataframe looks like:
>>> df
name author count
0 a b 10
1 c d 5
2 e f 2
I need to join it with a Postgres table that looks like this:
TABLE: blog
title author url
a b w.com
b b x.com
e g y.com
This is what I'm attempting to do, but this doesn't appear to be the right syntax for the query:
>>> sql_join = r"""select b.*, frame.* from ({0}) frame
join blog b
on frame.name = b.title
where frame.owner = b.owner
order by frame.count desc
limit 30;""".format(df)
>>> res = pd.read_sql(sql_join, connection)
I'm not sure how I can use the values in the dataframes within the sql query.
Can someone point me in the right direction? Thanks!
Edit: As per my use case, I'm not able to convert the blog table into a dataframe given memory and performance constraints.

I managed to do this without having to convert the dataframe to a temp table or without reading SQL into a dataframe from the blog table.
For anyone else facing the same issue, this is achieved using a virtual table of sorts.
This is what my final sql query looks like this:
>>> inner_string = "VALUES ('a','b',10), ('c','d',5), ('e','f',2)"
>>> sql_join = r"""SELECT * FROM blog
JOIN ({0}) AS frame(title, owner, count)
ON blog.title = frame.title
WHERE blog.owner = frame.owner
ORDER BY frame.count DESC
LIMIT 30;""".format(inner_string)
>>> res = pd.read_sql(sql_join, connection)
You can use string manipulation to convert all rows in the dataframe into one large string similar to inner_string.

You should create another dataframe from the Postgres table and then join both dataframes.
You can use read_sql to create a df from table:
import psycopg2 ## Python connector library to Postgres
import pandas as pd
conn = psycopg2.connect(...) ## Put your DB credentials here
blog_df = pd.read_sql('blog', con=conn)
## This will bring `blog` table's data into blog_df
It should look like this:
In [258]: blog_df
Out[258]:
title author url
0 a b w.com
1 b b x.com
2 e g y.com
Now, you can join df and blog_df using merge like below:
In [261]: pd.merge(df, blog_df, left_on='name', right_on='title')
Out[261]:
name author_x count title author_y url
0 a b 10 a b w.com
1 e f 2 e g y.com
You will get result like above. You can clean it further.
Let me know if this helps.

I've had similar problems. I found a work-around that allows me to join two different servers where i only have read-only rights. using sqlalchemy insert the pandas dataframe and then join
import sqlalchemy as sa
import pandas as pd
metadata = MetaData()
sql_of_df = sa.Table(
"##df",
metadata,
sa.Column("name", sa.String(x), primary_key=True),
sa.Column("author", sa.String(x), nullable=False),
sa.Columnt("count", sa.Integer),
)
metadata.create_all(engine)
dataframe_dict = df.to_dict(orient='records')
insert_statement = sql_of_df.insert().values(
{
"name":sa.bindparam("name"),
"author":sa.bindparam("author"),
"count":sa.bindparam("count"),
}
)
session.execute(insert_statement, dataframe_dict)
statement=sa.text("SELECT * from blog Inner join ##df on blog.Title = ##df.name")
session.execute(statement)

Related

python SQLite3 how to getting records that match a list of values in a column then place that into pandas df

I am not experienced with SQL or SQLite3.
I have a list of ids from another table. I want to use the list as a key in my query and get all records based on the list. I want the SQL query to feed directly into a DataFrame.
import pandas as pd
import sqlite3
cnx = sqlite3.connect('c:/path/to/data.sqlite')
# the below values are ones found in "s_id"
id_list = ['C20','C23','C25','C28', ... ,'C83']
# change list to sql string.
id_sql = ", ".join(str(x) for x in id_list)
df = pd.read_sql_query(f"SELECT * FROM table WHERE s_id in ({id_sql})", cnx)
I am getting a DatabaseError: Execution failed on sql 'SELECT * FROM ... : no such column: C20.
When I saw this error I thought the code just needs a simple switch. So I tried this
df = pd.read_sql_query(f"SELECT * FROM table WHERE ({id_sql}) in s_id", cnx)
it did not work.
So how can I get this to work?
The table is like.
id
s_id
date
assigned_to
date_complete
notes
0
C10
1/6/2020
Jack
1/8/2020
None
1
C20
1/10/2020
Jane
1/12/2020
Call back
2
C23
1/11/2020
Henry
1/12/2020
finished
n
C83
rows
of
more
data
n+1
D85
9/10/2021
Jeni
9/12/2021
Call back
Currently, you are missing the single quotes around your literal values and consequently the SQLite engine assumes you are attempting to query columns. However, avoid concatenation of values altogether but bind them to parameters which pandas pandas.read_sql supports with the params argument:
# the below values are ones found in "s_id"
id_list = ['C20','C23','C25','C28', ... ,'C83']
# build equal length string of ? place holders
prm_list = ", ".join("?" for _ in id_list)
# build prepared SQL statement
sql = f"SELECT * FROM table WHERE s_id IN ({prm_list})"
# run query, passing parameters and values separately
df = pd.read_sql(sql, con=cnx, params=id_list)
So the problem is that it is missing single quote marks in the sql string. So for the in part needs ' on each side of the s_id values.
import pandas as pd
import sqlite3
cnx = sqlite3.connect('c:/path/to/data.sqlite')
# the below values are ones found in "s_id"
id_list = ['C20','C23','C25','C28', ... ,'C83']
# change list to sql string.
id_sql = "', '".join(str(x) for x in id_list)
df = pd.read_sql_query(f"SELECT * FROM table WHERE s_id in ('{id_sql}')", cnx)

Is there a way to store both a pandas dataframe and separate string var in the same SQLite table?

Disclaimer: This is my first time posting a question here, so I apologize if I didn't post this properly.
I recently started learning how to use SQLite in python. As the title suggests, I have a python object with a string and a pandas dataframe attributes and want to know if/how I can add both of these to the same SQLite table. Below is the code I have thus far. The mydb.db file gets created successfully, but on insert I get the following error message:
sqlite3.InterfaceError: Error binding parameter :df- probably unsupported type.
I know you can use df.to_sql('mydbs', conn) to store a pandas dataframe in an SQL table, but this wouldn't seem to allow for an additional string to be added to the same table and then retrieved separately from the dataframe. Any solutions or alternative suggestions are appreciated.
Python Code:
# Python 3.7
import sqlite3
import pandas as pd
import myclass
conn = sqlite3.connect("mydb.db")
c = conn.cursor()
c.execute("""CREATE TABLE mydbs (
name text,
df blob
)""")
conn.commit()
c.execute("INSERT INTO mydbs VALUES (:name, :df)", {'name': myclass.name, 'df': myclass.df})
conn.commit()
conn.close()
It looks like you are trying to store a dataframe in an SQL table 'cell'. This is a bit odd, since sql is used for storing tables of data... and a dataframe is something that arguably should be stored as a table on its own (hence the built in pandas function). To accomplish what you want specifically, you could pickle the dataframe and store it
import codecs
import pickle
import pandas as pd
import sqlite3
df = pd.DataFrame({"foo": range(5), "bar": range(5, 10)})
pickled = codecs.encode(pickle.dumps(df), "base64").decode()
df
foo bar
0 0 5
1 1 6
2 2 7
3 3 8
4 4 9
Store & Retrieve:
conn = sqlite3.connect("mydb.db")
c = conn.cursor()
c.execute("""CREATE TABLE mydbs (
name text,
df text
)""")
c.execute("INSERT INTO mydbs VALUES (:name, :df)", {'name': 'name', 'df': pickled})
conn.commit()
c.execute('SELECT * FROM mydbs')
result = c.fetchall()
unpickled = pickle.loads(codecs.decode(result[0][1].encode(), "base64"))
conn.close()
unpickled
foo bar
0 0 5
1 1 6
2 2 7
3 3 8
4 4 9
If you wanted to store the dataframe as an sql table (which imo makes more sense and is simpler) and you needed to have a name with it you could just add a column 'name' to the df:
import pandas as pd
from sqlalchemy import create_engine
df = pd.DataFrame({"foo": range(5), "bar": range(5, 10)})
df
foo bar
0 0 5
1 1 6
2 2 7
3 3 8
4 4 9
Add name column, then save to db and retrieve:
df['name'] = 'the df name'
engine = create_engine('sqlite://', echo=False)
df.to_sql('users', con=engine)
r = engine.execute("SELECT * FROM users").fetchall()
r = pd.read_sql('users', con=engine)
r
index foo bar name
0 0 0 5 the df name
1 1 1 6 the df name
2 2 2 7 the df name
3 3 3 8 the df name
4 4 4 9 the df name
But even that method may not be ideal, since you are effectively adding an extra column of data for each df, and this could get costly if you are working on a large project where database size is a factor, and maybe even speed (although SQL is quite fast). In this case, it may be best to use relational tables. For this I refer you here since there is no point re-writing the code here. Using a relational model would be the most 'proper' solution imo, since it fully embodies the purpose of SQL.

How to select data from SQL Server based on data available in pandas data frame?

I have a list of data in one of the pandas dataframe column for which I want to query SQL Server database. Is there any way I can query a SQL Server DB based on data I have in pandas dataframe.
select * from table_name where customerid in pd.dataframe.customerid
In SAP, there is something called "For all entries in" where the SQL can query the DB based on the data available in the array, I was trying to find something similar.
Thanks.
If you are working with tiny DataFrame, then the easiest way would be to generate a corresponding SQL:
In [8]: df
Out[8]:
id val
0 1 21
1 3 111
2 5 34
3 12 76
In [9]: q = 'select * from tab where id in ({})'.format(','.join(['?']*len(df['id'])))
In [10]: q
Out[10]: 'select * from tab where id in (?,?,?,?)'
now you can read data from SQL Server:
from sqlalchemy import create_engine
conn = create_engine(...)
new = pd.read_sql(q, conn, params=tuple(df['id']))
NOTE: this approach will not work for bigger DF's as the generated query (and/or list of bind variables) might bee too long either for Pandas to_sql() function or for SQL Server or even for both.
For bigger DFs I would recommend to write your pandas DF to SQL Server table and then use SQL subquery to filter needed data:
df[list_of_columns_to_save].to_sql('tmp_tab_name', conn, index=False)
q = "select * from tab where id in (select id from tmp_tab_name)"
new = pd.read_sql(q, conn, if_exists='replace')
This is a very familiar scenario and one can use the below code to Query SQL using a very large pandas dataframe. The parameter n needs to be manipulated based on your SQL server memory. For me n=25000 worked.
n = 25000 #chunk row size
## Big_frame dataframe divided into smaller chunks of n into a list
list_df = [big_frame[i:i+n] for i in range(0,big_frame.shape[0],n)]
## Create another dataframe with columns names as expected from SQL
big_frame_2 = pd.DataFrame(columns=[<Mention all column names from SQL>])
## Print total no. of iterations
print("Total Iterations:", len(list_df))
for i in range(0,len(list_df)):
print("Iteration :",i)
temp_frame = list_df[i]
testList = temp_frame['customer_no']
## Pass smaller chunk of data to SQL(here I am passing a list of customers)
temp_DF = SQL_Query(tuple(testList))
print(temp_DF.shape[0])
## Append all the data retrieved from SQL to big_frame_2
big_frame_2=big_frame_2.append(temp_DF, ignore_index=True)

How can I get column name and type from an existing table in SQLAlchemy?

Suppose I have the table users and I want to know what the column names are and what the types are for each column.
I connect like this;
connectstring = ('mssql+pyodbc:///?odbc_connect=DRIVER%3D%7BSQL'
'+Server%7D%3B+server%3D.....')
engine = sqlalchemy.create_engine(connectstring).connect()
md = sqlalchemy.MetaData()
table = sqlalchemy.Table('users', md, autoload=True, autoload_with=engine)
columns = table.c
If I call
for c in columns:
print type(columns)
I get the output
<class 'sqlalchemy.sql.base.ImmutableColumnCollection'>
printed once for each column in the table.
Furthermore,
print columns
prints
['users.column_name_1', 'users.column_name_2', 'users.column_name_3'....]
Is it possible to get the column names without the table name being included?
columns have name and type attributes
for c in columns:
print c.name, c.type
Better use "inspect" to obtain only information from the columns, with that you do not reflect the table.
import sqlalchemy
from sqlalchemy import inspect
engine = sqlalchemy.create_engine(<url>)
insp = inspect(engine)
columns_table = insp.get_columns(<table_name>, <schema>) #schema is optional
for c in columns_table :
print(c['name'], c['type'])
You could call the column names from the Information Schema:
SELECT *
FROM information_schema.columns
WHERE TABLE_NAME = 'users'
Then sort the result as an array, and loop through them.

pandas read_sql drops dot in column names

is that a bug or I'm doing specifically something wrong ?
I create a df, put it in a sql table, df and table have a column with a dot in it.
now when I read the df from the sql table, column names aren't the same.
I wrote this little piece of code so that people can test it.
import sqlalchemy
import pandas as pd
import numpy as np
engine = sqlalchemy.create_engine('sqlite:///test.sqlite')
dfin = pd.DataFrame(np.random.randn(10,2), columns=['column with a . dot', 'without'])
print(dfin)
dfin.to_sql('testtable', engine, if_exists='fail')
tables = engine.table_names()
for table in tables:
sql = 'SELECT t.* FROM "' + table + '" t'
dfout = pd.read_sql(sql, engine)
print(dfout.columns)
print dfout
Solution is to pass sqlite_raw_colnames=True to your engine
In [141]: engine = sqlalchemy.create_engine('sqlite:///', execution_options={'sqlite_raw_colnames':True})
In [142]: dfin.to_sql('testtable', engine, if_exists='fail')
In [143]: pd.read_sql("SELECT * FROM testtable", engine).head()
Out[143]:
index column with a . dot without
0 0 0.213645 0.321328
1 1 -0.511033 0.496510
2 2 -1.114511 -0.030571
3 3 -1.370342 0.359123
4 4 0.101111 -1.010498
SQLAlchemy does this stripping of dots deliberately (in some cases SQLite may store col names as "tablename.colname"), see eg sqlalchemy+sqlite stripping column names with dots? and https://groups.google.com/forum/?hl=en&fromgroups#!topic/sqlalchemy/EqAuTFlMNZk
This seems a bug, but not necessarily in the pandas read_sql function, as this relies on the keys method of the SQLAlchemy ResultProxy object to determine the column names. And this seems to truncate the column names:
In [15]: result = engine.execute("SELECT * FROM testtable")
In [16]: result.keys()
Out[16]: [u'index', u' dot', u'without']
So the question is if this is a bug in SQLAlchemy, or that pandas should make a workaround (by eg using result.cursor.description which gives the correct names)
For now, you can also use the sqlite fallback mode, using a DBAPI connection instead of SQLAlchemy engine (as this relies on cursor.description, here the correct column names are used:
In [20]: con = sqlite3.connect(':memory:')
In [21]: dfin.to_sql('testtable', con, if_exists='fail')
In [22]: pd.read_sql("SELECT * FROM testtable", con).head()
Out[22]:
index column with a . dot without
0 0 0.213645 0.321328
1 1 -0.511033 0.496510
2 2 -1.114511 -0.030571
3 3 -1.370342 0.359123
4 4 0.101111 -1.010498

Categories