How do I replicate a pandas function in MySQL? - python

Im new to SQL and trying to unlearn what I know in python. I have a script where I connect to the odbc of SSMS to work with data in Python:
import pyodbc
import pandas as pd
#odbc
conn = pyodbc.connect('Driver={SQL Server};'
'Server=PMZZ315\RION;'
'Database=Warehouse;'
'Trusted_Connection=yes;')
cursor = conn.cursor()
df = pd.read_sql_query("SELECT [LetId],[StreetAddressLine1],[CompanyName] FROM Dim.Let", conn)
df
df.head()
#print(df.columns)
# Select duplicate rows except first occurrence based on all columns
duplicateRowsDF = df[df.duplicated(['CompanyName','StreetAddressLine1'])]
#print("Duplicate Rows except first occurrence based on all columns are :")
print(duplicateRowsDF)
duplicateRowsDF.to_csv("duplicateRowsDFodbc.csv")
What function in SQL can substitute the df.duplicated function? All I am trying to do is detect duplicate records ignoring the first instance if the company name and street address are repeated
Reprex of output dataset:
LetId StreetAddressLine1 CompanyName
32 1451 West Brimson View Court Palmer
405 1808 North Lonion Ave Ozark
465 4223 Monty Hwy Alabama

SQL tables represent unordered sets. Ordering is only provided by columns in the data. There is no "first" without an ordering. Let me assume that letid defines the ordering.
The canonical way in SQL uses row_number():
select t.*
from (select t.*,
row_number() over (partition by CompanyName, StreetAddressLine1 order by letid) as seqnum
from t
) t
where seqnum = 1;

Related

python SQLite3 how to getting records that match a list of values in a column then place that into pandas df

I am not experienced with SQL or SQLite3.
I have a list of ids from another table. I want to use the list as a key in my query and get all records based on the list. I want the SQL query to feed directly into a DataFrame.
import pandas as pd
import sqlite3
cnx = sqlite3.connect('c:/path/to/data.sqlite')
# the below values are ones found in "s_id"
id_list = ['C20','C23','C25','C28', ... ,'C83']
# change list to sql string.
id_sql = ", ".join(str(x) for x in id_list)
df = pd.read_sql_query(f"SELECT * FROM table WHERE s_id in ({id_sql})", cnx)
I am getting a DatabaseError: Execution failed on sql 'SELECT * FROM ... : no such column: C20.
When I saw this error I thought the code just needs a simple switch. So I tried this
df = pd.read_sql_query(f"SELECT * FROM table WHERE ({id_sql}) in s_id", cnx)
it did not work.
So how can I get this to work?
The table is like.
id
s_id
date
assigned_to
date_complete
notes
0
C10
1/6/2020
Jack
1/8/2020
None
1
C20
1/10/2020
Jane
1/12/2020
Call back
2
C23
1/11/2020
Henry
1/12/2020
finished
n
C83
rows
of
more
data
n+1
D85
9/10/2021
Jeni
9/12/2021
Call back
Currently, you are missing the single quotes around your literal values and consequently the SQLite engine assumes you are attempting to query columns. However, avoid concatenation of values altogether but bind them to parameters which pandas pandas.read_sql supports with the params argument:
# the below values are ones found in "s_id"
id_list = ['C20','C23','C25','C28', ... ,'C83']
# build equal length string of ? place holders
prm_list = ", ".join("?" for _ in id_list)
# build prepared SQL statement
sql = f"SELECT * FROM table WHERE s_id IN ({prm_list})"
# run query, passing parameters and values separately
df = pd.read_sql(sql, con=cnx, params=id_list)
So the problem is that it is missing single quote marks in the sql string. So for the in part needs ' on each side of the s_id values.
import pandas as pd
import sqlite3
cnx = sqlite3.connect('c:/path/to/data.sqlite')
# the below values are ones found in "s_id"
id_list = ['C20','C23','C25','C28', ... ,'C83']
# change list to sql string.
id_sql = "', '".join(str(x) for x in id_list)
df = pd.read_sql_query(f"SELECT * FROM table WHERE s_id in ('{id_sql}')", cnx)

Reading an SQL query into a Dask DataFrame

I'm trying create a function that takes an SQL SELECT query as a parameter and use dask to read its results into a dask DataFrame using the dask.read_sql_query function. I am new to dask and to SQLAlchemy.
I first tried this:
import dask.dataFrame as dd
query = "SELECT name, age, date_of_birth from customer"
df = dd.read_sql_query(sql=query, con=con_string, index_col="name", npartitions=10)
As you probably already know, this won't work because the sql parameter has to be an SQLAlchemy selectable and more importantly, TextClause isn't supported.
I then wrapped the query behind a select like this:
import dask.dataFrame as dd
from sqlalchemy import sql
query = "SELECT name, age, date_of_birth from customer"
sa_query = sql.select(sql.text(query))
df = dd.read_sql_query(sql=sa_query, con=con_string, index_col="name")
This fails too with a very weird error that I have been trying to solve. The problem is that dask needs to infer the types of the columns and it does so by reading the first head_row rows in the table - 5 rows by default - and infer the types there. This line in the dask codebase adds a LIMIT ? to the query, which ends up being
SELECT name, age, date_of_birth from customer LIMIT param_1
The param_1 doesn't get substituted at all with the right value - 5 in this case. It then fails on the next line, https://github.com/dask/dask/blob/main/dask/dataframe/io/sql.py#L119, tjat evaluates the SQL expression.
sqlalchemy.exc.ProgrammingError: (mariadb.ProgrammingError) You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'SELECT name, age, date_of_birth from customer
LIMIT ?' at line 1
[SQL: SELECT SELECT name, age, date_of_birth from customer
LIMIT ?]
[parameters: (5,)]
(Background on this error at: https://sqlalche.me/e/14/f405)
I can't understand why param_1 wasn't substituted with the value of head_rows. One can see from the error message that it detects there's a parameter that needs to be used for the substitution but for some reason it doesn't actually substitute it.
Perhaps, I didn't correctly create the SQLAlchemy selectable?
I can simply use pandas.read_sql and create a dask dataframe from the resulting pandas dataframe but that defeats the purpose of using dask in the first place.
I have the following constraints:
I cannot change the function to accept a ready-made sqlalchemy
selectable. This feature will be added to a private library used at
my company and various projects using this library do not use
sqlalchemy.
Passing meta to the custom function is not an option because it would require the caller do create it. However, passing a meta attribute to read_sql_query and setting head_rows=0 is completely ok as long as there's an efficient way to retrieve/create
while dask-sql might work for this case, using it is not an
option, unfortunately
How can I go about correctly reading an SQL query into dask dataframe?
The crux of the problem is this line:
sa_query = sql.select(sql.text(query))
What is happening is that we are constructing a nested SELECT query,
which can cause a problem downstream.
Let's first create a test database:
# create a test database (using https://stackoverflow.com/a/64898284/10693596)
from sqlite3 import connect
from dask.datasets import timeseries
con = "delete_me_test.sqlite"
db = connect(con)
# create a pandas df and store (timestamp is dropped to make sure
# that the index is numeric)
df = (
timeseries(start="2000-01-01", end="2000-01-02", freq="1h", seed=0)
.compute()
.reset_index()
)
df.to_sql("ticks", db, if_exists="replace")
Next, let's try to get things working with pandas without sqlalchemy:
from pandas import read_sql_query
con = "sqlite:///test.sql"
query = "SELECT * FROM ticks LIMIT 3"
meta = read_sql_query(sql=query, con=con).set_index("index")
print(meta)
# id name x y
# index
# 0 998 Ingrid 0.760997 -0.381459
# 1 1056 Ingrid 0.506099 0.816477
# 2 1056 Laura 0.316556 0.046963
Now, let's add sqlalchemy functions:
from pandas import read_sql_query
from sqlalchemy.sql import text, select
con = "sqlite:///test.sql"
query = "SELECT * FROM ticks LIMIT 3"
sa_query = select(text(query))
meta = read_sql_query(sql=sa_query, con=con).set_index("index")
# OperationalError: (sqlite3.OperationalError) near "SELECT": syntax error
# [SQL: SELECT SELECT * FROM ticks LIMIT 3]
# (Background on this error at: https://sqlalche.me/e/14/e3q8)
Note the SELECT SELECT due to running sqlalchemy.select on an existing query. This can cause problems. How to fix this? In general, I don't think there's a safe and robust way of transforming arbitrary SQL queries into their sqlalchemy equivalent, but if this is for an application where you know that users will only run SELECT statements, you can manually sanitize the query before passing it to sqlalchemy.select:
from dask.dataframe import read_sql_query
from sqlalchemy.sql import select, text
con = "sqlite:///test.sql"
query = "SELECT * FROM ticks"
def _remove_leading_select_from_query(query):
if query.startswith("SELECT "):
return query.replace("SELECT ", "", 1)
else:
return query
sa_query = select(text(_remove_leading_select_from_query(query)))
ddf = read_sql_query(sql=sa_query, con=con, index_col="index")
print(ddf)
print(ddf.head(3))
# Dask DataFrame Structure:
# id name x y
# npartitions=1
# 0 int64 object float64 float64
# 23 ... ... ... ...
# Dask Name: from-delayed, 2 tasks
# id name x y
# index
# 0 998 Ingrid 0.760997 -0.381459
# 1 1056 Ingrid 0.506099 0.816477
# 2 1056 Laura 0.316556 0.046963

Using pandas to write df to sqlite

I'm trying to create a sqlite db from a csv file. After some searching it seems like this is possible using a pandas df. I've tried following some tutorials and the documentation but I can't figure this error out. Here's my code:
# Import libraries
import pandas, csv, sqlite3
# Create sqlite database and cursor
conn = sqlite3.connect('test.db')
c = conn.cursor()
# Create the table of pitches
c.execute("""CREATE TABLE IF NOT EXISTS pitches (
pitch_type text,
game_date text,
release_speed real
)""")
conn.commit()
df = pandas.read_csv('test2.csv')
df.to_sql('pitches', conn, if_exists='append', index=False)
conn.close()
When I run this code, I get the following error:
sqlite3.OperationalError: table pitches has no column named SL
SL is the first value in the first row in my csv file. I can't figure out why it's looking at the csv value as a column name, unless it thinks the first row of the csv should be the headers and is trying to match that to column names in the table? I don't think that was it either though because I tried changing the first value to an actual column name and got the same error.
EDIT:
When I have the headers in the csv, the dataframe looks like this:
pitch_type game_date release_speed
0 SL 8/31/2017 81.9
1 SL 8/31/2017 84.1
2 SL 8/31/2017 81.9
... ... ... ...
2919 SL 8/1/2017 82.3
2920 CU 8/1/2017 78.7
[2921 rows x 3 columns]
and I get the following error:
sqlite3.OperationalError: table pitches has no column named game_date
When I take the headers out of the csv file:
SL 8/31/2017 81.9
0 SL 8/31/2017 84.1
1 SL 8/31/2017 81.9
2 SL 8/31/2017 84.1
... .. ... ...
2918 SL 8/1/2017 82.3
2919 CU 8/1/2017 78.7
[2920 rows x 3 columns]
and I get the following error:
sqlite3.OperationalError: table pitches has no column named SL
EDIT #2:
I tried taking the table creation out of the code entirely, per this answer, with the following code:
# Import libraries
import pandas, csv, sqlite3
# Create sqlite database and cursor
conn = sqlite3.connect('test.db')
c = conn.cursor()
df = pandas.read_csv('test2.csv')
df.to_sql('pitches', conn, if_exists='append', index=False)
conn.close()
and still get the
sqlite3.OperationalError: table pitches has no column named SL
error
EDIT #3:
I changed the table creation code to the following:
# Create the table of pitches
dropTable = 'DROP TABLE pitches'
c.execute(dropTable)
createTable = "CREATE TABLE IF NOT EXISTS pitches(pitch_type text, game_date text, release_speed real)"
c.execute(createTable)
and it works now. Not sure what exactly changed, as it looks basically the same to me, but it works.
If you are trying to create a table from a csv file you can just run sqlite3 and do:
sqlite> .mode csv
sqlite> .import c:/path/to/file/myfile.csv myTableName
Check your column names. I am able to replicate your code successfully with no errors. The names variable gets all the columns names from the sqlite table and you can compare them with the dataframe headers with df.columns.
# Import libraries
import pandas as pd, csv, sqlite3
# Create sqlite database and cursor
conn = sqlite3.connect('test.db')
c = conn.cursor()
# Create the table of pitches
c.execute("""CREATE TABLE IF NOT EXISTS pitches (
pitch_type text,
game_date text,
release_speed real
)""")
conn.commit()
test = conn.execute('SELECT * from pitches')
names = [description[0] for description in test.description]
print(names)
df = pd.DataFrame([['SL','8/31/2017','81.9']],columns = ['pitch_type','game_date','release_speed'])
df.to_sql('pitches', conn, if_exists='append', index=False)
conn.execute('SELECT * from pitches').fetchall()
>> [('SL', '8/31/2017', 81.9), ('SL', '8/31/2017', 81.9)]
I am guessing there might be some whitespaces in your column headers.
As you can see from pandas read_csv docs:
header : int or list of ints, default 'infer'
Row number(s) to use as the column names, and the start of the
data. Default behavior is to infer the column names: if no names
are passed the behavior is identical to ``header=0`` and column
names are inferred from the first line of the file, if column
names are passed explicitly then the behavior is identical to
``header=None``. Explicitly pass ``header=0`` to be able to
replace existing names. The header can be a list of integers that
specify row locations for a multi-index on the columns
e.g. [0,1,3]. Intervening rows that are not specified will be
skipped (e.g. 2 in this example is skipped). Note that this
parameter ignores commented lines and empty lines if
``skip_blank_lines=True``, so header=0 denotes the first line of
data rather than the first line of the file.
That means read_csv using your first row as header names.

How to select data from SQL Server based on data available in pandas data frame?

I have a list of data in one of the pandas dataframe column for which I want to query SQL Server database. Is there any way I can query a SQL Server DB based on data I have in pandas dataframe.
select * from table_name where customerid in pd.dataframe.customerid
In SAP, there is something called "For all entries in" where the SQL can query the DB based on the data available in the array, I was trying to find something similar.
Thanks.
If you are working with tiny DataFrame, then the easiest way would be to generate a corresponding SQL:
In [8]: df
Out[8]:
id val
0 1 21
1 3 111
2 5 34
3 12 76
In [9]: q = 'select * from tab where id in ({})'.format(','.join(['?']*len(df['id'])))
In [10]: q
Out[10]: 'select * from tab where id in (?,?,?,?)'
now you can read data from SQL Server:
from sqlalchemy import create_engine
conn = create_engine(...)
new = pd.read_sql(q, conn, params=tuple(df['id']))
NOTE: this approach will not work for bigger DF's as the generated query (and/or list of bind variables) might bee too long either for Pandas to_sql() function or for SQL Server or even for both.
For bigger DFs I would recommend to write your pandas DF to SQL Server table and then use SQL subquery to filter needed data:
df[list_of_columns_to_save].to_sql('tmp_tab_name', conn, index=False)
q = "select * from tab where id in (select id from tmp_tab_name)"
new = pd.read_sql(q, conn, if_exists='replace')
This is a very familiar scenario and one can use the below code to Query SQL using a very large pandas dataframe. The parameter n needs to be manipulated based on your SQL server memory. For me n=25000 worked.
n = 25000 #chunk row size
## Big_frame dataframe divided into smaller chunks of n into a list
list_df = [big_frame[i:i+n] for i in range(0,big_frame.shape[0],n)]
## Create another dataframe with columns names as expected from SQL
big_frame_2 = pd.DataFrame(columns=[<Mention all column names from SQL>])
## Print total no. of iterations
print("Total Iterations:", len(list_df))
for i in range(0,len(list_df)):
print("Iteration :",i)
temp_frame = list_df[i]
testList = temp_frame['customer_no']
## Pass smaller chunk of data to SQL(here I am passing a list of customers)
temp_DF = SQL_Query(tuple(testList))
print(temp_DF.shape[0])
## Append all the data retrieved from SQL to big_frame_2
big_frame_2=big_frame_2.append(temp_DF, ignore_index=True)

Insert geometry point mysql from pandas Dataframe

I am using pandas and a Dataframe to deal with some data. I want to load the data into a mySQL dabase where one of the fields is a Point.
In the file I am parsing with python I have the lat and lon of the points.
I have created a dataframe (df) with the point information (id and coords):
id coords
A GeomFromText( ' POINT(40.87 3.80) ' )
I have saved in coords the command required in mySQL to create a Point from the text. However, when executing:
from sqlalchemy import create_engine
engine = create_engine(dbconnection)
df.to_sql("point_test",engine, index=False, if_exists="append")
I got the following error:
DataError: (mysql.connector.errors.DataError) 1416 (22003): Cannot get
geometry object from data you send to the GEOMETRY field
Triggered because df.to_sql transforms the GeomFromText( ' POINT(40.87
3.80) ' ) into string as "GeomFromText( ' POINT(40.87 3.80) ' )" when it should be the execution of the function GeomFromText in mySQL.
Does anyone has a suggestion about how to insert in mySQL geometrical fields information originally in text form using pandas dataframe?
A work around is to create a temporary table with the String of the geometrical information that need to be added and then update the point_test table with a call to ST_GeomFromText from the temporary table.
Assuming database with table point_test with id (VARCHAR(5)) and coords(POINT):
a.Create dataframe df as an example with point "A" and "B"
dfd = np.array([['id','geomText'],
["A","POINT( 50.2 5.6 )"],
["B","POINT( 50.2 50.4 )"]])
df=pd.DataFrame(data=dfd[1:,:], columns=dfd[0,:])
b.Add point "A" and "B" into point_test but only the id and add the string "geomText" into the table temp_point_test
df[['id']].to_sql("point_test",engine, index=False, if_exists="append")
df[['id', 'geomText']].to_sql("temp_point_test",engine, index=False, if_exists="append")
c. Update table point_test with the point from table temp_point_test applying the ST_GeomFromText() to the select. Finally, drop temp_point_test:
conn = engine.connect()
conn.execute("update point_test pt set pt.coords=(select ST_GeomFromText(geomText) from temp_point_test tpt "+
"where pt.id=tpt.id)")
conn.execute("drop table temp_point_test")
conn.close()

Categories