I have a .sql file that contains a database dump. I would prefer to get this file into a pandas dataframe so that I can view the data and manipulate it. Willing to take any solution, but need explicit instructions, I've never worked with a .sql file previously.
The file's structure is as follows:
-- MySQL dump 10.13 Distrib 8.0.11, for Win64 (x86_64)
--
-- Host: localhost Database: somedatabase
-- ------------------------------------------------------
-- Server version 8.0.11
DROP TABLE IF EXISTS `selected`;
CREATE TABLE `selected` (
`date` date DEFAULT NULL,
`weekday` int(1) DEFAULT NULL,
`monthday` int(4) DEFAULT NULL,
... [more variables]) ENGINE=somengine DEFAULT CHARSET=something COLLATE=something;
LOCK TABLES `selected` WRITE;
INSERT INTO `selected` VALUES (dateval, weekdayval, monthdayval), (dateval, weekdayval, monthdayval), ... (dateval, weekdayval, monthdayval);
INSERT INTO `selected` VALUES (...), (...), ..., (...);
... (more insert statements) ...
-- Dump completed on timestamp
You should use the sqlalchemy library for this:
https://docs.sqlalchemy.org/en/13/dialects/mysql.html
Or alternatively you could use this:
https://pynative.com/python-mysql-database-connection/
The second option my be easier to load your data to mysql as you could just take your sql file text as the query object and pass it to the connection.
Something like this:
import mysql.connector
connection = mysql.connector.connect(host='localhost',
database='database',
user='user',
password='pw')
query = yourSQLfile
cursor = connection.cursor()
result = cursor.execute(query)
Once you've loaded your table you create the engine with sqlalchemy to connect pandas to your database and simply use the pandas read_sql() command to load your table to a dataframe object.
Another note is that if you just want to manipulate the data, you could take the values statement from the sql file and use that to populate a dataframe manually if you needed to. Just change the "Values (....),(....),(....)" to mydict = {[....],[....],[....]} and load it to a dataframe. Or you could dump the values statement to excel and delete the parentheses and do text to columns, give it headers and save it, then load it to a dataframe from excel. Or just manipulate it in excel (you could even use a concat formula to recreate the sql values syntax and replace the data in the sql file). It really depends on exactly what your end-goal here is.
Sorry you did not receive a timely answer here.
Related
I'm writing an app in Python and part of it includes an api that needs to interact with a MySQL database. Coming from sqlite3 to sqlalchemy, there are parts of the workflow that seem a bit too verbose for my taste and wasn't sure if there was a way to simplify the process.
Sqlite3 Workflow
If I wanted to take a list from Python and insert it into a table I would use the approach below
# connect to db
con = sqlite3.connect(":memory:")
cur = con.cursor()
# create table
cur.execute("create table lang (name, first_appeared)")
# prep data to be inserted
lang_list = [
("Fortran", 1957),
("Python", 1991),
("Go", 2009)
]
# add data to table
cur.executemany("insert into lang values (?, ?)", lang_list)
con.commit()
SqlAlchemy Workflow
In sqlalchemy, I would have to import the Table, Column, String, Integer Metadata etc objects and do something like this
# connect to db
engine = create_engine("mysql+pymysql://....")
# (re)create table, seems like this needs to be
# done every time I want to insert anything into it?
metadata = MetaData()
metadata.reflect(engine, only=['lang'])
table = Table('lang', meta,
Column('name', String),
Column('first_appeared', Integer),
autoload=True, autoload_with=engine)
# prep data to be inserted
lang_list = [
{'first_appeared': 1957, 'name': 'Fortran'},
{'first_appeared': 1991, 'name': 'Python'},
{'first_appeared': 2009, 'name': 'Go'}
]
# add data to table
engine.execute(table.insert(), lang_list)
Question
Is there a way to add data to a table in sqlalchemy without having to use Metadata, Table and Column objects? Specifically just using the connection, a statement and the list so all that needs to be run is execute?
I want to do as little sql work in Python as possible and this seems too verbose for my taste.
Possible different route
I could use a list comprehension to transform the list into one long INSERT statement so the final query looks like this
statement = """
INSERT INTO lang
VALUES ("Fortran", 1957),
("Python", 1991),
("Go", 2009);"""
con.execute(statement)
- but wasn't sure if sqlalchemy had a simple equivalent to sqlite3's executemany for inserts and a list without having to incorporate all these objects every time in order to do so.
If a list comprehension -> big statement -> execute was the simplest way to go in that regard then that's fine, I am just new to sqlalchemy and had been using sqlite3 up until this point.
For clarification, in my actual code the connection is already using the appropriate database and the tables themselves exist - the code snippets used above have nothing to do with the actual data/tables I'm working with and are just for reproducibility/testing sake. It's the workflow for adding to them that felt verbose when I had to reconstruct the tables with imported objects just to add to them.
I didn't know SQLite allowed weakly typed columns as your demonstrated in your example. As far as I know most other databases, mysql and postgresql, will require strongly typed columns. Usually the table metadata is either reflected or pre-defined and used. Sort of like type definitions in a statically typed language. SQLAlchemy will use these types to determine how to properly format the SQL statements. Ie. wrapping strings with quotes and NOT wrapping integers with quotes.
In your mysql example you should be able to use the table straight off the metadata with metadata.tables["lang"], they call this reflecting-all-tables-at-once in the docs. This assumes the table is already defined in the mysql database. You only need to define the table columns if you need to override the reflected table's definition, as they do in the overriding-reflected-columns docs.
The docs state that this should utilize executemany and should work if you reflected the table from a database that already had it defined:
engine = create_engine("mysql+pymysql://....")
metadata = MetaData()
# Pull in table definitions from database, only lang table.
metadata.reflect(engine, only=['lang'])
# prep data to be inserted
lang_list = [
{'first_appeared': 1957, 'name': 'Fortran'},
{'first_appeared': 1991, 'name': 'Python'},
{'first_appeared': 2009, 'name': 'Go'}
]
# add data to table
engine.execute(metadata.tables["lang"].insert(), lang_list)
I'm new to Python and Pandas - please be gentle!
I'm using SqlAlchemy with pymssql to execute a SQL query against a SQL Server database and then convert the result set into a dataframe. I'm then attempting to write this dataframe as a Parquet file:
engine = sal.create_engine(connectionString)
conn = engine.connect()
df = pd.read_sql(query, con=conn)
df.to_parquet(outputFile)
The data I'm retrieving in the SQL query includes a uniqueidentifier column (i.e. a UUID) named rowguid. Because of this, I'm getting the following error on the last line above:
pyarrow.lib.ArrowInvalid: ("Could not convert UUID('92c4279f-1207-48a3-8448-4636514eb7e2') with type UUID: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column rowguid with type object')
Is there any way I can force all UUIDs to strings at any point in the above chain of events?
A few extra notes:
The goal for this portion of code was to receive the SQL query text as a parameter and act as a generic SQL-to-Parquet function.
I realise I can do something like df['rowguid'] = df['rowguid'].astype(str), but it relies on me knowing which columns have uniqueidentifier types. By the time it's a dataframe, everything is an object and each query will be different.
I also know I can convert it to a char(36) in the SQL query itself, however, I was hoping to do something more "automatic" so the person writing the query doesn't trip over this problem accidentally all the time / doesn't have to remember to always convert the datatype.
Any ideas?
Try DuckDB
engine = sal.create_engine(connectionString)
conn = engine.connect()
df = pd.read_sql(query, con=conn)
df.to_parquet(outputFile)
# Close the database connection
conn.close()
# Create DuckDB connection
duck_conn = duckdb.connect(':memory:')
# Write DataFrame content to a snappy compressed parquet file
COPY (SELECT * FROM df) TO 'df-snappy.parquet' (FORMAT 'parquet')
Ref:
https://duckdb.org/docs/guides/python/sql_on_pandas
https://duckdb.org/docs/sql/data_types/overview
https://duckdb.org/docs/data/parquet
I want to import data of file "save.csv" into my actian PSQL database table "new_table" but i got error
ProgrammingError: ('42000', "[42000] [PSQL][ODBC Client Interface][LNA][PSQL][SQL Engine]Syntax Error: INSERT INTO 'new_table'<< ??? >> ('name','address','city') VALUES (%s,%s,%s) (0) (SQLPrepare)")
Below is my code:
connection = 'Driver={Pervasive ODBC Interface};server=localhost;DBQ=DEMODATA'
db = pyodbc.connect(connection)
c=db.cursor()
#create table i.e new_table
csv = pd.read_csv(r"C:\Users\user\Desktop\save.csv")
for row in csv.iterrows():
insert_command = """INSERT INTO new_table(name,address,city) VALUES (row['name'],row['address'],row['city'])"""
c.execute(insert_command)
c.commit()
Pandas have a built-in function that empty a pandas-dataframe into a sql-database called pd.to_sql(). This might be what you are looking for. Using this you dont have to manually insert one row at a time but you can insert the entire dataframe at once.
If you want to keep using your method, the issue might be that the table "new_table" hasn't been created yet in the database. And thus you first need something like this:
CREATE TABLE new_table
(
Name [nvarchar](100) NULL,
Address [nvarchar](100) NULL,
City [nvarchar](100) NULL
)
EDIT:
You can use to_sql() like this on tables that already exist in the database:
df.to_sql(
"new_table",
schema="name_of_the_schema",
con=c.session.connection(),
if_exists="append", # <--- This will append an already existing table
chunksize=10000,
index=False,
)
I have tried the same, in my case the table is created , I just want to insert each row from pandas dataframe into the database using Actian PSQL
I have created a database using sqlite3 in python that has thousands of tables. Each of these tables contains thousands of rows and ten columns. One of the columns is the date and time of an event: it is a string that is formatted as YYYY-mm-dd HH:MM:SS, which I have defined to be the primary key for each table. Every so often, I collect some new data (hundreds of rows) for each of these tables. Each new dataset is pulled from a server and loaded in directly as a pandas data frame or is stored as a CSV file. The new data contains the same ten columns as my original data. I need to update the tables in my database using this new data in the following way:
Given a table in my database, for each row in the new dataset, if the date and time of the row matches the date and time of an existing row in my database, update the remaining columns of that row using the values in the new dataset.
If the date and time does not yet exist, create a new row and insert it to my database.
Below are my questions:
I've done some searching on Google and it looks like I should be using the UPSERT (merge) functionality of sqlite but I can't seem to find any examples showing how to use it. Is there an actual UPSERT command, and if so, could someone please provide an example (preferably with sqlite3 in Python) or point me to a helpful resource?
Also, is there a way to do this in bulk so that I can UPSERT each new dataset into my database without having to go row by row? (I found this link, which suggests that it is possible, but I'm new to using databases and am not sure how to actually run the UPSERT command.)
Can UPSERT also be performed directly using pandas.DataFrame.to_sql?
My backup solution is loading in the table to be UPSERTed using pd.read_sql_query("SELECT * from table", con), performing pandas.DataFrame.merge, deleting the said table from the database, and then adding in the updated table to the database using pd.DataFrame.to_sql (but this would be inefficient).
Instead of going through upsert command, why don't you create your own algorithim that will find values and replace them if date & time is found, else it will insert new row. Check out my code, i wrote for you. Let me know if you are still confused. You can even do that for hundereds of tables just by replacing table name in algorithim with some variable and changing it for the whole list of your table names.
import sqlite3
import pandas as pd
csv_data = pd.read_csv("my_CSV_file.csv") # Your CSV Data Path
def manual_upsert():
con = sqlite3.connect(connection_str)
cur = con.cursor()
cur.execute("SELECT * FROM my_CSV_data") # Viewing Data from Column
data = cur.fetchall()
old_data_list = [] # Collection of All Dates already in Database table.
for line in data:
old_data_list.append(line[0]) # I suppose you Date Column is on 0 Index.
for new_data in csv_data:
if new_data[0] in old_data_list:
cur.execute("UPDATE my_CSV_data SET column1=?, column2=?, column3=? WHERE my_date_column=?", # it will update column based on date if condition is true
(new_data[1],new_data[2],new_data[3],new_data[0]))
else:
cur.execute("INSERT INTO my_CSV_data VALUES(?,?,?,?)", # It will insert new row if date is not found.
(new_data[0],new_data[1],new_data[2],new_data[3]))
con.commit()
con.close()
manual_upsert()
First, even though the questions are related, ask them separately in the future.
There is documentation on UPSERT handling in SQLite that documents how to use it but it is a bit abstract. You can check examples and discussion here: SQLite - UPSERT *not* INSERT or REPLACE
Use a transaction and the statements are going to be executed in bulk.
As presence of this library suggests to_sql does not create UPSERT commands (only INSERT).
I currently have a Python dataframe that is 23 columns and 20,000 rows.
Using Python code, I want to write my data frame into a MSSQL server that I have the credentials for.
As a test I am able to successfully write some values into the table using the code below:
connection = pypyodbc.connect('Driver={SQL Server};'
'Server=XXX;'
'Database=XXX;'
'uid=XXX;'
'pwd=XXX')
cursor = connection.cursor()
for index, row in df_EVENT5_15.iterrows():
cursor.execute("INSERT INTO MODREPORT(rowid, OPCODE, LOCATION, TRACKNAME)
cursor.execute("INSERT INTO MODREPORT(rowid, location) VALUES (?,?)", (5, 'test'))
connection.commit()
But how do I write all the rows in my data frame table to the MSSQL server? In order to do so, I need to code up the following steps in my Python environment:
Delete all the rows in the MSSQL server table
Write my dataframe to the server
When you say Python data frame, I'm assuming you're using a Pandas dataframe. If it's the case, then you could use the to_sql function.
df.to_sql("MODREPORT", connection, if_exists="replace")
The if_exists argument set to replace will delete all the rows in the existing table before writing the records.
I realise it's been a while since you asked but the easiest way to delete ALL the rows in the SQL server table (point 1 of the question) would be to send the command
TRUNCATE TABLE Tablename
This will drop all the data in the table but leave the table and indexes empty so you or the DBA would not need to recreate it. It also uses less of the transaction log when it runs.