I know this has been touched on several times, but I cannot seem to get this working. I am writing a python program that will take in an sqlite3 database dump file, analyse it and recreate it using a database migration tool (called yoyo-migrations)
I am running into an issue with blob data in sqlite3 and how to correctly format it.
Here is a basic explanation of my programs execute
- read in dump file, separate into CREATE statements, INSERT statements and other
- generate migration files for CREATEs
- generate a migration file for each tables inserts
- run the migration to rebuild the database ( except now it is built off of migrations)
Basically I was given a database, and need to get it under control using migrations. This is just the first step (getting the thing rebuilt using the migration tool)
Here is the table creation of the blob table:
CREATE TABLE blob_table(
blockid INTEGER PRIMARY KEY,
block blob
)
I then create the migration file:
#
# file: migrations/0001.create_table.py
# Migration to build tables (autogenerated by parse_dump.py)
#
from yoyo import step
step('CREATE TABLE blob_table( blockid INTEGER PRIMARY KEY, block blob);')
Note that I just write that to a file, and then at the end run the migrations. Next I need to right a "seed" migration that inserts the data. This is where I run into trouble!
# here is an example insert line from the dump
INSERT INTO blob_table VALUES(765,X'00063030F180800FE1C');
So the X'' stuff is the blob data, and I need to write a python file which INSERTs this data back into the table. I have a large amount of data so I am using the execute many syntax. Here is what the seed migration file looks like (an example):
#
# file: migrations/0011.seed_blob_table.py
# Insert seed data for blob table
#
from yoyo import step
import sqlite3
def do_step(conn):
rows = [
(765,sqlite3.Binary('00063030303031340494100')),
(766,sqlite3.Binary('00063030303331341FC5150')),
(767,sqlite3.Binary('00063030303838381FC0210'))
]
cursor = conn.cursor()
cursor.executemany('INSERT INTO blob_table VALUES (?,?)', rows)
# run the insert
step(do_step)
I have tried using sqlite3.Binary(), the python built-in buffer(), both combinations of the two as well as int('string', base=16), hex() and many others. No matter what I do it will not match up with the database from the dump. What I mean is:
If I open up the new and old database side by side and excute this query:
# in the new database, it comes out as a string
SELECT * FROM blob_table WHERE blockid=765;
> 765|00063030303031340494100
# in the old database, it displays nothing
SELECT * FROM blob_table WHERE blockid=765;
> 765|
# if I do this in the old one, I get the x'' from the dump
SELECT blockid, quote(block) FROM blob_table WHERE blockid=765;
765|X'00063030303031340494100'
# if I use the quote() in the new database i get something different
SELECT blockid, quote(block) FROM blob_table WHERE blockid=765;
765|X'303030363330333033303330... (truncated, this is longer than the original and only has digits 0-9
My end goal is to rebuild the database and have it be identical to the starting one (from which the dump was made). Any tips on getting the blob stuff to work are much appreciated!
The buffer class is capable of handling binary data. However, it takes care to preserve the data you give to it, and '00063030303031340494100' is not binary data; it is a string that contains the digits zero, zero, zero, six, etc.
To construct a string containing binary data, use decode:
import codecs
blob = buffer(codecs.decode(b'00063030303031340494100', 'hex_codec'))
Related
I'm creating a Snowflake procedure using Snowpark (python) package executing a query into a snowflake dataframe and I would like to export that into Excel, how can I accomplish that? Is it a better approach to do this? The end goal is to export it the query results into Excel. Needs to be in a Snowflake procedure since we already have others "parent" procedures. Thanks!
CREATE OR REPLACE PROCEDURE EXPORT_SP()
RETURNS string not null
LANGUAGE PYTHON
RUNTIME_VERSION = '3.8'
PACKAGES = ('snowflake-snowpark-python', 'pandas')
HANDLER = 'run'
AS
$$
import pandas
def run(snowpark_session):
## Execute the query into a Snowflake dataframe
results_df = snowpark_session.sql('''
SELECT * FROM
MY TABLES
;
''').collect()
return results_df
$$
;
In general, you can do this by:
"Unloading" the data from the table using the COPY INTO <location> command.
Using the GET command to copy the data to your local filesystem.
Open the file with Excel! If you used the CSV format and the appropriate format options in step 1, you should be able to easily open the resulting data with Excel.
Snowpark directly supports step 1 in the DataFrameWriter.copy_into_location method. An instance of DataFrameWriter contained in the DataFrame.write attribute, as described here.
Snowpark also directly supports step 2 in the FileOperation.get method. As per the example in that documentation page, you can access this method using the .file attribute of your Snowpark session object.
Putting this all together, you should be able to do something like this in Snowpark to save a single exported file into the current working directory:
source_table = "my_table"
unload_location = "#my_stage/export.csv"
def run(session):
df = session.table(source_table)
df.write.copy_into_location(
unload_location,
file_format_type="csv",
format_type_options=dict(
compression="none",
field_delimiter="\t",
),
single=True,
header=True,
)
session.file.get(unload_location, ".")
You can of course use session.sql() instead of session.table() as needed. You might also want to consider unloading data to the stage associated with the source data, instead of creating a separate stage, i.e. if the data is from table my_table then you would unload to the stage #%my_table.
For more details, refer to the documentation pages I linked, which contain important reference information as well as several examples.
Note that I am not sure if session.file is accessible from inside a stored procedure; you will have to experiment to see what works in your specific situation.
As always, remember that this is untested code written by an unpaid volunteer. Always triple-check and test any code that is provided here. Please do ask questions in the comments if anything is still unclear.
I'm wrting a python code that creates a SQLite database and does some calculations for massive tables. To begin with, reason i'm doing it in SQLite through python is memory, my data is huge that will break into a memory error if run in, say, pandas. and if chuncked it'll take ages, generally because pandas is slow with merges and groupes, etc.
So my issue now is at some point, i want to calculate exponential of one column in a table (sample code below) but it seems that SQLite doesn't have an EXP function.
I can write data to a dataframe and then use numpy to calculate the EXP but that then beats the whole point that pushed my twoards DBs and not have the additional time of reading/writing back and forth between the DB and python.
so my question is this: is there a way around this to calculate the exponential within the database? i've read that i can create the function within sqlite3 in python, but i have no idea how. If you know how or can direct me to where i can find relavent info then i would be thankful, thanks.
Sample of my code where i'm trying to do the calculation, note here i'm just providing a sample where the table is coming directly from a csv, but in my process it's actually created within the DB after lots of megres and group bys:
import sqlite3
#set path and files names
folderPath = 'C:\\SCP\\'
inputDemandFile = 'demandFile.csv'
#set connection to database
conn = sqlite3.connect(folderPath + dataBaseName)
cur = conn.cursor()
#read demand file into db
inputDemand = pd.read_csv(folderPath + inputDemandFile)
inputDemand.to_sql('inputDemand', conn, if_exists='replace', index=False)
#create new table and calculate EXP
cur.execute('CREATE TABLE demand_exp AS SELECT from_zone_id, to_zone_id, EXP(demand) AS EXP_Demand FROM inputDemand;')
i've read that i can create the function within sqlite3 in python, but i have no idea how.
That's conn.create_function()
https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.create_function
>>> import math
>>> conn.create_function('EXP', 1, math.exp)
>>> cur.execute('select EXP(1)')
>>> cur.fetchone()
(2.718281828459045,)
I'm using AWS Glue to move multiple files to an RDS instance from S3. Each day I get a new file into S3 which may contain new data, but can also contain a record I have already saved with some updates values. If I run the job multiple times I will of course get duplicate records in the database. Instead of multiple records being inserted I want Glue to try and update that record if it notices a field has changed, each record has a unique id. Is this possible?
I followed the similar approach which is suggested as 2nd option by Yuriy. Get existing data as well as new data and then do some processing to merge to of them and write with ovewrite mode. Following code would help you to get an idea about how to solve this problem.
sc = SparkContext()
glueContext = GlueContext(sc)
#get your source data
src_data = create_dynamic_frame.from_catalog(database = src_db, table_name = src_tbl)
src_df = src_data.toDF()
#get your destination data
dst_data = create_dynamic_frame.from_catalog(database = dst_db, table_name = dst_tbl)
dst_df = dst_data.toDF()
#Now merge two data frames to remove duplicates
merged_df = dst_df.union(src_df)
#Finally save data to destination with OVERWRITE mode
merged_df.write.format('jdbc').options( url = dest_jdbc_url,
user = dest_user_name,
password = dest_password,
dbtable = dest_tbl ).mode("overwrite").save()
Unfortunately there is no elegant way to do it with Glue. If you would write to Redshift you could use postactions to implement Redshift merge operation. However, it's not possible for other jdbc sinks (afaik).
Alternatively in your ETL script you can load existing data from a database to filter out existing records before saving. However if your DB table is big then the job may take a while to process it.
Another approach is to write into a staging table with mode 'overwrite' first (replace existing staging data) and then make a call to a DB via API to copy new records only into a final table.
I have used INSERT into table .... ON DUPLICATE KEY.. for UPSERTs into the Aurora RDS running mysql engine. Maybe this would be a reference for your use case. We cannot use a JDBC since we have only APPEND, OVERWRITE, ERROR modes currently supported.
I am not sure of the RDS database engine you are using, and following is an example for mysql UPSERTS.
Please see this reference, where i have posted a solution using INSERT INTO TABLE..ON DUPLICATE KEY for mysql :
Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array
I have a script that stores results in pdf format in a particular folder. I want to create a mysql database ( which is successful with the below code ), and populate the pdf results to it. what would be the best way , storing the file as such , or as reference to the location. The file size would be around 2MB. Could someone help in explaining the same with some working examples. I am new to both python and mysql.Thanks in advance.
To clarify more : I tried using LOAD DATA INFILE and the BLOB type for the result file column , but it dosent seem to work .I am using pymysql api module to connect to the database.Below code is to connect to the database and is successful.
import pymsql
conn = pymysql.connect(host='hostname', port=3306, user='root', passwd='abcdef', db='mydb')
cur = conn.cursor()
cur.execute("SELECT * FROM userlogin")
for r in cur.fetchall():
print(r)
cur.close()
conn.close()
Since you seem to be close to getting mysql to store strings for you (user names), your best bet is to just stick with what you did there and store the file path just as you stored the strings in your userlogin table (but in a different table with a foreign key to userlogin). It will probably be the most efficient approach in the long run anyway, especially if you store important metadata along with the file path (like keywords or even complete n-gram sets)... now you're talking about a file indexing system like Google Desktop or Xapian... just so you know what you're up against if you want to do this the "best" way.
I would like to get some understanding on the question that I was pretty sure was clear for me. Is there any way to create table using psycopg2 or any other python Postgres database adapter with the name corresponding to the .csv file and (probably the most important) with columns that are specified in the .csv file.
I'll leave you to look at the psycopg2 library properly - this is off the top of my head (not had to use it for a while, but IIRC the documentation is ample).
The steps are:
Read column names from CSV file
Create "CREATE TABLE whatever" ( ... )
Maybe INSERT data
import os.path
my_csv_file = '/home/somewhere/file.csv'
table_name = os.path.splitext(os.path.split(my_csv_file)[1])[0]
cols = next(csv.reader(open(my_csv_file)))
You can go from there...
Create a SQL query (possibly using a templating engine for the fields and then issue the insert if needs be)