Pandas df to MySQL gives the same timestamp - python

I am trying to upload a CSV to MySQL by first reading the CSV using pandas's .read_csv function and then using the .to_sql function to upload it to the db table.
I have a modification_time column defined in the table schema as follows:
CREATE TABLE test_table (
id BIGINT(20) UNSIGNED NOT NULL,
PRIMARY KEY (id),
UNIQUE KEY unique_id (id),
modification_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
insertion_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);
and the code to read and upload the data is as follows:
from sqlalchemy import create_engine
import pymysql
import pandas as pd
import mysql.connector
from urllib.parse import quote
conn = mysql.connector.connect(
host='xx.x.x.xx', user='username', password='password', database='dbname')
sql = 'TRUNCATE ' + 'test_table' + ";"
cur = conn.cursor()
conn.commit()
conn.close()
engine = create_engine('mysql+pymysql://username:%s#xx.x.x.xx/dbname' % quote('password'),echo=True)
df = pd.read_csv("inputdata.csv")
df.to_sql('test_table', con = engine, if_exists='append', index=False)
The data is being uploaded fine for all the columns except the modification_time and insertion_time columns. The same timestamp is repeated for all the records in the table.
I want the different insertion timestamps for each row as they are getting uploaded one after another (I am passing None in the method parameter in the .to_sql function)
The method parameter is referred from here
Any suggestions are very much appreciated, Thanks.

The default MySQL TIMESTAMP records times with a granularity of seconds, so it's quite likely that multiple records will be inserted in the same second.
MariaDB [test]> create table tstest (
-> name varchar(4),
-> ts timestamp default current_timestamp(6)
-> );
Query OK, 0 rows affected (0.274 sec)
MariaDB [test]> insert into tstest (name) values ('a');
Query OK, 1 row affected (0.057 sec)
MariaDB [test]> select * from tstest;
+------+---------------------+
| name | ts |
+------+---------------------+
| a | 2021-12-28 09:42:26 |
+------+---------------------+
You can specify a fractional seconds value in the column description to increase the granularity of the timestamps recorded (6 is the highest value accepted in MySQL 8.0):
MariaDB [test]> create table tstest ( name varchar(4), ts timestamp(6) default current_timestamp() );
Query OK, 0 rows affected (0.247 sec)
MariaDB [test]> insert into tstest (name) values ('a');
Query OK, 1 row affected (0.040 sec)
MariaDB [test]> select * from tstest;
+------+----------------------------+
| name | ts |
+------+----------------------------+
| a | 2021-12-28 09:47:10.708227 |
+------+----------------------------+
This doesn't guarantee that you won't have collisions (and may be subject to the precision of the system clock), but it should make them more unlikely.
While this approach may help, if you want timestamps based on the order of records in your dataframe then I would consider setting them on arrival in the dataframe; relying on the SQL engine processing the records in a specific order may not be safe or portable.

Related

automatically insert only one row in table after calculating the sum of column

i have 3 table in my database
CREATE TABLE IF NOT EXISTS depances (
id SERIAL PRIMARY KEY UNIQUE NOT NULL,
type VARCHAR NOT NULL,
nom VARCHAR,
montant DECIMAL(100,2) NOT NULL,
date DATE,
temp TIME)
CREATE TABLE IF NOT EXISTS transactions (
id SERIAL PRIMARY KEY UNIQUE NOT NULL,
montant DECIMAL(100,2),
medecin VARCHAR,
patient VARCHAR,
acte VARCHAR,
date_d DATE,
time_d TIME,
users_id INTEGER)
CREATE TABLE IF NOT EXISTS total_jr (
id SERIAL PRIMARY KEY UNIQUE NOT NULL,
total_revenu DECIMAL(100,2),
total_depance DECIMAL(100,2),
total_différence DECIMAL(100,2),
date DATE)
my idea is to insert defrent value in table depances and transaction using a GUI interface.
and after that adding the SUM of montant.depances in total_depance.total_jr
and the SUM of montant.transactions in total_revenu.total_jr where all rows have the same time
that's the easy part using this code
self.cur.execute( '''SELECT SUM(montant) AS totalsum FROM depances WHERE date = %s''',(date,))
result = self.cur.fetchall()
for i in result:
o = i[0]
self.cur_t = self.connection.cursor()
self.cur_t.execute( '''INSERT INTO total_jr(total_depance)
VALUES (%s)'''
, (o,))
self.connection.commit()
self.cur.execute( '''UPDATE total_jr SET total_depance = %s WHERE date = %s''',(o, date))
self.connection.commit()
But every time it adds a new row to the table of total_jr
How can i add thos value of SUM(montant) to the table where the date is the same every time its only put the value of sum in one row not every time it add a new row
The result should will be like this
id|total_revenu|total_depance|total_différence|date
--+------------+-------------+----------------+----
1 sum(montant1) value value 08/07/2020
2 sum(montant2) value value 08/09/2020
3 sum(montant3) value value 08/10/2020
but it only give me this result
id|total_revenu|total_depance|total_différence|date
--+------------+-------------+----------------+----
1 1 value value 08/07/2020
2 2 value value 08/07/2020
3 3 value value 08/7/2020
if there is any idea or any hit that will be hulpefull
You didn't mention which DBMS or SQL module you're using so I'm guessing MySQL.
In your process, run the update first and check how many rows were changed. If zero row changed, then insert a new row for that date.
self.cur.execute( '''SELECT SUM(montant) AS totalsum FROM depances WHERE date = %s''',(date,))
result = self.cur.fetchall()
for i in result:
o = i[0]
self.cur.execute( '''UPDATE total_jr SET total_depance = %s WHERE date = %s''',(o, date))
rowcnt = self.cur.rowcount # number of rows updated - psycopg2
self.connection.commit()
if rowcnt == 0: # no rows updated, need to insert new row
self.cur_t = self.connection.cursor()
self.cur_t.execute( '''INSERT INTO total_jr(total_depance, date)
VALUES (%s, %s)'''
, (o, date))
self.connection.commit()
I find a solution for anyone who need it in future first of all we need to update the table
create_table_total_jr = ''' CREATE TABLE IF NOT EXISTS total_jr (
id SERIAL PRIMARY KEY UNIQUE NOT NULL,
total_revenu DECIMAL(100,2),
total_depance DECIMAL(100,2),
total_différence DECIMAL(100,2),
date DATE UNIQUE)''' #add unique to the date
and after that we use the UPSERT and ON CONFLICT
self.cur_t.execute( ''' INSERT INTO total_jr(date) VALUES (%s)
ON CONFLICT (date) DO NOTHING''', (date,))
self.connection.commit()
with this code when there is an insert value with the same date it will do nothing
after that we update the value of the SUM
self.cur.execute( '''UPDATE total_jr SET total_depance = %s WHERE date = %s''',(o, date))
self.connection.commit()
Special thanks to Mike67 for his help
You do not need 2 database calls for this. As #Mike67 suggested UPSERT functionality is what you want. However, you need to send both date and total_depance. In SQL that becomes:
insert into total_jr(date,total_depance)
values (date_value, total_value
on conflict (date)
do update
set total_depance = excluded.total_depance;
or depending on input total_depance just the transaction value while on the table total_depance is an accumulation:
insert into total_jr(date,total_depance)
values (date_value, total_value
on conflict (date)
do update
set total_depance = total_depance + excluded.total_depance;
I believe your code then becomes something like (assuming the 1st insert is correct)
self.cur_t.execute( ''' INSERT INTO total_jr(date,total_depance) VALUES (%s1,$s2)
ON CONFLICT (date) DO UPDATE set total_depance = excluded.$s2''',(date,total_depance))
self.connection.commit()
But that could off, you will need to verify.
Tip of the day: You should change the column name date to something else. Date is a reserved word in both Postgres and the SQL Standard. It has predefined meanings based on its context. While you may get away with using it as a data name Postgres still has the right to change that at any time without notice, unlikely but still true. If so, then your code (and most code using that/those table(s)) fails, and tracking down why becomes extremely difficult. Basic rule do not use reserved words as data names; using reserved words as data or db object names is a bug just waiting to bite.

Update multiple rows of SQL table from Python script

I have a massive table (over 100B records), that I added an empty column to. I parse strings from another field (string) if the required string is available, extract an integer from that field, and want to update it in the new column for all rows that have that string.
At the moment, after data has been parsed and saved locally in a dataframe, I iterate on it to update the Redshift table with clean data. This takes approx 1sec/iteration, which is way too long.
My current code example:
conn = psycopg2.connect(connection_details)
cur = conn.cursor()
clean_df = raw_data.apply(clean_field_to_parse)
for ind, row in clean_df.iterrows():
update_query = build_update_query(row.id, row.clean_integer1, row.clean_integer2)
cur.execute(update_query)
where update_query is a function to generate the update query:
def update_query(id, int1, int2):
query = """
update tab_tab
set
clean_int_1 = {}::int,
clean_int_2 = {}::int,
updated_date = GETDATE()
where id = {}
;
"""
return query.format(int1, int2, id)
and where clean_df is structured like:
id . field_to_parse . clean_int_1 . clean_int_2
1 . {'int_1':'2+1'}. 3 . np.nan
2 . {'int_2':'7-0'}. np.nan . 7
Is there a way to update specific table fields in bulk, so that there is no need to execute one query at a time?
I'm parsing the strings and running the update statement from Python. The database is stored on Redshift.
As mentioned, consider pure SQL and avoid iterating through billions of rows by pushing the Pandas data frame to Postgres as a staging table and then run one single UPDATE across both tables. With SQLAlchemy you can use DataFrame.to_sql to create a table replica of data frame. Even add an index of the join field, id, and drop the very large staging table at end.
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://myuser:mypwd!#myhost/mydatabase")
# PUSH TO POSTGRES (SAME NAME AS DF)
clean_df.to_sql(name="clean_df", con=engine, if_exists="replace", index=False)
# SQL UPDATE (USING TRANSACTION)
with engine.begin() as conn:
sql = "CREATE INDEX idx_clean_df_id ON clean_df(id)"
conn.execute(sql)
sql = """UPDATE tab_tab t
SET t.clean_int_1 = c.int1,
t.clean_int_2 = c.int2,
t.updated_date = GETDATE()
FROM clean_df c
WHERE c.id = t.id
"""
conn.execute(sql)
sql = "DROP TABLE IF EXISTS clean_df"
conn.execute(sql)
engine.dispose()

How to integrate python modules with mysql Alchemy using ORM library

I am trying to update mysql database table
So I started by creating ORM Object helping me to reduce the volume of an update query by using UPDATE, WHERE Conditions
First of all, I created an ORM variable as this ORM Object is a filtered data from dataframe by using a condition in another pd.data_frame CSV
this is my simple rule as to be easy to create conditions like this
myOutlook_inBox = pd.read_csv (r'' + mydir + 'test.CSV', usecols=
['Subject','Body', 'From: (Name)', 'To: (Name)' ], encoding='latin-1')
this is simple ORM extracted data from pd.read_csv
replaced_sbj_value = myOutlook_inBox['Subject']
.str.extract(pat='(L(?:DEL|CAI|SIN).\d{5})').dropna()
and this ORM is extracting csv.column from myOutlook_inBox['Subject']
replaced_sbj_value = myOutlook_inBox['Subject']
.str.extract(pat='(L(?:DEL|CAI|SIN).\d{5})').dropna()
myOutlook_inBox["Subject"] = replaced_sbj_value
and this is a condition that I am using to filter a specific data
frm_mwfy_to_te = myOutlook_inBox.loc[myOutlook_inBox['From:
(Name)'].str.contains("mowafy", na=False)
& myOutlook_inBox['To:(Name)'].str.contains("te",
na=False)].drop_duplicates(keep=False)
frm_mwfy_to_te.Subject
and this variable is filtered rows in mysql database in a column called Subject
filtered_data = all_data
.loc[all_data.site_code.str.contains('|'.join(frm_mwfy_to_te.Subject))]
and this is my sql query, all I need now I need to create a query that's updates column called "pending" filters in a column called "site_code" and update rows which value contains filtered_data as to update or replace values in column pending with a value TE
update_db_query = engine.execute("UPDATE govtracker SET pending = 'TE'
WHERE site_code = " + filtered_data)
I am thinking that I am on the wrong scenario any Ideas to solve this
Note: I don't need to mention the old value in my query I just want to update the value in the same row according to the the filtered data frame by the new value I mentioned in the query
For example
according to frm_mwfy_to_te.Subject as Subject is a columns name called in csv file
Let's say the output of this ORM frm_mwfy_to_te.Subject
Subject
LCAIN20804
LDELE30434
LSINI20260
and this is my whole code
from sqlalchemy import create_engine
import pandas as pd
import os
import csv
import MySQLdb
from sqlalchemy import types, create_engine
# MySQL Connection
MYSQL_USER = 'root'
MYSQL_PASSWORD = 'Mharooney'
MYSQL_HOST_IP = '127.0.0.1'
MYSQL_PORT = 3306
MYSQL_DATABASE = 'mydb'
engine = create_engine('mysql+mysqlconnector://'+MYSQL_USER+'
:'+MYSQL_PASSWORD+'#'+MYSQL_HOST_IP+':'+str(MYSQL_PORT)+'/'+MYSQL_DATABASE,
echo=False)
#engine = create_engine('mysql+mysqldb://root:#localhost:123456/myDB?
charset=utf8mb4&binary_prefix=true', echo=False)
mydir = (os.getcwd()).replace('\\', '/') + '/'
all_data = pd.read_sql('SELECT * FROM govtracker', engine)
# .drop(['#'], axis=1)
myOutlook_inBox = pd.read_csv(r'' + mydir + 'test.CSV', usecols=['Subject',
'Body', 'From: (Name)', 'To: (Name)'],
encoding='latin-1')
myOutlook_inBox.columns = myOutlook_inBox.columns.str.replace(' ', '')
#this object extract 5 chars and 5 numbers from specific column in csv
replaced_sbj_value = myOutlook_inBox['Subject'].str.extract(pat='(L(?:DEL|CAI|SIN).\d{5})').dropna()
#this columns I want to filter in database
myOutlook_inBox["Subject"] = replaced_sbj_value
# this conditions filters and get and dublicate repeated data from outlook
exported file
# Condition 1 any mail from mowafy to te
frm_mwfy_to_te = myOutlook_inBox.loc[myOutlook_inBox['From:
(Name)'].str.contains("mowafy", na=False)
& myOutlook_inBox['To:
(Name)'].str.contains("te", na=False)].drop_duplicates(
keep=False)
frm_mwfy_to_te.Subject
filtered_data = all_data.loc[all_data.site_code.str.contains
('|'.join(frm_mwfy_to_te.Subject))]
print(myOutlook_inBox)
all_data.replace('\n', '', regex=True)
df = all_data.where((pd.notnull(all_data)), None)
print(df)
print("Success")
print(frm_mwfy_to_te.Subject)
print(filtered_data)
# rows = engine.execute("SELECT * FROM govtracker")#.fetchall()
# print(rows)
update_db_query = engine.execute("UPDATE govtracker SET pending = 'TE'
WHERE site_code = " + filtered_data)
"""engine = create_engine('postgresql+psycopg2://user:pswd#mydb')
df.to_sql('temp_table', engine, if_exists='replace')"""
# select_db_query = pd.read_sql("SELECT * FROM govtracker", con = engine)
#print(update_db_query)
Now let's say this is the output of my ORM then I will use this ORM as to filter and get the row of these three values from mysql database as to update every row contains these values and I want to update columns called Pending and pending status in my sql
and this is my database query
CREATE TABLE `mydb`.`govtracker` (
`id` INT,
`site_name` VARCHAR(255),
`region` VARCHAR(255),
`site_type` VARCHAR(255),
`site_code` VARCHAR(255),
`tac_name` VARCHAR(255),
`dt_readiness` DATE,
`rfs` VARCHAR(255),
`rfs_date` DATE,
`huawei_1st_submission_date` DATE,
`te_1st_submission_date` DATE,
`huawei_2nd_submission_date` DATE,
`te_2nd_submission_date` DATE,
`huawei_3rd_submission_date` DATE,
`te_3rd_submission_date` DATE,
`acceptance_date_opt` DATE,
`acceptance_date_plan` DATE,
`signed_sites` VARCHAR(255),
`as_built_date` DATE,
`as_built_status` VARCHAR(255),
`date_dt` DATE,
`dt_status` VARCHAR(255),
`shr_status` VARCHAR(255),
`dt_planned` INT(255),
`integeration_status` VARCHAR(255),
`comments_snags` LONGTEXT,
`cluster_name` LONGTEXT,
`type_standalone_colocated` VARCHAR(255),
`installed_type_standalone_colocated` VARCHAR(255),
`status` VARCHAR(255),
`pending` VARCHAR(255),
`pending_status` LONGTEXT,
`problematic_details` LONGTEXT,
`ets_tac` INT(255),
`region_r` VARCHAR(255),
`sf6_signed_date` DATE,
`sf6_signed_comment` LONGTEXT,
`comment_history` LONGTEXT,
`on_air_owner` VARCHAR(255),
`pp_owner` VARCHAR(255),
`report_comment` LONGTEXT,
`hu_opt_area_owner` VARCHAR(255),
`planning_owner` VARCHAR(255),
`po_number` VARCHAR(255),
`trigger_date` DATE,
`as_built_status_tr` VARCHAR(255)
) ENGINE = InnoDB;
Another Important note:
In excel while I using filter in some column it shows the all values in the column I selected lets to say Pending is the column I've selected which have values Accepted & PAC in progress Planning TE PP DT FM Rollout Integration Opt Team
So now all the rest columns have values like this
So should I have to create a table something like columns_values and fill this table with all these values I have, as these values are static values
It is easy to solve my case
Last Note: This database is according to an existing xlsm file but I push the data from xlsm to mysql and now mysql Is my main database, not the excel formats but I am updating mysql database through csv file not in my database the orm object frm_mwfy_to_te.Subject is an extracted data from the data frame in the csv file
Any Ideas Here?
I hope everything is clear enough
Is this material could help me or not?
https://auth0.com/blog/sqlalchemy-orm-tutorial-for-python-developers/#SQLAlchemy-ORM
It's called TL;DR
Important Note: the value of fitered data is actually as pandas Dataframe but for one column only from CSV file because I want to filter with this dataframe column values like I posted before to update some columns in my database I just started with updating one column called pending one as to see the result after that I'll update the other columns by the way the script the I want to create that to search in my database with this values in filtered data for an example I have a value called LCAIN20804 I want to take this value and to filter in database table then go the column called Huawei 1st submission date if it wasn't filled then fill with current data if it was filled go to pending column and replace the old value with TE then go to pending_status and replace the old value with waiting TE acceptance and so on that's a small part of my script I want to create
I hope this is clear enough
If you want to turn a pandas DataFrame into a SQL update statement, it may be nice to first transform it into a list of tuples, where the tuples are the new column values, and then use engine.executemany (https://stackoverflow.com/a/27743541/5015356)
values = [tuple(x) for x in filtered_data.values]
query = """
UPDATE govtracker
SET pending = 'TE'
WHERE site_code = '%s')
"""
connection = engine.connect()
update_db_query = connection.execute(query, values)
For each tuple (<sitecode>), this will execute the update statement. If you want to update more columns or expand the where clause, just add the additional columns to filtered_data, and add a new %s where you want the other value to appear.
Just make sure you keep the columns in the correct order!

Invalid Date when inserting to Teradata using Python

I'm working on a python piece that will insert a dataframe into a teradata table using pyodbc. The error I can't get past is...
File "file.py", line 33, in <module>
cursor.execute("INSERT INTO DB.TABLE (MASDIV,TRXTYPE,STATION,TUNING_EVNT_START_DT,DOW,MOY,TRANSACTIONS)VALUESrow['MASDIV'],'trx_chtr',row['STATION'],row['TUNING_EVNT_START_DT'],row['DOW'],row['MOY'],row['TRANSACTIONS'])
pyodbc.DataError: ('22008', '[22008] [Teradata][ODBC Teradata Driver][TeradataDatabase] Invalid date supplied for Table.TUNING_EVNT_START_DT. (-2666) (SQLExecDirectW)')
To fill you in... I've got a Teradata table that I want to take a dataframe and insert it into. That table is made as.
CREATE SET TABLE DB.TABLE, FALLBACK
(PK decimal(10,0) NOT NULL GENERATED ALWAYS AS IDENTITY
(START WITH 1
INCREMENT BY 1
MINVALUE 1
--MAXVALUE 2147483647
NO CYCLE),
TRXTYPE VARCHAR(10),
MASDIV VARCHAR(30),
STATION VARCHAR(50),
TUNING_EVNT_START_DT DATE format 'MM/DD/YYYY',
DOW VARCHAR(3),
MOY VARCHAR(10),
TRANSACTIONS INT,
ANOMALY_FLAG INT NOT NULL DEFAULT 1)
PRIMARY INDEX (PK);
The primary key and anomaly_flag will be automatically filled in. Below is the script that I am using and running into the error. It is reading in a csv and creating a dataframe. The first two lines (including a header) of the csv look like...
MASDIV | STATION | TUNING_EVNT_START_DT | DOW | MOY | TRANSACTIONS
Staten Island | WFUTDT4 | 9/12/18 | Wed | September | 538
San Fernando Valley | American Heroes Channel HD | 6/28/2018 | Thu | June | 12382
Here is the script that I am using...
'''
Written by Bobby October 1st, 2018
REFERENCE
https://tomaztsql.wordpkress.com/2018/07/15/using-python-pandas-dataframe-to-read-and-insert-data-to-microsoft-sql-server/
'''
import pandas as pd
import pyodbc
from datetime import datetime
#READ IN CSV TEST DATA
df = pd.read_csv('Data\\test_set.csv')
print('CSV LOADED')
#ADJUST DATE FORMAT
df['TUNING_EVNT_START_DT'] = pd.to_datetime(df.TUNING_EVNT_START_DT)
#df['TUNING_EVNT_START_DT'] =
df['TUNING_EVNT_START_DT'].dt.strftime('%m/%d/%Y')
df['TUNING_EVNT_START_DT'] = df['TUNING_EVNT_START_DT'].dt.strftime('%Y-%m-%d')
print('DATE FORMAT CHANGED')
print(df)
#PUSH TO DATABASE
conn = pyodbc.connect('dsn=ConnectR')
cursor = conn.cursor()
# Database table has columns...
# PK | TRXYPE | MASDIV | STATION | TUNING_EVNT_START_DT | DOW | MOY |
TRANSACTIONS | ANOMALY_FLAG
# PK is autoincrementing, TRXTYPE needs to be specified on insert command,
and ANOMALY_FLAG defaults to 1 for yes
for index, row in df.iterrows():
cursor.execute("INSERT INTO DLABBUAnalytics_Lab.Anomaly_Detection_SuperSet(MASDIV,TRXTYPE,STATION,TUNING_EVNT_START_DT,DOW,MOY,TRANSACTIONS)VALUES(?,?,?,?,?,?,?)", row['MASDIV'],'trx_chtr',row['STATION'],row['TUNING_EVNT_START_DT'],row['DOW'],row['MOY'],row['TRANSACTIONS'])
conn.commit()
print('RECORD ENTERED')
print('DF SUCCESSFULLY WRITTEN TO DB')
#PULL FROM DATABASE
sql_conn = pyodbc.connect('dsn=ConnectR')
query = 'SELECT * FROM DLABBUAnalytics_Lab.Anomaly_Detection_SuperSet;'
df = pd.read_sql(query, sql_conn)
print(df)
So in this I am converting the date format and trying to insert row by row into the Teradata table. The first record reads in and is in the database. The second record throws the error that is at the top. The date is 6/28/18 and I've changed it to 6/11/18 just to see if there was a mix up with day and month, but that still had the same problem. Are the columns getting off somewhere and it is trying to insert a different column's value into the date column.
Any ideas or help is much appreciated!
So the issue was in the format of the table. Initially it was built to have the MM/DD/YYYY format from the CSV, but changing it to the YYYY-MM-DD format made the script run perfectly.
Thanks!

Creating tables in MySQL based on the names of the columns in another table

I have a table with ~133M rows and 16 columns. I want to create 14 tables on another database on the same server for each of columns 3-16 (columns 1 and 2 are `id` and `timestamp` which will be in the final 14 tables as well but won't have their own table), where each table will have the name of the original column. Is this possible to do exclusively with an SQL script? It seems logical to me that this would be the preferred, and fastest way to do it.
Currently, I have a Python script that "works" by parsing the CSV dump of the original table (testing with 50 rows), creating new tables, and adding the associated values, but it is very slow (I estimated almost 1 year to transfer all 133M rows, which is obviously not acceptable). This is my first time using SQL in any capacity, and I'm certain that my code can be sped up, but I'm not sure how because of my unfamiliarity with SQL. The big SQL string command in the middle was copied from some other code in our codebase. I've tried using transactions as seen below, but it didn't seem to have any significant effect on the speed.
import re
import mysql.connector
import time
# option flags
debug = False # prints out information during runtime
timing = True # times the execution time of the program
# save start time for timing. won't be used later if timing is false
start_time = time.time()
# open file for reading
path = 'test_vaisala_sql.csv'
file = open(path, 'r')
# read in column values
column_str = file.readline().strip()
columns = re.split(',vaisala_|,', column_str) # parse columns with regex to remove commas and vasiala_
if debug:
print(columns)
# open connection to MySQL server
cnx = mysql.connector.connect(user='root', password='<redacted>',
host='127.0.0.1',
database='measurements')
cursor = cnx.cursor()
# create the table in the MySQL database if it doesn't already exist
for i in range(2, len(columns)):
table_name = 'vaisala2_' + columns[i]
sql_command = "CREATE TABLE IF NOT EXISTS " + \
table_name + "(`id` BIGINT(20) NOT NULL AUTO_INCREMENT, " \
"`timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, " \
"`milliseconds` BIGINT(20) NOT NULL DEFAULT '0', " \
"`value` varchar(255) DEFAULT NULL, " \
"PRIMARY KEY (`id`), " \
"UNIQUE KEY `milliseconds` (`milliseconds`)" \
"COMMENT 'Eliminates duplicate millisecond values', " \
"KEY `timestamp` (`timestamp`)) " \
"ENGINE=InnoDB DEFAULT CHARSET=utf8;"
if debug:
print("Creating table", table_name, "in database")
cursor.execute(sql_command)
# read in rest of lines in CSV file
for line in file.readlines():
cursor.execute("START TRANSACTION;")
line = line.strip()
values = re.split(',"|",|,', line) # regex split along commas, or commas and quotes
if debug:
print(values)
# iterate of each data column. Starts at 2 to eliminate `id` and `timestamp`
for i in range(2, len(columns)):
table_name = "vaisala2_" + columns[i]
timestamp = values[1]
# translate timestamp back to epoch time
try:
pattern = '%Y-%m-%d %H:%M:%S'
epoch = int(time.mktime(time.strptime(timestamp, pattern)))
milliseconds = epoch * 1000 # convert seconds to ms
except ValueError: # errors default to 0
milliseconds = 0
value = values[i]
# generate SQL command to insert data into destination table
sql_command = "INSERT IGNORE INTO {} VALUES (NULL,'{}',{},'{}');".format(table_name, timestamp,
milliseconds, value)
if debug:
print(sql_command)
cursor.execute(sql_command)
cnx.commit() # commits changes in destination MySQL server
# print total execution time
if timing:
print("Completed in %s seconds" % (time.time() - start_time))
This doesn't need to be incredibly optimized; it's perfectly acceptable if the machine has to run for a few days in order to do it. But 1 year is far too long.
You can create a table from a SELECT like:
CREATE TABLE <other database name>.<column name>
AS
SELECT <column name>
FROM <original database name>.<table name>;
(Replace the <...> with your actual object names or extend it with other columns or a WHERE clause or ...)
That will also insert the data from the query into the new table. And it's probably the fastest way.
You could use dynamic SQL and information from the catalog (namely information_schema.columns) to create the CREATE statements or create them manually, which is annoying but acceptable for 14 columns I guess.
When using scripts to talk to databases you want to minimise the number of messages that are sent as each message creates a further delay on your execution time. Currently, it looks as if you are sending (by your approximation) 133 million messages, and thus, slowing down your script 133 million times. A simple optimisation would be to parse your spreadsheet and split the data into the tables (either in memory or saving them to disk) and only then send the data to the new DB.
As you hinted, it's much quicker to write an SQL script to redistribute the data.

Categories