What's the best / fastest solution for the following task:
Used technology: MySQL database + Python
I'm downloading a data.sql file. It's format:
INSERT INTO `temp_table` VALUES (group_id,city_id,zip_code,post_code,earnings,'group_name',votes,'city_name',person_id,'person_name',networth);
INSERT INTO `temp_table` VALUES (group_id,city_id,zip_code,post_code,earnings,'group_name',votes,'city_name',person_id,'person_name',networth);
.
.
Values in each row differ.
Tables structures: http://sqlfiddle.com/#!9/8f10d6
A person can have multiple cities
A person can be only in one group or can be without group.
A group can have multiple persons
And i know from which country these .sql data are.
I need to split these data into 3 tables. And I will be updating data that are already in the tables and if not then I will create new row.
So I came up with 2 solutions:
Split the values from the file via python and then perform for each line 3x select + 3x update/insert in the transaction.
Somehow bulk insert the data into a temporary table and then manipulate with the data inside a database - meaning for each row in the temporary table I will perform 3 select queries (one to each actual table) and if I find row I will send 3x (update query and if not then I run insert query).
I will be running this function multiple times per day with over 10K lines in the .sql file and it will be updating / creating over 30K rows in the database.
//EDIT
My inserting / updating code now:
autocomit = "SET autocommit=0"
with connection.cursor() as cursor:
cursor.execute(autocomit)
data = data.sql
lines = data.splitlines
for line in lines:
with connection.cursor() as cursor:
cursor.execute(line)
temp_data = "SELECT * FROM temp_table"
with connection.cursor() as cursor:
cursor.execute(temp_data)
temp_data = cursor.fetchall()
for temp_row in temp_data:
group_id = temp_row[0]
city_id = temp_row[1]
zip_code = temp_row[2]
post_code = temp_row[3]
earnings = temp_row[4]
group_name = temp_row[5]
votes = temp_row[6]
city_name = temp_row[7]
person_id = temp_row[8]
person_name = temp_row[9]
networth = temp_row[10]
group_select = "SELECT * FROM perm_group WHERE group_id = %s AND countryid_fk = %s"
group_values = (group_id, countryid)
with connection.cursor() as cursor:
row = cursor.execute(group_select, group_values)
if row == 0 and group_id != 0: #If person doesn't have group do not create
group_insert = "INSERT INTO perm_group (group_id, group_name, countryid_fk) VALUES (%s, %s, %s)"
group_insert_values = (group_id, group_name, countryid)
with connection.cursor() as cursor:
cursor.execute(group_insert, group_insert_values)
groupid = cursor.lastrowid
elif row == 1 and group_id != 0:
group_update = "UPDATE perm_group SET group_name = group_name WHERE group_id = %s and countryid_fk = %s"
group_update_values = (group_id, countryid)
with connection.cursor() as cursor:
cursor.execute(group_update, group_update_values)
#Select group id for current row to assign correct group to the person
group_certain_select = "SELECT id FROM perm_group WHERE group_id = %s and countryid_fk = %s"
group_certain_select_values = (group_id, countryid)
with connection.cursor() as cursor:
cursor.execute(group_certain_select, group_certain_select_values)
groupid = cursor.fetchone()
#.
#.
#.
#Repeating the same piece of code for person and city
Measured time: 206 seconds - which is not acceptable.
group_insert = "INSERT INTO perm_group (group_id, group_name, countryid_fk) VALUES (%s, %s, %s) ON DUPLICATE KEY UPDATE group_id = %s, group_name = %s"
group_insert_values = (group_id, group_name, countryid, group_id, group_name)
with connection.cursor() as cursor:
cursor.execute(group_insert, group_insert_values)
#Select group id for current row to assign correct group to the person
group_certain_select = "SELECT id FROM perm_group WHERE group_id = %s and countryid_fk = %s"
group_certain_select_values = (group_id, countryid)
with connection.cursor() as cursor:
cursor.execute(group_certain_select, group_certain_select_values)
groupid = cursor.fetchone()
Measured time: from 30 to 50 seconds. (Still quite long, but it's getting better)
Are there any other better (faster) options on how to do it?
Thanks in advice, popcorn
I would recommend that you load the data into a staging table and do the processing in SQL.
Basically, your ultimate result is a set of SQL tables, so SQL is necessarily going to be part of the solution. You might as well put as much logic into the database as you can, to simply the number of tools needed.
Loading 10,000 rows should not take much time. However, if you have a choice of data formats, I would recommend a CSV file over inserts. inserts incur extra overhead, if only because they are larger.
Once the data is in the database, I would not worry much about the processing time for storing the data in three tables.
Related
Beginners question here. I wish to populate a table with many rows of data straight from a query I'm running in the same session. I wish to do it using with excutemany(). currently, I insert each row as a tuple, as shown in the script below.
Select Query to get the needed data:
This query returns data with 4 columns Parking_ID, Snapshot_Date, Snapshot_Time, Parking_Stat
park_set_stat_query = "SET #row_number = 0;"
park_set_stat_query2 = "SET #row_number2 = 0;"
# one time load to catch only the changes done in the input table
park_change_stat_query = """select in1.Parking_ID,
in1.Snapshot_Date as Snapshot_Date,
in1.Snapshot_Time as Snapshot_Time,
in1.Parking_Stat
from (SELECT
Parking_ID,
Snapshot_Date,
Snapshot_Time,
Parking_Stat,
(#row_number:=#row_number + 1) AS num1
from Fact_Parking_Stat_Input
WHERE Parking_Stat<>0) as in1
left join (SELECT
Parking_ID,
Snapshot_Date,
Snapshot_Time,
Parking_Stat,
(#row_number2:=#row_number2 + 1)+1 AS num2
from Fact_Parking_Stat_Input
WHERE Parking_Stat<>0) as in2
on in1.Parking_ID=in2.Parking_ID and in1.num1=in2.num2
WHERE (CASE WHEN in1.Parking_Stat<>in2.Parking_Stat THEN 1 ELSE 0 END=1) OR num1=1"""
Here is the insert part of the script:
as you can see below I insert each row to the destination table Fact_Parking_Stat_Input_Alter
mycursor = connection.cursor()
mycursor2 = connection.cursor()
mycursor.execute(park_set_stat_query)
mycursor.execute(park_set_stat_query2)
mycursor.execute(park_change_stat_query)
# # keep only changes in a staging table named Fact_Parking_Stat_Input_Alter
qSQLresults = mycursor.fetchall()
for row in qSQLresults:
Parking_ID = row[0]
Snapshot_Date = row[1]
Snapshot_Time = row[2]
Parking_Stat = row[3]
#SQL query to INSERT a record into the table Fact_Parking_Stat_Input_Alter.
mycursor2.execute('''INSERT into Fact_Parking_Stat_Input_Alter (Parking_ID, Snapshot_Date, Snapshot_Time, Parking_Stat)
values (%s, %s, %s, %s)''',
(Parking_ID, Snapshot_Date, Snapshot_Time, Parking_Stat))
# Commit your changes in the database
connection.commit()
mycursor.close()
mycursor2.close()
connection.close()
How can I improve the code so it will insert the data in on insert command?
Thanks
Amir
MYSQL has an INSERT INTO command that is probably far more efficient than query it in python, pulling it and re-iserting
https://www.mysqltutorial.org/mysql-insert-into-select/
I am trying to create a training app in python to work with a database of movies, adding movie details via a text menu prompting user input for all fields (movie name, actors, company, etc.). I am using PostgreSQL as the database and import psycopg2 in Python.
From user input, I am collecting data which I then want to store in my database tables 'movies' and 'actors'. For one movie, there are several actors. I have this code:
def insert_movie(name, actors, company, year):
connection = psycopg2.connect(user='postgres', password='postgres', database='movie')
cursor = connection.cursor()
query1 = "INSERT INTO movies (name, company, year) VALUES (%s, %s, %s);"
cursor.execute(query1, (name, company, year))
movie_id = cursor.fetchone[0]
print(movie_id)
query2 = 'INSERT INTO actors (last_name, first_name, actor_ordinal) VALUES (%s, %s, %s);'
for actor in actors:
cursor.execute(query2, (tuple(actor)))
rows = cursor.fetchall()
actor_id1 = [row[0] for row in rows]
actor_id2 = [row[1] for row in rows]
print(actor_id1)
print(actor_id2)
connection.commit()
connection.close()
This works great for printing movie_id after query1. However for printing actor_id2, I get IndexError: list index out of range.
If I leave only actor_id1 in query3 like this:
query2 = 'INSERT INTO actors (last_name, first_name, actor_ordinal) VALUES (%s, %s, %s);'
for actor in actors:
cursor.execute(query2, (tuple(actor)))
rows = cursor.fetchall()
actor_id1 = [row[0] for row in rows]
print(actor_id1)
, I get printed the following result:
movie_id --> 112
actor2_id --> 155
The problem that I cannot retrieve actor1_id with this code, which is 154.
Can anyone help with using fetchall correctly here?
OK, I have found out the answer. The fetch should be used inside the loop as we should execute fetch for every row and not after the whole query for all rows altogether:
query2 = 'INSERT INTO actors (last_name, first_name, actor_ordinal) VALUES (%s, %s, %s);'
actor_ids = []
for actor in actors:
cursor.execute(query2, (tuple(actor)))
actor_id = cursor.fetchone()[0]
actor_ids.append(actor_id)
print(actor_ids)
I have this situation where I created a method that will insert rows in database. I provide to that method columns, values and table name.
COLUMNS = [['NAME','SURNAME','AGE'],['SURNAME','NAME','AGE']]
VALUES = [['John','Doe',56],['Doe','John',56]]
TABLE = 'people'
This is how I would like to pass but it doesn't work:
db = DB_CONN.MSSQL() #method for connecting to MS SQL or ORACLE etc.
cursor = db.cursor()
sql = "insert into %s (?) VALUES(?)" % TABLE
cursor.executemany([sql,[COLUMNS[0],VALUES[0]],[COLUMNS[1],VALUES[1]]])
db.commit()
This is how it will pass query but problem is that I must have predefined column names and that's not good because what if the other list has different column sort? Than the name will be in surname and surname in name.
db = DB_CONN.MSSQL() #method for connecting to MS SQL or ORACLE etc.
cursor = db.cursor()
sql = 'insert into %s (NAME,SURNAME,AGE) VALUES (?,?,?)'
cursor.executemany(sql,[['John','Doe',56],['Doe','John',56]])
db.commit()
I hope I explained it clearly enough.
Ps. COLUMNS and VALUES are extracted from json dictionary
[{'NAME':'John','SURNAME':'Doe','AGE':56...},{'SURNAME':'Doe','NAME':'John','AGE':77...}]
if that helps.
SOLUTION:
class INSERT(object):
def __init__(self):
self.BASE_COL = ''
def call(self):
GATHER_DATA = [{'NAME':'John','SURNAME':'Doe','AGE':56},{'SURNAME':'Doe','NAME':'John','AGE':77}]
self.BASE_COL = ''
TABLE = 'person'
#check dictionary keys
for DATA_EVAL in GATHER_DATA:
if self.BASE_COL == '': self.BASE_COL = DATA_EVAL.keys()
else:
if self.BASE_COL != DATA_EVAL.keys():
print ("columns in DATA_EVAL.keys() have different columns")
#send mail or insert to log or remove dict from list
exit(403)
#if everything goes well make an insert
columns = ','.join(self.BASE_COL)
sql = 'insert into %s (%s) VALUES (?,?,?)' % (TABLE, columns)
db = DB_CONN.MSSQL()
cursor = db.cursor()
cursor.executemany(sql, [DATA_EVAL.values() for DATA_EVAL in GATHER_DATA])
db.commit()
if __name__ == "__main__":
ins = INSERT()
ins.call()
You could take advantage of the non-random nature of key-value pair listing for python dictionaries.
You should check that all items in the json array of records have the same fields, otherwise you'll run into an exception in your query.
columns = ','.join(records[0].keys())
sql = 'insert into %s (%s) VALUES (?,?,?)' % (TABLE, columns)
cursor.executemany(sql,[record.values() for record in records])
References:
https://stackoverflow.com/a/835430/5189811
I HAVE ADDED MY OWN ANSWER THAT WORKS BUT OPEN TO IMPROVEMENTS
After seeing a project at datanitro. I took on getting a connection to MySQL (they use SQLite) and I was able to import a small test table into Excel from MySQL.
Inserting new updated data from the Excel sheet was this next task and so far I can get one row to work like so...
import MySQLdb
db = MySQLdb.connect("xxx","xxx","xxx","xxx")
c = db.cursor()
c.execute("""INSERT INTO users (id, username, password, userid, fname, lname)
VALUES (%s, %s, %s, %s, %s, %s);""",
(Cell(5,1).value,Cell(5,2).value,Cell(5,3).value,Cell(5,4).value,Cell(5,5).value,Cell(5,6).value,))
db.commit()
db.close()
...but attempts at multiple rows will fail. I suspect either issues while traversing rows in Excel. Here is what I have so far...
import MySQLdb
db = MySQLdb.connect(host="xxx.com", user="xxx", passwd="xxx", db="xxx")
c = db.cursor()
c.execute("select * from users")
usersss = c.fetchall()
updates = []
row = 2 # starting row
while True:
data = tuple(CellRange((row,1),(row,6)).value)
if data[0]:
if data not in usersss: # new record
updates.append(data)
row += 1
else: # end of table
break
c.executemany("""INSERT INTO users (id, username, password, userid, fname, lname) VALUES (%s, %s, %s, %s, %s, %s)""", updates)
db.commit()
db.close()
...as of now, I don't get any errors, but my new line is not added (id 3). This is what my table looks like in Excel...
The database holds the same structure, minus id 3. There has to be a simpler way to traverse the rows and pull the unique content for INSERT, but after 6 hours trying different things (and 2 new Python books) I am going to ask for help.
If I run either...
print '[%s]' % ', '.join(map(str, updates))
or
print updates
my result is
[]
So this is likely not passing any data to MySQL in the first place.
LATEST UPDATE AND WORKING SCRIPT
Not exactly what I want, but this has worked for me...
c = db.cursor()
row = 2
while Cell(row,1).value != None:
c.execute("""INSERT IGNORE INTO users (id, username, password, userid, fname, lname)
VALUES (%s, %s, %s, %s, %s, %s);""",
(CellRange((row,1),(row,6)).value))
row = row + 1
Here is your problem:
while True:
if data[0]:
...
else:
break
Your first id is 0, so in the first iteration of the loop data[0] will be falsely and your loop will exit, without ever adding any data. What you probably ment is:
while True:
if data[0] is not None:
...
else:
break
I ended up finding a solution that gets me an Insert on new and allows for UPDATE of those that are changed. Not exactly a Python selection based on a single query, but will do.
import MySQLdb
db = MySQLdb.connect("xxx","xxx","xxx","xxx")
c = db.cursor()
row = 2
while Cell(row,1).value is not None:
c.execute("INSERT INTO users (id, username, password, \
userid, fname, lname) \
VALUES (%s, %s, %s, %s, %s, %s) \
ON DUPLICATE KEY UPDATE \
id=VALUES(id), username=VALUES(username), password=VALUES(password), \
userid=VALUES(userid), fname=VALUES(fname), lname=VALUES(lname);",
(CellRange((row,1),(row,6)).value))
row = row + 1
db.commit()
db.close()
I would like to remove the duplicate data only if three columns (name, price and new price) matching with the same data. But in an other python script.
So the data can insert in to the database, but with an other python script, I want to delete this duplicate data by a cron job.
So in this case:
cur.execute("INSERT INTO cars VALUES(8,'Hummer',41400, 49747)")
cur.execute("INSERT INTO cars VALUES(9,'Volkswagen',21600, 36456)")
are duplicates. Example script with inserted data:
import psycopg2
import sys
con = None
try:
con = psycopg2.connect(database='testdb', user='janbodnar')
cur = con.cursor()
cur.execute("CREATE TABLE cars(id INT PRIMARY KEY, name VARCHAR(20), price INT, new price INT)")
cur.execute("INSERT INTO cars VALUES(1,'Audi',52642, 98484)")
cur.execute("INSERT INTO cars VALUES(2,'Mercedes',57127, 874897)")
cur.execute("INSERT INTO cars VALUES(3,'Skoda',9000, 439788)")
cur.execute("INSERT INTO cars VALUES(4,'Volvo',29000, 743878)")
cur.execute("INSERT INTO cars VALUES(5,'Bentley',350000, 434684)")
cur.execute("INSERT INTO cars VALUES(6,'Citroen',21000, 43874)")
cur.execute("INSERT INTO cars VALUES(7,'Hummer',41400, 49747)")
cur.execute("INSERT INTO cars VALUES(8,'Hummer',41400, 49747)")
cur.execute("INSERT INTO cars VALUES(9,'Volkswagen',21600, 36456)")
cur.execute("INSERT INTO cars VALUES(10,'Volkswagen',21600, 36456)")
con.commit()
except psycopg2.DatabaseError, e:
if con:
con.rollback()
print 'Error %s' % e
sys.exit(1
finally:
if con:
con.close()
You can do this in one statement without additional round-trips to the server.
DELETE FROM cars
USING (
SELECT id, row_number() OVER (PARTITION BY name, price, new_price
ORDER BY id) AS rn
FROM cars
) x
WHERE cars.id = x.id
AND x.rn > 1;
Requires PostgreSQL 8.4 or later for the window function row_number().
Out of a set of dupes the smallest id survives.
Note that I changed "new price" to new_price.
Or use the EXISTS semi-join, that #wildplasser posted as comment to the same effect.
Or, to by special request of CTE-devotee #wildplasser, with a CTE instead of the subquery ... :)
WITH x AS (
SELECT id, row_number() OVER (PARTITION BY name, price, new_price
ORDER BY id) AS rn
FROM cars
)
DELETE FROM cars
USING x
WHERE cars.id = x.id
AND x.rn > 1;
Data modifying CTE requires Postgres 9.1 or later.
This form will perform about the same as the one with the subquery.
Use a GROUP BY SQL statement to identify the rows, together with the initial primary key:
duplicate_query = '''\
SELECT MIN(id), "name", price, "new price"
FROM cars
GROUP BY "name", price, "new price"
HAVING COUNT(ID) > 1
'''
The above query selects the lowest primary key id for each group of (name, price, "new price") rows where there is more than one primary key id. For your sample data, this will return:
7, 'Hummer', 41400, 49747
9, 'Volkswagen', 21600, 36456
You can then use the returned data to delete the duplicates:
delete_dupes = '''
DELETE
FROM cars
WHERE
"name"=%(name)s AND price=%(price)s AND "new price"=%(newprice)s AND
id > %(id)s
'''
cur.execute(duplicate_query)
dupes = cur.fetchall()
cur.executemany(delete_dupes, [
dict(name=r[1], price=r[2], newprice=r[3], id=r[0])
for r in dupes])
Note that we delete any row where the primary key id is larger than the first id with the same 3 columns. For the first dupe, only the row with id 8 will match, for the second dupe the row with id 10 matches.
This does do a separate delete for each dupe found. You can combine this into one statement with a WHERE EXISTS sub-select query:
delete_dupes = '''\
DELETE FROM cars cdel
WHERE EXISTS (
SELECT *
FROM cars cex
WHERE
cex."name" = cdel."name" AND
cex.price = cdel.price AND
cex."new price" = cdel."new price" AND
cex.id > cdel.id
)
'''
cur.execute(delete_dupes)
This instructs PostgreSQL to delete any row for which there are other rows with the same name, price and new price but with a primary key that is higher than the current row.