How do you avoid duplicate entry into database? - python

I would like to filter database inserts to avoid duplicates so it only insert 1 product per 1 ProductId. How do I do this?
This is my insert:
add_data = ("INSERT INTO productdetails"
"(productId, productUrl, discount, evaluateScore, volume, packageType, lotNum, validTime, storeName, storeUrl, allImageUrls, description) "
"VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)")
This is how it suppose to look like but in PyMySQL, how do I do the same in mysql.connector ?
INSERT INTO producttable (productId, productTitle, salePrice, originalPrice )
SELECT * FROM (SELECT %(productId)s, %(productTitle)s, %(salePrice)s, %(originalPrice)s) AS tmp
WHERE NOT EXISTS (
SELECT productId FROM producttable WHERE productId = %(productId)s
)
LIMIT 1;

The proper approach to do this is at the database end. You need to add a unique constraint:
ALTER TABLE productdetails
ADD UNIQUE (productId);
You can than simply do Insert, without any where or if.
Why?
If you keep a set as suggested by yayati, you will limit yourself by having the set and the processing surrounding it as a bottleneck.
If you add the constraint, than it's left to the database to do fast checks for uniqueness even with millions of rows. Than you see if the DB returns error if its not unique.

Set column to unique. then use INSERT IGNORE statement, If there is duplicate entry the query will not execute. you can read more here about INSERT IGNORE.

What you could do is create the Insert statements via the String Interpolation in and keep adding them in a Set. The Set collection would only keep unique strings in itself. You can then bulk load the Set of unique SQL insert statements into your RDBMS.

Related

How to specify a column's default value as a placeholder?

I've got a table with multiple columns, several of which are optional. I'm reading records from an external source, in which each record may specify values for the optional columns or not. For each record, I'd like to insert a row into the database with the given values plus the column defaults for any column that's not specified.
If all the columns are specified, I obviously just use a basic INSERT statement:
db_cursor.execute("insert into table (col1, col2, col3, col4, col5) " +
"values (%s, %s, %s, %s, %s)",
(value_1, value_2, value_3, value_4, value_5))
However, if some values are unspecified, there doesn't seem to be an easy way to use the defaults for only those values. You can use the DEFAULT keyword in SQL (or, equivalently, leave those columns out of the insert statement entirely), as e.g.
db_cursor.execute("insert into table (col1, col2, col3, col4, col5) " +
"values (%s, %s, %s, DEFAULT, %s)",
(value_1, value_2, value_3, value_5))
But you can't pass 'DEFAULT' as a placeholder value; it'll just become that string.
So far I can only think of three approaches to this problem:
Construct the SQL query string itself at run-time based on the input data, rather than using parameterization. This is a very strong anti-pattern due to the usual SQL injection reasons. (This application isn't actually security-critical, but I don't want such anti-patterns in my code.)
Write a different query string for each possible combination of specified and unspecified parameters. Here, if four of the columns are optional, that's 2^4 = 16 different commands running the same query. This is obviously unworkable.
Make the application aware of the default values and have it send them explicitly in the case where a column is unspecified. This breaks SPOT for the defaults, with all the attending maintenance and interoperability headaches (multiple applications read the database). Of the approaches I can think of, this is probably least bad, but I'd still prefer not to have to do it.
Is there an easier way to manage dynamically sending defaults?
The way I usually deal with this is to have a placeholder in place of the column list and string format() the list of columns. This is safe, as the list of columns is controlled by the dev, and isn't untrusted user input.
stmt_without_col_names = 'INSERT INTO table ({}) VALUES ({})'
input_values = [1, None, 1, None, None]
columns = ('col1', 'col2', 'col3', 'col4', 'col5')
columns_to_keep = {k: v for k, v in zip(columns, input_values) if v is not None}
# note: relies on dict key ordering remaining the same
# this is true if the dict is not modified *at all* between creation
# and the statement execution - use an OrderedDict or other data
# structure instead if you're worried
format_str = ','.join(['%s'] * len(columns_to_keep))
stmt = stmt_without_col_names.format(columns_to_keep.keys(), format_str)
# stmt looks like "INSERT INTO table (['col3', 'col1']) VALUES (%s,%s)"
cursor.execute(stmt, columns_to_keep.values())

How can I check if data exist in table and update a value, if not insert data in table using MariaDB?

I have 2 tables. First table contains some products and the second table is used for temporary data storage. Both tables have the same column names.
Table `products`: contains this columns
- id (unique, autoincrement)
- name
- quantity
- price
- group
Table `temp_stor`: contains this columns
- id (unique, autoincrement)
- name
- quantity
- price
- group
I want to get from the first table one row (name,quantity,price,group) and insert it into the second table if the data does not exist. If the same data exists in temp_stor I want to update only one column (quantity).
For example:
I take from products the following line ('cola','1','2.5','soda'), I want to check the temp_stor to see if the line exist. temp_store table looks like this:
('milk 1L','1','1.5','milks')
('cola','1','2.5','soda')
('bread','1','0.9','pastry')
('7up','1','2.8','soda')
We see the second line exists, and I want to update it's quantity. The table will look like this:
('milk 1L','1','1.5','milks')
('cola','2','2.5','soda')
('bread','1','0.9','pastry')
('7up','1','2.8','soda')
If the table looks like this:
('milk 1L','1','1.5','milks')
('bread','1','0.9','pastry')
('7up','1','2.8','soda')
I want to insert the line into the table. So it would look like this:
('milk 1L','1','1.5','milks')
('bread','1','0.9','pastry')
('7up','1','2.8','soda')
('cola','1','2.5','soda')
Is this posible to do it through a sql query? I need to implement this into a python code. The python part I can handle , but I'm not that good to sql.
Thank you
UPDATE 1:
i forgot to specify maybe the most important thing and this is my fault. I want to check the existence of a product name inside the temp_stor table. Only the name should be unique. If the product exists i wan't to update it's quantity value only, if the product doesn't exist i want to insert it into the temp_stor table.
Assuming that "if the data does not exist" means "if the combo of name+quantity+price+group" is is not already there?
Add
UNIQUE(name, quantity, price, group)
Then
INSERT INTO products
(name, quantity, price, group)
SELECT name, quantity, price, group FROM temp_stor;
Minor(?) drawback: This will (I think) 'burn' ids. In your example, it will allocate 4 new values of id, but use only one of them.
after update to Question
Do not have the index above, instead, have this. (I assume product is spelled "name"??)
UNIQUE(name)
Then...
INSERT INTO products
(name, quantity, price, group)
SELECT name, quantity, price, group FROM temp_stor
ON DUPLICATE KEY -- this will notice UNIQUE(name)
UPDATE
quantity = VALUES(quantity);
It is unusual to use this construct without UPDATEing all the extra columns (quantity, price, group). What if temp_stor has a different value for price or group?
Take a look at How to connect Python programs to MariaDB to see how to connect to your DB.
After that you can select from temp_stor the row with the same id as the row you have obtained from products. Let row be the tuple of values you obtained.
cursor = mariadb_connection.cursor()
cursor.execute("SELECT * FROM temp_stor WHERE id=%s", (some_id,))
If the result of this query contains nothing, indicating that there is no such row, you can proceed to insert it, otherwise, update the row.
if len(cursor) == 0:
try:
cursor.execute("""INSERT INTO temp_stor VALUES (%s,%s,%s,%s,%s)""", row)
except mariadb.Error as error:
print("Error: {}".format(error))
else:
cursor.execute ("""
UPDATE temp_stor
SET id=%s, name=%s, quantity=%s, price=%s, group=%s
WHERE id=%s
""", (row[0], row[1], row[2], row[3], row[4], row[0]))
Update:
To perform something similar with just one query:
INSERT INTO temp_stor (id, name, quantity, price, group) VALUES(1, "cola", '2.5', 'soda') ON DUPLICATE KEY UPDATE quantity='2.5'
Here I am assuming that 1 is the id and "cola" is the name. If you want name to be unique, you should make that the key, because this query currently only compares keys, which in this case seems to be id.

How to On Duplicate Key Update

I have this query that is executed in my Python script but when its inserting into the database and finds a duplicate of my unique column it causes it to error and stops. I know I need to use On Duplicate Key Update but I'm note sure how to properly add this.
My unique column 2.
cur.execute("""INSERT INTO logs (1,2,3) VALUES (%s,%s,%s) """,(line[0], line[1], line[2]))
If there is a duplicate to have it update that row/entry.
When I understand you correctly, what you are looking for is this:
cur.execute(""" INSERT INTO logs (1, 2, 3) VALUES (%s, %s, %s) ON DUPLICATE KEY UPDATE 1=%s, 3=%s """, (line[0], line[1], line[2], line[0], line[2]))
Check also Insert on duplicate.

Bulk insert with returning IDs performance

I've got a table which I want to insert around 1000 items each query, and get their PK after creation, for later use as FK for other tables.
I've tried inserting them using returning syntax in postgresql.
but it takes around 10 sec to insert
INSERT INTO table_name (col1, col2, col3) VALUES (a1,a2,a3)....(a(n-2),a(n-1),a(n)) returning id;
By removing RETURNING I get much better performance ~50ms.
I think that if I can get an atomic operation to get the first id and insert the rows at the same time I could get better performance by removing the RETURNING.
but don't understand if that is possible.
Generate id using the nextval
http://www.postgresql.org/docs/9.1/static/sql-createsequence.html
CREATE TEMP TABLE temp_val AS(
VALUES(nextval('table_name_id_seq'),a1,a2,a3),
(nextval('table_name_id_seq'),a1,a2,a3)
);
INSERT INTO table_name (id, col1, col2, col3)(
SELECT column1,column2,column3,column4
FROM temp_val
);
SELECT column1 FROM temp_val;

Can I split INSERT statement into several ones without repeat inserting rows?

I have such an INSERT statement:
mtemp = "station, calendar, type, name, date, time"
query = "INSERT INTO table (%s) VALUES ( '%s', '%s', '%s', %s, '%s', '%s' );"
query = query % (mtemp, mstation, mcalendar, mtype, mname, mdate, mtime)
curs.execute(query, )
conn.commit()
The problem is that I can not get the variables: mcalendar, mdate, mtime in this statement. They are not constant values. I would have to access each of them within a forloop. However, the values of mstation, mtype and mname are fixed. I tried to split the INSERT statement into several ones: one for each of the three variables in a forloop, and one for the three fixed values in a single forloop. The forloop is basically to define when to insert rows. I have a list of rows1 and a list of rows2, rows1 is a full list of records while rows2 lack some of them. I’m checking if the rows2 record exist in rows1. If it does, then execute the INSERT statement, if not, do nothing.
I ran the codes and found two problems:
It’s inserting way more rows than it is supposed to. It’s supposed to insert no more than 240 rows for there are only 240 time occurrences in each day for each sensor. (I wonder if it is because I wrote too many forloop so that it keeps inserting rows). Now it’s getting more than 400 new rows.
In these new rows being inserted to the table, they only have values in the columns of fixed value. For the three ones that I use the single forloop to insert data, they don’t have value at all.
Hope someone give me some tip here. Thanks in advance! I can put more codes here if needed. I’m not even sure if I’m in the right track.
I'm not sure I understand exactly your scenario, but is this the sort of thing you need?
Pseudo code
mstation = "foo"
mtype = "bar"
mname = "baz"
mtemp = "station, calendar, type, name, date, time"
queryTemplate = "INSERT INTO table (%s) VALUES ( '%s', '%s', '%s', %s, '%s', '%s' );"
foreach (mcalendar in calendars)
foreach (mdate in dates)
foreach (mtime in times)
query = queryTemplate % (mtemp, mstation, mcalendar, mtype, mname, mdate, mtime)
curs.execute(query, )
One INSERT statement always corresponds to one new row in a table. (Unless of course there is an error during the insert.) You can INSERT a row, and then UPDATE it later to add/change information but there is no such thing as splitting up an INSERT.
If you have a query which needs to be executed multiple times with changing data, the best option is a prepared statement. A prepared statement "compiles" an SQL query but leaves placeholders that can set each time it is executed. This improves performance because the statement doesn't need to be parsed each time. You didn't specify what library you're using to connect to postgres so I don't know what the syntax would be, but it's something to look in to.
If you can't/don't want to use prepared statements, you'll have to just create the query string once for each insert. Don't substitute the values in before the loop, wait until you know them all before creating the query.
Following syntax works in SQL Server 2008 but not in SQL Server 2005.
CREATE TABLE Temp (id int, name varchar(10));
INSERT INTO Temp (id, name) VALUES (1, 'Anil'), (2, 'Ankur'), (3, 'Arjun');
SELECT * FROM Temp;
id | name
------------
1 | Anil
2 | Ankur
3 | Arjun

Categories