SQL delete duplicate rows [duplicate]

SQL delete duplicate rows [duplicate] - python

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Delete duplicate rows
Here is my table structure:
"Author" (varchar) | "Points" (integer) | "Body" (text)
The author is always the same and the Body is too. The same author entry will appear multiple times throughout the database with different bodies, so I cannot delete according to the author. However, the points column isn't always the same. I want the keep the row with the largest point value.
I am using SQLite3 and Python.
Thanks
EDIT:
I have tried this, but it just deletes all the rows.
for row in cur.fetchall():
rows = cur.execute('SELECT * FROM Posts WHERE Author=? AND Body=? AND Nested=? AND Found=?', (row['Author'], row['Body'], row['Nested'], row['Found'],))
for row2 in rows:
delrow = row
if (row['Upvotes'] < row2['Upvotes'] or row['Downvotes'] < row2['Downvotes']):
delrow = row2
cur.execute("DELETE FROM Posts WHERE Author=? AND Body=? AND Upvotes=? AND Downvotes=? AND Nested=? AND Found=?", (delrow['Author'], delrow['Body'], delrow['Upvotes'], delrow['Downvotes'], delrow['Nested'], delrow['Found'],))
dn += 1
print "Deleted row ", dn
I have also tried this, but it didn't work.
cur.execute("DELETE FROM Posts WHERE Upvotes NOT IN (SELECT MAX(Upvotes) FROM Posts GROUP BY Body);")
I am also committing all the changes so it is not that. The SQLite3 module is installed correctly and I can write on the db.

Unfortunately in SQLite3 you don't have nice functions like partition over row so there's no way to do it in one query so you'll either have to do it procedurally or iteratively.
For performance reasons I would recommend extracting your full list of deletion potentials and then delete them en-masse, for eg.
# in your sql query
SELECT ROWID, AUTHOR, BODY
FROM TABLE_NAME
ORDER BY AUTHOR, BODY, POINTS DESC
Then in your Python application, iterate through your result set, and store all non-first ROWIDs for the Author/Body combo (think CTRL-BREAK style programming), and once you're done building your set delete the row IDs.

Since you want to delete all but the highest points value, the following will do it just fine:
delete from test
where exists (select * from test t2
where test.author = t2.author
and test.body = t2.body
and test.points < t2.points);
It's a basic join to itself, and then deleting out all values that have the same author & body, but have a lower point value.
SqlFiddle here: http://sqlfiddle.com/#!7/64d62/3
Note: The one caveat, is that if multiple author/body pairs have the same max point value, then all those values will be preserved.

I haven't tested it, but this may work:
DELETE FROM TableName
WHERE author, body, points NOT IN (SELECT author, body, MAX(points) as points
FROM TableName
GROUP BY author, body)
Run it as a SELECT query first to see if it will keepwhat you want.

Related

SQL database with a column being a list or a set

With a SQL database (in my case Sqlite, using Python), what is a standard way to have a column which is a set of elements?
id name items_set
1 Foo apples,oranges,tomatoes,ananas
2 Bar tomatoes,bananas
...
A simple implementation is using
CREATE TABLE data(id int, name text, items_set text);
but there are a few drawbacks:
to query all rows that have ananas, we have to use items_set LIKE '%ananas%' and some tricks with separators to avoid querying "ananas" to also return rows with "bananas", etc.
when we insert a new item in one row, we have to load the whole items_set, and see if the item is already in the list or not, before concatenating ,newitem at the end.
etc.
There is surely better, what is a standard SQL solution for a column which is a list or set?
Note: I don't know in advance all the possible values for the set/list.
I can see a solution with a few additional tables, but in my tests, it multiplies the size on disk by a factor x2 or x3, which is a problem with many gigabytes of data.
Is there a better solution?

To have a well structured SQL database, you should extract the items to their own table and use a join table between the main table and the items table
I'm not familiar with the Sqlite syntax but you should be able to create the tables with
CREATE TABLE entities(id int, name text);
CREATE TABLE entity_items(entity_id int, item_id int);
CREATE TABLE items(id int, name text);
add data
INSERT INTO entities (name) VALUES ('Foo'), ('Bar');
INSERT INTO items (name) VALUES ('tomatoes'), ('ananas'), ('bananas');
INSERT INTO entity_items (entity_id, item_id) VALUES (
(SELECT id from entities WHERE name='Foo'),
(SELECT id from items WHERE name='bananas')
);
query data
SELECT * FROM entities
LEFT JOIN entity_items
ON entities.id = entity_items.entity_id
LEFT JOIN items
ON items.id = entity_items.item_id
WHERE items.name = 'bananas';

You have probably two options. One standard approach, which is more conventional, is many-to-many relationship. Like you have three tables, for example, Employees, Projects, and ProjectEmployees. The latter describes your many-to-many relationship (each employee can work on multiple projects, each project has a team).
Having a set in a single value denormalized the table and it will complicate the things either way. But if you just, use the JSON format and the JSON functionality provided by SQLite. If your SQLite version is not recent, it may not have the JSON extension built in. You would need either updated (best option) or load the JSON extension dynamically. Not sure if you can do it using the SQLite copy supplied with Python.

To elaborate on what #ussu said, ideally your table would have one row per thing & item pair, using IDs instead of names:
id thing_id item_id
1 1 1
2 1 2
3 1 3
4 1 4
5 2 3
5 2 4
Then look-up tables for the thing and item names:
id name
1 Foo
2 Bar
id name
1 apples
2 oranges
3 tomatoes
4 bananas

In Mysql, You have set Type
Creation:
CREATE TABLE myset (col SET('a', 'b', 'c', 'd'));
Select:
mysql> SELECT * FROM tbl_name WHERE FIND_IN_SET('value',set_col)>0;
mysql> SELECT * FROM tbl_name WHERE set_col LIKE '%value%';
Insertion:
INSERT INTO myset (col) VALUES ('a,d'), ('d,a'), ('a,d,a'), ('a,d,d'), ('d,a,d');

Efficiently delete multiple records

I am trying to execute a delete statement that checks if the table has any SKU that exists in the SKU column of the dataframe. And if it does, it deletes the row. As I am using a for statement to iterate through the rows and check, it takes a long time to run the program for 6000 rows of data.
I used executemany() as it was faster than using a for loop for the delete statement, but I am finding it hard to find an alternative for checking values in the dataframe.
sname = input("Enter name: ")
cursor = mydb.cursor(prepared=True)
column = df["SKU"]
data=list([(sname, x) for x in column])
query="""DELETE FROM price_calculations1 WHERE Name=%s AND SKU=%s"""
cursor.executemany(query,data)
mydb.commit()
cursor.close()
Is there a more efficient code for achieving the same?

You could first use a GET id FROM price_calculations1 WHERE Name=%s AND SKU=%s
and then use a MYSQL WHILE loop to delete these ids without the need of a cursor, which seems to be more performant.
See: https://www.mssqltips.com/sqlservertip/6148/sql-server-loop-through-table-rows-without-cursor/
A WHILE loop without the previous get, might also work.
See: https://dev.mysql.com/doc/refman/8.0/en/while.html

Rather than looping, try to do all the work in a single call to the database (this guideline is often applicable when working with databases).
Given a list of name / sku pairs:
pairs = [(name1, sku1), (name2, sku2), ...]
create a query that identifies all the matching records and deletes them
base_query = """DELETE FROM t1.price_calculations1 t1
WHERE t1.id IN (
SELECT t2.id FROM price_calculations1 t2
WHERE {})
"""
# Build the WHERE clause criteria
criteria = "OR ".join(["(name = %s AND sku = %s)"] * len(pairs))
# Create the query
query = base_query.format(criteria)
# "Flatten" the value pairs
values = [i for j in pairs for i in j]
cursor.execute(query, values)
cursor.commit()

SQL insert query through python loop

I'm trying to insert multiple rows into a table using a for-loop in python using the following code:
ID = 0
values = ['a', 'b', 'c']
for x in values:
database.execute("INSERT INTO table (ID, value) VALUES (:ID, :value)",
ID = ID, value = x)
ID += 1
What I'd expected to happen was that this piece of code would insert three rows into my table. The only problem is that it only executes the query once. So I'd only get the row " 0, 'a' ".
There aren't any error messages popping up, it just doesn't update the table with the other two values. Weirdly enough however, I can circumvent this problem by using multiple queries, like so:
ID = 0
values = ['a', 'b', 'c']
for x in values:
database.execute("INSERT INTO table (ID) VALUES (:ID)", ID = ID)
database.execute("INSERT INTO table (value) VALUES (:value)", value = x)
ID += 1
While this updates my code, this method becomes more tedious as I add columns to my table further down the line. Does anyone know why the first snippet of code doesn't work and the second one does?

The execute method takes an array as the second parameter.
execute(sql[, parameters])
Executes an SQL statement. The SQL statement may be parameterized (i. e. placeholders instead of SQL literals). The sqlite3 module
supports two kinds of placeholders: question marks (qmark style) and
named placeholders (named style).
This should work:
database.execute("INSERT INTO table (ID, value) VALUES (:ID, :value)", [ID , x])
You might want to investigte executemany while you're in the doc.
From the same doc:
commit()
This method commits the current transaction. If you don’t call this method, anything you did since the last call to commit() is not
visible from other database connections. If you wonder why you don’t
see the data you’ve written to the database, please check you didn’t
forget to call this method.
You might want to investigte executemany while you're in the doc.

How to delete large quantity of records from Oracle Table that has no primary key

The situation: I'm loading an entire SQL table into my program. For convenience I'm using pandas to maintain the row data. I am then creating a dataframe of rows I would like to have removed from the SQL table. Unfortunately (and I can't change this) the table does not have any primary keys other than the built-in Oracle ROWID (which isn't a real table column its a pseudocolumn), but I can make ROWID part of my dataframe if I need to.
The table has hundreds of thousands of rows, and I'll probably be deleting a few thousand records with each run of the program.
Question:
Using Cx_Oracle what is the best method of deleting multiple rows/records that don't have a primary key? I don't think creating a loop to submit thousands of delete statements is very efficient or pythonic. Although I am concerned about building a singular SQL delete statement keyed off of ROWID and that contains a clause with thousands of items:
Where ROWID IN ('eg1','eg2',........, 'eg2345')
Is this concern valid? Any Suggestions?

Using ROWID
Since you can use ROWID, that would be the ideal way to do it. And depending on the Oracle version, the query length limit may be large enough for a query with that many elements in the IN clause. The issue is the number of elements in the IN expression list - limited to 1000.
So you'll either have to break up the list of RowIDs into sets of 1000 at a time or delete just a single row at a time; with or without executemany().
>>> len(delrows) # rowids to delete
5000
>>> q = 'DELETE FROM sometable WHERE ROWID IN (' + ', '.join(f"'{row}'" for row in delrows) + ')'
>>> len(q) # length of the query
55037
>>> # let's try with just the first 1000 id's and no extra spaces
... q = 'DELETE FROM sometable WHERE ROWID IN (' + ','.join(f"'{row}'" for row in delrows[:1000]) + ')'
>>> len(q)
10038
You're probably within query-length limits, and can even save some chars with a minimal ',' item separator.
Without ROWID
Without the Primary Key or ROWID, the only way to identify each row is to specify all the columns in the WHERE clause and to do many rows at a time, they'll need to be OR'd together:
DELETE FROM sometable
WHERE ( col1 = 'val1'
AND col2 = 'val2'
AND col3 = 'val3' ) -- row 1
OR ( col1 = 'other2'
AND col2 = 'value2'
AND col3 = 'val3' ) -- row 2
OR ( ... ) -- etc
As you can see it's not the nicest query to construct but allows you to do it without ROWIDs.
And in both cases, you probably don't need to be using parameterised queries since the IN list in 1 or OR grouping in 2 is variable. (Yes, you could create it parameterised after constructing the whole extended SQL with thousands of parameters. Not sure what the limit is on that.) The executemany() way is definitely easier to write & do but for speed, the single large queries (either of the above two) will probably outperform executemany with thousands of items.

You can use cursor.executemany() to delete multiple rows at once. Something like the following should work:
dataToDelete = [['eg1'], ['eg2'], ...., ['eg2345']]
cursor.executemany("delete from sometable where rowid = :1", dataToDelete)

Compare in Sq Lite 3 Python

I have an Sq Lite 3 database which has the columns ID,name and time.
So I have the last row and placed in a var LAST_PERSON using python.
your_rank = "SELECT usr_name,time FROM rank WHERE ID = (SELECT MAX(ID) FROM rank)"
I also have a var ROW which loops through each row sorted order by time.
sql = "SELECT usr_name,time FROM rank ORDER BY time "
for row in cur.execute(sql):
I want to compare:
your_rank with the sorted by time row and get that last person's rank
I tried
for row in cur.execute(sql):
sql_list.append(row)
if(row is your_rank):
this_is_your_rank = rank_number
rank_number += 1
But I cannot use the if statements for Sq Lite 3 and I have not being able to find any solution to compare these. Can you anyone give me a click?
If you cannot, thanks taking your time to reading.

You want to select count(ID) from rank where time < your_time or similar.
Looping over SQL results to find out what you want is clunky when you can just ask the database to give you the answer you want.
Edit:
Your first query, where you join the table to itself to get the user with the highest ID, can be:
SELECT MAX(ID),usr_name,time FROM rank
And you can combine them both together into "get the most recent user name, time, and their position" with:
SELECT
MAX(ID),usr_name,time, (SELECT COUNT(ID)+1 FROM rank WHERE time < r.time) [Pos]
FROM
rank r
Edit again, oh ok ok. Instead of "can anyone give me a solution?", it's not clear what you mean by "I can't use the if statement", but here's some speculation:
If you actually typed in row is your_rank, assuming you actually executed the your_rank SQL query and saved the result over the top in the same variable name, then it fails because is is a Python keyword for testing whether two things are the same thing (that is, one thing with two names). It does not test whether two separate things have the same value. == is the equality test.
It also might fail because the result of a SQL query is effectively a list of tuples. Each row is a tuple and, depending on what you did to put the result in your_rank, they won't ever match when compared.
This might work, if you want to keep the same approach:
last_user = cursor.execute('select max(id),usr_name,time from rank').fetchone()
last_user_rank = 1
for row in cursor.execute('select id,usr_name,time from rank order by time asc'):
if last_user[2] > row[2]:
last_user_rank += 1
else:
break
print last_user, last_user_rank

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

SQL delete duplicate rows [duplicate] - python

I haven't tested it, but this may work: DELETE FROM TableName WHERE author, body, points NOT IN (SELECT author, body, MAX(points) as points FROM TableName GROUP BY author, body) Run it as a SELECT query first to see if it will keepwhat you want.

Related

SQL database with a column being a list or a set

Efficiently delete multiple records

SQL insert query through python loop

How to delete large quantity of records from Oracle Table that has no primary key

Compare in Sq Lite 3 Python

Categories

Resources