cleaning a Postgres table of bad rows - python

I have inherited a Postgres database, and am currently in the process of cleaning it. I have created an algorithm to find the rows where the data is bad. The algorithm is encoded into the function called checkProblems(). Using this, I am able to select the rows that contains the bad rows, as shown below ...
schema = findTables(dbName)
conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName)
cur = conn.cursor()
results = []
for t in tqdm(sorted(schema.keys())):
n = 0
cur.execute('select * from %s'%t)
for i, cs in enumerate(tqdm(cur)):
if checkProblem(cs):
n += 1
results.append({
'tableName': t,
'totalRows': i+1,
'badRows' : n,
})
cur.close()
conn.close()
print pd.DataFrame(results)[['tableName', 'badRows', 'totalRows']]
Now, I need to delete the rows that are bad. I have two different ways of doing it. First, I can write the clean rows in a temporary table, and rename the table. I think that this option is too memory-intensive. It would be much better if I would be able to just delete the specific record at the cursor. Is this even an option?
Otherwise, what is the best way of deleting a record under such circumstances? I am guessing that this should be a relatively common thing that database administrators do ...

Of course that delete the specific record at the cursor is better. You can do something like:
for i, cs in enumerate(tqdm(cur)):
if checkProblem(cs):
# if cs is a tuple with cs[0] being the record id.
cur.execute('delete from %s where id=%d'%(t, cs[0]))
Or you can store the ids of the bad records and then do something like
DELETE FROM table WHERE id IN (id1,id2,id3,id4)

Related

How to escape a #/# (for example 6/8) in the name of a table from a database

I am currently trying to get a list of values from a table inside an SQL database. The problem is appending the values due to the table's name in which I can't change. The table's name is something like Value123/123.
I tried making a variable with the name like
x = 'Value123/123'
then doing
row.append(x)
but that just prints Value123/123 and not the values from the database
cursor = conn.cursor()
cursor.execute("select Test, Value123/123 from db")
Test = []
Value = []
Compiled_Dict = {}
for row in cursor:
Test.append(row.Test)
Value.append(row.Value123/123)
Compiled_Dict = {'Date&Time': Test}
Compiled_Dict['Value'] = Value
conn.close()
df = pd.DataFrame(Compiled_Dict)
The problem occurs in this line
Value.append(row.Value123/123)
When I run it I get that the database doens't have a table named 'Value123'. Since I think it's trying to divide 123 by 123? Unfortunately the table in the database is named like this and I cannot change it, so how do I pull the values from this table?
Edit:
cursor.execute("select Test, Value123/123 as newValue from db")
I tried this and it worked thanks for the solutions. Suggested by Yu Jiaao

Iterating over table names and updating queries

I'm using PyMySQL to Update Data by iterating through table names , But the problem is that I was able to update the data from the first table only
the loop is not working after the first table
Ive tried using the fetchall() to get the table names and loop by that but it didnt work
def update():
global table_out
global i
cursor.execute("USE company;")
cursor.execute("SHOW TABLES;")
lst=[]
for table_name in cursor:
lst.append(table_name)
emp_list=lst[0][0]
print(emp_list)
i=0
while i<=len(lst)-1:
state="""SELECT `employee_name` from `%s` WHERE attended=0 """%(employees)
out=cursor.execute(state)
result=cursor.fetchall()
i+=1
for records in result:
table_out=''.join(records)
print(table_out)
db.commit()
try:
sql="""UPDATE `%s` SET `attended` = True WHERE `employee_name` = '%s' ;"""%(emp_list,table_out)
cursor.execute(sql)
I expect to iterate over all the tables in that database when this function is called
I'm not sure that your approach is quite optimal.
[In your middle block, employees is undefined - should that be emp_lst?]
Your select statement appears to reduce down to
SELECT employee_name FROM $TABLE WHERE attended=0;
which you then want to go through each table and change the value. Have you considered using the UPDATE verb instead? https://dev.mysql.com/doc/refman/8.0/en/update.html
UPDATE $table SET attended=True WHERE attended=0;
If that works for your desired outcome, then that will save you quite a few table scans and double handling.
So perhaps you could refactor your code along these lines:
def update_to_True():
# assume that cursor is a global
cursor.execute("USE company;")
tables = cursor.execute("SHOW TABLES;").fetchall()
for el in tables:
STMT="""UPDATE {0} SET attended=True WHERE attended=0;".format(el)
res = cursor.execute(el)
# for debugging....
print(res)
that's it!

Python foreach not looping properly

I'm writing a script that formats a bunch of csv files into one csv file.
To do this, I'm using a couple of cursor tables in python via sqlite.
Here is my code - currently I'm just trying to get every row in gsap that is associated with a code that is in gsap_locs to print
data = c.execute("SELECT * from gsap_locs")
for row in data:
print row[0]
d2 = c.execute("select date, cardtype, volume, transactions from gsap where gsaploc=?", (row[0],))
for r2 in d2:
print r2
However, my code is only returning one row. I know that the problem isn't in the first for because when I take out everything after print row[0] it prints out all of the values from the first select.
Why is it escaping out of my first for after my second for runs without satisfying the conditions of the first for?
You are missing the fetchall or fetchone instructions.
It's a common thing, we think that the execute has done the job of getting the data but you should use fetch.
To retrieve data after executing a SELECT statement, you can either treat the cursor as an iterator, call the cursor’s fetchone() method to retrieve a single matching row, or call fetchall() to get a list of the matching rows.
import sqlite3
conn = sqlite3.connect('gasp.sqlite')
c = conn.cursor()
c.execute("SELECT * FROM gsap_locs")
rows = c.fetchall()
for row in rows:
print row[0]
c.execute("select * from gsap where loc=?", (row[0],))
d2 = c.fetchall()
for r2 in d2:
print r2
conn.close()
Looks like cursor.execute can only track one operation/returns an iterator at a time. You might want to keep the results of the first operation in memory, calling tuple on it:
data = tuple(c.execute("SELECT * from gsap_locs"))
for row in data:
...
Be sure to have enough memory to hold all the results from the first query.

Summing database column in python

I have recently encountered the problem of adding the elements of a database column. Here is the following code:
import sqlite3
con = sqlite3.connect("values.db")
cur = con.cursor()
cur.execute('SELECT objects FROM data WHERE firm = "sony"')
As you can see, I connect to the database (sql) and I tell to Python to select the column "objects".
The problem is that I do not know the appropriate command for summing the selected objects.
Any ideas/ advices are highly reccomended.
Thank you in advance!!
If you can, have the database do the sum, as that reduces data transfer and lets the database do what it's good at.
cur.execute("SELECT sum(objects) FROM data WHERE firm = 'sony'")
or, if you're really just looking for the total count of objects.
cur.execute("SELECT count(objects) FROM data WHERE firm = 'sony'")
either way, your result is simply:
count = cur.fetchall()[0][0]
Try the following line:
print sum([ row[0] for row in cur.fetchall()])
If you want the items instead adding them together:
print ([ row[0] for row in cur.fetchall()])

SQLite insert or ignore and return original _rowid_

I've spent some time reading the SQLite docs, various questions and answers here on Stack Overflow, and this thing, but have not come to a full answer.
I know that there is no way to do something like INSERT OR IGNORE INTO foo VALUES(...) with SQLite and get back the rowid of the original row, and that the closest to it would be INSERT OR REPLACE but that deletes the entire row and inserts a new row and thus gets a new rowid.
Example table:
CREATE TABLE foo(
id INTEGER PRIMARY KEY AUTOINCREMENT,
data TEXT
);
Right now I can do:
sql = sqlite3.connect(":memory:")
# create database
sql.execute("INSERT OR IGNORE INTO foo(data) VALUES(?);", ("Some text.", ))
the_id_of_the_row = None
for row in sql.execute("SELECT id FROM foo WHERE data = ?", ("Some text.", )):
the_id_of_the_row = row[0]
But something ideal would look like:
the_id_of_the_row = sql.execute("INSERT OR IGNORE foo(data) VALUES(?)", ("Some text", )).lastrowid
What is the best (read: most efficient) way to insert a row into a table and return the rowid, or to ignore the row if it already exists and just get the rowid? Efficiency is important because this will be happening quite often.
Is there a way to INSERT OR IGNORE and return the rowid of the row that the ignored row was compared to? This would be great, as it would be just as efficient as an insert.
The way that worked the best for me was to insert or ignore the values, and the select the rowid in two separate steps. I used a unique constraint on the data column to both speed up selects and avoid duplicates.
sql.execute("INSERT OR IGNORE INTO foo(data) VALUES(?);" ("Some text.", ))
last_row_id = sql.execute("SELECT id FROM foo WHERE data = ?;" ("Some text. ", ))
The select statement isn't as slow as I thought it would be. This, it seems, is due to SQLite automatically creating an index for the unique columns.
INSERT OR IGNORE is for situations where you do not care about the identity of the record; where the goal is only to have some record with that specific value.
If you want to know whether a new record is inserted or not, you have to check by hand:
the_id_of_the_row = None
for row in sql.execute("SELECT id FROM foo WHERE data = ?", ...):
the_id_of_the_row = row[0]
if the_id_of_the_row is None:
c = sql.cursor()
c.execute("INSERT INTO foo(data) VALUES(?)", ...)
the_id_of_the_row = c.lastrowid
As for efficiency: when SQLite checks the datacolumn for duplicates, it has to do exactly the same query that you're doing with the SELECT, and once you've done that, the access path is in the cache, so performance should not be a problem. In any case, it is necessary to execute two separate INSERT/SELECT queries (in either order, both your and my code work, but yours is simpler).

Categories