Not adding the same record checking loop

Not adding the same record checking loop - python

When adding a record to the database, how can I create a loop that will check if the record we have added exists before, check the next record if it is not, or add it if not?
I have a table in the database, there is a link structure from that table, I want this operation for that link column. Because a link can be pulled several times. There is an id in the link, I thought the id could be compared as well. The complete link is also comparable. I would be glad if you help.
I can add to this part, I am sharing the codes for an idea.
if (link == ""):
control = "false"
else:
control = "true"
if control == "true":
mySql_insert_query = "INSERT INTO ad_l(id,count,clist_id,brand_model,ad,created_at,updated_at,status) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s)"
val = (id, count, clist_id, brand_model, ad_link, link, created_at, updated_at, status)
#cursor = scrap_db.cursor()
cursor.execute(mySql_insert_query, val) # cursor.executemany(mySql_insert_query, tuple_of_tuples)
scrap_db.commit()
print(cursor.rowcount, "Record inserted successfully into *ad_l* table")
I added something like this but it didn't work.
if exists ("SELECT ad FROM ad_l WHERE AD = ad_link"):
print('There is the same link, it is not added')
pass
else:
print('New record being added')

IMO this would be better to solve on the DB level as unique column or unique constraint (unique over multiple columns).
The DBs are optimized to do this check for you which results in better performance.
In some cases you might use a DB trigger to automatically user insert or update based on whether the record already exists or not.
In python you can use SQLAlchemy but tbh it is not very begineer friendly.
If you need to do this from other perspective and check the DB for record before you initiate some link processing (scrapping), You were heading in right direction to check for existing record, but problem is, that ad_link in your query was meant to be a variable but you pass it as part of the query.
Change it to for example:
exists (f"SELECT ad FROM ad_l WHERE AD = {ad_link}")
add the ' around {ad_link} if needed (I currently don't remember if its required)

Related

What does return SELECT when couldn't find the value it's searching for (in SQL)?

I'm doing an exercise where I need to update some values in a table of SQL DB via python and I can't find out what SELECT return if "user" AND another "X-condition" are NOT found in the database.
I read in another thread SELECT should return an empty set, but I still got a problem with it!
When I run:
example = db.execute("SELECT value FROM table1 WHERE user=:user AND X=:Y", user=user, X=Y)
and I try with a condition like
if example == {}:
db.execute("INSERT [...]")
I never go inside this condition to do the INSERT stuff when the set is empty.
I found another route to solve this (write below), but is it valid at all?
if not example:
do the job
EDIT: I'm using sqlite3!

Assuming you're using sqlite3, the execute method always returns a Cursor which you can use to fetch the result rows, if there were any. It doesn't matter what kind of query you were doing.
If the first result you fetch is None right away, there weren't any rows returned:
if example.fetchone() is None:
db.execute("INSERT [...]")
Alternatively, you could fetch all rows as a list, and compare that against the empty list:
if example.fetchall() == []:
db.execute("INSERT [...]")

Check if entity already exists in a table

I want to check if entity already exists in a table, I tried to search google from this and I found this , but it didn't help me.
I want to return False if the entity already exists but it always insert the user.
def insert_admin(admin_name) -> Union[bool, None]:
cursor.execute(f"SELECT name FROM admin WHERE name='{admin_name}'")
print(cursor.fetchall()) # always return empty list []
if cursor.fetchone():
return False
cursor.execute(f"INSERT INTO admin VALUES('{admin_name}')") # insert the name
def current_admins() -> list:
print(cursor.execute('SELECT * FROM admin').fetchall()) # [('myname',)]
When I run the program again, I can still see that print(cursor.fetchall()) return empty list. Why is this happening if I already insert one name into the table, and how can I check if the name already exists ?

If you want to avoid duplicate names in the table, then let the database do the work -- define a unique constraint or index:
ALTER TABLE admin ADD CONSTRAINT unq_admin_name UNIQUE (name);
You can attempt to insert the same name multiple times. But it will only work once, returning an error on subsequent attempts.
Note that this is also much, much better than attempting to do this at the application level. In particular, the different threads could still insert the same name at (roughly) the same time -- because they run the first query, see the name is not there and then insert the same row.
When the database validates the data integrity, you don't have to worry about such race conditions.

Returning primary key on INSERT with pyodbc

I have a program inserting a bunch of data into an SQL database. The data consists of Reports, each having a number of Tags.
A Tag has a field report_id, which is a reference to the primary key of the relevant Report.
Now, each time I insert the data, there can be 200 Reports or even more, each maybe having 400 Tags. So in pseudo-code I'm now doing this:
for report in reports:
cursor_report = sql('INSERT report...')
cursor_report.commit()
report_id = sql('SELECT ##IDENTITY')
for tag in report:
cursor_tag += sql('INSERT tag, report_id=report_id')
cursor_tag.commit()
I don't like this for a couple of reasons. Mostly i don't like the SELECT ##IDENTITY statement.
Wouldn't this mean that if another process were inserting data at the right moment then the statement would return the wrong primary key?
I would rather like the INSERT report... to return the inserted primary key, is that possible?
Since I currently have to commit between reports the program "pauses" during these moments. If I could commit everything at the end then it would greatly reduce the time spent. I have been considering creating a seperate field in Report used for identification so I could report_id = (SELECT id FROM reports WHERE seperate_field=?) or something in the Tags, but that doesn't seem very elegant.

Wouldn't this mean that if another process were inserting data at the right moment then the ["SELECT ##IDENTITY"] statement would return the wrong primary key?
No. The database engine keeps track of the last identity value inserted for each connection and returns the appropriate value for the connection on which the SELECT ##IDENTITY statement is executed.

How to retrieve the latest version of a record in GAE's high replication datastore?

I have created a REST service for syncing data from iPhones to our GAE.
In a few situations we get double entries for the same day. I believe I have made a mistake in the design of the Record class and would like to double check if my assumption and possible solution is correct before I attempt any data migration.
First I go through all incoming json_records, if it finds count == 1, then that means there is an existing entry that needs to be updated (This is where it goes sometimes wrong!!!). Then it checks the timestamp and would only update it if the incoming timestamp is greater, otherwise it ignores it.
for json_record in json_records:
recordsdb = Record.query(Record.user == user.key, Record.record_date == date_parser.parse(json_record['record_date']))
if recordsdb.count() == 1:
rec = recordsdb.fetch(1)[0]
if rec.timestamp < json_record['timestamp']:
....
rec.put()
elif recordsdb.count() == 0:
new_record = Record(user=user.key,
record_date = date_parser.parse(json_record['record_date']),
notes = json_record['notes'],
timestamp = json_record['timestamp'])
new_record.put()
If I am not wrong, this way of querying an object, provides no gurantee that it is the latest version.
recordsdb = Record.query(Record.user == user.key, Record.record_date == date_parser.parse(json_record['record_date']))
I believe the only way GAE/Highreplication Datastore can make sure that you have the latest data in front of you is if you retrieve it by a key.
Hence, if this assumption is correct, I should have saved my records with a date string as the key in first place.
jsondate = date_parser.parse(json_record['record_date']
new_record = Record(id = jsondate.strftime("%Y-%m-%d")
user=user.key,
record_date = jsondate),
notes = json_record['notes'],
timestamp = json_record['timestamp'])
new_record.put()
and when I have to query to see if the record already exists, I would get it by its key like this:
jsondate = date_parser.parse(json_record['record_date']
record = ndb.Key('Record', jsondate.strftime("%Y-%m-%d")).get()
Now if record is null then I have to create a new record.
if record != null then I have to update it.
Is my assumption and solution correct?
How can I migrate this data with date-string as their key?
UPDATE
I just realised another mistake I made. I can't set the record to its date-string. Because each user can have a record for a day, which causes duplication for the key.
I believe the only way to solve that is through ancestor/parent, which I am still trying to get my head around it.
UPDATE 2:
Trying to see if I understand Patrick's solution here. If it doesn't make sense, or there is a better way, please correct me.
I would add a is_fixed flag to the existing model:
class Record(ndb.Model)
user = ndb.KeyProperty(kind=User)
is_fixed = ndb.BooleanProperty()
...
Then I would query for the existing records via a cursor and delete them afterwards:
q = Record.query()
q_forward = q.order(Record.key)
cursor = None
while True:
records, cursor, more = q_forward.fetch_page(100)
if not records:
break;
for record in records:
new_record = Record(parent=user.key, ... )
new_record.is_fixed = True
new_record.put()
//now delete the old ones, I wonder if this would be an issue:
for old in Record.query()
if not old.is_fixed:
old.delete()

Since your query is always per user, I would recommend having the User be a ancestor of the user.
As you mentioned, the issue that you are hitting is a result of eventual consistency -- your query is not guaranteed to have the most up to date results. With an ancestor query, the results will be strongly consistent.
One important piece to watch out for is that within an entity group (a single ancestor), you are limited to 1 update per second. Since you only have one record per user, this seems like it shouldn't be a problem.
Your code is actually already all setup to user ancestors:
new_record = Record(parent=user.key, # Here we say that the ancestor of the record is the user
record_date =date_parser.parse(json_record['record_date']),
notes = json_record['notes'],
timestamp = json_record['timestamp'])
And then now you can actually use a strongly consistent query:
Record.query(ancestor == user.key, Record.record_date == date_parser.parse(json_record['record_date']))
However, you are going to have the same problems with changing the id of existing Records. Adding an ancestor to an entity is effectively changing it's key to have the ancestor as a prefix. In order to do this, you'll have to go through all your records and create new ones with their user as an ancestor. You can probably either do this using a query to grab results in batches (using cursors to step forward) or if you have a lot of data it may be worthwhile to explore the MapReduce library.

What's the difference between these two ways of implementing `on duplication increment`?

Assume that I have a table user_count defined as follows:
id primary key, auto increment
user_id unique
count default 0
What I want to do is increment count by one when an existing record of a user exists or else insert a new record.
Currently, I do it this way (in Python):
try:
cursor.execute("INSERT INTO user_count (user_id) VALUES (%s)", user.id)
except IntegrityError:
cursor.execute("UPDATE user_count SET count = count+1 WHERE user_id = %s", user.id)
And it can also be implement this way:
cursor.execute("INSERT INTO user_count (user_id) VALUES (user_id) ON DUPLICATE KEY UPDATE count = count + 1", user.id)
What's the difference between these two ways, and which one is better?

The second one is a single SQL command which makes use of the feature that the database offers for solving exactly the problem you have here.
It'd use that as it should be faster and more reliable.
The first one is a fallback if that feature is not available (older database version?).

The first one uses an exception to direct the flow of the program, which isn't what you should do unless you have no other solutions to it (e.g. getting exclusive access to a file). Also, it takes the work from the database which should know better to handle the case.
The second code handles all the work in the database which in turn can optimize the query plan to a very efficient manner.
I would use the second solution as the database usually knows better than yourself how to handle a case.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Not adding the same record checking loop - python

Related

What does return SELECT when couldn't find the value it's searching for (in SQL)?

Check if entity already exists in a table

Returning primary key on INSERT with pyodbc

How to retrieve the latest version of a record in GAE's high replication datastore?

What's the difference between these two ways of implementing `on duplication increment`?

Categories

Resources