delete previous line in postgresql database - python

I'm doing a face recognition project and my output from this project is a PostgreSQL database that is stored every time a person is identified by the name of the person in the database. I use the python programming language and psycopg2 module for make this output. What I need is until the new one is detected, then the previous line in the database is deleted. Thank you in advance.
thank you for your support
this is my database
I have a table that stores the images path with a limited number id
I want to show appropriate image when my system detected for example a happy man,
For this purpose, I want to join the two tables(images table and face classification table), but my table introduces the IDs serially and I can not join them.

Assuming that the order of the ids correlates to the order of insertion: You can fetch the id in descending order of the persons lower than the last added, and taking the first one.
SELECT id FROM person WHERE id < $1 ORDER BY id DESC
where $1 stands for the id of the row inserted. With the fetched id you can delete the row.
A better way would be storing the times as timestamps instead of texts

Related

DWH primary key conflict between staging tables and DWH tables

I am building a DWH based on data I am collecting from an ERP API.
currently, I am fetching the data from the API based on an incremental mechanism I built using python: The python script fetches all invoices whose last modified date is in the last 24 hours and inserts the data into a "staging table" (no changes are required during this step).
The next step is to insert all data from the staging area into the "final tables". The final tables include primary keys according to the ERP (for example invoice number).
There are no primary keys defined at the staging tables.
For now, I am putting aside the data manipulation and transformation.
In some cases, it's possible that a specific invoice is already in the "final tables", but then the user updates the invoice at the ERP system which causes the python script to fetch the data again from the API into the staging tables. In the case when I try to insert the invoice into the "final table" I will get a conflict due to the primary key restriction at the "final tables".
Any idea of how to solve this?
I am thinking to add a field that details the date and timestamp at which the record land at the staging table ("insert date") and then upsert the records if
insert date at the staging table > insert date at the final tables
Is this best practice?
Any other suggestions? maybe use a specific tool/data solution?
I prefer using python scripts since it is part of a wider project.
Thank you!
Instead of a straight INSERT use an UPSERT pattern. Either the MERGE statement if your database has it, or UPDATE the existing rows, followed by INSERTing the new ones.

Pymongo : keep creating new ids

In a pymongo project I'm working on, in a particular collection, I have to keep uploading name and age of people who will be entering them. I have to identify them with unique ids.
What I as planning to do is start the first id from 1. When inserting data, I first read the whole collection, find the number of records and then save my record with the next id. (eg. I read, and I find that there are 10 records, then the id of my new record will be 11).
But is there any better way to do it?
MongoDB already assigns unique id to each document which can be sorted in ascending order. If you don't wanna use that then you can create a separate collection which should just contain 1 document with totalRecordsCount. And increment it every time you add new record and get latest number before adding new record. This is not a best way but you will be able to avoid reading whole collection.

How to check if data has already been previously used

I have a python script that retrieves the newest 5 records from a mysql database and sends email notification to a user containing this information.
I would like the user to receive only new records and not old ones.
I can retrieve data from mysql without problems...
I've tried to store it in text files and compare the files but, of course, the text files containing freshly retrieved data will always have 5 records more than the old one.
So I have a logic problem here that, being a newbie, I can't tackle easily.
Using lists is also an idea but I am stuck in the same kind of problem.
The infamous 5 records can stay the same for one week and then we can have a new record or maybe 3 new records a day.
It's quite unpredictable but more or less that should be the behaviour.
Thank you so much for your time and patience.
Are you assigning a unique incrementing ID to each record? If you are, you can create a separate table that holds just the ID of the last record fetched, that way you can only retrieve records with IDs greater than this ID. Each time you fetch, you could update this table with the new latest ID.
Let me know if I misunderstood your issue, but saving the last fetched ID in the database could be a solution.

What are some ways to maintain data consistency at the application layer of NoSQL?

My python web application uses DynamoDB as its datastore, but this is probably applicable to other NoSQL tables where index consistency is done at the application layer. I'm de-normalizing data and creating indicies in several tables to facilitate lookups.
For example, for my users table:
* Table 1: (user_id) email, employee_id, first name, last name, etc ...
Table 2: (email) user_id
Table 3: (employee_id) user_id
Table 1 is my "primary table" where user info is stored. If the user_id is known, all info about a user can be retrieved in a single GET query.
Table 2 and 3 enable lookups by email or employee_id, requiring a query to those tables first to get the user_id, then a second query to Table 1 to retrieve the rest of the information.
My concern is with the de-normalized data -- what is the best way to handle deletions from Table 1 to ensure the matching data gets deleted from Tables 2 + 3? Also ensuring inserts?
Right now my chain of events is something like:
1. Insert row in table 1
2. Insert row in table 2
3. Insert row in table 3
Does it make sense to add "checks" at the end? Some thing like:
4. Check that all 3 rows have been inserted.
5. If a row is missing, remove rows from all tables and raise an error.
Any other techniques?
Short answer is: There is no way to ensure consistency. This is the price you agreed to pay when moving to NoSQL in trade of performances and scalability.
DynamoDB-mapper has a "transaction engine". Transaction objects are plain DynamoDB Items and may be persisted. This way, If a logical group of actions aka transaction has succeeded, we can be sure of it by looking at the persisted status. But we have no mean to be sure it has not...
To do a bit of advertisment :) , dynamodb-mapper transaction engine supports
single/multiple targets
sub transactions
transaction creating objects (not released yet)
If you are rolling your own mapper (which is an enjoyable task), feel free to have a look at our source code: https://bitbucket.org/Ludia/dynamodb-mapper/src/52c75c5df921/dynamodb_mapper/transactions.py
Disclaimer: I am one of the main dynamodb-mapper project. Feel free to contribute :)
Disclaimer: I haven't actually used DynamoDB, just looked through the data model and API, so take this for what it's worth.
The use case you're giving is one primary table for the data, with other tables for hand-rolled indices. This really sounds like work for an RDBMS (maybe with some sharding for growth). But, if that won't cut it, here a couple of ideas which may or may not work for you.
A. Leave it as it is. If you'll never serve data from your index tables, then maybe you can afford to have lazy deletion and insertion as long as you handle the primary table first. Say this happens:
1) Delete JDoe from Main table
xxxxxxxxxx Process running code crashes xxxxxxx
2) Delete from email index // Never gets here
3) Delete from employee_id index // Never gets here
Well, if an "email" query comes in, you'll resolve the corresponding user_id from the index (now stale), but it won't show up on the main table. You know that something is wrong, so you can return a failure/error and clean up the indexes. In other words, you just live with some stale data and save yourself the trouble, cleaning it up as necessary. You'll have to figure out how much stale data to expect, and maybe write a script that does some housekeeping daily.
B. If you really want to simulate locks and transactions, you could consider using something like Apache Zookeeper, which is a distributed system for managing shared resources like locks. It'd be more work and overhead, but you could probably set it up to do what you want.

Select records from mysql table one by one - dynamic table with more records continuously being written

I want to select one record at a time from a MySQL table. I found a similar post on SO here -
How to select records one by one without repeating
However, in my case, the table size is not fixed, The data is continuously been added to the table and I want to select one record at a time from this table. Also, I'm using python to connect to the mysql database and do processing over each record. Any pointers?
P.S. : The size of the table is very large, hence everytime, I can not compute the number of records in the table
This functionality isn't built into SQL.
If you have a dense, incrementing, indexed column on the table, then you just do this:
i = 0
while True:
# your usual select here, but with "and MyIndexColumn = %d' % (i,)
With some databases, there's such a column built in, whether you want it or not, usually called either "RowID" or "Row ID"; if not, you can always add such a column explicitly.
If the values can be skipped (usually they can, e.g., if for no other reason than because someone can delete a row from the middle), you have to be able to handle a select that returns 0 rows, of course.
The only issue is that there's no way to tell when you're done. But that's inherent in your initial problem statement: data is being continuously added, and you don't know the size of it in advance, so how could you? If you need to know when the producer is done, it has to somehow (possibly through another table in the database) give you the highest rowid it created.

Categories