How to check if data has already been previously used

How to check if data has already been previously used - python

I have a python script that retrieves the newest 5 records from a mysql database and sends email notification to a user containing this information.
I would like the user to receive only new records and not old ones.
I can retrieve data from mysql without problems...
I've tried to store it in text files and compare the files but, of course, the text files containing freshly retrieved data will always have 5 records more than the old one.
So I have a logic problem here that, being a newbie, I can't tackle easily.
Using lists is also an idea but I am stuck in the same kind of problem.
The infamous 5 records can stay the same for one week and then we can have a new record or maybe 3 new records a day.
It's quite unpredictable but more or less that should be the behaviour.
Thank you so much for your time and patience.

Are you assigning a unique incrementing ID to each record? If you are, you can create a separate table that holds just the ID of the last record fetched, that way you can only retrieve records with IDs greater than this ID. Each time you fetch, you could update this table with the new latest ID.
Let me know if I misunderstood your issue, but saving the last fetched ID in the database could be a solution.

Related

Detect New Entries in Database Using Python

I'm currently trying to make a twitter bot that will print out new entries from my database. I'm trying to use python to do this, and can successfully post messages to Twitter. However, whenever a new entry comes in it doesn't update.
How would I go about implementing something like this, and what would I use? I'm not too experienced with this topic area. Any help or guidance would be appreciated.

Use a trigger to propogate the newly inserted rows from the original table to a record table which is under the surveillance of python, and have python post the new record (possibly remove the already posted ones from the record table)
DELIMITER //
drop trigger if exists record_after_insert //
create trigger record_after_insert after insert on original_table for each row begin
insert record_table (new_record) values (new.new_message);
end //

PynamoDB - Query or count all items on table

Using PynamoDB, anyone else having issues trying to count all items on table? i have a model that i created a table from.
I'm trying to use ModelName.count() but i keep getting 0 even though i have some items there.
When sending a specific key to ModelName.count(key) i get the correct results, but i want to count all items.
I tried to query all items and count but it seems that i must set the primary key to query, so this walkaround isn't relevant.
Ill be glad for help here if someone dealt with this one before,
Thanks!

To count all items in the table without scanning the entire table at a huge cost, pynamodb would have to use the DescribeTable operation, which returns a ItemCount response. However, the DynamoDB documentation explains that:
DynamoDB updates this value approximately every six hours. Recent changes might not be reflected in this value.
So it is possible that this is your problem. Please try to wait six hours and see if the count gets updated.
Regarding querying all the items without a key - you can do that, but it's not called a Query, it's called a Scan. And it will be very expensive (even if you just want to count, you'll basically be paying for reading the entire contents of the database).

Best way to search text records in SQL database based on keywords and create a calculated column

I have a large SQL database that contains all call records from a call center for the last 15 ish years. I am working with a subset of the records (3-5 million records). There is a field that is stored as text where we store all notes from the call, emails, etc. I would like to provide a list of keywords and have the program output another label in a new column for the record. Essentially classifying each record with the likely problem.
For example, my text record contains "Hi John, thank you for contacting us for support with your truck. Has the transmission always made this noise"
The query would then be something like
If the text record contains "Truck" and "Transmission" then the new column value is "error123".
I'm not sure if doing this in SQL would be feasible as there are almost 170 different errors that need to be matched. I was also thinking it maybe could be done in Python? I'm not sure what would be the best fit for this type of tagging.
Currently, I am using PowerQuery in PowerBI to load the SQL table, and then 170 switch statements to create a calculated column. This seems to handle about 500k records before timing out. While I can chunk my records, I know this isn't the best way, but I'm not sure what program would be most suited to it.
EDIT
Per the below answer, I am going to run an update command for each error on a new column. I only have read-only access to the Database, so I am using the below code to pull the data and add a new column called "Error". My problem is that I want to use the update command to update the new "Error" column instead of the DB. Is this possible? I know the update needs a table, what would the returned query table be called? Is it possible to do it this way?
SELECT *, 'null' AS Error FROM [TicketActivity]
UPDATE
SET Error = 'desktop'
WHERE ActivityNote LIKE '%desktop%'
AND ActivityNote LIKE '%setup%'

If you just need to check for keywords, I would not take the detour through Python since you need to transfer all information from the db into Python memory and back.
I would fire 170 different versions of this with UPDATE instead of SELECT and have columns available where you can enter a True or False (or copy probable records into another table using the same approach)

So, I figured this out through some more Googling after being pointed in the right direction here.
SELECT *,
CASE
WHEN column1 LIKE '%keyword%'
AND column1 LIKE '%keyword%' THEN 'Error 123'
WHEN column1 LIKE '%keyword%'
AND column1 LIKE '%keyword%' THEN 'Error 321'
ELSE 'No Code'
END AS ErrorMessage
FROM [TicketActivity]
Repeating the WHEN statements for as many as needed, and using a WHERE statement to select my time range

Pymongo : keep creating new ids

In a pymongo project I'm working on, in a particular collection, I have to keep uploading name and age of people who will be entering them. I have to identify them with unique ids.
What I as planning to do is start the first id from 1. When inserting data, I first read the whole collection, find the number of records and then save my record with the next id. (eg. I read, and I find that there are 10 records, then the id of my new record will be 11).
But is there any better way to do it?

MongoDB already assigns unique id to each document which can be sorted in ascending order. If you don't wanna use that then you can create a separate collection which should just contain 1 document with totalRecordsCount. And increment it every time you add new record and get latest number before adding new record. This is not a best way but you will be able to avoid reading whole collection.

What are some ways to maintain data consistency at the application layer of NoSQL?

My python web application uses DynamoDB as its datastore, but this is probably applicable to other NoSQL tables where index consistency is done at the application layer. I'm de-normalizing data and creating indicies in several tables to facilitate lookups.
For example, for my users table:
* Table 1: (user_id) email, employee_id, first name, last name, etc ...
Table 2: (email) user_id
Table 3: (employee_id) user_id
Table 1 is my "primary table" where user info is stored. If the user_id is known, all info about a user can be retrieved in a single GET query.
Table 2 and 3 enable lookups by email or employee_id, requiring a query to those tables first to get the user_id, then a second query to Table 1 to retrieve the rest of the information.
My concern is with the de-normalized data -- what is the best way to handle deletions from Table 1 to ensure the matching data gets deleted from Tables 2 + 3? Also ensuring inserts?
Right now my chain of events is something like:
1. Insert row in table 1
2. Insert row in table 2
3. Insert row in table 3
Does it make sense to add "checks" at the end? Some thing like:
4. Check that all 3 rows have been inserted.
5. If a row is missing, remove rows from all tables and raise an error.
Any other techniques?

Short answer is: There is no way to ensure consistency. This is the price you agreed to pay when moving to NoSQL in trade of performances and scalability.
DynamoDB-mapper has a "transaction engine". Transaction objects are plain DynamoDB Items and may be persisted. This way, If a logical group of actions aka transaction has succeeded, we can be sure of it by looking at the persisted status. But we have no mean to be sure it has not...
To do a bit of advertisment :) , dynamodb-mapper transaction engine supports
single/multiple targets
sub transactions
transaction creating objects (not released yet)
If you are rolling your own mapper (which is an enjoyable task), feel free to have a look at our source code: https://bitbucket.org/Ludia/dynamodb-mapper/src/52c75c5df921/dynamodb_mapper/transactions.py
Disclaimer: I am one of the main dynamodb-mapper project. Feel free to contribute :)

Disclaimer: I haven't actually used DynamoDB, just looked through the data model and API, so take this for what it's worth.
The use case you're giving is one primary table for the data, with other tables for hand-rolled indices. This really sounds like work for an RDBMS (maybe with some sharding for growth). But, if that won't cut it, here a couple of ideas which may or may not work for you.
A. Leave it as it is. If you'll never serve data from your index tables, then maybe you can afford to have lazy deletion and insertion as long as you handle the primary table first. Say this happens:
1) Delete JDoe from Main table
xxxxxxxxxx Process running code crashes xxxxxxx
2) Delete from email index // Never gets here
3) Delete from employee_id index // Never gets here
Well, if an "email" query comes in, you'll resolve the corresponding user_id from the index (now stale), but it won't show up on the main table. You know that something is wrong, so you can return a failure/error and clean up the indexes. In other words, you just live with some stale data and save yourself the trouble, cleaning it up as necessary. You'll have to figure out how much stale data to expect, and maybe write a script that does some housekeeping daily.
B. If you really want to simulate locks and transactions, you could consider using something like Apache Zookeeper, which is a distributed system for managing shared resources like locks. It'd be more work and overhead, but you could probably set it up to do what you want.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.