PynamoDB - Query or count all items on table - python

Using PynamoDB, anyone else having issues trying to count all items on table? i have a model that i created a table from.
I'm trying to use ModelName.count() but i keep getting 0 even though i have some items there.
When sending a specific key to ModelName.count(key) i get the correct results, but i want to count all items.
I tried to query all items and count but it seems that i must set the primary key to query, so this walkaround isn't relevant.
Ill be glad for help here if someone dealt with this one before,
Thanks!

To count all items in the table without scanning the entire table at a huge cost, pynamodb would have to use the DescribeTable operation, which returns a ItemCount response. However, the DynamoDB documentation explains that:
DynamoDB updates this value approximately every six hours. Recent changes might not be reflected in this value.
So it is possible that this is your problem. Please try to wait six hours and see if the count gets updated.
Regarding querying all the items without a key - you can do that, but it's not called a Query, it's called a Scan. And it will be very expensive (even if you just want to count, you'll basically be paying for reading the entire contents of the database).

Related

How to get the nth record of a model with Python - Django?

I'm currently working on a command that deletes old records for some models that grow a lot in the database. The idea behind is receiving a parameter that indicates from which record number (not id) we have to delete backwards. For doing this i came up with this solution:
reference_record = WarmUpCache.objects.filter(company_id=company).values_list('id', flat=True).order_by('-id')[_number_of_records]
records_to_delete = WarmUpCache.objects.filter(company_id=company,id__lt=reference_record)
if records_to_delete:
records_to_delete.delete()
For example, for a given company_id=118 I get the ids of the records associated to that Company. Having this, I try to get the nTh record and then, calculate how many records are with an id lower than the given one. After that, delete all of them.
Currently this solution is working, but I'm managing to improve it somehow. I have checked stackoverflow to find any answers but I only found old answers that explain almost the same solution I made:
django query get last n records
django queryset runtime - get nth entry in constant time
So, the question itself is ¿Is there any way to improve this query by obtaining just the nth record of a model?
Thanks a lot.
You can use Python slice to get a range of items starting from Nth and then delete like so:
WarmUpCache.objects.order_by('id')[from_which_to_delete:].delete()
also this condition
if records_to_delete:
records_to_delete.delete()
is not required cause if records_to_delete is empty - .delete() will just do nothing.

Best way to search text records in SQL database based on keywords and create a calculated column

I have a large SQL database that contains all call records from a call center for the last 15 ish years. I am working with a subset of the records (3-5 million records). There is a field that is stored as text where we store all notes from the call, emails, etc. I would like to provide a list of keywords and have the program output another label in a new column for the record. Essentially classifying each record with the likely problem.
For example, my text record contains "Hi John, thank you for contacting us for support with your truck. Has the transmission always made this noise"
The query would then be something like
If the text record contains "Truck" and "Transmission" then the new column value is "error123".
I'm not sure if doing this in SQL would be feasible as there are almost 170 different errors that need to be matched. I was also thinking it maybe could be done in Python? I'm not sure what would be the best fit for this type of tagging.
Currently, I am using PowerQuery in PowerBI to load the SQL table, and then 170 switch statements to create a calculated column. This seems to handle about 500k records before timing out. While I can chunk my records, I know this isn't the best way, but I'm not sure what program would be most suited to it.
EDIT
Per the below answer, I am going to run an update command for each error on a new column. I only have read-only access to the Database, so I am using the below code to pull the data and add a new column called "Error". My problem is that I want to use the update command to update the new "Error" column instead of the DB. Is this possible? I know the update needs a table, what would the returned query table be called? Is it possible to do it this way?
SELECT *, 'null' AS Error FROM [TicketActivity]
UPDATE
SET Error = 'desktop'
WHERE ActivityNote LIKE '%desktop%'
AND ActivityNote LIKE '%setup%'
If you just need to check for keywords, I would not take the detour through Python since you need to transfer all information from the db into Python memory and back.
I would fire 170 different versions of this with UPDATE instead of SELECT and have columns available where you can enter a True or False (or copy probable records into another table using the same approach)
So, I figured this out through some more Googling after being pointed in the right direction here.
SELECT *,
CASE
WHEN column1 LIKE '%keyword%'
AND column1 LIKE '%keyword%' THEN 'Error 123'
WHEN column1 LIKE '%keyword%'
AND column1 LIKE '%keyword%' THEN 'Error 321'
ELSE 'No Code'
END AS ErrorMessage
FROM [TicketActivity]
Repeating the WHEN statements for as many as needed, and using a WHERE statement to select my time range

Python/Django/MySQL optimisation

Just a logic question really...I have a script that takes rows of data from a CSV, parses the cell values to uniform the data and makes a check on the database that a key/primary value does not exist so as to prevent duplicates! At the moment, the 1st 10-15k entries commit to the DB fairly quick but then it starts really slowing as there are more entries in the DB to check against for duplicates....by the time there are 100k rows in the DB the commit speed is about 1/sec argh...
So my question, is it (pythonically) more efficient to extract and parse the data separately to the DB commit procedure (maybe in a class based script or?? Could I add multiprocessing to the csv parsing or DB commit) and is there a quicker method to check the database for duplicates if i am only cross-referencing 1 table and 1 value??
Much appreciated
Kuda
If the first 10-15k entries worked fine, probably the issue is with the database query. Do you have a suitable index, and is that index used by the database? You can use an EXPLAIN statement to see what the database is doing, whether it's actually using the index for the particular query used by Django.
If the table starts empty, it might also help to run ANALYZE TABLE after the first few thousand rows; the query optimiser might have stale statistics from when the table was empty. To test this hypothesis, you can connect to the database while the script is running, when it starts to slow down, and run ANALYSE TABLE manually. If it immediately speeds up, the problem was indeed stale statistics.
As for optimisation of database commits themselves, it probably isn't an issue in your case (since the first 10k rows perform fine), but one aspect is the round-trips; for every query, it has to go to the database and get the results back. This is especially noticeable if the database is across a network. If you need to speed that up, Django has a bulk_create() method to insert many rows at once. However, if you do that, you'll only get an error for the whole batch of rows if you try to insert duplicates forbidden by the database indexes; you'll then have to find the particular row causing the error using other code.

is it bad practise to store a table in a database with no primary key?

I am currently working on a list implementation in python that stores a persistent list as a database:
https://github.com/DarkShroom/sqlitelist
I am tackling a design consideration, it seems that SQLite allows me store the data without a primary key?
self.c.execute('SELECT * FROM unnamed LIMIT 1 OFFSET {}'.format(key))
this line of code can retrieve by absolute row reference
Is this bad practise? will I loose the data order at any point? Perhaps it's OKAY with sqlite, but my design will not translate to other database engines? Any thoughts from people more familiar with databases would be helpful. I am writing this so I don't have to deal with databases!
The documentation says:
If a SELECT statement that returns more than one row does not have an ORDER BY clause, the order in which the rows are returned is undefined.
So you cannot simply use OFFSET to identify rows.
A PRIMARY KEY constraint just tells the database that is must enforce UNIQUE and NOT NULL constraints on the PK columns. If you do not declare a PRIMARY KEY, these constraints are not automatically enforced, but this does not change the fact that you have to identify your rows somehow when you want to access them.
The easiest way to store list entries is to have the position in the list as a separate column. (If your program takes up most of its time inserting or deleting list entries, it might be a better idea to store the list not as an array but as a linked list, i.e., the database does not store the position but a pointer to the next entry.)

How to check if data has already been previously used

I have a python script that retrieves the newest 5 records from a mysql database and sends email notification to a user containing this information.
I would like the user to receive only new records and not old ones.
I can retrieve data from mysql without problems...
I've tried to store it in text files and compare the files but, of course, the text files containing freshly retrieved data will always have 5 records more than the old one.
So I have a logic problem here that, being a newbie, I can't tackle easily.
Using lists is also an idea but I am stuck in the same kind of problem.
The infamous 5 records can stay the same for one week and then we can have a new record or maybe 3 new records a day.
It's quite unpredictable but more or less that should be the behaviour.
Thank you so much for your time and patience.
Are you assigning a unique incrementing ID to each record? If you are, you can create a separate table that holds just the ID of the last record fetched, that way you can only retrieve records with IDs greater than this ID. Each time you fetch, you could update this table with the new latest ID.
Let me know if I misunderstood your issue, but saving the last fetched ID in the database could be a solution.

Categories