How to get only specified columns from Dynamodb using python? - python

I have the below function to pull the required columns from dynamodb, it is working fine.
The problem is, it pulling only few rows from the table.
For eg: Table has 26000+ rows but I'm able to get only 3000 rows here.
Did I missed anything?
def get_columns_dynamodb():
try:
response = table.query(
ProjectionExpression= " id, name, date",
KeyConditionExpression=
Key('opco_type').eq('cwc') and Key('opco_type').eq('cwp')
)
return (response['Items'])
except Exception as error:
logger.error(error)

In DynamoDB, there's no such thing as "select only these columns". Or, there sort of is, but that happens after data is fetched from storage. The entire item is always fetched, and the entire item will count towards the various limits in DynamoDB, such as 1mb max for each response, etc.
One way to solve this, is to write your data in a way that's more optimized for this query. Generally speaking, in DynamoDB, you optimize "queries" (in quotes, since they're more of a key/value read than a dynamic query with joins and selects etc) by writing optimized data.
So, when you write data to your table, you can either use a transaction to write companion items to the same or a separate table, or you can use DynamoDB streams to write the same data in a similar fashion, except async (i.e. eventually consistent).
Let's say you roll with two tables: have one table, my_things, which contains full items. Then another table, my_things_for_query_x that only has the exact data you need for that query, which will allow you to read more data in each chunk, since the data in storage only contains the data you actually need in your situation.

Related

How does DynamoDB Query act when all items are in a single item collection?

I have dynamodb and I use lambda to query the tables using python,
My columns are:
product_id,
product_name,
create_at,
I'd like to be able to sort every column In descending or ascending order. From what I have read, I came to the conclusion that I need to create the first column as Partition, and in every record, I have the same value, let's say "dummy". Moreover, I need to create the create_at as a sort key and; for the other columns I need to create a local secondary index for each of them.Then when I sort, I can do that
response = table.query(
KeyConditionExpression=Key('dummy_col').eq('dummy'),
IndexName=product_name_index
ScanIndexForward=True,
)
what I don't understand is that: will my query Go through all the records like scanning, because of my dummy value in every record?
If you require all these access patterns, that's one way to design your table. It has some limitations though - you won't be able to have more than 10GB of data in total, because when you use Local Secondary Indexes, that limits the item collection (all items with the same PK) size to 10GB.
Each query reads and returns up to 1MB of data (docs), afterwards you will get a token that you can use to request the next 1 MB (LastEvaluatedKey)in a new query (ExclusiveStartKey). You can also use the Limit parameter to limit how many items will be read and returned per Query. If you read the whole table based on that, you'll effectively have scanned it.
By filtering based on the sort key you can also define where it starts to read, so you don't have to read everything all the time.

Database: faster to dump in list or query everytime?

I have a program to handle sales and inventory of businesses (of all dimensions) and I've made it so every time it interacts with the data stored in the database a function is called for that specific query. I believe this is inefficient since accessing a database is often a slow process, and as the software functions right now, these queries for small I/O operations in the db are made very often. So I've been thinking of a way to improve this.
I thought about dumping the data of different tables in different lists at the beginning of the software execution and using those lists through its functioning. Then, at close, make the corresponding changes in the database according to those made in the lists. This seems a better solution since all the data would be in memory and the slow process of handling big amounts of data (to which db frameworks are optimized) would be at the execution and exiting of the program.
As an example the following function I call to retrieve an entire table or an entire column of a table:
def sqGenericSelect(table, fetchCol=None):
'''
table: table retrieved. If this is the only argument provided, the entire table gets retrieved.
fetchCol: fetched column. If provided, the function returns only the column passed here.
'''
try:
with conn:
if not fetchCol: fetchCol = '*'
c.execute(f'SELECT {fetchCol} FROM {table} ')
return c.fetchall()
except Exception as exc:
autolog(f'Problem in table: {table}', exc) #This function is for logs
(And it gets called 8 times!)
Is this the right approach? If not, how should I improve this?

Best way to search text records in SQL database based on keywords and create a calculated column

I have a large SQL database that contains all call records from a call center for the last 15 ish years. I am working with a subset of the records (3-5 million records). There is a field that is stored as text where we store all notes from the call, emails, etc. I would like to provide a list of keywords and have the program output another label in a new column for the record. Essentially classifying each record with the likely problem.
For example, my text record contains "Hi John, thank you for contacting us for support with your truck. Has the transmission always made this noise"
The query would then be something like
If the text record contains "Truck" and "Transmission" then the new column value is "error123".
I'm not sure if doing this in SQL would be feasible as there are almost 170 different errors that need to be matched. I was also thinking it maybe could be done in Python? I'm not sure what would be the best fit for this type of tagging.
Currently, I am using PowerQuery in PowerBI to load the SQL table, and then 170 switch statements to create a calculated column. This seems to handle about 500k records before timing out. While I can chunk my records, I know this isn't the best way, but I'm not sure what program would be most suited to it.
EDIT
Per the below answer, I am going to run an update command for each error on a new column. I only have read-only access to the Database, so I am using the below code to pull the data and add a new column called "Error". My problem is that I want to use the update command to update the new "Error" column instead of the DB. Is this possible? I know the update needs a table, what would the returned query table be called? Is it possible to do it this way?
SELECT *, 'null' AS Error FROM [TicketActivity]
UPDATE
SET Error = 'desktop'
WHERE ActivityNote LIKE '%desktop%'
AND ActivityNote LIKE '%setup%'
If you just need to check for keywords, I would not take the detour through Python since you need to transfer all information from the db into Python memory and back.
I would fire 170 different versions of this with UPDATE instead of SELECT and have columns available where you can enter a True or False (or copy probable records into another table using the same approach)
So, I figured this out through some more Googling after being pointed in the right direction here.
SELECT *,
CASE
WHEN column1 LIKE '%keyword%'
AND column1 LIKE '%keyword%' THEN 'Error 123'
WHEN column1 LIKE '%keyword%'
AND column1 LIKE '%keyword%' THEN 'Error 321'
ELSE 'No Code'
END AS ErrorMessage
FROM [TicketActivity]
Repeating the WHEN statements for as many as needed, and using a WHERE statement to select my time range

How to tell if CQLEngine made an insert or update through the Model class Save

I am using Python3.4 and CQLEngine. In my code, I am saving an object in an overloaded save operator as follows:
Class Foo(Model, ...):
id = columns.Integer(primary_key)=True
bar = column.Text()
...
def save(self):
super(Foo, self).save()
and I would like to know if the save() is making an insert or an update from the return of the save function.
INSERT and UPDATE are synonyms in Cassandra with a very few exceptions. Here is a description of INSERT where it briefly touches on a difference:
An INSERT writes one or more columns to a record in a Cassandra table
atomically and in isolation. No results are returned. You do not have
to define all columns, except those that make up the key. Missing
columns occupy no space on disk.
If the column exists, it is updated. You can qualify table names by
keyspace. INSERT does not support counters, but UPDATE does.
Internally, the insert and update operation are identical.
You don't know whether it will be an insert or update, and you can look at it as if it was a data save request, then the coordinator determines what it is.
This answers your original question - you can't know based on the return of the save function whether it was an insert or update.
The answer on your comment below, which explained why you wanted to have that output: You can't reliably get this info out of Cassandra, but you can use lightweight transactions to a certain extent and run 2 statements sequentially with the same rows of data:
INSERT ... IF NOT EXISTS followed by UPDATE ... IF EXISTS
In the target table you will need to have a column where each of these statements will write a value unique for each call. Then you can select data based on the primary keys of your dataset, and see how many rown have each value. This will roughly tell you how many updates and how many inserts were there. However of there were any concurrent processes, they may have overwritten your data over with their tokens, so this method will not be very accurate and will work (as any other method with databases like Cassandra) only where there are no concurrent processes.

AWS DynamoDB retrieve entire table

Folks,
Retrieving all items from a DynamoDB table, I would like to replace the scan operation with a query.
Currently I am pulling in all the table's data via the following (python):
drivertable = Table(url['dbname'])
all_drivers = []
all_drivers_query = drivertable.scan()
for x in all_drivers_query:
all_drivers.append(x['number'])
How would i change this to use the query API?
Thanks!
There is no way to query and get the entire results of the table. As of right now, you have a few options if you want to get all of your data out of a DynamoDB, and all of them involve actually reading the data out of DynamoDB:
Scan the table. It can be done faster with the expense of using much more read capacity by using a parallel scan
Export your data using AWS Data Pipelines. You can configure the export job for where and how it should store your data.
Using one of the AWS event platforms for new data and denormalize it. For all new data you can get a time-ordered stream of all updates to the table from DynamoDB Update Streams or process events using AWS Lambda
You can't query an entire table. Query is used to retrieve a set of items by supplying a hash key (part of the complex primary key hash-range of the table).
One can not use query without knowing the hash keys.
EDIT as a bounty was added to this old question that asks:
How do I get a list of hashes from DynamoDB?
Well - In Dec 2014 you still can't ask via a single API for all hash keys of a table.
Even if you go and put a GSI you still can't get a DISTINCT hash count.
The way I would solve this is with de-normalization. Keep another table with no range key and put every hash there together with the main table. This adds house-keeping overhead to your application level (mainly when removing), but solves the problem you asked.

Categories