There is an external API that has some data that I need to have a copy of inside a dynamoDB table.
The external API gives me an user_id and some user_data. There are many user_ids that need to be updated on a regular basis (every few hours).
I can easily invoke the function to get the data initially into the table and record the first batch (as each user_id is added manually).
But how can I schedule the lambda function, to run again, with the user ID as a parameter, to update the record automatically, 3 hours after it's been added to the table?
I was thinking about some 'scheduler_table' to which to save a timestamp with when the next update should be and for which user_id, and then having a trigger that invokes a lambda function every 10 minutes to scan the table, to see if there are any 'expired' timestamps, but this seems sub-optimal and I was wondering if there are any AWS resources, design decisions or methods I can use, that will allow me to do this in a more straight-forward and easy to manage manner?
If the dynamoDB scheduler_table solution is reasonable, what would be the best way to organize the primary/sort keys and how would I go about querying them? (as I've found out that scanning wouldn't be a great solution)
PS: Solutions are not too hard on my pokets are preferred.
Thanks!
Related
I am trying to use AWS Timestream to store data with timesteamp (in python using boto3).
The data I need to store corresponds to prices over time of different tokens. Each record has 3 field: token_address, timestamp, price. I have around 100 M records (with timestamps from 2019 to now).
I have all the data in a CSV and I would like to populate the DB with it. But I don't find a way to do this in the documentation as I am limited by 100 writes per query according to quotas. The only optimization proposed in documentation is Writing batches of records with common attributes but in my my case they don't share the same values (they all have the same structure but not the same values so I can not define a common_attributes as they do in the example).
So is there a way to populate a Timestream DB without writing records by batch of 100 ?
I asked AWS support, here is the answer:
Unfortunately, "Records per WriteRecords API request" is a non-configurable limit. This limitation is already noted by the development team.
However, to get any additional insights to help with your load, I have reached out to my internal team. I will get back to you as soon as I have an update from the team.
EDIT:
I had a new answer from AWS support:
Team, suggested that a new feature called batch load is being released tentatively at the end of February (2023). This feature will allow the customer to ingest data from CSV files directly into Timestream in bulk.
I am trying to build a big aggregated table with googles tools but I am a bit lost on the 'how to do it'.
Here is what I would like to create: I have a big table in bigquery. Its updated daily with about 1.2M events for evert user of the application. I would like to have an auto updating aggregate table(udpated once every day) built upon that with all user data broken by userID. But how do I continiously update the data inside of it?
I read a bit about firebase and bigquery but since they are very new to me I cant figure out if this is possible to do serverlessly?
I know how to do it with a jenkins process that queries the big events table for the last day, gets all userIDs, joins with the data from the existing aggregate values for the userIDs, takes whatever is changed and deletes from the aggregate in order to insert the updated data. (In python)
The issue is I want to do this entirely within the structure of google. Is firebase able to do that? Is bigquery able? How? What tools? Could this be solved using the serverless functions available?
I am more familiar with Redshift.
You can use a pretty new BigQuery feature that to schedule queries.
I use it to create rollup tables. If you need more custom stuff you can use cloud scheduler to call any Google product that can be triggered by an HTTP request such as cloud function, cloud run, or app engine.
I want to Reset DynamoDB Table Write and read throughput after my build creates the table using AWS Lambda Function. I need to provision 200 RCU and 600 WCU during an intial run to write my data to the table. Once written, my table does not require more than 50 WCU and 20 RCU. I currently reset the value in dynamoDB console once the table is created.
Developers use circle CI which uses the environment variables for provision RCU/WCU and trigger the build that creates the lambda functions and dynamoDB. AS an admin, I don't have access to the code repo but basically, it creates required tables via Circle CI build triggers using the source code via GitHub Repo. I was asked to automate the problem described above.
We would like to write a new lambda function triggered once successful creation of DyanamoTable during the initial run. This new function should reset the table throughput value to 50WCU and 20 RCU without relying on Dynamo Autoscale. I researched many places, went over AWS documentation, but could not find details or functions that would make sense.
Did you check the UpdateTable API?
You can have the ProvisionedThroughput object inside your request that update WCU/RCU
Does this answer makes sense to you?
I use the python sdk to create a new bigquery table:
tableInfo = {
'tableReference':{
'datasetId':datasetId,
'projectId':projectId,
'tableId':targetTableId
},
'schema':schema
}
result = bigquery_service.tables().insert(projectId=projectId,
datasetId=datasetId,
body=tableInfo).execute()
The result variable contains the created table information with etag,id,kind,schema,selfLink,tableReference,type - therefore I assume the table is created correctly.
Afterwards I even get the table, when I call bigquery_service.tables().list(...)
The problem is:
When inserting right after that, I still (often) get an error: Not found: MY_TABLE_NAME
My insert function call looks like this:
response = bigquery_service.tabledata().insertAll(
projectId=projectId,
datasetId=datasetId,
tableId=targetTableId,
body=body).execute()
I even retried the insert multiple times with 3 seconds of sleep between retries. Any ideas?
My projectId is stylight-bi-testing
There were a lot failures between 10:00 and 12:00 (time given in UTC)
Per your answers to my question regarding using NOT_FOUND as an indicator to create the table, this is intended (though admittedly somewhat frustrating) behavior.
The streaming insertion path caches information about tables (and the authorization of a user to insert into the table). This is because of the intended high QPS nature of the API. We also cache certain negative responses in order to protect again buggy or abusive clients. One of those cached negative responses is the non-existence of a destination table. We've always done this on a per-machine basis, but recently added an additional centralized cache, such that all machines will see the negative cache result almost immediately after the first NOT_FOUND response is returned.
In general, we recommend that table creation not occur inline with insert requests, because in a system that is issuing thousands of QPS of inserts, a table miss could result in thousands of table creation operations which can be taxing on our system. Instead, if you know the possible set of tables beforehand, we recommend some periodic process that performs table creations in advance of their usage as a streaming destination. If your destination tables are more dynamic in nature, you may need to implement a delay after table creation has been performed.
Apologies for the difficulty. We do hope to address this issue, but we don't have any timeframe yet for doing so.
I have several CouchDB databases. The largest is about 600k documents, and I am finding that queries are prohibitively long (several hours or more). The DB is updated infrequently (once a month or so), and only involves adding new documents, never updating existing documents.
Queries are of the type: Find all documents where key1='a' or multiple keys: key1='a', key2='b'...
I don't see that permanent views are practical here, so have been using the CouchDB-Python 'query' method.
I have tried several approaches, and I am unsure what is most efficient, or why.
Method 1:
map function is:
map_fun = '''function(doc){
if(doc.key1=='a'){
emit(doc.A, [doc.B, doc.C,doc.D,doc.E]);
}
}'''
The Python query is:
results = ui.db.query(map_fun, key2=user)
Then some operation with results.rows. This takes up the most time.
It takes about an hour for 'results.rows' to come back. If I change key2 to something else, it comes back in about 5 seconds. If I repeat the original user, it's also fast.
But sometimes I need to query on more keys, so I try:
map_fun = '''function(doc){
if(doc.key1=='a' && doc.key2=user && doc.key3='something else' && etc.){
emit(doc.A, [doc.B, doc.C,doc.D,doc.E]);
}
}'''
and use the python query:
results = ui.db.query(map_fun)
Then some operation with results.rows
Takes a long time for the first query. When I change key2, takes a long time again. If
I change key2 back to the original data, takes the same amount of time. (That is, nothing seems to be getting cached, B-tree'ed or whatever).
So my question is: What's the most efficient way to do queries in couchdb-python, where the queries are ad hoc and involve multiple keys for search criteria?
The UI is QT-based, using PyQt underneath.
There are two caveats for couchdb-python db.query() method:
It executes temporary view. This means that code flow processing would be blocked until this all documents would be proceeded by this view. And this would happened again and again for each call. Try to save view and use db.view() method instead to get results on demand and have incremental index updates.
It's reads whole result no matter how bigger it is. db.query() nor db.view() methods aren't lazy so if view result is 100 MB JSON object, you have to fetch all this data before use them somehow. To query data in more memory-optimized way, try to apply patch to have db.iterview() method - it allows you to fetch data in pagination style.
I think that the fix to your problem is to create an index for the keys you are searching. It is what you called permanent view.
Note the difference between map/reduce and SQL queries in a B-tree based table:
simple SQL query searching for a key (if you have an index for it) traverses single path in the B+-tree from root to leaf,
map function reads all the elements, event if it emits small result.
What you are doing is for each query
reading every document (most of the cost) and
searching for a key in the emitted result (quick search in the B-tree).
and I think your solution has to be slow by the design.
If you redesign database structure to make permanent views practical, (1.) will be executed once and only (2.) will be executed for each query. Each document will be read by a view after addition to DB and a query will search in B-tree storing emitted result. If emitted set is smaller than the total documents number, then the query searches smaller structure and you have the benefit over SQL databases.
Temporary views are far less efficient, then the permanent ones and are meant to be used only for development. CouchDB was designed to work with permanent views. To make map/reduce efficient one has to implement caching or make the view permanent. I am not familiar with the details of the CouchDB implementation, perhaps second query with different key is faster because of some caching. If for some reason you have to use temporary view then perhaps CouchDB is a mistake and you should consider DBMS created and optimized for online queries like MongoDB.