I am trying to use AWS Timestream to store data with timesteamp (in python using boto3).
The data I need to store corresponds to prices over time of different tokens. Each record has 3 field: token_address, timestamp, price. I have around 100 M records (with timestamps from 2019 to now).
I have all the data in a CSV and I would like to populate the DB with it. But I don't find a way to do this in the documentation as I am limited by 100 writes per query according to quotas. The only optimization proposed in documentation is Writing batches of records with common attributes but in my my case they don't share the same values (they all have the same structure but not the same values so I can not define a common_attributes as they do in the example).
So is there a way to populate a Timestream DB without writing records by batch of 100 ?
I asked AWS support, here is the answer:
Unfortunately, "Records per WriteRecords API request" is a non-configurable limit. This limitation is already noted by the development team.
However, to get any additional insights to help with your load, I have reached out to my internal team. I will get back to you as soon as I have an update from the team.
EDIT:
I had a new answer from AWS support:
Team, suggested that a new feature called batch load is being released tentatively at the end of February (2023). This feature will allow the customer to ingest data from CSV files directly into Timestream in bulk.
Related
This is more of design and architecture question so please bare with me.
Requirement - Lets consider we have 2 type of flat files (csv, xls or txt) for below two db tables.
Doctor
name
degree
...
Patient
name
doctorId
age
...
each file contains data of each tables respectively. (volume 2-3millions each file).
we have to load these two files data into Doctor, Patient table of warehouse. after some of the validations like null value, foreign key, duplicates in doctor etc..
if any invalid data identifies, i will need to attach the reasons like null value, duplicate value. so that i can evaluate the invalid data.
Note that, expectations is to load 1 million records in ~1-2mins of span.
My Designed workflow (so far)
After several articles and blog reading i find it to go with AWS Glue & Databrew of my ETL for source to target along with custom validations.
Please find below design and architecture. Suggest & guide me on it.
Is there any scope of parallel or partition based processing for quick validation and loading the data? Your help is going to really help me and others (who gonna come to this type of case).
Thanks a ton.
I'm not sure if you're asking the same thing, but your architecture should follow these guidelines
File land to S3 raw bucket.
Your lambda trigger once file put on S3 bucket.
AWS lambda trigger will invoke Step function which contains following steps
3.1 ) Data governance (AWS Deeque) check all validations
3.2 ) perform transformation
Move process data to your process bucket where you have data for reconciliation and other process.
Finally your data move to your production bucket where you have only required process data not all
Note :
Partitioning is help to achieve parallels processing in Glue.
Your lambda and ETL logic should be stateless so rerun will not corrupts your data.
All Synchronous call should use proper retry Exponential Backoff and jitter.
Logs every steps into DynamoDB table so analysis logs and help in reconciliation.
I have many server instances in google cloud that are running all day collecting data from various edge devices. I have around a 100 servers each collecting from around 50 devices.
The data for each source is updated in different frequencies, some could be per second, some could be per minute, 5 minute, half hourly, hourly, 4 hourly, half day, up to daily.
The data is usually 2 columns, date + time in one column and the datapoint in the other, so it could be temperature data, or soil moisture data, wind direction that kind of thing.
Right now, each server collects the data into python pandas data frames, and updates them live, then at the end of the day, they're updated or saved into csv files (2 column csv files). Each device that collects data has its own csv file. I don't update them into one big dataframe because there will be a lot of empty spaces in the because of the difference in update frequencies.
It could look like this:
DateTime Device-19-Location-27-Temperature
01-June-2020 1:00p.m. 21.4
01-June-2020 1:01p.m. 21.5
....
When I need to access the data, I have to SSH into the servers one by one and download the files into my computer and work with the data after that.
My knowledge of databases close to none so I have been doing it this way.
My first question is, is storing them in separate csv files the best way? I thought of doing this because of the different update frequencies.
And my second question is, is there a centralised location or database on google cloud or elsewhere where I can store all these files, access them, update them using some kind of python API so that I only have to access a single location to get all my data?
According to your description I think that a good option for you could be a NoSQL database, you could check these options: Cloud Firestore or Firebase Realtime Database.
With regard to your other questions about store files, access them, update them using some kind of python API. On GCP an option could be Cloud Storage buckets. You can store your files there, download them to work with them and update the version when you finish to work with them using Client Libraries.This option is not to work with the files directly on the buckets(on the fly).
I am trying to build a big aggregated table with googles tools but I am a bit lost on the 'how to do it'.
Here is what I would like to create: I have a big table in bigquery. Its updated daily with about 1.2M events for evert user of the application. I would like to have an auto updating aggregate table(udpated once every day) built upon that with all user data broken by userID. But how do I continiously update the data inside of it?
I read a bit about firebase and bigquery but since they are very new to me I cant figure out if this is possible to do serverlessly?
I know how to do it with a jenkins process that queries the big events table for the last day, gets all userIDs, joins with the data from the existing aggregate values for the userIDs, takes whatever is changed and deletes from the aggregate in order to insert the updated data. (In python)
The issue is I want to do this entirely within the structure of google. Is firebase able to do that? Is bigquery able? How? What tools? Could this be solved using the serverless functions available?
I am more familiar with Redshift.
You can use a pretty new BigQuery feature that to schedule queries.
I use it to create rollup tables. If you need more custom stuff you can use cloud scheduler to call any Google product that can be triggered by an HTTP request such as cloud function, cloud run, or app engine.
I am facing a couple of issues in figuring out what-is-what, in spite of the humungous documentation I am unable to figure out these issues
1.Which report type should be used to get the campaign level totals. I am trying to get the data in the format of headers
-campaign_id|campaign_name|CLicks|Impressions|Cost|Conversions.
2.I have tried to use "CAMPAIGN_PERFORMANCE_REPORT" but I get broken up information at a keyword level, but I am trying to pull the data at a campaign level.
3.I also need to push the data to a database. In the API documentation, i get samples which will either print the results on my screen or it will create a file on my machine. is there a way where I can get the data in JSON to push it to the database.
4.I have 7 accounts on my MCC account as of now, the number will increase in the coming days. I don't want to manually hard code the client customer ids into my code as there will be new accounts which will be created. is there a way where I can get the list of client customer ids which are on my MCC ac
I am trying to get this data using python as my code base and adwords api V201710.
To retrieve campaign performance data you need to run a campaign_performance_report. Follow this link to view all available columns for Campaign performance report.
The campaign performance report does not include stats aggregated at a keyword level. Are you using AWQL to pull your report?
Can you paste your code here, I find it odd you are getting keyword level data.
Run this python example code to get campaign data (you should definitely not be getting keyword level data with this example code).
Firstly Google AdWords API only returns report data in the following file formats CSVFOREXCEL, CSV, TSV, XML, GZIPPED_CSV, GZIPPED_XML. Unfortunately JSON is not supported for your use case. I would recommend GZIPPED_CSV and set the following properties to false:
skipReportHeader
skipColumnHeader
skipReportSummary
This will simply skip all headers, report titles & totals from the report making is very simple to upsert data into a table.
It is not possible to enter a MCC ID and expect the API to fetch a report for all client accounts. Each API report request contains the client ID, so therefore you are required to create an array of all client IDs and then iterate through each id. If you are using the client library (recommended) then you can simply set the clientID within the session i.e. session.setClientCustomerId("xxx");
To automate this use the ManagedCustomerService to automatically retrieve all clientIDs then iterate through this therefore you would not need to hard code each ClientID. Google have created a handy python file which returns the account hierarchy including child account ID (click here).
Lastly I based on your question I assume you attempting to run an ETL process. Google have an opensource AdWords extractor which I highly recommend.
Folks,
Retrieving all items from a DynamoDB table, I would like to replace the scan operation with a query.
Currently I am pulling in all the table's data via the following (python):
drivertable = Table(url['dbname'])
all_drivers = []
all_drivers_query = drivertable.scan()
for x in all_drivers_query:
all_drivers.append(x['number'])
How would i change this to use the query API?
Thanks!
There is no way to query and get the entire results of the table. As of right now, you have a few options if you want to get all of your data out of a DynamoDB, and all of them involve actually reading the data out of DynamoDB:
Scan the table. It can be done faster with the expense of using much more read capacity by using a parallel scan
Export your data using AWS Data Pipelines. You can configure the export job for where and how it should store your data.
Using one of the AWS event platforms for new data and denormalize it. For all new data you can get a time-ordered stream of all updates to the table from DynamoDB Update Streams or process events using AWS Lambda
You can't query an entire table. Query is used to retrieve a set of items by supplying a hash key (part of the complex primary key hash-range of the table).
One can not use query without knowing the hash keys.
EDIT as a bounty was added to this old question that asks:
How do I get a list of hashes from DynamoDB?
Well - In Dec 2014 you still can't ask via a single API for all hash keys of a table.
Even if you go and put a GSI you still can't get a DISTINCT hash count.
The way I would solve this is with de-normalization. Keep another table with no range key and put every hash there together with the main table. This adds house-keeping overhead to your application level (mainly when removing), but solves the problem you asked.