I'm build a solution Match Service where receive data from a third party provider from MQTT server. This data is a realtime data. We save this data in RDS Cluster.
Our users can create in another service a filter called Strateg, we send a cron every 5 minutes to this service and all records in database are send to Kafka topic to be processed in Match Service.
My design is based on events, so each new Strategy record in topic, Match Service performs a query in database for check if have any Match that active the Strategy threshold. If the threshold is passed, it sends out an new message to broker.
The API processes about 10k Strategy in each job, it's taking timing (about 250s for each job).
So my question is if there is a better way to design this system? I was thinking of adding a redis-layer, to avoid database transactions.
All suggestions welcome!
Think long and hard about your relational data store. If you really need it to be relational, then it may absolutely make sense, but if not, a relational database is often a horrible place to dump things like time-series and IoT output. It's a great place to put normalized, structured data for reporting, but a lousy dump/load location and real-time matching.
Look more at something like AWS RedShift, ElasticSearch, or some other no-sql solution that can ingest and match things at orders of magnitude higher scale.
Related
I joined as a junior data engineer at a startup and I'm working on setting up a data warehouse for BI/visualization. I wanted to get an idea of approaches for the extraction/loading part as the company is also new to data engineering.
The company is thinking of going with Google BigQuery for warehousing. The main data source is currently a single OLTP PostgreSQL database hosted on AWS RDS. The database is about 50 GB for now with nearly a hundred tables.
I was initially thinking of using Stitch to integrate directly with BigQuery but since the team is shifting the RDS instance to a private subnet, it would not be possible to access using third party tools which would require a publicly accessible URL (would it?).
How would I go about it? I am still pretty new to data engineering so wanted some advice. I was thinking about using:
RDS -> Lambda/VM with Python extraction/load script -> BigQuery upload using API
But how would I account for changing row values e.g. a customer's status changes in a table. Would BigQuery automatically handle such changes? Plus, I would want to set up regular daily data transfers. For this, I think a cron job can be set up with the Python script to transfer data but would this be a costly approach considering that there are a bunch of large tables (extraction, conversion to dataframe/CSV then uploading to BQ)? As the data size increases, I would need to upsert data instead of overwriting tables. Can BigQuery or other warehouse solutions like Redshift handle this? My main factors to consider for a solution are mostly cost, time to set up and data loading durations.
This question is more on architecture and libs, than on implementation.
I am currently working at project, which requires a local long-term cache storage (updated once a day) at client kept in sync with a remote db at server. For client side sqlite has been chosen as a lightweight approach and postgresql as feature rich db at server. Native replication mechanisms of postgres are no-opt cause I need to keep client really lightweight and free of relying on external components like db servers.
The implementation language would be Python. Now I'm looking at ORMs like SQLAlchemy, but haven't worked with any before.
Does SQLAlchemy have any tools to keep sqlite and postgres dbs in sync?
If not, are there any other Python libraries which have such tools?
Any ideas about how should the architecture look like, if the task must be solved "by hand"?
Added:
It's like telemetry, cause client would have internet connection only for approximately 20 minutes a day
So, the main question is about architecure of such a system
It doesn't usually fall within the tasks of an ORM to sync data between databases, so you will likely have to implement it yourself. I don't know of any solution that will handle syncing for you given your choice of databases.
There are a couple important design choices to consider:
how do you figure out what data changed ( i.e. inserted, updated or deleted )
what is the most efficient way to package the change-log
will you have to deal with conflicts ? and how will you do that.
The most efficient way to figure out what changed is to have the database tell you that directly. Bottled water can offer some inspiration in this regard. The idea is to tap into the event log postgres would use for replication. You will need something like Kafka to keep track of what each of your clients already knows. This will allow you to optimize your server for writes, as you won't have clients querying trying to figure out what changed since they were last online.
The same can be achieved on the sqlight end with event callbacks, you'll just have to trade some storage space on the client to retain the changes to be sent to the server. If that sounds like too much infrastructure for your needs, it's something that you can easily implement with SQL and pooling as well, but I would still think of it as an event log, and consider how it's implemented a detail - possibly allowing for a more efficient implementation lather on.
The best way to structure and package your change log will depend on your applications requirements, available band-with, etc. You could use standard formats such as json, compress and encrypt if needed.
It will be much simpler to design your application as such to avoid conflicts, and possibly flow data in a single direction, or partition your data so that it always flows in a single direction for a specific partition.
One final taught is that with such an architecture you would be getting incremental updates, some of which might be missed for unplanned reasons ( system failure, bugs, dropped messages, etc ). You could have some built in heuristic to check that your data matches, like at least checking the number of records on each side, with some way to recover such a fault, at a minimal a way to manually re-fetch the data from the authoritative source, i.e. if the server is authoritative, the client should be able to discard it's data and re-fetch it. You might need such a mechanism anyway for cases wen the client is reinstalled, etc.
I am in the process of building my first SNMP application using Django, MySQL, Python, and Apache. It will monitor a few thousand devices that will have anyhere from 5-30 OIDS pulled from each device every 1-5 minutes.
I am wondering what is the best way to store data of this type?
It would need to be something robost.
Open to SQL or NoSQL
No duplicate information (This could be easily accomplished by just storing the data every poll for every device but the constrains are it needs to be kept lean. So only storge of unique data)
Schema of data should either be dynamic or somehow expandable.
I have truly run into the problem of scaling versus web development. Never thought this day would come!
I think the best option to store data like so is rrdtool http://oss.oetiker.ch/rrdtool/
You can create a separate rrd file for each OID per device.
Introduction
I am working on a GPS Listener, this is a service build on twisted python, this app receive at least 100 connections from gps devices, and it is working without issues, each GPS send data each 5 seconds, containing positions. ( the next week must be at least 200 gps devices connected )
Database
I am using a unique postgresql connection, this connection is shared between all gps devices connected for save and store information, postgresql is using pgbouncer as pooler
Server
I am using a small pc as server, and I need to find a way to have a high availability application with out loosing data
Problem
According with my high traffic on my app, I am having issues with memory data after 30 minutes start to appear as no saved, however queries are being executed on postgres ( I have checked that on last activity )
Fake Solution
I have amke a script that restart my app, postgres ang pgbouncer, however this is a wrong solution, because each time that I restart my app, gps get disconnected, and must to reconnected again
Posible Solution
I am thinking on a high availability solution based on a Data Layer, where each time when database have to be restarted or something happened, a txt file store data from gps devices.
For get it, I am thing on a no unique connection, I am thinking on a simple connection each time one data must be saved, and then test database, like a pooler, and then if database connection is wrong, the txt file store it, until database is ok again, and the other process read txt file and send info to database
Question
Since I am thinking on a app data pooler and a single connection each time when this data must be saved for try to no lost data, I want to know
Is ok making single connection each time that data is saved for this
kind of app, knowing that connections will be done more than 100 times
each 5 seconds?
As I said, my question is too simple, which one is the right way on working with db connections on a high traffic app? single connections per query or shared unique connection for all app.
The reason on looking this single question, is looking for the right way on working with db connections considering memory resources.
I am not looking for solve postgresql issues or performance, just to know the right way on working with this kind of applications. And that is the reason on give as much of possible about my application
Note
One more thing,I have seen one vote to close this question, that is related to no clear question, when the question is titled with the word "question" and was marked on italic, now I have marked as gray for notice people that dont read the word "question"
Thanks a lot
Databases do not just lose data willy-nilly. Not losing data is pretty much number one in their job description. If it seems to be losing data, you must be misusing transactions in your application. Figure out what you are doing wrong and fix it.
Making and breaking a connection between your app and pgbouncer for each transaction is not good for performance, but is not terrible either; and if that is what helps you fix your transaction boundaries then do that.
I've been using PostgreSQL for the longest time. All of my data lives inside Postgres. I've recently looked into redis and it has a lot of powerful features that would otherwise take a couple of lines in Django (python) to do. Redis data is persistent as long the machine it's running on doesn't go down and you can configure it to write out the data it's storing to disk every 1000 keys or every 5 minutes or so depending on your choice.
Redis would make a great cache and it would certainly replace a lot of functions I have written in python (up voting a user's post, viewing their friends list etc...). But my concern is, all of this data would some how need to be translated over to postgres. I don't trust storing this data in redis. I see redis as a temporary storage solution for quick retrieval of information. It's extremely fast and this far outweighs doing repetitive queries against postgres.
I'm assuming the only way I could technically write the redis data to the database is to save() whatever I get from the 'get' query from redis to the postgres database through Django.
That's the only solution I could think of. Do you know of any other solutions to this problem?
Redis is increasingly used as a caching layer, much like a more sophisticated memcached, and is very useful in this role. You usually use Redis as a write-through cache for data you want to be durable, and write-back for data you might want to accumulate then batch write (where you can afford to lose recent data).
PostgreSQL's LISTEN and NOTIFY system is very useful for doing selective cache invalidation, letting you purge records from Redis when they're updated in PostgreSQL.
For combining it with PostgreSQL, you will find the Redis foreign data wrapper provider that Andrew Dunstain and Dave Page are working on very interesting.
I'm not aware of any tool that makes Redis into a transparent write-back cache for PostgreSQL. Their data models are probably too different for this to work well. Usually you write changes to PostgreSQL and invalidate their Redis cache entries using listen/notify to a cache manager worker, or you queue changes in Redis then have your app read them out and write them into Pg in chunks.
Redis is persistent if configured to be so, both through snapshots and a kind of WAL called AOF. Loads of people use it as a primary datastore.
https://redis.io/topics/persistence
If one is referring to the greater world of Redis compatible (resp protocol) datastores, many are not limited to in-memory storage:
https://keydb.dev/
http://ssdb.io/
and many more...