I am in the process of building my first SNMP application using Django, MySQL, Python, and Apache. It will monitor a few thousand devices that will have anyhere from 5-30 OIDS pulled from each device every 1-5 minutes.
I am wondering what is the best way to store data of this type?
It would need to be something robost.
Open to SQL or NoSQL
No duplicate information (This could be easily accomplished by just storing the data every poll for every device but the constrains are it needs to be kept lean. So only storge of unique data)
Schema of data should either be dynamic or somehow expandable.
I have truly run into the problem of scaling versus web development. Never thought this day would come!
I think the best option to store data like so is rrdtool http://oss.oetiker.ch/rrdtool/
You can create a separate rrd file for each OID per device.
Related
I joined as a junior data engineer at a startup and I'm working on setting up a data warehouse for BI/visualization. I wanted to get an idea of approaches for the extraction/loading part as the company is also new to data engineering.
The company is thinking of going with Google BigQuery for warehousing. The main data source is currently a single OLTP PostgreSQL database hosted on AWS RDS. The database is about 50 GB for now with nearly a hundred tables.
I was initially thinking of using Stitch to integrate directly with BigQuery but since the team is shifting the RDS instance to a private subnet, it would not be possible to access using third party tools which would require a publicly accessible URL (would it?).
How would I go about it? I am still pretty new to data engineering so wanted some advice. I was thinking about using:
RDS -> Lambda/VM with Python extraction/load script -> BigQuery upload using API
But how would I account for changing row values e.g. a customer's status changes in a table. Would BigQuery automatically handle such changes? Plus, I would want to set up regular daily data transfers. For this, I think a cron job can be set up with the Python script to transfer data but would this be a costly approach considering that there are a bunch of large tables (extraction, conversion to dataframe/CSV then uploading to BQ)? As the data size increases, I would need to upsert data instead of overwriting tables. Can BigQuery or other warehouse solutions like Redshift handle this? My main factors to consider for a solution are mostly cost, time to set up and data loading durations.
I'm build a solution Match Service where receive data from a third party provider from MQTT server. This data is a realtime data. We save this data in RDS Cluster.
Our users can create in another service a filter called Strateg, we send a cron every 5 minutes to this service and all records in database are send to Kafka topic to be processed in Match Service.
My design is based on events, so each new Strategy record in topic, Match Service performs a query in database for check if have any Match that active the Strategy threshold. If the threshold is passed, it sends out an new message to broker.
The API processes about 10k Strategy in each job, it's taking timing (about 250s for each job).
So my question is if there is a better way to design this system? I was thinking of adding a redis-layer, to avoid database transactions.
All suggestions welcome!
Think long and hard about your relational data store. If you really need it to be relational, then it may absolutely make sense, but if not, a relational database is often a horrible place to dump things like time-series and IoT output. It's a great place to put normalized, structured data for reporting, but a lousy dump/load location and real-time matching.
Look more at something like AWS RedShift, ElasticSearch, or some other no-sql solution that can ingest and match things at orders of magnitude higher scale.
I am creating a machine learning app which should save a number to a database locally and frequently.
These number values are connected which basically means that I want to frequently update a time series of values by appending a number to the list.
An ideal case would be to be able to save key-value pairs where key would represent the name of the array (example train_loss) and value would be according time series.
My first idea was leveraging redis but as far as I know redis data is only saved in RAM? What I want to achieve is saving to a disk after every log or perhaps after every couple of logs.
I need the local save of data since this data will be consumed by other app (in javascript). Therefore some JSON-like format would be nice.
Using JSON files (and Python json package) is an option, but I believe it would result in an I/O bottleneck because of frequent updates.
I am basically trying to create a clone of a web app like Tensorboard.
A technique we use in the backend of hosted application for frequently used read/post api is to write to the Redis and DB at the same time and during the read operation we check if the key is available in the Redis and if it's not we read update it to the Redis and then serve it
I'm working in a personal project to segment some data from the Sendinblue Api (CRM Service). Basically what I try to achieve is generate a new score attribute to each user base on his emailing behavior. For that proposed, the process I've plan is as follows:
Get data from the API
Store in database
Analysis and segment the data with Python
Create and update score attribute in Sendin every 24 hours
The Api has a Rate limiting 400 request per minute, we are talking about 100k registers right now which means I have to spend like 3 hours to get all the initial data (currently I'm using concurrent futures to multiprocessing). After that I'll plan to store and update only the registers who present changes. I'm wondering if this is the best way to do it and which combinations of tools is better for this job.
Right now I have all my script in Jupyter notebooks and I recently finished my first Django project, so I don't know if I need a django app for this one or just simple connect the notebook to a database (PostgreSQL?), and if this last one is possible which library I have to learn to run my script every 24 hours. (i'm a beginner). Thanks!
I don't think you need Django except you want a web to view your data. Even so you can write any web application to view your statistic data with any framework/language. So I think the approach is simpler:
Create your python project, entry point main function will execute logic to fetch data from API. Once it's done, you can start logic to analyze and statistic then save result in database.
If you can query to view your final result by SQL, you don't need to build web application. Otherwise you might want to build a small web application to pull data from database to view statistic in charts or export in any prefer format.
Setup a linux cron job to execute python code at #1 and let it run every 24 at paticular time you want. Link: https://phoenixnap.com/kb/set-up-cron-job-linux
Introduction
I am working on a GPS Listener, this is a service build on twisted python, this app receive at least 100 connections from gps devices, and it is working without issues, each GPS send data each 5 seconds, containing positions. ( the next week must be at least 200 gps devices connected )
Database
I am using a unique postgresql connection, this connection is shared between all gps devices connected for save and store information, postgresql is using pgbouncer as pooler
Server
I am using a small pc as server, and I need to find a way to have a high availability application with out loosing data
Problem
According with my high traffic on my app, I am having issues with memory data after 30 minutes start to appear as no saved, however queries are being executed on postgres ( I have checked that on last activity )
Fake Solution
I have amke a script that restart my app, postgres ang pgbouncer, however this is a wrong solution, because each time that I restart my app, gps get disconnected, and must to reconnected again
Posible Solution
I am thinking on a high availability solution based on a Data Layer, where each time when database have to be restarted or something happened, a txt file store data from gps devices.
For get it, I am thing on a no unique connection, I am thinking on a simple connection each time one data must be saved, and then test database, like a pooler, and then if database connection is wrong, the txt file store it, until database is ok again, and the other process read txt file and send info to database
Question
Since I am thinking on a app data pooler and a single connection each time when this data must be saved for try to no lost data, I want to know
Is ok making single connection each time that data is saved for this
kind of app, knowing that connections will be done more than 100 times
each 5 seconds?
As I said, my question is too simple, which one is the right way on working with db connections on a high traffic app? single connections per query or shared unique connection for all app.
The reason on looking this single question, is looking for the right way on working with db connections considering memory resources.
I am not looking for solve postgresql issues or performance, just to know the right way on working with this kind of applications. And that is the reason on give as much of possible about my application
Note
One more thing,I have seen one vote to close this question, that is related to no clear question, when the question is titled with the word "question" and was marked on italic, now I have marked as gray for notice people that dont read the word "question"
Thanks a lot
Databases do not just lose data willy-nilly. Not losing data is pretty much number one in their job description. If it seems to be losing data, you must be misusing transactions in your application. Figure out what you are doing wrong and fix it.
Making and breaking a connection between your app and pgbouncer for each transaction is not good for performance, but is not terrible either; and if that is what helps you fix your transaction boundaries then do that.