I am creating a machine learning app which should save a number to a database locally and frequently.
These number values are connected which basically means that I want to frequently update a time series of values by appending a number to the list.
An ideal case would be to be able to save key-value pairs where key would represent the name of the array (example train_loss) and value would be according time series.
My first idea was leveraging redis but as far as I know redis data is only saved in RAM? What I want to achieve is saving to a disk after every log or perhaps after every couple of logs.
I need the local save of data since this data will be consumed by other app (in javascript). Therefore some JSON-like format would be nice.
Using JSON files (and Python json package) is an option, but I believe it would result in an I/O bottleneck because of frequent updates.
I am basically trying to create a clone of a web app like Tensorboard.
A technique we use in the backend of hosted application for frequently used read/post api is to write to the Redis and DB at the same time and during the read operation we check if the key is available in the Redis and if it's not we read update it to the Redis and then serve it
Related
I'm new to python and was trying to create a python bot, I wanted a optimized way to modify and access my bot configs per server. I had 2 ideas on how/when to fetch configs from the database for optimization.
this is what you would normally do - just fetch data variables(fetch a variable at a time) for each command, this would keep the bot simple and minimize unused recources.
In this one, whenever the user uses a command for the first time, it fetches the entire config table and stores it in a loaded dict from which you can access the config from. you can also update the config in the dict and every 30m-1hr it will log the values in the table and empty the dict. The benefit of this one is less sql calls but potentially less scalability because of unused objects in the dict.
Can someone help me decide which one is better, i dont know normally how you would make discord bots or the convention.
Your second approach is called caching the data. You're basically creating a cached database in your application (the dictionary) and save a bunch of usually necessary data to access them quickly. It is what every (almost every) major service (like Steam) does in order to minimize the main database calls.
I think this is the better practice however it has its drawbacks.
First, from time to time, you have to compare the cached data with what you have in the original database because your bot will not have a single user and while the cached data is available to one user, another user might alter the data in the original database.
Second, it is harder to implement than the first approach. You need to determine which data to store, which data to update rapidly and also you need to implement an alarm system for the users to update their cache whenever the main data is altered in the database.
If I were you and I just wanted to mess around with bots, I would go with just fetching the data each time from the database. It's easier and it is good enough for most applications.
I have created a memcached cluster in the Elasticache tool of AWS.
My program in every call sets keys with some data in cache and every time that I call the server it updates the data. However while testing it with the cluster I found that it seems that is changing the node where the key is located or it is erasing it, so the moment it changes the node /or erase the key, I lose my previous information.As Im only calling to one end point for all the cluster, shouldnt it keep the consistancy of the key over the cluster and not delete the content of the key or restart the key somewhere else ?
Is there any configuration parameter of memcached cluster to force it not to change the reference node for a key?
Now Im using the default configuration parameters of the AWS file default.memcached1.4..and I took a look to the configuration parameters at http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/ParameterGroups.Memcached.html and I dont find any information giving me tips related to how to solve this issue.
(Pd. When I directly point my program to a specific node everything works fine)
That is the way it's supposed to be.
The following diagram illustrates a typical Memcached and a typical
Redis cluster. Memcached clusters contain from 1 to 20 nodes across
which you can horizontally partition your data. Redis
From http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/Clusters.html
The django documentation says something similar.
One excellent feature of Memcached is its ability to share a cache
over multiple servers. This means you can run Memcached daemons on
multiple machines, and the program will treat the group of machines as
a single cache, without the need to duplicate cache values on each
machine
In other words, you cannot directly request data from any given node in the cluster. You have to let django's cache api figure out for you how to retrieve the data.
With redis the behaviour is the opposite. Once you write to the cluster you can query any node in the cluster for the data because it will be replicated to them all. Where as in memcache, it's sharded.
I am in the process of building my first SNMP application using Django, MySQL, Python, and Apache. It will monitor a few thousand devices that will have anyhere from 5-30 OIDS pulled from each device every 1-5 minutes.
I am wondering what is the best way to store data of this type?
It would need to be something robost.
Open to SQL or NoSQL
No duplicate information (This could be easily accomplished by just storing the data every poll for every device but the constrains are it needs to be kept lean. So only storge of unique data)
Schema of data should either be dynamic or somehow expandable.
I have truly run into the problem of scaling versus web development. Never thought this day would come!
I think the best option to store data like so is rrdtool http://oss.oetiker.ch/rrdtool/
You can create a separate rrd file for each OID per device.
I have a google app engine app that has to deal with a lot of data collecting. The data I gather is around millions of records per day. As I see it, there are two simple approaches to dealing with this in order to be able to analyze the data:
1. use logger API to generate app engine logs, and then try to load these up to a big query (or more simply export to CSV and do the analysis with excel).
2. saving the data in the app engine datastore (ndb), and then download that data later / try to load that up to big query.
Is there any preferable method of doing this?
Thanks!
BigQuery has a new Streaming API, which they claim was designed for high-volume real-time data collection.
Advice from practice: we are currently logging 20M+ multi-event records a day via a method 1. as described above. It works pretty well, except when the batch uploader is not called (normally every 5min), then we need to detect this and re-run the importer.
Also, we are currently in process of migrating to new Streaming API, but is not yet in production so I can't say how reliable it is.
I've been using PostgreSQL for the longest time. All of my data lives inside Postgres. I've recently looked into redis and it has a lot of powerful features that would otherwise take a couple of lines in Django (python) to do. Redis data is persistent as long the machine it's running on doesn't go down and you can configure it to write out the data it's storing to disk every 1000 keys or every 5 minutes or so depending on your choice.
Redis would make a great cache and it would certainly replace a lot of functions I have written in python (up voting a user's post, viewing their friends list etc...). But my concern is, all of this data would some how need to be translated over to postgres. I don't trust storing this data in redis. I see redis as a temporary storage solution for quick retrieval of information. It's extremely fast and this far outweighs doing repetitive queries against postgres.
I'm assuming the only way I could technically write the redis data to the database is to save() whatever I get from the 'get' query from redis to the postgres database through Django.
That's the only solution I could think of. Do you know of any other solutions to this problem?
Redis is increasingly used as a caching layer, much like a more sophisticated memcached, and is very useful in this role. You usually use Redis as a write-through cache for data you want to be durable, and write-back for data you might want to accumulate then batch write (where you can afford to lose recent data).
PostgreSQL's LISTEN and NOTIFY system is very useful for doing selective cache invalidation, letting you purge records from Redis when they're updated in PostgreSQL.
For combining it with PostgreSQL, you will find the Redis foreign data wrapper provider that Andrew Dunstain and Dave Page are working on very interesting.
I'm not aware of any tool that makes Redis into a transparent write-back cache for PostgreSQL. Their data models are probably too different for this to work well. Usually you write changes to PostgreSQL and invalidate their Redis cache entries using listen/notify to a cache manager worker, or you queue changes in Redis then have your app read them out and write them into Pg in chunks.
Redis is persistent if configured to be so, both through snapshots and a kind of WAL called AOF. Loads of people use it as a primary datastore.
https://redis.io/topics/persistence
If one is referring to the greater world of Redis compatible (resp protocol) datastores, many are not limited to in-memory storage:
https://keydb.dev/
http://ssdb.io/
and many more...