Architecture for a lot of data logging, DB or file? - python

I'm working on a Python app I want to be scalable to accommodate about 150 writes per second. That's spread out among about 50 different sources.
Is Mongodb a good candidate for this? I'm split on writing to a database, or just making a log file for each source and parsing them separately.
Any other suggestions for logging a lot of data?

I would say that mongodb very good fit for the logs collection, because of:
Mongodb has amazing fast writes
Logs not so important, so it's okay to loose some of them in case of server failure. So you can run mongodb without journaling option to avoid writes overhead.
In additional you can use sharding to increase writes speed, in same time you can just move oldest logs to separate collection or into file system.
You can easy export data from database to the json/csv.
Once you will have everything in a database you will able to query data in order to find log that you need.
So, my opinion is that mongodb perfectly fit for such things as logs. You no need manage a lot of logs files in the file system. Mongodb does this for you.

Related

Cleaning ~/.prefect/pg_data/ when using Prefect

I'm using Prefect to automatize my flows (python scripts). Once running, some data get persisted to a postgresql database, problem, the size of pg_data gets rapidely out of hands (~20Gb) and I was wondering if there was a way to reduce the amount of data stored to pg_data when running an agent or if there was a way to automatically clean the directory.
Thanks in advance for your help,
best,
Christian
I assume you are running Prefect Server and you want to clean up the underlying database instance to save space? If so, there are a couple of ways you can clean up the Postgres database:
you can manually delete old records, especially logs from the flow run table using DELETE FROM in SQL,
you can do the same in an automated fashion, e.g. some users have an actual flow that runs on schedule and purges old data from the database,
alternatively, you can use the open-source pg_cron job scheduler for Postgres to schedule such DB administration tasks,
you can also do the same using GraphQL: you would need to query for flow run IDs of "old" flow runs using the flow_run query, and then execute delete_flow_run mutation,
lastly, to be more proactive, you can reduce the number of logs you generate by generally logging less (only logging what's needed) and setting the log level to a lower category, e.g. instead of using DEBUG logs on your agent, switching to INFO should significantly reduce the amount of space consumed by logs in the database.

Importing data into Google App Engine

Recently I had to import 48,000 records into Google App Engine. The stored 'tables' are 'ndb.model' types. Each of these records is checked against a couple of other 'tables' in the 'database' for integrity purposes and then written (.put()).
To do this, I uploaded a .csv file into Google Cloud Storage and processed it from there in a task queue. This processed about 10 .csv rows per second and errored after 41,000 records with an out of memory error. Splitting the .csv file into 2 sets of 24,000 records each fixed this problem.
So, my questions are:
a) is this the best way to do this?
b) is there a faster way (the next upload might be around 400,000 records)? and
c) how do I get over (or stop) the out of memory error?
Many thanks,
David
1) Have you thought about (even temporarily) upgrading your server instances?
https://cloud.google.com/appengine/docs/standard/#instance_classes
2) I don't think a 41000 row csv is enough to run out of memory, so you probably need to change your processing:
a) Break up the processing using multiple tasks, rolling your own cursor to process a couple thousand at a time, then spinning up a new task.
b) Experiment with ndb.put_multi()
Sharing some code of your loop and puts might help
The ndb in-context cache could be contributing to the memory errors. From the docs:
With executing long-running queries in background tasks, it's possible for the in-context cache to consume large amounts of memory. This is because the cache keeps a copy of every entity that is retrieved or stored in the current context. To avoid memory exceptions in long-running tasks, you can disable the cache or set a policy that excludes whichever entities are consuming the most memory.
You can prevent caching on a case by case basis by setting a context option in your ndb calls, for example
foo.put(use_cache=False)
Completely disabling caching might degrade performance if you are often using the same objects for your comparisons. If that's the case, you could flush the cache periodically to stop it getting too big.
if some_condition:
context = ndb.get_context()
context.clear_cache()

MySQL: How to pull large amount of Data from MySQL without choking it?

My colleague run a script that pulls data from the db periodically. He is using the query:
SELECT url, data FROM table LIMIT {} OFFSET {}'.format( OFFSET, PAGE * OFFSET
We use Amazon AURORAS and he has his own slaves server but everytime it touches 98%+
Table have millions of records.
Would it be nice if we go for sqldump instead of SQL queries for fetching data?
The options come in my mind are:
SQL DUMP of selective tables( not sure of benchmark)
Federate tables based on certain reference(date, ID etc)
Thanks
I'm making some fairly big assumptions here, but from
without choking it
I'm guessing you mean that when your colleague runs the SELECT to grab the large amount of data, the database performance drops for all other operations - presumably your primary application - while the data is being prepared for export.
You mentioned SQL Dump so I'm also assuming that this colleague will be satisfied with data that is roughly correct, ie: it doesn't have to be up to the instant transactionally correct data. Just good enough for something like analytics work.
If those assumptions are close, your colleague and your database might benefit from
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
This line of code should be used carefully and almost never in a line of business application but it can help people querying the live database with big queries, as long as you fully understand the implications.
To use it, simply start a transaction and put this line before any queries you run.
The 'choking'
What you are seeing when your colleague runs a large query is record locking. Your database engine is - quite correctly - set up to provide an accurate view of your data, at any point. So, when a large query comes along the database engine first waits for all write locks (transactions) to clear, runs the large query and holds all future write locks until the query has run.
This actually happens for all transactions, but you only really notice it for the big ones.
What READ UNCOMMITTED does
By setting the transaction isolation level to READ UNCOMMITTED, you are telling the database engine that this transaction doesn't care about write locks and to go ahead and read anyway.
This is known as a 'dirty read', in that the long-running query could well read a table with a write lock on it and will ignore the lock. The data actually read could be the data before the write transaction has completed, or a different transaction could start and modify records before this query gets to it.
The data returned from anything with READ UNCOMMITTED is not guaranteed to be correct in the ACID sense of a database engine, but for some use cases it is good enough.
What the effect is
Your large queries magically run faster and don't lock the database while they are running.
Use with caution and understand what it does before you use it though.
MySQL Manual on transaction isolation levels

Large memory Python background jobs

I am running a Flask server which loads data into a MongoDB database. Since there is a large amount of data, and this takes a long time, I want to do this via a background job.
I am using Redis as the message broker and Python-rq to implement the job queues. All the code runs on Heroku.
As I understand, python-rq uses pickle to serialise the function to be executed, including the parameters, and adds this along with other values to a Redis hash value.
Since the parameters contain the information to be saved to the database, it quite large (~50MB) and when this is serialised and saved to Redis, not only does it take a noticeable amount of time but it also consumes a large amount of memory. Redis plans on Heroku cost $30 p/m for 100MB only. In fact I every often get OOM errors like:
OOM command not allowed when used memory > 'maxmemory'.
I have two questions:
Is python-rq well suited to this task or would Celery's JSON serialisation be more appropriate?
Is there a way to not serialise the parameter but rather a reference to it?
Your thoughts on the best solution are much appreciated!
Since you mentioned in your comment that your task input is a large list of key value pairs, I'm going to recommend the following:
Load up your list of key/value pairs in a file.
Upload the file to Amazon S3.
Get the resulting file URL, and pass that into your RQ task.
In your worker task, download the file.
Parse the file line-by-line, inserting the documents into Mongo.
Using the method above, you'll be able to:
Quickly break up your tasks into manageable chunks.
Upload these small, compressed files to S3 quickly (use gzip).
Greatly reduce your redis usage by requiring much less data to be passed over the wires.
Configure S3 to automatically delete your files after a certain amount of time (there are S3 settings for this: you can have it delete automatically after 1 day, for instance).
Greatly reduce memory consumption on your worker by processing the file one line at-a-time.
For use cases like what you're doing, this will be MUCH faster and require much less overhead than sending these items through your queueing system.
Hope this helps!
It turns out that the solution that worked for is to save the data to Amazon S3 storage, and then pass the URI to function in the background task.

Google App Engine, Datastore and Task Queues, Performance Bottlenecks?

We're designing a system that will take thousands of rows at a time and send them via JSON to a REST API built on Google App Engine. Typically 3-300KB of data but let's say in extreme cases a few MB.
The REST API app will then adapt this data to models on the server and save them to the Datastore. Are we likely to (eventually if not immediately) encounter any performance bottlenecks here with Google App Engine, whether it's working with that many models or saving so many rows of data at a time to the datastore?
The client does a GET to get thousands of records, then a PUT with thousands of records. Is there any reason for this to take more than a few seconds, and necessitate the need for a Task queues API?
The only bottleneck in App Engine (apart from the single entity group limitation) is how many entities you can process in a single thread on a single instance. This number depends on your use case and the quality of your code. Once you reach a limit, you can (a) use a more powerful instance, (b) use multi-threading and/or (c) add more instances to scale up your processing capacity to any level you desire.
Task API is a very useful tool for large data loads. It allows you to split your job into a large number of smaller tasks, set the desired processing rate, and let App Engine automatically adjust the number of instances to meet that rate. Another option is a MapReduce API.
This is a really good question, one that I've been asked in interviews, seen pop up in a lot of different situations as well. Your system essentially consists of two things:
Savings (or writing) models to the data store
Reading from the data store.
From my experience of this problem, when you view these two things differently you're able to come up with solid solutions to both. I typically use a cache, such as memcachd, in order to keep data easily accessible for reading. At the same time, for writing, I try to have a main db and a few slave instances as well. All the writes will go to the slave instances (thereby not locking up the main db for reads that sync to the cache), and the writes to the slave db's can be distributed in a round robin approach there by ensuring that your insert statements are not skewed by any of the model's attributes having a high occurance.

Categories