I'm using Prefect to automatize my flows (python scripts). Once running, some data get persisted to a postgresql database, problem, the size of pg_data gets rapidely out of hands (~20Gb) and I was wondering if there was a way to reduce the amount of data stored to pg_data when running an agent or if there was a way to automatically clean the directory.
Thanks in advance for your help,
best,
Christian
I assume you are running Prefect Server and you want to clean up the underlying database instance to save space? If so, there are a couple of ways you can clean up the Postgres database:
you can manually delete old records, especially logs from the flow run table using DELETE FROM in SQL,
you can do the same in an automated fashion, e.g. some users have an actual flow that runs on schedule and purges old data from the database,
alternatively, you can use the open-source pg_cron job scheduler for Postgres to schedule such DB administration tasks,
you can also do the same using GraphQL: you would need to query for flow run IDs of "old" flow runs using the flow_run query, and then execute delete_flow_run mutation,
lastly, to be more proactive, you can reduce the number of logs you generate by generally logging less (only logging what's needed) and setting the log level to a lower category, e.g. instead of using DEBUG logs on your agent, switching to INFO should significantly reduce the amount of space consumed by logs in the database.
Related
I want to create a Python3 program that takes in MySQL data and holds it temporarily, and can then pass this data onto a cloud MySQL database.
The idea would be that it acts as a buffer for entries in the event that my local network goes down, the buffer would then be able to pass those entries on at a later date, theoretically providing fault-tolerance.
I have done some research into Replication and GTIDs and I'm currently in the process of learning these concepts. However I would like to write my own solution, or at least have it be a smaller program rather than a full implementation of replication server-side.
I already have a program that generates some MySQL data to fill my DB, the key part I need help with would be the buffer aspect/implementation (The code itself I have isn't important as I can rework it later on).
I would greatly appreciate any good resources or help, thank you!
I would implement what you describe using a message queue.
Example: https://hevodata.com/learn/python-message-queue/
The idea is to run a message queue service on your local computer. Your Python application pushes items into the MQ instead of committing directly to the database.
Then you need another background task, called a worker, which you may also write in Python or another language, which consumes items from the MQ and writes them to the cloud database when it's available. If the cloud database is not available, then the background worker pauses.
The data in the MQ can grow while the background worker is paused. If this goes on too long, you may run out of space. But hopefully the rate of growth is slow enough and the cloud database is available regularly, so the risk of this happening is low.
Re your comment about performance.
This is a different application architecture, so there are pros and cons.
On the one hand, if your application is "writing" to a local MQ instead of the remote database, it's likely to appear to the app as if writes have lower latency.
On the other hand, posting to the MQ does not write to the database immediately. There still needs to be a step of the worker pulling an item and initiating its own write to the database. So from the application's point of view, there is a brief delay before the data appears in the database, even when the database seems available.
So the app can't depend on the data being ready to be queried immediately after the app pushes it to the MQ. That is, it might be pretty prompt, under 1 second, but that's not the same as writing to the database directly, which ensures that the data is ready to be queried immediately after the write.
The performance of the worker writing the item to the database should be identical to that of the app writing that same item to the same database. From the database perspective, nothing has changed.
I got a lot scripts running: scrappers, checkers, cleaners, etc. They have some things in common:
they are forever running;
they have no time constrain to finish their job;
they all access the same MYSQL DB, writting and reading.
Accumulating them, it's starting to slow down the website, which runs on the same system, but depends on these scripts.
I can use queues with Kombu to inline all writtings.
But do you know a way to make the same with reading ?
E.G: if one script need to read from the DB, his request is sent to a blocking queue, et it resumes when it got the answer ? This way everybody is making request to one process, and the process is the only one talking to the DB, making one request at the time.
I have no idea how to do this.
Of course, in the end I may have to add more servers to the mix, but before that, is there something I can do at the software level ?
You could use a connection pooler and make the connections from the scripts go through it. It would limit the number of real connections hitting your DB while being transparent to your scripts (their connections would be held in a "wait" state until a real connections is freed).
I don't know what DB you use, but for Postgres I'm using PGBouncer for similiar reasons, see http://pgfoundry.org/projects/pgbouncer/
You say that your dataset is <1GB, the problem is CPU bound.
Now start analyzing what is eating CPU cycles:
Which queries are really slow and executed often. MySQL can log those queries.
What about the slow queries? Can they be accelerated by using an index?
Are there unused indices? Drop them!
Nothing helps? Can you solve it by denormalizing/precomputing stuff?
You could create a function that each process must call in order to talk to the DB. You could re-write the scripts so that they must call that function rather than talk directly to the DB. Within that function, you could have a scope-based lock so that only one process would be talking to the DB at a time.
I have a django project with various apps, which are completely independent. I'd like to make them run each one in their own process, as some of them spawn background threads to precalculate periodically some data and now they are competing for the CPU (the machine has loads of cores, but you know, the GIL and such...)
So, is there an easy way to split automatically the project into different ones, or at least to make each app live in its own process?
You can always have different settings files, but that would be like having multiple projects and even multiple endpoints. With some effort you could configure a reverse proxy to forward to the right Django server, based on the request's path and so on, but I don't think that's what you want and it would be an ugly solution to your problem.
The solution to this is to move the heavy processing to a jobs queue. A lot of people and projects prefer Celery for this.
If that seems like overkill for some reason, you can always implement your own based on simple cron jobs. You can take a look at my small project that does this.
The simplest of the simple is probably to write a custom management command that observes given model (database table) for new entries and processes them. The model is written to by e.g. Django view and the management command is launched periodically from cron (e.g. every 5 minutes).
Example: user registers on the site, but the account creation is an expensive operation (allocating some space, pinging remote services etc.). Therefore you just write a new record to AccountRequest table (AccountRequest.objects.create(...)). Then, cron periodically launches your management script (./manage.py account_creator), which checks for new AccountRequest-s (AccountRequest.objects.filter(unprocessed=True)), does its job and marks those requests as processed.
I'm working on a Python app I want to be scalable to accommodate about 150 writes per second. That's spread out among about 50 different sources.
Is Mongodb a good candidate for this? I'm split on writing to a database, or just making a log file for each source and parsing them separately.
Any other suggestions for logging a lot of data?
I would say that mongodb very good fit for the logs collection, because of:
Mongodb has amazing fast writes
Logs not so important, so it's okay to loose some of them in case of server failure. So you can run mongodb without journaling option to avoid writes overhead.
In additional you can use sharding to increase writes speed, in same time you can just move oldest logs to separate collection or into file system.
You can easy export data from database to the json/csv.
Once you will have everything in a database you will able to query data in order to find log that you need.
So, my opinion is that mongodb perfectly fit for such things as logs. You no need manage a lot of logs files in the file system. Mongodb does this for you.
I'm using Python with Celery and RabbitMQ to make a web spider to count the number of links on a page.
Can a database, such as MySQL, be written into asynchronously? Is it OK to commit the changes after every row added, or is it required to batch them (multi-add) and then commit after a certain number of rows/duration?
I'd prefer to use SQLAlchemy and MySQL, unless there is a more recommended combination for Celery/RabbitMQ. I also see NoSQL (CouchDB?) recommended.
For write intensive operation like Counters and Logs NoSQL solution are always the best choice. Personally I use a mongoDB for this kind of tasks.