I have a python script which hits dozens of API endpoints every 10s to write climate data to a database. Lets say on average I insert 1,500 rows every 10 seconds, from 10 different threads.
I am thinking of making a batch system whereby the insert queries aren't written to the db as they come in, but added to a waiting list and this list is inserted in batch when it reaches a certain size, and the list of course emptied.
Is this justified due to the overhead with frequently writing small numbers of rows to the db?
If so, would a list be wise? I am worried about if my program terminates unexpectedly, perhaps a form of serialized data would be better?
150 inserts per second can be a load on a database and can affect performance. There are pros and cons to changing the approach that you have. Here are some things to consider:
Databases implement ACID, so inserts are secure. This is harder to achieve with buffering schemes.
How important is up-to-date information for queries?
What is the query load?
insert is pretty simple. Alternative mechanisms may require re-inventing the wheel.
Do you have other requirements on the inserts, such as ensuring they are in particular order?
No doubt, there are other considerations.
Here are some possible alternative approaches:
If recent data is not a concern, snapshot the database for querying purposes -- say once per day or once per hour.
Batch inserts in the application threads. A single insert can insert multiple rows.
Invest in larger hardware. An insert load that slows down a single processor may have little effect on a a larger machine.
Invest in better hardware. More memory and faster disk (particularly solid state) and have a big impact.
No doubt, there are other approaches as well.
Related
I have a Flask application that allows users to query a ~small database (2.4M rows) using SQL. It's similar to a HackerRank but more limited in scope. It's deployed on Heroku.
I've noticed during testing that I can predictably hit an R14 error (memory quota exceeded) or R15 (memory quota greatly exceeded) by running large queries. The queries that typically cause this are outside what a normal user might do, such as SELECT * FROM some_huge_table. That said, I am concerned that these errors will become a regular occurrence for even small queries when 5, 10, 100 users are querying at the same time.
I'm looking for some advice on how to manage memory quotas for this type of interactive site. Here's what I've explored so far:
Changing the # of gunicorn workers. This has had some effect but I still hit R14 and R15 errors consistently.
Forced limits on user queries, based on either text or the EXPLAIN output. This does work to reduce memory usage, but I'm afraid it won't scale to even a very modest # of users.
Moving to a higher Heroku tier. The plan I use currently provides ~512MB RAM. The largest plan is around 14GB. Again, this would help but won't even moderately scale, to say nothing of the associated costs.
Reducing the size of the database significantly. I would like to avoid this if possible. Doing the napkin math on a table with 1.9M rows going to 10k or 50k, the application would have greatly reduced memory needs and will scale better, but will still have some moderate max usage limit.
As you can see, I'm a novice at best when it comes to memory management. I'm looking for some strategies/ideas on how to solve this general problem, and if it's the case that I need to either drastically cut the data size or throw tons of $ at this, that's OK too.
Thanks
Coming from my personal experience, I see two approaches:
1. plan for it
Coming from your example, this means you try to calculate the maximum memory that the request would use, multiply it by the number of gunicorn workers, and use dynos big enough.
With a different example this could be valid, I don't think it is for you.
2. reduce memory usage, solution 1
The fact that too much application memory is used makes me think that likely in your code you are loading the whole result-set into memory (probably even multiple times in multiple formats) before returning it to the client.
In the end, your application is only getting the data from the database and converting it to some output format (JSON/CSV?).
What you are probably searching for is streaming responses.
Your Flask-view will work on a record-by-record base. It will read a single record, convert it to your output format, and return a single record.
Both your database client library and Flask will support this (on most databases it is called cursors / iterators).
2. reduce memory usage, solution 2
other services often go for simple pagination or limiting resultsets to manage server-side memory.
security sidenote
it sounds like the users can actually define the SQL statement in their API requests. This is a security and application risk. Apart from doing INSERT, UPDATE, or DELETE statements, the user could create a SQL statement that will not only blow your application memory, but also break your database.
Using Postgres and sqlalchemy.
I have a job scans a large table and for each row does some calculation and updates some related tables. I am told that I should issue periodic commits inside the loop in order not to keep a large amount of in-memory data. I wonder such commits have a performance penalty, e.g. restarting a transaction, taking db snapshot perhaps etc.
Would using a flush() be better in this case?
An open transaction won't keep a lot of data in memory.
The advice you got was probably from somebody who is used to Oracle, where large transactions cause problems with UNDO.
The question is how you scan the large table:
If you snarf the large table to the client and then update the related tables, it won't matter much if you commit in between or not.
If you use a cursor to scan the large table (which is normally better), you'd have to create a WITH HOLD cursor if you want the cursor to work across transactions. Such a cursor is materialized on the database server side and so will use more resources on the database.
The alternative would be to use queries for the large table that fetch only part of the table and chunk the operation that way.
That said, there are reasons why one big transaction might be better or worse than many smaller ones:
Reasons speaking for a big transaction:
You can use a normal cursor to scan the big table and don't have to bother with WITH HOLD cursors or the alternative as indicated above.
You'd have transactional guarantees for the whole operation. For example, you can simply restart the operation after an error and rollback.
Reasons speaking for operation in batches:
Shorter transactions reduce the risk of deadlocks.
Shorter transactions allow autovacuum to clean up the effects of previous batches while later batches are being processed. This is a notable advantage if there is a lot of data churn due to the updates, as it will help keep table bloat small.
The best choice depends on the actual situation.
I'm using a web API to call and receive data to build out an SQL database for historical energy prices. For context, energy prices are set at what are called "nodes", and each node has 20 years of historical data.
I can receive the data in JSON or XML format. I need to do one operation with the received data before I put it into the SQL database. Namely, I need to convert each hour given in Eastern Daylight Time back to its Eastern Standard Time equivalent.
Being brand new to Python (learned in last two weeks), I initially went down a path more intuitive to me:
HTTP Request (XML format) -> Parse to XML object in Python -> Convert Datetime -> Place in SQL database
The total size of the data I'm attempting to get is roughly 150GB. Because of this, I wanted to get the data in an asynchronous matter and format/put into SQL as it came in from hundreds of API calls (there's a 50000 row limit to what I can get at a time). I was using a ThreadPool to do this. Once the data was received, I attempted to use a ProcessPool to convert this data into the format I needed to place into my SQL database, but was unsuccessful.
Looking at the process from a high level, I think this process can be a lot more efficient. I think I can do the following:
HTTP Request (JSON Format) -> Parse to JSON object in Python -> Perform operation to convert datetime (map value using dictionary?) -> Place into SQL database
I just discovered the OPENJSON library in Python. Is this all I need to do this?
Another issue I need to look into are the limitations of SQLite3. Each node will have its own table in my database, so ideally I'd like to have as many instances of my program as possible getting, parsing, and putting data into my SQLite3 database.
Any help would be much appreciated!
There is no definite answer to you question given so many unknowns but I can outline the way how to get to the solution.
Factors That Influence Performance
The processing is done in stages as you described (I'll abstract away the actual format for now for the reasons I'll describe a bit later):
Fetch data from the remote service
Parse data
Convert data
Store into local DB
For every stage there are some limiting factors that does not allow you to increase processing speed.
For fetching data some of them are:
network bandwidth.
parallelism that remote server supports: remote server may throttle connections and/or total speed for single user or it may be required by terms of usage to limit this on client side.
data format used when downloading. Different formats add their own amount of unneeded/boilerplate formatting and/or data that is sent over network. It depends on the service and its API but it may be that returned XML is smaller than JSON so even XML is usually more verbose for your particular case XML is better.
RAM amount (and swap speed) may be a limit on your system in (very unlikely) case that factors #1 and #2 do not limit you. In this case the downloaded date may not fit into RAM and will be swapped to disk and this will slowdown the download process.
For parsing the data:
RAM amount (for the same reasons as above)
Data format used.
The parser used. Different parsers implementations for JSON for example have different speed.
CPU power: speed and number of processing units.
For data conversion:
RAM amount
CPU power
For data storing:
disk speed
parallelism level the DB efficiently supports
These are not all factors that limit processing speed but just some most obvious. There are also some other unknown limitations.
Also there may be some overhead when passing data between stages. It depends on the design. In some designs (for example single process that reads the data from remote server, processes it in memory and stores to database) the overhead may be zero, but in some designs (multiple processes read data and stores it to files, another set of processes open these files and processes them and so on) the overhead may be quite big.
The final speed of processing is defined by speed of the slowest stage or speed of data passing between stages.
Not all of these factors can be predicted when you design a solution or choose between several designs. Given that there are unknown factors this is even more complicated.
Approach
To be systematic I would use the following approach:
create simple solution (like single process reads data processes and stores to database)
find the processing speed of every phase using that solution
when you have processing speed of every phase look to the slowest phase (note that make sense to look only to the slowest as it defines the overall speed)
then find
why it is slow?
what limits the speed and if that can be improved?
what is the theoretical limit of that stage? (for example if you have 1Gb network and one processing box you can't read data with the speed greater than 120MB/s, in practice it will be even smaller).
Improve. The improvement is usually
optimize processing (like choose better format or library for parsing, remove operations that can be avoided etc) of single processor. If you hit (or is close to) the theoretical limit of the processing speed, you can't use this option.
add more parallelism
In general when you try to optimize something you need to have numbers and compare them when you are doing experiments.
Parallelism
Python
You should be careful when choose between threads and processes. As for example threads are not good for CPU intensive tasks. See more information on this Multiprocessing vs Threading Python
SQLite
SQLite may have some limitations when multiple processes work with single databases. You need to check if it is the limiting factor of your speed. Maybe you need to use another database that better fits for parallelism and then as an additional final step dump the data from it to SQLite in single shot (that would only require to read data sequentially and store it in SQLite and that may be much more efficient if compared to parallel write to single SQLite DB).
I have an function that operates on a single collection, recursively doing two aggregations then updating documents. All my indexes are correct.
At the moment I cannot refactor this code, and when the function runs it slams Mongo for about 10 minutes to process the data. Seems to grow based on collection size, averaging about 60 seconds for every additional 3k documents. This collection can grow to hundreds of thousands of documents. (The documents are small - about 10 keys with very small values each.)
There is no need for the result of this function to be real-time - it is scheduled, so it's perfectly fine for me to throttle it back.
The question is, is there any way to tell mongo to limit the CPU it grants to an operation? Or should I address the throttling in Python using sleep or some other method?
recursively doing two aggregations then updating documents
It looks like you need to consider to re-model your schema. The flexibility of MongoDB documents schema is something to optimise your process. See MongoDB: Data Modeling for more infos, examples and patterns.
The question is, is there any way to tell mongo to limit the CPU it grants to an operation?
MongoDB does not have a feature to limit a CPU usage per operation. This feature may not make sense in distributed fashion. For example, a limit of 1 CPU for an operation that spans multiple shards may not be as simple/desired anymore.
Alternatively, depending on your use case, if the function does not have to be real-time, you could utilise secondary read-preference.
Essentially, directing your scheduled reporting to a secondary member, allowing your primary to process other data.
Although make sure you read the pros and cons of this secondary read beforehand.
See also MongoDB Replication
I have about 29 million individual unicode elements I want to serialise into a database. My current solution uses SQLite, and incrementally sends a list of length 100,000 at a time:
cursor.executemany("insert into array values (?)", [list of length 100,000])
I thought this would be efficient enough, but after going to sleep and waking up (say 6 hours), and seeing it still hadn't finished, I thought otherwise.
Would a MongoDB or memcached solution be more efficient?
(feel free to suggest any more efficient method)
If you decide to choose memcached, please take case of the below mentioned note.
Memcached has a default setting of maximum data size that can be set as 1Mb.
Any data size greater then that would require you to start the memcached explicitly stating the maximum size.
I really like using redis for storing data structures. As your requirement is not so very clear. I would be suggesting you to use redis, as you can store it in multiple fashion and use it interestingly.
Please read the documentation of the different data structures that can be stored on the redis.
http://redis.io/documentation
Try spawning gevent threads e.g 10 and monitor how many records are processed in a given time, and try increasing the number. You should also look into multiprocessing