I have an function that operates on a single collection, recursively doing two aggregations then updating documents. All my indexes are correct.
At the moment I cannot refactor this code, and when the function runs it slams Mongo for about 10 minutes to process the data. Seems to grow based on collection size, averaging about 60 seconds for every additional 3k documents. This collection can grow to hundreds of thousands of documents. (The documents are small - about 10 keys with very small values each.)
There is no need for the result of this function to be real-time - it is scheduled, so it's perfectly fine for me to throttle it back.
The question is, is there any way to tell mongo to limit the CPU it grants to an operation? Or should I address the throttling in Python using sleep or some other method?
recursively doing two aggregations then updating documents
It looks like you need to consider to re-model your schema. The flexibility of MongoDB documents schema is something to optimise your process. See MongoDB: Data Modeling for more infos, examples and patterns.
The question is, is there any way to tell mongo to limit the CPU it grants to an operation?
MongoDB does not have a feature to limit a CPU usage per operation. This feature may not make sense in distributed fashion. For example, a limit of 1 CPU for an operation that spans multiple shards may not be as simple/desired anymore.
Alternatively, depending on your use case, if the function does not have to be real-time, you could utilise secondary read-preference.
Essentially, directing your scheduled reporting to a secondary member, allowing your primary to process other data.
Although make sure you read the pros and cons of this secondary read beforehand.
See also MongoDB Replication
Related
I'm using a web API to call and receive data to build out an SQL database for historical energy prices. For context, energy prices are set at what are called "nodes", and each node has 20 years of historical data.
I can receive the data in JSON or XML format. I need to do one operation with the received data before I put it into the SQL database. Namely, I need to convert each hour given in Eastern Daylight Time back to its Eastern Standard Time equivalent.
Being brand new to Python (learned in last two weeks), I initially went down a path more intuitive to me:
HTTP Request (XML format) -> Parse to XML object in Python -> Convert Datetime -> Place in SQL database
The total size of the data I'm attempting to get is roughly 150GB. Because of this, I wanted to get the data in an asynchronous matter and format/put into SQL as it came in from hundreds of API calls (there's a 50000 row limit to what I can get at a time). I was using a ThreadPool to do this. Once the data was received, I attempted to use a ProcessPool to convert this data into the format I needed to place into my SQL database, but was unsuccessful.
Looking at the process from a high level, I think this process can be a lot more efficient. I think I can do the following:
HTTP Request (JSON Format) -> Parse to JSON object in Python -> Perform operation to convert datetime (map value using dictionary?) -> Place into SQL database
I just discovered the OPENJSON library in Python. Is this all I need to do this?
Another issue I need to look into are the limitations of SQLite3. Each node will have its own table in my database, so ideally I'd like to have as many instances of my program as possible getting, parsing, and putting data into my SQLite3 database.
Any help would be much appreciated!
There is no definite answer to you question given so many unknowns but I can outline the way how to get to the solution.
Factors That Influence Performance
The processing is done in stages as you described (I'll abstract away the actual format for now for the reasons I'll describe a bit later):
Fetch data from the remote service
Parse data
Convert data
Store into local DB
For every stage there are some limiting factors that does not allow you to increase processing speed.
For fetching data some of them are:
network bandwidth.
parallelism that remote server supports: remote server may throttle connections and/or total speed for single user or it may be required by terms of usage to limit this on client side.
data format used when downloading. Different formats add their own amount of unneeded/boilerplate formatting and/or data that is sent over network. It depends on the service and its API but it may be that returned XML is smaller than JSON so even XML is usually more verbose for your particular case XML is better.
RAM amount (and swap speed) may be a limit on your system in (very unlikely) case that factors #1 and #2 do not limit you. In this case the downloaded date may not fit into RAM and will be swapped to disk and this will slowdown the download process.
For parsing the data:
RAM amount (for the same reasons as above)
Data format used.
The parser used. Different parsers implementations for JSON for example have different speed.
CPU power: speed and number of processing units.
For data conversion:
RAM amount
CPU power
For data storing:
disk speed
parallelism level the DB efficiently supports
These are not all factors that limit processing speed but just some most obvious. There are also some other unknown limitations.
Also there may be some overhead when passing data between stages. It depends on the design. In some designs (for example single process that reads the data from remote server, processes it in memory and stores to database) the overhead may be zero, but in some designs (multiple processes read data and stores it to files, another set of processes open these files and processes them and so on) the overhead may be quite big.
The final speed of processing is defined by speed of the slowest stage or speed of data passing between stages.
Not all of these factors can be predicted when you design a solution or choose between several designs. Given that there are unknown factors this is even more complicated.
Approach
To be systematic I would use the following approach:
create simple solution (like single process reads data processes and stores to database)
find the processing speed of every phase using that solution
when you have processing speed of every phase look to the slowest phase (note that make sense to look only to the slowest as it defines the overall speed)
then find
why it is slow?
what limits the speed and if that can be improved?
what is the theoretical limit of that stage? (for example if you have 1Gb network and one processing box you can't read data with the speed greater than 120MB/s, in practice it will be even smaller).
Improve. The improvement is usually
optimize processing (like choose better format or library for parsing, remove operations that can be avoided etc) of single processor. If you hit (or is close to) the theoretical limit of the processing speed, you can't use this option.
add more parallelism
In general when you try to optimize something you need to have numbers and compare them when you are doing experiments.
Parallelism
Python
You should be careful when choose between threads and processes. As for example threads are not good for CPU intensive tasks. See more information on this Multiprocessing vs Threading Python
SQLite
SQLite may have some limitations when multiple processes work with single databases. You need to check if it is the limiting factor of your speed. Maybe you need to use another database that better fits for parallelism and then as an additional final step dump the data from it to SQLite in single shot (that would only require to read data sequentially and store it in SQLite and that may be much more efficient if compared to parallel write to single SQLite DB).
I have a python script which hits dozens of API endpoints every 10s to write climate data to a database. Lets say on average I insert 1,500 rows every 10 seconds, from 10 different threads.
I am thinking of making a batch system whereby the insert queries aren't written to the db as they come in, but added to a waiting list and this list is inserted in batch when it reaches a certain size, and the list of course emptied.
Is this justified due to the overhead with frequently writing small numbers of rows to the db?
If so, would a list be wise? I am worried about if my program terminates unexpectedly, perhaps a form of serialized data would be better?
150 inserts per second can be a load on a database and can affect performance. There are pros and cons to changing the approach that you have. Here are some things to consider:
Databases implement ACID, so inserts are secure. This is harder to achieve with buffering schemes.
How important is up-to-date information for queries?
What is the query load?
insert is pretty simple. Alternative mechanisms may require re-inventing the wheel.
Do you have other requirements on the inserts, such as ensuring they are in particular order?
No doubt, there are other considerations.
Here are some possible alternative approaches:
If recent data is not a concern, snapshot the database for querying purposes -- say once per day or once per hour.
Batch inserts in the application threads. A single insert can insert multiple rows.
Invest in larger hardware. An insert load that slows down a single processor may have little effect on a a larger machine.
Invest in better hardware. More memory and faster disk (particularly solid state) and have a big impact.
No doubt, there are other approaches as well.
This is a rather specific question to advanced users of celery. Let me explain the use case I have:
Usecase
I have to run ~1k-100k tasks that will run a simulation (movie) and return the data of the simulation as a rather large list of smaller objects (frames), say 10k-100k per frame and 1k frames. So the total amount of data produced will be very large, but assume that I have a database that can handle this. Speed is not a key factor here. Later I need to compute features from each frame which can be done completely independent.
The frames look like a dict that point to some numpy arrays and other simple data like strings and numbers and have a unique identifier UUID.
Important is that the final objects of interest are arbitrary joins and splits of these generated lists. As a metaphor consider the result movies be chop and recombined into new movies. These final lists (movies) are then basically a list of references to the frames using their UUIDs.
Now, I consider using celery to get these first movies and since these will end up in the backend DB anyway I might just keep these results indefinitely, at least the ones I specify to keep.
My question
Can I configure a backend, preferably a NonSQL DB, in a way to keep the results and access these later independent from Celery using the objects UUID. And if so does that make sense because of overhead and performance, etc.
Another possibility would be to not return anything and let the worker store the result in a DB. Is that preferred? It seems unnecessary to have a second channel of communication to another DB when Celery can do this already.
I am also interested in comments on using Celery in general for highly independent tasks that run long (>1h) and return large result objects. A fail is not problematic and can just be restarted. The resulting movies are stochastic! so functional approaches can be problematic. Even storing the random seed might not garantuee reproducible results! although I do not have side-effects. I just might have lots of workers available that are widely distributed. Imagine lots of desktop machines in a closed environment where every worker helps even if it is slow. Network speed and security is not an issue here. I know that this is not the original use case, but it seemed very easy to use it for these cases. The best analogy I found are projects like Folding#Home.
Can I configure a backend, preferably a NonSQL DB, in a way to keep the results and access these later independent from Celery using the objects UUID.
Yes, you can configure celery to store its results in a NoSQL database such as redis for access by UUID later. The two settings that will control the behavior of interest for you are result_expires and result_backend.
result_backend will specify which NoSQL database you want to store your results in (e.g., elasticsearch or redis) while result_expires will specify how long after a task completes that the task's result will be available for access.
After the task completes, you can access the results in python like this:
from celery.result import AsyncResult
result = task_name.delay()
print result.id
uuid = result.id
checked_result = AsyncResult(uuid)
# and you can access the result output here however you'd like
And if so does that make sense because of overhead and performance, etc.
I think this strategy makes perfect sense. I have typically used this a number of times when generating long-running reports for web users. The initial post will return the UUID from the celery task. The web client can poll the app sever via javascript using the UUID to see if the task is ready/complete. Once the report is ready, the page can redirect the user to the route that will allow the user to download or view the report by passing in the UUID.
We're designing a system that will take thousands of rows at a time and send them via JSON to a REST API built on Google App Engine. Typically 3-300KB of data but let's say in extreme cases a few MB.
The REST API app will then adapt this data to models on the server and save them to the Datastore. Are we likely to (eventually if not immediately) encounter any performance bottlenecks here with Google App Engine, whether it's working with that many models or saving so many rows of data at a time to the datastore?
The client does a GET to get thousands of records, then a PUT with thousands of records. Is there any reason for this to take more than a few seconds, and necessitate the need for a Task queues API?
The only bottleneck in App Engine (apart from the single entity group limitation) is how many entities you can process in a single thread on a single instance. This number depends on your use case and the quality of your code. Once you reach a limit, you can (a) use a more powerful instance, (b) use multi-threading and/or (c) add more instances to scale up your processing capacity to any level you desire.
Task API is a very useful tool for large data loads. It allows you to split your job into a large number of smaller tasks, set the desired processing rate, and let App Engine automatically adjust the number of instances to meet that rate. Another option is a MapReduce API.
This is a really good question, one that I've been asked in interviews, seen pop up in a lot of different situations as well. Your system essentially consists of two things:
Savings (or writing) models to the data store
Reading from the data store.
From my experience of this problem, when you view these two things differently you're able to come up with solid solutions to both. I typically use a cache, such as memcachd, in order to keep data easily accessible for reading. At the same time, for writing, I try to have a main db and a few slave instances as well. All the writes will go to the slave instances (thereby not locking up the main db for reads that sync to the cache), and the writes to the slave db's can be distributed in a round robin approach there by ensuring that your insert statements are not skewed by any of the model's attributes having a high occurance.
I'm working on a closed system running in the cloud.
What I need is a search function that uses user-typed-in regexp to filter the rows in a dataset.
phrase = re.compile(request.get("query"))
data = Entry.all().fetch(50000) #this takes around 10s when there are 6000 records
result = x for x in data if phrase.search(x.title)
Now, the database itself won't change too much, and there will be no more than 200-300 searches a day.
Is there a way to somehow cache all the Entries (I expect that there will be no more than 50.000 of them, each no bigger than 500 bytes), so retrieving them won't take up >10 seconds? Or perhaps to parallelize it? I don't mind 10cpu seconds, but I do mind 10 second that the user has to wait.
To address any answers like "index it and use .filter()" - the query is a regexp, and I don't know about any indexing mechanism that would allow to use a regexp.
You can also use cachepy or performance engine (shameless plug) to store the data on app engine's local instances, so you can have faster access to all entities without getting limited by memcache boundaries or datastore latency.
Hint: A local instance gets killed if it surpasses about 185 MB of memory, so you can store actually quite a lot of data in it if you know what you're doing.
Since there is a bounded number of entries, you can memcache all entries and then do the filtering in memory like you've outlined. However, note that each memcache entry cannot exceed 1mb. But you can fetch up to 32mb of memcache entries in parallel.
So split the entries into sub sets, memcache the subsets and then read them in parallel by precomputing the memcache key.
More here:
http://code.google.com/appengine/docs/python/memcache/functions.html
Since your data is on the order of 20MB, you may be able to load it entirely into local instance memory, which will be as fast as you can get. Alternately, you could store it as a data file alongside your app, reading which will be faster than accessing the datastore.