I am trying to use the SimpleDB in following way.
I want to keep 48 hrs worth data at anytime into simpledb and query it for different purposes.
Each domain has 1 hr worth data, so at any time there are 48 domains present in the simpledb.
As the new data is constantly uploaded, I delete the oldest domain and create a new domain for each new hour.
Each domain is about 50MB in size, the total size of all the domains is around 2.2 GB.
The item in the domain has following type of attributes
identifier - around 50 characters long -- 1 per item
timestamp - timestamp value -- 1 per item
serial_n_data - 500-1000 bytes data -- 200 per item
I'm using python boto library to upload and query the data.
I send 1 item/sec with around 200 attributes in the domain.
For one of the application of this data, I need to get all the data from all the 48 domains.
The Query looks like, "SELECT * FROM domain", for all the domains.
I use 8 threads to query data with each thread taking responsibility of few domains.
e.g domain 1-6 thread 1
domain 7-12 thread 2 and so on
It takes close to 13 minutes to get the entire data.I am using boto's select method for this.I need much more faster performance than this. Any suggestions on speed up the querying process? Is there any other language that I can use, which can speed up the things?
Use more threads
I would suggest inverting your threads/domain ratio from 1/6 to something closer to 30/1. Most of the time taken to pull down large chunks of data from SimpleDB is going to be spent waiting. In this situation upping the thread count will vastly improve your throughput.
One of the limits of SimpleDB is the query response size cap at 1MB. This means pulling down the 50MB in a single domain will take a minimum of 50 Selects (the original + 49 additional pages). These must occur sequentially because the NextToken from the current response is needed for the next request. If each Select takes 2+ seconds (not uncommon with large responses and high request volume) you spend 2 minutes on each domain. If every thread has to iterate thru each of 6 domains in turn, that's about 12 minutes right there. One thread per domain should cut that down to about 2 minutes easily.
But you should be able to do much better than that. SimpleDB is optimized for concurrency. I would try 30 threads per domain, giving each thread a portion of the hour to query on, since it is log data after all. For example:
SELECT * FROM domain WHERE timestamp between '12:00' and '12:02'
(Obviously, you'd use real timestamp values) All 30 queries can be kicked off without waiting for any responses. In this way you still need to make at least 50 queries per domain, but instead of making them all sequentially you can get a lot more concurrency. You will have to test for yourself how many threads gives you the best throughput. I would encourage you to try up to 60 per domain, breaking the Select conditions down to one minute increments. If it works for you then you will have fully parallel queries and most likely have eliminated all follow up pages. If you get 503 ServiceUnavailable errors, scale back the threads.
The domain is the basic unit of scalability for SimpleDB so it is good that you have a convenient way to partition your data. You just need take advantage of the concurrency. Rather than 13 minutes, I wouldn't be surprised if you were able to get the data in 13 seconds for an app running on EC2 in the same region. But the actual time it takes will depend on a number of other factors.
Cost Concerns
As a side note, I should mention the costs of what you are doing, even though you haven't raised the issue. CreateDomain and DeleteDomain are heavyweight operations. Normally I wouldn't advise using them so often. You are charged about 25 seconds of box usage each time so creating and deleting one each hour adds up to about $70 per month just for domain management. You can store orders of magnitude more data in a domain than the 50MB you mention. So you might want to let the data accumulate more before you delete. If your queries include the timestamp (or could be made to include the timestamp) query performance may not be hurt at all by having an extra GB of old data in the domain. In any case, GetAttributes and PutAttributes will never suffer a performance hit with a large domain size, it is only queries that don't make good use of a selective index. You'd have to test your queries to see. That is just a suggestion, I realize that the create/delete is cleaner conceptually.
Also writing 200 attributes at a time is expensive, due to a quirk in the box usage formula. The box usage for writes is proportional to the number of attributes raised to the power of 3 ! The formula in hours is:
0.0000219907 + 0.0000000002 N^3
For the base charge plus the per attribute charge, where N is the number of attributes. In your situation, if you write all 200 attributes in a single request, the box usage charges will be about $250 per million items ($470 per million if you write 256 attributes). If you break each request in to 4 requests with 50 attributes each, you will quadruple your PutAttributes volume, but reduce the box usage charges by an order of magnitude to about $28 per million items. If you are able break the requests down, then it may be worth doing. If you cannot (due to request volume, or just the nature of your app) it means that SimpleDB can end up being extremely unappealing from a cost standpoint.
I have had the same issue as you Charlie. After profiling the code, I have narrowed the performance problem down to SSL. It seems like that is where it is spending most of it's time and hence CPU cycles.
I have read of a problem in the httplib library (which boto uses for SSL) where the performance doesn't increase unless the packets are over a certain size, though that was for Python 2.5 and may have already been fixed.
SBBExplorer uses Multithreaded BatchPutAttributes to achieve high write throughput while uploading bulk data to Amazon SimpleDB. SDB Explorer allows multiple parallel uploads. If you have the bandwidth, you can take full advantage of that bandwidth by running number of BatchPutAttributes processes at once in parallel queue that will reduce the time spend in processing.
http://www.sdbexplorer.com/
Related
I'm using a web API to call and receive data to build out an SQL database for historical energy prices. For context, energy prices are set at what are called "nodes", and each node has 20 years of historical data.
I can receive the data in JSON or XML format. I need to do one operation with the received data before I put it into the SQL database. Namely, I need to convert each hour given in Eastern Daylight Time back to its Eastern Standard Time equivalent.
Being brand new to Python (learned in last two weeks), I initially went down a path more intuitive to me:
HTTP Request (XML format) -> Parse to XML object in Python -> Convert Datetime -> Place in SQL database
The total size of the data I'm attempting to get is roughly 150GB. Because of this, I wanted to get the data in an asynchronous matter and format/put into SQL as it came in from hundreds of API calls (there's a 50000 row limit to what I can get at a time). I was using a ThreadPool to do this. Once the data was received, I attempted to use a ProcessPool to convert this data into the format I needed to place into my SQL database, but was unsuccessful.
Looking at the process from a high level, I think this process can be a lot more efficient. I think I can do the following:
HTTP Request (JSON Format) -> Parse to JSON object in Python -> Perform operation to convert datetime (map value using dictionary?) -> Place into SQL database
I just discovered the OPENJSON library in Python. Is this all I need to do this?
Another issue I need to look into are the limitations of SQLite3. Each node will have its own table in my database, so ideally I'd like to have as many instances of my program as possible getting, parsing, and putting data into my SQLite3 database.
Any help would be much appreciated!
There is no definite answer to you question given so many unknowns but I can outline the way how to get to the solution.
Factors That Influence Performance
The processing is done in stages as you described (I'll abstract away the actual format for now for the reasons I'll describe a bit later):
Fetch data from the remote service
Parse data
Convert data
Store into local DB
For every stage there are some limiting factors that does not allow you to increase processing speed.
For fetching data some of them are:
network bandwidth.
parallelism that remote server supports: remote server may throttle connections and/or total speed for single user or it may be required by terms of usage to limit this on client side.
data format used when downloading. Different formats add their own amount of unneeded/boilerplate formatting and/or data that is sent over network. It depends on the service and its API but it may be that returned XML is smaller than JSON so even XML is usually more verbose for your particular case XML is better.
RAM amount (and swap speed) may be a limit on your system in (very unlikely) case that factors #1 and #2 do not limit you. In this case the downloaded date may not fit into RAM and will be swapped to disk and this will slowdown the download process.
For parsing the data:
RAM amount (for the same reasons as above)
Data format used.
The parser used. Different parsers implementations for JSON for example have different speed.
CPU power: speed and number of processing units.
For data conversion:
RAM amount
CPU power
For data storing:
disk speed
parallelism level the DB efficiently supports
These are not all factors that limit processing speed but just some most obvious. There are also some other unknown limitations.
Also there may be some overhead when passing data between stages. It depends on the design. In some designs (for example single process that reads the data from remote server, processes it in memory and stores to database) the overhead may be zero, but in some designs (multiple processes read data and stores it to files, another set of processes open these files and processes them and so on) the overhead may be quite big.
The final speed of processing is defined by speed of the slowest stage or speed of data passing between stages.
Not all of these factors can be predicted when you design a solution or choose between several designs. Given that there are unknown factors this is even more complicated.
Approach
To be systematic I would use the following approach:
create simple solution (like single process reads data processes and stores to database)
find the processing speed of every phase using that solution
when you have processing speed of every phase look to the slowest phase (note that make sense to look only to the slowest as it defines the overall speed)
then find
why it is slow?
what limits the speed and if that can be improved?
what is the theoretical limit of that stage? (for example if you have 1Gb network and one processing box you can't read data with the speed greater than 120MB/s, in practice it will be even smaller).
Improve. The improvement is usually
optimize processing (like choose better format or library for parsing, remove operations that can be avoided etc) of single processor. If you hit (or is close to) the theoretical limit of the processing speed, you can't use this option.
add more parallelism
In general when you try to optimize something you need to have numbers and compare them when you are doing experiments.
Parallelism
Python
You should be careful when choose between threads and processes. As for example threads are not good for CPU intensive tasks. See more information on this Multiprocessing vs Threading Python
SQLite
SQLite may have some limitations when multiple processes work with single databases. You need to check if it is the limiting factor of your speed. Maybe you need to use another database that better fits for parallelism and then as an additional final step dump the data from it to SQLite in single shot (that would only require to read data sequentially and store it in SQLite and that may be much more efficient if compared to parallel write to single SQLite DB).
So I decided to benchmark my REST API today that I developed using Django REST Framework. The request I send is a GET request that basically retrieves the latest 50 posts from the database and returns it in JSON format.
Using Apache Benchmark, the stats were:
Server Software: nginx/1.4.6
Concurrency Level: 100
Time taken for tests: 18.394 seconds
Complete requests: 1000
Failed requests: 0
Non-2xx responses: 1000
Total transferred: 5628000 bytes
HTML transferred: 5447000 bytes
Requests per second: 54.36 [#/sec] (mean)
Time per request: 1839.442 [ms] (mean)
Time per request: 18.394 [ms] (mean, across all concurrent requests)
Transfer rate: 298.79 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 17 1137 1899.3 31 12366
Processing: 25 189 314.2 31 1418
Waiting: 24 184 309.4 29 1415
Total: 44 1326 1846.3 888 12407
Percentage of the requests served within a certain time (ms)
50% 888
66% 1178
75% 1775
80% 2286
90% 3434
95% 4576
98% 7859
99% 7922
100% 12407 (longest request)
This is obviously, incredibly slow...but I'm not sure how I can improve this.
PS: I am very new to developing a server, and want to learn from this. In the above GET request, I am not doing any sort of threading whatsoever on the server side. All it simply is doing is:
user_id = str(request.QUERY_PARAMS.get("user_id", None))
cur = connection.cursor()
cur.execute("SELECT * FROM get_posts(%s)", [user_id]) # This is a Function in the SQL database
return Response(convertToDict(cursor))
I want to improve the speed of that GET request, so what can I possibly do to make it faster?
Well, I am a little surprised to see a raw SQL query (that's another show) but you can do all sorts of things.
TL;DR
Upfront
Doing performance testing is great, regularly benchmarking and recording those outcomes over time is a good practice but benchmarking can be tricky to do properly: you have to take software and hardware into account - the outcomes of your tests will heavily depend on the interactions of those two things. Do your best to replicate your production environment for these things and try out different configurations (you are 12factor, right?) to determine a good fit.
Side note: I'm not horribly familiar with AB but looks like you are also returning HTML according to the output that doesn't seem to be intended behavior.
Fixing the problem
The first thing to do is evaluate what you have done in a thoughtful way.
Inspect the queries
Use things like django-debug-toolbar to see if you have some query bottlenecks - many queries that are chained together, long running queries, etc. If you need to get more granular, your database probably has logging facilities to record the long queries.
Assuming your data is pretty normalized (in the sense of normal forms), this may be a place to introduce denormalization so you don't have to traverse as many relationships.
You could also introduce raw SQL (but you seem to already be doing that).
Inspect your business logic
You should be diligent in making sure your business logic is being placed in the correct parts of the request, response cycle. Many times you put things in places just to get it working, maybe your initial decision is finding it's limits.
It seems like you are doing something very simple: get the last 50 entries in a table. If you are computing whether or not a post is included, you should probably leave that to the database - it should be handling all of the logic when it comes to what data to retrieve.
Inspect supporting code
While you're at it, try doing some more performance testing and see what areas of your code are lagging behind. Maybe there are things you can do that will improve your code (while being readable and understandable by others) and give you a performance bump. List comprehensions, generators, taking advantage of prefetch_ and select_related, taking care to lazily evaluate queries - all of these things are worth implementing because their functionality is well documented and understood. That said, be sure to document these decisions carefully for your future self and possibly others.
I'm not really familiar with your implementation of the view code as it relates to the Django REST framework, I would probably stick to the JSON serializers that come with it.
Work arounds
Another useful trick is to do things like implement a pagination strategy (but with the REST Framework most likely) so the data only get transferred in small pieces to the client. This will be heavily dependent on the use case.
This serves as a nice introduction to:
Throwing Software at the Problem
You can use a cache to save data in the RAM of your server so it is quickly accessed by Django.
Generally, what cache will work best will depend on the data itself. It may be the case that using a search engine to store documents that you query frequently will be most useful. But, a good start is Redis. You can read all about implementing the cache from a variety of sources but a good place to search around with Django is on Django Packages.
Throwing Hardware at the Problem
Speed can also be about hardware. You should think about the requirements of your software and it's dependencies. Do some testing, search around and experiment with what's right for you. Throwing more hardware at a problem has severe diminishing marginal returns.
can you post your get_posts(user_id) method??.
Steps to improve your performance
Make improvements on get_posts() method. You need to make sure that there are minimum number of queries to the database. Try to get the results by using one .filter and make use of select_related, prefetch related to reduce the database calls. https://docs.djangoproject.com/en/1.8/ref/models/querysets/#prefetch-related
Use .extra to the .filter if required, Through which you can add additional attribute to the model instance which cannot be done through a single query https://docs.djangoproject.com/en/1.8/ref/models/querysets/#extra
Make these changes to the get_posts() and see how your GET request responds. If it still lags you can opt for caching.
Most of the consumption of time will be for the database calls. If you optimize the get_posts() you might get satisfied with the performance
I am looking around in order to get an answer what is the max limit of results I can have from a GQL query on Ndb on Google AppEngine. I am using an implementation with cursors but it will be much faster if I retrieve them all at once.
This depends on lots of things like the size of the entities and the number of values that need to look up in the index, so it's best to benchmark it for your specific application. Also beware that if you find that on a sunny day it takes e.g. 10 seconds to load all your items, that probably means that some small fraction of your queries will run into a timeout due to natural variations in datastore performance, and occasionally your app will hit the timeout all the time when the datastore is having a bad day (it happens).
Basically you don't have the old limit of 1000 entities per query anymore, but consider using a reasonable limit, because you can hit the time out error and it's better to get them in batches so users won't wait during load time.
I'm testing Google App Engine and Django-nonrel with free quota. It seems to me, that the database operations to Datastore a hideously slow.
Take for example this simplified function processing a request, which takes in a multipart/form-data of XML blobs, parses them and inserts them to the database:
def post(request):
fields = cgi.FieldStorage(request)
with transaction.commit_on_success():
for xmlblob in fields.getlist('xmlblob'):
blob_object = parse_xml(xmlblob)
blob_object.save()
Blob_object has five fields, all of them of type CharField.
For just ca. 30 blobs (with about 1 kB of XML altogether), that function takes 5 seconds to return, and uses over 30000 api_cpu_ms. CPU time should equivalent to the amount of work a 1,2 GHz Intel x86 processor could do in that time, but I am pretty sure it would not take 30 seconds to insert 30 rows to a database for any x86 processor available.
Without saving objects to database (that is, just parsing the XML and throwing away the result) the request takes merely milliseconds.
So should Google App Engine really be so slow, that I can't save even a few dozen entities to the Datastore in a normal request, or am I missing something here? And of course, even if I would do the inserts in some Backend or by using a Task Queue, it would still cost hundreds of times more that what would seem acceptable.
Edit: I found out, that by default, GAE does two index writes per property for each entity. Most of those properties should not be indexed, so the question is: how can I set properties unindexed on Django-nonrel?
I still do feel though, that even with index writes, the database operation is taking ridiculous amount of time.
In the absence of batch operations, there's not much you can do to reduce wallclock times. Batch operations are pretty essential to reducing wallclock time on App Engine (or any distributed platform with RPCs, really).
Under the current billing model, CPU milliseconds reported by the datastore reflect the cost of the operation rather than the actual time it took, and are a way of billing for resources. Under the new billing model, these will be billed explicitly as datastore operations, instead.
I have not found a real answer yet, but I made some calculations for the cost. Currently every indexed property field costs around $0.20 to $0.30 per 10k inserts. With the upcoming billing model (Pricing FAQ) the cost will be exactly $0.1 per 100k operations, or $0.2 per indexed field per 100k inserts with 2 index write operations per insert.
So as the price seems to go down by a factor of ten, the observed slowness is indeed unexpected behaviour. As the free quota is well enough for my test runs, and the with new pricing model coming, I wont let it bother me at this time.
We've got a reasonably-sized database on Google App Engine - just over 50,000 entities - that we want to clear out stale data from. The plan was to write a deferred task to iterate over the entities we no longer wanted, and delete them in batches.
One complication is that our entities also have child entities that we also want to purge -- no problem, we thought; we'd just query the datastore for those entities, and drop them at the same time as the parent:
query = ParentKind.all()
query.count(100)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
to_delete.append(entity)
to_delete.extend(ChildKindA.all().ancestor(entity).fetch(100))
to_delete.extend(ChildKindB.all().ancestor(entity).fetch(100))
db.delete(to_delete)
We limited ourselves to deleting 100 ParentKind entities at a time; each ParentKind had around 40 child ChildKindA and ChildKindB entities total - perhaps 4000 entities.
This seemed reasonable at the time, but we ran one batch as a test, and the resulting query took 9 seconds to run -- and spent 1933 seconds in billable CPU time accessing the datastore.
This seems pretty harsh -- 0.5 billable seconds per entity! -- but we're not entirely sure what we're doing wrong. Is it simply the size of the batch? Are ancestor queries particularly slow? Or, are deletes (and indeed, all datastore accesses) simply slow as molasses?
Update
We changed our queries to be keys_only, and while that reduced the time to run one batch to 4.5 real seconds, it still cost ~1900 seconds in CPU time.
Next, we installed Appstats to our app (thanks, kevpie) and ran a smaller sized batch -- 10 parent entities, which would amount to ~450 entities total. Here's the updated code:
query = ParentKind.all(keys_only=True)
query.count(10)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
to_delete.append(entity)
to_delete.extend(ChildKindA.all(keys_only=True).ancestor(entity).fetch(100))
to_delete.extend(ChildKindB.all(keys_only=True).ancestor(entity).fetch(100))
db.delete(to_delete)
The results from Appstats:
service.call #RPCs real time api time
datastore_v3.RunQuery 22 352ms 555ms
datastore_v3.Delete 1 366ms 132825ms
taskqueue.BulkAdd 1 7ms 0ms
The Delete call is the single most expensive part of the operation!
Is there a way around this? Nick Johnson mentioned that using the bulk delete handler is the fastest way to delete at present, but ideally we don't want to delete all entities of a kind, just the ones that match, and are children of, our initial bar = foo query.
We recently added a bulk-delete handler, documented here. It takes the most efficient possible approach to bulk deletion, though it still consumes CPU quota.
If you want to spread out the CPU burn, you could create a map reduce job. It will still iterate over every entity (this is a current limitation of the mapper API). However, you can check if each entity meets the condition and delete or not at that time.
To slow down the CPU usage, assign the mapper to a task queue that you've configured to run slower than normal. You can spread the run time out over several days and not eat up all your CPU quota.