Importing data into Google App Engine

Importing data into Google App Engine - python

Recently I had to import 48,000 records into Google App Engine. The stored 'tables' are 'ndb.model' types. Each of these records is checked against a couple of other 'tables' in the 'database' for integrity purposes and then written (.put()).
To do this, I uploaded a .csv file into Google Cloud Storage and processed it from there in a task queue. This processed about 10 .csv rows per second and errored after 41,000 records with an out of memory error. Splitting the .csv file into 2 sets of 24,000 records each fixed this problem.
So, my questions are:
a) is this the best way to do this?
b) is there a faster way (the next upload might be around 400,000 records)? and
c) how do I get over (or stop) the out of memory error?
Many thanks,
David

1) Have you thought about (even temporarily) upgrading your server instances?
https://cloud.google.com/appengine/docs/standard/#instance_classes
2) I don't think a 41000 row csv is enough to run out of memory, so you probably need to change your processing:
a) Break up the processing using multiple tasks, rolling your own cursor to process a couple thousand at a time, then spinning up a new task.
b) Experiment with ndb.put_multi()
Sharing some code of your loop and puts might help

The ndb in-context cache could be contributing to the memory errors. From the docs:
With executing long-running queries in background tasks, it's possible for the in-context cache to consume large amounts of memory. This is because the cache keeps a copy of every entity that is retrieved or stored in the current context. To avoid memory exceptions in long-running tasks, you can disable the cache or set a policy that excludes whichever entities are consuming the most memory.
You can prevent caching on a case by case basis by setting a context option in your ndb calls, for example
foo.put(use_cache=False)
Completely disabling caching might degrade performance if you are often using the same objects for your comparisons. If that's the case, you could flush the cache periodically to stop it getting too big.
if some_condition:
context = ndb.get_context()
context.clear_cache()

Related

Managing Heroku RAM for Unique Application

I have a Flask application that allows users to query a ~small database (2.4M rows) using SQL. It's similar to a HackerRank but more limited in scope. It's deployed on Heroku.
I've noticed during testing that I can predictably hit an R14 error (memory quota exceeded) or R15 (memory quota greatly exceeded) by running large queries. The queries that typically cause this are outside what a normal user might do, such as SELECT * FROM some_huge_table. That said, I am concerned that these errors will become a regular occurrence for even small queries when 5, 10, 100 users are querying at the same time.
I'm looking for some advice on how to manage memory quotas for this type of interactive site. Here's what I've explored so far:
Changing the # of gunicorn workers. This has had some effect but I still hit R14 and R15 errors consistently.
Forced limits on user queries, based on either text or the EXPLAIN output. This does work to reduce memory usage, but I'm afraid it won't scale to even a very modest # of users.
Moving to a higher Heroku tier. The plan I use currently provides ~512MB RAM. The largest plan is around 14GB. Again, this would help but won't even moderately scale, to say nothing of the associated costs.
Reducing the size of the database significantly. I would like to avoid this if possible. Doing the napkin math on a table with 1.9M rows going to 10k or 50k, the application would have greatly reduced memory needs and will scale better, but will still have some moderate max usage limit.
As you can see, I'm a novice at best when it comes to memory management. I'm looking for some strategies/ideas on how to solve this general problem, and if it's the case that I need to either drastically cut the data size or throw tons of $ at this, that's OK too.
Thanks

Coming from my personal experience, I see two approaches:
1. plan for it
Coming from your example, this means you try to calculate the maximum memory that the request would use, multiply it by the number of gunicorn workers, and use dynos big enough.
With a different example this could be valid, I don't think it is for you.
2. reduce memory usage, solution 1
The fact that too much application memory is used makes me think that likely in your code you are loading the whole result-set into memory (probably even multiple times in multiple formats) before returning it to the client.
In the end, your application is only getting the data from the database and converting it to some output format (JSON/CSV?).
What you are probably searching for is streaming responses.
Your Flask-view will work on a record-by-record base. It will read a single record, convert it to your output format, and return a single record.
Both your database client library and Flask will support this (on most databases it is called cursors / iterators).
2. reduce memory usage, solution 2
other services often go for simple pagination or limiting resultsets to manage server-side memory.
security sidenote
it sounds like the users can actually define the SQL statement in their API requests. This is a security and application risk. Apart from doing INSERT, UPDATE, or DELETE statements, the user could create a SQL statement that will not only blow your application memory, but also break your database.

Large memory Python background jobs

I am running a Flask server which loads data into a MongoDB database. Since there is a large amount of data, and this takes a long time, I want to do this via a background job.
I am using Redis as the message broker and Python-rq to implement the job queues. All the code runs on Heroku.
As I understand, python-rq uses pickle to serialise the function to be executed, including the parameters, and adds this along with other values to a Redis hash value.
Since the parameters contain the information to be saved to the database, it quite large (~50MB) and when this is serialised and saved to Redis, not only does it take a noticeable amount of time but it also consumes a large amount of memory. Redis plans on Heroku cost $30 p/m for 100MB only. In fact I every often get OOM errors like:
OOM command not allowed when used memory > 'maxmemory'.
I have two questions:
Is python-rq well suited to this task or would Celery's JSON serialisation be more appropriate?
Is there a way to not serialise the parameter but rather a reference to it?
Your thoughts on the best solution are much appreciated!

Since you mentioned in your comment that your task input is a large list of key value pairs, I'm going to recommend the following:
Load up your list of key/value pairs in a file.
Upload the file to Amazon S3.
Get the resulting file URL, and pass that into your RQ task.
In your worker task, download the file.
Parse the file line-by-line, inserting the documents into Mongo.
Using the method above, you'll be able to:
Quickly break up your tasks into manageable chunks.
Upload these small, compressed files to S3 quickly (use gzip).
Greatly reduce your redis usage by requiring much less data to be passed over the wires.
Configure S3 to automatically delete your files after a certain amount of time (there are S3 settings for this: you can have it delete automatically after 1 day, for instance).
Greatly reduce memory consumption on your worker by processing the file one line at-a-time.
For use cases like what you're doing, this will be MUCH faster and require much less overhead than sending these items through your queueing system.
Hope this helps!

It turns out that the solution that worked for is to save the data to Amazon S3 storage, and then pass the URI to function in the background task.

How to autoload a little config data read frequently but nearly never being updated?

We have a little data which almost won't be updated but read frequently (site config and some selection items like states and counties information), I think if I can move it to application memory instead of any database, our I/O performance would get a big improvement.
But we have a lot of web servers, I cannot figure out a good solution how to notice all the servers to reload these data.

You are likely looking for a cache pattern: Is there a Python caching library? You just need to ask how stale you can afford to be. If it was looking this up on every request, even a short-lived cache can massively improve performance. It's likely though that this information can live for minutes or hours without too much risk of being "stale".
If you can't live with a stale cache, I've implemented solutions that have a single database call, which keeps track of the last updated date for any of the cached data. This at least reduces the cache lookups to a single database call.
Be aware though, as soon as you are sharing updateable information, you have to deal with multi-threaded updates of shared state. Make sure you understand the implications of this. Hopefully your caching library handles this gracefully.

Google App Engine, Datastore and Task Queues, Performance Bottlenecks?

We're designing a system that will take thousands of rows at a time and send them via JSON to a REST API built on Google App Engine. Typically 3-300KB of data but let's say in extreme cases a few MB.
The REST API app will then adapt this data to models on the server and save them to the Datastore. Are we likely to (eventually if not immediately) encounter any performance bottlenecks here with Google App Engine, whether it's working with that many models or saving so many rows of data at a time to the datastore?
The client does a GET to get thousands of records, then a PUT with thousands of records. Is there any reason for this to take more than a few seconds, and necessitate the need for a Task queues API?

The only bottleneck in App Engine (apart from the single entity group limitation) is how many entities you can process in a single thread on a single instance. This number depends on your use case and the quality of your code. Once you reach a limit, you can (a) use a more powerful instance, (b) use multi-threading and/or (c) add more instances to scale up your processing capacity to any level you desire.
Task API is a very useful tool for large data loads. It allows you to split your job into a large number of smaller tasks, set the desired processing rate, and let App Engine automatically adjust the number of instances to meet that rate. Another option is a MapReduce API.

This is a really good question, one that I've been asked in interviews, seen pop up in a lot of different situations as well. Your system essentially consists of two things:
Savings (or writing) models to the data store
Reading from the data store.
From my experience of this problem, when you view these two things differently you're able to come up with solid solutions to both. I typically use a cache, such as memcachd, in order to keep data easily accessible for reading. At the same time, for writing, I try to have a main db and a few slave instances as well. All the writes will go to the slave instances (thereby not locking up the main db for reads that sync to the cache), and the writes to the slave db's can be distributed in a round robin approach there by ensuring that your insert statements are not skewed by any of the model's attributes having a high occurance.

Is there a way to cache the fetch output?

I'm working on a closed system running in the cloud.
What I need is a search function that uses user-typed-in regexp to filter the rows in a dataset.
phrase = re.compile(request.get("query"))
data = Entry.all().fetch(50000) #this takes around 10s when there are 6000 records
result = x for x in data if phrase.search(x.title)
Now, the database itself won't change too much, and there will be no more than 200-300 searches a day.
Is there a way to somehow cache all the Entries (I expect that there will be no more than 50.000 of them, each no bigger than 500 bytes), so retrieving them won't take up >10 seconds? Or perhaps to parallelize it? I don't mind 10cpu seconds, but I do mind 10 second that the user has to wait.
To address any answers like "index it and use .filter()" - the query is a regexp, and I don't know about any indexing mechanism that would allow to use a regexp.

You can also use cachepy or performance engine (shameless plug) to store the data on app engine's local instances, so you can have faster access to all entities without getting limited by memcache boundaries or datastore latency.
Hint: A local instance gets killed if it surpasses about 185 MB of memory, so you can store actually quite a lot of data in it if you know what you're doing.

Since there is a bounded number of entries, you can memcache all entries and then do the filtering in memory like you've outlined. However, note that each memcache entry cannot exceed 1mb. But you can fetch up to 32mb of memcache entries in parallel.
So split the entries into sub sets, memcache the subsets and then read them in parallel by precomputing the memcache key.
More here:
http://code.google.com/appengine/docs/python/memcache/functions.html

Since your data is on the order of 20MB, you may be able to load it entirely into local instance memory, which will be as fast as you can get. Alternately, you could store it as a data file alongside your app, reading which will be faster than accessing the datastore.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.