GAE Python LXML - Exceeded soft private memory limit

GAE Python LXML - Exceeded soft private memory limit - python

I am fetching a GZipped LXML file and trying to write Product entries to a Databse Model. Previously I was having local memory issues, which were resolved by help on SO (question). Now I got everything working and deployed it, however on the server I get the following error:
Exceeded soft private memory limit with 158.164 MB after servicing 0 requests total
Now I tried all I know to reduce the memory usage and am currently using the code below. The GZipped file is about 7 MB whereas unzipped it is 80 MB. Locally the code is working fine. I tried running it as HTTP request as well as Cron Job but it didn't make a difference. Now I am wondering if there is any way to make it more efficient.
Some similar questions on SO referred to Frontend and Backend specification, which I am not familiar with. I am running the free version of GAE and this task would have to run once a week. Any suggestions on best way to move forward would be very much appreciated.
from google.appengine.api.urlfetch import fetch
import gzip, base64, StringIO, datetime, webapp2
from lxml import etree
from google.appengine.ext import db
class GetProductCatalog(webapp2.RequestHandler):
def get(self):
user = XXX
password = YYY
url = 'URL'
# fetch gziped file
catalogResponse = fetch(url, headers={
"Authorization": "Basic %s" % base64.b64encode(user + ':' + password)
}, deadline=10000000)
# the response content is in catalogResponse.content
# un gzip the file
f = StringIO.StringIO(catalogResponse.content)
c = gzip.GzipFile(fileobj=f)
content = c.read()
# create something readable by lxml
xml = StringIO.StringIO(content)
# delete unnecesary variables
del f
del c
del content
# parse the file
tree = etree.iterparse(xml, tag='product')
for event, element in tree:
if element.findtext('manufacturer') == 'New York':
if Product.get_by_key_name(element.findtext('sku')):
coupon = Product.get_by_key_name(element.findtext('sku'))
if coupon.last_update_prov != datetime.datetime.strptime(element.findtext('lastupdated'), "%d/%m/%Y"):
coupon.restaurant_name = element.findtext('name')
coupon.restaurant_id = ''
coupon.address_street = element.findtext('keywords').split(',')[0]
coupon.address_city = element.findtext('manufacturer')
coupon.address_state = element.findtext('publisher')
coupon.address_zip = element.findtext('manufacturerid')
coupon.value = '$' + element.findtext('price') + ' for $' + element.findtext('retailprice')
coupon.restrictions = element.findtext('warranty')
coupon.url = element.findtext('buyurl')
if element.findtext('instock') == 'YES':
coupon.active = True
else:
coupon.active = False
coupon.last_update_prov = datetime.datetime.strptime(element.findtext('lastupdated'), "%d/%m/%Y")
coupon.put()
else:
pass
else:
coupon = Product(key_name = element.findtext('sku'))
coupon.restaurant_name = element.findtext('name')
coupon.restaurant_id = ''
coupon.address_street = element.findtext('keywords').split(',')[0]
coupon.address_city = element.findtext('manufacturer')
coupon.address_state = element.findtext('publisher')
coupon.address_zip = element.findtext('manufacturerid')
coupon.value = '$' + element.findtext('price') + ' for $' + element.findtext('retailprice')
coupon.restrictions = element.findtext('warranty')
coupon.url = element.findtext('buyurl')
if element.findtext('instock') == 'YES':
coupon.active = True
else:
coupon.active = False
coupon.last_update_prov = datetime.datetime.strptime(element.findtext('lastupdated'), "%d/%m/%Y")
coupon.put()
else:
pass
element.clear()
UDPATE
According to Paul's suggestion I implemented the backend. After some troubles it worked like a charm - find the code I used below.
My backends.yaml looks as follows:
backends:
- name: mybackend
instances: 10
start: mybackend.app
options: dynamic
And my app.yaml as follows:
handlers:
- url: /update/mybackend
script: mybackend.app
login: admin

Backends are like front end instances but they don't scale and you have to stop and start them as you need them (or set them to be dynamic, probably your best bet here).
You can have up to 1024MB of memory in the backend so it will probably work fine for your task.
https://developers.google.com/appengine/docs/python/backends/overview
App Engine Backends are instances of your application that are exempt from request deadlines and have access to more memory (up to 1GB) and CPU (up to 4.8GHz) than normal instances. They are designed for applications that need faster performance, large amounts of addressable memory, and continuous or long-running background processes. Backends come in several sizes and configurations, and are billed for uptime rather than CPU usage.
A backend may be configured as either resident or dynamic. Resident
backends run continuously, allowing you to rely on the state of their
memory over time and perform complex initialization. Dynamic backends
come into existence when they receive a request, and are turned down
when idle; they are ideal for work that is intermittent or driven by
user activity. For more information about the differences between
resident and dynamic backends, see Types of Backends and also the
discussion of Startup and Shutdown.
It sounds like just what you need. The free usage level will also be OK for your task.

Regarding the backend: looking at the example you have provided - seems like your request is simply handled by frontend instance.
To make it be handled by the backend, try instead calling the task like: http://mybackend.my_app_app_id.appspot.com/update/mybackend
Also, I think you can remove: start: mybackend.app from your backends.yaml

Related

What are the things I should avoid in my python Cloud Function to avoid memory leak?

Generic question
My python Cloud Function raises about 0.05 memory error per second - it is invoked about 150 times per second. I get the feeling my function leaves memory residuals behind, which causes its instances to crash once they have dealt with many requests. What are the things you should do or not do so that your function instance doesn't eat "a bit more of its allocated memory" each time it's called ? I've been pointed to the docs to learn that I should delete all temporary files as this is writing in memory but I don't think I've written any.
More context
The code of my function can be summed up as the following.
Global context: Grab a file on Google Cloud Storage containing a list of known User-Agents of bots. Instantiate an Error Reporting client.
If User-Agent identifies a bot, return a 200 code. Else parse the arguments of the request, rename them, format them, timestamp the reception of the request.
Send the resulting message to Pub/Sub in a JSON string.
Return a 200 code
I believe my instances are gradually consuming all memory available because of this graph I've done in Stackdriver:
It is a heat map of the memory usage across my Cloud function's instances, red and yellow indicating that most of my function instances are consuming this range of memory. Because of the cycle that seems to appear, I interpreted it as a gradual fill-up of my instances' memory, until they crash and new instances are spawned. This cycle remains if I raise the memory allocated to the function, it just raises the upper bound of memory usage the cycle follows.
Edit: Code excerpt and more context
The requests contain parameters that help implement tracking on an ecommerce website. Now that I copy it, there might be an anti-pattern where I modify form['products'] while iterating over it, but I don't think it would have anything to do with memory waste ?
from json import dumps
from datetime import datetime
from pytz import timezone
from google.cloud import storage
from google.cloud import pubsub
from google.cloud import error_reporting
from unidecode import unidecode
# this is done in global context because I only want to load the BOTS_LIST at
# cold start
PROJECT_ID = '...'
TOPIC_NAME = '...'
BUCKET_NAME = '...'
BOTS_PATH = '.../bots.txt'
gcs_client = storage.Client()
cf_bucket = gcs_client.bucket(BUCKET_NAME)
bots_blob = cf_bucket.blob(BOTS_PATH)
BOTS_LIST = bots_blob.download_as_string().decode('utf-8').split('\r\n')
del cf_bucket
del gcs_client
del bots_blob
err_client = error_reporting.Client()
def detect_nb_products(parameters):
'''
Detects number of products in the fields of the request.
'''
# ...
def remove_accents(d):
'''
Takes a dictionary and recursively transforms its strings into ASCII
encodable ones
'''
# ...
def safe_float_int(x):
'''
Custom converter to float / int
'''
# ...
def build_hit_id(d):
'''concatenate specific parameters from a dictionary'''
# ...
def cloud_function(request):
"""Actual Cloud Function"""
try:
time_received = datetime.now().timestamp()
# filtering bots
user_agent = request.headers.get('User-Agent')
if all([bot not in user_agent for bot in BOTS_LIST]):
form = request.form.to_dict()
# setting the products field
nb_prods = detect_nb_products(form.keys())
if nb_prods:
form['products'] = [{'product_name': form['product_name%d' % i],
'product_price': form['product_price%d' % i],
'product_id': form['product_id%d' % i],
'product_quantity': form['product_quantity%d' % i]}
for i in range(1, nb_prods + 1)]
useful_fields = [] # list of keys I'll keep from the form
unwanted = set(form.keys()) - set(useful_fields)
for key in unwanted:
del form[key]
# float conversion
if nb_prods:
for prod in form['products']:
prod['product_price'] = safe_float_int(
prod['product_price'])
# adding timestamp/hour/minute, user agent and date to the hit
form['time'] = int(time_received)
form['user_agent'] = user_agent
dt = datetime.fromtimestamp(time_received)
form['date'] = dt.strftime('%Y-%m-%d')
remove_accents(form)
friendly_names = {} # dict to translate the keys I originally
# receive to human friendly ones
new_form = {}
for key in form.keys():
if key in friendly_names.keys():
new_form[friendly_names[key]] = form[key]
else:
new_form[key] = form[key]
form = new_form
del new_form
# logging
print(form)
# setting up Pub/Sub
publisher = pubsub.PublisherClient()
topic_path = publisher.topic_path(PROJECT_ID, TOPIC_NAME)
# sending
hit_id = build_hit_id(form)
message_future = publisher.publish(topic_path,
dumps(form).encode('utf-8'),
time=str(int(time_received * 1000)),
hit_id=hit_id)
print(message_future.result())
return ('OK',
200,
{'Access-Control-Allow-Origin': '*'})
else:
# do nothing for bots
return ('OK',
200,
{'Access-Control-Allow-Origin': '*'})
except KeyError:
err_client.report_exception()
return ('err',
200,
{'Access-Control-Allow-Origin': '*'})

There are a few things you could try (theoretical answer, I didn't play with CFs yet):
explicitly delete the temporary variables that you allocate on the bots processing path, which may be referencing each-other thus preventing the memory garbage collector from freeing them (see https://stackoverflow.com/a/33091796/4495081): nb_prods, unwanted, form, new_form, friendly_names, for example.
if unwanted is always the same make it a global instead.
delete form before re-assigning it to new_form (the old form object remains); also deleting new_form won't actually save much since the object remains referenced by form. I.e. change:
form = new_form
del new_form
into
del form
form = new_form
explicitly invoke the memory garbage collector after you publish your topic and before returning. I'm unsure if that's applicable to CFs or if the invocation is immediately effective or not (for example in GAE it's not, see When will memory get freed after completing the request on App Engine Backend Instances?). This may also be overkill and potentially hurt you CF's performance, see if/how it works for you.
gc.collect()

Is google app engine right for me (hosting a few rapidly updating text files created w/ python)

I have a python script that creates a few text files, which are then uploaded to my current web host. This is done every 5 minutes. The text files are used in a software program which fetches the latest version every 5 min. Right now I have it running on my web host, but I'd like to move to GAE to improve reliability. (Also because my current web host does not allow for just plain file hosting, per their TOS.)
Is google app engine right for me? I have some experience with python, but none related to web technologies. I went through the basic hello world tutorial and it seems pretty straightforward for a website, but I don't know how I would implement my project. I also worry about any caching which could cause the latest files not to propagate fast enough across google's servers.

Yes and no.
Appengine is great in terms of reliability, server speed, features, etc. However, it has two main drawbacks: You are in a sandboxed environment (no filesystem access, must use datastore), and you are paying by instance hour. Normally, if you're just hosting a small server accessed once in a while, you can get free hosting; if you are running a cron job all day every day, you must use at least one instance at all times, thus costing you money.
Your concerns about speed and propagation on google's servers is moot; they have a global time server pulsating through their datacenters ensuring your operations are atomic; if you request data with consistency=STRONG, so long as your get begins after the put, you will see the updated data.

If your text files are always going to be under one meg, and you are not planning to scale to a large number of users, it would very easy to set up a system to post your text files into an entity as a TextProperty. If you are a complete newbie to GAE it is probably < 1 hours to get this running. I do this a lot to speed testing of my HTML work (beats deploying your static files by a mile). Here is some very simple code extracts as an example. (Apologies if I screwed it up when modifying it to simplify/anonimize.) HTH -stevep
#client side python...
import time
import urllib
import httplib
def processUpdate(filename):
f = open(filename, 'rb')
parts = filename.split('/')
name = parts[len(parts)-1]
print name
html = f.read()
f.close()
htmlToLoad = urllib.quote(html)
params = urllib.urlencode({'key':'your_arbitrary_password_here(or use admin account)',
'name':name,
'htmlToLoad':htmlToLoad,
})
headers = {'Content-type': 'application/x-www-form-urlencoded', 'Accept': 'text/plain'}
#conn = httplib.HTTPConnection('your_localhost_here')
conn = httplib.HTTPConnection('your_app_id_here')
conn.request('POST', '/your_on-line_handler_url_stub_here', params, headers)
response = conn.getresponse()
print '%s, %s, %s' % (filename, response.status, response.reason)
def main():
startTime = time.time()
print '----------Start Process----------\n'
processUpdate('your_full_file01_path_here')
processUpdate('your_full_file02_path_here')
processUpdate('your_full_file03_path_here')
print '\n----------End Process----------', time.time() - startTime
if __name__ == '__main__':
main()
# GAE Kind
class Html_Source(db.Model):
html = db.TextProperty(required=True, indexed=False)
dateM = db.DateTimeProperty(required=True, indexed=False, auto_now=True)
v = db.IntegerProperty(required=False, indexed=False, default=1)
#GAE handler classes
EVENTUAL = db.create_config(read_policy=db.EVENTUAL_CONSISTENCY)
class load_test(webapp2.RequestHandler):
def post(self):
self.response.clear()
if (self.request.get('key') != 'your_arbitrary_password_here(or use admin account)'):
logging.info("----------------------------------bad key")
return
name = self.request.get('name')
rec = Html_Source(
key_name = name,
html = urllib.unquote(self.request.get('htmlToLoad')),
)
rec.put()
self.response.out.write('OK=' + name)
class get_test(webapp2.RequestHandler):
def get(self):
urlList = self.request.url.split('/')
name = urlList[len(urlList) - 1]
extension = name.split('.')
type = '' if len(extension) < 2 else extension[1]
typeM = None
if type == 'js': typeM = 'application/javascript'
if type == 'css': typeM = 'text/css'
if type == 'html': typeM = 'text/html'
self.response.out.clear()
if typeM: self.response.headers["Content-Type"] = typeM
logging.info('%s-----name, %s-----typeM' % (name, typeM))
htmlRec = Html_Source.get_by_key_name(name, config=EVENTUAL)
if htmlRec is None:
self.response.out.write('<p>invalid:%s</p>' % (name))
return
self.response.out.write(htmlRec.html)

use Flask-Cache in webpy. i got error <type 'exceptions.RuntimeError'> at xx working outside of request context

initial cache object code as follow:
pageCache = Cache()
cacheDir = os.path.join(path.dirname(path.dirname(__file__)),'pageCache')
pageCache.init_app(flaskApp,config={'CACHE_TYPE': 'filesystem','CACHE_THRESHOLD':1>>10>>10,'CACHE_DIR': cacheDir })
I use pageCache as follow:
class CodeList:
"""
show code list
"""
#pageCache.cached(timeout=60)
def GET(self):
i = web.input()
sort = i.get('sort','newest')
pageNo = int(i.get('page','1'))
if i.get('pageSize'):
pageSize = int(i.get('pageSize'))
else:
pageSize = DEFAULT_LIST_PAGE_SIZE
if pageSize > 50:
pageSize = 50
items = csModel.getCodeList(sort=sort,pageNo=pageNo,pageSize=pageSize)
totalCount = csModel.getCodeCount()
pageInfo = (pageNo,pageSize,totalCount)
return render.code.list(items,pageInfo)
when I request this page, I got an exception:
type 'exceptions.RuntimeError' at /code-snippet/
working outside of request context
Python C:\Python27\lib\site-packages\flask-0.9-py2.7.egg\flask\globals.py in >_lookup_object, line 18

Flask-Cache is - as the name suggests - a Flask extension. So you cannot properly use it if you do not use Flask.
You can use werkzeug.cache instead - Flask-Cache is using it, too. However, depending on your needs it might be a better idea to use e.g. memcached directly - when using a wrapper such as werkzeug.cache you lose all advanced features of your caching engine because it's wrapped with a rather simple/minimalistic API.

couchdb-python change notifications

I'm trying to use couchdb.py to create and update databases. I'd like to implement notification changes, preferably in continuous mode. Running the test code posted below, I don't see how the changes scheme works within python.
class SomeDocument(Document):
#############################################################################
# def __init__ (self):
intField = IntegerField()#for now - this should to be an integer
textField = TextField()
couch = couchdb.Server('http://127.0.0.1:5984')
databasename = 'testnotifications'
if databasename in couch:
print 'Deleting then creating database ' + databasename + ' from server'
del couch[databasename]
db = couch.create(databasename)
else:
print 'Creating database ' + databasename + ' on server'
db = couch.create(databasename)
for iii in range(5):
doc = SomeDocument(intField=iii,textField='somestring'+str(iii))
doc.store(db)
print doc.id + '\t' + doc.rev
something = db.changes(feed='continuous',since=4,heartbeat=1000)
for iii in range(5,10):
doc = SomeDocument(intField=iii,textField='somestring'+str(iii))
doc.store(db)
time.sleep(1)
print something
print db.changes(since=iii-1)
The value
db.changes(since=iii-1)
returns information that is of interest, but in a format from which I haven't worked out how to extract the sequence or revision numbers, or the document information:
{u'last_seq': 6, u'results': [{u'changes': [{u'rev': u'1-9c1e4df5ceacada059512a8180ead70e'}], u'id': u'7d0cb1ccbfd9675b4b6c1076f40049a8', u'seq': 5}, {u'changes': [{u'rev': u'1-bbe2953a5ef9835a0f8d548fa4c33b42'}], u'id': u'7d0cb1ccbfd9675b4b6c1076f400560d', u'seq': 6}]}
Meanwhile, the code I'm really interested in using:
db.changes(feed='continuous',since=4,heartbeat=1000)
Returns a generator object and doesn't appear to provide notifications as they come in, as the CouchDB guide suggests ....
Has anyone used changes in couchdb-python successfully?

I use long polling rather than continous, and that works ok for me. In long polling mode db.changes blocks until at least one change has happened, and then returns all the changes in a generator object.
Here is the code I use to handle changes. settings.db is my CouchDB Database object.
since = 1
while True:
changes = settings.db.changes(since=since)
since = changes["last_seq"]
for changeset in changes["results"]:
try:
doc = settings.db[changeset["id"]]
except couchdb.http.ResourceNotFound:
continue
else:
// process doc
As you can see it's an infinite loop where we call changes on each iteration. The call to changes returns a dictionary with two elements, the sequence number of the most recent update and the objects that were modified. I then loop through each result loading the appropriate object and processing it.
For a continuous feed, instead of the while True: line use for changes in settings.db.changes(feed="continuous", since=since).

I setup a mailspooler using something similar to this. You'll need to also load couchdb.Session() I also use a filter for only receiving unsent emails to the spooler changes feed.
from couchdb import Server
s = Server('http://localhost:5984/')
db = s['testnotifications']
# the since parameter defaults to 'last_seq' when using continuous feed
ch = db.changes(feed='continuous',heartbeat='1000',include_docs=True)
for line in ch:
doc = line['doc']
// process doc here
doc['priority'] = 'high'
doc['recipient'] = 'Joe User'
# doc['state'] + 'sent'
db.save(doc)
This will allow you access your doc directly from the changes feed, manipulate your data as you see fit, and finally update you document. I use a try/except block on the actual 'db.save(doc)' so I can catch when a document has been updated while I was editing and reload the doc before saving.

Link generator using django or any python module

I want to generate for my users temporary download link.
Is that ok if i use django to generate link using url patterns?
Could it be correct way to do that. Because can happen that I don't understand some processes how it works. And it will overflow my memory or something else. Some kind of example or tools will be appreciated. Some nginx, apache modules probably?
So, what i wanna to achieve is to make url pattern which depend on user and time. Decript it end return in view a file.

A simple scheme might be to use a hash digest of username and timestamp:
from datetime import datetime
from hashlib import sha1
user = 'bob'
time = datetime.now().isoformat()
plain = user + '\0' + time
token = sha1(plain)
print token.hexdigest()
"1e2c5078bd0de12a79d1a49255a9bff9737aa4a4"
Next you store that token in a memcache with an expiration time. This way any of your webservers can reach it and the token will auto-expire. Finally add a Django url handler for '^download/.+' where the controller just looks up that token in the memcache to determine if the token is valid. You can even store the filename to be downloaded as the token's value in memcache.

Yes it would be ok to allow django to generate the urls. This being exclusive from handling the urls, with urls.py. Typically you don't want django to handle the serving of files see the static file docs[1] about this, so get the notion of using url patterns out of your head.
What you might want to do is generate a random key using a hash, like md5/sha1. Store the file and the key, datetime it's added in the database, create the download directory in a root directory that's available from your webserver like apache or nginx... suggest nginx), Since it's temporary, you'll want to add a cron job that checks if the time since the url was generated has expired, cleans up the file and removes the db entry. This should be a django command for manage.py
Please note this is example code written just for this and not tested! It may not work the way you were planning on achieving this goal, but it works. If you want the dl to be pw protected also, then look into httpbasic auth. you can generate and remove entries on the fly in a httpd.auth file using htpasswd and the subprocess module when you create the link or at registration time.
import hashlib, random, datetime, os, shutil
# model to hold link info. has these fields: key (charfield), filepath (filepathfield)
# datetime (datetimefield), url (charfield), orgpath (filepathfield of the orignal path
# or a foreignkey to the files model.
from models import MyDlLink
# settings.py for the app
from myapp import settings as myapp_settings
# full path and name of file to dl.
def genUrl(filepath):
# create a onetime salt for randomness
salt = ''.join(['{0}'.format(random.randrange(10) for i in range(10)])
key = hashlib('{0}{1}'.format(salt, filepath).hexdigest()
newpath = os.path.join(myapp_settings.DL_ROOT, key)
shutil.copy2(fname, newpath)
newlink = MyDlink()
newlink.key = key
newlink.date = datetime.datetime.now()
newlink.orgpath = filepath
newlink.newpath = newpath
newlink.url = "{0}/{1}/{2}".format(myapp_settings.DL_URL, key, os.path.basename(fname))
newlink.save()
return newlink
# in commands
def check_url_expired():
maxage = datetime.timedelta(days=7)
now = datetime.datetime.now()
for link in MyDlink.objects.all():
if(now - link.date) > maxage:
os.path.remove(link.newpath)
link.delete()
[1] http://docs.djangoproject.com/en/1.2/howto/static-files/

It sounds like you are suggesting using some kind of dynamic url conf.
Why not forget your concerns by simplifying and setting up a single url that captures a large encoded string that depends on user/time?
(r'^download/(?P<encrypted_id>(.*)/$', 'download_file'), # use your own regexp
def download_file(request, encrypted_id):
decrypted = decrypt(encrypted_id)
_file = get_file(decrypted)
return _file
A lot of sites just use a get param too.
www.example.com/download_file/?09248903483o8a908423028a0df8032
If you are concerned about performance, look at the answers in this post: Having Django serve downloadable files
Where the use of the apache x-sendfile module is highlighted.
Another alternative is to simply redirect to the static file served by whatever means from django.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

GAE Python LXML - Exceeded soft private memory limit - python

Related

What are the things I should avoid in my python Cloud Function to avoid memory leak?

Is google app engine right for me (hosting a few rapidly updating text files created w/ python)

use Flask-Cache in webpy. i got error <type 'exceptions.RuntimeError'> at xx working outside of request context

couchdb-python change notifications

Link generator using django or any python module

Categories

Resources