Flask, uWSGI and threads produce unreproducible result

Flask, uWSGI and threads produce unreproducible result - python

I have an API based on flask + uwsgi (running in a docker container).
Its input is a text, and its output is a simple json with data about this text (something like {"score": 1}).
I request this API in 5 threads. In my uwsgi.ini I have processes = 8 and threads = 2.
Recently I noticed that some results are not reproducible, though the source code of API didn't change and there is no code provoking random answers inside it.
So I took the same set of queries and fed it to my API, first in straight order, then in the reversed one. About 1% of responses differed!
When I did the same locally (without threads and docker& in just one thread), the results became just the same though. So my hypothesis is that flask somewhat confuses responses to different threads from time to time.
Has anyone dealt with it? I found only https://github.com/getsentry/raven-python/issues/923, but if it's the case then the problem still remains unsolved as fas as I understand...
So here are relevant code pieces:
uwsgi.ini
[uwsgi]
socket = :8000
processes = 8
threads = 2
master = true
module = web:app
requests
import json
from multiprocessing import Pool
import requests
def fetch_score(query):
r = requests.post("http://url/api/score", data = query)
score = json.loads(r.text)
query["score"] = score["score"]
return query
def calculateParallel(arr_of_data):
pool = Pool(processes=5)
results = pool.map(fetch_score, arr_of_data)
pool.close()
pool.join()
return results
results = calculateParallel(final_data)

Related

How to prevent Saxon/C Python bindings from trying to start a new java VM when another is active? (JNI_CreateJavaVM() failed with result: -5)

I am building a flask api that allows users to pass an xml and a transformation that returns the xml on which the transformation is performed using Saxon/C's python API (https://www.saxonica.com/saxon-c/doc/html/saxonc.html).
The incoming endpoint looks like this (removed logging and irrelevant info):
#app.route("/v1/transform/", methods=["POST"])
def transform():
xml = request.data
transformation = request.args.get("transformation")
result = transform_xml(xml, transformation)
return result
The transform function looks like this:
def transform_xml(xml: bytes, transformation: str) -> str:
with saxonc.PySaxonProcessor(license=False) as proc:
base_dir = os.getcwd()
xslt_path = os.path.join(base_dir, "resources", transformation, "main.xslt")
xslt_proc = proc.new_xslt30_processor()
node = proc.parse_xml(xml_text=xml.decode("utf-8"))
result = xslt_proc.transform_to_string(stylesheet_file=xslt_path, xdm_node=node)
return result
The xslt's are locally available and a user should choose one of the available ones by passing the corresponding transformation name.
Now the problem is, this works (fast) for the first incoming call, but the second one crashes:
JNI_CreateJavaVM() failed with result: -5
DAMN ! worker 1 (pid: 517095) died :( trying respawn ...
What does work is changing the transform_xml function like this:
proc = saxonc.PySaxonProcessor(license=False)
xslt_path = self.__get_path_to_xslt(transformation)
xslt_proc = proc.new_xslt30_processor()
node = proc.parse_xml(xml_text=xml.decode("utf-8"))
result = xslt_proc.transform_to_string(stylesheet_file=xslt_path, xdm_node=node)
return result
But this leads to the resources never getting released and over time (1k+ requests) this starts to fill up the memory.
It seems like Saxon is trying to create a new VM while the old one is going down.
I found this thread from 2016: https://saxonica.plan.io/boards/4/topics/6399 but this didn't clear it up for me. I looked at the github for the pysaxon repo, but I have found no answer to this problem.
Also made a ticket at Saxon: https://saxonica.plan.io/issues/4942

Why do I get duplicates into my multithreaded processed list, using pandas and requests in Python?

EDIT2: Solved! Thanks to Michael Butscher's comment, I made a shallow copy of req_params by renaming the argument req_params to req_params_arg and then adding req_params = req_params_arg.copy() in get_assets_api_for_range().
The variable was shared between the threads, so copying before use solved the problem.
EDIT: It seems that "python requests" doesn't like being called simultaneously by several threads, I activated debug mode and I can see that the requests sent to the API are sometimes equal (same range asked) which leads to the duplicates. Weird behavior... Do you think I need to use aiohttp or asyncio??
I'm struggling with the concurrent.futures module in order to fetch a large volume of data from an API.
The API I'm using is limiting results to 100 results per page, then I'm calling it multiple times in order to get all the data.
To accelerate the process I tried to use multithreading (ThreadPoolExecutor) with a maximum of 10 threads.
It works fine and it's very quick but the results are different each time... Sometimes I get duplicates, sometimes not.
It seems not to be thread-safe somewhere but I cannot figure out where... Maybe it is hidden into the functions that uses pandas behind?
I tried to echo the no of pages it's getting and it's pretty correct (not in order but normal):
Fetching 300-400
Fetching 700-800
Fetching 800-900
Fetching 500-600
Fetching 400-500
Fetching 0-100
Fetching 200-300
Fetching 100-200
Fetching 900-1000
Fetching 600-700
Fetching 1100-1159
Fetching 1000-1100
Another weird behavior is here: when I put the line req_sess = authenticate() after the line print("Fetching {}".format(req_params['range'])) in get_assets_api_for_range function, only one page (the last I believe) is fetched multiple times.
Thanks for your help!!
Here is my code (I removed some parts, I should be enough I think), the main function called is get_assets_from_api_in_df():
from functools import partial
import pandas as pd
import requests
import concurrent.futures as cfu
def get_assets_api_for_range(range_to_fetch, req_params):
req_sess = authenticate()
req_params.update({'range': range_to_fetch})
print("Fetching {}".format(req_params['range']))
r = req_sess.get(url=auth_config['endpoint_url'] + '/assets',
params=req_params)
if r.status_code != 200:
raise ConnectionError("API Get assets error: {}".format(r))
json_response = r.json()
# This function in return, processes the list into a dataframe
return process_get_assets_from_api_in_df(json_response["asset_list"])
def get_assets_from_api_in_df() -> pd.DataFrame:
GET_NB_MAX = 100
# First, fetch 1 value to get nb to fetch
req_sess = authenticate()
r = req_sess.get(url=auth_config['endpoint_url'] + '/assets',
params={'range': '0-1'})
if r.status_code != 200:
raise ConnectionError("API Get assets error: {}".format(r))
json_response = r.json()
nb_to_fetch_total = json_response['total']
print("Nb to fetch: {}".format(nb_to_fetch_total))
# Building a queue of ranges to fetch
ranges_to_fetch_queue = []
for nb in range(0, nb_to_fetch_total, GET_NB_MAX):
if nb + GET_NB_MAX < nb_to_fetch_total:
range_str = str(nb) + '-' + str(nb + GET_NB_MAX)
else:
range_str = str(nb) + '-' + str(nb_to_fetch_total)
ranges_to_fetch_queue.append(range_str)
params = {
}
func_to_call = partial(get_assets_api_for_range,
req_params=params)
with cfu.ThreadPoolExecutor(max_workers=10) as executor:
result = list(executor.map(func_to_call, ranges_to_fetch_queue))
print("Fetch finished, merging data...")
return pd.concat(result, ignore_index=True)

Celery-Redis worker showing weird results on Heroku

I have built a Flask application which I have hosted on Heroku, Celery as the worker with Redis as the Broker and for saving the backend on Redis itself, has the following code:
def create_csv_group(orgi,mx):
# Write a csv file with filename 'group'
cols = []
maxx = int(mx)+1
cols.append(['SID','First','Last','Email'])
for i in range(0,int(mx)):
cols.append(['SID'+str(i),'First'+str(i),'Last'+str(i),'Email'+ str(i)])
with open(os.path.join('uploads/','groupfile_t.csv'), 'wb') as f:
writer = csv.writer(f)
for i in range(len(max(cols, key=len))):
writer.writerow([(c[i] if i<len(c) else '') for c in cols])
#app.route('/mark',methods=['POST'])
def mark():
task = create_csv_group.apply_async(args=[orig,mx])
tsk_id = task.id
If I try to access the variable tsk_id, sometimes it gives the error:
variable used before being initialized.
I thought the reason it was not sending the task to the queue before I was accessing the tsk_id. So I moved the function after two form filling pages.
But now, it is not updating/saving the file correctly, it shows weird output in the file(Seems to be the old file data, which should get updated on filling the new form). When I run the same code locally, it runs perfectly fine. I logged the worker, it goes in the task function, runs properly too.
Why is this weird output is being displayed? How can I fix both of the issues, so that it writes properly to the file and check on the task id?

How to speed up web scraping in python

I'm working on a project for school and I am trying to get data about movies. I've managed to write a script to get the data I need from IMDbPY and Open Movie DB API (omdbapi.com). The challenge I'm experiencing is that I'm trying to get data for 22,305 movies and each request takes about 0.7 seconds. Essentially my current script will take about 8 hours to complete. Looking for any way to maybe use multiple requests at the same time or any other suggestions to significantly speed up the process of getting this data.
import urllib2
import json
import pandas as pd
import time
import imdb
start_time = time.time() #record time at beginning of script
#used to make imdb.com think we are getting this data from a browser
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
#Open Movie Database Query url for IMDb IDs
url = 'http://www.omdbapi.com/?tomatoes=true&i='
#read the ids from the imdb_id csv file
imdb_ids = pd.read_csv('ids.csv')
cols = [u'Plot', u'Rated', u'tomatoImage', u'Title', u'DVD', u'tomatoMeter',
u'Writer', u'tomatoUserRating', u'Production', u'Actors', u'tomatoFresh',
u'Type', u'imdbVotes', u'Website', u'tomatoConsensus', u'Poster', u'tomatoRotten',
u'Director', u'Released', u'tomatoUserReviews', u'Awards', u'Genre', u'tomatoUserMeter',
u'imdbRating', u'Language', u'Country', u'imdbpy_budget', u'BoxOffice', u'Runtime',
u'tomatoReviews', u'imdbID', u'Metascore', u'Response', u'tomatoRating', u'Year',
u'imdbpy_gross']
#create movies dataframe
movies = pd.DataFrame(columns=cols)
i=0
for i in range(len(imdb_ids)-1):
start = time.time()
req = urllib2.Request(url + str(imdb_ids.ix[i,0]), None, headers) #request page
response = urllib2.urlopen(req) #actually call the html request
the_page = response.read() #read the json from the omdbapi query
movie_json = json.loads(the_page) #convert the json to a dict
#get the gross revenue and budget from IMDbPy
data = imdb.IMDb()
movie_id = imdb_ids.ix[i,['imdb_id']]
movie_id = movie_id.to_string()
movie_id = int(movie_id[-7:])
data = data.get_movie_business(movie_id)
data = data['data']
data = data['business']
#get the budget $ amount out of the budget IMDbPy string
try:
budget = data['budget']
budget = budget[0]
budget = budget.replace('$', '')
budget = budget.replace(',', '')
budget = budget.split(' ')
budget = str(budget[0])
except:
None
#get the gross $ amount out of the gross IMDbPy string
try:
budget = data['budget']
budget = budget[0]
budget = budget.replace('$', '')
budget = budget.replace(',', '')
budget = budget.split(' ')
budget = str(budget[0])
#get the gross $ amount out of the gross IMDbPy string
gross = data['gross']
gross = gross[0]
gross = gross.replace('$', '')
gross = gross.replace(',', '')
gross = gross.split(' ')
gross = str(gross[0])
except:
None
#add gross to the movies dict
try:
movie_json[u'imdbpy_gross'] = gross
except:
movie_json[u'imdbpy_gross'] = 0
#add gross to the movies dict
try:
movie_json[u'imdbpy_budget'] = budget
except:
movie_json[u'imdbpy_budget'] = 0
#create new dataframe that can be merged to movies DF
tempDF = pd.DataFrame.from_dict(movie_json, orient='index')
tempDF = tempDF.T
#add the new movie to the movies dataframe
movies = movies.append(tempDF, ignore_index=True)
end = time.time()
time_took = round(end-start, 2)
percentage = round(((i+1) / float(len(imdb_ids))) * 100,1)
print i+1,"of",len(imdb_ids),"(" + str(percentage)+'%)','completed',time_took,'sec'
#increment counter
i+=1
#save the dataframe to a csv file
movies.to_csv('movie_data.csv', index=False)
end_time = time.time()
print round((end_time-start_time)/60,1), "min"

Use Eventlet library to fetch concurently
As advised in comments, you shall fetch your feeds concurrently. This can be done by using treading, multiprocessing, or using eventlet.
Install eventlet
$ pip install eventlet
Try web crawler sample from eventlet
See: http://eventlet.net/doc/examples.html#web-crawler
Understanding concurrency with eventlet
With threading system takes care of switching between your threads. This brings big problem in case you have to access some common data structures, as you never know, which other thread is currently accessing your data. You then start playing with synchronized blocks, locks, semaphores - just to synchronize access to your shared data structures.
With eventlet it goes much simpler - you always run only one thread and jump between them only at I/O instructions or at other eventlet calls. The rest of your code runs uninterrupted and without a risk, another thread would mess up with our data.
You only have to take care of following:
all I/O operations must be non-blocking (this is mostly easy, eventlet provides non-blocking versions for most of the I/O you need).
your remaining code must not be CPU expensive as it would block switching between "green" threads for longer time and the power of "green" multithreading would be gone.
Great advantage with eventlet is, that it allows to write code in straightforward way without spoiling it (too) much with Locks, Semaphores etc.
Apply eventlet to your code
If I understand it correctly, list of urls to fetch is known in advance and order of their processing in your analysis is not important. This shall allow almost direct copy of example from eventlet. I see, that an index i has some significance, so you might consider mixing url and the index as a tuple and processing them as independent jobs.
There are definitely other methods, but personally I have found eventlet really easy to use comparing it to other techniques while getting really good results (especially with fetching feeds). You just have to grasp main concepts and be a bit careful to follow eventlet requirements (keep being non-blocking).
Fetching urls using requests and eventlet - erequests
There are various packages for asynchronous processing with requests, one of them using eventlet and being namederequests see https://github.com/saghul/erequests
Simple sample fetching set of urls
import erequests
# have list of urls to fetch
urls = [
'http://www.heroku.com',
'http://python-tablib.org',
'http://httpbin.org',
'http://python-requests.org',
'http://kennethreitz.com'
]
# erequests.async.get(url) creates asynchronous request
async_reqs = [erequests.async.get(url) for url in urls]
# each async request is ready to go, but not yet performed
# erequests.map will call each async request to the action
# what returns processed request `req`
for req in erequests.map(async_reqs):
if req.ok:
content = req.content
# process it here
print "processing data from:", req.url
Problems for processing this specific question
We are able to fetch and somehow process all urls we need. But in this question, processing is bound to particular record in source data, so we will need to match processed request with index of record we need for getting further details for final processing.
As we will see later, asynchronous processing does not honour order of requests, some are processed sooner and some later and map yields whatever is completed.
One option is to attach index of given url to the requests and use it later when processing returned data.
Complex sample of fetching and processing urls with preserving url indices
Note: following sample is rather complex, if you can live with solution provided above, skip this. But make sure you are not running into problems detected and resolved below (urls being modified, requests following redirects).
import erequests
from itertools import count, izip
from functools import partial
urls = [
'http://www.heroku.com',
'http://python-tablib.org',
'http://httpbin.org',
'http://python-requests.org',
'http://kennethreitz.com'
]
def print_url_index(index, req, *args, **kwargs):
content_length = req.headers.get("content-length", None)
todo = "PROCESS" if req.status_code == 200 else "WAIT, NOT YET READY"
print "{todo}: index: {index}: status: {req.status_code}: length: {content_length}, {req.url}".format(**locals())
async_reqs = (erequests.async.get(url, hooks={"response": partial(print_url_index, i)}) for i, url in izip(count(), urls))
for req in erequests.map(async_reqs):
pass
Attaching hooks to request
requests (and erequests too) allows defining hooks to event called response. Each time, the request gets a response, this hook function is called and can do something or even modify the response.
Following line defines some hook to response:
erequests.async.get(url, hooks={"response": partial(print_url_index, i)})
Passing url index to the hook function
Signature of any hook shall be func(req, *args, *kwargs)
But we need to pass into the hook function also the index of url we are processing.
For this purpose we use functools.partial which allows creation of simplified functions by fixing some of parameters to specific value. This is exactly what we need, if you see print_url_index signature, we need just to fix value of index, the rest will fit requirements for hook function.
In our call we use partial with name of simplified function print_url_index and providing for each url unique index of it.
Index could be provided in the loop by enumerate, in case of larger number of parameters we may work more memory efficient way and use count, which generates each time incremented number starting by default from 0.
Let us run it:
$ python ereq.py
WAIT, NOT YET READY: index: 3: status: 301: length: 66, http://python-requests.org/
WAIT, NOT YET READY: index: 4: status: 301: length: 58, http://kennethreitz.com/
WAIT, NOT YET READY: index: 0: status: 301: length: None, http://www.heroku.com/
PROCESS: index: 2: status: 200: length: 7700, http://httpbin.org/
WAIT, NOT YET READY: index: 1: status: 301: length: 64, http://python-tablib.org/
WAIT, NOT YET READY: index: 4: status: 301: length: None, http://kennethreitz.org
WAIT, NOT YET READY: index: 3: status: 302: length: 0, http://docs.python-requests.org
WAIT, NOT YET READY: index: 1: status: 302: length: 0, http://docs.python-tablib.org
PROCESS: index: 3: status: 200: length: None, http://docs.python-requests.org/en/latest/
PROCESS: index: 1: status: 200: length: None, http://docs.python-tablib.org/en/latest/
PROCESS: index: 0: status: 200: length: 12064, https://www.heroku.com/
PROCESS: index: 4: status: 200: length: 10478, http://www.kennethreitz.org/
This shows, that:
requests are not processed in the order they were generated
some requests follow redirection, so hook function is called multiple times
carefully inspecting url values we can see, that no url from original list urls is reported by response, even for index 2 we got extra / appended. That is why simple lookup of response url in original list of urls would not help us.

When web-scraping we generally have two types of bottlenecks:
IO blocks - whenever we make a request, we need to wait for the server to respond, which can block our entire program.
CPU blocks - when parsing web scraped content, our code might be limited by CPU processing power.
CPU Speed
CPU blocks are an easy fix - we can spawn more processes. Generally, 1 CPU core can efficiently handle 1 process. So if our scraper is running on a machine that has 12 CPU cores we can spawn 12 processes for 12x speed boost:
from concurrent.futures import ProcessPoolExecutor
def parse(html):
... # CPU intensive parsing
htmls = [...]
with ProcessPoolExecutor() as executor:
for result in executor.map(parse, htmls):
print(result)
Python's ProcessPooolExecutor spawns optimal amount of threads (equal to CPU cores) and distributes task through them.
IO Speed
For IO-blocking we have more options as our goal is to get rid of useless waiting which can be done through threads, processes and asyncio loops.
If we're making thousands of requests we can't spawn hundreds of processes. Threads will be less expensive but still, there's a better option - asyncio loops.
Asyncio loops can execute tasks in no specific order. In other words, while task A is being blocked task B can take over the program. This is perfect for web scraping as there's very little overhead computing going on. We can scale to thousands requests in a single program.
Unfortunately, for asycio to work, we need to use python packages that support asyncio. For example, by using httpx and asyncio we can speed up our scraping significantly:
# comparing synchronous `requests`:
import requests
from time import time
_start = time()
for i in range(50):
request.get("http://httpbin.org/delay/1")
print(f"finished in: {time() - _start:.2f} seconds")
# finished in: 52.21 seconds
# versus asynchronous `httpx`
import httpx
import asyncio
from time import time
_start = time()
async def main():
async with httpx.AsyncClient() as client:
tasks = [client.get("http://httpbin.org/delay/1") for i in range(50)]
for response_future in asyncio.as_completed(tasks):
response = await response_future
print(f"finished in: {time() - _start:.2f} seconds")
asyncio.run(main())
# finished in: 3.55 seconds
Combining Both
With async code we can avoid IO-blocks and with processes we can scale up CPU intensive parsing - a perfect combo to optimize web scraping:
import asyncio
import multiprocessing
from concurrent.futures import ProcessPoolExecutor
from time import sleep, time
import httpx
async def scrape(urls):
"""this is our async scraper that scrapes"""
results = []
async with httpx.AsyncClient(timeout=httpx.Timeout(30.0)) as client:
scrape_tasks = [client.get(url) for url in urls]
for response_f in asyncio.as_completed(scrape_tasks):
response = await response_f
# emulate data parsing/calculation
sleep(0.5)
...
results.append("done")
return results
def scrape_wrapper(args):
i, urls = args
print(f"subprocess {i} started")
result = asyncio.run(scrape(urls))
print(f"subprocess {i} ended")
return result
def multi_process(urls):
_start = time()
batches = []
batch_size = multiprocessing.cpu_count() - 1 # let's keep 1 core for ourselves
print(f"scraping {len(urls)} urls through {batch_size} processes")
for i in range(0, len(urls), batch_size):
batches.append(urls[i : i + batch_size])
with ProcessPoolExecutor() as executor:
for result in executor.map(scrape_wrapper, enumerate(batches)):
print(result)
print("done")
print(f"multi-process finished in {time() - _start:.2f}")
def single_process(urls):
_start = time()
results = asyncio.run(scrape(urls))
print(f"single-process finished in {time() - _start:.2f}")
if __name__ == "__main__":
urls = ["http://httpbin.org/delay/1" for i in range(100)]
multi_process(urls)
# multi-process finished in 7.22
single_process(urls)
# single-process finished in 51.28
These foundation concepts sound complex, but once you narrow it down to the roots of the issue, the fixes are very straight and already present in Python!
For more details on this subject see my blog Web Scraping Speed: Processes, Threads and Async

GAE Python LXML - Exceeded soft private memory limit

I am fetching a GZipped LXML file and trying to write Product entries to a Databse Model. Previously I was having local memory issues, which were resolved by help on SO (question). Now I got everything working and deployed it, however on the server I get the following error:
Exceeded soft private memory limit with 158.164 MB after servicing 0 requests total
Now I tried all I know to reduce the memory usage and am currently using the code below. The GZipped file is about 7 MB whereas unzipped it is 80 MB. Locally the code is working fine. I tried running it as HTTP request as well as Cron Job but it didn't make a difference. Now I am wondering if there is any way to make it more efficient.
Some similar questions on SO referred to Frontend and Backend specification, which I am not familiar with. I am running the free version of GAE and this task would have to run once a week. Any suggestions on best way to move forward would be very much appreciated.
from google.appengine.api.urlfetch import fetch
import gzip, base64, StringIO, datetime, webapp2
from lxml import etree
from google.appengine.ext import db
class GetProductCatalog(webapp2.RequestHandler):
def get(self):
user = XXX
password = YYY
url = 'URL'
# fetch gziped file
catalogResponse = fetch(url, headers={
"Authorization": "Basic %s" % base64.b64encode(user + ':' + password)
}, deadline=10000000)
# the response content is in catalogResponse.content
# un gzip the file
f = StringIO.StringIO(catalogResponse.content)
c = gzip.GzipFile(fileobj=f)
content = c.read()
# create something readable by lxml
xml = StringIO.StringIO(content)
# delete unnecesary variables
del f
del c
del content
# parse the file
tree = etree.iterparse(xml, tag='product')
for event, element in tree:
if element.findtext('manufacturer') == 'New York':
if Product.get_by_key_name(element.findtext('sku')):
coupon = Product.get_by_key_name(element.findtext('sku'))
if coupon.last_update_prov != datetime.datetime.strptime(element.findtext('lastupdated'), "%d/%m/%Y"):
coupon.restaurant_name = element.findtext('name')
coupon.restaurant_id = ''
coupon.address_street = element.findtext('keywords').split(',')[0]
coupon.address_city = element.findtext('manufacturer')
coupon.address_state = element.findtext('publisher')
coupon.address_zip = element.findtext('manufacturerid')
coupon.value = '$' + element.findtext('price') + ' for $' + element.findtext('retailprice')
coupon.restrictions = element.findtext('warranty')
coupon.url = element.findtext('buyurl')
if element.findtext('instock') == 'YES':
coupon.active = True
else:
coupon.active = False
coupon.last_update_prov = datetime.datetime.strptime(element.findtext('lastupdated'), "%d/%m/%Y")
coupon.put()
else:
pass
else:
coupon = Product(key_name = element.findtext('sku'))
coupon.restaurant_name = element.findtext('name')
coupon.restaurant_id = ''
coupon.address_street = element.findtext('keywords').split(',')[0]
coupon.address_city = element.findtext('manufacturer')
coupon.address_state = element.findtext('publisher')
coupon.address_zip = element.findtext('manufacturerid')
coupon.value = '$' + element.findtext('price') + ' for $' + element.findtext('retailprice')
coupon.restrictions = element.findtext('warranty')
coupon.url = element.findtext('buyurl')
if element.findtext('instock') == 'YES':
coupon.active = True
else:
coupon.active = False
coupon.last_update_prov = datetime.datetime.strptime(element.findtext('lastupdated'), "%d/%m/%Y")
coupon.put()
else:
pass
element.clear()
UDPATE
According to Paul's suggestion I implemented the backend. After some troubles it worked like a charm - find the code I used below.
My backends.yaml looks as follows:
backends:
- name: mybackend
instances: 10
start: mybackend.app
options: dynamic
And my app.yaml as follows:
handlers:
- url: /update/mybackend
script: mybackend.app
login: admin

Backends are like front end instances but they don't scale and you have to stop and start them as you need them (or set them to be dynamic, probably your best bet here).
You can have up to 1024MB of memory in the backend so it will probably work fine for your task.
https://developers.google.com/appengine/docs/python/backends/overview
App Engine Backends are instances of your application that are exempt from request deadlines and have access to more memory (up to 1GB) and CPU (up to 4.8GHz) than normal instances. They are designed for applications that need faster performance, large amounts of addressable memory, and continuous or long-running background processes. Backends come in several sizes and configurations, and are billed for uptime rather than CPU usage.
A backend may be configured as either resident or dynamic. Resident
backends run continuously, allowing you to rely on the state of their
memory over time and perform complex initialization. Dynamic backends
come into existence when they receive a request, and are turned down
when idle; they are ideal for work that is intermittent or driven by
user activity. For more information about the differences between
resident and dynamic backends, see Types of Backends and also the
discussion of Startup and Shutdown.
It sounds like just what you need. The free usage level will also be OK for your task.

Regarding the backend: looking at the example you have provided - seems like your request is simply handled by frontend instance.
To make it be handled by the backend, try instead calling the task like: http://mybackend.my_app_app_id.appspot.com/update/mybackend
Also, I think you can remove: start: mybackend.app from your backends.yaml

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Flask, uWSGI and threads produce unreproducible result - python

Related

How to prevent Saxon/C Python bindings from trying to start a new java VM when another is active? (JNI_CreateJavaVM() failed with result: -5)

Why do I get duplicates into my multithreaded processed list, using pandas and requests in Python?

Celery-Redis worker showing weird results on Heroku

How to speed up web scraping in python

GAE Python LXML - Exceeded soft private memory limit

Categories

Resources