Memory leak in my Google App Engine code - python

I have the following code that is trying to loop over a large table (~100k rows; ~30GB)
def updateEmailsInLoop(cursor=None, stats={}):
BATCH_SIZE=10
try:
rawEmails, next_cursor, more = RawEmailModel.query().fetch_page(BATCH_SIZE, start_cursor=cursor)
for index, rawEmail in enumerate(rawEmails):
stats = process_stats(rawEmail, stats)
i = 0
while more and next_cursor:
rawEmails, next_cursor, more = RawEmailModel.query().fetch_page(BATCH_SIZE, start_cursor=next_cursor)
for index, rawEmail in enumerate(rawEmails):
stats = process_stats(rawEmail, stats)
i = (i + 1) %100
if i == 99:
logging.info("foobar: Finished 100 more %s", str(stats))
write_stats(stats)
except DeadlineExceededError:
logging.info("foobar: Deadline exceeded")
for index, rawEmail in enumerate(rawEmails[index:], start=index):
stats = process_stats(rawEmail, stats)
if more and next_cursor:
deferred.defer(updateEmailsInLoop, cursor = next_cursor, stats=stats, _queue="adminStats")
However, I keep getting the following error:
While handling this request, the process that handled this request was found to be using too much memory and was terminated. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may have a memory leak in your application.
...and sometimes....
Exceeded soft private memory limit of 128 MB with 154 MB after servicing 9 requests total
I had changed my code so I was always only pulling in 10 entries at any given time, so I don't get why I'm still running out of memory?

There are 3 ways to do this kind of job (iteration on a large set of rows in datastore):
Process 1 batch of x entities and create a task (push queue) using the cursor.
Process 1 batch of x entities and respond to the browser with a bit of javascript that shows the progress and changes window.location to a link that contains the cursor and the current progress. (this is my preferred approach)
Use mapreduce (it's harder to code)(but can be applied on 10M-1B rows)
For most of my apps that i needed this x is usually between 100-500.
Here is the code i use for iteration over 1.5m-2m rows to generate some reports or update stuff in my db. For reports i save an entity that contains the information i need in csv format, and at the end, i read all entities, merge them, and delete them. (done this to generate 1.5m rows of excel data)
(it's java, but should be easily translated to python):
resp.getWriter().println("<html><head>");
resp.getWriter().println(
"<script type='text/javascript'>function f(){window.location.href='/do/convert/" + this.getClass().getSimpleName() + "?cursor=" + cursorString + "&count="
+ count + "';}</script>");
resp.getWriter().println("</head><body onload='f()'>");
resp.getWriter().println(
"<a href='/do/convert/" + this.getClass().getSimpleName() + "?cursor=" + cursorString + "&count=" + count + "'>Next page -->" + cursorString + " </a>");
resp.getWriter().println("</body></html>");
If your "progress" is big and messy, save it in entities (one or more, depending on what you are doing)
If you are doing the task version, i recommend to either use task names or to make your tasks idempotent (especially if your counting stuff).
If your counting stuff, i recommend saving entities that contain the keys of the entities that you are counting, and at the end, count those.

Related

Neo4j: dependence of execution speed on batch size of input parameters

I'm using Neo4J to identify the connections between different node labels.
Neo4J 4.4.4 Community Edition
DB rolled out in docker container with k8s orchestrating.
MATCH (source_node: Person) WHERE source_node.name in $inputs
MATCH (source_node)-[r]->(child_id:InternalId)
WHERE r.valid_from <= datetime($actualdate) < r.valid_to
WITH [type(r), toString(date(r.valid_from)), child_id.id] as child_path, child_id, false as filtered
CALL apoc.do.when(filtered,
'RETURN child_path as full_path, NULL as issuer_id',
'OPTIONAL MATCH p_path = (child_id)-[:HAS_PARENT_ID*0..50]->(parent_id:InternalId)
WHERE all(a in relationships(p_path) WHERE a.valid_from <= datetime($actualdate) < a.valid_to) AND
NOT EXISTS{ MATCH (parent_id)-[q:HAS_PARENT_ID]->() WHERE q.valid_from <= datetime($actualdate) < q.valid_to}
WITH DISTINCT last(nodes(p_path)) as i_source,
reduce(st = [], q IN relationships(p_path) | st + [type(q), toString(date(q.valid_from)), endNode(q).id])
as parent_path, CASE WHEN length(p_path) = 0 THEN NULL ELSE parent_id END as parent_id, child_path
OPTIONAL MATCH (i_source)-[r:HAS_ISSUER_ID]->(issuer_id:IssuerId)
WHERE r.valid_from <= datetime($actualdate) < r.valid_to
RETURN DISTINCT CASE issuer_id WHEN NULL THEN child_path + parent_path + [type(r), NULL, "NOT FOUND IN RELATION"]
ELSE child_path + parent_path + [type(r), toString(date(r.valid_from)), toInteger(issuer_id.id)]
END as full_path, issuer_id, CASE issuer_id WHEN NULL THEN true ELSE false END as filtered',
{filtered: filtered, child_path: child_path, child_id: child_id, actualdate: $actualdate}
)
YIELD value
RETURN value.full_path as full_path, value.issuer_id as issuer_id, value.filtered as filtered
When query executing on a large number of incoming names (Person), it is processed quickly for example for 100,000 inputs it takes ~2.5 seconds. However, if 100,000 names are divided into small batches and fore each batch query is executed sequentially, the overall processing time increases dramatically:
100 names batch is ~2 min
1000 names batch is ~10 sec
Could you please provide me a clue why exactly this is happening? And how I could get the same executions time as for the entire dataset regardless the batch size?
Is the any possibility to divide transactions into multiple processes? I tried Python multiprocessing using Neo4j Driver. It works faster but still cannot achieve the target execution time of 2.5 sec for some reasons.
Is it any possibility to keep entire graph into memory during the whole container lifecycle? Could it help resolve the issue with the execution speed on multiple batches instead the entire dataset?
Essentially, the goal is to use as small batches as possible in order to process the entire dataset.
Thank you.
PS: Any suggestions to improve the query are very welcome.)
You pass in a list - then it will use an index to efficiently filter down the results by passing the list to the index, and you do additional aggressive filtering on properties.
So if you run the query with PROFILE you will see how much data is loaded / touched at each step.
A single execution makes more efficient use of resources like heap and page-cache.
For individual batched executions it has to go through the whole machinery (driver, query-parsing, planning, runtime), depending if you execute your queries in parallel (do you?) or sequentially, the next query needs to wait until your previous one has finished.
Multiple executions also content for resources like memory, IO, network.
Python is also not the fastest driver esp. if you send/receive larger volumes of data, try one of the other languages if that serves you better.
Why don't you just always execute one large batch then?
With Neo4j EE (e.g. on Aura) or CE 5 you will also get better runtimes and execution.
Yes if you configure your page-cache large enough to hold the store, it will keep the graph in memory during the execution.
If you run PROFILE with your query you should also see page-cache faults, when it needs to fetch data from disk.

Redis as a Queue - Bulk Retrieval

Our Python application serves around 2 million API requests per day. We got a new requirement from our business to generate the report which should contain the count of unique request and response every day.
We would like to use Redis for queuing all the requests & responses.
Another worker instance will retrieve the above data from Redis queue and process it.
The processed results will be persisted to the database.
The simplest option is to use LPUSH and RPOP. But RPOP will return one value at a time which will affect the performance. Is there any way to do a bulk pop from Redis?
Other suggestions for the scenario would be highly appreciated.
A simple solution would be to use redis pipelining
In a single request you will be allowed to perform multiple RPOP instructions.
Most of redis drivers support it. In python with Redis-py it looks like this:
pipe = r.pipeline()
# The following RPOP commands are buffered
pipe.rpop('requests')
pipe.rpop('requests')
pipe.rpop('requests')
pipe.rpop('requests')
# the EXECUTE call sends all buffered commands to the server, returning
# a list of responses, one for each command.
pipe.execute()
Can approach this from a different angle. Your requirement is:
requirement ... to generate the report which should contain the count of unique request and response every day.
Rather than storing requests in the lists and then post-processing the results, why not use Redis features to solve the actual requirements and avoid the problem of bulk LPUSH/LPOP.
If all we want if to record the unique counts, then you may want to consider using sorted sets.
This may go like this:
Collect the request statistics
# Collect the request statistics in the sorted set.
# The key includes date so we can do the "by date" stats.
key = 'requests:date'
r.zincrby(key, request, 1)
Report request statistics
Can use ZSCAN to iterate over all members in batches, but this is unordered.
Can use ZRANGE to get all members in one go (or whatever), ordered.
Python code:
# ZSCAN: Iterate over all members in the set in batches of about 10.
# This will be unordered list.
# zscan_iter returns tuples (member, score)
batchSize = 10
for memberTuple in r.zscan_iter(key, match = None, count = batchSize):
member = memberTuple[0]
score = memberTuple[1]
print str(member) + ' --> ' + str(score)
# ZRANGE: Get all members in the set, ordered by score.
# Here there maxRank=-1 means "no max".
minRank = 0
maxRank = -1
for memberTuple in r.zrange(key, minRank, maxRank, desc = False, withscores = True):
member = memberTuple[0]
score = memberTuple[1]
print str(member) + ' --> ' + str(score)
Benefits of this approach
Solves the actual requirement - reports on the count of unique requests by day.
No need to post-process anything.
Can do additional queries like "top requests" out of the box :)
Another approach would be to use the Hyperloglog data structure.
It was especially designed for this kind of use case.
It allows counting unique items with a low error margin (0.81%) and with a very low memory usage.
Using HLL is really simple:
PFADD myHll "<request1>"
PFADD myHll "<request2>"
PFADD myHll "<request3>"
PFADD myHll "<request4>"
Then to get the count:
PFCOUNT myHll
The actual question was regarding Redis List, You can use lrange to get all values in a single call, below is solution;
import redis
r_server = redis.Redis("localhost")
r_server.rpush("requests", "Adam")
r_server.rpush("requests", "Bob")
r_server.rpush("requests", "Carol")
print r_server.lrange("requests", 0, -1)
print r_server.llen("requests")
print r_server.lindex("requests", 1)

Speeding up a large process run over some data obtained from a database

So I am working in a project in which I have to read a large database (for me it is large) of 10 million records. I cannot really filter them, because I have to treat them all and individually. For each record I must apply a formula and then write this result into multiple files depending on certain conditions of the record.
I have implemented a few algorithms and finishing the whole processing takes around 2- 3 days. This is a problem because I am trying to optimise a process that already takes this time. 1 day is acceptable.
So far I have tried indexes on the database, threading(of the process upon the record and not I/O operations). I can not get a shorter time.
I am using django, and i fail to measure how much it really takes to really start treating the data due to its lazy behaviour. I would also like to know if i can start treating the data as soon as i receive it and not having to wait for all the data to be loaded unto memory before i can actually process it. It could also be my understanding of writing operations upon python. Lastly it could be that I need a better machine (I doubt it, I have 4 cores and 4GB RAM, it should be able to give better speeds)
Any ideas? I really appreciate the feedback. :)
Edit: Code
Explanation:
The records i talked about are ids of customers(passports), and the conditions are if there are agreements between the different terminals of the company(countries). The process is a hashing.
First strategy tries to treat the whole database... We have at the beginning some preparation for treating the condition part of the algorithm (agreements between countries). Then a large verification by belonging or not in a set.
Since i've been trying to improve it on my own, i tried to cut the problem in parts for the second strategy, treating the query by parts (obtaining the records that belong to a country and writing in the files of those that have an agreement with them)
The threaded strategy is not depicted for it was designed for a single country and i got awful results compared with no threaded. I honestly have the intuition it has to be a thing of memory and sql.
def create_all_files(strategy=0):
if strategy == 0:
set_countries_agreements = set()
file_countries = open(os.path.join(PROJECT_ROOT, 'list_countries'))
set_countries_temp = set(line.strip() for line in file_countries)
file_countries.close()
set_countries = sorted_nicely(set_countries_temp)
for each_country in set_countries:
set_agreements = frozenset(get_agreements(each_country))
set_countries_agreements.add(set_agreements)
print("All agreements obtained")
set_passports = Passport.objects.all()
print("All passports obtained")
for each_passport in set_passports:
for each_agreement in set_countries_agreements:
for each_country in each_agreement:
if each_passport.nationality == each_country:
with open(os.path.join(PROJECT_ROOT, 'generated_indexes/%s' % iter(each_agreement).next()), "a") as f:
f.write(generate_hash(each_passport.nationality + "<" + each_passport.id_passport, each_country) + "\n")
print(".")
print("_")
print("-")
print("~")
if strategy == 1:
file_countries = open(os.path.join(PROJECT_ROOT, 'list_countries'))
set_countries_temp = set(line.strip() for line in file_countries)
file_countries.close()
set_countries = sorted_nicely(set_countries_temp)
while len(set_countries)!= 0:
country = set_countries.pop()
list_countries = get_agreements(country)
list_passports = Passport.objects.filter(nationality=country)
for each_passport in list_passports:
for each_country in list_countries:
with open(os.path.join(PROJECT_ROOT, 'generated_indexes/%s' % each_country), "a") as f:
f.write(generate_hash(each_passport.nationality + "<" + each_passport.id_passport, each_country) + "\n")
print("r")
print("c")
print("p")
print("P")
In your question, you are describing an ETL process. I suggest you to use an ETL tool.
To mention some python ETL tool I can talk about Pygrametl, wrote by Christian Thomsen, in my opinion it runs nicely and its performance is impressive. Test it and comeback with results.
I can't post this answer without mention MapReduce. This programming model can catch with your requirements if you are planing to distribute task through nodes.
It looks like you have a file for each country that you append hashes to, instead of opening and closing handles to these files 10 million+ times you should open each one once and close them all at the end.
countries = {} # country -> file
with open(os.path.join(PROJECT_ROOT, 'list_countries')) as country_file:
for line in country_file:
country = line.strip()
countries[country] = open(os.path.join(PROJECT_ROOT, 'generated_indexes/%s' % country), "a")
for country in countries:
agreements = get_agreements(country)
for postcode in Postcode.objects.filter(nationality=country):
for agreement in agreements:
countries[agreement].write(generate_hash(passport.nationality + "<" + passport.id_passport, country_agreement) + "\n")
for country, file in countries.items():
file.close()
I don't how big a list of Postcode objects Postcode.objects.filter(nationality=country) will return, if it is massive and memory is an issue, you will have to start thinking about chunking/paginating the query using limits
You are using sets for your list of countries and their agreements, if that is because your file containing the list of countries is not guaranteed to be unique, the dictionary solution may error when you attempt to open another handle to the same file. This can be avoided by added a simple check to see if the country is already a member of countries

What data is cached during a "select" in sqlite3/Python, and can this be done manually from the start?

Suppose you have a sqlite database with several thousand rows -- each of which either contains or references a sizable, unique blob -- and you want to sparsely sample this collection, pulling rows based on rowid or some equivalent primary key. I find that the first time I attempt to fetch several (500) datapoints after connecting (out of 20k rows), the call takes over 10 seconds to return; and, with every successive iteration, the calls get shorter and shorter, until converging to around 100 milliseconds after 50-100 such queries.
Clearly, either sqlite or its python wrapper must be caching... something. If I clear out inactive memory (I'm in OS X, but I think Linux has a comparable if-not-identical "purge" command?), the behavior can be replicated exactly. The question is, what is it caching that an index doesn't address? And furthermore, is it possible to automatically pull whatever information is accelerating these queries into memory from the start? Or is there something else I've missed entirely?
A few notes in case someone doesn't immediately know the answer...
Each blob is around 40kB, and are a large (ha) source of the problem. I've some code below for anyone who wants to play along at home, but I've had better luck keeping separate tables for sortable information and data. This introduces an inner join, but it's generally been better than keeping it all together (although if anyone feels this is wrong, I'm keen to hear it). Without the inner join / data fetch, things start at 4 seconds and drop to 3 ms in a hurry.
I feel like this might be a PRAGMA thing, but I fiddled with some settings suggested by others in the wilderness of the web and didn't really see any benefit.
In-memory databases are not an option. For one, I'm trying to share across threads (which might not actually be a problem for in-mems...? not sure), but more importantly the database files are typically on the order of 17 GB. So, that's out.
That being said, there's no problem caching a reasonable amount of information. After a few dozen calls, inactive memory gets somewhat bloated anyways, but I'd rather do it (1) right and (2) efficiently.
Okay, now some code for anyone who wants to try to replicate things. You should be able to copy and paste it into a stand-alone script (that's basically what I did, save for formatting).
import sqlite3
import numpy as np
import time
ref_uid_index = """CREATE INDEX ref_uid_idx
ON data(ref_uid)"""
def populate_db_split(db_file, num_classes=10, num_points=20000, VERBOSE=False):
def_schema_split0 = """
CREATE TABLE main (
uid INTEGER PRIMARY KEY,
name TEXT,
label INTEGER,
ignore INTEGER default 0,
fold INTEGER default 0)"""
def_schema_split1 = """
CREATE TABLE data (
uid INTEGER PRIMARY KEY,
ref_uid INTEGER REFERENCES main(uid),
data BLOB)"""
def_insert_split0 = """
INSERT INTO main (name, label, fold)
VALUES (?,?,?)"""
def_insert_split1 = """
INSERT INTO data (ref_uid, data)
VALUES (?,?)"""
blob_size= 5000
k_folds = 5
some_names = ['apple', 'banana', 'cherry', 'date']
dbconn = sqlite3.connect(db_file)
dbconn.execute(def_schema_split0)
dbconn.execute(def_schema_split1)
rng = np.random.RandomState()
for n in range(num_points):
if n%1000 == 0 and VERBOSE:
print n
# Make up some data
data = buffer(rng.rand(blob_size).astype(float))
fold = rng.randint(k_folds)
label = rng.randint(num_classes)
rng.shuffle(some_names)
# And add it
dbconn.execute(def_insert_split0,[some_names[0], label, fold])
ref_uid = dbconn.execute("SELECT uid FROM main WHERE rowid=last_insert_rowid()").fetchone()[0]
dbconn.execute(def_insert_split1,[ref_uid,data])
dbconn.execute(ref_uid_index)
dbconn.commit()
return dbconn
def timeit_join(dbconn, n_times=10, num_rows=500):
qmarks = "?,"*(num_rows-1)+"?"
q_join = """SELECT data.data, main.uid, main.label
FROM data INNER JOIN main ON main.uid=data.ref_uid
WHERE main.uid IN (%s)"""%qmarks
row_max = dbconn.execute("SELECT MAX(rowid) from main").fetchone()[0]
tstamps = []
for n in range(n_times):
now = time.time()
uids = np.random.randint(low=1,high=row_max,size=num_rows).tolist()
res = dbconn.execute(q_join, uids).fetchall()
tstamps += [time.time()-now]
print tstamps[-1]
Now, if you want to replicate things, do the following. On my machine, this creates an 800MB database and produces something like below.
>>> db = populate_db_split('/some/file/path.db')
>>> timeit_join(db)
12.0593519211
5.56209111214
3.51154184341
2.20699000359
1.73895692825
1.18351387978
1.27329611778
0.934082984924
0.780968904495
0.834318161011
So... what say you, knowledgable sages?
Database files with GB size are never loaded into the memory entirely. They are split into a tree of socalled pages. These pages are cached in the memory, the default is 2000 pages.
You can use the following statement to e.g. double the number of cached pages of 1kB size.
conn.execute("""PRAGMA cache_size = 4000""")
The connection again has a cache for the last 100 statements, as you can see in the function description:
sqlite3.connect(database[, timeout, detect_types, isolation_level, check_same_thread, factory, cached_statements])
cached_statements expects and integer and defaults to 100.
Except from setting up the cache size, it is not likely that you benefit from actively caching statements or pages at the application start.

What's the best performing xml parsing for GAE (Python Version)?

I think we all know this page, but the benchmarks provided dated from more than two years ago. So, I would like to know if you could point out the best xml parser around. As I need just a xml parser, the more important thing to me is speed over everything else.
My objective is to process some xml feeds (about 25k) that are 4kb in size (this will be a daily task). As you probably know, I'm restricted by the 30 seconds request timeout. So, what's the best parser (Python only) that I can use?
Thanks for your anwsers.
Edit 01:
#Peter Recore
I'll. I'm writing some code now and plan to run some profiling in the near future. Regarding your question, the answer is no. Processing takes just a little time when compared with downloading the actual xml feed. But, I can't increase Google's Bandwidth, so I can only focus on this right now.
My only problem is that i need to do this as fastest as possible because my objective is to get a snapshot of a website status. And, as internet is live and people keep adding and changing it's data, i need the fastest method because any data insertion during the "downloading and processing" time span will actually mess with my statistical analisys.
I used to do it from my own computer and the process took 24 minutes back then, but now the website has 12 times more information.
I know that this don't awnser my question directly, but id does what i just needed.
I remenbered that xml is not the only file type I could use, so instead of using a xml parser I choose to use json. About 2.5 times smaller in size. What means a decrease in download time. I used simplejson as my json libray.
I used from google.appengine.api import urlfetch to get the json feeds in parallel:
class GetEntityJSON(webapp.RequestHandler):
def post(self):
url = 'http://url.that.generates.the.feeds/'
if self.request.get('idList'):
idList = self.request.get('idList').split(',')
try:
asyncRequests = self._asyncFetch([url + id + '.json' for id in idList])
except urlfetch.DownloadError:
# Dealed with time out errors (#5) as these were very frequent
for result in asyncRequests:
if result.status_code == 200:
entityJSON = simplejson.loads(result.content)
# Filled a database entity with some json info. It goes like this:
# entity= Entity(
# name = entityJSON['name'],
# dateOfBirth = entityJSON['date_of_birth']
# ).put()
self.redirect('/')
def _asyncFetch(self, urlList):
rpcs = []
for url in urlList:
rpc = urlfetch.create_rpc(deadline = 10)
urlfetch.make_fetch_call(rpc, url)
rpcs.append(rpc)
return [rpc.get_result() for rpc in rpcs]
I tried getting 10 feeds at a time, but most of the times an individual feed raised the DownloadError #5 (Time out). Then, I increased the deadline to 10 seconds and started getting 5 feeds at a time.
But still, 25k feeds getting 5 at a time results in 5k calls. In a queue that can spawn 5 tasks a second, the total task time should be 17min in the end.

Categories