Spark MLlib - trainImplicit warning

Spark MLlib - trainImplicit warning - python

I keep seeing these warnings when using trainImplicit:
WARN TaskSetManager: Stage 246 contains a task of very large size (208 KB).
The maximum recommended task size is 100 KB.
And then the task size starts to increase. I tried to call repartition on the input RDD but the warnings are the same.
All these warnings come from ALS iterations, from flatMap and also from aggregate, for instance the origin of the stage where the flatMap is showing these warnings (w/ Spark 1.3.0, but they are also shown in Spark 1.3.1):
org.apache.spark.rdd.RDD.flatMap(RDD.scala:296)
org.apache.spark.ml.recommendation.ALS$.org$apache$spark$ml$recommendation$ALS$$computeFactors(ALS.scala:1065)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:530)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:527)
scala.collection.immutable.Range.foreach(Range.scala:141)
org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:527)
org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:203)
and from aggregate:
org.apache.spark.rdd.RDD.aggregate(RDD.scala:968)
org.apache.spark.ml.recommendation.ALS$.computeYtY(ALS.scala:1112)
org.apache.spark.ml.recommendation.ALS$.org$apache$spark$ml$recommendation$ALS$$computeFactors(ALS.scala:1064)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:538)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:527)
scala.collection.immutable.Range.foreach(Range.scala:141)
org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:527)
org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:203)

Similar problem was described in Apache Spark mail lists - http://apache-spark-user-list.1001560.n3.nabble.com/Large-Task-Size-td9539.html
I think you can try to play with number of partitions (using repartition() method), depending of how many hosts, RAM, CPUs do you have.
Try also to investigate all steps via Web UI, where you can see number of stages, memory usage on each stage, and data locality.
Or just never mind about this warnings unless everything works correctly and fast.
This notification is hard-coded in Spark (scheduler/TaskSetManager.scala)
if (serializedTask.limit > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1024 &&
!emittedTaskSizeWarning) {
emittedTaskSizeWarning = true
logWarning(s"Stage ${task.stageId} contains a task of very large size " +
s"(${serializedTask.limit / 1024} KB). The maximum recommended task size is " +
s"${TaskSetManager.TASK_SIZE_TO_WARN_KB} KB.")
}
.
private[spark] object TaskSetManager {
// The user will be warned if any stages contain a task that has a serialized size greater than
// this.
val TASK_SIZE_TO_WARN_KB = 100
}

Related

Neo4j: dependence of execution speed on batch size of input parameters

I'm using Neo4J to identify the connections between different node labels.
Neo4J 4.4.4 Community Edition
DB rolled out in docker container with k8s orchestrating.
MATCH (source_node: Person) WHERE source_node.name in $inputs
MATCH (source_node)-[r]->(child_id:InternalId)
WHERE r.valid_from <= datetime($actualdate) < r.valid_to
WITH [type(r), toString(date(r.valid_from)), child_id.id] as child_path, child_id, false as filtered
CALL apoc.do.when(filtered,
'RETURN child_path as full_path, NULL as issuer_id',
'OPTIONAL MATCH p_path = (child_id)-[:HAS_PARENT_ID*0..50]->(parent_id:InternalId)
WHERE all(a in relationships(p_path) WHERE a.valid_from <= datetime($actualdate) < a.valid_to) AND
NOT EXISTS{ MATCH (parent_id)-[q:HAS_PARENT_ID]->() WHERE q.valid_from <= datetime($actualdate) < q.valid_to}
WITH DISTINCT last(nodes(p_path)) as i_source,
reduce(st = [], q IN relationships(p_path) | st + [type(q), toString(date(q.valid_from)), endNode(q).id])
as parent_path, CASE WHEN length(p_path) = 0 THEN NULL ELSE parent_id END as parent_id, child_path
OPTIONAL MATCH (i_source)-[r:HAS_ISSUER_ID]->(issuer_id:IssuerId)
WHERE r.valid_from <= datetime($actualdate) < r.valid_to
RETURN DISTINCT CASE issuer_id WHEN NULL THEN child_path + parent_path + [type(r), NULL, "NOT FOUND IN RELATION"]
ELSE child_path + parent_path + [type(r), toString(date(r.valid_from)), toInteger(issuer_id.id)]
END as full_path, issuer_id, CASE issuer_id WHEN NULL THEN true ELSE false END as filtered',
{filtered: filtered, child_path: child_path, child_id: child_id, actualdate: $actualdate}
)
YIELD value
RETURN value.full_path as full_path, value.issuer_id as issuer_id, value.filtered as filtered
When query executing on a large number of incoming names (Person), it is processed quickly for example for 100,000 inputs it takes ~2.5 seconds. However, if 100,000 names are divided into small batches and fore each batch query is executed sequentially, the overall processing time increases dramatically:
100 names batch is ~2 min
1000 names batch is ~10 sec
Could you please provide me a clue why exactly this is happening? And how I could get the same executions time as for the entire dataset regardless the batch size?
Is the any possibility to divide transactions into multiple processes? I tried Python multiprocessing using Neo4j Driver. It works faster but still cannot achieve the target execution time of 2.5 sec for some reasons.
Is it any possibility to keep entire graph into memory during the whole container lifecycle? Could it help resolve the issue with the execution speed on multiple batches instead the entire dataset?
Essentially, the goal is to use as small batches as possible in order to process the entire dataset.
Thank you.
PS: Any suggestions to improve the query are very welcome.)

You pass in a list - then it will use an index to efficiently filter down the results by passing the list to the index, and you do additional aggressive filtering on properties.
So if you run the query with PROFILE you will see how much data is loaded / touched at each step.
A single execution makes more efficient use of resources like heap and page-cache.
For individual batched executions it has to go through the whole machinery (driver, query-parsing, planning, runtime), depending if you execute your queries in parallel (do you?) or sequentially, the next query needs to wait until your previous one has finished.
Multiple executions also content for resources like memory, IO, network.
Python is also not the fastest driver esp. if you send/receive larger volumes of data, try one of the other languages if that serves you better.
Why don't you just always execute one large batch then?
With Neo4j EE (e.g. on Aura) or CE 5 you will also get better runtimes and execution.
Yes if you configure your page-cache large enough to hold the store, it will keep the graph in memory during the execution.
If you run PROFILE with your query you should also see page-cache faults, when it needs to fetch data from disk.

Celery Redis instance filling up despite queue looking empty

We have a Django app that needs to fetch lots of data using Celery. There are 20 or so celery workers running every few minutes. We're running on Google Kubernetes Engine with a Redis queue using Cloud memorystore.
The Redis instance we're using for celery is filling up, even when the queue is empty according to Flower. This results in the Redis DB eventually being full and Celery throwing errors.
In Flower I see tasks coming in and out, and I have increased workers to the point where the queue is always empty now.
If I run redis-cli --bigkeys I see:
# Scanning the entire keyspace to find biggest keys as well as
# average sizes per key type. You can use -i 0.1 to sleep 0.1 sec
# per 100 SCAN commands (not usually needed).
[00.00%] Biggest set found so far '_kombu.binding.my-queue-name-queue' with 1 members
[00.00%] Biggest list found so far 'default' with 611 items
[00.00%] Biggest list found so far 'my-other-queue-name-queue' with 44705 items
[00.00%] Biggest set found so far '_kombu.binding.celery.pidbox' with 19 members
[00.00%] Biggest list found so far 'my-queue-name-queue' with 727179 items
[00.00%] Biggest set found so far '_kombu.binding.celeryev' with 22 members
-------- summary -------
Sampled 12 keys in the keyspace!
Total key length in bytes is 271 (avg len 22.58)
Biggest list found 'my-queue-name-queue' has 727179 items
Biggest set found '_kombu.binding.celeryev' has 22 members
4 lists with 816144 items (33.33% of keys, avg size 204036.00)
0 hashs with 0 fields (00.00% of keys, avg size 0.00)
0 strings with 0 bytes (00.00% of keys, avg size 0.00)
0 streams with 0 entries (00.00% of keys, avg size 0.00)
8 sets with 47 members (66.67% of keys, avg size 5.88)
0 zsets with 0 members (00.00% of keys, avg size 0.00)
If I inspect the queue using LRANGE I see lots of objects like this:
"{\"body\": \"W1syNDQ0NF0sIHsicmVmZXJlbmNlX3RpbWUiOiBudWxsLCAibGF0ZXN0X3RpbWUiOiBudWxsLCAicm9sbGluZyI6IGZhbHNlLCAidGltZWZyYW1lIjogIjFkIiwgIl9udW1fcmV0cmllcyI6IDF9LCB7ImNhbGxiYWNrcyI6IG51bGwsICJlcnJiYWNrcyI6IG51bGwsICJjaGFpbiI6IG51bGwsICJjaG9yZCI6IG51bGx9XQ==\", \"content-encoding\": \"utf-8\", \"content-type\": \"application/json\", \"headers\": {\"lang\": \"py\", \"task\": \"MyDataCollectorClass\", \"id\": \"646910fc-f9db-48c3-b5a9-13febbc00bde\", \"shadow\": null, \"eta\": \"2019-08-20T02:31:05.113875+00:00\", \"expires\": null, \"group\": null, \"retries\": 0, \"timelimit\": [null, null], \"root_id\": \"beeff557-66be-451d-9c0c-dc622ca94493\", \"parent_id\": \"374d8e3e-92b5-423e-be58-e043999a1722\", \"argsrepr\": \"(24444,)\", \"kwargsrepr\": \"{'reference_time': None, 'latest_time': None, 'rolling': False, 'timeframe': '1d', '_num_retries': 1}\", \"origin\": \"gen1#celery-my-queue-name-worker-6595bd8fd8-8vgzq\"}, \"properties\": {\"correlation_id\": \"646910fc-f9db-48c3-b5a9-13febbc00bde\", \"reply_to\": \"e55a31ed-cbba-3d79-9ffc-c19a29e77aac\", \"delivery_mode\": 2, \"delivery_info\": {\"exchange\": \"\", \"routing_key\": \"my-queue-name-queue\"}, \"priority\": 0, \"body_encoding\": \"base64\", \"delivery_tag\": \"a83074a5-8787-49e3-bb7d-a0e69ba7f599\"}}"
We're using django-celery-results to store results, so these shouldn't be going in there, and we're using a separate Redis instance for Django's cache.
If I clear Redis with a FLUSHALL it slowly fills up again.
I'm kind of stumped at where to go next. I don't know Redis well - maybe I can do something to inspect the data to see what's filling this? Maybe it's Flower not reporting properly? Maybe Celery keeps completed tasks for a bit despite us using the Django DB for results?
Thanks loads for any help.

It sounds like Redis is not set up to delete completed items or report & delete failed items--i.e. it may be putting the tasks on the list, but it's not taking them off.
Check out pypi packages: rq, django-rq, django-rq-scheduler
You can read here a little bit about how this should work: https://python-rq.org/docs/

This seems to be a known (or intentional) issue with Celery, with various solutions/workarounds proposed:
https://github.com/celery/celery/issues/436

Inadequate RAM usage by Redis

I'm developing an API using Go and Redis. The problem is that RAM usage is inadequate and I can't find the root of the problem.
TL;DR version
There are hundreds/thousands of hash objects. Each one of 1 KB objects (key+value) takes ~0.5 MB of RAM. However, there is no memory fragmentation (INFO shows none).
Also, dump.rdb is 70x times smaller than the RAM set (360KB dump.rdb vs 25MB RAM for 50 objects, and 35.5MB vs 2.47GB for 5000 objects).
Long version
Redis instance is filled mostly with task:123 hashes of the following kind:
"task_id" : int
"client_id" : int
"worker_id" : int
"text" : string (0..255 chars)
"is_processed" : boolean
"timestamp" : int
"image" : byte array (1 kbyte)
Also, there are a couple of integer counters, one list and one sorted set (both consist of task_id's).
RAM usage has a linear dependency on the number of task objects.
INFO output for 50 tasks:
# Memory
used_memory:27405872
used_memory_human:26.14M
used_memory_rss:45215744
used_memory_peak:31541400
used_memory_peak_human:30.08M
used_memory_lua:35840
mem_fragmentation_ratio:1.65
mem_allocator:jemalloc-3.6.0
and 5000 tasks:
# Memory
used_memory:2647515776
used_memory_human:2.47G
used_memory_rss:3379187712
used_memory_peak:2651672840
used_memory_peak_human:2.47G
used_memory_lua:35840
mem_fragmentation_ratio:1.28
mem_allocator:jemalloc-3.6.0
Size of dump.rdb for 50 tasks is 360kB and for 5000 tasks it's 35553kB.
Every task object has serializedlength of ~7KB:
127.0.0.1:6379> DEBUG OBJECT task:2000
Value at:0x7fcb403f5880 refcount:1 encoding:hashtable serializedlength:7096 lru:6497592 lru_seconds_idle:180
I've written a Python script trying to reproduce the problem:
import redis
import time
import os
from random import randint
img_size = 1024 * 1 # 1 kb
r = redis.StrictRedis(host='localhost', port=6379, db=0)
for i in range(0, 5000):
values = {
"task_id" : randint(0, 65536),
"client_id" : randint(0, 65536),
"worker_id" : randint(0, 65536),
"text" : "",
"is_processed" : False,
"timestamp" : int(time.time()),
"image" : bytearray(os.urandom(img_size)),
}
key = "task:" + str(i)
r.hmset(key, values)
if i % 500 == 0: print(i)
And it consumes just 80MB of RAM!
I would appreciate any ideas on how to figure out what's going on.

You have lots and lots of small HASH objects, and that's fine. But each of them has a lot of overhead in the redis memory, since it has a separate dictionary. There is a small optimization for this that usually improves things significantly, and it's to keep hashes in a memory optimized but slightly slower data structure, which at these object sizes should not matter much. From the config:
# Hashes are encoded using a memory efficient data structure when they have a
# small number of entries, and the biggest entry does not exceed a given
# threshold. These thresholds can be configured using the following directives.
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
Now, you have large values which causes this optimization not to work.
I'd set hash-max-ziplist-value to a few kbs (depending on the size of your largest object), and it should improve this (you should not see any performance degradation in this HASH size).
Also, keep in mind that redis compresses its RDB files relative to what you have in memory, so a ~50% reduction over memory is to be expected anyway.
[EDIT] After re-reading your question and seeing it's a go only problem, and considering the fact that the compressed rdb is small, something tells me you're writing a bigger size than you'd expect for the image. Any chance you're writing that off a []byte slice? If so, perhaps you did not trim it and you're writing a much bigger buffer or something similar? I've worked like this with redigo tons of times and never seen what you're describing.

Neo4J / py2neo -- cursor-based query?

If I do something like this:
from py2neo import Graph
graph = Graph()
stuff = graph.cypher.execute("""
match (a:Article)-[p]-n return a, n, p.weight
""")
on a database with lots of articles and links, the query takes a long time and uses all my system's memory, presumably because it's copying the entire result set into memory in one go. Is there some kind of cursor-based version where I could iterate through the results one at a time without having to have them all in memory at once?
EDIT
I found the stream function:
stuff = graph.cypher.stream("""
match (a:Article)-[p]-n return a, n, p.weight
""")
which seems to be what I want according to the documentation but now I get a timeout error (py2neo.packages.httpstream.http.SocketError: timed out), followed by the server becoming unresponsive until I kill it using kill -9.

Have you tried implementing a paging mechanism? Perhaps with the skip keyword: http://neo4j.com/docs/stable/query-skip.html
Similar to using limit / offset in a postgres / mysql query.
EDIT: I previously said that the entire result set was stored in memory, but it appears this is not the case when using api streaming - per Nigel's (Neo engineer) comment below.

Memory leak in my Google App Engine code

I have the following code that is trying to loop over a large table (~100k rows; ~30GB)
def updateEmailsInLoop(cursor=None, stats={}):
BATCH_SIZE=10
try:
rawEmails, next_cursor, more = RawEmailModel.query().fetch_page(BATCH_SIZE, start_cursor=cursor)
for index, rawEmail in enumerate(rawEmails):
stats = process_stats(rawEmail, stats)
i = 0
while more and next_cursor:
rawEmails, next_cursor, more = RawEmailModel.query().fetch_page(BATCH_SIZE, start_cursor=next_cursor)
for index, rawEmail in enumerate(rawEmails):
stats = process_stats(rawEmail, stats)
i = (i + 1) %100
if i == 99:
logging.info("foobar: Finished 100 more %s", str(stats))
write_stats(stats)
except DeadlineExceededError:
logging.info("foobar: Deadline exceeded")
for index, rawEmail in enumerate(rawEmails[index:], start=index):
stats = process_stats(rawEmail, stats)
if more and next_cursor:
deferred.defer(updateEmailsInLoop, cursor = next_cursor, stats=stats, _queue="adminStats")
However, I keep getting the following error:
While handling this request, the process that handled this request was found to be using too much memory and was terminated. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may have a memory leak in your application.
...and sometimes....
Exceeded soft private memory limit of 128 MB with 154 MB after servicing 9 requests total
I had changed my code so I was always only pulling in 10 entries at any given time, so I don't get why I'm still running out of memory?

There are 3 ways to do this kind of job (iteration on a large set of rows in datastore):
Process 1 batch of x entities and create a task (push queue) using the cursor.
Process 1 batch of x entities and respond to the browser with a bit of javascript that shows the progress and changes window.location to a link that contains the cursor and the current progress. (this is my preferred approach)
Use mapreduce (it's harder to code)(but can be applied on 10M-1B rows)
For most of my apps that i needed this x is usually between 100-500.
Here is the code i use for iteration over 1.5m-2m rows to generate some reports or update stuff in my db. For reports i save an entity that contains the information i need in csv format, and at the end, i read all entities, merge them, and delete them. (done this to generate 1.5m rows of excel data)
(it's java, but should be easily translated to python):
resp.getWriter().println("<html><head>");
resp.getWriter().println(
"<script type='text/javascript'>function f(){window.location.href='/do/convert/" + this.getClass().getSimpleName() + "?cursor=" + cursorString + "&count="
+ count + "';}</script>");
resp.getWriter().println("</head><body onload='f()'>");
resp.getWriter().println(
"<a href='/do/convert/" + this.getClass().getSimpleName() + "?cursor=" + cursorString + "&count=" + count + "'>Next page -->" + cursorString + " </a>");
resp.getWriter().println("</body></html>");
If your "progress" is big and messy, save it in entities (one or more, depending on what you are doing)
If you are doing the task version, i recommend to either use task names or to make your tasks idempotent (especially if your counting stuff).
If your counting stuff, i recommend saving entities that contain the keys of the entities that you are counting, and at the end, count those.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.