I am using this python script to migrate data from one ElastiCache redis instance to another. It uses the redis pipelining to migrate data in chunks.
https://gist.github.com/thomasst/afeda8fe80534a832607
But I am getting this strange error:
Traceback (most recent call last):########### | ETA: 0:00:12
File "migrate-redis.py", line 95, in <module>
migrate()
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 664, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 644, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 837, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 464, in invoke
return callback(*args, **kwargs)
File "migrate-redis.py", line 74, in migrate
results = pipeline.execute(False)
File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 2593, in execute
return execute(conn, stack, raise_on_error)
File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 2446, in _execute_transaction
all_cmds = connection.pack_commands([args for args, _ in cmds])
File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 637, in pack_commands
output.append(SYM_EMPTY.join(pieces))
MemoryError
There are no issues with RAM as node has 6 GB of RAM.
The Memory Profile of source redis is as follows:
used_memory:1483900120
used_memory_human:1.38G
used_memory_rss:1945829376
used_memory_peak:2431795528
used_memory_peak_human:2.26G
used_memory_lua:86016
mem_fragmentation_ratio:1.31
mem_allocator:jemalloc-3.6.0
What can be the possible cause for this ?
From your error log, It has no relation with your redis server. The error happens in your redis client when it pack all commands into a memory buffer.
Maybe you could try to decrease the SCAN count option in your migrate-redis.py to test if it is too large to pack it.
Related
I have this program which performs a kind of attack on a pytorch model, and the model is moved on a cuda device.
The main of my program is something like this:
...
from joblib import Parallel, delayed, cpu_count
...
def main(params):
...
move the black_box on the GPU
...
# Batch size
batch_size: int = len(x_train) // cpu_count()
echo(f"Starting Parallel with {n_jobs=} and {batch_size=}")
with Parallel(n_jobs=n_jobs, prefer="processes", batch_size=batch_size) as parallel:
# For each row of the matrix perform the MIA
parallel(
delayed(perform_attack_pipeline)(
id,
x_row,
y_row,
labels,
black_box,
path,
explainer,
explainer_sampling,
neighborhood_sampling,
attack_full,
num_samples,
num_shadow_models,
test_size,
random_state,
)
for id, x_row, y_row in zip(indices, x_train, y_train)
)
echo("Experiments are concluded, kudos from MLEM!")
if __name__ == '__main__':
run(main()) # this program uses the typer lib to handle the cmdline args
As it can be seen in the main, this program is parallelized using Joblib.
One of the parameters of this program is the n_jobs to pass to Joblib's Parallel construct.
When I call the program's main with the option n_jobs = 1:
./pytorch_main.py pretrained/linear_geo30.tar --n-jobs=1
everything works, but when I try to call it using more workers, e.g.
./pytorch_main.py pretrained/linear_geo30.tar --n-jobs=4
the program crashes with the following traceback, which is hard to interpret since in it I cannot see any function I wrote, but only functions of the runtimes of the libraries.
(In addition, the function passed to delayed, perform_attack_pipeline, prints something as first statement, but the print never appears, so the program crashes before its execution.)
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/gerardozinno/Desktop/Tesi/Code/mlem/venv/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 407, in _process_worker
call_item = call_queue.get(block=True, timeout=timeout)
File "/usr/lib/python3.9/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "/home/gerardozinno/Desktop/Tesi/Code/mlem/venv/lib/python3.9/site-packages/torch/storage.py", line 161, in _load_from_bytes
return torch.load(io.BytesIO(b))
File "/home/gerardozinno/Desktop/Tesi/Code/mlem/venv/lib/python3.9/site-packages/torch/serialization.py", line 608, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/home/gerardozinno/Desktop/Tesi/Code/mlem/venv/lib/python3.9/site-packages/torch/serialization.py", line 787, in _legacy_load
result = unpickler.load()
File "/home/gerardozinno/Desktop/Tesi/Code/mlem/venv/lib/python3.9/site-packages/torch/serialization.py", line 743, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/home/gerardozinno/Desktop/Tesi/Code/mlem/venv/lib/python3.9/site-packages/torch/serialization.py", line 175, in default_restore_location
result = fn(storage, location)
File "/home/gerardozinno/Desktop/Tesi/Code/mlem/venv/lib/python3.9/site-packages/torch/serialization.py", line 155, in _cuda_deserialize
return storage_type(obj.size())
File "/home/gerardozinno/Desktop/Tesi/Code/mlem/venv/lib/python3.9/site-packages/torch/cuda/__init__.py", line 606, in _lazy_new
return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/gerardozinno/Desktop/Tesi/Code/mlem/./pytorch_main.py", line 180, in <module>
run(main)
File "/home/gerardozinno/Desktop/Tesi/Code/mlem/venv/lib/python3.9/site-packages/typer/main.py", line 864, in run
app()
File "/home/gerardozinno/Desktop/Tesi/Code/mlem/venv/lib/python3.9/site-packages/typer/main.py", line 214, in __call__
return get_command(self)(*args, **kwargs)
File "/home/gerardozinno/Desktop/Tesi/Code/mlem/venv/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/gerardozinno/Desktop/Tesi/Code/mlem/venv/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/gerardozinno/Desktop/Tesi/Code/mlem/venv/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/gerardozinno/Desktop/Tesi/Code/mlem/venv/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/gerardozinno/Desktop/Tesi/Code/mlem/venv/lib/python3.9/site-packages/typer/main.py", line 500, in wrapper
return callback(**use_params) # type: ignore
File "/home/gerardozinno/Desktop/Tesi/Code/mlem/./pytorch_main.py", line 154, in main
parallel(
File "/home/gerardozinno/Desktop/Tesi/Code/mlem/venv/lib/python3.9/site-packages/joblib/parallel.py", line 1056, in __call__
self.retrieve()
File "/home/gerardozinno/Desktop/Tesi/Code/mlem/venv/lib/python3.9/site-packages/joblib/parallel.py", line 935, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/home/gerardozinno/Desktop/Tesi/Code/mlem/venv/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/usr/lib/python3.9/concurrent/futures/_base.py", line 445, in result
return self.__get_result()
File "/usr/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
raise self._exception
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
Looking up on google for a solution, I found this page on the PyTorch documentation saying that The CUDA runtime does not support the fork start method. Can this be the problem?
Has anyone had the same issue?
I have 3 machines with celery workers and rabbitmq as a broker, one worker is running with beat flag, all of this is managed by supervisor, and sometimes celery dies with such error.
This error appears only on beat worker, but when it appears, workers on all machines dies.
(celery==3.1.12, kombu==3.0.20)
[2014-07-05 08:37:04,297: INFO/MainProcess] Connected to amqp://user:**#192.168.15.106:5672//
[2014-07-05 08:37:04,311: ERROR/Beat] Process Beat
Traceback (most recent call last):
File "/var/projects/env/local/lib/python2.7/site-packages/billiard/process.py", line 292, in _bootstrap
self.run()
File "/var/projects/env/local/lib/python2.7/site-packages/celery/beat.py", line 527, in run
self.service.start(embedded_process=True)
File "/var/projects/env/local/lib/python2.7/site-packages/celery/beat.py", line 453, in start
humanize_seconds(self.scheduler.max_interval))
File "/var/projects/env/local/lib/python2.7/site-packages/kombu/utils/__init__.py", line 322, in __get__
value = obj.__dict__[self.__name__] = self.__get(obj)
File "/var/projects/env/local/lib/python2.7/site-packages/celery/beat.py", line 491, in scheduler
return self.get_scheduler()
File "/var/projects/env/local/lib/python2.7/site-packages/celery/beat.py", line 486, in get_scheduler
lazy=lazy)
File "/var/projects/env/local/lib/python2.7/site-packages/celery/utils/imports.py", line 53, in instantiate
return symbol_by_name(name)(*args, **kwargs)
File "/var/projects/env/local/lib/python2.7/site-packages/celery/beat.py", line 357, in __init__
Scheduler.__init__(self, *args, **kwargs)
File "/var/projects/env/local/lib/python2.7/site-packages/celery/beat.py", line 184, in __init__
self.setup_schedule()
File "/var/projects/env/local/lib/python2.7/site-packages/celery/beat.py", line 376, in setup_schedule
self._store['entries']
File "/usr/lib/python2.7/shelve.py", line 121, in __getitem__
f = StringIO(self.dict[key])
File "/usr/lib/python2.7/bsddb/__init__.py", line 270, in __getitem__
return _DeadlockWrap(lambda: self.db[key]) # self.db[key]
File "/usr/lib/python2.7/bsddb/dbutils.py", line 68, in DeadlockWrap
return function(*_args, **_kwargs)
File "/usr/lib/python2.7/bsddb/__init__.py", line 270, in <lambda>
return _DeadlockWrap(lambda: self.db[key]) # self.db[key]
DBPageNotFoundError: (-30985, 'DB_PAGE_NOTFOUND: Requested page not found')
I've ran into this issue and the cause was a corrupted db file (usually named "celerybeat-schedule").
Solution would be to delete the existing db file and restart the process.
Relavent:bsddb.db.DBPageNotFoundError
https://mail.python.org/pipermail/python-list/2009-October/554552.html
I had to remove some temp files in the /tmp directory. One was named celeryd-<NAME_OF_WORKER>-state and also celeryd-<NAME_OF_WORKER>-state-renamed. After removing those and I was able to restart my affected worker.
We're trying to heavily use MapReduce in our project. Now we have this problem, there are a lot of 'InternalError: internal error.' errors in the log...
One example of it:
"POST /mapreduce/worker_callback HTTP/1.1" 500 0 "http://appname/mapreduce/worker_callback" "AppEngine-Google;
(+http://code.google.com/appengine)" "appname.appspot.com" ms=18856 cpu_ms=15980
queue_name=default task_name=appengine-mrshard-15828822618486744D69C-11-195
instance=00c61b117c47e0cba49bc5e5c7f9d328693e95ce
W 2012-10-24 06:51:27.140
suspended generator _put_tasklet(context.py:274) raised InternalError(internal error.)
W 2012-10-24 06:51:27.153
suspended generator put(context.py:703) raised InternalError(internal error.)
E 2012-10-24 06:51:27.207
internal error.
Traceback (most recent call last):
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1511, in __call__
rv = self.handle_exception(request, response, e)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1505, in __call__
rv = self.router.dispatch(request, response)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1253, in default_dispatcher
return route.handler_adapter(request, response)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1077, in __call__
return handler.dispatch()
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 547, in dispatch
return self.handle_exception(e, self.app.debug)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 545, in dispatch
return method(*args, **kwargs)
File "/base/data/home/apps/s~appname/1.362664407983567993/mapreduce/base_handler.py", line 65, in post
self.handle()
File "/base/data/home/apps/s~appname/1.362664407983567993/mapreduce/handlers.py", line 208, in handle
ctx.flush()
File "/base/data/home/apps/s~appname/1.362664407983567993/mapreduce/context.py", line 333, in flush
pool.flush()
File "/base/data/home/apps/s~appname/1.362664407983567993/mapreduce/context.py", line 221, in flush
self.__flush_ndb_puts()
File "/base/data/home/apps/s~appname/1.362664407983567993/mapreduce/context.py", line 239, in __flush_ndb_puts
ndb.put_multi(self.ndb_puts.items, config=self.__create_config())
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/model.py", line 3650, in put_multi
for future in put_multi_async(entities, **ctx_options)]
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 325, in get_result
self.check_success()
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 368, in _help_tasklet_along
value = gen.throw(exc.__class__, exc, tb)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/context.py", line 703, in put
key = yield self._put_batcher.add(entity, options)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 368, in _help_tasklet_along
value = gen.throw(exc.__class__, exc, tb)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/context.py", line 274, in _put_tasklet
keys = yield self._conn.async_put(options, datastore_entities)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 454, in _on_rpc_completion
result = rpc.get_result()
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 834, in get_result
result = rpc.get_result()
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 604, in get_result
return self.__get_result_hook(self)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1569, in __put_hook
self.check_rpc_success(rpc)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1224, in check_rpc_success
raise _ToDatastoreError(err)
InternalError: internal error.
queue.yaml:
queue:
- name: default
rate: 500/s
bucket_size: 100
max_concurrent_requests: 400
retry_parameters:
min_backoff_seconds: 5
max_backoff_seconds: 120
max_doublings: 2
MapReduce mapper params:
'shard_count': 16,
'processing_rate': 200,
'batch_size': 20
we would like to increase these numbers, since we need more speed in processing, but once we try to increase it increases error rate...
Blobstore Files Count: several ( some of them contain millions of lines )
Frontend Instance Class: F4
Processing flow:
We use only mapper for this particular processing.
We user BlobstoreLineInputReader (blob contains text file).
Each line represents new entry we need to create if it does not exist already(some of them we update).
My questions are:
How can we avoid these errors?
Are there any tips/hints on how we can choose/balance mapper params (shard_count, processing_rate, batch_size) ?
What happens with the job, does it get
retried (if so, how can we control it?) or not ?
BTW, we tried to play with some of the suggestions provided here (control batch_size) but we still see this.
This looks like a timeout error - check you logs to see how long that process is running before that happens.
If it is, you should try reducing the number of items that you're calling put_multi() on (ie reduce your batch size) and adding a timer check so that when your average time per put_multi() call gets close to the process time limit you quit and let another one start.
I have created a backend for my google app that looks like this:
backends:
- name: dbops
options: dynamic
and I've created an admin handler for it:
- url: /backend/.*
script: backend.app
login: admin
Now I understand that admin jobs should be able to run forever and I'm launching this job with a TaskQueue, but for some reason mine is not. My job is simply creating a summary table in datastore from a much larger table. This table holds about 12000 records and it takes several minutes for it to process the job on the development server, but it works fine. When I push the code out to appspot and try to get it to run the same job, I'm getting what looks like datastore timeouts.
Traceback (most recent call last):
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py", line 1536, in __call__
rv = self.handle_exception(request, response, e)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py", line 1530, in __call__
rv = self.router.dispatch(request, response)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py", line 1278, in default_dispatcher
return route.handler_adapter(request, response)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py", line 1102, in __call__
return handler.dispatch()
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py", line 572, in dispatch
return self.handle_exception(e, self.app.debug)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.5.1/webapp2.py", line 570, in dispatch
return method(*args, **kwargs)
File "/base/data/home/apps/s~myzencoder/dbops.362541511260492787/backend.py", line 626, in get
for asset in assets:
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/db/__init__.py", line 2314, in next
return self.__model_class.from_entity(self.__iterator.next())
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_query.py", line 2816, in next
next_batch = self.__batcher.next()
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_query.py", line 2678, in next
return self.next_batch(self.AT_LEAST_ONE)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_query.py", line 2715, in next_batch
batch = self.__next_batch.get_result()
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 604, in get_result
return self.__get_result_hook(self)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_query.py", line 2452, in __query_result_hook
self._batch_shared.conn.check_rpc_success(rpc)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1224, in check_rpc_success
raise _ToDatastoreError(err)
Timeout: The datastore operation timed out, or the data was temporarily unavailable.
Anyone got any suggestions on how to make this work?
While the backend request can run for a long time, a query can only run for 60 sec. You'll have to loop over your query results with cursors.
Mapreduce will get you a result quicker by doing the queries in parallel.
In production you use the HR datastore and you can run into contention problems. See this article.
https://developers.google.com/appengine/articles/scaling/contention?hl=nl
And have a look at mapreduce for creating a report. Maybe this is a better solution.
we're trying to heavily use MapReduce in our project.
Now we have this problem, there is a lots of 'DeadlineExceededError' errors in the log...
One example of it ( traceback differs each time a bit ) :
Traceback (most recent call last):
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 207, in Handle
result = handler(dict(self._environ), self._StartResponse)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1505, in __call__
rv = self.router.dispatch(request, response)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1253, in default_dispatcher
return route.handler_adapter(request, response)
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1077, in __call__
return handler.dispatch()
File "/base/python27_runtime/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 545, in dispatch
return method(*args, **kwargs)
File "/base/data/home/apps/s~sba/1.362471299468574812/mapreduce/base_handler.py", line 65, in post
self.handle()
File "/base/data/home/apps/s~sba/1.362471299468574812/mapreduce/handlers.py", line 208, in handle
ctx.flush()
File "/base/data/home/apps/s~sba/1.362471299468574812/mapreduce/context.py", line 333, in flush
pool.flush()
File "/base/data/home/apps/s~sba/1.362471299468574812/mapreduce/context.py", line 221, in flush
self.__flush_ndb_puts()
File "/base/data/home/apps/s~sba/1.362471299468574812/mapreduce/context.py", line 239, in __flush_ndb_puts
ndb.put_multi(self.ndb_puts.items, config=self.__create_config())
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/model.py", line 3625, in put_multi
for future in put_multi_async(entities, **ctx_options)]
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 323, in get_result
self.check_success()
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 318, in check_success
self.wait()
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 302, in wait
if not ev.run1():
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/eventloop.py", line 219, in run1
delay = self.run0()
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/eventloop.py", line 181, in run0
callback(*args, **kwds)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 365, in _help_tasklet_along
value = gen.send(val)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/ext/ndb/context.py", line 274, in _put_tasklet
keys = yield self._conn.async_put(options, datastore_entities)
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1560, in async_put
for pbs, indexes in pbsgen:
File "/base/python27_runtime/python27_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1350, in __generate_pb_lists
incr_size = pb.lengthString(pb.ByteSize()) + 1
DeadlineExceededError
My questions are:
How can we avoid this Error?
What happens with the job, does it get retried (if so how can we control it?) or not ?
Does it causes data inconsistency in the end ?
Apparently you are doing too many puts than it is possible to insert in one datastore call. You have multiple options here:
If this is a relatively rare event - ignore it. Mapreduce will retry the slice and will lower put pool size. Make sure that your map is idempotent.
Take a look at http://code.google.com/p/appengine-mapreduce/source/browse/trunk/python/src/mapreduce/context.py - in your main.py you can lower DATASTORE_DEADLINE, MAX_ENTITY_COUNT or MAX_POOL_SIZE to lower the size of the pool for the whole mapreduce.
If you're using an InputReader, you might be able to adjust the default batch_size to reduce the number of entities processed by each task.
I believe the task queue will retry tasks, but you probably don't want it to, since it'll likley hit the same DeadlineExceededError.
Data inconsistencies are possible.
See this question as well.
App Engine - Task Queue Retry Count with Mapper API