I have written a simple MapReduce flow to read in lines from a CSV from a file on Google Cloud Storage and subsequently make an Entity. However, I can't seem to get it to run on more than one shard.
The code makes use of mapreduce.control.start_map and looks something like this.
class LoadEntitiesPipeline(webapp2.RequestHandler):
id = control.start_map(map_name,
handler_spec="backend.line_processor",
reader_spec="mapreduce.input_readers.FileInputReader",
queue_name=get_queue_name("q-1"),
shard_count=shard_count,
mapper_parameters={
'shard_count': shard_count,
'batch_size': 50,
'processing_rate': 1000000,
'files': [gsfile],
'format': 'lines'})
I have shard_count in both places, because I'm not sure what methods actually need it. Setting shard_count anywhere from 8 to 32, doesn't change anything as the status page always says 1/1 shards running. To separate things, I've made everything run on a backend queue with a large number of instances. I've tried adjusting the queue parameters per this wiki. In the end, it seems to just run serially.
Any ideas? Thanks!
Update (Still no success):
In trying to isolate things, I tried making the call using direct calls to pipeline like so:
class ImportHandler(webapp2.RequestHandler):
def get(self, gsfile):
pipeline = LoadEntitiesPipeline2(gsfile)
pipeline.start(queue_name=get_queue_name("q-1"))
self.redirect(pipeline.base_path + "/status?root=" + pipeline.pipeline_id)
class LoadEntitiesPipeline2(base_handler.PipelineBase):
def run(self, gsfile):
yield mapreduce_pipeline.MapperPipeline(
'loadentities2_' + gsfile,
'backend.line_processor',
'mapreduce.input_readers.FileInputReader',
params={'files': [gsfile], 'format': 'lines'},
shards=32
)
With this new code, it still only runs on one shard. I'm starting to wonder if mapreduce.input_readers.FileInputReader is capable of parallelizing input by line.
It looks like FileInputReader can only shard via files. The format params only change the way mapper function got call. If you pass more than one files to the mapper, it will start to run on more than one shard. Otherwise it will only use one shard to process the data.
EDIT #1:
After dig deeper in the mapreduce library. MapReduce will decide whether or not to split file into pieces based on the can_split method return for each file type it defined. Currently, the only format which implement split method is ZipFormat. So, if your file format is not zip, it won't split the file to run on more than one shard.
#classmethod
def can_split(cls):
"""Indicates whether this format support splitting within a file boundary.
Returns:
True if a FileFormat allows its inputs to be splitted into
different shards.
"""
https://code.google.com/p/appengine-mapreduce/source/browse/trunk/python/src/mapreduce/file_formats.py
But it looks like it is possible to write your own file format split method. You can try to hack and add split method on _TextFormat first and see if more than one shard running.
#classmethod
def split(cls, desired_size, start_index, opened_file, cache):
pass
EDIT #2:
An easy workaround would be left the FileInputReader run serially but move the time-cosuming task to parallel reduce stage.
def line_processor(line):
# serial
yield (random.randrange(1000), line)
def reducer(key, values):
# parallel
entities = []
for v in values:
entities.append(CREATE_ENTITY_FROM_VALUE(v))
db.put(entities)
EDIT #3:
If try to modify the FileFormat, here is an example (haven't been test yet)
from file_formats import _TextFormat, FORMATS
class _LinesSplitFormat(_TextFormat):
"""Read file line by line."""
NAME = 'split_lines'
def get_next(self):
"""Inherited."""
index = self.get_index()
cache = self.get_cache()
offset = sum(cache['infolist'][:index])
self.get_current_file.seek(offset)
result = self.get_current_file().readline()
if not result:
raise EOFError()
if 'encoding' in self._kwargs:
result = result.encode(self._kwargs['encoding'])
return result
#classmethod
def can_split(cls):
"""Inherited."""
return True
#classmethod
def split(cls, desired_size, start_index, opened_file, cache):
"""Inherited."""
if 'infolist' in cache:
infolist = cache['infolist']
else:
infolist = []
for i in opened_file:
infolist.append(len(i))
cache['infolist'] = infolist
index = start_index
while desired_size > 0 and index < len(infolist):
desired_size -= infolist[index]
index += 1
return desired_size, index
FORMATS['split_lines'] = _LinesSplitFormat
Then the new file format can be called via change the mapper_parameters from lines to split_line.
class LoadEntitiesPipeline(webapp2.RequestHandler):
id = control.start_map(map_name,
handler_spec="backend.line_processor",
reader_spec="mapreduce.input_readers.FileInputReader",
queue_name=get_queue_name("q-1"),
shard_count=shard_count,
mapper_parameters={
'shard_count': shard_count,
'batch_size': 50,
'processing_rate': 1000000,
'files': [gsfile],
'format': 'split_lines'})
It looks to me like FileInputReader should be capable of sharding based on a quick reading of:
https://code.google.com/p/appengine-mapreduce/source/browse/trunk/python/src/mapreduce/input_readers.py
It looks like 'format': 'lines' should split using: self.get_current_file().readline()
Does it seem to be interpreting the lines correctly when it is working serially? Maybe the line breaks are the wrong encoding or something.
From experience FileInputReader will do a max of one shard per file.
Solution: Split your big files. I use split_file in https://github.com/johnwlockwood/karl_data to shard files before uploading them to Cloud Storage.
If the big files are already up there, you can use a Compute Engine instance to pull them down and do the sharding because the transfer speed will be fastest.
FYI: karld is in the cheeseshop so you can pip install karld
Related
So let's say I have two solids. The first does some computations and writes a file to disk. The second solid takes that file and does other things with it, but it needs its filesystem path in order to open it. I can do this with two yields (one for the AssetMaterialization and the other for the str Output) and explicitly putting the Output in the second solid call:
from dagster import (AssetKey, AssetMaterialization, EventMetadataEntry,
Output, execute_pipeline, pipeline, solid)
#solid
def yield_asset(context):
yield AssetMaterialization(
asset_key=AssetKey('my_dataset'),
description='Persisted result to storage',
metadata_entries=[
EventMetadataEntry.text('Text-based metadata for this event',
label='text_metadata'),
EventMetadataEntry.fspath('/path/to/data/on/filesystem'),
EventMetadataEntry.url('http://mycoolsite.com/url_for_my_data',
label='dashboard_url'),
],
)
yield Output('/path/to/data/on/filesystem')
#solid
def print_asset_path(context, asset_path: str):
# do stuff with `asset_path`
context.log.info(asset_path)
#pipeline
def some_pipeline():
asset_path = yield_asset()
print_asset_path(asset_path)
if __name__ == "__main__":
result = execute_pipeline(some_pipeline)
This works fine, and you should get the info message in the logs (2021-03-16 13:23:29 - dagster - INFO - system - 366248ec-6a83-462f-b62f-9fb2514f6f80 - print_asset_path - /path/to/data/on/filesystem) and the AssetMaterialization in dagit.
However, this is kind of inconvenient, since I need to explicitly yield an Output with the filesystem path that I need. Is it possible, and how, to reference the AssetMaterialization in the second solid, and use its properties directly?
Something like (won't work):
#solid
def print_asset_path(context):
asset_path = context.assets.get_asset_by_key(`my_key`).fspath
# do stuff with `asset_path`
context.log.info(asset_path)
The code you've provided is currently the best way to accomplish this in Dagster.
If the fspath is known at before the solid itself executes, then the directions outlined in these two issues (not yet implemented) might offer a more elegant solution:
https://github.com/dagster-io/dagster/issues/3894
https://github.com/dagster-io/dagster/issues/3895
I am working on a django based web app that takes python file as input which contains some function, then in backend i have some lists that are passed as parameters through the user's function,which will generate a single value output.The result generated will be used for some further computation.
Here is how the function inside the user's file look like :
def somefunctionname(list):
''' some computation performed on list'''
return float value
At present the approach that i am using is taking user's file as normal file input. Then in my views.py i am executing the file as module and passing the parameters with eval function. Snippet is given below.
Here modulename is the python file name that i had taken from user and importing as module
exec("import "+modulename)
result = eval(f"{modulename}.{somefunctionname}(arguments)")
Which is working absolutely fine. But i know this is not the secured approach.
My question , Is there any other way through which i can run users file securely as the method that i am using is not secure ? I know the proposed solutions can't be full proof but what are the other ways in which i can run this (like if it can be solved with dockerization then what will be the approach or some external tools that i can use with API )?
Or if possible can somebody tell me how can i simply sandbox this or any tutorial that can help me..?
Any reference or resource will be helpful.
It is an important question. In python sandboxing is not trivial.
It is one of the few cases where the question which version of python interpreter you are using. For example, Jyton generates Java bytecode, and JVM has its own mechanism to run code securely.
For CPython, the default interpreter, originally there were some attempts to make a restricted execution mode, that were abandoned long time ago.
Currently, there is that unofficial project, RestrictedPython that might give you what you need. It is not a full sandbox, i.e. will not give you restricted filesystem access or something, but for you needs it may be just enough.
Basically the guys there just rewrote the python compilation in a more restricted way.
What it allows to do is to compile a piece of code and then execute, all in a restricted mode. For example:
from RestrictedPython import safe_builtins, compile_restricted
source_code = """
print('Hello world, but secure')
"""
byte_code = compile_restricted(
source_code,
filename='<string>',
mode='exec'
)
exec(byte_code, {__builtins__ = safe_builtins})
>>> Hello world, but secure
Running with builtins = safe_builtins disables the dangerous functions like open file, import or whatever. There are also other variations of builtins and other options, take some time to read the docs, they are pretty good.
EDIT:
Here is an example for you use case
from RestrictedPython import safe_builtins, compile_restricted
from RestrictedPython.Eval import default_guarded_getitem
def execute_user_code(user_code, user_func, *args, **kwargs):
""" Executed user code in restricted env
Args:
user_code(str) - String containing the unsafe code
user_func(str) - Function inside user_code to execute and return value
*args, **kwargs - arguments passed to the user function
Return:
Return value of the user_func
"""
def _apply(f, *a, **kw):
return f(*a, **kw)
try:
# This is the variables we allow user code to see. #result will contain return value.
restricted_locals = {
"result": None,
"args": args,
"kwargs": kwargs,
}
# If you want the user to be able to use some of your functions inside his code,
# you should add this function to this dictionary.
# By default many standard actions are disabled. Here I add _apply_ to be able to access
# args and kwargs and _getitem_ to be able to use arrays. Just think before you add
# something else. I am not saying you shouldn't do it. You should understand what you
# are doing thats all.
restricted_globals = {
"__builtins__": safe_builtins,
"_getitem_": default_guarded_getitem,
"_apply_": _apply,
}
# Add another line to user code that executes #user_func
user_code += "\nresult = {0}(*args, **kwargs)".format(user_func)
# Compile the user code
byte_code = compile_restricted(user_code, filename="<user_code>", mode="exec")
# Run it
exec(byte_code, restricted_globals, restricted_locals)
# User code has modified result inside restricted_locals. Return it.
return restricted_locals["result"]
except SyntaxError as e:
# Do whaever you want if the user has code that does not compile
raise
except Exception as e:
# The code did something that is not allowed. Add some nasty punishment to the user here.
raise
Now you have a function execute_user_code, that receives some unsafe code as a string, a name of a function from this code, arguments, and returns the return value of the function with the given arguments.
Here is a very stupid example of some user code:
example = """
def test(x, name="Johny"):
return name + " likes " + str(x*x)
"""
# Lets see how this works
print(execute_user_code(example, "test", 5))
# Result: Johny likes 25
But here is what happens when the user code tries to do something unsafe:
malicious_example = """
import sys
print("Now I have the access to your system, muhahahaha")
"""
# Lets see how this works
print(execute_user_code(malicious_example, "test", 5))
# Result - evil plan failed:
# Traceback (most recent call last):
# File "restr.py", line 69, in <module>
# print(execute_user_code(malitious_example, "test", 5))
# File "restr.py", line 45, in execute_user_code
# exec(byte_code, restricted_globals, restricted_locals)
# File "<user_code>", line 2, in <module>
#ImportError: __import__ not found
Possible extension:
Pay attention that the user code is compiled on each call to the function. However, it is possible that you would like to compile the user code once, then execute it with different parameters. So all you have to do is to save the byte_code somewhere, then to call exec with a different set of restricted_locals each time.
EDIT2:
If you want to use import, you can write your own import function that allows to use only modules that you consider safe. Example:
def _import(name, globals=None, locals=None, fromlist=(), level=0):
safe_modules = ["math"]
if name in safe_modules:
globals[name] = __import__(name, globals, locals, fromlist, level)
else:
raise Exception("Don't you even think about it {0}".format(name))
safe_builtins['__import__'] = _import # Must be a part of builtins
restricted_globals = {
"__builtins__": safe_builtins,
"_getitem_": default_guarded_getitem,
"_apply_": _apply,
}
....
i_example = """
import math
def myceil(x):
return math.ceil(x)
"""
print(execute_user_code(i_example, "myceil", 1.5))
Note that this sample import function is VERY primitive, it will not work with stuff like from x import y. You can look here for a more complex implementation.
EDIT3
Note, that lots of python built in functionality is not available out of the box in RestrictedPython, it does not mean it is not available at all. You may need to implement some function for it to become available.
Even some obvious things like sum or += operator are not obvious in the restricted environment.
For example, the for loop uses _getiter_ function that you must implement and provide yourself (in globals). Since you want to avoid infinite loops, you may want to put some limits on the number of iterations allowed. Here is a sample implementation that limits number of iterations to 100:
MAX_ITER_LEN = 100
class MaxCountIter:
def __init__(self, dataset, max_count):
self.i = iter(dataset)
self.left = max_count
def __iter__(self):
return self
def __next__(self):
if self.left > 0:
self.left -= 1
return next(self.i)
else:
raise StopIteration()
def _getiter(ob):
return MaxCountIter(ob, MAX_ITER_LEN)
....
restricted_globals = {
"_getiter_": _getiter,
....
for_ex = """
def sum(x):
y = 0
for i in range(x):
y = y + i
return y
"""
print(execute_user_code(for_ex, "sum", 6))
If you don't want to limit loop count, just use identity function as _getiter_:
restricted_globals = {
"_getiter_": labmda x: x,
Note that simply limiting the loop count does not guarantee security. First, loops can be nested. Second, you cannot limit the execution count of a while loop. To make it secure, you have to execute unsafe code under some timeout.
Please take a moment to read the docs.
Note that not everything is documented (although many things are). You have to learn to read the project's source code for more advanced things. Best way to learn is to try and run some code, and to see what kind function is missing, then to see the source code of the project to understand how to implement it.
EDIT4
There is still another problem - restricted code may have infinite loops. To avoid it, some kind of timeout is required on the code.
Unfortunately, since you are using django, that is multi threaded unless you explicitly specify otherwise, simple trick for timeouts using signeals will not work here, you have to use multiprocessing.
Easiest way in my opinion - use this library. Simply add a decorator to execute_user_code so it will look like this:
#timeout_decorator.timeout(5, use_signals=False)
def execute_user_code(user_code, user_func, *args, **kwargs):
And you are done. The code will never run more than 5 seconds.
Pay attention to use_signals=False, without this it may have some unexpected behavior in django.
Also note that this is relatively heavy on resources (and I don't really see a way to overcome this). I mean not really crazy heavy, but it is an extra process spawn. You should hold that in mind in your web server configuration - the api which allows to execute arbitrary user code is more vulnerable to ddos.
For sure with docker you can sandbox the execution if you are careful. You can restrict CPU cycles, max memory, close all network ports, run as a user with read only access to the file system and all).
Still,this would be extremely complex to get it right I think. For me you shall not allow a client to execute arbitrar code like that.
I would be to check if a production/solution isn't already done and use that. I was thinking that some sites allow you to submit some code (python, java, whatever) that is executed on the server.
I am trying to find a target pattern or cache config to differentiate between tasks with the same name in a flow.
As highlighted from the diagram above only one of the tasks gets cached and the other get overwritten. I tried using task-slug but to no avail.
#task(
name="process_resource-{task_slug}",
log_stdout=True,
target=task_target
)
Thanks in advance
It looks like you are attempting to format the task name instead of the target. (task names are not template-able strings).
The following snippet is probably what you want:
#task(name="process_resource", log_stdout=True, target="{task_name}-{task_slug}")
After further research it looks like the documentation directly addresses changing task configuration on the fly - Without breaking target location templates.
#task
def number_task():
return 42
with Flow("example-v3") as f:
result = number_task(task_args={"name": "new-name"})
print(f.tasks) # {<Task: new-name>}
I raised a feature request on the CDK github account recently and was pointed in the direction of Core.Token as being pretty much the exact functionality I was looking for. I'm now having some issues implementing it and getting similar errors, heres the feature request I raised previously: https://github.com/aws/aws-cdk/issues/3800
So my current code looks something like this:
fargate_service = ecs_patterns.LoadBalancedFargateService(
self, "Fargate",
cluster = cluster,
memory_limit_mib = core.Token.as_number(ssm.StringParameter.value_from_lookup(self, parameter_name='template-service-memory_limit')),
execution_role=fargate_iam_role,
container_port=core.Token.as_number(ssm.StringParameter.value_from_lookup(self, parameter_name='port')),
cpu = core.Token.as_number(ssm.StringParameter.value_from_lookup(self, parameter_name='template-service-container_cpu')),
image=ecs.ContainerImage.from_registry(ecrRepo)
)
When I try synthesise this code I get the following error:
jsii.errors.JavaScriptError:
Error: Resolution error: Supplied properties not correct for "CfnSecurityGroupEgressProps"
fromPort: "dummy-value-for-template-service-container_port" should be a number
toPort: "dummy-value-for-template-service-container_port" should be a number.
Object creation stack:
To me it seems to be getting past the validation requiring a number to be passed into the FargateService validation, but when it tried to create the resources after that ("CfnSecurityGroupEgressProps") it cant resolve the dummy string as a number. I'd appreciate any help on solving this or alternative suggestions to passing in values from AWS system params instead (I thought it might be possible to parse the values into here via a file pulled from S3 during the build pipeline or something along those lines, but that seems hacky).
With some help I think we've cracked this!
The problem was that I was passing "ssm.StringParameter.value_from_lookup" the solution is to provide the token with "ssm.StringParameter.value_for_string_parameter", when this is synthesised it stores the token and then upon deployment the value stored in system parameter store is substituted.
(We also came up with another approach for achieving similar which we're probably going to use over SSM approach, I've detailed below the code snippet if you're interested)
See the complete code below:
from aws_cdk import (
aws_ec2 as ec2,
aws_ssm as ssm,
aws_iam as iam,
aws_ecs as ecs,
aws_ecs_patterns as ecs_patterns,
core,
)
class GenericFargateService(core.Stack):
def __init__(self, scope: core.Construct, id: str, **kwargs) -> None:
super().__init__(scope, id, **kwargs)
containerPort = core.Token.as_number(ssm.StringParameter.value_for_string_parameter(
self, 'template-service-container_port'))
vpc = ec2.Vpc(
self, "cdk-test-vpc",
max_azs=2
)
cluster = ecs.Cluster(
self, 'cluster',
vpc=vpc
)
fargate_iam_role = iam.Role(self,"execution_role",
assumed_by = iam.ServicePrincipal("ecs-tasks"),
managed_policies=[iam.ManagedPolicy.from_aws_managed_policy_name("AmazonEC2ContainerRegistryFullAccess")]
)
fargate_service = ecs_patterns.LoadBalancedFargateService(
self, "Fargate",
cluster = cluster,
memory_limit_mib = 1024,
execution_role=fargate_iam_role,
container_port=containerPort,
cpu = 512,
image=ecs.ContainerImage.from_registry("000000000000.dkr.ecr.eu-west-1.amazonaws.com/template-service-ecr")
)
fargate_service.target_group.configure_health_check(path=self.node.try_get_context("health_check_path"), port="9000")
app = core.App()
GenericFargateService(app, "generic-fargate-service", env={'account':'000000000000', 'region': 'eu-west-1'})
app.synth()
Solutions to problems are like buses, apparently you spend ages waiting for one and then two arrive together. And I think this new bus is the option we're probably going to run with.
The plan is to have developers provide an override for the cdk.json file withing their code repos, which can then put parsed into the CDK pipeline where the generic code will be synthesised. This file will contain some "context", the context will then be used within the CDK to set our variables for the LoadBalancedFargate service.
I've included some code snippets for setting cdk.json file and then using its values within code below.
Example CDK.json:
{
"app": "python3 app.py",
"context": {
"container_name":"template-service",
"memory_limit":1024,
"container_cpu":512,
"health_check_path": "/gb/template/v1/status",
"ecr_repo": "000000000000.dkr.ecr.eu-west-1.amazonaws.com/template-service-ecr"
}
}
Python example for assigning context to variables:
memoryLimitMib = self.node.try_get_context("memory_limit")
I believe we could also use a Try/Catch block to assign some default values to this if not provided by the developer in their CDK.json file.
I hope this post has provided some useful information to those looking for ways to create a generic template for deploying CDK code! I don't know if we're doing the right thing here, but this tool is so new it feels like some common patterns dont exist yet.
Edit:
Firstly, thank you #martineau and #jonrsharpe for your prompt reply.
I was initially hesitant to write a verbose description, but I now realize that I am sacrificing clarity for brevity. (thanks #jonrsharpe for the link).
So here's my attempt to describe what I am upto as succinctly as possible:
I have implemented the Lempel-Ziv-Welch text file compression algorithm in form of a python package. Here's the link to the repository.
Basically, I have a compress class in the lzw.Compress module, which takes in as input the file name(and a bunch of other jargon parameters) and generates the compressed file which is then decompressed by the decompress class within the lzw.Decompress module generating the original file.
Now what I want to do is to compress and decompress a bunch of files of various sizes stored in a directory and save and visualize graphically the time taken for compression/decompression along with the compression ratio and other metrics. For this, I am iterating over the list of the file names and passing them as parameters to instantiate the compress class and begin compression by calling the encode() method on it as follows:
import os
os.chdir('/path/to/files/to/be/compressed/')
results = dict()
results['compress_time'] = []
results['other_metrics'] = []
file_path = '/path/to/files/to/be/compressed/'
comp_path = '/path/to/store/compressed/files/'
decomp_path = '/path/to/store/decompressed/file'
files = [_ for _ in os.listdir()]
for f in files:
from lzw.Compress import compress as comp
from lzw.Decompress import decompress as decomp
c = comp(file_path+f,comp_path) #passing the input file and the output path for storing compressed file.
c.encode()
#Then measure time required for comression using time.monotonic()
del c
del comp
d = decomp('/path/to/compressed/file',decomp_path) #Decompressing
d.decode()
#Then measure time required for decompression using
#time.monotonic()
#append metrics to lists in the results dict for this particular
#file
if decompressed_file_size != original_file_size:
print("error")
break
del d
del decomp
I have run this code independently for each file without the for loop and have achieved compression and decompression successfully. So there are no problems in the files I wish to compress.
What happens is that whenever I run this loop, the first file (the first iteration) runs successfully and the on the next iteration, after the entire process happens for the 2nd file, "error" is printed and the loop exits. I have tried reordering the list or even reversing it(maybe a particular file is having a problem), but to no avail.
For the second file/iteration, the decompressed file contents are dubious(not matching the original file). Typically, the decompressed file size is nearly double that of the original.
I strongly suspect that there is something to do with the variables of the class/package retaining their state somehow among different iterations of the loop. (To counter this I am deleting both the instance and the class at the end of the loop as shown in the above snippet, but no success.)
I have also tried to import the classes outside the loop, but no success.
P.S.: I am a python newbie and don't have much of an expertise, so forgive me for not being "pythonic" in my exposition and raising a rather naive issue.
Update:
Thanks to #martineau, one of the problem was regarding the importing of global variables from another submodule.
But there was another issue which crept in owing to my superficial knowledge about the 'del' operator in python3.
I have this trie data structure in my program which is basically just similar to a binary tree.
I had a self_destruct method to delete the tree as follows:
class trie():
def __init__(self):
self.next = {}
self.value = None
self.addr = None
def insert(self, word=str(),addr=int()):
node = self
for index,letter in enumerate(word):
if letter in node.next.keys():
node = node.next[letter]
else:
node.next[letter] = trie()
node = node.next[letter]
if index == len(word) - 1:
node.value = word
node.addr = addr
def self_destruct(self):
node = self
if node.next == {}:
return
for i in node.next.keys():
node.next[i].self_destruct()
del node
Turns out that this C-like recursive deletion of objects makes no sense in python as here simply its association in the namespace is removed while the real work is done by the garbage collector.
Still, its kinda weird why python is retaining the state/association of variables even on creating a new object(as shown in my loop snippet in the edit).
So 2 things solved the problem. Firstly, I removed the global variables and made them local to the module where I need them(so no need to import). Also, I deleted the self_destruct method of the trie and simple did: del root where root = trie() after use.
Thanks #martineau & #jonrsharpe.