I have Apache Beam pipeline that reads data from Google Cloud Datastore. Pipeline is ran in Google Cloud Dataflow in batch mode and it is written in Python.
Problem is with templated argument which I'm trying to use to create Datastore query with dynamic timestamp filter.
Pipeline is defined as follows:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io.gcp.datastore.v1new.datastoreio import ReadFromDatastore
from apache_beam.io.gcp.datastore.v1new.types import Query
class UserOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument('--filter', type=int)
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
user_options = pipeline_options.view_as(UserOptions)
data = (p
| 'Read' >> ReadFromDatastore(build_query(user_options.filter.get()))
| ...
And build_query function as follows:
def build_query(filter):
return Query(
kind='Kind',
filters=[('timestamp', '>', filter)],
project='Project'
)
Running this leads to error RuntimeValueProvider(...).get() not called from a runtime context.
I have also tried ReadFromDatastore(build_query(user_options.filter)) But then error is ValueError: (u"Unknown protobuf attr type [while running 'Read/Read']", <class 'apache_beam.options.value_provider.RuntimeValueProvider'>).
Everything works just fine if templated argument is removed from equation eg. like this: ReadFromDatastore(build_query(1563276063)). So the problem is with using templated argument while building Datastore query.
My guess is that build_query should be defined some other way but after spending some time with documentation and googling I still have no idea how.
Any suggestions how I could solve this are highly appreciated!
EDIT 1
Actually, in this case filter is always relative to current timestamp so passing it as an argument is probably not even necessary if there is some other way to use dynamic values. Tried with ReadFromDatastore(build_query(int(time())-90000)) but two consecutive runs contained exactly same filter.
Value providers need to be supported by the source you're using. Only there can it be unpacked at the right moment.
When creating your own source you have full control over this obviously. When using a pre-existing source I only see two options:
Provide the value at template creation, meaning don't use a template argument for it
Create a PR for the pre-existing source to support template arguments
Related
I have a simple lambda function, and because I am new to Python, I am not able to write test cases for that function. So, the task is very simple, I uploading an xml document to s3, and returning the url. Below is the code in the main python file:
Edit 2: the purpose of below code is to make a xml out of a JSON payload, which is passed in as an argument root, and upload it to s3. s3 already has a bucket s3-encora-task, and I have to upload output.xml. I have to write unit test for that condition.
def uploaddata(root):
xmlstr = minidom.parseString(ET.tostring(root)).toprettyxml(indent=" ")
string_out = io.StringIO()
string_out.write(xmlstr)
s3.Object('s3-encora-task', 'output.xml').put(Body=string_out.getvalue())
location = boto3.client('s3').get_bucket_location(Bucket=bucket_name)['LocationConstraint']
url = "https://s3-%s.amazonaws.com/%s/%s" % (location, bucket_name, 'output.xml')
return url
I get error in the line:
s3.Object('s3-encora-task', 'output.xml').put(Body=string_out.getvalue())
Below is the error:
raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials
I have experience in writing Junit test classes but I am having trouble in making a comparison(in the way of writing) of Junit and unittest, and so not able to articulate my question. So, I have below questions:
In my python file, I don't have a class.I am instantiating s3 like this outside of methods: s3 = boto3.resource('s3'). So , how can I mock this s3 object and pass to the method, so that mock object is used.
In my test class, i am importing the main python as : import jsontoxmlconverter as converter
and then using converter.<> for unit testing, but the method doesn't take s3 as an argument, so again, how can I pass s3 mock object.
I have read few things about using moto to mock s3, but not able to figure out, how the object is getting passed, so any more info on that would help.
Thanks in Advance!!
Edit 1:
Below is the current code in my test class:
import jsontoxmlconverter as converter
def test_jsontoxml_happyflow(self):
with open('jsonData.txt') as json_file:
data = json.load(json_file)
mock = Mock()
mock.patch('converter.s3')
result = converter.jsontoxml(data, context={})
Use mock.patch:
import jsontoxmlconverter as converter
from mock import patch
def test_jsontoxml_happyflow(self):
with patch('jsontoxmlconverter.s3') as mock_s3:
# make whatever adjustments to mock_s3 as needed
result = converter.uploaddata("foo")
Note that you need to tell patch which module you want to patch s3 in. When you use patch as a context manager, the patch only applies within the context, which makes it easy to isolate changes like this and guarantee that they don't affect other tests. The mock_s3 object available within the context is a mock.Mock object that takes the place of s3; you can freely reassign its attributes before calling the function that will depend on them.
I am trying to call an aggregation pipeline with parameters as part of my server code within eve.
The documentation and the code on the [github]https://github.com/pyeve/eve/blob/master/eve/methods/get.py#L122 suggests that I should be able to call the pipeline using get_internal and it should run with the parameters passed to it.
I have been trying
get_internal("pipeline", **{'_id': id, 'time': time})
but it appears that the _id and time parameters are not getting passed to the aggregate query.
I have verified that the pipeline is working by visiting the pipeline URL
<baseurl>?aggregate={"_id":"5fa904807d3037e78023a5192,"time":1604827480260}
but I would prefer to call it from within the server side code, rather than making a request if possible.
Is there something obvious that I am doing wrong here.
Thanks
Unfortunately, you can not use parameters with get_internal and aggregation. _perform_aggregation only uses the immutable request object while _perform_find merges where from the request object with your lookup parameters with an $and.
You could do a http request to the url like you show works or you could use app.data.driver and manually perform the aggregation query by importing the pipeline and modifying it manually:
from Flask import current_app as app
from domain.yourcollection import my_eve_aggregation # Import the aggregation definition
# Get the aggregation pipeline
pipeline = my_eve_aggregation['datasource']['aggregation']['pipeline']
# Replace parameters - you need to manually replace the correct stages
pipeline[0]['$match'] = {'_id':'5fa904807d3037e78023a5192','time':1604827480260}
# Get the mongo collection
datasource = my_eve_aggregation['datasource']['source']
# Set db collection
col = app.data.driver.db[datasource]
result = list(col.aggregate(pipeline))
And please do create an issue at https://github.com/pyeve/eve/issues for this missing feature
I am able to successfully implement and test on_success_callback and on_failure_callback in Apache Airflow including successfully able to pass parameters to them using context object. However I am not able to successfully implement sla_miss_callback . By going through different online sources I found that arguments that get passed on to this function is
dag, task_list, blocking_task_list, slas, blocking_tis
However the sla_miss_callback unlike success/failure callback doesn't get the context object in its argument list and if I am trying to run Multiple set of operators like Python, Bash Operators they fail and scheduler complains for not passing context to execute function.
I tried looking at other online sources and in just one (https://www.rea-group.com/blog/watching-the-watcher/) I found that we can extract context object by using the self object . So I appended self to the additional 5 arguments described above but it didn't work for me. I want to know how is it possible to retrieve or pass context object to sla_miss_callback function not only for running different operators but also retrieving other details about the dag which has missed the SLA
It seems it is not possible to pass the context dictionary to the SLA callback (see source code for sla_miss_callback) but I've found a reasonable workaround to access some other information about the dag-run such as dag_id, task_id, and execution_date. You can also use any of the build-in macros/parameters which should work fine. While I am using the SlackWebhookOperator for my other callbacks, I am using SlackWebhookHook for the sla_miss_callback. For example:
from airflow.providers.slack.hooks.slack_webhook import SlackWebhookHook
def sla_miss_callback(dag, task_list, blocking_task_list, slas, blocking_tis, *args, **kwargs):
dag_id = slas[0].dag_id
task_id = slas[0].task_id
execution_date = slas[0].execution_date.isoformat()
hook = SlackWebhookHook(
http_conn_id='slack_callbacks',
webhook_token=BaseHook.get_connection('slack_callbacks').password,
message=f"""
:sos: *SLA has been missed*
*Task:* {task_id}
*DAG:* {dag_id}
*Execution Date:* {execution_date}
"""
)
hook.execute()
I'm using Airflow 1.10.2. And I'm trying to define a custom module which would contain general functionality which can be used in multiple dags as well as operators.
A specific example can be an enum. I want to use it within a custom operator (to modify its behaviour). But I also want to use it within a dag definition where it can be used as a parameter.
This is my current hierarchy
airflow_home
| - dags/
- __init__.py
- my_dag.py
| - plugins/
- operators/
- __init__.py
- my_operator.py
- common/
- __init__.py
- my_enum.py
Let's say I want define an enum (in my_enum.py module):
class MyEnum(Enum):
OPTION_1 = 1
OPTION_2 = 2
It is imported to the operator (in my_operator.py) as:
from common.my_enum import MyEnum
And in to the dag (in my_dag.py) the same way:
from common.my_enum import MyEnum
Strangely(?), this works for me. However, I'm very uncertain whether this is the correct way of doing such thing. I was told by a colleague that he tried to do this in the past (possibly on older version of Airflow) and it was not working ("broken dag" when airflow started). Therefore, I'm afraid it might not be (might stop) working in the future or in specific conditions, as it is neither an operator, nor a sensor etc.
I didn't find any guidelines on how to separate shared behaviour. I find the airflow import system quite complicated and not very straight forward. My ideal solution would be to move the module common on the same level as dags and operators.
Also I'm not very sure about how to interpret this sentence from the docs: The python modules in the plugins folder get imported, and hooks, operators, sensors, macros, executors and web views get integrated to Airflow’s main collections and become available for use. Does it mean that my approach is correct, because any python module in plugins/ gets imported?
Is this a good way to achieve my goal, or is there a better solution?
Thank you for your advice
It's a bit of a hakish way of doing it.
Proper way would be first to create a hook and than operator which will use this hook. For simpler case as one below you don't even need to call hook in operator.
#1. Placing
<PROJECT NAME>/<PLUGINS_FOLDER>/<PLUGIN NAME>/__init__.py
<PROJECT NAME>/<PLUGINS_FOLDER>/<PLUGIN NAME>/<some_new>_hook.py
<PROJECT NAME>/<PLUGINS_FOLDER>/<PLUGIN NAME>/<some_new>_operator.py
For real case scenario that would look like this:
CRMProject/crm_plugin/__init__.py
CRMProject/crm_plugin/crm_hook.py
CRMProject/crm_plugin/customer_operator.py
#2. Code
Sample code of CRMProject/crm_plugin/__init__.py:
# CRMProject/crm_plugin/__init__.py
from airflow.plugins_manager import AirflowPlugin
from crm_plugin.crm_hook import CrmHook
from crm_plugin.customer_operator import CreateCustomerOperator, DeleteCustomerOperator, UpdateCustomerOperator
class AirflowCrmPlugin(AirflowPlugin):
name = "crm_plugin" # does not need to match the package name
operators = [CreateCustomerOperator, DeleteCustomerOperator, UpdateCustomerOperator]
sensors = []
hooks = [CrmHook]
executors = []
macros = []
admin_views = []
flask_blueprints = []
menu_links = []
appbuilder_views = []
appbuilder_menu_items = []
global_operator_extra_links = []
operator_extra_links = []
Sample code for hook class - CRMProject/crm_plugin/crm_hook.py. Don't ever call it directly from system\API. Use Operator for that (see below).
from airflow.hooks.base_hook import BaseHook
from airflow.exceptions import AirflowException
from crm_sdk import crm_api # import external libraries to interact with target system
class CrmHook(BaseHook):
"""
Hook to interact with the ACME CRM System.
"""
def __init__(self, ...):
# your code goes here
def insert_object(self, ...):
"""
Insert an object into the CRM system
"""
# your code goes here
def update_object(self, ...):
"""
Update an object into the CRM system
"""
# your code goes here
def delete_object(self, ...):
"""
Delete an object into the CRM system
"""
# your code goes here
def extract_object(self, ...):
"""
Extract an object into the CRM system
"""
# your code goes here
Sample code for Operator (CRMProject/crm_plugin/customer_operator.py) which you will use in DAGs. Operators require that you implement an execute method. This is the entry point into your operator for Airflow and it is called when the task in your DAG executes.
The apply_defaults decorator wraps the __init__ method of the class which applies the DAG defaults, set in your DAG script, to the task instance of your operator at run time.
There are also two important class attributes that we can set. These are templated_fields and template_ext. These two attributes are iterables that should contain the string values for the fields and/or file extensions that will allow templating with the jinja templating support in Airflow.
from airflow.exceptions import AirflowException
from airflow.operators import BaseOperator
from airflow.utils.decorators import apply_defauls
from crm_plugin.crm_hook import CrmHook
class CreateCustomerOperator(BaseOperator):
"""
This operator creates a new customer in the ACME CRM System.
"""
template_fields = ['first_contact_date', 'bulk_file']
template_ext = ['.csv']
#apply_defaults
def __init__(self, first_contact_date, bulk_file, ...):
# your code goes here
def _customer_exist(self, ...):
"""
Helper method to check if a customer exist. Raises an exception if it does.
"""
# your code goes here
def execute(self, context):
"""
Create a new customer in the CRM system.
"""
# your code goes here
You can create as many methods in your class as you need in order to simplify you execute method. The same principles for good class design still counts here.
#3. Deploying and using your plugin
Once you have completed work on your plugin all that is left for you to do is to copy your <PLUGIN NAME> package folder to the Airflow plugins folder. Airflow will pick the plugin up and it will become available to your DAGs.
If we copied the simple CRM plugin to our plugins_folder the folder structure would look like this.
<plugins_folder>/crm_plugin/__init__.py
<plugins_folder>/crm_plugin/crm_hook.py
<plugins_folder>/crm_plugin/customer_operator.py
In order to use your new plugin you will simply import your Operators and Hooks using the following statements.
from airflow.hooks.crm_plugin import CrmHook
from airflow.operators.crm_plugin import CreateCustomerOperator, DeleteCustomerOperator, UpdateCustomerOperator
Source
I need to mock elasticsearch calls, but I am not sure how to mock them in my python unit tests. I saw this framework called ElasticMock. I tried using it the way indicated in the documentation and it gave me plenty of errors.
It is here :
https://github.com/vrcmarcos/elasticmock
My question is, is there any other way to mock elastic search calls?
This doesn't seem to have an answer either: Mock elastic search data.
And this just indicates to actually do integration tests rather than unit tests, which is not what I want:
Unit testing elastic search inside Django app.
Can anyone point me in the right direction? I have never mocked things with ElasticSearch.
You have to mock the attr or method you need, for example:
import mock
with mock.patch("elasticsearch.Elasticsearch.search") as mocked_search, \
mock.patch("elasticsearch.client.IndicesClient.create") as mocked_index_create:
mocked_search.return_value = "pipopapu"
mocked_index_create.return_value = {"acknowledged": True}
In order to know the path you need to mock, just explore the lib with your IDE. When you already know one you can easily find the others.
After looking at the decorator source code, the trick for me was to reference Elasticsearch with the module:
import elasticsearch
...
elasticsearch.Elasticsearch(...
instead of
from elasticsearch import Elasticsearch
...
Elasticsearch(...
I'm going to give a very abstract answer because this applies to more than ES.
class ProductionCodeIWantToTest:
def __init__(self):
pass
def do_something(data):
es = ES() #or some database or whatever
es.post(data) #or the right syntax
Now I can't test this.
With one small change, injecting a dependency:
class ProductionCodeIWantToTest:
def __init__(self, database):
self.database = database
def do_something(data):
database.save(data) #or the right syntax
Now you can use the real db:
es = ES() #or some database or whatever
thing = ProductionCodeIWantToTest(es)
or test it
mock = #... up to you - just needs a save method so far
thing = ProductionCodeIWantToTest(mock)