I'm using Gcloud Composer as my Airflow. When I try to use Jinja in my HQL code, it does not translate it correctly.
I know that the HiveOperator has a Jinja translator as I'm used to it, but the DataProcHiveOperator doesn't.
I've tried to use the HiveConf directly into my HQL files, but when setting those values to my Partition (i.e. INSERT INTO TABLE abc PARTITION (ds = ${hiveconf:ds}))`, it doesn't work.
I have also added the following to my HQL file:
SET ds=to_date(current_timestamp());
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
But it didn't work as HIVE is transforming the formula above into a STRING.
So my idea was to combine both operators to have the Jinja translator working fine, but when I do that, I get the following error: ERROR - submit() takes from 3 to 4 positional arguments but 5 were given.
I'm not very familiar with Python coding and any help would be great, see below code for the operator I'm trying to build;
Header of the Python File (please note that the file contains other Operators not mentioned in this question):
import ntpath
import os
import re
import time
import uuid
from datetime import timedelta
from airflow.contrib.hooks.gcp_dataproc_hook import DataProcHook
from airflow.contrib.hooks.gcs_hook import GoogleCloudStorageHook
from airflow.exceptions import AirflowException
from airflow.models import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.version import version
from googleapiclient.errors import HttpError
from airflow.utils import timezone
from airflow.utils.operator_helpers import context_to_airflow_vars
modified DataprocHiveOperator:
class DataProcHiveOperator(BaseOperator):
template_fields = ['query', 'variables', 'job_name', 'cluster_name', 'dataproc_jars']
template_ext = ('.q',)
ui_color = '#0273d4'
#apply_defaults
def __init__(
self,
query=None,
query_uri=None,
hiveconfs=None,
hiveconf_jinja_translate=False,
variables=None,
job_name='{{task.task_id}}_{{ds_nodash}}',
cluster_name='cluster-1',
dataproc_hive_properties=None,
dataproc_hive_jars=None,
gcp_conn_id='google_cloud_default',
delegate_to=None,
region='global',
job_error_states=['ERROR'],
*args,
**kwargs):
super(DataProcHiveOperator, self).__init__(*args, **kwargs)
self.gcp_conn_id = gcp_conn_id
self.delegate_to = delegate_to
self.query = query
self.query_uri = query_uri
self.hiveconfs = hiveconfs or {}
self.hiveconf_jinja_translate = hiveconf_jinja_translate
self.variables = variables
self.job_name = job_name
self.cluster_name = cluster_name
self.dataproc_properties = dataproc_hive_properties
self.dataproc_jars = dataproc_hive_jars
self.region = region
self.job_error_states = job_error_states
def prepare_template(self):
if self.hiveconf_jinja_translate:
self.query_uri= re.sub(
"(\$\{(hiveconf:)?([ a-zA-Z0-9_]*)\})", "{{ \g<3> }}", self.query_uri)
def execute(self, context):
hook = DataProcHook(gcp_conn_id=self.gcp_conn_id,
delegate_to=self.delegate_to)
job = hook.create_job_template(self.task_id, self.cluster_name, "hiveJob",
self.dataproc_properties)
if self.query is None:
job.add_query_uri(self.query_uri)
else:
job.add_query(self.query)
if self.hiveconf_jinja_translate:
self.hiveconfs = context_to_airflow_vars(context)
else:
self.hiveconfs.update(context_to_airflow_vars(context))
job.add_variables(self.variables)
job.add_jar_file_uris(self.dataproc_jars)
job.set_job_name(self.job_name)
job_to_submit = job.build()
self.dataproc_job_id = job_to_submit["job"]["reference"]["jobId"]
hook.submit(hook.project_id, job_to_submit, self.region, self.job_error_states)
I would like to be able to use Jinja templating inside my HQL code to allow partition automation on my data pipeline.
P.S: I'll use the Jinja templating mostly for Partition DateStamp
Does anyone know what is the error message I'm getting + help me solve it?
ERROR - submit() takes from 3 to 4 positional arguments but 5 were given
Thank you!
It is because of the 5th argument job_error_states which is only in master and not in the current stable release (1.10.1).
Source Code for 1.10.1 -> https://github.com/apache/incubator-airflow/blob/76a5fc4d2eb3c214ca25406f03b4a0c5d7250f71/airflow/contrib/hooks/gcp_dataproc_hook.py#L219
So remove that parameter and it should work.
Related
So I have this code which works great for reading messages out of predefined topics and printing it to screen. The rosbags come with a rosbag_name.db3 (sqlite) database and metadata.yaml file
from rosbags.rosbag2 import Reader as ROS2Reader
import sqlite3
from rosbags.serde import deserialize_cdr
import matplotlib.pyplot as plt
import os
import collections
import argparse
parser = argparse.ArgumentParser(description='Extract images from rosbag.')
# input will be the folder containing the .db3 and metadata.yml file
parser.add_argument('--input','-i',type=str, help='rosbag input location')
# run with python filename.py -i rosbag_dir/
args = parser.parse_args()
rosbag_dir = args.input
topic = "/topic/name"
frame_counter = 0
with ROS2Reader(rosbag_dir) as ros2_reader:
ros2_conns = [x for x in ros2_reader.connections]
# This prints a list of all topic names for sanity
print([x.topic for x in ros2_conns])
ros2_messages = ros2_reader.messages(connections=ros2_conns)
for m, msg in enumerate(ros2_messages):
(connection, timestamp, rawdata) = msg
if (connection.topic == topic):
print(connection.topic) # shows topic
print(connection.msgtype) # shows message type
print(type(connection.msgtype)) # shows it's of type string
# TODO
# this is where things crash when it's a custom message type
data = deserialize_cdr(rawdata, connection.msgtype)
print(data)
The issue is that I can't seem to figure out how to read in custom message types. deserialize_cdr takes a string for the message type field, but it's not clear to me how to replace this with a path or how to otherwise pass in a custom message.
Thanks
One approach would be that you declare and register it to the type system as a string:
from rosbags.typesys import get_types_from_msg, register_types
MY_CUSTOM_MSG = """
std_msgs/Header header
string foo
"""
register_types(get_types_from_msg(
MY_CUSTOM_MSG, 'my_custom_msgs/msg/MyCustomMsg'))
from rosbags.typesys.types import my_custom_msgs__msg__MyCustomMsg as MyCustomMsg
Next, using:
msg_type = MyCustomMsg.__msgtype__
you can get the message type that you can pass to deserialize_cdr.
Also, see here for a quick example.
Another approach is to directly load it from the message definition.
Essentially, you would need to read the message
from pathlib import Path
custom_msg_path = Path('/path/to/my_custom_msgs/msg/MyCustomMsg.msg')
msg_def = custom_msg_path.read_text(encoding='utf-8')
and then follow the same steps as above starting with get_types_from_msg().
A more detailed example of this approach is given here.
I want to move from Ansible to Nornir. In Ansbile I use dynamic inventory, where I use this python script to reference the host_var folder:
import json
import yaml
import glob
groups = {}
for hostfilename in glob.glob('./host_vars/*.yml'):
with open(hostfilename, 'r') as hostfile:
host = yaml.load(hostfile, Loader=yaml.FullLoader)
for hostgroup in host['host_groups']:
if hostgroup not in groups.keys():
groups[ hostgroup ] = { 'hosts': [] }
groups[ hostgroup ]['hosts'].append( host['hostname'] )
print(json.dumps(groups))
Question:
How can I use my existing Ansible Inventory in Nornir.
nornir.plugins.inventory.ansible.AnsibleInventory can only be used with 1x host.yaml file not with many, at least this is my understanding
Edit: Goal is to create always new Inventory files on every run. The workflow would be to generate the inventory yaml files in host_vars and then use it during the play.
Can somebody please help me?
Thanks
F.
If I understood you correctly, you want each yaml file in the host_vars folder to be interpreted as one host and its data. This feature is not part of base Nornir, but can be implemented via a custom inventory plugin.
The custom inventory plugin should implement a load() method that returns an Inventory-type object that Nornir can then use normally (see here for an example of the SimpleInventory implementation). I came up with this snippet adapted from the code that was given:
import os
import yaml
import glob
import pathlib
from nornir.core.inventory import (
Inventory,
Hosts,
Host,
Groups,
Group)
def map_host_data(host_dict):
return({
'hostname' : host_dict['hostname'],
'port': host_dict.get('port',22),
'username' : host_dict['username'],
'password' : host_dict['password'],
'platform' : host_dict['platform'],
'data' : host_dict.get('data', None)
})
class DynamicInventory:
def __init__(self, inventory_dir: str = "host_vars/") -> None:
self.inventory_dir = pathlib.Path(inventory_dir).expanduser()
def load(self):
hosts = Hosts()
groups = Groups()
for hostfilename in glob.glob(f"{self.inventory_dir}/*.yaml"):
with open(hostfilename,'r') as hostfile:
host_name = os.path.basename(hostfilename).replace('.yaml','')
host = yaml.load(hostfile, Loader=yaml.FullLoader)
for hostgroup in host['host_groups']:
if hostgroup not in groups.keys():
group = Group(name=hostgroup)
groups[hostgroup] = group
hosts[host_name] = Host(name=host_name, **map_host_data(host))
return Inventory(hosts=hosts,groups=groups,defaults={})
I'm assuming you're using Nornir >= 3 (which you really should), so don't forget to register your plugin if using it on your configuration. Assuming you put the above code under plugins/inventory.py:
from nornir import InitNornir
from plugins.inventory import DynamicInventory
from nornir.core.plugins.inventory import InventoryPluginRegister
InventoryPluginRegister.register("DynamicInventoryPlugin",DynamicInventory)
nr = InitNornir(inventory={'plugin': 'DynamicInventoryPlugin'},
runner={'plugin': 'threaded','options': {'num_workers': 20}})
This of course ignores some features (such as setting defaults), but can be modified to add more features that better match your current setup.
I have a table in sql database which I inserted a python code as data.
I wanna map an entity to another one (first one is mine and the second one is from web service). So I used python for mapping. I used this python code in two places and in both of them it just maps the entities. the first place used in Job Service and works Parallel and the Second place is for mapping and show mapped Entity in XML Format.
the problem is when the job is mapping entities, the python says No module named linq!
but when I want to see the mapped entity, it maps and the xml show the mapped entity without any problems. whats wrong? the code is the same but somewhere I get error.
here is some part of my python script. Any opinion?
import System
import clr
import sys
clr.AddReference(''System.Core'')
clr.AddReference(''Tools.Cmn'')
clr.AddReference(''Tools.ThinCmn'')
clr.AddReference(''Common.Cmn'')
clr.AddReference(''Common.SA'')
clr.AddReference(''Life.SA'')
clr.AddReference(''Life.Cmn'')
clr.AddReference(''Life.BL'')
from Tools.Cmn import *
from Common.Cmn import *
from Common.SA import *
from Life.Cmn import *
from Life.BL import *
from Life.SA.CiiRegLifePlcyServices import *
from Life.SA import *
from System import Array
from System import DateTime
from System import Func
from System.Linq import Enumerable
from System.Collections.Generic import *
StrTools = StringToolsExposeToScript()
CIHelper = LifeCIWebServiceHelper()
########################################
def Map(bnVer , lifPlcyReq, extraInfo):
isEdm = 0 if bnVer.ElhNo == 0 else 1
isPrps = extraInfo["IsPrps"]
try :
calcList = bnVer.LifePolicyCalcGeneralList.LoadAll()
for calc in calcList:
clc = System.Activator.CreateInstance(LifPlcyYrReq)
clc.YrNo = calc.Year
clc.PayCnt = calc.InstallmentCnt
clc.Svng = calc.PrmReserved
clc.GrntPrftRte = calc.PolicyYearInterest
clc.CfcChngPrm = calc.PolicyYearPrmChangeZarib
LifPlcyYrLst.Add(clc)
lifPlcyReq.LifPlcyYrLst = LifPlcyYrLst.ToArray()
except TypeError, err:
raise Exception(str(err) + " line : " + str(sys.exc_info()[-1].tb_lineno))
except:
raise
return
I am trying to create custom management command which will execute data from Api. I wrote this code:
from django.core.management.base import BaseCommand, CommandError
from data.models import Country
import requests
import json
def extracting():
country_req = requests.get("https://api-football-v1.p.rapidapi.com/countries", headers = {"X-RapidAPI-Key": "my_token"})
parsed_string = json.loads(country_req.text)
class Command(BaseCommand):
def handle(self, **options):
print(extracting())
but, when I try to execute it with python manage.py extract in my console I see None, while when I try to run this code in console without custom management command I see data which I try to execute.
Any ideas?
You do not return anything from the extracting() method. Depending on your interactive console, you might see the values for variables. But what you probably want to do is:
def extracting():
country_req = requests.get("https://api-football-v1.p.rapidapi.com/countries", headers = {"X-RapidAPI-Key": "my_token"})
return json.loads(country_req.text)
I'm working on a web application with Python and Google App Engine.
I tried to set the default URLFetch deadline globally as suggested in a previous thread:
https://stackoverflow.com/a/14698687/2653179
urlfetch.set_default_fetch_deadline(45)
However it doesn't work - When I print its value in one of the functions: urlfetch.get_default_fetch_deadline() is None.
Here is main.py:
from google.appengine.api import users
import webapp2
import jinja2
import random
import string
import hashlib
import CQutils
import time
import os
import httpRequests
import logging
from google.appengine.api import urlfetch
urlfetch.set_default_fetch_deadline(45)
...
class Del(webapp2.RequestHandler):
def get(self):
id = self.request.get('id')
ext = self.request.get('ext')
user_id = httpRequests.advance(id,ext)
d2 = urlfetch.get_default_fetch_deadline()
logging.debug("value of deadline = %s", d2)
Prints in the Log console:
DEBUG 2013-09-05 07:38:21,654 main.py:427] value of deadline = None
The function which is being called in httpRequests.py:
def advance(id, ext=None):
url = "http://localhost:8080/api/" + id + "/advance"
if ext is None:
ext = ""
params = urllib.urlencode({'ext': ext})
result = urlfetch.fetch(url=url,
payload=params,
method=urlfetch.POST,
headers={'Content-Type': 'application/x-www-form-urlencoded'})
if (result.status_code == 200):
return result.content
I know this is an old question, but recently ran into the issue.
The setting is placed into a thread-local, meaning that if your application is set to thread-safe and you handle a request in a different thread than the one you set the default deadline for, it can be lost. For me, the solution was to set the deadline before every request as part of the middleware chain.
This is not documented, and required looking through the source to figure it out.