I want to move from Ansible to Nornir. In Ansbile I use dynamic inventory, where I use this python script to reference the host_var folder:
import json
import yaml
import glob
groups = {}
for hostfilename in glob.glob('./host_vars/*.yml'):
with open(hostfilename, 'r') as hostfile:
host = yaml.load(hostfile, Loader=yaml.FullLoader)
for hostgroup in host['host_groups']:
if hostgroup not in groups.keys():
groups[ hostgroup ] = { 'hosts': [] }
groups[ hostgroup ]['hosts'].append( host['hostname'] )
print(json.dumps(groups))
Question:
How can I use my existing Ansible Inventory in Nornir.
nornir.plugins.inventory.ansible.AnsibleInventory can only be used with 1x host.yaml file not with many, at least this is my understanding
Edit: Goal is to create always new Inventory files on every run. The workflow would be to generate the inventory yaml files in host_vars and then use it during the play.
Can somebody please help me?
Thanks
F.
If I understood you correctly, you want each yaml file in the host_vars folder to be interpreted as one host and its data. This feature is not part of base Nornir, but can be implemented via a custom inventory plugin.
The custom inventory plugin should implement a load() method that returns an Inventory-type object that Nornir can then use normally (see here for an example of the SimpleInventory implementation). I came up with this snippet adapted from the code that was given:
import os
import yaml
import glob
import pathlib
from nornir.core.inventory import (
Inventory,
Hosts,
Host,
Groups,
Group)
def map_host_data(host_dict):
return({
'hostname' : host_dict['hostname'],
'port': host_dict.get('port',22),
'username' : host_dict['username'],
'password' : host_dict['password'],
'platform' : host_dict['platform'],
'data' : host_dict.get('data', None)
})
class DynamicInventory:
def __init__(self, inventory_dir: str = "host_vars/") -> None:
self.inventory_dir = pathlib.Path(inventory_dir).expanduser()
def load(self):
hosts = Hosts()
groups = Groups()
for hostfilename in glob.glob(f"{self.inventory_dir}/*.yaml"):
with open(hostfilename,'r') as hostfile:
host_name = os.path.basename(hostfilename).replace('.yaml','')
host = yaml.load(hostfile, Loader=yaml.FullLoader)
for hostgroup in host['host_groups']:
if hostgroup not in groups.keys():
group = Group(name=hostgroup)
groups[hostgroup] = group
hosts[host_name] = Host(name=host_name, **map_host_data(host))
return Inventory(hosts=hosts,groups=groups,defaults={})
I'm assuming you're using Nornir >= 3 (which you really should), so don't forget to register your plugin if using it on your configuration. Assuming you put the above code under plugins/inventory.py:
from nornir import InitNornir
from plugins.inventory import DynamicInventory
from nornir.core.plugins.inventory import InventoryPluginRegister
InventoryPluginRegister.register("DynamicInventoryPlugin",DynamicInventory)
nr = InitNornir(inventory={'plugin': 'DynamicInventoryPlugin'},
runner={'plugin': 'threaded','options': {'num_workers': 20}})
This of course ignores some features (such as setting defaults), but can be modified to add more features that better match your current setup.
Related
I have different set of config files initialized under a Config class and would need them updated dynamically in the run time.
config.py
# Package import
import reusable.common as common
import reusable.JSON_utils as JSON
class Config:
def __init__(self):
# run_config
run_file = common.file_path('run_config.JSON')
self.run_dict = JSON.read(run_file)
self.env_list = list(self.run_dict.keys())
# connection_config
self.conn_file = common.file_path('connection_config.JSON')
self.conn_dict = JSON.read(self.conn_file)
# snapshot_config
self.snap_file = common.file_path('snapshot_config.JSON')
self.snap_dict = JSON.read(self.snap_file)
For example I have to iterate through different environments like (DEV, STAGE, QA, PROD) and want to update conn_dict('env') to 'QA' from 'DEV' after DEV tasks are completed. Currently I have the update dict/JSON code in main() but I want to have this as a method inside the Config class
# Package import
import reusable.config as config
import reusable.JSON_utils as JSON
config_obj = config.Config()
for env in config_obj.env_list:
config_obj.conn_dict['env'] = env
JSON.write(config_obj.conn_dict, config_obj.conn_file)
src_id_list = config_obj.run_dict.get(env).get('snapshot')
# do stuff in the current env
for src_id in src_id_list:
config_obj.snap_dict['source_id'] = str(src_id)
JSON.write(config_obj.snap_dict, config_obj.snap_file)
# do stuff for the current data source
Q1 Which method is the optimal and conventional way for this. Class or instant or static method? kindly explain as I'm not clear with those completely
Q2 Can we have a unified method inside the class that takes the dict variable and function as parameters to update the dict and JSON file. If yes how it can be achieved?
As a Python programmer recently I received a small project to edit and add some functions. (the project is Python/Django)
But while I was working on it, I noticed something unusual, which is the presence of some python libraries(hashlib and others), which can take my data(Gmail accounts, passwords, chrome bookmarks . . .) from the computer.
This Script is an example from the code.
import hashlib
import logging
import re
import pandas as pd
from file import reader
logger = logging.getLogger(__name__)
def load_data(user_id, project_id, columns: dict):
group_files = {}
df = pd.DataFrame(None)
for id_, column in columns.items():
group = column['group']
if group not in group_files:
df_file = reader.load_file(user_id, project_id, group)
group_files[group] = df_file
if group_files[group] is not None:
if column['content'] in group_files[group].columns:
df[column['content']] = group_files[group][column['content']]
return df
def get_hash(string: str):
return hashlib.md5(string.encode()).hexdigest()[:5]
My question is: How can I know if they are taking my data from the computer or not?
Thanks in Advance.
Suppose in "./data_writers/excel_data_writer.py", I have:
from generic_data_writer import GenericDataWriter
class ExcelDataWriter(GenericDataWriter):
def __init__(self, config):
super().__init__(config)
self.sheet_name = config.get('sheetname')
def write_data(self, pandas_dataframe):
pandas_dataframe.to_excel(
self.get_output_file_path_and_name(), # implemented in GenericDataWriter
sheet_name=self.sheet_name,
index=self.index)
In "./data_writers/csv_data_writer.py", I have:
from generic_data_writer import GenericDataWriter
class CSVDataWriter(GenericDataWriter):
def __init__(self, config):
super().__init__(config)
self.delimiter = config.get('delimiter')
self.encoding = config.get('encoding')
def write_data(self, pandas_dataframe):
pandas_dataframe.to_csv(
self.get_output_file_path_and_name(), # implemented in GenericDataWriter
sep=self.delimiter,
encoding=self.encoding,
index=self.index)
In "./datawriters/generic_data_writer.py", I have:
import os
class GenericDataWriter:
def __init__(self, config):
self.output_folder = config.get('output_folder')
self.output_file_name = config.get('output_file')
self.output_file_path_and_name = os.path.join(self.output_folder, self.output_file_name)
self.index = config.get('include_index') # whether to include index column from Pandas' dataframe in the output file
Suppose I have a JSON config file that has a key-value pair like this:
{
"__comment__": "Here, user can provide the path and python file name of the custom data writer module she wants to use."
"custom_data_writer_module": "./data_writers/excel_data_writer.py"
"there_are_more_key_value_pairs_in_this_JSON_config_file": "for other input parameters"
}
In "main.py", I want to import the data writer module based on the custom_data_writer_module provided in the JSON config file above. So I wrote this:
import os
import importlib
def main():
# Do other things to read and process data
data_writer_class_file = config.get('custom_data_writer_module')
data_writer_module = importlib.import_module\
(os.path.splitext(os.path.split(data_writer_class_file)[1])[0])
dw = data_writer_module.what_should_this_be? # <=== Here, what should I do to instantiate the right specific data writer (Excel or CSV) class instance?
for df in dataframes_to_write_to_output_file:
dw.write_data(df)
if __name__ == "__main__":
main()
As I asked in the code above, I want to know if there's a way to retrieve and instantiate the class defined in a Python module assuming that there is ONLY ONE class defined in the module. Or if there is a better way to refactor my code (using some sort of pattern) without changing the structure of JSON config file described above, I'd like to learn from Python experts on StackOverflow. Thank you in advance for your suggestions!
You can do this easily with vars:
cls1,=[v for k,v in vars(data_writer_module).items()
if isinstance(v,type)]
dw=cls1(config)
The comma enforces that exactly one class is found. If the module is allowed to do anything like from collections import deque (or even foo=str), you might need to filter based on v.__module__.
I'm using Gcloud Composer as my Airflow. When I try to use Jinja in my HQL code, it does not translate it correctly.
I know that the HiveOperator has a Jinja translator as I'm used to it, but the DataProcHiveOperator doesn't.
I've tried to use the HiveConf directly into my HQL files, but when setting those values to my Partition (i.e. INSERT INTO TABLE abc PARTITION (ds = ${hiveconf:ds}))`, it doesn't work.
I have also added the following to my HQL file:
SET ds=to_date(current_timestamp());
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
But it didn't work as HIVE is transforming the formula above into a STRING.
So my idea was to combine both operators to have the Jinja translator working fine, but when I do that, I get the following error: ERROR - submit() takes from 3 to 4 positional arguments but 5 were given.
I'm not very familiar with Python coding and any help would be great, see below code for the operator I'm trying to build;
Header of the Python File (please note that the file contains other Operators not mentioned in this question):
import ntpath
import os
import re
import time
import uuid
from datetime import timedelta
from airflow.contrib.hooks.gcp_dataproc_hook import DataProcHook
from airflow.contrib.hooks.gcs_hook import GoogleCloudStorageHook
from airflow.exceptions import AirflowException
from airflow.models import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.version import version
from googleapiclient.errors import HttpError
from airflow.utils import timezone
from airflow.utils.operator_helpers import context_to_airflow_vars
modified DataprocHiveOperator:
class DataProcHiveOperator(BaseOperator):
template_fields = ['query', 'variables', 'job_name', 'cluster_name', 'dataproc_jars']
template_ext = ('.q',)
ui_color = '#0273d4'
#apply_defaults
def __init__(
self,
query=None,
query_uri=None,
hiveconfs=None,
hiveconf_jinja_translate=False,
variables=None,
job_name='{{task.task_id}}_{{ds_nodash}}',
cluster_name='cluster-1',
dataproc_hive_properties=None,
dataproc_hive_jars=None,
gcp_conn_id='google_cloud_default',
delegate_to=None,
region='global',
job_error_states=['ERROR'],
*args,
**kwargs):
super(DataProcHiveOperator, self).__init__(*args, **kwargs)
self.gcp_conn_id = gcp_conn_id
self.delegate_to = delegate_to
self.query = query
self.query_uri = query_uri
self.hiveconfs = hiveconfs or {}
self.hiveconf_jinja_translate = hiveconf_jinja_translate
self.variables = variables
self.job_name = job_name
self.cluster_name = cluster_name
self.dataproc_properties = dataproc_hive_properties
self.dataproc_jars = dataproc_hive_jars
self.region = region
self.job_error_states = job_error_states
def prepare_template(self):
if self.hiveconf_jinja_translate:
self.query_uri= re.sub(
"(\$\{(hiveconf:)?([ a-zA-Z0-9_]*)\})", "{{ \g<3> }}", self.query_uri)
def execute(self, context):
hook = DataProcHook(gcp_conn_id=self.gcp_conn_id,
delegate_to=self.delegate_to)
job = hook.create_job_template(self.task_id, self.cluster_name, "hiveJob",
self.dataproc_properties)
if self.query is None:
job.add_query_uri(self.query_uri)
else:
job.add_query(self.query)
if self.hiveconf_jinja_translate:
self.hiveconfs = context_to_airflow_vars(context)
else:
self.hiveconfs.update(context_to_airflow_vars(context))
job.add_variables(self.variables)
job.add_jar_file_uris(self.dataproc_jars)
job.set_job_name(self.job_name)
job_to_submit = job.build()
self.dataproc_job_id = job_to_submit["job"]["reference"]["jobId"]
hook.submit(hook.project_id, job_to_submit, self.region, self.job_error_states)
I would like to be able to use Jinja templating inside my HQL code to allow partition automation on my data pipeline.
P.S: I'll use the Jinja templating mostly for Partition DateStamp
Does anyone know what is the error message I'm getting + help me solve it?
ERROR - submit() takes from 3 to 4 positional arguments but 5 were given
Thank you!
It is because of the 5th argument job_error_states which is only in master and not in the current stable release (1.10.1).
Source Code for 1.10.1 -> https://github.com/apache/incubator-airflow/blob/76a5fc4d2eb3c214ca25406f03b4a0c5d7250f71/airflow/contrib/hooks/gcp_dataproc_hook.py#L219
So remove that parameter and it should work.
I want to change the env.hosts dynamically because sometimes I want to deploy to one machine first, check if ok then deploy to many machines.
Currently I need to set env.hosts first, how could I set the env.hosts in a method and not in global at script start?
Yes you can set env.hosts dynamically. One common pattern we use is:
from fabric.api import env
def staging():
env.hosts = ['XXX.XXX.XXX.XXX', ]
def production():
env.hosts = ['YYY.YYY.YYY.YYY', 'ZZZ.ZZZ.ZZZ.ZZZ', ]
def deploy():
# Do something...
You would use this to chain the tasks such as fab staging deploy or fab production deploy.
Kind of late to the party, but I achieved this with ec2 like so (note in EC2 you do not know what the ip/hostname may be, generally speaking - so you almost have to go dynamic to really account for how the environment/systems could come up - another option would be to use dyndns, but then this would still be useful):
from fabric.api import *
import datetime
import time
import urllib2
import ConfigParser
from platform_util import *
config = ConfigParser.RawConfigParser()
#task
def load_config(configfile=None):
'''
***REQUIRED*** Pass in the configuration to use - usage load_config:</path/to/config.cfg>
'''
if configfile != None:
# Load up our config file
config.read(configfile)
# Key/secret needed for aws interaction with boto
# (anyone help figure out a better way to do this with sub modules, please don't say classes :-) )
global aws_key
global aws_sec
aws_key = config.get("main","aws_key")
aws_sec = config.get("main","aws_sec")
# Stuff for fabric
env.user = config.get("main","fabric_ssh_user")
env.key_filename = config.get("main","fabric_ssh_key_filename")
env.parallel = config.get("main","fabric_default_parallel")
# Load our role definitions for fabric
for i in config.sections():
if i != "main":
hostlist = []
if config.get(i,"use-regex") == 'yes':
for x in get_running_instances_by_regex(aws_key,aws_sec,config.get(i,"security-group"),config.get(i,"pattern")):
hostlist.append(x.private_ip_address)
env.roledefs[i] = hostlist
else:
for x in get_running_instances(aws_key,aws_sec,config.get(i,"security-group")):
hostlist.append(x.private_ip_address)
env.roledefs[i] = hostlist
if config.has_option(i,"base-group"):
if config.get(i,"base-group") == 'yes':
print "%s is a base group" % i
print env.roledefs[i]
# env["basegroups"][i] = True
where get_running_instances and get_running_instances_by_regex are utility functions that make use of boto (http://code.google.com/p/boto/)
ex:
import logging
import re
from boto.ec2.connection import EC2Connection
from boto.ec2.securitygroup import SecurityGroup
from boto.ec2.instance import Instance
from boto.s3.key import Key
########################################
# B-O-F get_instances
########################################
def get_instances(access_key=None, secret_key=None, security_group=None):
'''
Get all instances. Only within a security group if specified., doesnt' matter their state (running/stopped/etc)
'''
logging.debug('get_instances()')
conn = EC2Connection(aws_access_key_id=access_key, aws_secret_access_key=secret_key)
if security_group:
sg = SecurityGroup(connection=conn, name=security_group)
instances = sg.instances()
return instances
else:
instances = conn.get_all_instances()
return instances
Here is a sample of what my config looked like:
# Config file for fabric toolset
#
# This specific configuration is for <whatever> related hosts
#
#
[main]
aws_key = <key>
aws_sec = <secret>
fabric_ssh_user = <your_user>
fabric_ssh_key_filename = /path/to/your/.ssh/<whatever>.pem
fabric_default_parallel = 1
#
# Groupings - Fabric knows them as roledefs (check env dict)
#
# Production groupings
[app-prod]
security-group = app-prod
use-regex = no
pattern =
[db-prod]
security-group = db-prod
use-regex = no
pattern =
[db-prod-masters]
security-group = db-prod
use-regex = yes
pattern = mysql-[d-s]01
Yet another new answer to an old question. :) But I just recently found myself attempting to dynamically set hosts, and really have to disagree with the main answer. My idea of dynamic, or at least what I was attempting to do, was take an instance DNS-name that was just created by boto, and access that instance with a fab command. I couldn't do fab staging deploy, because the instance doesn't exist at fabfile-editing time.
Fortunately, fabric does support a truly dynamic host-assignment with execute. (It's possible this didn't exist when the question was first asked, of course, but now it does). Execute allows you to define both a function to be called, and the env.hosts it should use for that command. For example:
def create_EC2_box(data=fab_base_data):
conn = boto.ec2.connect_to_region(region)
reservations = conn.run_instances(image_id=image_id, ...)
...
return instance.public_dns_name
def _ping_box():
run('uname -a')
run('tail /var/log/cloud-init-output.log')
def build_box():
box_name = create_EC2_box(fab_base_data)
new_hosts = [box_name]
# new_hosts = ['ec2-54-152-152-123.compute-1.amazonaws.com'] # testing
execute(_ping_box, hosts=new_hosts)
Now I can do fab build_box, and it will fire one boto call that creates an instance, and another fabric call that runs on the new instance - without having to define the instance-name at edit-time.