Azure ML Pipeline prohibit file upload - python

When creating a Pipeline with Python SDK V2 for Azure ML all contents of my current working directory are uploaded. Can I blacklist some files being upload? E.g. I use load_env(".env") in order to read some credentials but I don't wan't it to be uploaded.
Directory content:
./src
utilities.py # contains helper function to get Azure credentials
.env # contains credentials
conda.yaml
script.py
A minimal pipeline example:
import mldesigner
import mlflow
from azure.ai.ml import MLClient
from azure.ai.ml.dsl import pipeline
from src.utilities import get_credential
credential = get_credential() # calls `load_env(".env") locally
ml_client = MLClient(
credential=credential,
subscription_id="foo",
resource_group_name="bar",
workspace_name="foofoo",
)
#mldesigner.command_component(
name="testcomponent",
display_name="Test Component",
description="Test Component description.",
environment=dict(
conda_file="./conda.yaml",
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
),
)
def test_component():
mlflow.log_metric("metric", 0)
cluster_name = "foobar"
#pipeline(default_compute=cluster_name)
def pipe():
test_component()
pipeline_job = pipe()
pipeline_job = ml_client.jobs.create_or_update(
pipeline_job, experiment_name="pipeline_samples"
)
After running python script.py the pipeline job is created and runs in Azure ML. If I have a look at the Pipeline in Azure ML UI and inspect Test Component and the tab Code I find all source files including .env.
How can I prevent uploading this file using the SDK while creating a pipeline job?

you can use a .gitignore or .amlignore file in your working directory to specify files and directories to ignore. These files will not be included when you run the pipeline by default.
Here is the document to prevent unnecessary files.
or
# Get all files in the current working directory
all_files = os.listdir()
# Remove ".env" file from the list of files
all_files.remove(".env")
#pipeline(default_compute=cluster_name, files=all_files)
def pipe():
test_component()
pipeline_job = pipe()
pipeline_job = ml_client.jobs.create_or_update(
pipeline_job, experiment_name="pipeline_samples"
)

Related

How to get the path of .py files under Azure function in Azure portal

I am working on Python Azure function. Below is the part of the code.
df1 = pd.DataFrame(df)
df2= df1.loc[0, 'version']
ipversion= f"testversion{df2}.py"
start_path = 'E:\Azure\Azure_FUNC'
path_to_file = os.path.join(start_path, ipversion)
logging.info(f"path_to_file: {path_to_file}")
path = Path(path_to_file)
version= f"testversion{df2}"
if ip:
if path.is_file():
module = 'Azure_FUNC.' + version
my_module = importlib.import_module(module)
return func.HttpResponse(f"{my_module.add(ip)}")
else:
return func.HttpResponse(f" This HTTP triggered function executed successfully.Flex calculation = {default.mult(ip)}")
else:
return func.HttpResponse(
"This HTTP triggered function executed successfully.,
status_code=200
)
Azure_FUNC is my function name.
testversion1, testversion2 and default are 3 .py files under this function.
In the above code, based on the input version provided from the API call, the code checks if that version .py is available and imports the module from that particular version and executes the code. If the given version .py file is not available, it is going to execute default .py file.
This works fine in my local. But when I deploy this function to Azure, I am unable to find the path for testversion1 and testversion2 files in the Azure portal under Azure functions.
Please let me know how to get the path of these files and how to check these files based on the input version provided from the API call.
Thank you.
If you would deploy the Azure Python Function Project to Linux Function App, then you can see the location of your trigger files (i.e., .py files) in the path of:
Open Kudu Site of your Function App > Click on SSH >

GCP apache airflow, how to install Python dependency from private repository

for my project for data extraction I have gone for the apacahe Airflow, with GCP composer and bucket storage.
I have several modules in a package in my repo in Github, that my DAG file need to acess
for now im using BashOperator to check if it works:
#dag.py
dag = DAG(
dag_id='my_example_DAG',
start_date=datetime(2019, 10, 17, 8, 25),
schedule_interval=timedelta(minutes=15),
default_args=default_args,
)
t1 = BashOperator(
task_id='example_task',
bash_command='python /home/airflow/gcs/data/my_example_maindir/main.py ',
dag=dag)
t1
#main.py
def run_main(path_name)
#Reads YML file
extractor_pool(yml_info)
def extractor_pool
#do work
if __name__ == "__main__":
test_path = Example/path/for/test.yml
run_main(test_path)
And it works, it starts main.py with the test_path. but want to use the function run_main to parse the correct path with the correct YML file for the task.
I have tried to sys.path.insert the dir inside my storage bucket where my modules is, But i get import error
dir:
dir for my dags file (cloned from my git repo) = Buckets/europe-west1-eep-envxxxxxxx-bucket/dags
dir for my scripts/packages = Buckets/europe-west1-eep-envxxxxxxx-bucket/data
#dag.py
import sys
sys.path.insert(0, "/home/airflow/gcs/data/Example/")
from Example import main
dag = DAG(
dag_id='task_1_dag',
start_date=datetime(2019, 10, 13),
schedule_interval=timedelta(minutes=10),
default_args=default_args,
)
t1 = PythonOperator(
task_id='task_1',
provide_context=True,
python_callable=main.run_main,
op_args={'path_name': "project_output_0184_Storgaten_33"},
dag=dag
)
t1
This result in a ''module not found'' error, and does not work.
I have done som reading in GCP and found this:
Installing a Python dependency from private repository:
https://cloud.google.com/composer/docs/how-to/using/installing-python-dependencies
That says i need to place it in the directory path /config/pip/
example: gs://us-central1-b1-6efannnn-bucket/config/pip/pip.conf
But in my GCP storage bucket i have no directory named config.
I have tried to trace my steps in when i created the bucket and env but can figure out what i have done wrong
GCS has no true notion of folders or directories, what you actually have is a series of blobs that have names which may contain slashes and give the appearance of a directory.
The instructions are a bit unclear by asking you to put it in a directory, but what you actually want to do is create a file and give it the prefix config/pip/pip.conf.
With gsutil you'd do something like:
gsutil cp my-local-pip.conf gs://[DESTINATION_BUCKET_NAME]/config/pip/pip.conf

Attaching data file (.csv, .json) as part of a setup package to be used on Dataflow

I am trying to use dataflow to complete a task that requires the use of a .csv and .json files. From what I understand, I should be able to create a setup.py file that will include these files and distribute them to multiple workers.
This is how my files are laid out:
pipline.py
setup.py
utils /
-->__init__.py
-->**CSV.csv**
-->**JSON.json**
This is my setup.py file:
import setuptools
setuptools.setup(name='utils',
version='0.0.1',
description='utils',
packages=setuptools.find_packages(),
package_data={'utils': ['**CSV.csv**', '**JSON.json**']},
include_package_data=True)
This is my bean.DoFn functions:
class DoWork(beam.DoFn):
def process(self, element):
import pandas as pd
df_csv = pd.read_csv('**CSV.csv**')
df_json = pd.read_json('**JSON.json**')
Do other stuff with dataframes
yield [stuff]
My pipeline is setup like so:
dataflow_options = ['--job_name=pipline',
'--project=pipeline',
'--temp_location=gs://pipeline/temp',
'--staging_location=gs://pipeline/stage',
'--setup_file=./setup.py']
options = PipelineOptions(dataflow_options)
gcloud_options = options.view_as(GoogleCloudOptions)
options.view_as(StandardOptions).runner = 'DataflowRunner'
with beam.Pipeline(options=options) as p:
update = p | beam.Create(files) | beam.ParDo(DoWork())
Basically I keep getting an:
IOError: File CSV.csv does not exist
It doesn't think the .json file exists either but is just erroring out before it reaches that step. The files are possibly not making it to dataflow or I am referencing them incorrectly within the DoFn. Should I actually be putting the files into data_files parameter of the setup function instead of package_data?
you need to upload the input files in gs and give a gs location rather than CSV. I think you ran the code locally having the csv file in the same directory as the code. But running it using DataflowRunner will need the files in gs.

PyEZ - Cron Job to connect to 8 routers and save the running config to 8 local files at a specific time stamp

I am a starter to PyEZ. Can I write a cron job in PyEZ which will connect to 8 routers and fetch the running Config on device and save to 8 different files at a particular timestamp. Could you help me achieve the same.
I have already written a PyEZ code which will write the Base config to my local file.
Loading the config files to local file
from jnpr.junos import Device
from lxml import etree
dev = Device(host='hostname',port='22',user='root', password='sitlab123!' )
dev.open()class Create_Config():
def __init__(self):
cnf = dev.rpc.get_config() ####Get Config as Str
with open('myfile.txt', "w") as text_file:
text_file.write(etree.tostring(cnf))
text_file.close()
#####Return Configuration
def get_conf(self):
return dev.cli("show configuration")
You can use python-crontab module along with PyEZ module.
Python-crontab
To create a new cron job is as follows:
from crontab import CronTab
#init cron
cron = CronTab()
#add new cron job
job = cron.new(command='/usr/bin/echo')
#job settings
job.hour.every(4)

How do you define the host for Fabric to push to GitHub?

Original Question
I've got some python scripts which have been using Amazon S3 to upload screenshots taken following Selenium tests within the script.
Now we're moving from S3 to use GitHub so I've found GitPython but can't see how you use it to actually commit to the local repo and push to the server.
My script builds a directory structure similar to \images\228M\View_Use_Case\1.png in the workspace and when uploading to S3 it was a simple process;
for root, dirs, files in os.walk(imagesPath):
for name in files:
filename = os.path.join(root, name)
k = bucket.new_key('{0}/{1}/{2}'.format(revisionNumber, images_process, name)) # returns a new key object
k.set_contents_from_filename(filename, policy='public-read') # opens local file buffers to key on S3
k.set_metadata('Content-Type', 'image/png')
Is there something similar for this or is there something as simple as a bash type git add images command in GitPython that I've completely missed?
Updated with Fabric
So I've installed Fabric on kracekumar's recommendation but I can't find docs on how to define the (GitHub) hosts.
My script is pretty simple to just try and get the upload to work;
from __future__ import with_statement
from fabric.api import *
from fabric.contrib.console import confirm
import os
def git_server():
env.hosts = ['github.com']
env.user = 'git'
env.passowrd = 'password'
def test():
process = 'View Employee'
os.chdir('\Work\BPTRTI\main\employer_toolkit')
with cd('\Work\BPTRTI\main\employer_toolkit'):
result = local('ant viewEmployee_git')
if result.failed and not confirm("Tests failed. Continue anyway?"):
abort("Aborting at user request.")
def deploy():
process = "View Employee"
os.chdir('\Documents and Settings\markw\GitTest')
with cd('\Documents and Settings\markw\GitTest'):
local('git add images')
local('git commit -m "Latest Selenium screenshots for %s"' % (process))
local('git push -u origin master')
def viewEmployee():
#test()
deploy()
It Works \o/ Hurrah.
You should look into Fabric. http://docs.fabfile.org/en/1.4.1/index.html. Automated server deployment tool. I have been using this quite some time, it works pretty fine.
Here is my one of the application which uses it, https://github.com/kracekumar/sachintweets/blob/master/fabfile.py
It looks like you can do this:
index = repo.index
index.add(['images'])
new_commit = index.commit("my commit message")
and then, assuming you have origin as the default remote:
origin = repo.remotes.origin
origin.push()

Categories