Python aws cli from subprocess - python

import subprocess
import datetime
StartTime=datetime.datetime.utcnow() - datetime.timedelta(hours=1)
EndTime=datetime.datetime.utcnow()
instances = ['i-xxx1', 'i-xxx2']
list_files = subprocess.run(["aws", "cloudwatch", "get-metric-statistics", "--metric-name", "CPUUtilization", "--start-time", StartTime, "--end-time", EndTime, "--period", "300", "--namespace", "AWS/EC2", "--statistics", "Maximum", "--dimensions", "Name=InstanceId,#call the instances#"])
print("The exit code was: %d" % list_files.returncode)
Quick and dirty code. How do I loop from subprocess.run from instances list and print the results also in loop? Having issue also in calling the datetime from StartTime and Endtime format.
Thank you

It is recommended to use the boto3 library to call AWS from python. It is fairly easy to translate cli commands to boto3 commands.
list_files = subprocess.run(["aws", "cloudwatch", "get-metric-statistics", "--metric-name", "CPUUtilization", "--start-time", StartTime, "--end-time", EndTime, "--period", "300", "--namespace", "AWS/EC2", "--statistics", "Maximum", "--dimensions", "Name=InstanceId,#call the instances#"])
Instead of the above, you can run the following:
import boto3
client = boto3.client('cloudwatch')
list_files = client.get_metric_statistics(
MetricName='CPUUtilization',
StartTime=StartTime, # These should be datetime objects
EndTime=EndTime, # These should be datetime objects
Period=300,
Namespace='AWS/EC2',
Statistics=['Maximum'],
Dimensions=[
{
'Name': 'InstanceId',
'Value': '#call the instances#'
}
]
You can run help(client.get_metric_statistics) to get detailed information about the function. The boto3 library is pretty well-documented. The response structure and syntax is also documented on there.

Related

AzureML not able to Schedule Pipeline endpoint

I made a minimal Pipeline with a unique step in AML. I've publish this pipeline and I have and id and REST endpoint for it.
When I try to create a schedule on this pipeline, I get no error, but it will never launch.
from azureml.core.runconfig import RunConfiguration
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import Pipeline
datastore = ws.get_default_datastore()
minimal_run_config = RunConfiguration()
minimal_run_config.environment = myenv # Custom Env with Dockerfile from mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest + openSDK 11 + pip/conda packages
step_name = experiment_name
script_step_1 = PythonScriptStep(
name=step_name,
script_name="main.py",
arguments=args,
compute_target=cpu_cluster,
source_directory=str(source_path),
runconfig=minimal_run_config,
)
pipeline = Pipeline(
workspace=ws,
steps=[
script_step_1,
],
)
pipeline.validate()
pipeline.publish(name=experiment_name + "_pipeline")
I can trigger this pipeline with REST python
from azureml.core.authentication import InteractiveLoginAuthentication
from azureml.pipeline.core import PublishedPipeline
import requests
auth = InteractiveLoginAuthentication()
aad_token = auth.get_authentication_header()
pipelines = PublishedPipeline.list(ws)
rest_endpoint1 = [p for p in pipelines if p.name == experiment_name + "_pipeline"][0]
response = requests.post(rest_endpoint1.endpoint,
headers=aad_token,
json={"ExperimentName": experiment_name,
"RunSource": "SDK",
"ParameterAssignments": {"KEY": "value"}})
But when I use the Schedule, I have no warning, no error and nothing is triggered if I use start_time from ScheduleRecurrence. If I don't user start_time, my pipeline is triggered and launch immediately. And I don't want this. For example I'm running the Schedule setter today, but I want it's first trigger to run only the second of each month at 4pm.
from azureml.pipeline.core.schedule import ScheduleRecurrence, Schedule
import datetime
first_run = datetime.datetime(2022, 10, 2, 16, 00)
schedule_name = f"Recocpc monthly run PP {first_run.day:02} {first_run.hour:02}:{first_run.minute:02}"
recurrence = ScheduleRecurrence(
frequency="Month",
interval=1,
start_time=first_run,
)
recurrence.validate()
recurring_schedule = Schedule.create_for_pipeline_endpoint(
ws,
name=schedule_name,
description="Recocpc monthly run PP",
pipeline_endpoint_id=pipeline_endpoint.id,
experiment_name=experiment_name,
recurrence=recurrence,
pipeline_parameters={"KEY": "value"}
)
If I comment start_time, It will work, but the first run is now, and not when I want.
So I was not aware on how start_time was working. It is using DAGs logic like in airflow.
Here is an example:
Today is 10-01-2022 (dd-mm-yyy)
You want your pipeline to run every month, once the on the 10th of each month at 14:00.
Then your start_time is not 2022-01-10T14:00:00, but should be 2021-12-10T14:00:00.
Your scheduler will only trigger if it has made a full revolution of what you are asking him (here one month).
Maybe official documentation should be more explicit on this mecanism for neewbies like me that never used DAGs in their lives.

retrieving s3 path from payload inside AWS glue pythonshell job

I have a pythonshell job inside AWS glue that needs to download a file from a s3 path. This s3 path location is a variable so will come to the glue job as a payload in start_run_job call like below:
import boto3
payload = {'s3_target_file':s3_TARGET_FILE_PATH,
's3_test_file': s3_TEST_FILE_PATH}
job_def = dict(
JobName=MY_GLUE_PYTHONSHELL_JOB,
Arguments=payload,
WorkerType='Standard',
NumberOfWorkers=2,
)
response = glue.start_job_run(**job_def)
My question is, how do I retrieve those s3 paths from the payload inside AWS Glue pythonshell job that comes through boto3? Is there any sort of handler we need to write similar to AWS Lambda?
Please suggest.
Check the docimentation. All you need is here.
You can use the getResolvedOptions as follows:
import sys
from awsglue.utils import getResolvedOptions
args = getResolvedOptions(sys.argv,
['JOB_NAME',
'day_partition_key',
'hour_partition_key',
'day_partition_value',
'hour_partition_value'])
print "The day partition key is: ", args['day_partition_key']
print "and the day partition value is: ", args['day_partition_value']

Boto3 Cloudwatch API ELBv2 returning empty datapoints

I am writing a script to pull back metrics for ELBv2 (Network LB) using Boto3 but it just keeps returning empty datapoints. I have read through the AWS and Boto docs, and scoured here for answers but nothing seems to be correct. I am aware CW likes everything the be exact and so I have played with different dimensions, different time windows, datapoint periods, different metrics, with and without specifying units etc to no avail.
My script is here:-
#!/usr/bin/env python
import boto3
from pprint import pprint
from datetime import datetime
from datetime import timedelta
def initialize_client():
client = boto3.client(
'cloudwatch',
region_name='eu-west-1'
)
return client
def request_metric(client):
response = client.get_metric_statistics(
Namespace='AWS/NetworkELB',
Period=300,
StartTime=datetime.utcnow() - timedelta(days=5),
EndTime=datetime.utcnow() - timedelta(days=1),
MetricName='NewFlowCount',
Statistics=['Sum'],
Dimensions=[
{
'Name': 'LoadBalancer',
'Value': 'net/nlb-name/1111111111'
},
{
'Name': 'AvailabilityZone',
'Value': 'eu-west-1a'
}
],
)
return response
def main():
client = initialize_client()
response = request_metric(client)
pprint(response['Datapoints'])
return 0
main()

How to upload a local CSV to google big query using python

I'm trying to upload a local CSV to google big query using python
def uploadCsvToGbq(self,table_name):
load_config = {
'destinationTable': {
'projectId': self.project_id,
'datasetId': self.dataset_id,
'tableId': table_name
}
}
load_config['schema'] = {
'fields': [
{'name':'full_name', 'type':'STRING'},
{'name':'age', 'type':'INTEGER'},
]
}
load_config['sourceFormat'] = 'CSV'
upload = MediaFileUpload('sample.csv',
mimetype='application/octet-stream',
# This enables resumable uploads.
resumable=True)
start = time.time()
job_id = 'job_%d' % start
# Create the job.
result = bigquery.jobs.insert(
projectId=self.project_id,
body={
'jobReference': {
'jobId': job_id
},
'configuration': {
'load': load_config
}
},
media_body=upload).execute()
return result
when I run this it throws error like
"NameError: global name 'MediaFileUpload' is not defined"
whether any module is needed please help.
One of easiest method to upload to csv file in GBQ is through pandas.Just import csv file to pandas (pd.read_csv()). Then from pandas to GBQ (df.to_gbq(full_table_id, project_id=project_id)).
import pandas as pd
import csv
df=pd.read_csv('/..localpath/filename.csv')
df.to_gbq(full_table_id, project_id=project_id)
Or you can use client api
from google.cloud import bigquery
import pandas as pd
df=pd.read_csv('/..localpath/filename.csv')
client = bigquery.Client()
dataset_ref = client.dataset('my_dataset')
table_ref = dataset_ref.table('new_table')
client.load_table_from_dataframe(df, table_ref).result()
pip install --upgrade google-api-python-client
Then on top of your python file write:
from googleapiclient.http import MediaFileUpload
But care you miss some parenthesis. Better write:
result = bigquery.jobs().insert(projectId=PROJECT_ID, body={'jobReference': {'jobId': job_id},'configuration': {'load': load_config}}, media_body=upload).execute(num_retries=5)
And by the way, you are going to upload all your CSV rows, including the top one that defines columns.
The class MediaFileUpload is in http.py. See https://google-api-python-client.googlecode.com/hg/docs/epy/apiclient.http.MediaFileUpload-class.html

mongodump and then remove: Not exact same number of records

I am using a fabric script to dump data from a remote mongodb server to my local machine and then I wish to remove that data from the remote machine. I am doing it in two steps now and although I can understand that there may be more graceful methods exist for few more days I want to continue like this.
Here is the snippet of the python function that I run as fab task
from __future__ import with_statement
from fabric.api import *
from fabric.contrib.console import confirm
import datetime
import dateutil.relativedelta
def dump_mydb():
print "********************************"
print "Starting the dump process"
print "********************************"
d = datetime.datetime.today()
d2 = d - dateutil.relativedelta.relativedelta(months=1)
end_date = datetime.datetime(d2.year, d2.month, d2.day)
print end_date
before_time = int(end_date.strftime("%s")) * 1000
temp = datetime.datetime.today()
temp2 = datetime.datetime(temp.year, temp.month, temp.day)
local_folder = str(temp2).split(" ")[0]
local("mongodump --host x.x.x.x --port 27017 --collection my_collection --db my_db -q '{fetched_date :{$lte: Date(" + str(before_time) + ")}}'")
local("mkdir ../dump_files/store/" + local_folder)
local("cp -r dump ../dump_files/store/" + local_folder)
local("rm -rf dump")
print "********************************"
print "Data before one month from today is dumped at - ../dump_files/store/" + local_folder
print "********************************"
If this script being executed today (14th Feb, 2014, IST) then it searches for all the documents which has "fetched_date" (a normal ISODate object with date and time both present) less than equal to 2014-01-14 00:00:00. And this scripts executes fine.
The problem
When this script is executed, we can see that it dumps X number of objects (Documents) into my local machine. But when we run this query in the remote mongo shell
{"fetched_date":{"$lte": ISODate("2014-01-14T00:00:00.00Z")}}
This gives us a different number of records. which is more than X. So that means we can not delete all the records which match this query because some of them did not get dumped in my local machine. I do not understand how is that possible as I am converting the same date in ms and then running the query at mongodump.
Can anybody help me out please?
Please let me know if you need any more information.
Thanks.
I believe you've hit the same issue I did, where db.collection.find({...}).count() can over-count. According to the details in the reference documentation for count(), if you're on a sharded cluster, records being migrated are double-counted. (Thanks GothAlice on the IRC channel for pointing this out to me!)
If this is your issue, you can use the aggregation framework to get an accurate count, which should match the count you saw from mongodump:
db.collection.aggregate([
{ $match: {"fetched_date":{"$lte": ISODate("2014-01-14T00:00:00.00Z")}} },
{ $group: { _id: null, count: { $sum: 1 } } }
])

Categories