I have an Apache Airflow working on Kubernetes (via Google Composer). I want to retrieve one variable store in Secret:
I need to consume the variables stored in this Secret from a DAG in Airflow.(python).
The variables are stored as "Environment Vars", so in Python is quite easy:
import os
os.environ['DB_USER'])
Related
So I am trying to orchestrate a workflow in Airflow. One task is to read GCP Cloud Storage, which needs me to specify the Google Application Credentials.
I decided to create a new folder in the dag folder and put the JSON key. Then I specified this in the dag.py file;
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "dags\support\keys\key.json"
Unfortunately, I am getting this error below;
google.auth.exceptions.DefaultCredentialsError: File dags\support\keys\dummy-surveillance-project-6915f229d012.json was not found
Can anyone help with how I should go about declaring the service account key?
Thank you.
You can create a connection to Google Cloud from Airflow webserver admin menu. In this menu you can pass the Service Account key file path.
In this picture, the keyfile Path is /usr/local/airflow/dags/gcp.json.
Beforehand you need to mount your key file as a volume in your Docker container with the previous path.
You can also directly copy the key json content in the Airflow connection, in the keyfile Json field :
You can check from these following links :
Airflow-connections
Airflow-with-google-cloud
Airflow-composer-managing-connections
If you trying to download data from Google Cloud Storage using Airflow, you should use the GCSToLocalFilesystemOperator operator described here. It is already provided as part of the standard Airflow library (if you installed it) so you don't have to write the code yourself using the Python operator.
Also, if you use this operator you can enter the GCP credentials into the connections screen (where it should be). This is a better approach to putting your credentials in a folder with your DAGs as this could lead to your credentials being committed into your version control system which could lead to security issues.
I have an Azure Devops Pipeline setup. It gets some secrets via the yaml
variables
- group: GROUP_WITH_SECRET
Then in the later part of the pipeline I run a python script that gets that particular secret via
my_pat = os.environ["my_secret"]
That is then used in a library provided by Microsoft (msrest) as so:
BasicAuthentication("", my_pat)
If the variable in question, in the ADO Library is set to plain, the script works correctly. If I change it to a secret, connection fails. If I set it back to plain text, it again works.
Question is, how can I make it work with a secret? I've tried printing the value out but since it's a secret it doesn't show me the actual value other than the
The user 'aaaaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaa' is not authorized to access this resource
To use the secret variable in Azure Pipeline, you need to explicitly map secret variables in Agent Job.
Based on my test, the Python script task has no environment field to map the secret variables.
So you can add environment variable in PowerShell task to map secret variables. And you can set it as pipeline variable for the next tasks.
Here is an example:
- powershell: |
echo "##vso[task.setvariable variable=myPass]$env:myPass"
displayName: 'PowerShell Script'
env:
myPass: $(myPass)
Then you can use the variable in the next tasks.
For more detailed info, you can refer to this doc: Secret Variable
I am doing AWS tutorial Python and DynamoDB. I downloaded and installed DynamoDB Local. I got the access key and secret access key. I installed boto3 for python. The only step I have left is setting up authentication credentials. I do not have AWS CLI downloaded, so where should I include access key and secret key and also the region?
Do I include it in my python code?
Do I make a file in my directory where I put this info? Then should I write anything in my python code so it can find it?
You can try passing the accesskey and secretkey in your code like this:
import boto3
session = boto3.Session(
aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY,
)
client = session.client('dynamodb')
OR
dynamodb = session.resource('dynamodb')
From the AWS documentation:
Before you can access DynamoDB programmatically or through the AWS
Command Line Interface (AWS CLI), you must configure your credentials
to enable authorization for your applications. Downloadable DynamoDB
requires any credentials to work, as shown in the following example.
AWS Access Key ID: "fakeMyKeyId"
AWS Secret Access Key:"fakeSecretAccessKey"
You can use the aws configure command of the AWS
CLI to set up credentials. For more information, see Using the AWS
CLI.
So, you need to create an .aws folder in yr home directory.
There create the credentials and config files.
Here's how to do this:
https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html
If you want to write portable code and keep in the spirit of developing 12-factor apps, consider using environment variables.
The advantage is that locally, both the CLI and the boto3 python library in your code (and pretty much all the other offical AWS SDK languages, PHP, Go, etc.) are designed to look for these values.
An example using the official Docker image to quickly start DynamoDB local:
# Start a local DynamoDB instance on port 8000
docker run -p 8000:8000 amazon/dynamodb-local
Then in a terminal, set some defaults that the CLI and SDKs like boto3 are looking for.
Note that these will be available until you close your terminal session.
# Region doesn't matter, CLI will complain if not provided
export AWS_DEFAULT_REGION=us-east-1
# Set some dummy credentials, dynamodb local doesn't care what these are
export AWS_ACCESS_KEY_ID=abc
export AWS_SECRET_ACCESS_KEY=abc
You should then be able to run the following (in the same terminal session) if you have the CLI installed. Note the --endpoint-url flag.
# Create a new table in DynamoDB Local
aws dynamodb create-table \
--endpoint-url http://127.0.0.1:8000 \
--table-name tmp \
--attribute-definitions AttributeName=id,AttributeType=S \
--key-schema AttributeName=id,KeyType=HASH \
--billing-mode PAY_PER_REQUEST
You should then able to list out the tables with:
aws dynamodb list-tables --endpoint-url http://127.0.0.1:8000
And get a result like:
{
"TableNames": [
"tmp"
]
}
So how do we get the endpoint-url that we've been specifying in the CLI to work in Python? Unfortunately, there isn't a default environment variable for the endpoint url in the boto3 codebase, so we'll need to pass it in when the code runs. The docs for .NET and Java are comprehensive but for Python, they are a bit more elusive. From the boto3 github repo and also see this great answer, we need to create a client or resource with the endpoint_url keyword. In the below, we're looking for a custom environment variable called AWS_DYNAMODB_ENDPOINT_URL. The point being that if specified, it will be used, otherwise will fall back to whatever the platform default is, making your code portable.
# Run in the same shell as before
export AWS_DYNAMODB_ENDPOINT_URL=http://127.0.0.1:8000
# file test.py
import os
import boto3
# Get environment variable if it's defined
# Make sure to set the environment variable before running
endpoint_url = os.environ.get('AWS_DYNAMODB_ENDPOINT_URL', None)
# Using (high level) resource, same keyword for boto3.client
resource = boto3.resource('dynamodb', endpoint_url=endpoint_url)
tables = resource.tables.all()
for table in tables:
print(table)
Finally, run this snippet with
# Run in the same shell as before
python3 test.py
# Should produce the following output:
# dynamodb.Table(name='tmp')
We want to have a production airflow environment but do not know how to deal properly with secrets, in particular google bigquery client JSON files
We tried setting up the kubernetes secrets on the automatically created kubernetes cluster (automatically by creationg a google cloud composer (airflow) environment). We currently just put the files on the bucket, but would like a better way.
def get_bq_client ():
""" returns bq client """
return bq.Client.from_service_account_json(
join("volumes", "bigquery.json")
)
We would like some form of proper management of the required secrets. Sadly, using Airflow Variables won't work because we can't create the client object using the json file as text
One solution that would work, is to encrypt the JSON files and put that on the bucket. As long as the decryption key exists on the bucket and no where else you'll be able to just check the code in with secrets to some source control and in the bucket checkout and decrypt.
We are using apache beam through airflow. Default GCS account is set with environmental variable - GOOGLE_APPLICATION_CREDENTIALS. We don't want to change environmental variable as it might affect other processes running at that time. I couldn't find a way to change Google Cloud Dataflow Service Account programmatically.
We are creating pipeline in following way
p = beam.Pipeline(argv=self.conf)
Is there any option through argv or options, where in I can mention the location of gcs credential file?
Searched through documentation, but didn't find much information.
You can specify a service account when you launch the job with a basic flag:
--serviceAccount=my-service-account-name#my-project.iam.gserviceaccount.com
That account will need the Dataflow Worker role attached plus whatever else you would like(GCS/BQ/Etc). Details here. You don't need the SA to be stored in GCS, or keys locally to use it.