Provide BigQuery credentials in Apache-Beam pipeline coded in python - python

I'm trying to read data from bigquery in my beam pipeline using cloud dataflow runner.
I want to provide a credentials to access the project.
I've seen examples in Java but none in Python.
The only possibility I found is to use the : --service_account_email argument
But what if I want to give the .json key information in the code itself in all the options like :
google_cloud_options.service_account = '/path/to/credential.json'
options = PipelineOptions(flags=argv)
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = 'project_name'
google_cloud_options.job_name = 'job_name'
google_cloud_options.staging_location = 'gs://bucket'
google_cloud_options.temp_location = 'gs://bucket'
options.view_as(StandardOptions).runner = 'DataflowRunner'
with beam.Pipeline(options=options) as pipeline:
query = open('query.sql', 'r')
bq_source = beam.io.BigQuerySource(query=query.read(), use_standard_sql=True)
main_table = \
pipeline \
| 'ReadAccountViewAll' >> beam.io.Read(bq_source) \
Java has a method getGcpCredential but cant find one in Python...
Any ideas ?

The --service_account_email is the recommended approach as mentioned here . Downloading the key and storing it locally or on GCE is not recommended.

For the cases where it is required to use a different path for the json file within the code, you can try the following python Authentication workarounds:
client = Client.from_service_account_json('/path/to/keyfile.json')
or
client = Client(credentials=credentials)
Here is an example for creating custom credentials from a file:
credentials = service_account.Credentials.from_service_account_file(
key_path,
scopes=["https://www.googleapis.com/auth/cloud-platform"],
)

Related

Permission Denied using Google AiPlatform ModelServiceClient

I am following a guide to get a Vertex AI pipeline working:
https://codelabs.developers.google.com/vertex-pipelines-intro#5
I have implemented the following custom component:
from google.cloud import aiplatform as aip
from google.oauth2 import service_account
project = "project-id"
region = "us-central1"
display_name = "lookalike_model_pipeline_1646929843"
model_name = f"projects/{project}/locations/{region}/models/{display_name}"
api_endpoint = "us-central1-aiplatform.googleapis.com" #europe-west2
model_resource_path = model_name
client_options = {"api_endpoint": api_endpoint}
# Initialize client that will be used to create and send requests.
client = aip.gapic.ModelServiceClient(credentials=service_account.Credentials.from_service_account_file('..\\service_accounts\\aiplatform_sa.json'),
client_options=client_options)
#get model evaluation
response = client.list_model_evaluations(parent=model_name)
And I get following error:
(<class 'google.api_core.exceptions.PermissionDenied'>, PermissionDenied("Permission 'aiplatform.modelEvaluations.list' denied on resource '//aiplatform.googleapis.com/projects/project-id/locations/us-central1/models/lookalike_model_pipeline_1646929843' (or it may not exist)."), <traceback object at 0x000002414D06B9C0>)
The model definitely exists and has finished training. I have given myself admin rights in the aiplatform service account. In the guide, they do not use a service account, but uses only client_options instead. The client_option has the wrong type since it is a dict(str, str) when it should be: Optional['ClientOptions']. But this doesn't cause an error.
My main question is: how do I get around this permission issue?
My subquestions are:
How can I use my model_name variable in a URL to get to the model?
How can I create an Optional['ClientOptions'] object to pass as client_option
Is there another way I can list_model_evaluations from a model that is in VertexAI, trained using automl?
Thanks
With the caveats in my comment that, while familiar with GCP, I'm less familiar with the AI|ML stuff. The following should work. I don't have a model to deploy to test it.
BILLING=[[YOUR-BILLING]]
export PROJECT=[[YOUR-PROJECT]]
export LOCATION="us-central1"
export MODEL=[[YOUR-MODEL]]
ACCOUNT="tester"
gcloud projects create ${PROJECT}
gcloud beta billing projects link ${PROJECT} \
--billing-account=${BILLING}
# Unsure whether ML is needed
for SERVICE in "aiplatform" "ml"
do
gcloud services enable ${SERVICE}.googleapis.com \
--project=${PROJECT}
done
gcloud iam service-accounts create ${ACCOUNT} \
--project=${PROJECT}
EMAIL=${ACCOUNT}#${PROJECT}.iam.gserviceaccount.com
gcloud projects add-iam-policy-binding ${PROJECT} \
--role=roles/aiplatform.admin \
--member=serviceAccount:${EMAIL}
gcloud iam service-accounts keys create ${PWD}/${ACCOUNT}.json \
--iam-account=${EMAIL} \
--project=${PROJECT}
export GOOGLE_APPLICATION_CREDENTIALS=${PWD}/${ACCOUNT}.json
python3 -m venv venv
source venv/bin/activate
python3 -m pip install google-cloud-aiplatform
python3 main.py
main.py:
import os
from google.cloud import aiplatform
project = os.getenv("PROJECT")
location = os.getenv("LOCATION")
model = os.getenv("MODEL")
aiplatform.init(
project=project,
location=location,
experiment="test",
)
parent = f"projects/{project}/locations/{location}/models/{model}"
model = aiplatform.Model(parent)
I tried using your code and it did not also work for me and got a different error. As #DazWilkin mentioned it is recommended to use the Cloud Client.
I used aiplatform_v1 and it worked fine. One thing I noticed is that you should always define a value for client_options so it will point to the correct endpoint. Checking the code for ModelServiceClient, if I'm not mistaken the endpoint defaults to "aiplatform.googleapis.com" which don't have a location prepended. AFAIK the endpoint should prepend a location.
See code below. I used AutoML models and it returns their model evaluations.
from google.cloud import aiplatform_v1 as aiplatform
from typing import Optional
def get_model_eval(
project_id: str,
model_id: str,
client_options: dict,
location: str = 'us-central1',
):
client_model = aiplatform.services.model_service.ModelServiceClient(client_options=client_options)
model_name = f'projects/{project_id}/locations/{location}/models/{model_id}'
list_eval_request = aiplatform.types.ListModelEvaluationsRequest(parent=model_name)
list_eval = client_model.list_model_evaluations(request=list_eval_request)
print(list_eval)
api_endpoint = 'us-central1-aiplatform.googleapis.com'
client_options = {"api_endpoint": api_endpoint} # api_endpoint is required for client_options
project_id = 'project-id'
location = 'us-central1'
model_id = '99999999999' # aiplatform_v1 uses the model_id
get_model_eval(
client_options = client_options,
project_id = project_id,
location = location,
model_id = model_id,
)
This is an output snippet from my AutoML Text Classification:

Dataflow batch job not scaling

My Dataflow job (Job ID: 2020-08-18_07_55_15-14428306650890914471) is not scaling past 1 worker, despite Dataflow setting the target workers to 1000.
The job is configured to query the Google Patents BigQuery dataset, tokenize the text using a ParDo custom function and the transformers (huggingface) library, serialize the result, and write everything to a giant parquet file.
I had assumed (after running the job yesterday, which mapped a function instead of using a beam.DoFn class) that the issue was some non-parallelizing object eliminating scaling; hence, refactoring the tokenization process as a class.
Here's the script, which is run from the command line with the following command:
python bq_to_parquet_pipeline_w_class.py --extra_package transformers-3.0.2.tar.gz
The script:
import os
import re
import argparse
import google.auth
import apache_beam as beam
from apache_beam.options import pipeline_options
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.runners import DataflowRunner
from apache_beam.io.gcp.internal.clients import bigquery
import pyarrow as pa
import pickle
from transformers import AutoTokenizer
print('Defining TokDoFn')
class TokDoFn(beam.DoFn):
def __init__(self, tok_version, block_size=200):
self.tok = AutoTokenizer.from_pretrained(tok_version)
self.block_size = block_size
def process(self, x):
txt = x['abs_text'] + ' ' + x['desc_text'] + ' ' + x['claims_text']
enc = self.tok.encode(txt)
for idx, token in enumerate(enc):
chunk = enc[idx:idx + self.block_size]
serialized = pickle.dumps(chunk)
yield serialized
def run(argv=None, save_main_session=True):
query_big = '''
with data as (
SELECT
(select text from unnest(abstract_localized) limit 1) abs_text,
(select text from unnest(description_localized) limit 1) desc_text,
(select text from unnest(claims_localized) limit 1) claims_text,
publication_date,
filing_date,
grant_date,
application_kind,
ipc
FROM `patents-public-data.patents.publications`
)
select *
FROM data
WHERE
abs_text is not null
AND desc_text is not null
AND claims_text is not null
AND ipc is not null
'''
query_sample = '''
SELECT *
FROM `client_name.patent_data.patent_samples`
LIMIT 2;
'''
print('Start Run()')
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
'''
Configure Options
'''
# Setting up the Apache Beam pipeline options.
# We use the save_main_session option because one or more DoFn's in this
# workflow rely on global context (e.g., a module imported at module level).
options = PipelineOptions(pipeline_args)
options.view_as(SetupOptions).save_main_session = save_main_session
# Sets the project to the default project in your current Google Cloud environment.
_, options.view_as(GoogleCloudOptions).project = google.auth.default()
# Sets the Google Cloud Region in which Cloud Dataflow runs.
options.view_as(GoogleCloudOptions).region = 'us-central1'
# IMPORTANT! Adjust the following to choose a Cloud Storage location.
dataflow_gcs_location = 'gs://client_name/dataset_cleaned_pq_classTok'
# Dataflow Staging Location. This location is used to stage the Dataflow Pipeline and SDK binary.
options.view_as(GoogleCloudOptions).staging_location = f'{dataflow_gcs_location}/staging'
# Dataflow Temp Location. This location is used to store temporary files or intermediate results before finally outputting to the sink.
options.view_as(GoogleCloudOptions).temp_location = f'{dataflow_gcs_location}/temp'
# The directory to store the output files of the job.
output_gcs_location = f'{dataflow_gcs_location}/output'
print('Options configured per GCP Notebook Examples')
print('Configuring BQ Table Schema for Beam')
#Write Schema (to PQ):
schema = pa.schema([
('block', pa.binary())
])
print('Starting pipeline...')
with beam.Pipeline(runner=DataflowRunner(), options=options) as p:
res = (p
| 'QueryTable' >> beam.io.Read(beam.io.BigQuerySource(query=query_big, use_standard_sql=True))
| beam.ParDo(TokDoFn(tok_version='gpt2', block_size=200))
| beam.Map(lambda x: {'block': x})
| beam.io.WriteToParquet(os.path.join(output_gcs_location, f'pq_out'),
schema,
record_batch_size=1000)
)
print('Pipeline built. Running...')
if __name__ == '__main__':
import logging
logging.getLogger().setLevel(logging.INFO)
logging.getLogger("transformers.tokenization_utils_base").setLevel(logging.ERROR)
run()
The solution is twofold:
The following quotas were being exceeded when I ran my job, all under 'Compute Engine API' (view your quotas here: https://console.cloud.google.com/iam-admin/quotas):
CPUs (I requested an increase to 50)
Persistent Disk Standard (GB) (I requested an increase to 12,500)
In_Use_IP_Address (I requested an increase to 50)
Note: If you read the console output while your job is running, any exceeded quotas should print out as an INFO line.
Following Peter Kim's advice above, I passed the flag --max_num_workers as part of my command:
python bq_to_parquet_pipeline_w_class.py --extra_package transformers-3.0.2.tar.gz --max_num_workers 22
And I started scaling!
All in all, it would be nice if there was a way to prompt users via the Dataflow console when a quota is being reached, and provide an easy means to request an increase to that (and recommended complementary) quotas, along with suggestions for what the increased amount to be requested should be.

Python 3 and Azure table storage tablestorageaccount not working

I'm trying to use the sample provided by Microsoft to connect to an Azure storage table using Python. The code below fail because of tablestorageaccount not found. What I'm missing I installed the azure package but still complaining that it's not found.
import azure.common
from azure.storage import CloudStorageAccount
from tablestorageaccount import TableStorageAccount
print('Azure Table Storage samples for Python')
# Create the storage account object and specify its credentials
# to either point to the local Emulator or your Azure subscription
if IS_EMULATED:
account = TableStorageAccount(is_emulated=True)
else:
account_connection_string = STORAGE_CONNECTION_STRING
# Split into key=value pairs removing empties, then split the pairs into a dict
config = dict(s.split('=', 1) for s in account_connection_string.split(';') if s)
# Authentication
account_name = config.get('AccountName')
account_key = config.get('AccountKey')
# Basic URL Configuration
endpoint_suffix = config.get('EndpointSuffix')
if endpoint_suffix == None:
table_endpoint = config.get('TableEndpoint')
table_prefix = '.table.'
start_index = table_endpoint.find(table_prefix)
end_index = table_endpoint.endswith(':') and len(table_endpoint) or table_endpoint.rfind(':')
endpoint_suffix = table_endpoint[start_index+len(table_prefix):end_index]
account = TableStorageAccount(account_name = account_name, connection_string = account_connection_string, endpoint_suffix=endpoint_suffix)
I find the source sample code, and in the sample code there is still a custom module tablestorageaccount.py, it's just used to return TableService. If you already have the storage connection string and want to have a test, you could connect to table directly.
Sample:
from azure.storage.table import TableService, Entity
account_connection_string = 'DefaultEndpointsProtocol=https;AccountName=account name;AccountKey=account key;EndpointSuffix=core.windows.net'
tableservice=TableService(connection_string=account_connection_string)
Also you could refer to the new sdk to connect table. Here is the official tutorial about Get started with Azure Table storage.

How to execute a stored procedure in Azure Databricks PySpark?

I am able to execute a simple SQL statement using PySpark in Azure Databricks but I want to execute a stored procedure instead. Below is the PySpark code I tried.
#initialize pyspark
import findspark
findspark.init('C:\Spark\spark-2.4.5-bin-hadoop2.7')
#import required modules
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import *
import pandas as pd
#Create spark configuration object
conf = SparkConf()
conf.setMaster("local").setAppName("My app")
#Create spark context and sparksession
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
table = "dbo.test"
#read table data into a spark dataframe
jdbcDF = spark.read.format("jdbc") \
.option("url", f"jdbc:sqlserver://localhost:1433;databaseName=Demo;integratedSecurity=true;") \
.option("dbtable", table) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.load()
#show the data loaded into dataframe
#jdbcDF.show()
sqlQueries="execute testJoin"
resultDF=spark.sql(sqlQueries)
resultDF.show(resultDF.count(),False)
This doesn't work — how do I do it?
In case someone is still looking for a method on how to do this, it's possible to use the built-in jdbc-connector of you spark session. Following code sample will do the trick:
import msal
# Set url & credentials
jdbc_url = ...
tenant_id = ...
sp_client_id = ...
sp_client_secret = ...
# Write your SQL statement as a string
name = "Some passed value"
statement = f"""
EXEC Staging.SPR_InsertDummy
#Name = '{name}'
"""
# Generate an OAuth2 access token for service principal
authority = f"https://login.windows.net/{tenant_id}"
app = msal.ConfidentialClientApplication(sp_client_id, sp_client_secret, authority)
token = app.acquire_token_for_client(scopes="https://database.windows.net/.default")["access_token"]
# Create a spark properties object and pass the access token
properties = spark._sc._gateway.jvm.java.util.Properties()
properties.setProperty("accessToken", token)
# Fetch the driver manager from your spark context
driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
# Create a connection object and pass the properties object
con = driver_manager.getConnection(jdbc_url, properties)
# Create callable statement and execute it
exec_statement = con.prepareCall(statement)
exec_statement.execute()
# Close connections
exec_statement.close()
con.close()
For more information and a similar method using SQL-user credentials to connect over JDBC, or on how to take return parameters, I'd suggest you take a look at this blogpost:
https://medium.com/delaware-pro/executing-ddl-statements-stored-procedures-on-sql-server-using-pyspark-in-databricks-2b31d9276811
Running a stored procedure through a JDBC connection from azure databricks is not supported as of now. But your options are:
Use a pyodbc library to connect and execute your procedure. But by using this library, it means that you will be running your code on the driver node while all your workers are idle. See this article for details.
https://datathirst.net/blog/2018/10/12/executing-sql-server-stored-procedures-on-databricks-pyspark
Use a SQL table function rather than procedures. In a sense, you can use anything that you can use in the FORM clause of a SQL query.
Since you are in an azure environment, then using a combination of azure data factory (to execute your procedure) and azure databricks can help you to build pretty powerful pipelines.

Amazon AWS - S3 to ElasticSearch (Python Lambda)

I'd like to copy data from an S3 directory to the Amazon ElasticSearch service. I've tried following the guide, but unfortunately the part I'm looking for is missing. I don't know how the lambda function itself should look like (and all the info about this in the guide is: "Place your application source code in the eslambda folder."). I'd like ES to autoindex the files.
Currently I'm trying
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = urllib.unquote_plus(record['s3']['object']['key'])
index_name = event.get('index_name', key.split('/')[0])
object = s3_client.Object(bucket, key)
data = object.get()['Body'].read()
helpers.bulk(es, data, chunk_size=100)
But I get like a massive error stating
elasticsearch.exceptions.RequestError: TransportError(400, u'action_request_validation_exception', u'Validation Failed: 1: index is missing;2: type is missing;3: index is missing;4: type is missing;5: index is missing;6: type is missing;7: ...
Could anyone explain to me, how can I set things up so that my data gets moved from S3 to ES where it gets auto-mapped and auto-indexed? Apparently it's possible, as mentioned in the reference here and here.
While mapping can automatically be assigned in Elasticsearch, the indexes are not automatically generated. You have to specify the index name and type in the POST request. If that index does not exist, then Elasticsearch will create the index automatically.
Based on your error, it looks like you're not passing through an index and type.
For example, here's how a simple POST request to add a record to the index MyIndex and type MyType which would first create the index and type if it did not already exist.
curl -XPOST 'example.com:9200/MyIndex/MyType/' \
-d '{"name":"john", "tags" : ["red", "blue"]}'
I wrote a script to download a csv file from S3 and then transfer the data to ES.
Made an S3 client using boto3 and downloaded the file from S3
Made an ES client to connect to Elasticsearch.
Opened the csv file and used the helpers module from elasticsearch to insert csv file contents into elastic search.
main.py
import boto3
from elasticsearch import helpers, Elasticsearch
import csv
import os
from config import *
#S3
Downloaded_Filename=os.path.basename(Prefix)
s3 = boto3.client('s3', aws_access_key_id=awsaccesskey,aws_secret_access_key=awssecretkey,region_name=awsregion)
s3.download_file(Bucket,Prefix,Downloaded_Filename)
#ES
ES_index = Downloaded_Filename.split(".")[0]
ES_client = Elasticsearch([ES_host],http_auth=(ES_user, ES_password),port=ES_port)
#S3 to ES
with open(Downloaded_Filename) as f:
reader = csv.DictReader(f)
helpers.bulk(ES_client, reader, index=ES_index, doc_type='my-type')
config.py
awsaccesskey = ""
awssecretkey = ""
awsregion = "us-east-1"
Bucket=""
Prefix=''
ES_host = "localhost"
ES_port = "9200"
ES_user = "elastic"
ES_password = "changeme"

Categories