I would like to register a dataset from ADLS Gen2 in my Azure Machine Learning workspace (azureml-core==1.12.0). Given that service principal information is not required in the Python SDK documentation for .register_azure_data_lake_gen2(), I successfully used the following code to register ADLS gen2 as a datastore:
from azureml.core import Datastore
adlsgen2_datastore_name = os.environ['adlsgen2_datastore_name']
account_name=os.environ['account_name'] # ADLS Gen2 account name
file_system=os.environ['filesystem']
adlsgen2_datastore = Datastore.register_azure_data_lake_gen2(
workspace=ws,
datastore_name=adlsgen2_datastore_name,
account_name=account_name,
filesystem=file_system
)
However, when I try to register a dataset, using
from azureml.core import Dataset
adls_ds = Datastore.get(ws, datastore_name=adlsgen2_datastore_name)
data = Dataset.Tabular.from_delimited_files((adls_ds, 'folder/data.csv'))
I get an error
Cannot load any data from the specified path. Make sure the path is accessible and contains data.
ScriptExecutionException was caused by StreamAccessException.
StreamAccessException was caused by AuthenticationException.
'AdlsGen2-ReadHeaders' for '[REDACTED]' on storage failed with status code 'Forbidden' (This request is not authorized to perform this operation using this permission.), client request ID <CLIENT_REQUEST_ID>, request ID <REQUEST_ID>. Error message: [REDACTED]
| session_id=<SESSION_ID>
Do I need the to enable the service principal to get this to work? Using the ML Studio UI, it appears that the service principal is required even to register the datastore.
Another issue I noticed is that AMLS is trying to access the dataset here:
https://adls_gen2_account_name.**dfs**.core.windows.net/container/folder/data.csv whereas the actual URI in ADLS Gen2 is: https://adls_gen2_account_name.**blob**.core.windows.net/container/folder/data.csv
According to this documentation,you need to enable the service principal.
1.you need to register your application and grant the service principal with Storage Blob Data Reader access.
2.try this code:
adlsgen2_datastore = Datastore.register_azure_data_lake_gen2(workspace=ws,
datastore_name=adlsgen2_datastore_name,
account_name=account_name,
filesystem=file_system,
tenant_id=tenant_id,
client_id=client_id,
client_secret=client_secret
)
adls_ds = Datastore.get(ws, datastore_name=adlsgen2_datastore_name)
dataset = Dataset.Tabular.from_delimited_files((adls_ds,'sample.csv'))
print(dataset.to_pandas_dataframe())
Result:
Related
I am trying to write a file from an Azure Synapse Notebook to ADLS Gen2 while authenticating with the account key.
When I use python and the DataLakeServiceClient, I can authenticate via key and write a file without a problem. If I try to authenticate with the same key for Spark, I get java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, PUT,.
With PySpark and authorization with the account key [NOT WORKING]:
myaccountname = ""
account_key = ""
spark.conf.set(f"fs.azure.account.key.{myaccountname}.dfs.core.windows.net", account_key)
dest_container = "container_name"
dest_storage_name = "storage_name"
destination_storage = f"abfss://{dest_container }#{dest_storage_name }.dfs.core.windows.net"
df.write.mode("append").parquet(destination_storage + "/raw/myfile.parquet")
But I can write a file with Python and the DataLakeServiceClient and also authorization with the account key [WORKING]:
from azure.storage.filedatalake import DataLakeServiceClient
# DAP ADLS configurations
storage_name = ""
account_key = ""
container_name = ""
service_client = DataLakeServiceClient(account_url=f"https://{storage_name}.dfs.core.windows.net", credential=account_key)
file_system_client = service_client.get_file_system_client(container_name)
dir_client = file_system_client.get_directory_client(directory_name)
dir_client.create_directory()
file_client = dir_client.get_file_client(file_name)
file_client.create_file()
file_client.append_data(file_content, offset=0, length=len(file_content))
file_client.flush_data(len(file_content))
What am I doing wrong? I was under the impression using spark.conf.set for a URL-key is enough?
I finally solved it by using a LinkedService. In the LinkedService I used the AccountKey (retrieved from a KeyVault).
For some the direct reason the authentication with the account key in the code did not work in the Synapse Notebook, despite the User having all required permissions.
UPDATE: According to Microsoft's third level tech support, authentication with an account key from within Synapse is not possible (!!!) You HAVE to use their LinkedServices.
If anyone else needs to authenticate:
linkedServiceName_var = "my_linked_service_name"
spark.conf.set("fs.azure.account.auth.type", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type", "com.microsoft.azure.synapse.tokenlibrary.LinkedServiceBasedSASProvider")
spark.conf.set("spark.storage.synapse.linkedServiceName", linkedServiceName_var)
raw_container_name = "my_container"
raw_storageaccount_name = "my_storage_account"
CONNECTION_STR = f"abfs://{raw_container_name}#{raw_storageaccount_name}.dfs.core.windows.net"
my_df = spark.read.parquet(CONNECTION_STR+ "/" + filepath)
--Update
Can you double check if you or the user running this has ADLSGen2 access and right permissions (contributer role on subscription or Storage Blob Data Owner at the storage account level or Blob Storage Contributor Role to the service principal in the scope of the Data Lake Storage Gen2 storage account. ) depending on your setup.
Make sure you have the valid account key copied from the Azure portal.
Just in case....
To enable other users to use the storage account after you create your
workspace, you will have to perform below tasks:
Assign other users to the Contributor role on workspace
Assign other users to a Workspace, SQL, or Spark admin role using Synapse Studio
Assign yourself and other users to the Storage Blob Data Contributor role on the storage account
Also, if you are using MSI for synapse workspace, make sure the you as a user have same permission level in the notebook.
Going through the official MS docs on Azure Synapse connecting to Azure storage account
In case you have set up an account key and secret for the storage
account, you can set forwardSparkAzureStorageCredentials to true,
in which case Azure Synapse connector automatically discovers the
account access key set in the notebook session configuration or the
global Hadoop configuration and forwards the storage account access
key to the connected Azure Synapse instance by creating a temporary
Azure database scoped
credential.
Just add this option while df.write
.option("forwardSparkAzureStorageCredentials", "true")
I want to extract data from a BigQuery shared dataset. The dataset was shared with me through my gmail account, but I created credentials for my GCS Project. Then, I am having an error since these credentials are unassociated to my gmail account. I need credentials for my gmail account so I can query the shared dataset with Python BigQuery.
I created the client for my Project (but the shared dataset is in another Project):
client = bigquery.Client(credentials= credentials,project=project_id)
And then run a query getting this error:
"Forbidden: 403 Access Denied: BigQuery BigQuery: Permission denied for table:..."
I'm working on datalab but when I try to query a table in bigquery I get the following error:
Exception: invalid: Error while reading table: .... error message: Failed to read the spreadsheet. Errors: No OAuth token with Google Drive scope was found.
This only happens with the tables that are linked with google drive sheet.
now enable the google drive app in gcp
from google.cloud import bigquery
client = bigquery.Client()
sql = """
SELECT * FROM `proyect-xxxx.set_xxx.table_x` LIMIT 1000
"""
df = client.query(sql).to_dataframe()
project_id = 'proyect-xxxx'
df = client.query(sql, project=project_id).to_dataframe()
df.head(3)
Exception: invalid: Error while reading table: .... error message: Failed to read the spreadsheet. Errors: No OAuth token with Google Drive scope was found.
As state by the error you are trying to access Google Drive, Which store you BigQuery external table, without providing permission to your oAuth token
You will need to go to Google Console and enable this access to solve your problem.
You can use this link which provide a how-to explanation on this subject
Visit the Google API Console to obtain OAuth 2.0 credentials such as a client ID and client secret that are known to both Google and your application. The set of values varies based on what type of application you are building. For example, a JavaScript application does not require a secret, but a web server application does.
Is there any way of granting readonly access to a specific BigQuery Dataset to a given Client ID ?
I've tried using a service account, but this gives full access to all datasets.
Also tried creating a service account from a different application, and added the email address generated together with the certificate to the BigQuery > Some Dataset > Share Dataset > Can view, but this always results in a 403 "Access not Configured" error.
I'm using the server to server flow described in the documentation :
import httplib2
from apiclient.discovery import build
from oauth2client.client import SignedJwtAssertionCredentials
# REPLACE WITH YOUR Project ID
PROJECT_NUMBER = 'XXXXXXXXXXX'
# REPLACE WITH THE SERVICE ACCOUNT EMAIL FROM GOOGLE DEV CONSOLE
SERVICE_ACCOUNT_EMAIL = 'XXXXX#developer.gserviceaccount.com'
f = file('key.p12', 'rb')
key = f.read()
f.close()
credentials = SignedJwtAssertionCredentials(
SERVICE_ACCOUNT_EMAIL,
key,
scope='https://www.googleapis.com/auth/bigquery.readonly')
http = httplib2.Http()
http = credentials.authorize(http)
service = build('bigquery', 'v2')
tables = service.tables()
response = tables.list(projectId=PROJECT_NUMBER, datasetId='SomeDataset').execute(http)
print(response)
I'm basically trying to provide readonly access to an external server based application to a single dataset.
As pointed out by Fh, it is required to activate the BigQuery API in the Google Account where the service account is created, regardless of the fact that it will be querying a BigQuery endpoint bound to a different application ID.
Late to the game on migrating to the /v1 Fusion Table API but no holding off any longer.
I'm using Python on AppEngine and trying to connect to Google Fusion Tables with Google Service Accounts (the more complicated cousin of OAuth2 for server side apps that uses JSON Web Tokens)
I found another question that pointed me to some documentation for using Service Accounts with Google Prediction API.
Fusion Table and Google Service Accounts
So far I've got
import httplib2
from oauth2client.appengine import AppAssertionCredentials
from apiclient.discovery import build
credentials = AppAssertionCredentials(scope='https://www.googleapis.com/auth/fusiontables')
http = credentials.authorize(httplib2.Http(memcache)) #Http(memcache)
service = build("fusiontables", "v1", http=http)
# list the tables
tables = service.table().list().execute() # <-- ERROR 401 invalid credentials here
Does anyone have an example of connecting to Fusion Tables on AppEngine using Service Accounts they might be able to share? Or something nice online?
Thanks
This actually does work. The important parts are you have to give the app engine service account access to your fusion table. If you are writing then the account needs write access. For help see: https://developers.google.com/api-client-library/python/start/installation (look for Getting started: Quickstart)
Your app engine service account will be something like your-app-id#appspot.gserviceaccount.com
You must also make the app engine service account a team member in the api console and give it "can edit" privilege.
SCOPE='https://www.googleapis.com/auth/fusiontables'
PROJECT_NUMBER = 'XXXXXXXX' # REPLACE WITH YOUR Project ID
# Create a new API service for interacting with Fusion Tables
credentials = AppAssertionCredentials(scope=SCOPE)
http = credentials.authorize(httplib2.Http())
logging.info('QQQ: accountname: %s' % app_identity.get_service_account_name())
service = build('fusiontables', 'v1', http=http, developerKey='YOUR KEY HERE FROM API CONSOLE')
def log(value1,value2=None):
tableid='YOUR TABLE ID FROM FUSION TABLES'
now = strftime("%Y-%m-%d %H:%M:%S", gmtime())
service.query().sql(sql="INSERT INTO %s (Temperature,Date) values(%s,'%s')" % (tableid,value1,now)).execute()
to clarify Ralph Yozzo's answer: you need to add the value of 'client_email' from the json file you downloaded when you created your service_account credentials (the same file you load when using ServiceAccountCredentials.from_json_keyfile_name('service_acct.json') with the new oauth2client library), to your table's sharing dialog screen (click 1 then enter the email address in 2)
Since Fusion Tables' tables are owned by individual Gmail accounts rather than the service account associated with an API console project, the AppAssertionCredentials probably won't work. It would make for an interesting feature request, though:
http://code.google.com/p/fusion-tables/issues/list
The best online resource I have found for help connecting Python AppEngine to Fusion Tables API with Oauth2 is
Google APIs Client Library for Python
The slide presentation is helpful to understanding the online samples, why decorators are used.
Also useful for understanding whether to use the app's Service Acount or User Accounts to authenticate is:
Using OAuth 2.0 to Access Google APIs
Consider installing the Google APIs Client Library for Python
Apart from the scope, the Oauth2 is more or less common to all Google APIs not just fusion tables.
Once oauth2 is working, see the Google Fusion Tables API
In case you want it to work from another host than Google App Engine or Google Compute Engine (e.g. from localhost for testing) then you should use ServiceAccountCredentials created from a json key file that you can generate and download from your service account page.
scopes = ['https://www.googleapis.com/auth/fusiontables']
keyfile = 'PATH TO YOUR SERVICE ACCOUNT KEY FILE'
FTID = 'FUSION TABLE ID'
credentials = ServiceAccountCredentials.from_json_keyfile_name(keyfile, scopes)
http_auth = credentials.authorize(Http(memcache))
service = build('fusiontables', 'v2', http=http_auth)
def insert(title, description):
sqlInsert = "INSERT INTO {0} (Title,Description) values('{1}','{2}')".format(FTID, title, description)
service.query().sql(sql=sqlInsert).execute()
Refer to Google's page on service accounts for explanations.