I am trying to write a file from an Azure Synapse Notebook to ADLS Gen2 while authenticating with the account key.
When I use python and the DataLakeServiceClient, I can authenticate via key and write a file without a problem. If I try to authenticate with the same key for Spark, I get java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, PUT,.
With PySpark and authorization with the account key [NOT WORKING]:
myaccountname = ""
account_key = ""
spark.conf.set(f"fs.azure.account.key.{myaccountname}.dfs.core.windows.net", account_key)
dest_container = "container_name"
dest_storage_name = "storage_name"
destination_storage = f"abfss://{dest_container }#{dest_storage_name }.dfs.core.windows.net"
df.write.mode("append").parquet(destination_storage + "/raw/myfile.parquet")
But I can write a file with Python and the DataLakeServiceClient and also authorization with the account key [WORKING]:
from azure.storage.filedatalake import DataLakeServiceClient
# DAP ADLS configurations
storage_name = ""
account_key = ""
container_name = ""
service_client = DataLakeServiceClient(account_url=f"https://{storage_name}.dfs.core.windows.net", credential=account_key)
file_system_client = service_client.get_file_system_client(container_name)
dir_client = file_system_client.get_directory_client(directory_name)
dir_client.create_directory()
file_client = dir_client.get_file_client(file_name)
file_client.create_file()
file_client.append_data(file_content, offset=0, length=len(file_content))
file_client.flush_data(len(file_content))
What am I doing wrong? I was under the impression using spark.conf.set for a URL-key is enough?
I finally solved it by using a LinkedService. In the LinkedService I used the AccountKey (retrieved from a KeyVault).
For some the direct reason the authentication with the account key in the code did not work in the Synapse Notebook, despite the User having all required permissions.
UPDATE: According to Microsoft's third level tech support, authentication with an account key from within Synapse is not possible (!!!) You HAVE to use their LinkedServices.
If anyone else needs to authenticate:
linkedServiceName_var = "my_linked_service_name"
spark.conf.set("fs.azure.account.auth.type", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type", "com.microsoft.azure.synapse.tokenlibrary.LinkedServiceBasedSASProvider")
spark.conf.set("spark.storage.synapse.linkedServiceName", linkedServiceName_var)
raw_container_name = "my_container"
raw_storageaccount_name = "my_storage_account"
CONNECTION_STR = f"abfs://{raw_container_name}#{raw_storageaccount_name}.dfs.core.windows.net"
my_df = spark.read.parquet(CONNECTION_STR+ "/" + filepath)
--Update
Can you double check if you or the user running this has ADLSGen2 access and right permissions (contributer role on subscription or Storage Blob Data Owner at the storage account level or Blob Storage Contributor Role to the service principal in the scope of the Data Lake Storage Gen2 storage account. ) depending on your setup.
Make sure you have the valid account key copied from the Azure portal.
Just in case....
To enable other users to use the storage account after you create your
workspace, you will have to perform below tasks:
Assign other users to the Contributor role on workspace
Assign other users to a Workspace, SQL, or Spark admin role using Synapse Studio
Assign yourself and other users to the Storage Blob Data Contributor role on the storage account
Also, if you are using MSI for synapse workspace, make sure the you as a user have same permission level in the notebook.
Going through the official MS docs on Azure Synapse connecting to Azure storage account
In case you have set up an account key and secret for the storage
account, you can set forwardSparkAzureStorageCredentials to true,
in which case Azure Synapse connector automatically discovers the
account access key set in the notebook session configuration or the
global Hadoop configuration and forwards the storage account access
key to the connected Azure Synapse instance by creating a temporary
Azure database scoped
credential.
Just add this option while df.write
.option("forwardSparkAzureStorageCredentials", "true")
Related
I am trying to use the documentation on https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/UsingWithRDS.IAMDBAuth.Connecting.Python.html. Right now I am stuck at session = boto3.session(profile_name='RDSCreds'). What is profile_name and how do I find that in my RDS?
import sys
import boto3
ENDPOINT="mysqldb.123456789012.us-east-1.rds.amazonaws.com"
PORT="3306"
USR="jane_doe"
REGION="us-east-1"
os.environ['LIBMYSQL_ENABLE_CLEARTEXT_PLUGIN'] = '1'
#gets the credentials from .aws/credentials
session = boto3.Session(profile_name='RDSCreds')
client = session.client('rds')
token = client.generate_db_auth_token(DBHostname=ENDPOINT, Port=PORT, DBUsername=USR, Region=REGION)
session = boto3.Session(profile_name='RDSCreds')
profile_name here means the name of the profile you have configured to use for your aws cli.
usually when you run aws configure it creates a default profile.But sometime users want to manage aws cli with another account credentials or amange request for another region so they configure separate profile. docs for creating configuring multiple profiles
aws configure --profile RDSCreds #enter your access keys for this profile
in case if you think you have already created RDSCreds profile to check that profile less ~/.aws/config
the documentation which you have mentioned for rds using boto3 also says "The code examples use profiles for shared credentials. For information about the specifying credentials, see Credentials in the AWS SDK for Python (Boto3) documentation."
I would like to register a dataset from ADLS Gen2 in my Azure Machine Learning workspace (azureml-core==1.12.0). Given that service principal information is not required in the Python SDK documentation for .register_azure_data_lake_gen2(), I successfully used the following code to register ADLS gen2 as a datastore:
from azureml.core import Datastore
adlsgen2_datastore_name = os.environ['adlsgen2_datastore_name']
account_name=os.environ['account_name'] # ADLS Gen2 account name
file_system=os.environ['filesystem']
adlsgen2_datastore = Datastore.register_azure_data_lake_gen2(
workspace=ws,
datastore_name=adlsgen2_datastore_name,
account_name=account_name,
filesystem=file_system
)
However, when I try to register a dataset, using
from azureml.core import Dataset
adls_ds = Datastore.get(ws, datastore_name=adlsgen2_datastore_name)
data = Dataset.Tabular.from_delimited_files((adls_ds, 'folder/data.csv'))
I get an error
Cannot load any data from the specified path. Make sure the path is accessible and contains data.
ScriptExecutionException was caused by StreamAccessException.
StreamAccessException was caused by AuthenticationException.
'AdlsGen2-ReadHeaders' for '[REDACTED]' on storage failed with status code 'Forbidden' (This request is not authorized to perform this operation using this permission.), client request ID <CLIENT_REQUEST_ID>, request ID <REQUEST_ID>. Error message: [REDACTED]
| session_id=<SESSION_ID>
Do I need the to enable the service principal to get this to work? Using the ML Studio UI, it appears that the service principal is required even to register the datastore.
Another issue I noticed is that AMLS is trying to access the dataset here:
https://adls_gen2_account_name.**dfs**.core.windows.net/container/folder/data.csv whereas the actual URI in ADLS Gen2 is: https://adls_gen2_account_name.**blob**.core.windows.net/container/folder/data.csv
According to this documentation,you need to enable the service principal.
1.you need to register your application and grant the service principal with Storage Blob Data Reader access.
2.try this code:
adlsgen2_datastore = Datastore.register_azure_data_lake_gen2(workspace=ws,
datastore_name=adlsgen2_datastore_name,
account_name=account_name,
filesystem=file_system,
tenant_id=tenant_id,
client_id=client_id,
client_secret=client_secret
)
adls_ds = Datastore.get(ws, datastore_name=adlsgen2_datastore_name)
dataset = Dataset.Tabular.from_delimited_files((adls_ds,'sample.csv'))
print(dataset.to_pandas_dataframe())
Result:
Here is my problem, I am trying to create linked service using python sdk and I was successful if I provide the storage account name and key. But I would like to create Linked service with key vaults reference, the below runs fine and creates the linked service. However when I go to datafactory and test connection.. it fails.. Please help!
store = LinkedServiceReference(reference_name ='LS_keyVault_Dev')
storage_string = AzureKeyVaultSecretReference( store=store, secret_name = 'access_key')
ls_azure_storage = AzureStorageLinkedService(connection_string=storage_string)
ls = adf_client.linked_services.create_or_update(rg_name, df_name, ls_name, ls_azure_storage)
Error Message
Invalid storage connection string provided to 'AzureTableConnection'. Check the storage connection string in configuration. No valid combination of account information found.
I test your code, it created the linked service successfully, and I navigate to the portal to Test connection, it also works, you could follow the steps below.
1.Navigate to the azure keyvault in the portal -> Secrets -> Create a secret, I'm not sure why you can use access_key as the name of the secret, pey my test, it is invalid. So in my sample, I use accesskey as the name of the secret, then store the Connection string of the storage account.
2.Navigate to the Access policies of the keyvault, add the MSI of your data factory with correct secret permission. If you did not enable the MSI of the data factory, follow this link to generate it, this is used to for the Azure Key Vault linked service to access your keyvault secret.
3.Navigate to the Azure Key Vault linked service of your data factory, make sure the connection is successful.
4.Use the code below to create the storage linked service.
Version of the libraries:
azure-common==1.1.23
azure-mgmt-datafactory==0.9.0
from azure.common.credentials import ServicePrincipalCredentials
from azure.mgmt.datafactory import DataFactoryManagementClient
from azure.mgmt.datafactory.models import *
subscription_id = '<subscription-id>'
credentials = ServicePrincipalCredentials(client_id='<client-id>', secret='<client-secret>', tenant='<tenant-id>')
adf_client = DataFactoryManagementClient(credentials, subscription_id)
rg_name = '<resource-group-name>'
df_name = 'joyfactory'
ls_name = 'storageLinkedService'
store = LinkedServiceReference(reference_name ='AzureKeyVault1') # AzureKeyVault1 is the name of the Azure Key Vault linked service
storage_string = AzureKeyVaultSecretReference( store=store, secret_name = 'accesskey')
ls_azure_storage = AzureStorageLinkedService(connection_string=storage_string)
ls = adf_client.linked_services.create_or_update(rg_name, df_name, ls_name, ls_azure_storage)
print(ls)
5.Go back to the linked service page, refresh and test the connection, it works fine.
My login to AWS console is MFA & for that I am using Google Authenticator.
I have S3 DEV bucket and to access that DEV bucket, I have to switch role and after switching i can access DEV bucket.
I need help how to achieve same in python with boto3.
There are many csv file that I need to open in dataframe and without that resolving access, I cannot proceed.
I tried configuring AWS credentials & config and using that in my python code but didn't helped.
AWS document is not clear about how to do switching role while using & doing in python.
import boto3
import s3fs
import pandas as pd
import boto.s3.connection
access_key = 'XXXXXXXXXXX'
secret_key = 'XXXXXXXXXXXXXXXXX'
# bucketName = 'XXXXXXXXXXXXXXXXX'
s3 = boto3.resource('s3')
for bucket in s3.buckets.all():
print(bucket.name)
Expected result should be to access that bucket after switching role in python code along with MFA.
In general, it is a bad for security to put credentials in your program code. It is better to store them in a configuration file. You can do this by using the AWS Command-Line Interface (CLI) aws configure command.
Once the credentials are stored this way, any AWS SDK (eg boto3) will automatically retrieve the credentials without having to reference them in code.
See: Configuring the AWS CLI - AWS Command Line Interface
There is an additional capability with the configuration file, that allows you to store a role that you wish to assume. This can be done by specifying a profile with the Role ARN:
# In ~/.aws/credentials:
[development]
aws_access_key_id=foo
aws_access_key_id=bar
# In ~/.aws/config
[profile crossaccount]
role_arn=arn:aws:iam:...
source_profile=development
The source_profile points to the profile that contains credentials that will be used to make the AssumeRole() call, and role_arn specifies the Role to assume.
See: Assume Role Provider
Finally, you can tell boto3 to use that particular profile for credentials:
session = boto3.Session(profile_name='crossaccount')
# Any clients created from this session will use credentials
# from the [crossaccount] section of ~/.aws/credentials.
dev_s3_client = session.client('s3')
An alternative to all the above (which boto3 does for you) is to call assume_role() in your code, then use the temporary credentials that are returned to define a new session that you can use to connect to a service. However, the above method using profiles is a lot easier.
I'm trying to use Apache Libcloud (Web) and reading the Documentation of how to use it with Amazon EC2 I'm stuck on a step at the beginning.
On this step:
from libcloud.compute.types import Provider
from libcloud.compute.providers import get_driver
cls = get_driver(Provider.EC2)
driver = cls('temporary access key', 'temporary secret key',
token='temporary session token', region="us-west-1")
You need to pass the temporary access data and tells you to read Amazon Documentation but also I've read the documentation I don't get very clear what I have to do to get my temporal credentials.
On the doc says that you can interact with the AWS STS API to connect to the endpoint but I don't understand how do you get the credentials. Moreover, on the example of Libcloud Web they use the personal credentials:
ACCESS_ID = 'your access id'
SECRET_KEY = 'your secret key'
So I'm a bit lost. How I can get my temporal credentials to use it on my code?
Thanks and regards.
If this code does not run on an EC2 instance I suggest you go with static credentials:
ACCESS_ID = 'your access id'
SECRET_KEY = 'your secret key'
cls = get_driver(Provider.EC2)
driver = cls(ACCESS_ID, SECRET_KEY, region="us-west-1")
to create access credentials:
Sign in to the Identity and Access Management (IAM) console at https://console.aws.amazon.com/iam/.
In the navigation pane, choose Users.
Choose the name of the desired user, and then choose the Security Credentials tab.
If needed, expand the Access Keys section and do any of the following:
Choose Create Access Key and then choose Download Credentials to save the access key ID and secret access key to a CSV file on your computer. Store the file in a secure location. You will not have access to the secret access key again after this dialog box closes. After you have downloaded the CSV file, choose Close.
if you want to run your code from an EC2 machine you can get temporary credentials by assuming an IAM role using the AWS SDK for Python https://boto3.readthedocs.io/en/latest/guide/quickstart.html by calling assume_role() on the STS service https://boto3.readthedocs.io/en/latest/reference/services/sts.html
#Aker666 from what I have found on the web, you're still expected to use the regular AWS api to obtain this information.
The basic snippet that works for me is:
import boto3
from libcloud.compute.types import Provider
from libcloud.compute.providers import get_driver
boto3.setup_default_session(aws_access_key_id='somekey',aws_secret_access_key='somesecret',region_name="eu-west-1")
sts_client = boto3.client('sts')
assumed_role_object = sts_client.assume_role(
RoleArn="arn:aws:iam::701********:role/iTerm_RO_from_TGT",
RoleSessionName='update-cloud-hosts.aviadraviv#Aviads-MacBook-Pro.local'
)
cls = get_driver(Provider.EC2)
driver = cls(assumed_role_object['Credentials']['AccessKeyId'], assumed_role_object['Credentials']['SecretAccessKey'],
token=assumed_role_object['Credentials']['SessionToken'], region="eu-west-1")
nodes = driver.list_nodes()
print(nodes)
Hope this helps anyone.