Error: Trying to use PyAthena to access an Athena - python

I'm currently trying to build a data pipeline from an AWS Athena database so my team can query information using Python. However, I'm running into an issue with insufficient permissions.
We are able to query the data in Tableau, but we wanted to integrate it into an app we are developing.
Here is the code we followed from PyAthena's documentation.
from pyathena import connect
import pandas as pd
conn = connect(aws_access_key_id='YOUR_ACCESS_KEY_ID',
aws_secret_access_key='YOUR_SECRET_ACCESS_KEY',
s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',
region_name='us-west-2')
df = pd.read_sql("SELECT * FROM many_rows", conn)
print(df.head())
Here is the resulting error.
OperationalError: Insufficient permissions to execute the query. User: arn:aws:iam::OUR_ADDRESS:user/USER is not authorized to perform: glue:GetTable on resource: arn:aws:glue:us-west-2:OUR_ADDRESS:table/default/OUR_DATABASE
I'm guessing that this is an issue with IAM permissions on the Server side with respect to Amazon Glue. But I'm not sure how to resolve it.

Related

Error connecting to Cosmos DB from Databricks: AnalysisException: Catalog 'cosmoscatalog' not found

I've followed the documentation from start to end.
I'm trying to connect to a CosmosDb so I can write data to it.
My Databricks cluster runtime version is:
11.3 LTS
I have installed the cosmos DB Spark connector:
com.azure.cosmos.spark:azure-cosmos-spark_3-3_2-12:4.15.0 per the documentation.
I have the following code:
cosmosEndpoint = "MyEndpoint"
cosmosMasterKey = "MyMasterKey"
cosmosDatabaseName = "SampleDB"
cosmosContainerName = "testContainer"
#Configure Catalog Api to be used
spark.conf.set("spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog") spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint", cosmosEndpoint) spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountKey", cosmosMasterKey)
#create an Azure Cosmos DB database using catalog api
spark.sql("CREATE DATABASE IF NOT EXISTS cosmosCatalog.{};".format(cosmosDatabaseName))
#create an Azure Cosmos DB container using catalog api
spark.sql("CREATE TABLE IF NOT EXISTS cosmosCatalog.{}.{} using cosmos.oltp TBLPROPERTIES(partitionKeyPath = '/id', manualThroughput = '1100')".format(cosmosDatabaseName, cosmosContainerName)) `
I get the error AnalysisException: Catalog 'cosmoscatalog' not found. I have followed the documentation from start to end. Does anyone know why this error occurs?
While connecting Cosmos Db SQL API to Databricks make sure you install the library on cluster correctly most of the time the error we are getting because if libraries are not installed properly.
For connection give appropriate credentials from cosmos db
Code that I tried for connecting Cosmos Db SQL API to Databricks:
cosmosEndpoint1 = "Endpoints"
cosmosMasterKey1 = "masterkey"
cosmosDatabaseName1 = "demo3DB"
cosmosContainerName1 = "demo3Container"
#Configure Catalog Api to be used
spark.conf.set("spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog")
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint", cosmosEndpoint1)
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountKey", cosmosMasterKey1)
#create an Azure Cosmos DB database using catalog api
spark.sql("CREATE DATABASE IF NOT EXISTS cosmosCatalog.{};".format(cosmosDatabaseName1))
#create an Azure Cosmos DB container using catalog api
spark.sql("CREATE TABLE IF NOT EXISTS cosmosCatalog.{}.{} using cosmos.oltp TBLPROPERTIES(partitionKeyPath = '/id', manualThroughput = '1100')".format(cosmosDatabaseName1, cosmosContainerName1))
Output:

How to connect Cloud SQL Server via external python program?

So I am trying to communicate to a Google Cloud SQL Server that I have created with an external python program that I have written in VS Code but I don't know where to begin. Any help will be useful.
I'd recommend using the Cloud SQL Python Connector to manage your connections to Cloud SQL. It supports the pytds driver and should help resolve your troubles for connecting to a SQL Server instance from a Python application.
from google.cloud.sql.connector import connector
import sqlalchemy
# configure Cloud SQL Python Connector properties
def getconn() ->:
conn = connector.connect(
"PROJECT:REGION:INSTANCE",
"pytds",
user="YOUR_USER",
password="YOUR_PASSWORD",
db="YOUR_DB"
)
return conn
# create connection pool to re-use connections
pool = sqlalchemy.create_engine(
"mssql+pytds://localhost",
creator=getconn,
)
# query or insert into Cloud SQL database
with pool.connect() as db_conn:
# query database
result = db_conn.execute("SELECT * from my_table").fetchall()
# Do something with the results
for row in result:
print(row)
For more detailed examples and additional params refer to the README of the repository.
I think you can be inspired by this :Python django
"Run the app on your local computer"

How to read from a Azure Data Warehouse into Python using Databricks?

I tested some sample code that I found online. This is it.
import pandas as pd
import pypyodbc
conn = "DRIVER={SQL Server Native Client 11.0};SERVER=name.database.windows.net;DATABASE=db_name;UID=my_id;PWD=my_pwd"
df = pd.read_sql_query('select * from dbo.main_table', conn)
So, I would expect this to import the dataset, that sites in SQL Server, into the dataframe, but I'm getting this error.
ArgumentError: Could not parse rfc1738 URL from string 'DRIVER={SQL Server Native Client 11.0};SERVER=etc., etc., etc.
I'm working in a Databricks environment. Thanks for the look.

Error in connecting Azure SQL database from Azure Machine Learning Service using python

I am trying to connect Azure SQL Database from Azure Machine Learning service, but I got the below error.
Please check Error: -
**('IM002', '[IM002] [unixODBC][Driver Manager]Data source name not found and no default driver specified (0) (SQLDriverConnect)')**
Please Check the below code that I have used for database connection: -
import pyodbc
class DbConnect:
# This class is used for azure database connection using pyodbc
def __init__(self):
try:
self.sql_db = pyodbc.connect(SERVER=<servername>;PORT=1433;DATABASE=<databasename>;UID=<username>;PWD=<password>')
get_name_query = "select name from contacts"
names = self.sql_db.execute(get_name_query)
for name in names:
print(name)
except Exception as e:
print("Error in azure sql server database connection : ", e)
sys.exit()
if __name__ == "__main__":
class_obj = DbConnect()
Is there any way to solve the above error? Please let me know if there is any way.
I'd consider using azureml.dataprep over pyodbc for this task (the API may change, but this worked last time I tried):
import azureml.dataprep as dprep
ds = dprep.MSSQLDataSource(server_name=<server-name,port>,
database_name=<database-name>,
user_name=<username>,
password=<password>)
You should then be able to collect the result of an SQL query in pandas e.g. via
dataflow = dprep.read_sql(ds, "SELECT top 100 * FROM [dbo].[MYTABLE]")
dataflow.to_pandas_dataframe()
Alternatively you can create SQL datastore and create a dataset from the SQL datastore.
Learn how:
https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-create-register-datasets#create-tabulardatasets
Sample code:
from azureml.core import Dataset, Datastore
# create tabular dataset from a SQL database in datastore
sql_datastore = Datastore.get(workspace, 'mssql')
sql_ds = Dataset.Tabular.from_sql_query((sql_datastore, 'SELECT * FROM my_table'))
#AkshayGodase Any particular reason that you want to use pyodbc?

Pandas_GBQ query receiving 504 error from BQ

I am reading a table [200,000 rows; 40 columns] from BigQuery in a Flask application (using the Pandas GBQ library); the goal is to display some summary data as HTML tables in the UI.
An upstream server is returning a 504 (gateway timeout) error response when I attempt to read this table from my Flask function, however it is successful when running my cloud console.
I have tried reading chunksizes, however this has not solved the issue.
project_id = "<projectid>"
bq_table = request.form.get('bq_table')
query = "SELECT * FROM `<dataset>.{bq_table}`;".format(bq_table=bq_table)
df = pd.DataFrame(pq.read_gbq(query, project_id, dialect='standard'))
I am expecting the query to run synchronously (not ideal, I know, but this is for an ad-hoc DQ tool).

Categories