Need to run Spark Job using Airflow which is connected to Azure - python

I have four files main.py, jobs.zip, libs.zip & params.yaml and these I have stored on Azure Storage Account Container.
Now I have this code which is making a payload and will try to run a spark job using that payload. And that payload will be having the location link of these 4 files.
hook = AzureSynapseHook(
azure_synapse_conn_id=self.azure_synapse_conn_id, spark_pool=self.spark_pool
)
payload = SparkBatchJobOptions(
name=f"{self.job_name}_{self.app_id}",
file=f"abfss://{Variable.get('ARTIFACT_BUCKET')}#{Variable.get('ARTIFACT_ACCOUNT')}.dfs.core.windows.net/{self.env}/{SPARK_DIR}/main.py",
arguments=self.job_args,
python_files=[
f"abfss://{Variable.get('ARTIFACT_BUCKET')}#{Variable.get('ARTIFACT_ACCOUNT')}.dfs.core.windows.net/{self.env}/{SPARK_DIR}/jobs.zip",
f"abfss://{Variable.get('ARTIFACT_BUCKET')}#{Variable.get('ARTIFACT_ACCOUNT')}.dfs.core.windows.net/{self.env}/{SPARK_DIR}/libs.zip",
],
files=[
f"abfss://{Variable.get('ARTIFACT_BUCKET')}#{Variable.get('ARTIFACT_ACCOUNT')}.dfs.core.windows.net/{self.env}/{SPARK_DIR}/params.yaml"
],
)
self.log.info("Executing the Synapse spark job.")
response = hook.run_spark_job(payload=payload)
I have checked the location link that is correct but when I run this on airflow it throws an error related to the payload which I think it is trying to say that it is not able to some resource which is not present on my Azure account.
A body is sent with the request
[2023-02-02, 12:02:50 UTC] {_universal.py:554} INFO - Response status: 404
Response headers:
'Date': 'Thu, 02 Feb 2023 12:02:00 GMT'
'Content-Length': '0'
'Content-Security-Policy': 'REDACTED'
'Strict-Transport-Security': 'REDACTED'
'Arr-Disable-Session-Affinity': 'REDACTED'
'X-Content-Type-Options': 'REDACTED'
[2023-02-02, 12:02:50 UTC] {taskinstance.py:1768} ERROR - Task failed with exception
Traceback (most recent call last):
File "/usr/local/airflow/dags/operators/spark/__init__.py", line 36, in execute
return self.executor.execute()
File "/usr/local/airflow/dags/operators/spark/azure.py", line 49, in execute
response = hook.run_spark_job(payload=payload)
File "/usr/local/lib/python3.9/site-packages/airflow/providers/microsoft/azure/hooks/synapse.py", line 144, in run_spark_job
job = self.get_conn().spark_batch.create_spark_batch_job(payload)
File "/usr/local/lib/python3.9/site-packages/azure/synapse/spark/operations/_spark_batch_operations.py", line 168, in create_spark_batch_job
map_error(status_code=response.status_code, response=response, error_map=error_map)
File "/usr/local/lib/python3.9/site-packages/azure/core/exceptions.py", line 110, in map_error
raise error
azure.core.exceptions.ResourceNotFoundError: Operation returned an invalid status 'Not Found'
Need to know what this resource exactly is.

Related

Sendgrid HTTP: 400 error using Cloud Composer

I'm trying to set up an Airflow DAG that is able to send emails through the EmailOperator in Composer 2, Airflow 2.3.4. I've followed this guide. I tried running the example DAG that is provided in the guide, but I get an HTTP 400 error.
The log looks like this:
[2023-01-20, 10:46:45 UTC] {taskinstance.py:1904} ERROR - Task failed with exception
Traceback (most recent call last):
File "/opt/python3.8/lib/python3.8/site-packages/airflow/operators/email.py", line 75, in execute
send_email(
File "/opt/python3.8/lib/python3.8/site-packages/airflow/utils/email.py", line 58, in send_email
return backend(
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/sendgrid/utils/emailer.py", line 123, in send_email
_post_sendgrid_mail(mail.get(), conn_id)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/sendgrid/utils/emailer.py", line 142, in _post_sendgrid_mail
response = sendgrid_client.client.mail.send.post(request_body=mail_data)
File "/opt/python3.8/lib/python3.8/site-packages/python_http_client/client.py", line 277, in http_request
self._make_request(opener, request, timeout=timeout)
File "/opt/python3.8/lib/python3.8/site-packages/python_http_client/client.py", line 184, in _make_request
raise exc
python_http_client.exceptions.BadRequestsError: HTTP Error 400: Bad Request
I've looked at similar threads on Stackoverflow but none of those suggestions worked for me.
I have set up and verified the from email address in Sendgrid and it
uses a whole email address including the domain.
I also set this email address up in Secret Manager (as well as the API key).
I haven't changed the test DAG from the guide, except for the 'to' address.
In another DAG I've tried enabling 'email_on_retry' and that also didn't trigger any mail.
I'm at a loss here, can someone provide me with suggestions on things to try?

I'm trying to use the Facebook API to get lead information on an ad, but I keep getting an error message

So, I'm trying to use Python and the Facebook API to get the lead info for an ad. I'm following the instructions under Bulk Read on this page: https://developers.facebook.com/docs/marketing-api/guides/lead-ads/retrieving. And my code looks like this:
from facebook_business.adobjects.ad import Ad
from facebook_business.adobjects.lead import Lead
from facebook_business.api import FacebookAdsApi
import os
access_token = os.getenv('FB_ACCESS_TOKEN')
app_secret = os.getenv('FB_APP_SECRET')
app_id = os.getenv('FB_APP_ID')
id = '23850504679460681'
FacebookAdsApi.init(access_token=access_token)
fields = [
]
params = {
}
print(Ad(id).get_leads(
fields=fields,
params=params,
))
When I run this, I get this error message:
C:\Users\Joseph\Desktop\python_code\Facebook API>python leadget.py
Traceback (most recent call last):
File "C:\Users\Joseph\Desktop\python_code\Facebook API\leadget.py", line 16, in <module>
print(Ad(id).get_leads(
File "C:\Users\Joseph\AppData\Local\Programs\Python\Python310\lib\site-packages\facebook_business\adobjects\ad.py", line 622, in get_leads
return request.execute()
File "C:\Users\Joseph\AppData\Local\Programs\Python\Python310\lib\site-packages\facebook_business\api.py", line 677, in execute
cursor.load_next_page()
File "C:\Users\Joseph\AppData\Local\Programs\Python\Python310\lib\site-packages\facebook_business\api.py", line 841, in load_next_page
response_obj = self._api.call(
File "C:\Users\Joseph\AppData\Local\Programs\Python\Python310\lib\site-packages\facebook_business\api.py", line 350, in call
raise fb_response.error()
facebook_business.exceptions.FacebookRequestError:
Message: Call was not successful
Method: GET
Path: https://graph.facebook.com/v13.0/23850504679460681/leads
Params: {'summary': 'true'}
Status: 400
Response:
{
"error": {
"message": "(#100) Requires pages_manage_ads or leads_retrieval permission to manage the object",
"type": "OAuthException",
"code": 100,
"fbtrace_id": "ALhUN1djYRl_0HTb4RHZuHF"
}
}
I've checked, and both I and the app have those permissions. What am I doing wrong?
Hope it's not too late for you but it seems to need more permissions to get leads information.
you can do this in your facebook developer application

DeserializationError: Cannot deserialize content-type: text/plain

I am new to Azure functions and I have created a Timer Trigger Azure Function to retrieve the list of repositories from Azure Container Registry using Azure Python SDK v3.6.
This code is executed properly in my local VS Code setup but when I deploy it in Azure and execute I get a DeserializationError: Cannot deserialize content-type: text/plain error.
Here is a summary of what I tried till now:
import azure.functions as func
from azure.containerregistry import ContainerRegistryClient
from azure.identity import DefaultAzureCredential
container_registry_client = ContainerRegistryClient(
account_url,
DefaultAzureCredential()
# DefaultAzureCredential(logging_enable=True)
)
def main(mytimer: func.TimerRequest) -> None:
logging.info("============================Start=================================")
# Here I am able to print the container registry client which is not null,
# So I assume my azure function is able to connect with my ACR
logging.info(dir(container_registry_client))
# Nothing is getting printed here onwards, so I assume some issues here or in For loop
repository_names = container_registry_client.list_repository_names()
for repository_name in repository_names:
logging.info("ACR Repository Names: " + str(repository_name))
logging.info("============================End=================================")
Full Error stack:
Ran into a deserialization error. Ignoring since this is failsafe deserialization Traceback (most recent call last): File "/home/site/wwwroot/.python_packages/lib/python3.6/site-packages/msrest/serialization.py", line 1501, in failsafe_deserialize return self(target_obj, data, content_type=content_type) File "/home/site/wwwroot/.python_packages/lib/python3.6/site-packages/msrest/serialization.py", line 1367, in __call__ data = self._unpack_content(response_data, content_type) File "/home/site/wwwroot/.python_packages/lib/python3.6/site-packages/msrest/serialization.py", line 1541, in _unpack_content raw_data.headers File "/home/site/wwwroot/.python_packages/lib/python3.6/site-packages/msrest/pipeline/universal.py", line 226, in deserialize_from_http_generics return cls.deserialize_from_text(body_bytes, content_type) File "/home/site/wwwroot/.python_packages/lib/python3.6/site-packages/msrest/pipeline/universal.py", line 203, in deserialize_from_text raise DeserializationError("Cannot deserialize content-type: {}".format(content_type)) msrest.exceptions.DeserializationError: Cannot deserialize content-type: text/plain

Firestore client in python (as user) using firebase_admin or google.cloud.firestore

I am building a python client-side application that uses Firestore. I have successfully used Google Identity Platform to sign up and sign in to the Firebase project, and created a working Firestore client using google.cloud.firestore.Client which is authenticated as a user:
import json
import requests
import google.oauth2.credentials
from google.cloud import firestore
request_url = f"https://identitytoolkit.googleapis.com/v1/accounts:signInWithPassword?key={self.__api_key}"
headers = {"Content-Type": "application/json; charset=UTF-8"}
data = json.dumps({"email": self.__email, "password": self.__password, "returnSecureToken": True})
response = requests.post(request_url, headers=headers, data=data)
try:
response.raise_for_status()
except (HTTPError, Exception):
content = response.json()
error = f"error: {content['error']['message']}"
raise AuthError(error)
json_response = response.json()
self.__token = json_response["idToken"]
self.__refresh_token = json_response["refreshToken"]
credentials = google.oauth2.credentials.Credentials(self.__token,
self.__refresh_token,
client_id="",
client_secret="",
token_uri=f"https://securetoken.googleapis.com/v1/token?key={self.__api_key}"
)
self.__db = firestore.Client(self.__project_id, credentials)
I have the problem, however, that when the token has expired, I get the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/google/api_core/grpc_helpers.py", line 57, in error_remapped_callable
return callable_(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/grpc/_channel.py", line 826, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/usr/local/lib/python3.7/dist-packages/grpc/_channel.py", line 729, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAUTHENTICATED
details = "Missing or invalid authentication."
debug_error_string = "{"created":"#1613043524.699081937","description":"Error received from peer ipv4:172.217.16.74:443","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"Missing or invalid authentication.","grpc_status":16}"
>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/home/my_app/src/controllers/im_alive.py", line 20, in run
self.__device_api.set_last_updated(utils.device_id())
File "/home/my_app/src/api/firestore/firestore_device_api.py", line 21, in set_last_updated
"lastUpdatedTime": self.__firestore.SERVER_TIMESTAMP
File "/home/my_app/src/api/firestore/firestore.py", line 100, in update
ref.update(data)
File "/usr/local/lib/python3.7/dist-packages/google/cloud/firestore_v1/document.py", line 382, in update
write_results = batch.commit()
File "/usr/local/lib/python3.7/dist-packages/google/cloud/firestore_v1/batch.py", line 147, in commit
metadata=self._client._rpc_metadata,
File "/usr/local/lib/python3.7/dist-packages/google/cloud/firestore_v1/gapic/firestore_client.py", line 1121, in commit
request, retry=retry, timeout=timeout, metadata=metadata
File "/usr/local/lib/python3.7/dist-packages/google/api_core/gapic_v1/method.py", line 145, in __call__
return wrapped_func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/google/api_core/retry.py", line 286, in retry_wrapped_func
on_error=on_error,
File "/usr/local/lib/python3.7/dist-packages/google/api_core/retry.py", line 184, in retry_target
return target()
File "/usr/local/lib/python3.7/dist-packages/google/api_core/timeout.py", line 214, in func_with_timeout
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/google/api_core/grpc_helpers.py", line 59, in error_remapped_callable
six.raise_from(exceptions.from_grpc_error(exc), exc)
File "<string>", line 3, in raise_from
google.api_core.exceptions.Unauthenticated: 401 Missing or invalid authentication.
I have tried omitting the token and only specifying the refresh token, and then calling credentials.refresh(), but the expires_in in the response from the https://securetoken.googleapis.com/v1/token endpoint is a string instead of a number (docs here), which makes _parse_expiry(response_data) in google.oauth2._client.py:257 raise an exception.
Is there any way to use the firestore.Client from either google.cloud or firebase_admin and have it automatically handle refreshing tokens, or do I need to switch to the manually calling the Firestore RPC API and refreshing tokens at the correct time?
Note: There are no users interacting with the python app, so the solution must not require user interaction.
Can't you just pass the string cast as integer _parse_expiry(int(float(response_data))) ?
If it is not working you could try to make a call and refresh token after getting and error 401, see my answer for the general idea on how to handle tokens.
As mentioned by #Marco, it is recommended that you use a service account if it's going to be used in an environment without user. When you use service account, you can just set GOOGLE_APPLICATION_CREDENTIALS environment variable to location of service account json file and just instantiate the firestore Client without any credentials (The credentials will be picked up automatically):
import firestore
client = firestore.Client()
and run it as (assuming Linux):
$ export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
$ python file.py
Still, if you really want to use user credentials for the script, you can install the Google Cloud SDK, then:
$ gcloud auth application-default login
This will open browser and for you to select account and login. After logging in, it creates a "virtual" service account file corresponding to your user account (that will also be loaded automatically by clients). Here too, you don't need to pass any parameters to your client.
See also: Difference between “gcloud auth application-default login” and “gcloud auth login”

ML-engine fails from composer -Unknown name "python-version"

Im trying to launch an ml-engine jobs submit training using a cloud composer, i'm using this guide for instructions recommendation-system-tensorflow-deploy.
Im using a plugin which google created (see the implementation here)
Im trying to make it work on python version 3.5, this by changing line 206 from:
training_request = {
'jobId': job_id,
'trainingInput': {
'scaleTier': self._scale_tier,
'packageUris': self._package_uris,
'pythonModule': self._training_python_module,
'region': self._region,
'args': self._training_args,
'masterType': self._master_type
}
To:
training_request = {
'jobId': job_id,
'trainingInput': {
'scaleTier': self._scale_tier,
'packageUris': self._package_uris,
'pythonModule': self._training_python_module,
'region': self._region,
'args': self._training_args,
'masterType': self._master_type,
'python-version': '3.5' #self._python_version
}
I also tried to add to it the run time version (runtime-version='1.12') but i keep on getting the following error:
[2019-01-20 11:58:36,331] {models.py:1594} ERROR - <HttpError 400 when requesting https://ml.googleapis.com/v1/projects/hallowed-forge-577/jobs?alt=json returned "Invalid JSON payload received. Unknown name "python-version" at 'job.training_input': Cannot find field.">
Traceback (most recent call last)
File "/usr/local/lib/airflow/airflow/models.py", line 1492, in _run_raw_tas
result = task_copy.execute(context=context
File "/home/airflow/gcs/plugins/ml_engine_plugin.py", line 241, in execut
self._project_id, training_request, check_existing_job
File "/home/airflow/gcs/plugins/ml_engine_plugin.py", line 79, in create_jo
request.execute(
File "/usr/local/lib/python3.6/site-packages/oauth2client/util.py", line 135, in positional_wrappe
return wrapped(*args, **kwargs
File "/usr/local/lib/python3.6/site-packages/googleapiclient/http.py", line 838, in execut
raise HttpError(resp, content, uri=self.uri
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://ml.googleapis.com/v1/projects/hallowed-forge-577/jobs?alt=json returned "Invalid JSON payload received. Unknown name "python-version" at 'job.training_input': Cannot find field."
[2019-01-20 11:58:36,334] {models.py:1623} INFO - Marking task as FAILED.
[2019-01-20 11:58:36,513] {models.py:1627} ERROR - Failed to send email to: ['airflow#example.com']
[2019-01-20 11:58:36,516] {models.py:1628} ERROR - HTTP Error 401: Unauthorized
Traceback (most recent call last)
File "/usr/local/lib/airflow/airflow/models.py", line 1625, in handle_failur
self.email_alert(error, is_retry=False
File "/usr/local/lib/airflow/airflow/models.py", line 1778, in email_aler
send_email(task.email, title, body
File "/usr/local/lib/airflow/airflow/utils/email.py", line 44, in send_emai
return backend(to, subject, html_content, files=files, dryrun=dryrun, cc=cc, bcc=bcc, mime_subtype=mime_subtype
File "/usr/local/lib/airflow/airflow/contrib/utils/sendgrid.py", line 116, in send_emai
_post_sendgrid_mail(mail.get()
File "/usr/local/lib/airflow/airflow/contrib/utils/sendgrid.py", line 122, in _post_sendgrid_mai
response = sg.client.mail.send.post(request_body=mail_data
File "/usr/local/lib/python3.6/site-packages/python_http_client/client.py", line 252, in http_reques
return Response(self._make_request(opener, request, timeout=timeout)
File "/usr/local/lib/python3.6/site-packages/python_http_client/client.py", line 176, in _make_reques
raise ex
python_http_client.exceptions.UnauthorizedError: HTTP Error 401: Unauthorize
Notice that the python version actually changes (to 3.6 from the original 2.7) so changing the python version, does something, but then gets stuck
Any help on what i'm missing here will be awesome!
It seems like the example uses an old version of airflow MLEngineTrainingOperator.
The last version implements the runtime-version/python-version training params.
Use the current version:
mlengine_operator.py

Categories