Apache Airflow - connecting to AWS S3 error

Apache Airflow - connecting to AWS S3 error - python

I'm trying to get S3 hook in Apache Airflow using the Connection object.
It looks like this:
class S3ConnectionHandler:
def __init__():
# values are read from configuration class, which loads from env. variables
self._s3 = Connection(
conn_type="s3",
conn_id=config.AWS_CONN_ID,
login=config.AWS_ACCESS_KEY_ID,
password=config.AWS_SECRET_ACCESS_KEY,
extra=json.dumps({"region_name": config.AWS_DEFAULT_REGION}),
)
#property
def s3(self) -> Connection:
return get_live_connection(self.logger, self._s3)
#property
def s3_hook(self) -> S3Hook:
return self.s3.get_hook()
I get an error:
Broken DAG: [...] Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/connection.py", line 282, in get_hook
return hook_class(**{conn_id_param: self.conn_id})
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/base_aws.py", line 354, in __init__
raise AirflowException('Either client_type or resource_type must be provided.')
airflow.exceptions.AirflowException: Either client_type or resource_type must be provided.
Why does this happen? From what I understand the S3Hook calls the constructor from the parent class, AwsHook, and passes the client_type as "s3" string. How can I fix this?
I took this configuration for hook from here.
EDIT: I even get the same error when directly creating the S3 hook:
#property
def s3_hook(self) -> S3Hook:
#return self.s3.get_hook()
return S3Hook(
aws_conn_id=config.AWS_CONN_ID,
region_name=self.config.AWS_DEFAULT_REGION,
client_type="s3",
config={"aws_access_key_id": self.config.AWS_ACCESS_KEY_ID, "aws_secret_access_key": self.config.AWS_SECRET_ACCESS_KEY}
)
``

If youre using Airflow 2
please refer to the new documentation - it can be kind of tricky as most of the google searches redirect you to the old ones.
In my case I was using the AwsHook and had to switch to AwsBaseHook as it seems to be the only and correct one for version 2. I've had to switch the import path as well, now aws stuff isnt on contrib anymore its under providers
And as you can see on the new documentation you can pass either client_type ou resource_type as a AwsBaseHook parameter, depending on the one you want to use. Once you do that your problem should be solved

No other answers worked, I couldn't get around this. I ended up using boto3 library directly, which also gave me more low-level flexibility that Airflow hooks lacked.

First of all , I suggest that you create a S3 connection , for this you must go the path Admin >> Connections
After that and assuming that you want to load a file into S3 Bucket, you can code :
def load_csv_S3():
# Send to S3
hook = S3Hook(aws_conn_id="s3_conn")
hook.load_file(
filename='/write_your_path_file/filename.csv',
key='filename.csv',
bucket_name="BUCKET_NAME",
replace=True,
)
Finally, you can check all the functions of S3Hook 👉 HERE

What has worked for me, in case it helps someone, in my answer to a similar post: https://stackoverflow.com/a/73652781/4187360

Related

Document AI process document fails with invalid argument when processing docs from GCS

I am getting an error very similar to the below, but I am not in EU:
Document AI: google.api_core.exceptions.InvalidArgument: 400 Request contains an invalid argument
When I use the raw_document and process a local pdf file, it works fine. However, when I specify a pdf file on a GCS location, it fails.
Error message:
the processor name: projects/xxxxxxxxx/locations/us/processors/f7502cad4bccdd97
the form process request: name: "projects/xxxxxxxxx/locations/us/processors/f7502cad4bccdd97"
inline_document {
uri: "gs://xxxx/temp/test1.pdf"
}
Traceback (most recent call last):
File "C:\Python39\lib\site-packages\google\api_core\grpc_helpers.py", line 66, in error_remapped_callable
return callable_(*args, **kwargs)
File "C:\Python39\lib\site-packages\grpc\_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "C:\Python39\lib\site-packages\grpc\_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = "Request contains an invalid argument."
debug_error_string = "{"created":"#1647296055.582000000","description":"Error received from peer ipv4:142.250.80.74:443","file":"src/core/lib/surface/call.cc","file_line":1070,"grpc_message":"Request contains an invalid argument.","grpc_status":3}"
>
Code:
client = documentai.DocumentProcessorServiceClient(client_options=opts)
# The full resource name of the processor, e.g.:
# projects/project-id/locations/location/processor/processor-id
# You must create new processors in the Cloud Console first
name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"
print(f'the processor name: {name}')
# document = {"uri": gcs_path, "mime_type": "application/pdf"}
document = {"uri": gcs_path}
inline_document = documentai.Document()
inline_document.uri = gcs_path
# inline_document.mime_type = "application/pdf"
# Configure the process request
# request = {"name": name, "inline_document": document}
request = documentai.ProcessRequest(
inline_document=inline_document,
name=name
)
print(f'the form process request: {request}')
result = client.process_document(request=request)
I do not believe I have permission issues on the bucket since the same set up works fine for a document classification process on the same bucket.

This is a known issue for Document AI, and is already reported in this issue tracker. Unfortunately the only workaround for now is to either:
Download your file, read the file as bytes and use process_documents(). See Document AI local processing for the sample code.
Use batch_process_documents() since by default is only accepts files from GCS. This is if you don't want to do the extra step on downloading the file.

This is still an issue 5 months later, and something not mentioned in the accepted answer is (and I could be wrong but it seems to me) that batch processes are only able to output their results to GCS, so you'll still incur the extra step of downloading something from a bucket (be it the input document under Option 1 or the result under Option 2). On top of that, you'll end up having to do cleanup in the bucket if you don't want the results there, so under many circumstances, Option 2 won't present much of an advantage other than the fact that the result download will probably be smaller than the input file download.
I'm using the client library in a Python Cloud Function and I'm affected by this issue. I'm implementing Option 1 for the reason that it seems simplest and I'm holding out for the fix. I also considered using the Workflow client library to fire a Workflow that runs a Document AI process, or calling the Document AI REST API, but it's all very suboptimal.

serverless starting error: ModuleNotFoundError: No module named 'handler' [duplicate]

I have a deployment package in the following structure:
my-project.zip
--- my-project.py
------ lambda_handler()
Then I define the handler path in configuration file
my-project.lambda_handler
Get the error:
'handler' missing on module
Can not understand that

There are some issues occurring this error.
Issue#1:
The very first issue you’re gonna run into is if you name the file incorrectly, you get this error:
Unable to import module 'lambda_function': No module named lambda_function
If you name the function incorrectly you get this error:
Handler 'handler' missing on module 'lambda_function_file': 'module'
object has no attribute 'handler'
On the dashboard, make sure the handler field is entered as function_filename.actual_function_name and make sure they match up in your deployment package.
If only the messages were a bit more instructive that would have been a simpler step.
Resource Link:
No lambda_function?
Issue#2:
adrian_praja has solved the issue in aws forum. He answered the following
I belive your index.js should contain
exports.createThumbnailHandler = function(event, context) {}
Issue#3:
Solution: Correctly specify the method call
This happens when the specification of the method called by node.js is incorrect in Lambda's setting.
Please review the specification of the method to call.
In the case of the above error message, I attempted to call the handler method of index.js, but the corresponding method could not be found.
The processing to call is set with "Handler" on the configuration tab.
Below is an example of setting to call the handler method of index.js.
Resource Link:
http://qiita.com/kazuqqfp/items/ac8d93918d0030b31aad
AWS Lambda Function is returning Handler 'handler' missing on module 'index'

I had this issue and had to make sure I had a function called handler in my file, e.g.:
# this just takes whatever is sent to the api gateway and sends it back
def handler(event, context):
try:
return response(event, 200)
except Exception as e:
return response('Error' + e.message, 400)
def response(message, status_code):
return message

How to properly use dask's upload_file() to pass local code to workers

I have functions in a local_code.py file that I would like to pass to workers through dask. I've seen answers to questions on here saying that this can be done using the upload_file() function, but I can't seem to get it working because I'm still getting a ModuleNotFoundError.
The relevant part of the code is as follows.
from dask.distributed import Client
from dask_jobqueue import SLURMCluster
from local_code import *
helper_file = '/absolute/path/to/local_code.py'
def main():
with SLURMCluster(**slurm_params) as cluster:
cluster.scale(n_workers)
with Client(cluster) as client:
client.upload_file(helper_file)
mapping = client.map(myfunc, data)
client.gather(mapping)
if __name__ == '__main__':
main()
Note, myfunc is imported from local_code, and there's no error importing it to map. The function myfunc also depends on other functions that are defined in local_code.
With this code, I'm still getting this error
distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x04\x95+\x00\x00\x00\x00\x00\x00\x00\x8c\x11local_code\x94\x8c\x$
Traceback (most recent call last):
File "/home/gallagher.r/.local/lib/python3.7/site-packages/distributed/protocol/pickle.py", line 61, in loads
return pickle.loads(x)
ModuleNotFoundError: No module named 'local_code'
Using upload_file() seems so straightforward that I'm not sure what I'm doing wrong. I must have it in the wrong place or not be understanding correctly what is passed to it.
I'd appreciate any help with this. Please let me know if you need any other information or if there's anything else I can supply from the error file.

The upload_file method only uploads the file to the currently available workers. If a worker arrives after you call upload_file then that worker won't have the provided file.
If your situation the easiest thing to do is probably to wait until all of the workers arrive before you call upload file
cluster.scale(n)
with Client(cluster) as client:
client.wait_for_workers(n)
client.upload_file(...)

Another option when you have workers going in/out is to use the Client.register_worker_callbacks to hook into whenever a new worker is registered/added. The one caveat is you will need to serialize your file(s) in the callback partial:
fname = ...
with open(fname, 'rb') as f:
data = f.read()
client.register_worker_callbacks(
setup=functools.partial(
_worker_upload, data=data, fname=fname,
)
)
def _worker_upload(dask_worker, *, data, fname):
dask_worker.loop.add_callback(
callback=dask_worker.upload_file,
comm=None, # not used
filename=fname,
data=data,
load=True)
This will also upload the file the first time the callback is registered so you can avoid calling client.upload_file entirely.

Python suds error creating object

Trying to work with the echosign SOAP API.
The wsdl is here: https://secure.echosign.com/services/EchoSignDocumentService14?wsdl
When I try to create certain objects, it appears to not be able to find the type, even after listing it in print client
import suds
url = "https://secure.echosign.com/services/EchoSignDocumentService14?wsdl"
client = suds.client.Client(url)
print client
Service ( EchoSignDocumentService14 ) tns="http://api.echosign"
Prefixes (10)
ns0 = "http://api.echosign"
ns1 = "http://dto.api.echosign"
ns2 = "http://dto10.api.echosign"
ns3 = "http://dto11.api.echosign"
ns4 = "http://dto12.api.echosign"
ns5 = "http://dto13.api.echosign"
ns15 = "http://dto14.api.echosign"
ns16 = "http://dto7.api.echosign"
ns17 = "http://dto8.api.echosign"
ns18 = "http://dto9.api.echosign"
Ports (1):
(EchoSignDocumentService14HttpPort)
Methods (45):
...
Types (146):
ns1:CallbackInfo
ns17:WidgetCreationInfo
Trimmed for brevity, but showing the namespaces and the 2 types I'm concerned with right now.
Trying to run WCI = client.factory.create("ns17:WidgetCreationInfo") generates this error:
client.factory.create("ns17:WidgetCreationInfo")
Traceback (most recent call last):
File "", line 1, in
File "build/bdist.macosx-10.7-intel/egg/suds/client.py", line 244, in create
suds.BuildError:
An error occured while building a instance of (ns17:WidgetCreationInfo). As a result
the object you requested could not be constructed. It is recommended
that you construct the type manually using a Suds object.
Please open a ticket with a description of this error.
Reason: Type not found: '(CallbackInfo, http://dto.api.echosign, )'
So it doesn't appear to be able to find the CallbackInfo type. Maybe its because its missing the ns there?

Again, figured it out 15 min after posting here.
suds has an option to cross-pollinate all the namespaces so they all import each others schemas. autoblend can be set in the constructor or using the set_options method.
suds.client.Client(url, autoblend=True)

Take a look in the WSDL, it seems lots of definitions in http://*.api.echosign that suds cannot fetch.
Either update your /etc/hosts to make these not well-formed domains can be reached, or save the wsdl locally, modify it, then use Client('file://...', ...) to create your suds client.

Google Cloud Storage using Python

I set up a required environment for Google Cloud Storage according to the manual.
I have installed "gsutil" and set up all paths.
My gsutil works perfectly, however, when I try to run the code below,
#!/usr/bin/python
import StringIO
import os
import shutil
import tempfile
import time
from oauth2_plugin import oauth2_plugin
import boto
# URI scheme for Google Cloud Storage.
GOOGLE_STORAGE = 'gs'
# URI scheme for accessing local files.
LOCAL_FILE = 'file'
uri=boto.storage_uri('sangin2', GOOGLE_STORAGE)
try:
uri.create_bucket()
print "done!"
except boto.exception.StorageCreateError, e:
print "failed"
It gives "403 Access denied" error.
Traceback (most recent call last):
File "/Volumes/WingIDE-101-4.0.0/WingIDE.app/Contents/MacOS/src/debug/tserver/_sandbox.py", line 23, in <module>
File "/Users/lsangin/gsutil/boto/boto/storage_uri.py", line 349, in create_bucket
return conn.create_bucket(self.bucket_name, headers, location, policy)
File "/Users/lsangin/gsutil/boto/boto/gs/connection.py", line 91, in create_bucket
response.status, response.reason, body)
boto.exception.GSResponseError: GSResponseError: 403 Forbidden
<?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>Access denied.</Message></Error>
Since I am new to this, it is kinda hard for me to figure out why.
Can someone help me?
Thank you.

The boto library should automatically find and use your $HOME/.boto file. One thing to check: make sure the project you're using is set as your default project for legacy access (at the API console, click on "Storage Access" and verify that it says "This is your default project for legacy access"). When I have that set incorrectly and I follow the create bucket example you referenced, I also get a 403 error, however, it doesn't make sense that this would work for you in gsutil but not with direct use of boto.
Try adding "debug=2" when you instantiate the storage_uri object, like this:
uri = boto.storage_uri(name, GOOGLE_STORAGE, debug=2)
That will generate some additional debugging information on stdout, which you can then compare with the debug output from an analogous, working gsutil example (via gsutil -D mb ).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Apache Airflow - connecting to AWS S3 error - python

No other answers worked, I couldn't get around this. I ended up using boto3 library directly, which also gave me more low-level flexibility that Airflow hooks lacked.

What has worked for me, in case it helps someone, in my answer to a similar post: https://stackoverflow.com/a/73652781/4187360

Related

Document AI process document fails with invalid argument when processing docs from GCS

serverless starting error: ModuleNotFoundError: No module named 'handler' [duplicate]

How to properly use dask's upload_file() to pass local code to workers

Python suds error creating object

Google Cloud Storage using Python

Categories

Resources