While downloading azure blobs to local file system an exception occurs - python

While downloading azure blobs to local file system, I'm getting the following exception:
Client-Request-ID=99bdb0e4-2d1c-11e8-8fe6-00155dbf7128 Retry policy did not allow for a retry: Server-Timestamp=Wed, 21 Mar 2018 15:29:09 GMT, Server-Request-ID=1e7ab8f5-101e-0076-5329-c16a24000000, HTTP status code=404, Exception=The specified blob does not exist.
ErrorCode: BlobNotFound<?xml version="1.0" encoding="utf-8"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.
RequestId:1e7ab8f5-101e-0076-5329-c16a24000000Time:2018-03-21T15:29:09.6565984Z</Message></Error>.
Traceback (most recent call last):
File "C:\Program Files\Commvault\ContentStore\Automation\CloudApps\CloudAppsUtils\cahelper.py", line 483, in download_contents_azure
session.get_blob_to_path(container_name,fl,fl)
File "C:\Program Files\Python36\lib\site-packages\azure\storage\blob\baseblobservice.py", line 1817, in get_blob_to_path
timeout)
File "C:\Program Files\Python36\lib\site-packages\azure\storage\blob\baseblobservice.py", line 2003, in get_blob_to_stream
raise ex
File "C:\Program Files\Python36\lib\site-packages\azure\storage\blob\baseblobservice.py", line 1971, in get_blob_to_stream
_context=operation_context)
File "C:\Program Files\Python36\lib\site-packages\azure\storage\blob\baseblobservice.py", line 1695, in _get_blob
operation_context=_context)
File "C:\Program Files\Python36\lib\site-packages\azure\storage\common\storageclient.py", line 354, in _perform_request
raise ex
File "C:\Program Files\Python36\lib\site-packages\azure\storage\common\storageclient.py", line 289, in _perform_request
raise ex
File "C:\Program Files\Python36\lib\site-packages\azure\storage\common\storageclient.py", line 275, in _perform_request
HTTPError(response.status, response.message, response.headers, response.body))
File "C:\Program Files\Python36\lib\site-packages\azure\storage\common\_error.py", line 111, in _http_error_handler
raise AzureHttpError(message, http_error.status)
azure.common.AzureMissingResourceHttpError: The specified blob does not exist.ErrorCode: BlobNotFound
<?xml version="1.0" encoding="utf-8"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.
RequestId:1e7ab8f5-101e-0076-5329-c16a24000000
Time:2018-03-21T15:29:09.6565984Z</Message></Error>
I'm using the following code to download blobs:
def download_contents_azure(self, account_name, account_key, content):
session=self.create_session_azure(account_name,account_key)
os.mkdir('in place')
os.chdir('in place')
for item in content:
# s = os.path.basename(item)
path_to_file = ("/".join(item.strip("/").split('/')[1:]))
container_name = Path(item).parts[1]
gen = session.list_blobs(container_name)
li = []
for i in gen:
li.append(i.name)
if path_to_file in li:
fl = os.path.basename(path_to_file)
print(fl)
c = self.splitall(item)
for i in range(1,len(c)-1):
if path.exists(c[i]) is False:
os.mkdir(c[i])
os.chdir(c[i])
session.get_blob_to_path(container_name,fl,fl)
for i in range(1,len(c)-1):
os.chdir("..")
else:
c = self.splitall(item)
for i in range(1,len(c)):
os.mkdir(c[i])
os.chdir(c[i])
generator = session.list_blobs(container_name,path_to_file+'/',delimiter='/')
for blob in generator:
bl = os.path.basename(blob.name)
session.get_blob_to_path(container_name,bl,bl)
I've a path in azure (/container/folder/subfolder).
I'm trying to download the structure and all the files under subfolder. the first file under the subfolder gets downloaded and then I've the above exception. Due to this, I'm unable to loop through and print the next items.
Thoughts?

Double check your line
session.get_blob_to_path(container_name,fl,fl)
You have fl for blob_name.
documentation here indicates second argument is blob_name while third is file path on file system where to download.
I think your blob name is wrong. Hence it does not exist.
NOTE: consider installing fiddler so you can look at the network traces. It allows you to see the raw request.

Related

Only can upload 1 file to s3 bucket but not other files using python

I am trying to upload files inside a folder to a S3 bucket but I cannot seem to upload all files. Here is my code:
try:
for folder in os.listdir('raw_data/'):
for files in os.listdir(f'raw_data/{folder}'):
if folder == 'malay':
upload_file_bucket = 'book-reviews-analysis'
upload_file_key = 'malay/' + str(files)
client.upload_file(files, upload_file_bucket, upload_file_key)
logger.info('--DONE UPLOADING FILE TO BUCKET--')
elif folder == 'english':
upload_file_bucket='book-reviews-analysis'
upload_file_key='english/' + str(files)
client.upload_file(files,upload_file_bucket,upload_file_key)
logger.info('--DONE UPLOADING FILE TO BUCKET--')
except ClientError as e:
print(e)
logger.error(e)
The weird thing is that it does upload a file in the english folder but not in the 'malay' folder. I got the following error and I am very certain that the file I want to upload is in that folder.
Traceback (most recent call last):
File "D:/personal project/Book review analysis/book_reviews_analysis/pipeline.py", line 64, in <module>
main()
File "D:/personal project/Book review analysis/book_reviews_analysis/pipeline.py", line 61, in main
upload_to_s3()
File "D:/personal project/Book review analysis/book_reviews_analysis/pipeline.py", line 42, in upload_to_s3
client.upload_file(files, upload_file_bucket, upload_file_key)
File "D:\personal project\Book review analysis\book_reviews_analysis\env\lib\site-packages\boto3\s3\inject.py", line 148, in upload_file
callback=Callback,
File "D:\personal project\Book review analysis\book_reviews_analysis\env\lib\site-packages\boto3\s3\transfer.py", line 288, in upload_file
future.result()
File "D:\personal project\Book review analysis\book_reviews_analysis\env\lib\site-packages\s3transfer\futures.py", line 103, in result
return self._coordinator.result()
File "D:\personal project\Book review analysis\book_reviews_analysis\env\lib\site-packages\s3transfer\futures.py", line 266, in result
raise self._exception
File "D:\personal project\Book review analysis\book_reviews_analysis\env\lib\site-packages\s3transfer\tasks.py", line 269, in _main
self._submit(transfer_future=transfer_future, **kwargs)
File "D:\personal project\Book review analysis\book_reviews_analysis\env\lib\site-packages\s3transfer\upload.py", line 585, in _submit
upload_input_manager.provide_transfer_size(transfer_future)
File "D:\personal project\Book review analysis\book_reviews_analysis\env\lib\site-packages\s3transfer\upload.py", line 244, in provide_transfer_size
self._osutil.get_file_size(transfer_future.meta.call_args.fileobj)
File "D:\personal project\Book review analysis\book_reviews_analysis\env\lib\site-packages\s3transfer\utils.py", line 247, in get_file_size
return os.path.getsize(filename)
File "C:\Users\aliff\AppData\Local\Programs\Python\Python37\lib\genericpath.py", line 50, in getsize
return os.stat(filename).st_size
FileNotFoundError: [WinError 2] The system cannot find the file specified: 'politik_untuk_pemula.CSV'
Process finished with exit code 1
As #ewokx mentioned in their comment:
files doesn't have the path.
That said, try using pathlib.Path, like so:
from pathlib import Path
src_dir = Path("raw_data")
files_coll = src_dir.glob("*/*")
for one_file in files_coll:
folder = one_file.parent.name
if folder == "malay":
upload_file_bucket = 'book-reviews-analysis'
upload_file_key = 'malay/' + str(one_file.name)
# Check boto3 documentation; if upload_file() accepts Path-like object, then no need for str()
client.upload_file(str(one_file), upload_file_bucket, upload_file_key)
logger.info('--DONE UPLOADING FILE TO BUCKET--')
elif folder == "english":
...

Microsoft Azure Timer Function not uploading files to blob storage

I am trying to do a timer trigger whereby every 5 minutes, the function will upload a file into my blob storage.
When I run my code locally, it works, but it fails when it is deployed on Azure. Any help will be appreciated.
Main Method
device_client = IoTHubDeviceClient.create_from_connection_string(CONNECTION_STRING)
PATH_TO_FILE = wget.download("link-of-something", out=os.getcwd())
device_client.connect()
blob_name = os.path.basename(PATH_TO_FILE)
storage_info = device_client.get_storage_info_for_blob(blob_name)
store_blob(storage_info, PATH_TO_FILE)
device_client.shutdown()
Helper method
def store_blob(blob_info, file_name):
try:
sas_url = "https://{}/{}/{}{}".format(
blob_info["hostName"],
blob_info["containerName"],
blob_info["blobName"],
blob_info["sasToken"]
)
print("\nUploading file: {} to Azure Storage as blob: {} in container {}\n".format(file_name, blob_info["blobName"], blob_info["containerName"]))
# Upload the specified file
with BlobClient.from_blob_url(sas_url) as blob_client:
with open(file_name, "rb") as f:
result = blob_client.upload_blob(f, overwrite=True)
return (True, result)
except FileNotFoundError as ex:
# catch file not found and add an HTTP status code to return in notification to IoT Hub
ex.status_code = 404
return (False, ex)
except AzureError as ex:
# catch Azure errors that might result from the upload operation
return (False, ex)
This is the error log (Edited)
Result: Failure Exception: OSError: [Errno 30] Read-only file system: './nasa_graphics_manual_nhb_1430-2_jan_1976.pdfjo243l48.tmp' Stack: File "/azure-functions-host/workers/python/3.9/LINUX/X64/azure_functions_worker/dispatcher.py", line 407, in _handle__invocation_request call_result = await self._loop.run_in_executor( File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/azure-functions-host/workers/python/3.9/LINUX/X64/azure_functions_worker/dispatcher.py", line 649, in _run_sync_func return ExtensionManager.get_sync_invocation_wrapper(context, File "/azure-functions-host/workers/python/3.9/LINUX/X64/azure_functions_worker/extension.py", line 215, in _raw_invocation_wrapper result = function(**args) File "/home/site/wwwroot/azure-function-timer/__init__.py", line 134, in main PATH_TO_FILE = wget.download("https://www.nasa.gov/sites/default/files/atoms/files/nasa_graphics_manual_nhb_1430-2_jan_1976.pdf", out=os.getcwd()) # wget to get the filename and path File "/home/site/wwwroot/.python_packages/lib/site-packages/wget.py", line 506, in download (fd, tmpfile) = tempfile.mkstemp(".tmp", prefix=prefix, dir=".") File "/usr/local/lib/python3.9/tempfile.py", line 336, in mkstemp return _mkstemp_inner(dir, prefix, suffix, flags, output_type) File "/usr/local/lib/python3.9/tempfile.py", line 255, in _mkstemp_inner fd = _os.open(file, flags, 0o600)
What you can do is containerize the function using docker and inside the container the place the file so that you can read the file later.
Because if you don't containerize the function, only the function's code will be deployed and not the file.
Refer this documentation for Indepth explanation.

Python : upload my own files into my drive using Pydrive library

I want to upload my file into my drive. However in Pydrive Documentation I found only upload() function that uploads a file created by drive.CreateFile() function and update it, and not the file in my hard drive (my own file).
file1 = drive.CreateFile({'title': 'Hello.txt'}) # Create GoogleDriveFile
instance with title 'Hello.txt'.
file1.SetContentString('Hello World!') # Set content of the file from given
string.
file1.Upload()
I've tried the ansewers of my question here in stackoverflow, but an error accured . here is my code :
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
#1st authentification
gauth = GoogleAuth()
gauth.LocalWebserverAuth() # Creates local webserver and auto handles
#authentication.
drive = GoogleDrive(gauth)
file1 = drive.CreateFile(metadata={"title": "big.txt"})
file1.SetContentFile('big.txt')
file1.Upload()
The file "big.txt" is in the same folder of my code file.
When I run it, I got this traceback:
Traceback (most recent call last):
File "C:\Users\**\AppData\Local\Programs\Python\Python36-32\lib\site-
packages\pydrive\files.py", line 369, in _FilesInsert
http=self.http)
File "C:\Users\**\AppData\Local\Programs\Python\Python36-32\lib\site-
packages\oauth2client\_helpers.py", line 133, in positional_wrapper
return wrapped(*args, **kwargs)
File "C:\Users\**\AppData\Local\Programs\Python\Python36-32\lib\site-
packages\googleapiclient\http.py", line 813, in execute
_, body = self.next_chunk(http=http, num_retries=num_retries)
File "C:\Users\**\AppData\Local\Programs\Python\Python36-32\lib\site-
packages\oauth2client\_helpers.py", line 133, in positional_wrapper
return wrapped(*args, **kwargs)
File "C:\Users\**\AppData\Local\Programs\Python\Python36-32\lib\site-
packages\googleapiclient\http.py", line 981, in next_chunk
return self._process_response(resp, content)
File "C:\Users\**\AppData\Local\Programs\Python\Python36-32\lib\site-
packages\googleapiclient\http.py", line 1012, in _process_response
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting
https://www.googleapis.com/upload/drive/v2/files?
alt=json&uploadType=resumable returned "Bad Request">
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/Users/**/AppData/Local/Programs/Python/Python36-
32/quickstart.py", line 13, in <module>
file1.Upload()
File "C:\Users\**\AppData\Local\Programs\Python\Python36-32\lib\site-
packages\pydrive\files.py", line 285, in Upload
self._FilesInsert(param=param)
File "C:\Users\**\AppData\Local\Programs\Python\Python36-32\lib\site-
packages\pydrive\auth.py", line 75, in _decorated
return decoratee(self, *args, **kwargs)
File "C:\Users\**\AppData\Local\Programs\Python\Python36-32\lib\site-
packages\pydrive\files.py", line 371, in _FilesInsert
raise ApiRequestError(error)
pydrive.files.ApiRequestError: <HttpError 400 when requesting
https://www.googleapis.com/upload/drive/v2/files?
alt=json&uploadType=resumable returned "Bad Request">
You have to set the content with SetContentFile() instead of SetContentString():
file1 = drive.CreateFile({'title': 'Hello.txt'})
file1.SetContentFile(path_to_your_file)
file1.Upload()
As the documentation states, if you haven't set the title and mimeType they will be set automatically from the name and type of the file your give. Therefore if you want to upload the file with the same name it already has on your computer you can do:
file1 = drive.CreateFile()
file1.SetContentFile(path_to_your_file)
file1.Upload()
Regarding your second point, as far as I'm aware GDrive can not convert a file to a different format.
Based on the documentation of PyDrive, I would say, you need to do the following:
file_path = "path/to/your/file.txt"
file1 = drive.CreateFile()
file1.SetContentFile(file_path)
file1.Upload()
Title and content type metadata are set automatically based on the provided file path. If you want to provide a different filename, pass it to CreateFile() like this:
file1 = drive.CreateFile(metadata={"title": "CustomFileName.txt"})

s3cmd nodename nor servname provided, or not known

I'm trying to access objects from my S3 bucket with s3cmd with path style urls. This is no problem with the Java SDK like.
s3Client.setS3ClientOptions(S3ClientOptions.builder()
.setPathStyleAccess(true).build());
I want to do the same with s3cmd. I have set this up in my s3conf file:
host_base = s3.eu-central-1.amazonaws.com
host_bucket = s3.eu-central-1.amazonaws.com/%(bucket)s
This works for bucket listing with:
$ s3cmd ls
2016-08-24 12:36 s3://test
When trying to list all objects of a bucket I get the following error:
Traceback (most recent call last):
File "/usr/local/bin/s3cmd", line 2919, in <module>
rc = main()
File "/usr/local/bin/s3cmd", line 2841, in main
rc = cmd_func(args)
File "/usr/local/bin/s3cmd", line 120, in cmd_ls
subcmd_bucket_list(s3, uri)
File "/usr/local/bin/s3cmd", line 153, in subcmd_bucket_list
response = s3.bucket_list(bucket, prefix = prefix)
File "/usr/local/lib/python2.7/site-packages/S3/S3.py", line 297, in bucket_list
for dirs, objects in self.bucket_list_streaming(bucket, prefix, recursive, uri_params):
File "/usr/local/lib/python2.7/site-packages/S3/S3.py", line 324, in bucket_list_streaming
response = self.bucket_list_noparse(bucket, prefix, recursive, uri_params)
File "/usr/local/lib/python2.7/site-packages/S3/S3.py", line 343, in bucket_list_noparse
response = self.send_request(request)
File "/usr/local/lib/python2.7/site-packages/S3/S3.py", line 1081, in send_request
conn = ConnMan.get(self.get_hostname(resource['bucket']))
File "/usr/local/lib/python2.7/site-packages/S3/ConnMan.py", line 192, in get
conn.c.connect()
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 836, in connect
self.timeout, self.source_address)
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 557, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
gaierror: [Errno 8] nodename nor servname provided, or not known
Assuming that there is no other issue with your configuration, the value that you used for "host_bucket" is wrong.
It should be:
host_bucket = %(bucket)s.s3.eu-central-1.amazonaws.com
or
host_bucket = s3.eu-central-1.amazonaws.com
The second one will for "path style" to be used. But, if you are using amazon s3 and the first host_bucket value that I propose, s3cmd will automatically use dns-based or path-based buckets depending of what characters you are using in your bucket name.
Is it a particular reason why you would want to only use path-based style?

Invalid timestamp error when trying to get CSV file from and S3 bucket using boto3 module in python2.7

I am trying to get a .csv file stored in an S3 bucket. The CSV is being uploaded by a Mac compiler to the S3 bucket and my code (python 2.7) is running in a Unix environment. CSV looks like this (I have included carriage return character):
Order,Item,Date,Quantity\r
1,34975,8/4/15,10\r
2,921644,3/10/15,2\r
3,N18DAJ,1/7/15,10\r
4,20816,12/12/15,9\r
My code to get the file from the s3 bucket:
import boto3
def readcsvFromS3(bucket_name, key):
s3 = boto3.resource('s3')
obj = s3.Object(bucket_name=bucket_name, key=key)
response = obj.get()
data = response['Body'].read()
Error is happening on the response = obj.get() line. And the error I'm getting is:
Traceback (most recent call last):
File "slot.py", line 163, in <module>
columnNames, rowArray = neo.readcsvFromS3(bucket_name=config.s3bucket, key=config.orde
File "/home/jcgarciaram/WMSight/wmsight-api/api/utilities/pythonScripts/slotting/neo4jUt
response = obj.get()
File "/usr/local/lib/python2.7/dist-packages/boto3/resources/factory.py", line 481, in d
response = action(self, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/boto3/resources/action.py", line 83, in __c
response = getattr(parent.meta.client, operation_name)(**params)
File "/usr/local/lib/python2.7/dist-packages/botocore/client.py", line 228, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python2.7/dist-packages/botocore/client.py", line 481, in _make_api
operation_model, request_dict)
File "/usr/local/lib/python2.7/dist-packages/botocore/endpoint.py", line 117, in make_re
return self._send_request(request_dict, operation_model)
File "/usr/local/lib/python2.7/dist-packages/botocore/endpoint.py", line 144, in _send_r
request, operation_model, attempts)
File "/usr/local/lib/python2.7/dist-packages/botocore/endpoint.py", line 203, in _get_re
parser.parse(response_dict, operation_model.output_shape)),
File "/usr/local/lib/python2.7/dist-packages/botocore/parsers.py", line 208, in parse
parsed = self._do_parse(response, shape)
File "/usr/local/lib/python2.7/dist-packages/botocore/parsers.py", line 570, in _do_pars
member_shapes, final_parsed)
File "/usr/local/lib/python2.7/dist-packages/botocore/parsers.py", line 626, in _parse_n
member_shape, headers[header_name])
File "/usr/local/lib/python2.7/dist-packages/botocore/parsers.py", line 226, in _parse_s
return handler(shape, node)
File "/usr/local/lib/python2.7/dist-packages/botocore/parsers.py", line 149, in _get_tex
return func(self, shape, text)
File "/usr/local/lib/python2.7/dist-packages/botocore/parsers.py", line 380, in _handle_
return self._timestamp_parser(text)
File "/usr/local/lib/python2.7/dist-packages/botocore/utils.py", line 344, in parse_time
raise ValueError('Invalid timestamp "%s": %s' % (value, e))
ValueError: Invalid timestamp "Wed, 16 Jan 48199 20:37:02 GMT": year is out of range
I have been researching all over but can't seem to figure out the issue. Any ideas?
After days of searching and debugging we were able to finally determine the cause of the issue. We tried uploading the files in a JSON format rather than CSV format and imagine our surprise when we saw the same error when trying to download the file using boto3 in Python.
We started looking then at the properties of the files themselves in S3 (right-click on file and click on Properties) rather than the content.
We found a section called Metadata and found the following entry:
Key: Expires / Value: Tue, 15 Jan 48199 02:16:52 GMT.
After changing the year of the value to a date such as 2200, everything worked fine! We are now looking into our upload process in Node.js to see how we can make sure that this value is set correctly.

Categories