Unable to append new dataframe in previous dataframe From blobstorage - python

After HTTP trigger, I want to read .csv file from blob storage and append new data in that file.
and want to save data in .csv format to blob storage. Please help me out.....
scriptpath=os.path.abspath(__file__)
scriptdir=os.path.dirname(scriptpath)
train=os.path.join(scriptdir,'train.csv')
train_df=pd.read_csv(train)
train_df=train_df.append(test_df)
train_df.to_csv(scriptdir+'tt.csv')
block_blob_service.create_blob_from_path(
'files',
'mytest.csv',
scriptdir+'tt.csv',
content_settings=ContentSettings(content_type='application/CSV')
)
my problem is after appending data, I have to save that data to blob storage. So that, I have to save all data in csv file but the above error comes. Https trigger don't give me permission to save csv file. error shows
Exception: PermissionError: [Errno 13] Permission denied:
'C:\\Users\\Shiva\\Desktop\\project\\loumus\\Imagetrigger'
Stack: File "C:\Program Files\Microsoft\Azure Functions Core Tools\workers\python\3.7/WINDOWS/X64\azure_functions_worker\dispatcher.py", line 357, in _handle__invocation_request
self.__run_sync_func, invocation_id, fi.func, args)
File "C:\Users\Shiva\AppData\Local\Programs\Python\Python37\lib\concurrent\futures\thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "C:\Program Files\Microsoft\Azure Functions Core Tools\workers\python\3.7/WINDOWS/X64\azure_functions_worker\dispatcher.py", line 542, in __run_sync_func
return func(**params)
File "C:\Users\Shiva\Desktop\project\loumus\Imagetrigger\__init__.py", line 276, in main
mm.to_csv(scriptdir,'tt.csv')
File "C:\Users\Shiva\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\generic.py", line 3403, in to_csv
storage_options=storage_options,
File "C:\Users\Shiva\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\formats\format.py", line 1083, in to_csv
csv_formatter.save()
File "C:\Users\Shiva\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\formats\csvs.py", line 234, in save
storage_options=self.storage_options,
File "C:\Users\Shiva\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\common.py", line 647, in get_handle
newline="",

There are several issues.
Azure Functions run inside a managed runtime environment, you do not have same level of access to a local storage/disk like you would when running on a laptop. Not to say you don't have local disk. STW and RTM:
https://learn.microsoft.com/en-us/azure/azure-functions/functions-scale#service-limits (see storage)
https://github.com/Azure/Azure-Functions/issues/179
how much local disk is available in an Azure Function execution context
https://github.com/Azure/azure-functions-host/wiki/Retrieving-information-about-the-currently-running-function
Azure Functions Temp storage
Use tempdir to create a temp directory. This would create it in area designated by underlying OS as temp storage.
There is no specific reason to write to local storage and then upload to ADLS. You could:
Write csv file in memory (e.g. StringIO) then use SDK to write that str to a ADLS.
Install appropriate drivers (not sure if you're using pyspark or pandas or something else) and directly write Dataframe to ADLS. E.g. see one example.

Related

MLflow load model fails Python

I am trying to build an API using an MLflow model.
the funny thing is it works from one location on my PC and not from another. So, the reason for doing I wanted to change my repo etc.
So, the simple code of
from mlflow.pyfunc import load_model
MODEL_ARTIFACT_PATH = "./model/model_name/"
MODEL = load_model(MODEL_ARTIFACT_PATH)
now fails with
ERROR: Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 540, in lifespan
async for item in self.lifespan_context(app):
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 481, in default_lifespan
await self.startup()
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 516, in startup
await handler()
File "/code/./app/main.py", line 32, in startup_load_model
MODEL = load_model(MODEL_ARTIFACT_PATH)
File "/usr/local/lib/python3.8/dist-packages/mlflow/pyfunc/__init__.py", line 733, in load_model
model_impl = importlib.import_module(conf[MAIN])._load_pyfunc(data_path)
File "/usr/local/lib/python3.8/dist-packages/mlflow/spark.py", line 737, in _load_pyfunc
return _PyFuncModelWrapper(spark, _load_model(model_uri=path))
File "/usr/local/lib/python3.8/dist-packages/mlflow/spark.py", line 656, in _load_model
return PipelineModel.load(model_uri)
File "/usr/local/lib/python3.8/dist-packages/pyspark/ml/util.py", line 332, in load
return cls.read().load(path)
File "/usr/local/lib/python3.8/dist-packages/pyspark/ml/pipeline.py", line 258, in load
return JavaMLReader(self.cls).load(path)
File "/usr/local/lib/python3.8/dist-packages/pyspark/ml/util.py", line 282, in load
java_obj = self._jread.load(path)
File "/usr/local/lib/python3.8/dist-packages/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/usr/local/lib/python3.8/dist-packages/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.
The model artifacts are already downloaded to the folder /model folder which has the following structure.
the load model call is in the main.py file
As I mentioned it works from another directory, but there is no reference to any absolute paths. Also, I have made sure that my package references are identical. e,g I have pinned them all down
# Model
mlflow==1.25.1
protobuf==3.20.1
pyspark==3.2.1
scipy==1.6.2
six==1.15.0
also, the same docker file is used both places, which among other things, makes sure that the final resulting folder structure is the same
......other stuffs
COPY ./app /code/app
COPY ./model /code/model
what can explain it throwing this exception whereas in another location (on my PC), it works (same model artifacts) ?
Since it uses load_model function, it should be able to read the parquet files ?
Any question and I can explain.
EDIT1: I have debugged this a little more in the docker container and it seems the parquet files in the itemFactors folder (listed in my screenshot above) are not getting copied over to my image , even though I have the copy command to copy all files under the model folder. It is copying the _started , _committed and _SUCCESS files, just not the parquet files. Anyone knows why would that be? I DO NOT have a .dockerignore file. Why are those files ignored while copying?
I found the problem. Like I wrote in the EDIT1 of my post, with further observations, the parquet files were missing in the docker container. That was strange because I was copying the entire folder in my Dockerfile.
I then realized that I was hitting this problem mentioned here. File paths exceeding 260 characters, silently fail and do not get copied over to the docker container. This was really frustrating because nothing failed during build and then during run, it gave me that cryptic error of "unable to infer schema for parquet", essentially because the parquet files were not copied over during docker build.

Unable to read_excel using pandas on CentOS Stream 9 VM: zipfile.BadZipFile: Bad magic number for file header

I've been running a script for several months now where I read and concat several excel exports using the following code:
files = os.listdir(os.path.abspath('exports/'))
for file in files:
if file.startswith('ap_statistics_') and file.endswith('.xlsx'):
excel_list.append(pd.read_excel('exports/' + file, sheet_name='Access Points'))
df = pd.concat(excel_list, axis=0, ignore_index=True)
This has worked just fine until this Saturday when I uploaded new exports to the CentOS Stream 9 VM where I have a cronjob running the script every hour.
Now I always get this error:
Traceback (most recent call last):
File "/root/projects/beacon_check_v8/main.py", line 310, in <module>
ap_check()
File "/root/projects/beacon_check_v8/main.py", line 260, in ap_check
siteaps_result = getaps()
File "/root/projects/beacon_check_v8/main.py", line 30, in getaps
excel_list.append(pd.read_excel('exports/' + file, sheet_name='Access Points'))
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/pandas/io/excel/_base.py", line 457, in read_excel
io = ExcelFile(io, storage_options=storage_options, engine=engine)
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/pandas/io/excel/_base.py", line 1419, in __init__
self._reader = self._engines[engine](self._io, storage_options=storage_options)
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/pandas/io/excel/_openpyxl.py", line 525, in __init__
super().__init__(filepath_or_buffer, storage_options=storage_options)
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/pandas/io/excel/_base.py", line 518, in __init__
self.book = self.load_workbook(self.handles.handle)
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/pandas/io/excel/_openpyxl.py", line 536, in load_workbook
return load_workbook(
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/openpyxl/reader/excel.py", line 317, in load_workbook
reader.read()
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/openpyxl/reader/excel.py", line 277, in read
self.read_strings()
File "/root/projects/beacon_check_v8/venv/lib64/python3.9/site-packages/openpyxl/reader/excel.py", line 143, in read_strings
with self.archive.open(strings_path,) as src:
File "/usr/lib64/python3.9/zipfile.py", line 1523, in open
raise BadZipFile("Bad magic number for file header")
zipfile.BadZipFile: Bad magic number for file header
I develop on my Windows 10 notebook using PyCharm with a Python 3.9 venv, same as on the VM, where the script continued to work just fine.
When researching online all I found was that sometimes .pyc files can cause issues so I created a completely new venv on the VM, installed all libraries (netmiko, pandas, openpyxl, etc.) and tried running the script again before and after deleting all .pyc files in the directory but no luck.
I have extracted the Excel file header using the following code:
with open('exports/' + file, 'rb') as myexcel:
print(myexcel.read(4))
Unfortunately it comes back as the same values on both my Windows venv as well as the CentOS venv:
b'PK\x03\x04'
I don't know if this header value is correct or not but I can read the files on my Windows notebook just fine using pandas or excel.
Any help would be greatly appreciated.
The issue was actually the program I used to transfer the files between my notebook and the VM, WinSCP. I don't know why or how this caused the error but I was able to fix it by transferring directly over pscp.

Why boto3.client.download_file is appending a string at the end of file name?

I need to download files from s3, and I create this code:
#This function make the download of the files in a bucket
def download_dir(s3:boto3.client, bucket:str, directory:str=None) -> None:
#Verify if exist the bucket diretory
if not os.path.exists(bucket):
#Creating the bucket directory
os.makedirs(bucket)
# Iterating in s3 list of objects inside of the bucket
for s3_key in s3.list_objects(Bucket=bucket)['Contents']:
file_name=os.path.join(bucket, s3_key['Key'])
#If the object is not a directory
if not s3_key['Key'].endswith("/"):
#Verify if the download file was already done
if not os.path.exists(file_name):
print(s3_key['Key'])
real_file_name = s3_key['Key']
print(real_file_name)
try:
s3.download_file(Bucket=bucket,Key=s3_key['Key'], Filename=file_name)
except:
print(type(real_file_name))
s3.download_file(Bucket=bucket, Filename=file_name, Key=real_file_name)
#If the object is a directory
else:
#If the directory doesn't exist
if not os.path.exists(file_name):
#Creating the directory
os.makedirs(file_name)
s3 = boto3.client('s3',
verify=False,
aws_access_key_id=aws_dict['aws_access_key_id'],
aws_secret_access_key=aws_dict['aws_secret_access_key'],
aws_session_token=aws_dict['aws_session_token'],
region_name=aws_dict['region_name'],
config=config
)
download_dir(s3, 'MY-BUCKET')
But a in specific file, magicly is appending another string at the end of the bucket file name, which brings me an exception:
Traceback (most recent call last): File "aws_transfer.py", line 58, in download_dir(s3, 'bucket') File "aws_transfer.py", line 29, in download_dir s3.download_file(Bucket=bucket, Filename=file_name, Key=real_file_name) File "/home/gbarbo3/.local/lib/python3.8/site-packages/boto3/s3/inject.py", line 171, in download_file return transfer.download_file( File "/home/gbarbo3/.local/lib/python3.8/site-packages/boto3/s3/transfer.py", line 315, in download_file future.result() File "/home/gbarbo3/.local/lib/python3.8/site-packages/s3transfer/futures.py", line 106, in result return self._coordinator.result() File "/home/gbarbo3/.local/lib/python3.8/site-packages/s3transfer/futures.py", line 265, in result raise self._exception File "/home/gbarbo3/.local/lib/python3.8/site-packages/s3transfer/tasks.py", line 126, in call return self._execute_main(kwargs) File "/home/gbarbo3/.local/lib/python3.8/site-packages/s3transfer/tasks.py", line 150, in _execute_main return_value = self._main(**kwargs) File "/home/gbarbo3/.local/lib/python3.8/site-packages/s3transfer/download.py", line 571, in _main fileobj.seek(offset) File "/home/gbarbo3/.local/lib/python3.8/site-packages/s3transfer/utils.py", line 367, in seek self._open_if_needed() File "/home/gbarbo3/.local/lib/python3.8/site-packages/s3transfer/utils.py", line 350, in _open_if_needed self._fileobj = self._open_function(self._filename, self._mode) File "/home/gbarbo3/.local/lib/python3.8/site-packages/s3transfer/utils.py", line 261, in open return open(filename, mode) FileNotFoundError: [Errno 2] No such file or directory: 'bucket/folder/model.tar.gz.c40fF924'
The real file name must to be 'bucket/folder/model.tar.gz'.
Can anyone help me with that?
Your program is having problems with sub-directories.
First, an explanation...
Amazon S3 does not use directories. For example, you could run this command to upload a file:
aws s3 cp foo.txt s3://bucketname/folder1/foo.txt
The object would be created in the Amazon S3 bucket with a Key of folder1/foo.txt. If you view the bucket in the S3 management console, the folder1 directory would 'appear', but it doesn't actually exist. If you were to delete that object, the folder1 directory would 'disappear' because it never actually existed.
However, there is also a button in the S3 management console called Create folder. It will create a zero-length object with the name of the 'folder' (eg folder1/). This will 'force' the (pretend) directory to appear, but it still doesn't actually exist.
Your code is specifically checking whether such an object exists in this line:
if not s3_key['Key'].endswith("/"):
The assumption is that there will always be an object with the name of the directory. However, that is not necessarily true (as shown with my example above). Therefore, the program never creates the directory and it then fails when attempting to download an object to a directory that does not exist on your computer.
Your program would need to test the existence of the target directory on your local computer before downloading each object. It cannot rely on there always being an object with a Key that ends with a / for every directory in the bucket.

Paramiko Upload Errors [Errno 2] [duplicate]

I'm calling the Paramiko sftp_client.put(locapath,remotepath) method
This is throwing the [Errno 2] File not found error below.
01/07/2020 01:12:03 PM - ERROR - [Errno 2] File not found
Traceback (most recent call last):
File "file_transfer\TransferFiles.py", line 123, in main
File "paramiko\sftp_client.py", line 727, in put
File "paramiko\sftp_client.py", line 689, in putfo
File "paramiko\sftp_client.py", line 460, in stat
File "paramiko\sftp_client.py", line 780, in _request
File "paramiko\sftp_client.py", line 832, in _read_response
File "paramiko\sftp_client.py", line 861, in _convert_status
Having tried many of the other recommend fixes I found that the error is due to the server having an automatic trigger to move the file immediately to another location upon the file being uploaded.
I've not seen another post relating to this issue and wanted to know if anyone else has fixed this as the SFTP server is owned by a third party and not wanting to change trigger attributes.
The file actually uploads correctly, so I could catch the Exception and ignore the error. But I'd prefer to handle it, if possible.
Paramiko by default verifies a size of the uploaded file after the upload.
If the file is moved away immediately after upload, the check fails.
To avoid the check, set confirm parameter of SFTPClient.put to False.
sftp_client.put(localpath, remotepath, confirm=False)
I believe the check is redundant anyway, see
How to perform checksums during a SFTP file transfer for data integrity?
For a similar question about pysftp (what is a wrapper around Paramiko), see:
Python pysftp.put raises "No such file" exception although file is uploaded
Also had this issue of the file automatically getting moved before paramiko could do an os.stat on the uploaded file and compare the local and uploaded file sizes.
#Martin_Prikryl solution works works fine for removing the error by passing in confirm=False when using sftp.put or sftp.putfo
If you want this check to still run like you mention in the post to see if the file has been uploaded fully you can run something along these lines. For this to work you will need to know the moved file location and have the ability to read the file.
import os
sftp.putfo(source_file_object, destination_file, confirm=False)
upload_size = sftp.stat(moved_path).st_size
local_size = os.stat(source_file_object).st_size
if upload_size != local_size:
raise IOError(
"size mismatch in put! {} != {}".format(upload_size, local_size)
)
Both checks use os.stat

Paramiko put method throws "[Errno 2] File not found" if SFTP server has trigger to automatically move file upon upload

I'm calling the Paramiko sftp_client.put(locapath,remotepath) method
This is throwing the [Errno 2] File not found error below.
01/07/2020 01:12:03 PM - ERROR - [Errno 2] File not found
Traceback (most recent call last):
File "file_transfer\TransferFiles.py", line 123, in main
File "paramiko\sftp_client.py", line 727, in put
File "paramiko\sftp_client.py", line 689, in putfo
File "paramiko\sftp_client.py", line 460, in stat
File "paramiko\sftp_client.py", line 780, in _request
File "paramiko\sftp_client.py", line 832, in _read_response
File "paramiko\sftp_client.py", line 861, in _convert_status
Having tried many of the other recommend fixes I found that the error is due to the server having an automatic trigger to move the file immediately to another location upon the file being uploaded.
I've not seen another post relating to this issue and wanted to know if anyone else has fixed this as the SFTP server is owned by a third party and not wanting to change trigger attributes.
The file actually uploads correctly, so I could catch the Exception and ignore the error. But I'd prefer to handle it, if possible.
Paramiko by default verifies a size of the uploaded file after the upload.
If the file is moved away immediately after upload, the check fails.
To avoid the check, set confirm parameter of SFTPClient.put to False.
sftp_client.put(localpath, remotepath, confirm=False)
I believe the check is redundant anyway, see
How to perform checksums during a SFTP file transfer for data integrity?
For a similar question about pysftp (what is a wrapper around Paramiko), see:
Python pysftp.put raises "No such file" exception although file is uploaded
Also had this issue of the file automatically getting moved before paramiko could do an os.stat on the uploaded file and compare the local and uploaded file sizes.
#Martin_Prikryl solution works works fine for removing the error by passing in confirm=False when using sftp.put or sftp.putfo
If you want this check to still run like you mention in the post to see if the file has been uploaded fully you can run something along these lines. For this to work you will need to know the moved file location and have the ability to read the file.
import os
sftp.putfo(source_file_object, destination_file, confirm=False)
upload_size = sftp.stat(moved_path).st_size
local_size = os.stat(source_file_object).st_size
if upload_size != local_size:
raise IOError(
"size mismatch in put! {} != {}".format(upload_size, local_size)
)
Both checks use os.stat

Categories