config.yml not found on Databricks - python

I have python project which queries the SQL Server database and does some transformation within the SQL server. This project is using using config.yml which has all the DB related properties.
Now, I'am trying to host this on databricks and so that I can run it as notebook. I have all the python files imported to the databricks workspace. But while executing the main .py file I get the following error
FileNotFoundError: [Errno 2] No such file or directory: 'config.yml'
Because Databricks does not allow me to import a .yml file into the work space. What can I do to run this python project so that it read the .yml file and create a DB connection properly.
Thanks!

You can put your .yaml file to the DBFS, and point to it. You can do it different ways:
Using dbutils.fs.put (see doc)
Using Databricks CLI's databricks fs cp command from your local machine - you will need to install on it databricks-cli python package, and configure to use personal access tokens if they are enabled in your workspace (see doc)
Upload file via file browser, or directly from the notebook's menu (see doc)
Because your code works with "local" files, you will need to specify the path to the file as /dbfs/<file-path-on-dbfs - in this case, file will be read by "normal" Python's file API.

Related

Python workaround for functions that don't recognise a URL as a file path?

I am using the ecommercetools package to access the Google Search Console API as described here - https://practicaldatascience.co.uk/data-science/how-to-access-the-google-search-console-api-using-python .
This is working fine on my local machine. However, I want to run it in a Runbook on Azure. So I have stored the key json file in blob storage.
The issue seems to be that seo.query_google_search_console() does not recognise the Blob SAS URL of the file as a file path.
Is there some way to work around this? Is it possible to hold a virtual file in memory and refer to it instead? Would appreciate any advice!

Can't read directly from pandas on GCP Databricks

Usually on Databricks on Azure/AWS, to read files stored on Azure Blob/S3, I would mount the bucket or blob storage and then do the following:
If using Spark
df = spark.read.format('csv').load('/mnt/my_bucket/my_file.csv', header="true")
If using directly pandas, adding /dbfs to the path:
df = pd.read_csv('/dbfs/mnt/my_bucket/my_file.csv')
I am trying to do the exact same thing on the hosted version of Databricks with GCP and though I successfully manage to mount my bucket and read it with Spark, I am not able to do it with Pandas directly, adding the /dbfs does not work and I get a No such file or directory: ... error
Has any one of you encountered a similar issue ? Am I missing something ?
Also when I do
%sh
ls /dbfs
It returns nothing though I can see in the UI the dbfs browser with my mounted buckets and files
Thanks for the help
It's documented in the list of features not released yet:
DBFS access to local file system (FUSE mount).
For DBFS access, the Databricks dbutils commands, Hadoop Filesystem APIs such as the %fs command, and Spark read and write APIs are available. Contact your Databricks representative for any questions.
So you'll need to copy file to local disk before reading with Pandas:
dbutils.fs.cp("/mnt/my_bucket/my_file.csv", "file:/tmp/my_file.csv")
df = pd.read_csv('/tmp/my_file.csv')

How do you automatically update a csv file in a Gitab Repository?

I need some explanations or examples of how to perform updates on csv file in a Gitlab repository (with python).
I have a python script that sources data from a SQL Database, the script performs some functions and outputs the result in the form of a csv file (result.csv) example format shown below:
ID,1,1.0,0.0,0.3,0.01,0.04,...0.0
I have tested this out in Jupyter on my local machine and it works fine. The result.csv writes to location specified in my python script i.e. data.to_csv(r"C\Users\Ola\Documents\result.csv") and updates every time I run my python script.
How to perform the same process with the python script now located on Gitlab repo and the csv file will be written (outputs) into that same repo location as the python script (my Gitlab repo project)?
PS: I will eventually schedule the pipeline to execute once a week.

Python AWS Glue log says "Considering file without prefix as a python extra file" for uploaded python zip packages

In AWS Glue, for a simple pandas job of reading data in XLSX and writing to CSV. I have a small code. As per the Python Glue instructions, I have zipped the required libraries and provided the as packages to Glue Job while execution.
Question: What do the following logs convey?
Considering file without prefix as a python extra file s3://raw-data/sampath/scripts/s3fs/fsspec.zip
Considering file without prefix as a python extra file s3://raw-data/sampath/scripts/s3fs/jmespath.zip
Considering file without prefix as a python extra file s3://raw-data/sampath/scripts/s3fs/s3fs.zip
....
please elaborate with an example?
In python shell jobs, you should add external libraries in egg file and not zip file. Zip file is for Spark job.
I also wrote small shell script to deploy python shell job without manual steps to create egg file and upload to s3 and deploy via cloudformation. Script does all automatically. You may find code at https://github.com/fatangare/aws-python-shell-deploy. Script will take csv file and convert it into excel file using pandas and xlswriter libraries.

Deployed Django instance not recognising a file as a file

I have a deployed django webapp which has been working fine. It scrapes sites for newly published .csv files and then uses them for data analysis which is then returned to the user.
The most recent .csv is not being recognised as a file on the deployed version, but is on the test version on my local machine. The structure is as follows:
-indicator-analyser
-Analyser
-AnalysisScripts
-uploads
-data
-2017
-Monthly_File_Jun_Final.csv
-Monthly_File_Sep_Final.csv
When a user attempts to run the script on the Monthly_File_Jun_Final.csv, the webapp performs as expected. When they run the same script on Monthly_File_Sep_Final.csv, django throws an error as there is no file found. I have taken the file path that is passed in and used that to open the file in explorer, and I have used the same file path to load the .csv as a dataframe in pandas within the console with no problems.
The path that is passed to the loading script is:
C:\\webapp\\indicator-analyser\\Analyser/uploads/data/2017/Monthly_File_Sep_Final.csv
When this is evaluated using os.path.isfile(filepath), it is being returned as False. However, when the other file is selected, this is returned and recognised as a file:
C:\\webapp\\indicator-analyser\\Analyser/uploads/data/2017/Monthly_File_Jun_Final.csv
Just for reference, this is running on a IIS server. I have restarted the machine and the server to no avail.
To conclude, I can access this file:
through the console
on my local instance
through my file explorer
But it is not recognised as a file in the live django instance.
This was fixed by deleting the scraped file and redownloading it, placing it in the exact same place. I do not know why this has fixed it, but it is now working.

Categories