Usually on Databricks on Azure/AWS, to read files stored on Azure Blob/S3, I would mount the bucket or blob storage and then do the following:
If using Spark
df = spark.read.format('csv').load('/mnt/my_bucket/my_file.csv', header="true")
If using directly pandas, adding /dbfs to the path:
df = pd.read_csv('/dbfs/mnt/my_bucket/my_file.csv')
I am trying to do the exact same thing on the hosted version of Databricks with GCP and though I successfully manage to mount my bucket and read it with Spark, I am not able to do it with Pandas directly, adding the /dbfs does not work and I get a No such file or directory: ... error
Has any one of you encountered a similar issue ? Am I missing something ?
Also when I do
%sh
ls /dbfs
It returns nothing though I can see in the UI the dbfs browser with my mounted buckets and files
Thanks for the help
It's documented in the list of features not released yet:
DBFS access to local file system (FUSE mount).
For DBFS access, the Databricks dbutils commands, Hadoop Filesystem APIs such as the %fs command, and Spark read and write APIs are available. Contact your Databricks representative for any questions.
So you'll need to copy file to local disk before reading with Pandas:
dbutils.fs.cp("/mnt/my_bucket/my_file.csv", "file:/tmp/my_file.csv")
df = pd.read_csv('/tmp/my_file.csv')
Related
How can I mock existing Azure Databricks PySpark codes of a project (written by others) and run them locally on windows machine/Anaconda to test and practice?
Is it possible to mock the codes or I need to create a new cluster on Databricks for my own testing purposes?
how I can connect to storage account, use the Databricks Utilities, etc? I only have experience with Python & GCP and just joined a Databricks project and need to run the cells one by one to see the result and modify if required.
Thanks
You can test/run PySpark code from your IDE by installing PySpark on your local computer.
Now, to use Databricks Utilities, in fact you would need a Databricks instance and it's not available on local. You can try Databricks community Editionfor free but with some limitation
To acess a cloud storage account, it can be done locally from your computer or from your own Databricks instance. In both cases your will have to set up the end point of this storage account using its secrets.
I have a bunch of files in a Google Cloud Storage bucket, including some Python scripts and text files. I want to run the Python scripts on the text files. What would be the best way to go about doing this (App Engine, Compute Engine, Jupyter)? Thanks!
I recommend using Google Cloud Function, that can be triggered automatically each time you upload new file to the Cloud Storage to process it. You can see workflow for this in Cloud Function Storage Tutorial
You will need to at least download the python scripts onto an environment first (be it GCE or GAE). To access the GCS text files, you can use https://pypi.org/project/google-cloud-storage/ library. I don't think you can execute python scripts from the object bucket itself.
If it is troublesome to change the python codes for reading the text files from GCS, you will have to download everything into your environment (e.g. using gsutil)
In AWS, a similar functionality exists using awscli as explained here. Does there exist a similar functionality in Azure using Python SDK or CLI? Thanks.
There are two services Blob Storage & File Storage in Azure Storage, but I don't know which one of Azure Storage services is you want to be synchronised with a folder and what OS you used is.
As #Gaurav Mantri said, Azure File Sync is a good idea if you want to synchronise a folder with Azure File Share on your on-premise Windows Server.
However, if you want to synchronise Azure Blobs or some Unix-like OS you used like Linux/MacOS, I think you can try to use Azure Storage Fuse for Blob Storage or Samba client for File Storage with rsync command to achieve your needs.
First of all, the key point of the workaround solution is to mount the File/Blob service of Azure Storage as a local filesystem, then you can operate it in Python/Other ways as same as on local, as below.
For how to mount blob container as fs, to follow the installation instructions to install blobfuse, then to configure & run the necessary file/script to mount a blob container of Azure Storage account as the wiki page said.
For how to mount a file share with samba clint, please refer to the offical document Use Azure Files with Linux.
Then, you can directly operate all data in the filesystem of blobfuse mounted or samba mounted, or to do the folder synchronisation with rsync & inotify command, or to do other operations if you want.
Hope it helps. Any concern, please feel free to let me know.
Has anyone figured out a way to keep files persisted across sessions in Google's newly open sourced Colaboratory?
Using the sample notebooks, I'm successfully authenticating and transferring csv files from my Google Drive instance and have stashed them in /tmp, my ~, and ~/datalab. Pandas can read them just fine off of disk too. But once the session times out , it looks like the whole filesystem is wiped and a new VM is spun up, without downloaded files.
I guess this isn't surprising given Google's Colaboratory Faq:
Q: Where is my code executed? What happens to my execution state if I close the browser window?
A: Code is executed in a virtual machine dedicated to your account. Virtual machines are recycled when idle for a while, and have a maximum lifetime enforced by the system.
Given that, maybe this is a feature (ie "go use Google Cloud Storage, which works fine in Colaboratory")? When I first used the tool, I was hoping that any .csv files that were in the My File/Colab Notebooks Google Drive folder would be also loaded onto the VM instance that the notebook was running on :/
Put that before your code, so will always download your file before run your code.
!wget -q http://www.yoursite.com/file.csv
Your interpretation is correct. VMs are ephemeral and recycled after periods of inactivity. There's no mechanism for persistent data on the VM itself right now.
In order for data to persist, you'll need to store it somewhere outside of the VM, e.g., Drive, GCS, or any other cloud hosting provider.
Some recipes for loading and saving data from external sources is available in the I/O example notebook.
Not sure whether this is the best solution, but you can sync your data between Colab and Drive with automated authentication like this: https://gist.github.com/rdinse/159f5d77f13d03e0183cb8f7154b170a
Include this for files in your Google Drive:
from google.colab import drive
drive.mount('/content/drive')
After it runs you will see it mounted in your files tab and you can access your files with the path:
'/content/drive/MyDrive/<your folder inside drive>/file.ext'
Clouderizer may provide some data persistence, at the cost of a long setup(because you use google colab only as a host) and little space to work on.
But, in my opinion that's best than have your file(s) "recycled" when you forget to save your progress.
As you pointed out, Google Colaboratory's file system is ephemeral. There are workarounds, though there's a network latency penalty and code overhead: e.g. you can use boilerplate code in your notebooks to mount external file systems like GDrive (see their example notebook).
Alternatively, while this is not supported in Colaboratory, other Jupyter hosting services – like Jupyo – provision dedicated VMs with persistent file systems so the data and the notebooks persist across sessions.
If anyone's interested in saving and restoring the whole session, here's a snippet I'm using that you might find useful:
import os
import dill
from google.colab import drive
backup_dir = 'drive/My Drive/colab_sessions'
backup_file = 'notebook_env.db'
backup_path = backup_dir + '/' + backup_file
def init_drive():
# create directory if not exist
drive.mount('drive')
if not os.path.exists(backup_dir):
!mkdir backup_dir
def restart_kernel():
os._exit(00)
def save_session():
init_drive()
dill.dump_session(backup_path)
def load_session():
init_drive()
dill.load_session(backup_path)
Edit: Works fine until your session size is not too big. You need to check if it works for you..
I was interested in importing a module in a separate .py file.
What I ended up doing is copying the .py file contents to the first cell in my notebook, adding the following text as the first line:
%%writefile mymodule.py
This creates a separate file named mymodule.py in the working directory so your notebook can use it with an import line.
I know that by running all of the code in the module would enable using the variables and functions in the notebook, but my code required importing a module, so that was good enough for me.
Is it possible to create a new excel spreadsheet file and save it to an Amazon S3 bucket without first saving to a local filesystem?
For example, I have a Ruby on Rails web application which now generates Excel spreadsheets using the write_xlsx gem and saving it to the server's local file system. Internally, it looks like the gem is using Ruby's IO.copy_stream when it saves the spreadsheet. I'm not sure this will work if moving to Heroku and S3.
Has anyone done this before using Ruby or even Python?
I found this earlier question, Heroku + ephemeral filesystem + AWS S3. So, it would seem this is not possible using Heroku. Theoretically, it would be possible using a service which allows adding an Amazon EBS.
You have dedicated Ruby Gem to help you moving file to Amazon S3:
https://rubygems.org/gems/aws-s3
If you want more details about the implementation, here is the git repository. The documentation on the page is very complete, and explain how to move file to S3. Hope it helps.
Once your xls file is created, the library helps you create a S3Objects and store it into a Bucket (which you can also create with the library).
S3Object.store('keyOfYourData', open('nameOfExcelFile.xls'), 'bucketName')
If you want more choice, Amazon also delivered an official Gem for this purpose: https://rubygems.org/gems/aws-sdk