s3fs with pandas, can we cache files automatically with native implementation? - python

I recently migrated my workflow from AWS to a local computer. The files I need are still stored on S3 private buckets. I've been able to set up my environmental variables correctly, where all I need to do is import s3fs and then and I can read files very conveinetly in pandas like this:
pd.read_csv('S3://my-bucket/some-file.csv')
And it works perfectly. This is nice because I don't need to change any code and reading/writing files works well.
However reading files from S3 is incredibly slow, and even more so now that I'm working locally. I've been googling and I've found that s3fs appears to support caching files locally, so after we've read the file the first time from S3, s3fs can store the file locally and the next time we read the file it will read the local file and works much faster. This is perfect for my workflow, where I will be iterating on the same data many times.
However I can't find anything about how to set this up with the pandas native s3fs implementation. This post describes how to cache files with s3fs, however the wiki linked in the answer is for something called fuse-s3fs. I don't see a way to specify a use_cache option in native s3fs.
In pandas all the s3fs setup is done behind-the-scenes, and it seems right now by default that when I read a file from S3 in pandas, and I read the same file again, it takes just as long to read, so I don't believe there is any caching taking place.
Does anyone know how to set up pandas with s3fs so that it caches all files that is has read?
Thanks!

Related

azure functions onedrivesdk python

I need to constantly poll a onedrive and when a file is dropped, I need to perform some operations on it and then re-upload it to a different folder on the same onedrive. I thought of using azure functions to download the file to a blob and then re-upload it from there. However the onedrivesdk (https://github.com/OneDrive/onedrive-sdk-python) for python is not maintained anymore. Someone suggested to install the onedrivesdk_fork in github issues. However, that also doesn't work. How should I move forward? any alternates?

Persisting data in Google Colaboratory

Has anyone figured out a way to keep files persisted across sessions in Google's newly open sourced Colaboratory?
Using the sample notebooks, I'm successfully authenticating and transferring csv files from my Google Drive instance and have stashed them in /tmp, my ~, and ~/datalab. Pandas can read them just fine off of disk too. But once the session times out , it looks like the whole filesystem is wiped and a new VM is spun up, without downloaded files.
I guess this isn't surprising given Google's Colaboratory Faq:
Q: Where is my code executed? What happens to my execution state if I close the browser window?
A: Code is executed in a virtual machine dedicated to your account. Virtual machines are recycled when idle for a while, and have a maximum lifetime enforced by the system.
Given that, maybe this is a feature (ie "go use Google Cloud Storage, which works fine in Colaboratory")? When I first used the tool, I was hoping that any .csv files that were in the My File/Colab Notebooks Google Drive folder would be also loaded onto the VM instance that the notebook was running on :/
Put that before your code, so will always download your file before run your code.
!wget -q http://www.yoursite.com/file.csv
Your interpretation is correct. VMs are ephemeral and recycled after periods of inactivity. There's no mechanism for persistent data on the VM itself right now.
In order for data to persist, you'll need to store it somewhere outside of the VM, e.g., Drive, GCS, or any other cloud hosting provider.
Some recipes for loading and saving data from external sources is available in the I/O example notebook.
Not sure whether this is the best solution, but you can sync your data between Colab and Drive with automated authentication like this: https://gist.github.com/rdinse/159f5d77f13d03e0183cb8f7154b170a
Include this for files in your Google Drive:
from google.colab import drive
drive.mount('/content/drive')
After it runs you will see it mounted in your files tab and you can access your files with the path:
'/content/drive/MyDrive/<your folder inside drive>/file.ext'
Clouderizer may provide some data persistence, at the cost of a long setup(because you use google colab only as a host) and little space to work on.
But, in my opinion that's best than have your file(s) "recycled" when you forget to save your progress.
As you pointed out, Google Colaboratory's file system is ephemeral. There are workarounds, though there's a network latency penalty and code overhead: e.g. you can use boilerplate code in your notebooks to mount external file systems like GDrive (see their example notebook).
Alternatively, while this is not supported in Colaboratory, other Jupyter hosting services – like Jupyo – provision dedicated VMs with persistent file systems so the data and the notebooks persist across sessions.
If anyone's interested in saving and restoring the whole session, here's a snippet I'm using that you might find useful:
import os
import dill
from google.colab import drive
backup_dir = 'drive/My Drive/colab_sessions'
backup_file = 'notebook_env.db'
backup_path = backup_dir + '/' + backup_file
def init_drive():
# create directory if not exist
drive.mount('drive')
if not os.path.exists(backup_dir):
!mkdir backup_dir
def restart_kernel():
os._exit(00)
def save_session():
init_drive()
dill.dump_session(backup_path)
def load_session():
init_drive()
dill.load_session(backup_path)
Edit: Works fine until your session size is not too big. You need to check if it works for you..
I was interested in importing a module in a separate .py file.
What I ended up doing is copying the .py file contents to the first cell in my notebook, adding the following text as the first line:
%%writefile mymodule.py
This creates a separate file named mymodule.py in the working directory so your notebook can use it with an import line.
I know that by running all of the code in the module would enable using the variables and functions in the notebook, but my code required importing a module, so that was good enough for me.

Deleting a csv file which is created using numpy.savetxt in pyspark

I am new to pyspark and python.
After saving a file in local system using numpy.savetxt("test.csv",file,delimiter=',')
I am using os to delete that file. os.remove("test.csv"). I am getting an error java.io.FileNotFoundException File file:/someDir/test.csv does not exist. The file numpy.savetxt() creates file with only read permission. How can save the same with read and write permission.
Using spark version 2.1
Looks like your spark workers are not able to access the file. You are probably running the master and workers on different servers. When you are trying to work on files, while having setup workers across different machines make sure these workers can access the file.You could keep the same copy of files among all the workers in the exact same location. It is always advisable to use DFS like Hadoop like "hdfs://path/file". When you do the workers can access these files.
More details on:
Spark: how to use SparkContext.textFile for local file system

Install mysql-client inside a zip

What I am trying to do is use aws-lambda to import zipped sql files in aws-rds. In my case zipped sql files are inserted in s3 constantly by some crawlers. What I want to do is when any sql file is uploaded to an s3 bucket, I want aws-lambda to use a mysql-client to import these files into aws-rds.
They way I have think of doing this is by packaging a mysql-client inside the zip for the aws-lambda handler. But I can't really figure out how to package mysql inside a zip. Is this possible? If yes, then a list of steps to achieve this would be really helpful!
PS: I am using python-2.7 for writing the aws-lambda handler. I am not interested in using any python-mysql library to achieve this task. The reason being, I don't want to unzip the files and load them in memory and than execute them. These files can be very large, so I don't want to load them in memory.

How to save excel file to amazon s3 from python or ruby

Is it possible to create a new excel spreadsheet file and save it to an Amazon S3 bucket without first saving to a local filesystem?
For example, I have a Ruby on Rails web application which now generates Excel spreadsheets using the write_xlsx gem and saving it to the server's local file system. Internally, it looks like the gem is using Ruby's IO.copy_stream when it saves the spreadsheet. I'm not sure this will work if moving to Heroku and S3.
Has anyone done this before using Ruby or even Python?
I found this earlier question, Heroku + ephemeral filesystem + AWS S3. So, it would seem this is not possible using Heroku. Theoretically, it would be possible using a service which allows adding an Amazon EBS.
You have dedicated Ruby Gem to help you moving file to Amazon S3:
https://rubygems.org/gems/aws-s3
If you want more details about the implementation, here is the git repository. The documentation on the page is very complete, and explain how to move file to S3. Hope it helps.
Once your xls file is created, the library helps you create a S3Objects and store it into a Bucket (which you can also create with the library).
S3Object.store('keyOfYourData', open('nameOfExcelFile.xls'), 'bucketName')
If you want more choice, Amazon also delivered an official Gem for this purpose: https://rubygems.org/gems/aws-sdk

Categories