Reading SAS (sas7bdat) files from AWS S3 with Glue Job - python

I have been trying to read SAS (sas7bdat) files from AWS S3 with Glue Job. For this, I found a lib to read the files (https://github.com/openpharma/sas7bdat). However, when I try to read these files, my job doesnt find the directory. But the directory exists, and the file is inside it.
When I check the logs, it looks like to be related to JAVA/JAR. I am a beginner with AWS and SAS files either. How could I read SAS files with Glue? There's a easier way?

Related

Can't read directly from pandas on GCP Databricks

Usually on Databricks on Azure/AWS, to read files stored on Azure Blob/S3, I would mount the bucket or blob storage and then do the following:
If using Spark
df = spark.read.format('csv').load('/mnt/my_bucket/my_file.csv', header="true")
If using directly pandas, adding /dbfs to the path:
df = pd.read_csv('/dbfs/mnt/my_bucket/my_file.csv')
I am trying to do the exact same thing on the hosted version of Databricks with GCP and though I successfully manage to mount my bucket and read it with Spark, I am not able to do it with Pandas directly, adding the /dbfs does not work and I get a No such file or directory: ... error
Has any one of you encountered a similar issue ? Am I missing something ?
Also when I do
%sh
ls /dbfs
It returns nothing though I can see in the UI the dbfs browser with my mounted buckets and files
Thanks for the help
It's documented in the list of features not released yet:
DBFS access to local file system (FUSE mount).
For DBFS access, the Databricks dbutils commands, Hadoop Filesystem APIs such as the %fs command, and Spark read and write APIs are available. Contact your Databricks representative for any questions.
So you'll need to copy file to local disk before reading with Pandas:
dbutils.fs.cp("/mnt/my_bucket/my_file.csv", "file:/tmp/my_file.csv")
df = pd.read_csv('/tmp/my_file.csv')

upload folders inside a folder to google cloud using python

Previously I was working in AWS and I am new in Google Cloud, in AWS there was a way to upload directories/folder to bucket. I have done bit of research for uploading directory/folder in Google Cloud bucket but couldn't find. Can someone help me.I would like to upload some folders(not files) inside a folder to google cloud using python.How to do that?
To achieve this, you need to upload file by file the content on each directory and replicate the path that you have locally in your GCS bucket.
Note: directory doesn't exist in GCS, it's simply a set of the same file path prefix presented as directory in the UI

Zipping the files in S3

I am having some text files in S3 location. I am trying to compress and zip each text files in it. I was able to zip and compress it in Jupyter notebook by selecting the file from my local. While trying the same code in S3, its throwing error as file is missing. Could someone please help
Amazon S3 does not have a zip/compress function.
You will need to download the files, zip them on an Amazon EC2 instance or your own computer, then upload the result.

How can I update a CSV stored on AWS S3 with a Python script?

I have a CSV which is stored in an AWS S3 bucket and is used to store information which gets loaded into a HTML document via some jQuery.
I also have a Python script which is currently sat on my local machine ready to be used. This Python script scrapes another website and saves the information to the CSV file which I then upload to my AWS S3 bucket.
I am trying to figure out a way that I can have the Python script run nightly and overwrite the CSV stored in the S3 bucket. I cannot seem to find a similar solution to my problem online and am vastly out of my depth when it comes to AWS.
Does anyone have any solutions to this problem?
Cheapest way: Modify your Python script to work as an AWS Lambda function, then schedule it to run nightly.
Easiest way: Spin up an EC2 instance, copy the script to the instance, and schedule it to run nightly via cron.

How to save excel file to amazon s3 from python or ruby

Is it possible to create a new excel spreadsheet file and save it to an Amazon S3 bucket without first saving to a local filesystem?
For example, I have a Ruby on Rails web application which now generates Excel spreadsheets using the write_xlsx gem and saving it to the server's local file system. Internally, it looks like the gem is using Ruby's IO.copy_stream when it saves the spreadsheet. I'm not sure this will work if moving to Heroku and S3.
Has anyone done this before using Ruby or even Python?
I found this earlier question, Heroku + ephemeral filesystem + AWS S3. So, it would seem this is not possible using Heroku. Theoretically, it would be possible using a service which allows adding an Amazon EBS.
You have dedicated Ruby Gem to help you moving file to Amazon S3:
https://rubygems.org/gems/aws-s3
If you want more details about the implementation, here is the git repository. The documentation on the page is very complete, and explain how to move file to S3. Hope it helps.
Once your xls file is created, the library helps you create a S3Objects and store it into a Bucket (which you can also create with the library).
S3Object.store('keyOfYourData', open('nameOfExcelFile.xls'), 'bucketName')
If you want more choice, Amazon also delivered an official Gem for this purpose: https://rubygems.org/gems/aws-sdk

Categories