How to read a parquet file into a PCollection from s3? - python

My problem is simple: I want to read a parquet file from s3 into a PCollection in Apache Beam using the Python Sdk.
I know of the apache_beam.io.parquetio module but this one does not seem to be able to read from s3 directly (or does it?).
I know of the apache_beam.io.aws.s3io module but this one seems to return an s3 file object or something that is not a PCollection anyway (or does it?).
So what’s the best way to do this?

if you install beam with the aws requirement
pip install 'apache-beam[aws]'
You can just pass in an s3 filename to read from it
filename = "s3://bucket-name/...
beam.io.ReadFromParquet(filenam)

Related

s3fs with pandas, can we cache files automatically with native implementation?

I recently migrated my workflow from AWS to a local computer. The files I need are still stored on S3 private buckets. I've been able to set up my environmental variables correctly, where all I need to do is import s3fs and then and I can read files very conveinetly in pandas like this:
pd.read_csv('S3://my-bucket/some-file.csv')
And it works perfectly. This is nice because I don't need to change any code and reading/writing files works well.
However reading files from S3 is incredibly slow, and even more so now that I'm working locally. I've been googling and I've found that s3fs appears to support caching files locally, so after we've read the file the first time from S3, s3fs can store the file locally and the next time we read the file it will read the local file and works much faster. This is perfect for my workflow, where I will be iterating on the same data many times.
However I can't find anything about how to set this up with the pandas native s3fs implementation. This post describes how to cache files with s3fs, however the wiki linked in the answer is for something called fuse-s3fs. I don't see a way to specify a use_cache option in native s3fs.
In pandas all the s3fs setup is done behind-the-scenes, and it seems right now by default that when I read a file from S3 in pandas, and I read the same file again, it takes just as long to read, so I don't believe there is any caching taking place.
Does anyone know how to set up pandas with s3fs so that it caches all files that is has read?
Thanks!

Reading SAS (sas7bdat) files from AWS S3 with Glue Job

I have been trying to read SAS (sas7bdat) files from AWS S3 with Glue Job. For this, I found a lib to read the files (https://github.com/openpharma/sas7bdat). However, when I try to read these files, my job doesnt find the directory. But the directory exists, and the file is inside it.
When I check the logs, it looks like to be related to JAVA/JAR. I am a beginner with AWS and SAS files either. How could I read SAS files with Glue? There's a easier way?

Python AWS Glue log says "Considering file without prefix as a python extra file" for uploaded python zip packages

In AWS Glue, for a simple pandas job of reading data in XLSX and writing to CSV. I have a small code. As per the Python Glue instructions, I have zipped the required libraries and provided the as packages to Glue Job while execution.
Question: What do the following logs convey?
Considering file without prefix as a python extra file s3://raw-data/sampath/scripts/s3fs/fsspec.zip
Considering file without prefix as a python extra file s3://raw-data/sampath/scripts/s3fs/jmespath.zip
Considering file without prefix as a python extra file s3://raw-data/sampath/scripts/s3fs/s3fs.zip
....
please elaborate with an example?
In python shell jobs, you should add external libraries in egg file and not zip file. Zip file is for Spark job.
I also wrote small shell script to deploy python shell job without manual steps to create egg file and upload to s3 and deploy via cloudformation. Script does all automatically. You may find code at https://github.com/fatangare/aws-python-shell-deploy. Script will take csv file and convert it into excel file using pandas and xlswriter libraries.

Install mysql-client inside a zip

What I am trying to do is use aws-lambda to import zipped sql files in aws-rds. In my case zipped sql files are inserted in s3 constantly by some crawlers. What I want to do is when any sql file is uploaded to an s3 bucket, I want aws-lambda to use a mysql-client to import these files into aws-rds.
They way I have think of doing this is by packaging a mysql-client inside the zip for the aws-lambda handler. But I can't really figure out how to package mysql inside a zip. Is this possible? If yes, then a list of steps to achieve this would be really helpful!
PS: I am using python-2.7 for writing the aws-lambda handler. I am not interested in using any python-mysql library to achieve this task. The reason being, I don't want to unzip the files and load them in memory and than execute them. These files can be very large, so I don't want to load them in memory.

How to save excel file to amazon s3 from python or ruby

Is it possible to create a new excel spreadsheet file and save it to an Amazon S3 bucket without first saving to a local filesystem?
For example, I have a Ruby on Rails web application which now generates Excel spreadsheets using the write_xlsx gem and saving it to the server's local file system. Internally, it looks like the gem is using Ruby's IO.copy_stream when it saves the spreadsheet. I'm not sure this will work if moving to Heroku and S3.
Has anyone done this before using Ruby or even Python?
I found this earlier question, Heroku + ephemeral filesystem + AWS S3. So, it would seem this is not possible using Heroku. Theoretically, it would be possible using a service which allows adding an Amazon EBS.
You have dedicated Ruby Gem to help you moving file to Amazon S3:
https://rubygems.org/gems/aws-s3
If you want more details about the implementation, here is the git repository. The documentation on the page is very complete, and explain how to move file to S3. Hope it helps.
Once your xls file is created, the library helps you create a S3Objects and store it into a Bucket (which you can also create with the library).
S3Object.store('keyOfYourData', open('nameOfExcelFile.xls'), 'bucketName')
If you want more choice, Amazon also delivered an official Gem for this purpose: https://rubygems.org/gems/aws-sdk

Categories