S3 Bucket .txt.gz Copy Via PySpark - python

I am using Python 2 (Jupyter notebook running PySpark on EMR). I am trying to load some data as a dataframe in order to map/reduce it and output it to my own S3 bucket.
I typically use this command:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file:///home/path/datafolder/data2014/*.csv')
This is failing to work for when the file is in S3 and not my own bucket (as I am not sure how to format the .load command) which is most of my use cases now. My files are also a mix of .csv and .txt.gz, both of which I want in csv format (unzipped) when copied over.
I had a look on google and tried the following commands in Python 2 (Jupyter notebook):
import os
import findspark
findspark.init('/usr/lib/spark/')
from pyspark import SparkContext, SQLContext
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
import sys
if sys.version_info[0] >= 3:
from urllib.request import urlretrieve
else:
from urllib import urlretrieve
# Get file from URL like this:
urlretrieve("https://s3.amazonaws.com/bucketname/path/path2/path3/path4/path3/results.txt.gz")
Which simply outputs: ('/tmp/tmpmDB1EC.gz', <httplib.HTTPMessage instance at 0x7f54db894758>) so I'm unsure what to do now.
I have read through the documentation, and searched this website and Google for simple methods on forming the df but am stuck. I also read up about using my AWS key / secret key (which I have) but I could not find an example to follow.
Can someone kindly help me out?

you need to load it using the spark context
data_file = urlretrieve("https://s3.amazonaws.com/bucketname/path/path2/path3/path4/path3/results.txt.gz")
raw_data = sc.textFile(data_file)

Related

Fill Pandas Dataframe using Databricks

i have some code which i'm writing to solve a problem for our devops team, the first of which is to help them locate files on blob storage as they only have azure explorer.
i will create a dataset which lists all the files in a certain directory, this can be parameterised.
this will hopefully be a triggered pipline where the devops team can earmark what files they want to interact with a move/copy the file to a different location all of which will be tracked and audited, however, im stuck at the first hurdle.
below is some code which i'm trying to populate a dataframe, but i only end up with 1 row.. i've tried following a few posts on here but i seems to be missing something.
can anyone spot where i went wrong?
note: walk_dirz is a function which uses
dbutils.fs.ls(dir_path)
to loop through directorys looking for JSON and TXT only
here is the part where i'm stuck
import os
import datetime
import pathlib
#from datetime import datetime as dt
import time
import pandas as pd
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
os_file_list = walk_dirz('dbfs:/mnt/landing/raw/NewData/Deltas')
df1 = pd.DataFrame()
#create dataframe
df = pd.DataFrame()
for i in os_file_list:
print(i)
pn = i
new_row = {'path':pn}
df1 = df.append(new_row , ignore_index=True)
this looked like its worked
for i in os_file_list:
df = pd.DataFrame(os_file_list,columns =['Path'])

How do I read a gzipped parquet file from S3 into Python using Boto3?

I have a file called data.parquet.gzip on my S3 bucket. I can't figure out what's the problem in reading it. Normally I've worked with StringIO but I don't know how to fix it. I want to import it from S3 into my Python jupyter notebook session using pandas and boto3.
The solution is actually quite straightforward.
import boto3 # For read+push to S3 bucket
import pandas as pd # Reading parquets
from io import BytesIO # Converting bytes to bytes input file
import pyarrow # Fast reading of parquets
# Set up your S3 client
# Ideally your Access Key and Secret Access Key are stored in a file already
# So you don't have to specify these parameters explicitly.
s3 = boto3.client('s3',
aws_access_key_id=ACCESS_KEY_HERE,
aws_secret_access_key=SECRET_ACCESS_KEY_HERE)
# Get the path to the file
s3_response_object = s3.get_object(Bucket=BUCKET_NAME_HERE, Key=KEY_TO_GZIPPED_PARQUET_HERE)
# Read your file, i.e. convert it from a stream to bytes using .read()
df = s3_response_object['Body'].read()
# Read your file using BytesIO
df = pd.read_parquet(BytesIO(df))
If you are using an IDE in your laptop/PC to connect to AWS S3 you may refer to the first solution of Corey:
import boto3
import pandas as pd
import io
s3 = boto3.resource(service_name='s3', region_name='XXXX',
aws_access_key_id='YYYY', aws_secret_access_key='ZZZZ')
buffer = io.BytesIO()
object = s3.Object(bucket_name='bucket_name', key='path/to/your/file.parquet')
object.download_fileobj(buffer)
df = pd.read_parquet(buffer)
If you are using Glue job you may refer to the second solution of Corey in the Glue script:
df = pd.read_parquet(path='s3://bucket_name/path/to/your/file.parquet')
In case you want to read a .json file (using an IDE in your laptop/PC):
object = s3.Object(bucket_name='bucket_name',
key='path/to/your/file.json').get()['Body'].read().decode('utf-8')
df = pd.read_json(object, lines=True)

How do I use pandas.read_csv on Google Cloud ML?

I'm trying to deploy a training script on Google Cloud ML. Of course, I've uploaded my datasets (CSV files) in a bucket on GCS.
I used to import my data with read_csv from pandas, but it doesn't seem to work with a GCS path.
How should I proceed (I would like to keep using pandas) ?
import pandas as pd
data = pd.read_csv("gs://bucket/folder/file.csv")
output :
ERROR 2018-02-01 18:43:34 +0100 master-replica-0 IOError: File gs://bucket/folder/file.csv does not exist
You will require to use file_io from tensorflow.python.lib.io to do that as demonstrated below:
from tensorflow.python.lib.io import file_io
from pandas.compat import StringIO
from pandas import read_csv
# read csv file from google cloud storage
def read_data(gcs_path):
file_stream = file_io.FileIO(gcs_path, mode='r')
csv_data = read_csv(StringIO(file_stream.read()))
return csv_data
Now call the above function
gcs_path = 'gs://bucket/folder/file.csv' # change path according to your bucket, folder and path
df = read_data(gcs_path)
# print(df.head()) # displays top 5 rows including headers as default
Pandas does not have native GCS support. There are two alternatives:
1. copy the file to the VM using gsutil cli
2. use the TensorFlow file_io library to open the file, and pass the file object to pd.read_csv(). Please refer to the detailed answer here.
You could also use Dask to extract and then load the data into, let's say, a Jupyter Notebook running on GCP.
Make sure you have Dask is installed.
conda install dask #conda
pip install dask[complete] #pip
import dask.dataframe as dd #Import
dataframe = dd.read_csv('gs://bucket/datafile.csv') #Read CSV data
dataframe2 = dd.read_csv('gs://bucket/path/*.csv') #Read parquet data
This is all you need to load the data.
You can filter and manipulate data with Pandas syntax now.
dataframe['z'] = dataframe.x + dataframe.y
dataframe_pd = dataframe.compute()

How to read csv to dataframe in Google Colab

I am trying to read a csv file which I stored locally on my machine. (Just for additional reference it is titanic data from Kaggle which is here.)
From this question and answers I learnt that you can import data using this code which works well from me.
from google.colab import files
uploaded = files.upload()
Where I am lost is how to convert it to dataframe from here. The sample google notebook page listed in the answer above does not talk about it.
I am trying to convert the dictionary uploaded to dataframe using from_dict command but not able to make it work. There is some discussion on converting dict to dataframe here but the solutions are not applicable to me (I think).
So summarizing, my question is:
How do I convert a csv file stored locally on my files to pandas
dataframe on Google Colaboratory?
step 1- Mount your Google Drive to Collaboratory
from google.colab import drive
drive.mount('/content/gdrive')
step 2- Now you will see your Google Drive files in the left pane (file explorer). Right click on the file that you need to import and select çopy path. Then import as usual in pandas, using this copied path.
import pandas as pd
df=pd.read_csv('gdrive/My Drive/data.csv')
Done!
Pandas read_csv should do the trick. You'll want to wrap your uploaded bytes in an io.StringIO since read_csv expects a file-like object.
Here's a full example:
https://colab.research.google.com/notebook#fileId=1JmwtF5OmSghC-y3-BkvxLan0zYXqCJJf
The key snippet is:
import pandas as pd
import io
df = pd.read_csv(io.StringIO(uploaded['train.csv'].decode('utf-8')))
df
Colab google: uploading csv from your PC
I had the same problem with an excel file (*.xlsx), I solved the problem as the following and I think you could do the same with csv files:
- If you have a file in your PC drive called (file.xlsx) then:
1- Upload it from your hard drive by using this simple code:
from google.colab import files
uploaded = files.upload()
Press on (Choose Files) and upload it to your google drive.
2- Then:
import io
data = io.BytesIO(uploaded['file.XLSX'])
3- Finally, read your file:
import pandas as pd
f = pd.read_excel(data , sheet_name = '1min', header = 0, skiprows = 2)
#df.sheet_names
df.head()
4- Please, change parameters values to read your own file. I think this could be generalized to read other types of files!
Enjoy it!
This worked for me:
from google.colab import auth
auth.authenticate_user()
from pydrive.drive import GoogleDrive
from pydrive.auth import GoogleAuth
from oauth2client.client import GoogleCredentials
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
myfile = drive.CreateFile({'id': '!!!YOUR FILE ID!!!'})
myfile.GetContentFile('file.csv')
Replace !!!YOUR FILE ID!!! with the id of the file in google drive (this is the long alphanumeric string that appears when you click on "obtain link to share"). Then you can access file.csv with pandas' read_csv:
import pandas as pd
frm = pd.read_csv('file.csv', header=None)
So, if you were not working on google colab, you would have simply written something like this:
df = pd.read_csv('path_of_the_csv_file')
In google colab, you only thing you have to know is the path of the csv file.
If you follow the steps that I have written below, your problem will be solved:
First of all, upload the CSV file on your google drive.
Then, open your google colab notebook and click on the 'Files' icon on the left
side of the page.
Then, click on the 'Google Drive Folder' icon to mount your Google Drive.
Then, look for the csv file that you uploaded on your google drive (step 1),
and copy its path.
Once you have the path, treat it as an ordinary path and use it in your code.
It should look something like this:
df = pd.read_csv('/content/drive/MyDrive/File.csv')
this worked for me:
import pandas as pd
import io
df=pd.read_csv(io.StringIO(uploaded['Filename.CSV'].decode('ISO-8859-1')))
df
Alternatively, you can use github to import files also.
You can take this as an example: https://drive.google.com/file/d/1D6ViUx8_ledfBqcxHCrFPcqBvNZitwCs/view?usp=sharing
Also google does not persist the file for longer so you may have to run the github snippets time and again.

Why does dumping Dataframe to Avro file fail to convert bytearray in Python?

I face the following difficulty :
I am using Spark 1.4.1, Python 2.7.8, and spark-avro_2.10-1.0.0
I am trying to store Python byte-arrays in an avro file using spark-avro. My purpose is to store chains of bytes corresponding to chunks of images that have been encoded using a specific image encoder.
It fails on a conversion exception :
org.apache.avro.file.DataFileWriter$AppendWriteException: org.apache.avro.UnresolvedUnionException: Not in union ["bytes","null"]:
Here is a sample example I have made for reproducing the problem :
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, Row
import os
import tempfile
# Just setting name of the Spark app
conf = SparkConf().setAppName("pyspark test")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
# build Data frame containing bytearrays (stupid)
data = map(lambda x: bytearray(str(x)), range(5))
rdd = sc.parallelize(data)
# convert data to SQL Row
rdd_row = rdd.map(lambda b: Row(val=b))
# create a DataFrame
df = sqlContext.createDataFrame(rdd_row)
df.registerTempTable('test')
# try to dump it
outputFile = os.path.join(tempfile.gettempdir(), 'test.avro')
df.write.format("com.databricks.spark.avro").save(outputFile)
This is launched using
spark-submit --master local[1] --jars "spark-avro_2.10-1.0.0.jar" testBytearray.py
And it fails in the conversion !
I was using a bad version of spark-avro. After building the latest, everything works fine.

Categories