How to sort S3 CSV File using AWS GLUE - python

I'm relatively new to AWS glue and spark. I'd like to sort a csv file by user ID in S3. I'm trying out the script below, but it's not sorting the file.Can someone please help me in this?
import sys
import math
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import current_date
import pyspark.sql.functions as f
from pyspark.sql.functions import asc
args = getResolvedOptions(sys.argv, ['JOB_NAME','DESTINATION_PATH', 'SOURCE_PATH'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
dyf = glueContext.create_dynamic_frame.from_options("s3", connection_options = {"paths": [args['SOURCE_PATH']]}, format="csv", format_options = {"withHeader": True});
print("records read from s3 store")
print(dyf.count())
file_size = 10000
n_partitions = int(math.ceil(dyf.count() / float(file_size)))
print("splitting file into partitions")
print(n_partitions)
sort_dataframe = dyf.toDF().orderBy("user_id")
print(sort_dataframe.show())
df_dataframe = sort_dataframe.repartition(n_partitions)
ddf_dataframe = DynamicFrame.fromDF(sort_dataframe, glueContext, "ddf_dataframe")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = ddf_dataframe, connection_type = "s3", format = "csv", connection_options = {"path": args['DESTINATION_PATH']}, transformation_ctx = "datasink4",format_options = {"withHeader": True})
print("records processing complete")
job.commit()

You are sorting it, then immediately shuffling everything randomly to other partitions by repartitioning. Do a dyf.toDF().repartition(n_partitions).sortWithinPartitions("user_id"). You will get the full range of user ids in each file, but within each file every row is sorted by user id.
If you are querying by athena that is actually good as you can query all files in parallel but the query will be able to zoom in on just the portion of the file with the user ids you are filtering by very quickly (if you are using parquet at least).
If that is not suitable try dyf.toDF().repartitionByRange(n_partitions, "user_id"). That will require it to sample the user_id and make an educated guess at how to distribute the user ids between files, therefore the files may not be perfectly evenly sized, but each file will have a different set of user ids and no files will have overlapping ranges of user ids.

Related

How to convert JSON to CSV file from s3 and save it in same s3 bucket using Glue job

Please help me with the coding part
I googled for the code, but it only shows with using lambda handler. My project requires use gluejob.
Here you can find the answer for converting json to csv.
GlueContext glueContext = new GlueContext(Spark.getActiveSession())
val jsonDf = glueContext.getSource(
connectionType = "s3",
connectionOptions = JsonOptions(Map("paths" -> "s3://:sourcePath/data.json")),
format = "json",
transformationContext = "jsonDf"
)
val dataDf = jsonDf.toDF()
val csvRDD = dataDf.repartition(1).rdd.map(_.mkString(","))
csvRDD.saveAsTextFile("s3://sourcePath/data.csv")

How can I read multiple S3 buckets using Glue?

When using Spark, I can read data from multiple buckets using the * in the prefix. For example, my folder structure is as follows:
s3://bucket/folder/computation_date=2020-11-01/
s3://bucket/folder/computation_date=2020-11-02/
s3://bucket/folder/computation_date=2020-11-03/
etc.
Using PySpark, if I want to read all data for month 11, I can do:
input_bucket = [MY-BUCKET]
input_prefix = [MY-FOLDER/computation_date=2020-11-*]
df_spark = spark.read.parquet("s3://{}/{}/".format(input_bucket, input_prefix))
How I achieve the same functionality with Glue? The below does not seem to work:
input_bucket = [MY-BUCKET]
input_prefix = [MY-FOLDER/computation_date=2020-11-*]
df_glue = glueContext.create_dynamic_frame_from_options(
connection_type="s3",
connection_options = {
"paths": ["s3://{}/{}/".format(input_bucket, input_prefix)]
},
format="parquet",
transformation_ctx="df_spark")
I read the file using spark instead of Glue
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
input_bucket = [MY-BUCKET]
input_prefix = [MY-FOLDER/computation_date=2020-11-*]
df_spark = spark.read.parquet("s3://{}/{}/".format(input_bucket, input_prefix))

How to set Primary Key while creating a PySpark Dataframe

I basically created a glue dynamic frame from the table I read "raw_tb". Then, I've converted the dynamic frame into Spark Dataframe using .todf() method. Now, I'm trying to create 2 separate dataframes from raw_df.
# Spark Context Object
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
#sqlContext = SQLContext(sc)
# Boto Client Objects
client = boto3.client('glue', region_name=REGION_NAME)
s3client = boto3.client('s3', region_name=REGION_NAME)
RAW_TABLE = "raw_tb"
table_read_df=glueContext.create_dynamic_frame.from_catalog(RAW_DATABASE,RAW_TABLE)
raw_df = table_read_df.toDF()
policy_tbl = raw_df['policynumber','status','startdate','expirationdate']
location_tbl = raw_df['locationid','city','county','state','zip']
Here, I would like to set the "policynumber" column in policy_tbl and "locationid" column in location_tbl as primary keys. I'm not sure how that's possible. Please help!
https://i.stack.imgur.com/OeI0N.png

PySpark/HIVE: append to an existing table

Really basic question pyspark/hive question:
How do I append to an existing table? My attempt is below
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf_init = SparkConf().setAppName('pyspark2')
sc = SparkContext(conf = conf_init)
hive_cxt = HiveContext(sc)
import pandas as pd
df = pd.DataFrame({'a':[0,0], 'b':[0,0]})
sdf = hive_cxt.createDataFrame(df)
sdf.write.mode('overwrite').saveAsTable('database.table') #this line works
df = pd.DataFrame({'a':[1,1,1], 'b':[2,2,2]})
sdf = hive_cxt.createDataFrame(df)
sdf.write.mode('append').saveAsTable('database.table') #this line does not work
#sdf.write.insertInto('database.table',overwrite = False) #this line does not work
Thanks!
Sam
It seems using option('overwrite') was causing the problem; it drops the table and then recreates a new one. If I do the following, everything works fine:
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf_init = SparkConf().setAppName('pyspark2')
sc = SparkContext(conf = conf_init)
print(sc.version)
hive_cxt = HiveContext(sc)
hive_cxt.sql('USE database')
query = """
CREATE TABLE IF NOT EXISTS table (a int, b int)
STORED AS parquet
"""
hive_cxt.sql(query)
import pandas as pd
df = pd.DataFrame({'a':[0,0], 'b':[0,0]})
sdf = hive_cxt.createDataFrame(df)
sdf.write.mode('append').format('hive').saveAsTable('table')
query = """
SELECT *
FROM table
"""
df = hive_cxt.sql(query)
df = df.toPandas()
print(df) # successfully pull the data in table
df = pd.DataFrame({'a':[1,1,1], 'b':[2,2,2]})
sdf = hive_cxt.createDataFrame(df)
sdf.write.mode('append').format('hive').saveAsTable('table')
I think previously you forgot use the format option which caused the issue for you when you are trying to append and not overwrite like you mentioned above.

How to use unbase64 function in pyspark SQL query?

I cannot seem to figure out why unbase64 function won't work in my Spark SQL query.
Here is an example. I'm trying to decode "VGhpcyBpcyBhIHRlc3Qh" by calling the unbase64 function within the spark SQL. Any thoughts on why the output doesn't get decoded? Thanks.
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import unbase64
sc = SparkContext("local", "Simple App")
sqlContext = SQLContext(sc)
log = [{"eventTime":"2015-12-14 15:27:00","id":"9ab0135f-b8a3-4312-9065-9f8874fd790c","fullLog":"VGhpcyBpcyBhIHRlc3Qh"}]
df = sqlContext.createDataFrame(log)
df.registerTempTable('data')
query = sqlContext.sql('SELECT unbase64(fullLog) as test FROM data')
query.write.save("output", format="json")
The output is : {"test":"VGhpcyBpcyBhIHRlc3Qh"} when I want it to be: {"test":"This is a test!"}
It seems to work for me...
from pyspark.sql import HiveContext
from pyspark.sql import SQLContext
log = [("2015-12-14 15:27:00","9ab0135f-b8a3-4312-9065-9f8874fd790c","VGhpcyBpcyBhIHRlc3Qh")]
rdd_log = sc.parallelize(log)
df = sqlContext.createDataFrame(rdd_log, ["eventTime", "id", "fullLog"])
df.registerTempTable("data")
query = sqlContext.sql('SELECT unbase64(fullLog) as test FROM data')
query = query.select(query.test.cast("string").alias('test'))
print query.collect()
>> [Row(test=u'This is a test!')]

Categories