How to set Primary Key while creating a PySpark Dataframe

How to set Primary Key while creating a PySpark Dataframe - python

I basically created a glue dynamic frame from the table I read "raw_tb". Then, I've converted the dynamic frame into Spark Dataframe using .todf() method. Now, I'm trying to create 2 separate dataframes from raw_df.
# Spark Context Object
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
#sqlContext = SQLContext(sc)
# Boto Client Objects
client = boto3.client('glue', region_name=REGION_NAME)
s3client = boto3.client('s3', region_name=REGION_NAME)
RAW_TABLE = "raw_tb"
table_read_df=glueContext.create_dynamic_frame.from_catalog(RAW_DATABASE,RAW_TABLE)
raw_df = table_read_df.toDF()
policy_tbl = raw_df['policynumber','status','startdate','expirationdate']
location_tbl = raw_df['locationid','city','county','state','zip']
Here, I would like to set the "policynumber" column in policy_tbl and "locationid" column in location_tbl as primary keys. I'm not sure how that's possible. Please help!
https://i.stack.imgur.com/OeI0N.png

Related

How to convert JSON to CSV file from s3 and save it in same s3 bucket using Glue job

Please help me with the coding part
I googled for the code, but it only shows with using lambda handler. My project requires use gluejob.

Here you can find the answer for converting json to csv.
GlueContext glueContext = new GlueContext(Spark.getActiveSession())
val jsonDf = glueContext.getSource(
connectionType = "s3",
connectionOptions = JsonOptions(Map("paths" -> "s3://:sourcePath/data.json")),
format = "json",
transformationContext = "jsonDf"
)
val dataDf = jsonDf.toDF()
val csvRDD = dataDf.repartition(1).rdd.map(_.mkString(","))
csvRDD.saveAsTextFile("s3://sourcePath/data.csv")

How to sort S3 CSV File using AWS GLUE

I'm relatively new to AWS glue and spark. I'd like to sort a csv file by user ID in S3. I'm trying out the script below, but it's not sorting the file.Can someone please help me in this?
import sys
import math
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import current_date
import pyspark.sql.functions as f
from pyspark.sql.functions import asc
args = getResolvedOptions(sys.argv, ['JOB_NAME','DESTINATION_PATH', 'SOURCE_PATH'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
dyf = glueContext.create_dynamic_frame.from_options("s3", connection_options = {"paths": [args['SOURCE_PATH']]}, format="csv", format_options = {"withHeader": True});
print("records read from s3 store")
print(dyf.count())
file_size = 10000
n_partitions = int(math.ceil(dyf.count() / float(file_size)))
print("splitting file into partitions")
print(n_partitions)
sort_dataframe = dyf.toDF().orderBy("user_id")
print(sort_dataframe.show())
df_dataframe = sort_dataframe.repartition(n_partitions)
ddf_dataframe = DynamicFrame.fromDF(sort_dataframe, glueContext, "ddf_dataframe")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = ddf_dataframe, connection_type = "s3", format = "csv", connection_options = {"path": args['DESTINATION_PATH']}, transformation_ctx = "datasink4",format_options = {"withHeader": True})
print("records processing complete")
job.commit()

You are sorting it, then immediately shuffling everything randomly to other partitions by repartitioning. Do a dyf.toDF().repartition(n_partitions).sortWithinPartitions("user_id"). You will get the full range of user ids in each file, but within each file every row is sorted by user id.
If you are querying by athena that is actually good as you can query all files in parallel but the query will be able to zoom in on just the portion of the file with the user ids you are filtering by very quickly (if you are using parquet at least).
If that is not suitable try dyf.toDF().repartitionByRange(n_partitions, "user_id"). That will require it to sample the user_id and make an educated guess at how to distribute the user ids between files, therefore the files may not be perfectly evenly sized, but each file will have a different set of user ids and no files will have overlapping ranges of user ids.

How can I read multiple S3 buckets using Glue?

When using Spark, I can read data from multiple buckets using the * in the prefix. For example, my folder structure is as follows:
s3://bucket/folder/computation_date=2020-11-01/
s3://bucket/folder/computation_date=2020-11-02/
s3://bucket/folder/computation_date=2020-11-03/
etc.
Using PySpark, if I want to read all data for month 11, I can do:
input_bucket = [MY-BUCKET]
input_prefix = [MY-FOLDER/computation_date=2020-11-*]
df_spark = spark.read.parquet("s3://{}/{}/".format(input_bucket, input_prefix))
How I achieve the same functionality with Glue? The below does not seem to work:
input_bucket = [MY-BUCKET]
input_prefix = [MY-FOLDER/computation_date=2020-11-*]
df_glue = glueContext.create_dynamic_frame_from_options(
connection_type="s3",
connection_options = {
"paths": ["s3://{}/{}/".format(input_bucket, input_prefix)]
},
format="parquet",
transformation_ctx="df_spark")

I read the file using spark instead of Glue
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
input_bucket = [MY-BUCKET]
input_prefix = [MY-FOLDER/computation_date=2020-11-*]
df_spark = spark.read.parquet("s3://{}/{}/".format(input_bucket, input_prefix))

Converting rdd to dataframe: AttributeError: 'RDD' object has no attribute 'toDF' using PySpark

I am trying to convert the RDD to DataFrame using PySpark. Below is my code.
from pyspark import SparkConf, SparkContext
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
conf = SparkConf().setMaster("local").setAppName("Dataframe_examples")
sc = SparkContext(conf=conf)
def parsedLine(line):
fields = line.split(',')
movieId = fields[0]
movieName = fields[1]
genres = fields[2]
return movieId, movieName, genres
movies = sc.textFile("file:///home/ajit/ml-25m/movies.csv")
parsedLines = movies.map(parsedLine)
print(parsedLines.count())
dataFrame = parsedLines.toDF(["movieId"])
dataFrame.printSchema()
I am running this code using PyCharm IDE.
And I get the error:
File "/home/ajit/PycharmProjects/pythonProject/Dataframe_examples.py", line 19, in <module>
dataFrame = parsedLines.toDF(["movieId"])
AttributeError: 'PipelinedRDD' object has no attribute 'toDF'
As I am new to this, let me know what am I missing?

Initialize SparkSession by passing sparkcontext.
Example:
from pyspark import SparkConf, SparkContext
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
conf = SparkConf().setMaster("local").setAppName("Dataframe_examples")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
def parsedLine(line):
fields = line.split(',')
movieId = fields[0]
movieName = fields[1]
genres = fields[2]
return movieId, movieName, genres
movies = sc.textFile("file:///home/ajit/ml-25m/movies.csv")
#or using spark.sparkContext
movies = spark.sparkContext.textFile("file:///home/ajit/ml-25m/movies.csv")
parsedLines = movies.map(parsedLine)
print(parsedLines.count())
dataFrame = parsedLines.toDF(["movieId"])
dataFrame.printSchema()

Use SparkSession to make the RDD dataframe as follows:
movies = sc.textFile("file:///home/ajit/ml-25m/movies.csv")
parsedLines = movies.map(parsedLine)
print(parsedLines.count())
spark = SparkSession.builder.getOrCreate()
dataFrame = spark.createDataFrame(parsedLines).toDF(["movieId"])
dataFrame.printSchema()
or use the spark context from the session at first.
spark = SparkSession.builder.master("local").appName("Dataframe_examples").getOrCreate()
sc = spark.sparkContext

Glue Job to union dataframes using pyspark

I'm basically trying to update/add rows from one DF to another. Here is my code:
# S3
import boto3
# SOURCE
source_table = "someDynamoDbtable"
source_s3 = "s://mybucket/folder/"
# DESTINATION
destination_bucket = "s3://destination-bucket"
#Select which attributes to update/add
params = ['attributeD', 'attributeF', 'AttributeG']
#spark wrapper
glueContext = GlueContext(SparkContext.getOrCreate())
newData = glueContext.create_dynamic_frame.from_options(connection_type = "dynamodb", connection_options = {"tableName": source_table})
newValues = newData.select_fields(params)
newDF = newValues.toDF()
oldData = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options={"paths": [source_s3]}, format="orc", format_options={}, transformation_ctx="dynamic_frame")
oldDataValues = oldData.drop_fields(params)
oldDF = oldDataValues.toDF()
#makes a union of the dataframes
rebuildData = oldDF.union(newDF)
#error happens here
readyData = DynamicFrame.fromDF(rebuildData, glueContext, "readyData")
#writes new data to s3 destination, into orc files, while partitioning
glueContext.write_dynamic_frame.from_options(frame = readyData, connection_type = "s3", connection_options = {"path": destination_bucket}, format = "orc", partitionBy=['partition_year', 'partition_month', 'partition_day'])
The error I get is:
SyntaxError: invalid syntax on line readyData = ...
So far I've got no idea what's wrong.

You are performing the union operation between a dataframe and a dynamicframe.
This creates a dynamicframe named newData and a dataframe named newDF:
newData = glueContext.create_dynamic_frame.from_options(connection_type = "dynamodb", connection_options = {"tableName": source_table})
newValues = newData.select_fields(params)
newDF = newValues.toDF()
This creates a dynamicframe named oldData and a dataframe named oldDF :
oldData = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options={"paths": [source_s3]}, format="orc", format_options={}, transformation_ctx="dynamic_frame")
oldDataValues = oldData.drop_fields(params)
oldDF = oldDataValues.toDF()
And you are performing the union operation on above two entities as below :
rebuildData = oldDF.union(newData)
which should be :
rebuildData = oldDF.union(newDF)

Yeah, so I figured that for what I need to do would be better to use an OUTER JOIN.
Let me explain:
I load two dataframes, where one drops the fields that we want to update.
The second one selects just those fields, so both would not have duplicate rows/columns.
Instead of union, which would just add rows, we use outer(or full) join. This add all the data from my dataframes without duplicates.
Now my logic may be flawed, but so far it is working okay for me. If anyone is looking for a similar solution you are welcome to it.
My changed code:
rebuildData = oldDF.join(newData, 'id', 'outer')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to set Primary Key while creating a PySpark Dataframe - python

Related

How to convert JSON to CSV file from s3 and save it in same s3 bucket using Glue job

How to sort S3 CSV File using AWS GLUE

How can I read multiple S3 buckets using Glue?

Converting rdd to dataframe: AttributeError: 'RDD' object has no attribute 'toDF' using PySpark

Glue Job to union dataframes using pyspark

Categories

Resources