How can I bucketing with s3 using aws glue? - python

I tried partitioning and bucketing using AWS Glue on S3. But the bucketing did not work. Only the partitioning did work. How can I bucketing with AWS Glue?
datasink4 = glueContext.write_dynamic_frame.from_options(
frame = dropnullfields3,
connection_type = "s3",
connection_options = {"path": s3_output_full,
"partitionKeys": ["PARTITIONKEY"],
"bucketColumns": ["ROW_ID"],
"numberOfBuckets": 12},
format = "parquet",
transformation_ctx = "datasink4")
job.commit()

I think they're not supported yet
My script is using bucketBy function instead; but it'd replace existing data in defined path
df_name, job_df = (str(transform_name), df)
datasink_path = "s3://sink-bucket/job-data/"
writing = job_df.write.format('parquet').mode("append") \
.partitionBy('event_day') \
.bucketBy(3, 'bucketed_field') \
.saveAsTable(df_name, path = datasink_path)

Related

How to convert JSON to CSV file from s3 and save it in same s3 bucket using Glue job

Please help me with the coding part
I googled for the code, but it only shows with using lambda handler. My project requires use gluejob.
Here you can find the answer for converting json to csv.
GlueContext glueContext = new GlueContext(Spark.getActiveSession())
val jsonDf = glueContext.getSource(
connectionType = "s3",
connectionOptions = JsonOptions(Map("paths" -> "s3://:sourcePath/data.json")),
format = "json",
transformationContext = "jsonDf"
)
val dataDf = jsonDf.toDF()
val csvRDD = dataDf.repartition(1).rdd.map(_.mkString(","))
csvRDD.saveAsTextFile("s3://sourcePath/data.csv")

How can I read multiple S3 buckets using Glue?

When using Spark, I can read data from multiple buckets using the * in the prefix. For example, my folder structure is as follows:
s3://bucket/folder/computation_date=2020-11-01/
s3://bucket/folder/computation_date=2020-11-02/
s3://bucket/folder/computation_date=2020-11-03/
etc.
Using PySpark, if I want to read all data for month 11, I can do:
input_bucket = [MY-BUCKET]
input_prefix = [MY-FOLDER/computation_date=2020-11-*]
df_spark = spark.read.parquet("s3://{}/{}/".format(input_bucket, input_prefix))
How I achieve the same functionality with Glue? The below does not seem to work:
input_bucket = [MY-BUCKET]
input_prefix = [MY-FOLDER/computation_date=2020-11-*]
df_glue = glueContext.create_dynamic_frame_from_options(
connection_type="s3",
connection_options = {
"paths": ["s3://{}/{}/".format(input_bucket, input_prefix)]
},
format="parquet",
transformation_ctx="df_spark")
I read the file using spark instead of Glue
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
input_bucket = [MY-BUCKET]
input_prefix = [MY-FOLDER/computation_date=2020-11-*]
df_spark = spark.read.parquet("s3://{}/{}/".format(input_bucket, input_prefix))

How to set Primary Key while creating a PySpark Dataframe

I basically created a glue dynamic frame from the table I read "raw_tb". Then, I've converted the dynamic frame into Spark Dataframe using .todf() method. Now, I'm trying to create 2 separate dataframes from raw_df.
# Spark Context Object
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
#sqlContext = SQLContext(sc)
# Boto Client Objects
client = boto3.client('glue', region_name=REGION_NAME)
s3client = boto3.client('s3', region_name=REGION_NAME)
RAW_TABLE = "raw_tb"
table_read_df=glueContext.create_dynamic_frame.from_catalog(RAW_DATABASE,RAW_TABLE)
raw_df = table_read_df.toDF()
policy_tbl = raw_df['policynumber','status','startdate','expirationdate']
location_tbl = raw_df['locationid','city','county','state','zip']
Here, I would like to set the "policynumber" column in policy_tbl and "locationid" column in location_tbl as primary keys. I'm not sure how that's possible. Please help!
https://i.stack.imgur.com/OeI0N.png

Glue Job to union dataframes using pyspark

I'm basically trying to update/add rows from one DF to another. Here is my code:
# S3
import boto3
# SOURCE
source_table = "someDynamoDbtable"
source_s3 = "s://mybucket/folder/"
# DESTINATION
destination_bucket = "s3://destination-bucket"
#Select which attributes to update/add
params = ['attributeD', 'attributeF', 'AttributeG']
#spark wrapper
glueContext = GlueContext(SparkContext.getOrCreate())
newData = glueContext.create_dynamic_frame.from_options(connection_type = "dynamodb", connection_options = {"tableName": source_table})
newValues = newData.select_fields(params)
newDF = newValues.toDF()
oldData = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options={"paths": [source_s3]}, format="orc", format_options={}, transformation_ctx="dynamic_frame")
oldDataValues = oldData.drop_fields(params)
oldDF = oldDataValues.toDF()
#makes a union of the dataframes
rebuildData = oldDF.union(newDF)
#error happens here
readyData = DynamicFrame.fromDF(rebuildData, glueContext, "readyData")
#writes new data to s3 destination, into orc files, while partitioning
glueContext.write_dynamic_frame.from_options(frame = readyData, connection_type = "s3", connection_options = {"path": destination_bucket}, format = "orc", partitionBy=['partition_year', 'partition_month', 'partition_day'])
The error I get is:
SyntaxError: invalid syntax on line readyData = ...
So far I've got no idea what's wrong.
You are performing the union operation between a dataframe and a dynamicframe.
This creates a dynamicframe named newData and a dataframe named newDF:
newData = glueContext.create_dynamic_frame.from_options(connection_type = "dynamodb", connection_options = {"tableName": source_table})
newValues = newData.select_fields(params)
newDF = newValues.toDF()
This creates a dynamicframe named oldData and a dataframe named oldDF :
oldData = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options={"paths": [source_s3]}, format="orc", format_options={}, transformation_ctx="dynamic_frame")
oldDataValues = oldData.drop_fields(params)
oldDF = oldDataValues.toDF()
And you are performing the union operation on above two entities as below :
rebuildData = oldDF.union(newData)
which should be :
rebuildData = oldDF.union(newDF)
Yeah, so I figured that for what I need to do would be better to use an OUTER JOIN.
Let me explain:
I load two dataframes, where one drops the fields that we want to update.
The second one selects just those fields, so both would not have duplicate rows/columns.
Instead of union, which would just add rows, we use outer(or full) join. This add all the data from my dataframes without duplicates.
Now my logic may be flawed, but so far it is working okay for me. If anyone is looking for a similar solution you are welcome to it.
My changed code:
rebuildData = oldDF.join(newData, 'id', 'outer')

Write results to permanent table in bigquery

I am using named parameters in Bigquery SQL and want to write the results to a permanent table. I have two functions 1 for using named query parameters and 1 for writing query results to table. How do I combine the two to get query results written to table; the query having named parameters.
This is the function using parameterized queries :
def sync_query_named_params(column_name,min_word_count,value):
query = """with lsq_results as
(select "%s" = #min_word_count)
replace (%s AS %s)
from lsq.lsq_results
""" % (min_word_count,value,column_name)
client = bigquery.Client()
query_results = client.run_sync_query(query
,
query_parameters=(
bigquery.ScalarQueryParameter('column_name', 'STRING', column_name),
bigquery.ScalarQueryParameter(
'min_word_count',
'STRING',
min_word_count),
bigquery.ScalarQueryParameter('value','INT64',value)
))
query_results.use_legacy_sql = False
query_results.run()
Function to write to permanent table
class BigQueryClient(object):
def __init__(self, bq_service, project_id, swallow_results=True):
self.bigquery = bq_service
self.project_id = project_id
self.swallow_results = swallow_results
self.cache = {}
def write_to_table(
self,
query,
dataset=None,
table=None,
external_udf_uris=None,
allow_large_results=None,
use_query_cache=None,
priority=None,
create_disposition=None,
write_disposition=None,
use_legacy_sql=None,
maximum_billing_tier=None,
flatten=None):
configuration = {
"query": query,
}
if dataset and table:
configuration['destinationTable'] = {
"projectId": self.project_id,
"tableId": table,
"datasetId": dataset
}
if allow_large_results is not None:
configuration['allowLargeResults'] = allow_large_results
if flatten is not None:
configuration['flattenResults'] = flatten
if maximum_billing_tier is not None:
configuration['maximumBillingTier'] = maximum_billing_tier
if use_query_cache is not None:
configuration['useQueryCache'] = use_query_cache
if use_legacy_sql is not None:
configuration['useLegacySql'] = use_legacy_sql
if priority:
configuration['priority'] = priority
if create_disposition:
configuration['createDisposition'] = create_disposition
if write_disposition:
configuration['writeDisposition'] = write_disposition
if external_udf_uris:
configuration['userDefinedFunctionResources'] = \
[ {'resourceUri': u} for u in external_udf_uris ]
body = {
"configuration": {
'query': configuration
}
}
logger.info("Creating write to table job %s" % body)
job_resource = self._insert_job(body)
self._raise_insert_exception_if_error(job_resource)
return job_resource
How do I combine the 2 functions to write a parameterized query and write the results to a permanent table?Or if there is another simpler way. Please suggest.
You appear to be using two different client libraries.
Your first code sample uses a beta version of the BigQuery client library, but for the time being I would recommend against using it, since it needs substantial revision before it is considered generally available. (And if you do use it, I would recommend using run_async_query() to create a job using all available parameters, and then call results() to get the QueryResults object.)
Your second code sample is creating a job resource directly, which is a lower-level interface. When using this approach, you can specify the configuration.query.queryParameters field on your query configuration directly. This is the approach I'd recommend right now.

Categories