Azure Synapse Serverless SQL Pools - how to optimize notebook - python
Is there a better way to optimize the following notebook? Currently it takes 2 minutes and 20 seconds to run. How can I improve performance? Any suggestions would be appreciated. Thanks.
Environment:
medium sized spark pool (8 vCores/64 GB) with 3-30 nodes and 10 executors
ADLSG2 premium (solid state drives)
Set the environment variables
environment = "mydatalake"
fileFormat = "parquet"
Function - set the path of where to load the source parquet files from
tblName = ""
fldrName = ""
dbName = ""
filePrefix = ""
# Create the function
def fn_PathSource(fldrName,dbName,tblName,fileFormat,filePrefix):
str_path0 = "spark.read.load("
str_path1 = "'abfss://"
str_path2 = ".dfs.core.windows.net/sources"
str_path3 = ", format="
return f"{str_path0}{str_path1}{fldrName}#{environment}{str_path2}/{dbName}/{tblName}/{dbName}{filePrefix}{tblName}.{fileFormat}'{str_path3}'{fileFormat}')"
Function - set the path where the table data will be stored in the datalake
# Create the variables used by the function
tblName = ""
fldrName = ""
dbName = ""
# Create the function
def fn_Path(fldrName,dbName,tblName):
str_path1 = "abfss://"
str_path2 = ".dfs.core.windows.net"
return f"{str_path1}{fldrName}#{environment}{str_path2}/{dbName}/{tblName}/"
Function - get the latest version of the records
import hashlib
from pyspark.sql.functions import md5, concat_ws,col
# Create the variables used by the function
uniqueId = ""
versionId = ""
tblName = ""
# Create the function
def fn_ReadLatestVrsn(uniqueId,versionId,tblName):
df_Max = spark.sql(f"SELECT {uniqueId},MAX({versionId}) AS {versionId}Max FROM {tblName} GROUP BY {uniqueId}")
df_Max.createOrReplaceTempView(f"{tblName}Max")
df_Latest = spark.sql(f"SELECT {uniqueId},{versionId}Max FROM {tblName}Max")
df_Latest = df_Latest.withColumn("HashKey",md5(concat_ws("",col(f"{uniqueId}").cast("string"),col(f"{versionId}Max").cast("string"))))
df_Latest.createOrReplaceTempView(f"{tblName}Latest")
df_Hash = spark.sql(f"SELECT * FROM {tblName} t1")
df_Hash = df_Hash.withColumn("HashKey",md5(concat_ws("",col(f"{uniqueId}").cast("string"),col(f"{versionId}").cast("string"))))
df_Hash.createOrReplaceTempView(f"{tblName}Hash")
df_Final = spark.sql(f"SELECT DISTINCT t1.* FROM {tblName}Hash t1 INNER JOIN {tblName}Latest t2 ON t1.HashKey = t2.HashKey")
df_Final.createOrReplaceTempView(f"{tblName}")
return spark.sql(f"SELECT * FROM {tblName}")
Load data frames with source table data
DF_tblBitSize = eval(fn_PathSource("silver","MineDB","tblBitSize","parquet","_dbo_"))
DF_tblDailyReport = eval(fn_PathSource("silver","MineDB","tblDailyReport","parquet","_dbo_"))
DF_tblDailyReportHole = eval(fn_PathSource("silver","MineDB","tblDailyReportHole","parquet","_dbo_"))
DF_tblDailyReportHoleActivity = eval(fn_PathSource("silver","MineDB","tblDailyReportHoleActivity","parquet","_dbo_"))
DF_tblDailyReportHoleActivityHours = eval(fn_PathSource("silver","MineDB","tblDailyReportHoleActivityHours","parquet","_dbo_"))
DF_tblDailyReportShift = eval(fn_PathSource("silver","MineDB","tblDailyReportShift","parquet","_dbo_"))
DF_tblDrill = eval(fn_PathSource("silver","MineDB","tblDrill","parquet","_dbo_"))
DF_tblDrillType = eval(fn_PathSource("silver","MineDB","tblDrillType","parquet","_dbo_"))
DF_tblEmployee = eval(fn_PathSource("silver","MineDB","tblEmployee","parquet","_dbo_"))
DF_tblHole = eval(fn_PathSource("silver","MineDB","tblHole","parquet","_dbo_"))
DF_tblMineProject = eval(fn_PathSource("silver","MineDB","tblMineProject","parquet","_dbo_"))
DF_tblShift = eval(fn_PathSource("silver","MineDB","tblShift","parquet","_dbo_"))
DF_tblUnit = eval(fn_PathSource("silver","MineDB","tblUnit","parquet","_dbo_"))
DF_tblUnitType = eval(fn_PathSource("silver","MineDB","tblUnitType","parquet","_dbo_"))
DF_tblWorkSubCategory = eval(fn_PathSource("silver","MineDB","tblWorkSubCategory","parquet","_dbo_"))
DF_tblWorkSubCategoryType = eval(fn_PathSource("silver","MineDB","tblWorkSubCategoryType","parquet","_dbo_"))
DF_v_Dashboards_CompanyContracts= eval(fn_PathSource("silver","MineDB","v_Dashboards_CompanyContracts","parquet","_"))
DF_v_DailyReportShiftDrillers = eval(fn_PathSource("silver","MineDB","v_DailyReportShiftDrillers","parquet","_"))
DF_v_ActivityCharges = eval(fn_PathSource("silver","MineDB","v_ActivityCharges","parquet","_"))
Convert dataframes to temporary views that can be used in SQL
DF_tblBitSize.createOrReplaceTempView("tblBitSize")
DF_tblDailyReport.createOrReplaceTempView("tblDailyReport")
DF_tblDailyReportHole.createOrReplaceTempView("tblDailyReportHole")
DF_tblDailyReportHoleActivity.createOrReplaceTempView("tblDailyReportHoleActivity")
DF_tblDailyReportHoleActivityHours.createOrReplaceTempView("tblDailyReportHoleActivityHours")
DF_tblDailyReportShift.createOrReplaceTempView("tblDailyReportShift")
DF_tblDrill.createOrReplaceTempView("tblDrill")
DF_tblDrillType.createOrReplaceTempView("tblDrillType")
DF_tblEmployee.createOrReplaceTempView("tblEmployee")
DF_tblHole.createOrReplaceTempView("tblHole")
DF_tblMineProject.createOrReplaceTempView("tblMineProject")
DF_tblShift.createOrReplaceTempView("tblShift")
DF_tblUnit.createOrReplaceTempView("tblUnit")
DF_tblUnitType.createOrReplaceTempView("tblUnitType")
DF_tblWorkSubCategory.createOrReplaceTempView("tblWorkSubCategory")
DF_tblWorkSubCategoryType.createOrReplaceTempView("tblWorkSubCategoryType") DF_v_Dashboards_CompanyContracts.createOrReplaceTempView("v_Dashboards_CompanyContracts")
DF_v_DailyReportShiftDrillers.createOrReplaceTempView("v_DailyReportShiftDrillers")
DF_v_ActivityCharges.createOrReplaceTempView("v_ActivityCharges")
Load latest data into views
When an existing record is updated (or a soft delete occurs) in the source system table, Azure Data Factory captures that change by creating an incremental parquet file. The same occurs when a new record is created. During the merge process, all of the incremental files are merged into one parquet file. For the existing record that was updated (or a soft deleted occured), the merge creates two versions of that record, appending the latest version. If you were to query the merged parquet file, you would see a duplicate record.
Therefore, to see only the latest version of that record, we need to remove the prior version. This function will ensure that we are looking at the most up to date version of all records.
** Special note: this logic is not necessary for tables with records that do not get soft deleted (e.g. tables without a LastModDateTime or ActiveInd column), therefore, we do not apply this function to those tables
DF_tblBitSize = fn_ReadLatestVrsn("BitSizeID","LastModDateTime","tblBitSize")
DF_tblDailyReport = fn_ReadLatestVrsn("DailyReportID","LastModDateTime","tblDailyReport")
DF_tblDailyReportHole = fn_ReadLatestVrsn("DailyReportHoleID","LastModDateTime","tblDailyReportHole")
DF_tblDailyReportHoleActivity = fn_ReadLatestVrsn("DailyReportHoleActivityID","LastModDateTime","tblDailyReportHoleActivity")
DF_tblDailyReportHoleActivityHours = fn_ReadLatestVrsn("DailyReportHoleActivityHoursID","LastModDateTime","tblDailyReportHoleActivityHours")
DF_tblDailyReportShift = fn_ReadLatestVrsn("DailyReportShiftID","LastModDateTime","tblDailyReportShift")
DF_tblDrill = fn_ReadLatestVrsn("DrillID","LastModDateTime","tblDrill")
DF_tblEmployee = fn_ReadLatestVrsn("EmployeeID","LastModDateTime","tblEmployee")
DF_tblHole = fn_ReadLatestVrsn("HoleID","LastModDateTime","tblHole")
DF_tblMineProject = fn_ReadLatestVrsn("MineProjectID","LastModDateTime","tblMineProject")
DF_tblShift = fn_ReadLatestVrsn("ShiftID","LastModDateTime","tblShift")
DF_tblWorkSubCategoryType = fn_ReadLatestVrsn("WorkSubCategoryTypeID","LastModDateTime","tblWorkSubCategoryType")
CTE_UnitConversion
%%sql
CREATE OR REPLACE TEMP VIEW CTE_UnitConversion AS
(
SELECT
u.UnitID
,ut.UnitType
,u.UnitName
,u.UnitAbbr
,COALESCE(CAST(u.Conversion AS FLOAT),1) AS Conversion
FROM
tblUnit u
INNER JOIN tblUnitType ut
ON u.UnitTypeID = ut.UnitTypeID
AND ut.UnitType IN ('Distance','Depth')
UNION
SELECT
-1 AS UnitID
,'Unknown' AS UnitType
,'Unknown' AS UnitName
,'Unknown' AS UnitAbbr
,1 AS Conversion
)
CTE_Dashboards_BaseData
%%sql
CREATE OR REPLACE TEMP VIEW CTE_Dashboards_BaseData AS
(
SELECT
CC.ContractID,
CC.ProjectID,
CAST(DR.ReportDate AS DATE) AS ReportDate,
D.DrillID,
CAST(D.DrillName AS STRING) AS DrillName,
DT.DrillTypeID,
CAST(DT.DrillType AS STRING) AS DrillType,
CAST(NULL AS STRING) AS HoleName,
CAST(S.ShiftName AS STRING) AS ShiftName,
STRING(CONCAT(E.LastName,' ',E.FirstName)) AS Supervisor,
CAST(DRSD.Drillers AS STRING) AS Driller,
CAST(NULL AS FLOAT) AS TotalMeterage,
CAST(NULL AS FLOAT) AS Depth,
CAST(NULL AS STRING) AS DepthUnit,
CAST(NULL AS FLOAT) AS ManHours,
CAST(NULL AS FLOAT) AS Payrollhours,
CAST(NULL AS FLOAT) AS ActivityHours,
CAST(NULL AS FLOAT) AS EquipmentHours,
CAST(NULL AS FLOAT) AS Quantity,
CAST(NULL AS STRING) AS Category,
CAST(NULL AS STRING) AS SubCategory,
CAST(NULL AS STRING) AS HoursType,
CAST(NULL AS STRING) AS BitSize,
CAST(DRS.DailyReportShiftID AS BIGINT) AS DailyReportShiftID,
CAST(DRS.ShiftID AS INT) AS ShiftID,
CAST(NULL AS TIMESTAMP) AS CompleteDateTime,
CAST(NULL AS STRING) AS HoleCompletionStatus,
CAST(NULL AS STRING) AS Notes,
CAST(NULL AS INT) AS HoleID,
CAST(NULL AS FLOAT) AS DistanceFrom,
CAST(NULL AS FLOAT) AS DistanceTo,
CAST(NULL AS STRING) AS DistanceFromToUnit,
CAST(NULL AS FLOAT) AS Distance,
CAST(NULL AS STRING) AS DistanceUnit,
CAST(NULL AS STRING) AS FluidUnit,
CAST(NULL AS FLOAT) AS FluidVolume,
CAST(NULL AS STRING) AS UID,
CAST(NULL AS FLOAT) AS MaxDepth,
CAST(NULL AS FLOAT) AS Penetration,
CAST(NULL AS FLOAT) AS Charges,
CAST(DR.Status AS STRING) AS Status,
CAST(DRS.LastModDateTime AS TIMESTAMP) AS LastModDateTime
FROM
v_Dashboards_CompanyContracts CC
LEFT JOIN tblDailyReport DR ON CC.ContractID = DR.ContractID AND CC.ProjectID = DR.ProjectID
LEFT JOIN tblDailyReportShift DRS ON DR.DailyReportID = DRS.DailyReportID
LEFT JOIN tblShift S ON DRS.ShiftID = S.ShiftID
LEFT JOIN tblDrill D ON DR.DrillID = D.DrillID
LEFT JOIN tblDrillType DT ON D.DrillTypeID = DT.DrillTypeID
LEFT JOIN tblEmployee E ON DRS.SupervisorID = E.EmployeeID
LEFT JOIN v_DailyReportShiftDrillers DRSD ON DRS.DailyReportShiftID = DRSD.DailyReportShiftID
WHERE
DR.Status <> 'Deleted'
)
CTE_DailyReportHoleActivityManHours
%%sql
CREATE OR REPLACE TEMP VIEW CTE_DailyReportHoleActivityManHours AS
(
SELECT
DailyReportHoleActivityID
,SUM(HoursAsFloat) AS ManHours
FROM
tblDailyReportHoleActivityHours
WHERE
ActiveInd = 'Y'
GROUP BY
DailyReportHoleActivityID
)
Activity charges
%%sql
CREATE OR REPLACE TEMP VIEW SECTION_1 AS
(
SELECT
BD.ContractID
,BD.ProjectID
,CAST(ReportDate AS DATE) AS ReportDate
,DrillID
,DRHA.Depth
,DPU.UnitAbbr AS DepthUnit
,DPU.UnitID AS DepthUnitID
,DRHAMH.ManHours
,DRHA.ActivityHoursAsFloat AS ActivityHours
,WSC.WorkSubCategoryName AS Category
,WSCT.TypeName AS SubCategory
,CASE
WHEN (COALESCE(AC.Charges,0) = 0 AND COALESCE(AC.BillableCount, 0) = 0) OR DRHA.Billable='N' THEN 'Non-Billable'
WHEN AC.DefinedRateName IS NOT NULL AND DRHA.Billable <> 'N' THEN AC.DefinedRateName
ELSE WSC.WorkSubCategoryName
END AS HoursType
,BS.BitSizeID AS BitSizeID
,BS.BitSize
,DRHA.BitID AS BitID
,BD.DailyReportShiftID
,DRHA.Notes
,H.HoleID
,DRHA.DistanceFrom
,DRHA.DistanceTo
,DFU.UnitAbbr AS DistanceFromToUnit
,DFU.UnitID AS DistanceFromToUnitID
,DRHA.Distance
,DU.UnitID AS DistanceUnitID
,CASE
WHEN WSC.WorkCategoryId = 1 THEN MAX(COALESCE(DRHA.DistanceTo, 0)) OVER ( PARTITION BY H.HoleID, WSC.WorkSubCategoryName ORDER BY H.HoleID, ReportDate, BD.ShiftID, DRHA.SequenceNumber, DRHA.CreateDateTime, DRHA.DistanceTo)
ELSE NULL
END AS MaxDepth
,CASE
WHEN WSC.WorkCategoryId = 1 THEN DRHA.Penetration
ELSE 0
END AS Penetration
,COALESCE(AC.Charges,0) AS Charges
,BD.Status
,H.MineProjectID
,CAST(DRHA.LastModDateTime AS TIMESTAMP) AS LastModDateTime
FROM
CTE_Dashboards_BaseData BD
INNER JOIN tblDailyReportHole DRH ON BD.DailyReportShiftID = DRH.DailyReportShiftID
INNER JOIN tblDailyReportHoleActivity DRHA ON DRH.DailyReportHoleID = DRHA.DailyReportHoleID
INNER JOIN tblWorkSubCategory WSC ON DRHA.WorkSubCategoryID = WSC.WorkSubCategoryID
LEFT JOIN tblHole H ON DRH.HoleID = H.HoleID
LEFT JOIN tblBitSize BS ON DRHA.BitSizeID = BS.BitSizeID
LEFT JOIN tblUnit DPU ON DRHA.DepthUnitID = DPU.UnitID
LEFT JOIN tblUnit DFU ON DRHA.DistanceFromToUnitID = DFU.UnitID
LEFT JOIN tblUnit DU ON DRHA.DistanceUnitID = DU.UnitID
LEFT JOIN tblWorkSubCategoryType WSCT ON DRHA.TypeID = WSCT.WorkSubCategoryTypeID
LEFT JOIN v_ActivityCharges AC ON DRHA.DailyReportHoleActivityID = AC.DailyReportHoleActivityID
LEFT JOIN CTE_DailyReportHoleActivityManHours DRHAMH ON DRHA.DailyReportHoleActivityID = DRHAMH.DailyReportHoleActivityID
WHERE
DRH.ActiveInd = 'Y'
AND DRHA.ActiveInd = 'Y'
)
Create FACT_Activity table
df = spark.sql("""
SELECT
ReportDate
,DrillingCompanyID
,MiningCompanyID
,DrillID
,ProjectID
,ContractID
,LocationID
,HoleID
,DailyReportShiftId
,MineProjectID
,BitID
,TRIM(UPPER(BitSize)) AS BitSize
,-1 AS TimesheetId
,CurrencyID
,TRIM(UPPER(Category)) AS Category
,TRIM(UPPER(SubCategory)) AS SubCategory
,TRIM(UPPER(HoursType)) AS HoursType
,TRIM(UPPER(Notes)) AS Notes
,ApprovalStatus
,Depth AS Depth
,(Depth/COALESCE(Depth.Conversion,1)) AS DepthMeters
,Manhours
,ActivityHours
,DistanceFrom
,DistanceTo
,Distance
,Penetration
,(DistanceFrom/Distance.Conversion) AS DistanceFromMeters
,(DistanceTo/Distance.Conversion) AS DistanceToMeters
,(Distance/Distance.Conversion) AS DistanceMeters
,(Penetration/Distance.Conversion) AS PenetrationMeters
,DepthUnitID
,DistanceFromToUnitID
,Charges
,LastModDateTime
,ReportApprovalRequired
FROM
(
SELECT
COALESCE(CAST(ReportDate AS DATE),'01/01/1900') AS ReportDate
,COALESCE(DrillingCompanyID,-1) AS DrillingCompanyID
,COALESCE(MiningCompanyID,-1) AS MiningCompanyID
,COALESCE(DrillID,-1) AS DrillID
,COALESCE(C.ProjectID, -1) AS ProjectID
,COALESCE(C.ContractID,-1) AS ContractID
,COALESCE(C.LocationID,-1) AS LocationID
,COALESCE(HoleID,-1) AS HoleID
,COALESCE(DailyReportShiftID,-1) AS DailyReportShiftId
,COALESCE(MP.MineProjectID,-1) AS MineProjectID
,COALESCE(BitID,-1) AS BitID
,COALESCE(BitSize,'UNKNOWN') AS BitSize
,COALESCE(DepthUnitID,-1) AS DepthUnitID
,COALESCE(DistanceFromToUnitID,-1) AS DistanceFromToUnitID
,COALESCE(DistanceUnitID,-1) AS DistanceUnitID
,COALESCE(C.CurrencyID,-1) AS CurrencyID
,COALESCE(Category,'Unknown') AS Category
,COALESCE(SubCategory,'UNKNOWN') AS SubCategory
,COALESCE(HoursType,'UNKNOWN') AS HoursType
,SUBSTRING(Notes,0,250) AS Notes
,COALESCE(U.Status,'Unknown') AS ApprovalStatus
,COALESCE(Depth,0) AS Depth
,COALESCE(Manhours,0) AS Manhours
,COALESCE(ActivityHours,0) AS ActivityHours
,COALESCE(DistanceFrom,0) AS DistanceFrom
,COALESCE(DistanceTo,0) AS DistanceTo
,COALESCE(Distance,0) AS Distance
,COALESCE(Penetration,0) AS Penetration
,COALESCE(Charges,0) AS Charges
,COALESCE(CAST(U.LastModDateTime AS TIMESTAMP),'1900/01/01 00:00:00') AS LastModDateTime
,C.ReportApprovalRequired
FROM
SECTION_1 U
LEFT JOIN v_Dashboards_CompanyContracts C ON U.ContractID = C.ContractID AND COALESCE(U.ProjectID,-1) = C.ProjectID
LEFT JOIN tblMineProject MP ON U.MineProjectID = MP.MineProjectID AND MP.ActiveInd = 'Y'
) TBL1
INNER JOIN CTE_UnitConversion Distance ON tbl1.DistanceFromToUnitID = Distance.UnitID
INNER JOIN CTE_UnitConversion Depth ON tbl1.DepthUnitID = Depth.UnitID
""")
Create the table and write to the datalake
tblName = "fact_activity"
fldrName = "myfolder"
dbName = "mydatabase"
path = fn_Path(fldrName,dbName,tblName)
path
# Reduce the number of parquet files written using coalesce and write the dataframe to the datalake
df.coalesce(1).write.format("parquet").mode("overwrite").save(path)
# Drop the table (only dropping the metadata) if it exists in the lakehouse database
spark.sql(f"DROP TABLE IF EXISTS {dbName}.{tblName}")
# Now create the table (metadata only) and point it at the data in the datalake
spark.sql(f"CREATE TABLE {dbName}.{tblName} USING PARQUET LOCATION '{path}'")
Release SQL views from memory
%%sql
DROP VIEW SECTION_1;
DROP VIEW CTE_DailyReportHoleActivityManHours;
DROP VIEW CTE_Dashboards_BaseData;
DROP VIEW CTE_UnitConversion;
DROP VIEW tblBitSize;
DROP VIEW tblDailyReport;
DROP VIEW tblDailyReportHole;
DROP VIEW tblDailyReportHoleActivity;
DROP VIEW tblDailyReportHoleActivityHours;
DROP VIEW tblDailyReportShift;
DROP VIEW tblDrill;
DROP VIEW tblEmployee;
DROP VIEW tblHole;
DROP VIEW tblMineProject;
DROP VIEW tblShift;
Release data frames from memory
del DF_tblBitSize
del DF_tblDailyReport
del DF_tblDailyReportHole
del DF_tblDailyReportHoleActivity
del DF_tblDailyReportHoleActivityHours
del DF_tblDailyReportShift
del DF_tblDrill
del DF_tblDrillType
del DF_tblEmployee
del DF_tblHole
del DF_tblMineProject
del DF_tblShift
del DF_tblUnit
del DF_tblUnitType
del DF_tblWorkSubCategory
del DF_v_Dashboards_CompanyContracts
del DF_v_DailyReportShiftDrillers
del DF_v_ActivityCharges
Apart from Release SQL views from memory and Release data frames from memory everything looks fine.
If your application required to query the data frequently and there is requirement to create VIEWS, you can create the EXTERNAL TABLE in Dedicated SQL Pool and save the VIEWS for the tables using Synapse SQL. This would be more efficient and there won't be need to drop VIEWS and release dataframes every time you need the data.
You can also create and use native external tables using SQL pools in Azure Synapse Analytics as Native external tables have better performance when compared to external tables with TYPE=HADOOP in their external data source definition. This is because native external tables use native code to access external data.
You can also refer Best practices for serverless SQL pool in Azure Synapse Analytics to get more details regarding performance optimization.
Related
Databricks DLT reading a table from one schema(bronze), process CDC data and store to another schema (processed)
I am developing an ETL pipeline using databricks DLT pipelines for CDC data that I recieve from kafka. I have created 2 pipelines successfully for landing, and raw zone. The raw one will have operation flag, a sequence column, and I would like to process the CDC and store the clean data in processed layer (SCD 1 type). I am having difficulties in reading table from one schema, apply CDC changes, and load to target db schema tables. I have 100 plus tables, so i am planning to loop through the tables in RAW layer and apply CDC, move to processed layer. Following is my code that I have tried (I have left the commented code just for your reference). import dlt from pyspark.sql.functions import * from pyspark.sql.types import * raw_db_name = "raw_db" processed_db_name = "processed_db_name" def generate_curated_table(src_table_name, tgt_table_name, df): # #dlt.view( # name= src_table_name, # spark_conf={ # "pipelines.incompatibleViewCheck.enabled": "false" # }, # comment="Processed data for " + str(src_table_name) # ) # # def create_target_table(): # # return (df) # dlt.create_target_table(name=tgt_table_name, # comment= f"Clean, merged {tgt_table_name}", # #partition_cols=["topic"], # table_properties={ # "quality": "silver" # } # ) # #dlt.view # def users(): # return spark.readStream.format("delta").table(src_table_name) #dlt.view def raw_tbl_data(): return df dlt.create_target_table(name=tgt_table_name, comment="Clean, merged customers", table_properties={ "quality": "silver" }) dlt.apply_changes( target = tgt_table_name, source = f"{raw_db_name}.raw_tbl_data, keys = ["id"], sequence_by = col("timestamp_ms"), apply_as_deletes = expr("op = 'DELETE'"), apply_as_truncates = expr("op = 'TRUNCATE'"), except_column_list = ["id", "timestamp_ms"], stored_as_scd_type = 1 ) return tbl_name = 'raw_po_details' df = spark.sql(f'select * from {raw_dbname}.{tbl_name}') processed_tbl_name = tbl_name.replace("raw", "processed") //processed_po_details generate_curated_table(tbl_name, processed_tbl_name, df) I have tried with dlt.view(), dlt.table(), dlt.create_streaming_live_table(), dlt.create_target_table(), but ending up with either of the following errors: AttributeError: 'function' object has no attribute '_get_object_id' pyspark.sql.utils.AnalysisException: Failed to read dataset '<raw_db_name.mytable>'. Dataset is not defined in the pipeline .Expected result: Read the dataframe which is passed as a parameter (RAW_DB) and Create new tables in PROCESSED_DB which is configured in DLT pipeline settings https://www.databricks.com/blog/2022/04/27/how-uplift-built-cdc-and-multiplexing-data-pipelines-with-databricks-delta-live-tables.html https://cprosenjit.medium.com/databricks-delta-live-tables-job-workflows-orchestration-patterns-bc7643935299 Appreciate any help please. Thanks in advance
I got the solution myself and got it working, thanks to all. Am adding my solution so it could be a reference to others. import dlt from pyspark.sql.functions import * from pyspark.sql.types import * def generate_silver_tables(target_table, source_table): #dlt.table def customers_filteredB(): return spark.table("my_raw_db.myraw_table_name") ### Create the target table definition dlt.create_target_table(name=target_table, comment= f"Clean, merged {target_table}", #partition_cols=["topic"], table_properties={ "quality": "silver", "pipelines.autoOptimize.managed": "true" } ) ## Do the merge dlt.apply_changes( target = target_table, source = "customers_filteredB", keys = ["id"], apply_as_deletes = expr("operation = 'DELETE'"), sequence_by = col("timestamp_ms"),#primary key, auto-incrementing ID of any kind that can be used to identity order of events, or timestamp ignore_null_updates = False, except_column_list = ["operation", "timestamp_ms"], stored_as_scd_type = "1" ) return raw_dbname = "raw_db" raw_tbl_name = 'raw_table_name' processed_tbl_name = raw_tbl_name.replace("raw", "processed") generate_silver_tables(processed_tbl_name, raw_tbl_name)
Can't make apache beam write outputs to bigquery when using DataflowRunner
I'm trying to understand why this pipeline writes no output to BigQuery. What I'm trying to achieve is to calculate the USD index for the last 10 years, starting from different currency pairs observations. All the data is in BigQuery and I need to organize it and sort it in a chronollogical way (if there is a better way to achieve this, I'm glad to read it because I think this might not be the optimal way to do this). The idea behing the class Currencies() is to start grouping (and keep) the last observation of a currency pair (eg: EURUSD), update all currency pair values as they "arrive", sort them chronologically and finally get the open, high, low and close value of the USD index for that day. This code works in my jupyter notebook and in cloud shell using DirectRunner, but when I use DataflowRunner it does not write any output. In order to see if I could figure it out, I tried to just create the data using beam.Create() and then write it to BigQuery (which it worked) and also just read something from BQ and write it on other table (also worked), so my best guess is that the problem is in the beam.CombineGlobally part, but I don't know what it is. The code is as follows: import logging import collections import apache_beam as beam from datetime import datetime SYMBOLS = ['usdjpy', 'usdcad', 'usdchf', 'eurusd', 'audusd', 'nzdusd', 'gbpusd'] TABLE_SCHEMA = "date:DATETIME,index:STRING,open:FLOAT,high:FLOAT,low:FLOAT,close:FLOAT" class Currencies(beam.CombineFn): def create_accumulator(self): return {} def add_input(self,accumulator,inputs): logging.info(inputs) date,currency,bid = inputs.values() if '.' not in date: date = date+'.0' date = datetime.strptime(date,'%Y-%m-%dT%H:%M:%S.%f') data = currency+':'+str(bid) accumulator[date] = [data] return accumulator def merge_accumulators(self,accumulators): merged = {} for accum in accumulators: ordered_data = collections.OrderedDict(sorted(accum.items())) prev_date = None for date,date_data in ordered_data.items(): if date not in merged: merged[date] = {} if prev_date is None: prev_date = date else: prev_data = merged[prev_date] merged[date].update(prev_data) prev_date = date for data in date_data: currency,bid = data.split(':') bid = float(bid) currency = currency.lower() merged[date].update({ currency:bid }) return merged def calculate_index_value(self,data): return data['usdjpy']*data['usdcad']*data['usdchf']/(data['eurusd']*data['audusd']*data['nzdusd']*data['gbpusd']) def extract_output(self,accumulator): ordered = collections.OrderedDict(sorted(accumulator.items())) index = {} for dt,currencies in ordered.items(): if not all([symbol in currencies.keys() for symbol in SYMBOLS]): continue date = str(dt.date()) index_value = self.calculate_index_value(currencies) if date not in index: index[date] = { 'date':date, 'index':'usd', 'open':index_value, 'high':index_value, 'low':index_value, 'close':index_value } else: max_value = max(index_value,index[date]['high']) min_value = min(index_value,index[date]['low']) close_value = index_value index[date].update({ 'high':max_value, 'low':min_value, 'close':close_value }) return index def main(): query = """ select date,currency,bid from data_table where date(date) between '2022-01-13' and '2022-01-16' and currency like ('%USD%') """ options = beam.options.pipeline_options.PipelineOptions( temp_location = 'gs://PROJECT/temp', project = 'PROJECT', runner = 'DataflowRunner', region = 'REGION', num_workers = 1, max_num_workers = 1, machine_type = 'n1-standard-1', save_main_session = True, staging_location = 'gs://PROJECT/stag' ) with beam.Pipeline(options = options) as pipeline: inputs = (pipeline | 'Read From BQ' >> beam.io.ReadFromBigQuery(query=query,use_standard_sql=True) | 'Accumulate' >> beam.CombineGlobally(Currencies()) | 'Flat' >> beam.ParDo(lambda x: x.values()) | beam.io.Write(beam.io.WriteToBigQuery( table = 'TABLE', dataset = 'DATASET', project = 'PROJECT', schema = TABLE_SCHEMA)) ) if __name__ == '__main__': logging.getLogger().setLevel(logging.INFO) main() They way I execute this is from shell, using python3 -m first_script (is this the way I should run this batch jobs?). What I'm missing or doing wrong? This is my first attemp to use Dataflow, so i'm probably making several mistakes in the book.
For whom it may help: I faced a similar problem but I already used the same code for a different flow that had a pubsub as input where it worked flawless instead a file based input where it simply did not. After a lot of experimenting I found that in the options I changed the flag options = PipelineOptions(streaming=True, .. to options = PipelineOptions(streaming=False, as of course it is not a streaming source, it's a bounded source, a batch. After I set this flag to true I found my rows in the BigQuery table. After it had finished it even stopped the pipeline as it where a batch operation. Hope this helps
Airflow Pipeline to read CSVs and load into PostgreSQL
So, I am trying to write an airflow Dag to 1) Read a few different CSVs from my local desk, 2) Create different PostgresQL tables, 3) Load the files into their respective tables. When I am running the DAG, the second step seems to fail. Below are the DAG logic operators code: AIRFLOW_HOME = os.getenv('AIRFLOW_HOME') def get_listings_data (): listings = pd.read_csv(AIRFLOW_HOME + '/dags/data/listings.csv') return listings def get_g01_data (): demographics= pd.read_csv(AIRFLOW_HOME + '/dags/data/demographics.csv') return demographics def insert_listing_data_func(**kwargs): ps_pg_hook = PostgresHook(postgres_conn_id="postgres") conn_ps = ps_pg_hook.get_conn() ti = kwargs['ti'] insert_df = pd.DataFrame.listings if len(insert_df) > 0: col_names = ['host_id', 'host_name', 'host_neighbourhood', 'host_total_listings_count', 'neighbourhood_cleansed', 'property_type', 'price', 'has_availability', 'availability_30'] values = insert_df[col_names].to_dict('split') values = values['data'] logging.info(values) insert_sql = """ INSERT INTO assignment_2.listings (host_name, host_neighbourhood, host_total_listings_count, neighbourhood_cleansed, property_type, price, has_availability, availability_30) VALUES %s """ result = execute_values(conn_ps.cursor(), insert_sql, values, page_size=len(insert_df)) conn_ps.commit() else: None return None def insert_demographics_data_func(**kwargs): ps_pg_hook = PostgresHook(postgres_conn_id="postgres") conn_ps = ps_pg_hook.get_conn() ti = kwargs['ti'] insert_df = pd.DataFrame.demographics if len(insert_df) > 0: col_names = ['LGA', 'Median_age_persons', 'Median_mortgage_repay_monthly', 'Median_tot_prsnl_inc_weekly', 'Median_rent_weekly', 'Median_tot_fam_inc_weekly', 'Average_num_psns_per_bedroom', 'Median_tot_hhd_inc_weekly', 'Average_household_size'] values = insert_df[col_names].to_dict('split') values = values['data'] logging.info(values) insert_sql = """ INSERT INTO assignment_2.demographics (LGA,Median_age_persons,Median_mortgage_repay_monthly,Median_tot_prsnl_inc_weekly,Median_rent_weekly,Median_tot_fam_inc_weekly,Average_num_psns_per_bedroom,Median_tot_hhd_inc_weekly,Average_household_size) VALUES %s """ result = execute_values(conn_ps.cursor(), insert_sql, values, page_size=len(insert_df)) conn_ps.commit() else: None return None And my postgresQL hook for the demographics table (just an example) is below: create_psql_table_demographics= PostgresOperator( task_id="create_psql_table_demographics", postgres_conn_id="postgres", sql=""" CREATE TABLE IF NOT EXISTS postgres.demographics ( LGA VARCHAR, Median_age_persons INT, Median_mortgage_repay_monthly INT, Median_tot_prsnl_inc_weekly INT, Median_rent_weekly INT, Median_tot_fam_inc_weekly INT, Average_num_psns_per_bedroom DECIMAL(10,1), Median_tot_hhd_inc_weekly INT, Average_household_size DECIMAL(10,2) ); """, dag=dag) Am I missing something in my code that stops the completion of that create_psql_table_demographics from running successfully on Airflow?
If your Postgresql database has access to the CSV files, you may simply use the copy_expert method of the PostgresHook class (cf documentation). Postgresql is pretty efficient in loading flat files: you'll save a lot of cpu cycles by not involving python (and Pandas!), not to mention the potential encoding issues that you would have to address.
Creating a Table(array) of Records
If I wanted to store records from two files into a table (an array of records), could I use a format similar to the below code, and just put both file names in def function like def readTable(log1,log2): and then use the same code for both log1 and log2 allowing it to make a table1 and a table2? def readTable(fileName): s = Scanner(fileName) table = [] record = readRecord(s) while (record != ""): table.append(record) record = readRecord(s) s.close() return table
Just use *args, and get a list of records? def readTable(*args): tables = [] for filename in args: s = Scanner(fileName) table = [] record = readRecord(s) while (record != ""): table.append(record) record = readRecord(s) s.close() tables.append(table) return tables This way, you can pass log1, log2, log3 (any number of logs you like and get back a list of tables for each
Since readTable returns a list, if you want to concatenate the records from 2 logs, use the + operator. readTable(log1) + readTable(log2)
SQLAlchemy session query with INSERT IGNORE
I'm trying to do a bulk insert/update with SQLAlchemy. Here's a snippet: for od in clist: where = and_(Offer.network_id==od['network_id'], Offer.external_id==od['external_id']) o = session.query(Offer).filter(where).first() if not o: o = Offer() o.network_id = od['network_id'] o.external_id = od['external_id'] o.title = od['title'] o.updated = datetime.datetime.now() payout = od['payout'] countrylist = od['countries'] session.add(o) session.flush() for country in countrylist: c = session.query(Country).filter(Country.name==country).first() where = and_(OfferPayout.offer_id==o.id, OfferPayout.country_name==country) opayout = session.query(OfferPayout).filter(where).first() if not opayout: opayout = OfferPayout() opayout.offer_id = o.id opayout.payout = od['payout'] if c: opayout.country_id = c.id opayout.country_name = country else: opayout.country_id = 0 opayout.country_name = country session.add(opayout) session.flush() It looks like my issue was touched on here, http://www.mail-archive.com/sqlalchemy#googlegroups.com/msg05983.html, but I don't know how to use "textual clauses" with session query objects and couldn't find much (though admittedly I haven't had as much time as I'd like to search). I'm new to SQLAlchemy and I'd imagine there's some issues in the code besides the fact that it throws an exception on a duplicate key. For example, doing a flush after every iteration of clist (but I don't know how else to get an the o.id value that is used in the subsequent OfferPayout inserts). Guidance on any of these issues is very appreciated.
The way you should be doing these things is with session.merge(). You should also be using your objects relation properties. So the o above should have o.offerpayout and this a list (of objects) and your offerpayout has offerpayout.country property which is the related countries object. So the above would look something like for od in clist: o = Offer() o.network_id = od['network_id'] o.external_id = od['external_id'] o.title = od['title'] o.updated = datetime.datetime.now() payout = od['payout'] countrylist = od['countries'] for country in countrylist: opayout = OfferPayout() opayout.payout = od['payout'] country_obj = Country() country_obj.name = country opayout.country = country_obj o.offerpayout.append(opayout) session.merge(o) session.flush() This should work as long as all the primary keys are correct (i.e the country table has a primary key of name). Merge essentially checks the primary keys and if they are there merges your object with one in the database (it will also cascade down the joins).