I have below list of dictionaries
results =
[
{
"type:"check_datatype",
"kwargs":{
"table":"cars","column_name":"vin","d_type":"string"
}
},
{
"type":"check_emptystring",
"kwargs":{
"table":"cars","column_name":"vin"
}
},
{
"type:"check_null",
"kwargs":{
"table":"cars","columns":["vin","index"]
}
}
]
I want to create two different pyspark dataframe with below schema -
args_id column in results table will be same when we have unique pair of (type,kwargs). This JSON has to be run on a daily basis and hence if it find out same pair of (type,kwargs) again, it should give the same args_id value.
Till now, i have written this code -
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import Window
check_type_results = [[elt['type']] for elt in results]
checkColumns = ['type']
spark = SparkSession.builder.getOrCreate()
checkResultsDF = spark.createDataFrame(data=check_type_results, schema=checkColumns)
checkResultsDF = checkResultsDF.withColumn("time", F.current_timestamp())
checkResultsDF = checkResultsDF.withColumn("args_id", F.row_number().over(Window.orderBy(F.monotonically_increasing_id())))
checkResultsDF.printSchema()
Now, with my code , i am always getting args_id in increasing order which is correct for the first run but if i again run the json on next day or may be on same day and in the json file some pair of (type,kwargs) comes which has already come before so i should be using the same args_id for that pair.
If some pair (type,kwargs) has no entry in Arguments table, then only i will insert into arguments table but if the pair (type,kwargs) already exists in arguments table, then no insert should happen there.
Once these two dataframes are filled properly, then i want to load them into separate delta tables.
Hashcode column in arguments table is unique identifier for each "kwargs".
Issues
Your schema is a bit incomplete. A more detailed schema will allow you to take advantage of more spark features. See Solution below using spark-sql and pyspark. Instead of window functions that require ordered partitions, you may take advantage of a few of the table generating array functions such as explode and posexplode available in spark-sql. As it pertains to writing to your delta table, you may see examples here
Solution 1 : Using Spark SQL
Setup
from pyspark.sql.types import ArrayType,StructType, StructField, StringType, MapType
from pyspark.sql import Row, SparkSession
sparkSession = SparkSession.builder.appName("Demo").getOrCreate()
Schema Definition
Your sample record is an Array of Structs/Objects where the kwargs is a Maptype with optional keys. NB. The True indicates optional and should assist when there are missing keys or entries with different formats
schema = StructType([
StructField("entry",ArrayType(
StructType([
StructField("type",StringType(),True),
StructField("kwargs",MapType(StringType(),StringType()),True)
])
),True)
])
Reproducible Example
result_entry =[
{
"type":"check_datatype",
"kwargs":{
"table":"cars","column_name":"vin","d_type":"string"
}
},
{
"type":"check_emptystring",
"kwargs":{
"table":"cars","column_name":"vin"
}
},
{
"type":"check_null",
"kwargs":{
"table":"cars","columns":["vin","index"]
}
}
]
df_results = sparkSession.createDataFrame([Row(entry=result_entry)],schema=schema)
df_results.createOrReplaceTempView("df_results")
df_results.show()
Results
+--------------------+
| entry|
+--------------------+
|[{check_datatype,...|
+--------------------+
Results Table Generation
I've used current_date to capture the current date however you may change this based on your pipeline.
results_table = sparkSession.sql("""
WITH raw_results as (
SELECT
posexplode(entry),
current_date as time
FROM
df_results
)
SELECT
col.type as Type,
time,
pos as arg_id
FROM
raw_results
""")
results_table.show()
Results
+-----------------+----------+------+
| Type| time|arg_id|
+-----------------+----------+------+
| check_datatype|2021-03-31| 0|
|check_emptystring|2021-03-31| 1|
| check_null|2021-03-31| 2|
+-----------------+----------+------+
Arguments Table Generation
args_table = sparkSession.sql("""
WITH raw_results as (
SELECT
posexplode(entry)
FROM
df_results
),
raw_arguments AS (
SELECT
explode(col.kwargs),
pos as args_id
FROM
raw_results
),
raw_arguments_before_array_check AS (
SELECT
args_id,
key as bac_key,
value as bac_value
FROM
raw_arguments
),
raw_arguments_after_array_check AS (
SELECT
args_id,
bac_key,
bac_value,
posexplode(split(regexp_replace(bac_value,"[\\\[\\\]]",""),","))
FROM
raw_arguments_before_array_check
)
SELECT
args_id,
bac_key as key,
col as value,
CASE
WHEN bac_value LIKE '[%' THEN pos
ELSE NULL
END as list_index,
abs(hash(args_id, bac_key,col,pos)) as hashcode
FROM
raw_arguments_after_array_check
""")
args_table.show()
Results
+-------+-----------+------+----------+----------+
|args_id| key| value|list_index| hashcode|
+-------+-----------+------+----------+----------+
| 0| d_type|string| null| 216841494|
| 0|column_name| vin| null| 502458545|
| 0| table| cars| null|1469121505|
| 1|column_name| vin| null| 604007568|
| 1| table| cars| null| 784654488|
| 2| columns| vin| 0|1503105124|
| 2| columns| index| 1| 454389776|
| 2| table| cars| null| 858757332|
+-------+-----------+------+----------+----------+
Solution 2: Using UDF
You may also define user-defined-functions with your already implemented python logic and apply this with spark
Setup
We will define our functions to create our results and arguments table here. I have chosen to create generator type functions but this is optional.
result_entry =[
{
"type":"check_datatype",
"kwargs":{
"table":"cars","column_name":"vin","d_type":"string"
}
},
{
"type":"check_emptystring",
"kwargs":{
"table":"cars","column_name":"vin"
}
},
{
"type":"check_null",
"kwargs":{
"table":"cars","columns":["vin","index"]
}
}
]
import json
result_entry_str = json.dumps(result_entry)
result_entry_str
def extract_results_table(entry,current_date=None):
if current_date is None:
from datetime import date
current_date = str(date.today())
if type(entry)==str:
import json
entry = json.loads(entry)
for arg_id,arg in enumerate(entry):
yield {
"Type":arg["type"],
"time":current_date,
"args_id":arg_id
}
def extract_arguments_table(entry):
if type(entry)==str:
import json
entry = json.loads(entry)
for arg_id,arg in enumerate(entry):
if "kwargs" in arg:
for arg_entry in arg["kwargs"]:
orig_key,orig_value = arg_entry, arg["kwargs"][arg_entry]
if type(orig_value)==list:
for list_index,value in enumerate(orig_value):
yield {
"args_id":arg_id,
"key":orig_key,
"value":value,
"list_index":list_index,
"hash_code": hash((arg_id,orig_key,value,list_index))
}
else:
yield {
"args_id":arg_id,
"key":orig_key,
"value":orig_value,
"list_index":None,
"hash_code": hash((arg_id,orig_key,orig_value,"null"))
}
Pyspark Setup
from pyspark.sql.functions import udf,col,explode
from pyspark.sql.types import StructType,StructField,IntegerType,StringType, ArrayType
results_table_schema = ArrayType(StructType([
StructField("Type",StringType(),True),
StructField("time",StringType(),True),
StructField("args_id",IntegerType(),True)
]),True)
arguments_table_schema = ArrayType(StructType([
StructField("args_id",IntegerType(),True),
StructField("key",StringType(),True),
StructField("value",StringType(),True),
StructField("list_index",IntegerType(),True),
StructField("hash",StringType(),True)
]),True)
extract_results_table_udf = udf(lambda entry,current_date=None : [*extract_results_table(entry,current_date)],results_table_schema)
extract_arguments_table_udf = udf(lambda entry: [*extract_arguments_table(entry)],arguments_table_schema)
# this is useful if you intend to use your functions in spark-sql
sparkSession.udf.register('extract_results_table',extract_results_table_udf)
sparkSession.udf.register('extract_arguments_table',extract_arguments_table_udf)
Spark Data Frame
df_results_1 = sparkSession.createDataFrame([Row(entry=result_entry_str)],schema="entry string")
df_results_1.createOrReplaceTempView("df_results_1")
df_results_1.show()
Extracting Results Table
# Using Spark SQL
sparkSession.sql("""
WITH results_table AS (
select explode(extract_results_table(entry)) as entry FROM df_results_1
)
SELECT entry.* from results_table
""").show()
# Just python
df_results_1.select(
explode(extract_results_table_udf(df_results_1.entry)).alias("entry")
).selectExpr("entry.*").show()
Output
+-----------------+----------+-------+
| Type| time|args_id|
+-----------------+----------+-------+
| check_datatype|2021-03-31| 0|
|check_emptystring|2021-03-31| 1|
| check_null|2021-03-31| 2|
+-----------------+----------+-------+
+-----------------+----------+-------+
| Type| time|args_id|
+-----------------+----------+-------+
| check_datatype|2021-03-31| 0|
|check_emptystring|2021-03-31| 1|
| check_null|2021-03-31| 2|
+-----------------+----------+-------+
Extracting Results Table
# Using spark sql
sparkSession.sql("""
WITH arguments_table AS (
select explode(extract_arguments_table(entry)) as entry FROM df_results_1
)
SELECT entry.* from arguments_table
""").show()
# Just python
df_results_1.select(
explode(extract_arguments_table_udf(df_results_1.entry)).alias("entry")
).selectExpr("entry.*").show()
Output
+-------+-----------+------+----------+----+
|args_id| key| value|list_index|hash|
+-------+-----------+------+----------+----+
| 0| table| cars| null|null|
| 0|column_name| vin| null|null|
| 0| d_type|string| null|null|
| 1| table| cars| null|null|
| 1|column_name| vin| null|null|
| 2| table| cars| null|null|
| 2| columns| vin| 0|null|
| 2| columns| index| 1|null|
+-------+-----------+------+----------+----+
+-------+-----------+------+----------+----+
|args_id| key| value|list_index|hash|
+-------+-----------+------+----------+----+
| 0| table| cars| null|null|
| 0|column_name| vin| null|null|
| 0| d_type|string| null|null|
| 1| table| cars| null|null|
| 1|column_name| vin| null|null|
| 2| table| cars| null|null|
| 2| columns| vin| 0|null|
| 2| columns| index| 1|null|
+-------+-----------+------+----------+----+
Reference
Spark SQL Functions
Delta Batch Writes
I have one csv file.
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,Address
2,66M,J,Rock,F,1995,201211.0,J
3,David,HM,Lee,M,1991,201211.0,J
6,66M,,Rock,F,1990,201211.0,J
0,David,H M,Lee,M,1990,201211.0,B
3,Marc,H,Robert,M,2000,201211.0,C
6,Marc,M,Robert,M,1988,201211.0,C
6,Marc,MS,Robert,M,2000,201211.0,D
I want to assign persons with same last name living in the same address a same ID or index. It's better that ID is made up of only numbers.
If persons have different last name in the same place, then ID should be different.
Such ID should be unique. Namely, people who are different in either address or last name, ID must be different.
My expected output is
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,Address,ID
2,66M,J,Rock,F,1995,201211.0,J,11
3,David,HM,Lee,M,1991,201211.0,J,12
6,66M,,Rock,F,1990,201211.0,J,11
0,David,H M,Lee,M,1990,201211.0,B,13
3,Marc,H,Robert,M,2000,201211.0,C,14
6,Marc,M,Robert,M,1988,201211.0,C,14
6,Marc,MS,Robert,M,2000,201211.0,D,15
My datafile size is around 30 GB. I am thinking of using groupBy function in spark based on the key consisting of LNAME and address to group those observations together. Then assign it a ID by key. But I don't know how to do this. After that, maybe I can use flatMap to split the line and return those observations with a ID. But I am not sure about it. In addition, can I also make it in Linux environment? Thank you.
As discussed in the comments, the basic idea is to partition the data properly so that records with the same LNAME+Address stay in the same partition, run Python code to generate separate idx on each partition and then merge them into the final id.
Note: I added some new rows in your sample records, see the result of df_new.show() shown below.
from pyspark.sql import Window, Row
from pyspark.sql.functions import coalesce, sum as fsum, col, max as fmax, lit, broadcast
# ...skip code to initialize the dataframe
# tweak the number of repartitioning N based on actual data size
N = 5
# Python function to iterate through the sorted list of elements in the same
# partition and assign an in-partition idx based on Address and LNAME.
def func(partition_id, it):
idx, lname, address = (1, None, None)
for row in sorted(it, key=lambda x: (x.LNAME, x.Address)):
if lname and (row.LNAME != lname or row.Address != address): idx += 1
yield Row(partition_id=partition_id, idx=idx, **row.asDict())
lname = row.LNAME
address = row.Address
# Repartition based on 'LNAME' and 'Address' and then run mapPartitionsWithIndex()
# function to create in-partition idx. Adjust N so that records in each partition
# should be small enough to be loaded into the executor memory:
df1 = df.repartition(N, 'LNAME', 'Address') \
.rdd.mapPartitionsWithIndex(func) \
.toDF()
Get number of unique rows cnt (based on Address+LNAME) which is max_idx and then grab the running SUM of this rcnt.
# idx: calculated in-partition id
# cnt: number of unique ids in the same partition: fmax('idx')
# rcnt: starting_id for a partition(something like a running count): coalesce(fsum('cnt').over(w1),lit(0))
# w1: WindowSpec to calculate the above rcnt
w1 = Window.partitionBy().orderBy('partition_id').rowsBetween(Window.unboundedPreceding,-1)
df2 = df1.groupby('partition_id') \
.agg(fmax('idx').alias('cnt')) \
.withColumn('rcnt', coalesce(fsum('cnt').over(w1),lit(0)))
df2.show()
+------------+---+----+
|partition_id|cnt|rcnt|
+------------+---+----+
| 0| 3| 0|
| 1| 1| 3|
| 2| 1| 4|
| 4| 1| 5|
+------------+---+----+
Join df1 with df2 and create the final id which is idx + rcnt
df_new = df1.join(broadcast(df2), on=['partition_id']).withColumn('id', col('idx')+col('rcnt'))
df_new.show()
#+------------+-------+---+----+-----+------+------+-----+---+--------+---+----+---+
#|partition_id|Address| D| DOB|FNAME|GENDER| LNAME|MNAME|idx|snapshot|cnt|rcnt| id|
#+------------+-------+---+----+-----+------+------+-----+---+--------+---+----+---+
#| 0| B| 0|1990|David| M| Lee| H M| 1|201211.0| 3| 0| 1|
#| 0| J| 3|1991|David| M| Lee| HM| 2|201211.0| 3| 0| 2|
#| 0| D| 6|2000| Marc| M|Robert| MS| 3|201211.0| 3| 0| 3|
#| 1| C| 3|2000| Marc| M|Robert| H| 1|201211.0| 1| 3| 4|
#| 1| C| 6|1988| Marc| M|Robert| M| 1|201211.0| 1| 3| 4|
#| 2| J| 6|1991| 66M| F| Rek| null| 1|201211.0| 1| 4| 5|
#| 2| J| 6|1992| 66M| F| Rek| null| 1|201211.0| 1| 4| 5|
#| 4| J| 2|1995| 66M| F| Rock| J| 1|201211.0| 1| 5| 6|
#| 4| J| 6|1990| 66M| F| Rock| null| 1|201211.0| 1| 5| 6|
#| 4| J| 6|1990| 66M| F| Rock| null| 1|201211.0| 1| 5| 6|
#+------------+-------+---+----+-----+------+------+-----+---+--------+---+----+---+
df_new = df_new.drop('partition_id', 'idx', 'rcnt', 'cnt')
Some notes:
Practically, you will need to clean-out/normalize the column LNAME and Address before using them as uniqueness check. For example, use a separate column uniq_key which combine LNAME and Address as the unique key of the dataframe. see below for an example with some basic data cleansing procedures:
from pyspark.sql.functions import coalesce, lit, concat_ws, upper, regexp_replace, trim
#(1) convert NULL to '': coalesce(col, '')
#(2) concatenate LNAME and Address using NULL char '\x00' or '\0'
#(3) convert to uppercase: upper(text)
#(4) remove all non-[word/whitespace/NULL_char]: regexp_replace(text, r'[^\x00\w\s]', '')
#(5) convert consecutive whitespaces to a SPACE: regexp_replace(text, r'\s+', ' ')
#(6) trim leading/trailing spaces: trim(text)
df = (df.withColumn('uniq_key',
trim(
regexp_replace(
regexp_replace(
upper(
concat_ws('\0', coalesce('LNAME', lit('')), coalesce('Address', lit('')))
),
r'[^\x00\s\w]+',
''
),
r'\s+',
' '
)
)
))
Then in the code, replace 'LNAME' and 'Address' with uniq_key to find the idx
As mentioned by cronoik in the comment, you can also try one of the Window rank functions to calculate the in-partition idx. for example:
from pyspark.sql.functions import spark_partition_id, dense_rank
# use dense_rank to calculate the in-partition idx
w2 = Window.partitionBy('partition_id').orderBy('LNAME', 'Address')
df1 = df.repartition(N, 'LNAME', 'Address') \
.withColumn('partition_id', spark_partition_id()) \
.withColumn('idx', dense_rank().over(w2))
After you have df1, use the same methods as above to calculate df2 and df_new. This should be faster than using mapPartitionsWithIndex() which is basically an RDD-based method.
For your real data, adjust N to fit your actual data size. this N only influences the initial partitions, after dataframe join, the partition will be reset to default(200). you can adjust this using spark.sql.shuffle.partitions for example when you initialize the spark session:
spark = SparkSession.builder \
....
.config("spark.sql.shuffle.partitions", 500) \
.getOrCreate()
Since you have 30GB of input data, you probably don't want something that'll attempt to hold it all in in-memory data structures. Let's use disk space instead.
Here's one approach that loads all your data into a sqlite database, and generates an id for each unique last name and address pair, and then joins everything back up together:
#!/bin/sh
csv="$1"
# Use an on-disk database instead of in-memory because source data is 30gb.
# This will take a while to run.
db=$(mktemp -p .)
sqlite3 -batch -csv -header "${db}" <<EOF
.import "${csv}" people
CREATE TABLE ids(id INTEGER PRIMARY KEY, lname, address, UNIQUE(lname, address));
INSERT OR IGNORE INTO ids(lname, address) SELECT lname, address FROM people;
SELECT p.*, i.id AS ID
FROM people AS p
JOIN ids AS i ON (p.lname, p.address) = (i.lname, i.address)
ORDER BY p.rowid;
EOF
rm -f "${db}"
Example:
$./makeids.sh data.csv
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,Address,ID
2,66M,J,Rock,F,1995,201211.0,J,1
3,David,HM,Lee,M,1991,201211.0,J,2
6,66M,"",Rock,F,1990,201211.0,J,1
0,David,"H M",Lee,M,1990,201211.0,B,3
3,Marc,H,Robert,M,2000,201211.0,C,4
6,Marc,M,Robert,M,1988,201211.0,C,4
6,Marc,MS,Robert,M,2000,201211.0,D,5
It's better that ID is made up of only numbers.
If that restriction can be relaxed, you can do it in a single pass by using a cryptographic hash of the last name and address as the ID:
$ perl -MDigest::SHA=sha1_hex -F, -lane '
BEGIN { $" = $, = "," }
if ($. == 1) { print #F, "ID" }
else { print #F, sha1_hex("#F[3,7]") }' data.csv
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,Address,ID
2,66M,J,Rock,F,1995,201211.0,J,5c99211a841bd2b4c9cdcf72d7e95e46b2ae08b5
3,David,HM,Lee,M,1991,201211.0,J,c263f9d1feb4dc789de17a8aab8f2808aea2876a
6,66M,,Rock,F,1990,201211.0,J,5c99211a841bd2b4c9cdcf72d7e95e46b2ae08b5
0,David,H M,Lee,M,1990,201211.0,B,e86e81ab2715a8202e41b92ad979ca3a67743421
3,Marc,H,Robert,M,2000,201211.0,C,363ed8175fdf441ed59ac19cea3c37b6ce9df152
6,Marc,M,Robert,M,1988,201211.0,C,363ed8175fdf441ed59ac19cea3c37b6ce9df152
6,Marc,MS,Robert,M,2000,201211.0,D,cf5135dc402efe16cd170191b03b690d58ea5189
Or if the number of unique lname, address pairs is small enough that they can reasonably be stored in a hash table on your system:
#!/usr/bin/gawk -f
BEGIN {
FS = OFS = ","
}
NR == 1 {
print $0, "ID"
next
}
! ($4, $8) in ids {
ids[$4, $8] = ++counter
}
{
print $0, ids[$4, $8]
}
$ sort -t, -k8,8 -k4,4 <<EOD | awk -F, ' $8","$4 != last { ++id; last = $8","$4 }
{ NR!=1 && $9=id; print }' id=9 OFS=,
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,Address
2,66M,J,Rock,F,1995,201211.0,J
3,David,HM,Lee,M,1991,201211.0,J
6,66M,,Rock,F,1990,201211.0,J
0,David,H M,Lee,M,1990,201211.0,B
3,Marc,H,Robert,M,2000,201211.0,C
6,Marc,M,Robert,M,1988,201211.0,C
6,Marc,MS,Robert,M,2000,201211.0,D
> EOD
D,FNAME,MNAME,LNAME,GENDER,DOB,snapshot,Address
0,David,H M,Lee,M,1990,201211.0,B,11
3,Marc,H,Robert,M,2000,201211.0,C,12
6,Marc,M,Robert,M,1988,201211.0,C,12
6,Marc,MS,Robert,M,2000,201211.0,D,13
3,David,HM,Lee,M,1991,201211.0,J,14
2,66M,J,Rock,F,1995,201211.0,J,15
6,66M,,Rock,F,1990,201211.0,J,15
$