I am new with Scala and I used to work with python.
I want to convert program from Python to Scala and have difficulties with following 2 lines (create sql dataframe)
python code
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
data = dataset.map(lambda (filepath, text): (filepath.split("/")[-1],text, filepath.split("/")[-2]))
df = sqlContext.createDataFrame(data, schema)
i have made this
scala code
val category = dataset.map { case (filepath, text) => filepath.split("/")(6) }
val id = dataset.map { case (filepath, text) => filepath.split("/")(7) }
val text = dataset.map { case (filepath, text) => text }
val schema = StructType(Seq(
StructField(id.toString(), StringType, true),
StructField(category.toString(), StringType, true),
StructField(text.toString(), StringType, true)
))
and now i am blocked there!
For what it is worth I have converted your code literally and the following compiles using spark 2.3.2 on my machine
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import spark.implicits._
// Introduced to make code clearer
case class FileRecord(name: String, text: String)
// Whatever data set you have (a single record dataset is hard coded, replace with your data)
val dataSet = Seq(FileRecord("/a/b/c/d/e/f/g/h/i", "example contents")).toDS()
// Whatever you need with path length 6 and 7 hardcoded (you might want to change this)
// you may be able to do the following three map operations more efficiently
val category = dataSet.map { case FileRecord(filepath, text) => filepath.split("/")(6) }
val id = dataSet.map { case FileRecord(filepath, text) => filepath.split("/")(7) }
val text = dataSet.map { case FileRecord(filepath, text) => text }
val schema = StructType(Seq(
StructField(id.toString(), StringType, true),
StructField(category.toString(), StringType, true),
StructField(text.toString(), StringType, true)
))
Related
I am developing an ETL pipeline using databricks DLT pipelines for CDC data that I recieve from kafka. I have created 2 pipelines successfully for landing, and raw zone. The raw one will have operation flag, a sequence column, and I would like to process the CDC and store the clean data in processed layer (SCD 1 type). I am having difficulties in reading table from one schema, apply CDC changes, and load to target db schema tables.
I have 100 plus tables, so i am planning to loop through the tables in RAW layer and apply CDC, move to processed layer. Following is my code that I have tried (I have left the commented code just for your reference).
import dlt
from pyspark.sql.functions import *
from pyspark.sql.types import *
raw_db_name = "raw_db"
processed_db_name = "processed_db_name"
def generate_curated_table(src_table_name, tgt_table_name, df):
# #dlt.view(
# name= src_table_name,
# spark_conf={
# "pipelines.incompatibleViewCheck.enabled": "false"
# },
# comment="Processed data for " + str(src_table_name)
# )
# # def create_target_table():
# # return (df)
# dlt.create_target_table(name=tgt_table_name,
# comment= f"Clean, merged {tgt_table_name}",
# #partition_cols=["topic"],
# table_properties={
# "quality": "silver"
# }
# )
# #dlt.view
# def users():
# return spark.readStream.format("delta").table(src_table_name)
#dlt.view
def raw_tbl_data():
return df
dlt.create_target_table(name=tgt_table_name,
comment="Clean, merged customers",
table_properties={
"quality": "silver"
})
dlt.apply_changes(
target = tgt_table_name,
source = f"{raw_db_name}.raw_tbl_data,
keys = ["id"],
sequence_by = col("timestamp_ms"),
apply_as_deletes = expr("op = 'DELETE'"),
apply_as_truncates = expr("op = 'TRUNCATE'"),
except_column_list = ["id", "timestamp_ms"],
stored_as_scd_type = 1
)
return
tbl_name = 'raw_po_details'
df = spark.sql(f'select * from {raw_dbname}.{tbl_name}')
processed_tbl_name = tbl_name.replace("raw", "processed") //processed_po_details
generate_curated_table(tbl_name, processed_tbl_name, df)
I have tried with dlt.view(), dlt.table(), dlt.create_streaming_live_table(), dlt.create_target_table(), but ending up with either of the following errors:
AttributeError: 'function' object has no attribute '_get_object_id'
pyspark.sql.utils.AnalysisException: Failed to read dataset '<raw_db_name.mytable>'. Dataset is not defined in the pipeline
.Expected result:
Read the dataframe which is passed as a parameter (RAW_DB) and
Create new tables in PROCESSED_DB which is configured in DLT pipeline settings
https://www.databricks.com/blog/2022/04/27/how-uplift-built-cdc-and-multiplexing-data-pipelines-with-databricks-delta-live-tables.html
https://cprosenjit.medium.com/databricks-delta-live-tables-job-workflows-orchestration-patterns-bc7643935299
Appreciate any help please.
Thanks in advance
I got the solution myself and got it working, thanks to all. Am adding my solution so it could be a reference to others.
import dlt
from pyspark.sql.functions import *
from pyspark.sql.types import *
def generate_silver_tables(target_table, source_table):
#dlt.table
def customers_filteredB():
return spark.table("my_raw_db.myraw_table_name")
### Create the target table definition
dlt.create_target_table(name=target_table,
comment= f"Clean, merged {target_table}",
#partition_cols=["topic"],
table_properties={
"quality": "silver",
"pipelines.autoOptimize.managed": "true"
}
)
## Do the merge
dlt.apply_changes(
target = target_table,
source = "customers_filteredB",
keys = ["id"],
apply_as_deletes = expr("operation = 'DELETE'"),
sequence_by = col("timestamp_ms"),#primary key, auto-incrementing ID of any kind that can be used to identity order of events, or timestamp
ignore_null_updates = False,
except_column_list = ["operation", "timestamp_ms"],
stored_as_scd_type = "1"
)
return
raw_dbname = "raw_db"
raw_tbl_name = 'raw_table_name'
processed_tbl_name = raw_tbl_name.replace("raw", "processed")
generate_silver_tables(processed_tbl_name, raw_tbl_name)
I have the following C struct:
#define UUID4_LEN 37
...
typedef struct can_record {
char id[UUID4_LEN];
char *can_data;
} CAN_RECORD;
I am saving that record in Berkeley DB via the below function:
int insert_record(DB **dbpp, CAN_RECORD * record) {
DB *db;
DBT key, data;
int ret;
db = *dbpp;
memset(&key, 0, sizeof(DBT));
memset(&data, 0, sizeof(DBT));
uuid4_generate(record->id);
key.data = record->id;
key.size = (u_int32_t)strlen(record->id) + 1;
data.data = &record;
data.size = sizeof(CAN_RECORD);
ret = db->put(db, 0, &key, &data, 0);
if(ret != 0) {
fprintf(stderr, "Unable to insert record %s, err: %s\n", record->id,
db_strerror(ret));
return ret;
}
printf("Record inserted %s %s\n", record->id, record->can_data);
return ret;
}
NOTE: the record->data has already been pre-populated previously, and it is of a variable length, but it is a stringified JSON structure, i.e.:
asprintf(&record.can_data, "{\"t\": \"%s\", \"b\": \"%s\", \"e\": \"%u\"\"}", U_UID, name, (unsigned)time(NULL));
I have a python process that reads the Berkeley DB (here is a small excerpt):
from berkeleydb import db
...
...
cursor = self._db.cursor()
record = cursor.first()
while record:
(id, data) = record
self.log(f'RECORD: {id} {data}')
id = struct.unpack("", id)
data = struct.unpack("", data)
self.log(f'DECODED: {id} {data}')
record = cursor.next()
...
The record data looks like this:
b'46c54a16-366a-4397-aa68-357ab5538590\x00'
and
b'P\x99\x12\x00x\xbb\xfd~(\xbb\xfd~\x16\x00\x00\x00\x04\x00\x00\x00x\xbb\xfd~\x00\x00\x00\x00\x00\x00\x00\x00\xa4\x9f\x02A\x00\x00\x00\x00\x83.\xf0v#\x03\x00\x00\x08\x00\x00\x00\xf0h\x9e\x9fpo;\xcc\x1d\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00can0\x00\x00\x00\x00x\xbb\xfd~\x00\x00\x00\x00\x03\x00\x00\x00\xb0\xfa\x00A\x00\x00\x00\x00\x00\x00'
I am unable to figure out how I can use python's struct.unpack to decode the bytes string. I have tried variety of different formats, but have been unsuccessful.
How would you go about unpacking the struct, such that I have the original form.
Unfortunately, the Berkeley DB reader has to be in python.
Also note:
data.data = &record;
data.size = sizeof(CAN_RECORD);
the data is the entire struct, which includes the id[UUID4_LEN], and the *can_data.
What would I need to do here:
(id, data) = record
id = struct.unpack("", id)
data = struct.unpack("", data)
to achieve original form?
Ok, for the time being, I did a workaround. Rather than:
data.data = &record;
data.size = sizeof(CAN_RECORD);
I did:
data.data = record->can_data;
data.size = (u_int32_t)strlen(record->can_data) + 1;
So, I only saved the string, rather than the entire struct. Then, in my python, I simply did:
(id, data) = record
id = str(id, 'utf-8')
data = str(data, 'utf-8')
self.log(f'ID: {id}')
self.log(f'DATA: {data}')
and that decoded the byte string, to just a string perfectly:
ID: dacaf94f-ecf5-4252-89d8-e2c9deff8f8d
DATA: {"t": "", "b": "abc123", "e": "1653636766"}
Although, this has helped me progress without using python struct.unpack, I am still keen on understanding how to unpack structs in python, as I will likely have a future requirement with slightly more complicated struct definitions.
What i get from api:
"name":"reports"
"col_type":"array<struct<imageUrl:string,reportedBy:string>>"
So in hive schema I got:
reports array<struct<imageUrl:string,reportedBy:string>>
Note: I got hive array schema as string from api
My target:
bigquery.SchemaField("reports", "RECORD", mode="NULLABLE",
fields=(
bigquery.SchemaField('imageUrl', 'STRING'),
bigquery.SchemaField('reportedBy', 'STRING')
)
)
Note: I would like to create universal code that can handle when i receive any number of struct inside of the array.
Any tips are welcome.
I tried creating a script that parses your input which is reports array<struct<imageUrl:string,reportedBy:string>>. This converts your input to a dictionary that could be used as schema when creating a table. The main idea of the apporach is instead of using SchemaField(), you can create a dictionary which is much easier than creating SchemaField() objects with parameters using your example input.
NOTE: The script is only tested based on your input and it can parse more fields if added in struct<.
import re
from google.cloud import bigquery
def is_even(number):
if (number % 2) == 0:
return True
else:
return False
def clean_string(str_value):
return re.sub(r'[\W_]+', '', str_value)
def convert_to_bqdict(api_string):
"""
This only works for a struct with multiple fields
This could give you an idea on constructing a schema dict for BigQuery
"""
num_even = True
main_dict = {}
struct_dict = {}
field_arr = []
schema_arr = []
# Hard coded this since not sure what the string will look like if there are more inputs
init_struct = sample.split(' ')
main_dict["name"] = init_struct[0]
main_dict["type"] = "RECORD"
main_dict["mode"] = "NULLABLE"
cont_struct = init_struct[1].split('<')
num_elem = len(cont_struct)
# parse fields inside of struct<
for i in range(0,num_elem):
num_even = is_even(i)
# fields are seen on even indices
if num_even and i != 0:
temp = list(filter(None,cont_struct[i].split(','))) # remove blank elements
for elem in temp:
fields = list(filter(None,elem.split(':')))
struct_dict["name"] = clean_string(fields[0])
# "type" works for STRING as of the moment refer to
# https://cloud.google.com/bigquery/docs/schemas#standard_sql_data_types
# for the accepted data types
struct_dict["type"] = clean_string(fields[1]).upper()
struct_dict["mode"] = "NULLABLE"
field_arr.append(struct_dict)
struct_dict = {}
main_dict["fields"] = field_arr # assign dict to array of fields
schema_arr.append(main_dict)
return schema_arr
sample = "reports array<struct<imageUrl:string,reportedBy:string,newfield:bool>>"
bq_dict = convert_to_bqdict(sample)
client = bigquery.Client()
project = client.project
dataset_ref = bigquery.DatasetReference(project, '20211228')
table_ref = dataset_ref.table("20220203")
table = bigquery.Table(table_ref, schema=bq_dict)
table = client.create_table(table)
Output:
I have json files, and I'm trying to hash one field of it with SHA 256. These files are on AWS S3. I am currently using spark with python on Apache Zeppelin.
Here is my json schema, I am trying to hash 'mac' field;
|-- Document: struct (nullable = true)
| |-- data: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- mac: string (nullable = true)
I've tried couple of things;
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
import hashlib
hcData = sqlc.read.option("inferSchema","true").json(inputPath)
hcData.registerTempTable("hcData")
name = 'Document'
udf = UserDefinedFunction(lambda x: hashlib.sha256(str(x).encode('utf-8')).hexdigest(), StringType())
new_df = hcData.select(*[udf(column).alias(name) if column == name else column for column in hcData.columns])
This code works fine. But when I try to hash mac field and change name variable nothing happens;
name = 'Document.data[0].mac'
name = 'mac'
I guess it is because, it couldn't find column with given name.
I've tried to change the code a bit;
def valueToCategory(value):
return hashlib.sha256(str(value).encode('utf-8')).hexdigest()
udfValueToCategory = udf(valueToCategory, StringType())
df = hcData.withColumn("Document.data[0].mac",udfValueToCategory("Document.data.mac"))
This code hashes "Document.data.mac" and creates new column with hashed mac addresses. I want to update existing column. For those variables not nested it can update, there is no problem, but for nested variables I couldn't find a way to update.
So basically, I want to hash a field in nested json file with spark python. Can anyone knows how to update spark dataframe with schema?
Here is the python solution for my question below.
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
import hashlib
import re
def find(s, r):
l = re.findall(r, s)
if(len(l)!=0):
return l
else:
lis = ["null"]
return lis
def hash(s):
return hashlib.sha256(str(s).encode('utf-8')).hexdigest()
def hashAll(s, r):
st = s
macs = re.findall(r, s)
for mac in macs:
st = st.replace(mac, hash(mac))
return st
rdd = sc.textFile(inputPath)
regex = "([0-9A-Z]{1,2}[:-]){5}([0-9A-Z]{1,2})"
hashed_rdd = rdd.map(lambda line: hashAll(line, regex))
hashed_rdd.saveAsTextFile(outputPath)
Well, I've found a solution for my question with scala. There can be redundant codes but it worked anyway.
import scala.util.matching.Regex
import java.security.MessageDigest
val inputPath = ""
val outputPath = ""
//finds mac addresses with given regex
def find(s: String, r: Regex): List[String] = {
val l = r.findAllIn(s).toList
if(!l.isEmpty){
return l
} else {
val lis: List[String] = List("null")
return lis
}
}
//hashes given string with sha256
def hash(s: String): String = {
return MessageDigest.getInstance("SHA-256").digest(s.getBytes).map(0xFF & _).map { "%02x".format(_) }.foldLeft(""){_ + _}
}
//hashes given line
def hashAll(s: String, r:Regex): String = {
var st = s
val macs = find(s, r)
for (mac <- macs){
st = st.replaceAll(mac, hash(mac))
}
return st
}
//read data
val rdd = sc.textFile(inputPath)
//mac address regular expression
val regex = "(([0-9A-Z]{1,2}[:-]){5}([0-9A-Z]{1,2}))".r
//hash data
val hashed_rdd = rdd.map(line => hashAll(line, regex))
//write hashed data
hashed_rdd.saveAsTextFile(outputPath)
I have a flask application which is receiving a request from dataTables Editor. Upon receipt at the server, request.form looks like (e.g.)
ImmutableMultiDict([('data[59282][gender]', u'M'), ('data[59282][hometown]', u''),
('data[59282][disposition]', u''), ('data[59282][id]', u'59282'),
('data[59282][resultname]', u'Joe Doe'), ('data[59282][confirm]', 'true'),
('data[59282][age]', u'27'), ('data[59282][place]', u'3'), ('action', u'remove'),
('data[59282][runnerid]', u''), ('data[59282][time]', u'29:49'),
('data[59282][club]', u'')])
I am thinking to use something similar to this really ugly code to decode it. Is there a better way?
from collections import defaultdict
# request.form comes in multidict [('data[id][field]',value), ...]
# so we need to exec this string to turn into python data structure
data = defaultdict(lambda: {}) # default is empty dict
# need to define text for each field to be received in data[id][field]
age = 'age'
club = 'club'
confirm = 'confirm'
disposition = 'disposition'
gender = 'gender'
hometown = 'hometown'
id = 'id'
place = 'place'
resultname = 'resultname'
runnerid = 'runnerid'
time = 'time'
# fill in data[id][field] = value
for formkey in request.form.keys():
exec '{} = {}'.format(d,repr(request.form[formkey]))
This question has an accepted answer and is a bit old but since the DataTable module seems being pretty popular among jQuery community still, I believe this approach may be useful for someone else. I've just wrote a simple parsing function based on regular expression and dpath module, though it appears not to be quite reliable module. The snippet may be not very straightforward due to an exception-relied fragment, but it was only one way to prevent dpath from trying to resolve strings as integer indices I found.
import re, dpath.util
rxsKey = r'(?P<key>[^\W\[\]]+)'
rxsEntry = r'(?P<primaryKey>[^\W]+)(?P<secondaryKeys>(\[' \
+ rxsKey \
+ r'\])*)\W*'
rxKey = re.compile(rxsKey)
rxEntry = re.compile(rxsEntry)
def form2dict( frmDct ):
res = {}
for k, v in frmDct.iteritems():
m = rxEntry.match( k )
if not m: continue
mdct = m.groupdict()
if not 'secondaryKeys' in mdct.keys():
res[mdct['primaryKey']] = v
else:
fullPath = [mdct['primaryKey']]
for sk in re.finditer( rxKey, mdct['secondaryKeys'] ):
k = sk.groupdict()['key']
try:
dpath.util.get(res, fullPath)
except KeyError:
dpath.util.new(res, fullPath, [] if k.isdigit() else {})
fullPath.append(int(k) if k.isdigit() else k)
dpath.util.new(res, fullPath, v)
return res
The practical usage is based on native flask request.form.to_dict() method:
# ... somewhere in a view code
pars = form2dict(request.form.to_dict())
The output structure includes both, dictionary and lists, as one could expect. E.g.:
# A little test:
rs = jQDT_form2dict( {
'columns[2][search][regex]' : False,
'columns[2][search][value]' : None,
'columns[2][search][regex]' : False,
} )
generates:
{
"columns": [
null,
null,
{
"search": {
"regex": false,
"value": null
}
}
]
}
Update: to handle lists as dictionaries (in more efficient way) one may simplify this snippet with following block at else part of if clause:
# ...
else:
fullPathStr = mdct['primaryKey']
for sk in re.finditer( rxKey, mdct['secondaryKeys'] ):
fullPathStr += '/' + sk.groupdict()['key']
dpath.util.new(res, fullPathStr, v)
I decided on a way that is more secure than using exec:
from collections import defaultdict
def get_request_data(form):
'''
return dict list with data from request.form
:param form: MultiDict from `request.form`
:rtype: {id1: {field1:val1, ...}, ...} [fieldn and valn are strings]
'''
# request.form comes in multidict [('data[id][field]',value), ...]
# fill in id field automatically
data = defaultdict(lambda: {})
# fill in data[id][field] = value
for formkey in form.keys():
if formkey == 'action': continue
datapart,idpart,fieldpart = formkey.split('[')
if datapart != 'data': raise ParameterError, "invalid input in request: {}".format(formkey)
idvalue = int(idpart[0:-1])
fieldname = fieldpart[0:-1]
data[idvalue][fieldname] = form[formkey]
# return decoded result
return data