Updating column in spark dataframe with json schema - python

I have json files, and I'm trying to hash one field of it with SHA 256. These files are on AWS S3. I am currently using spark with python on Apache Zeppelin.
Here is my json schema, I am trying to hash 'mac' field;
|-- Document: struct (nullable = true)
| |-- data: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- mac: string (nullable = true)
I've tried couple of things;
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
import hashlib
hcData = sqlc.read.option("inferSchema","true").json(inputPath)
hcData.registerTempTable("hcData")
name = 'Document'
udf = UserDefinedFunction(lambda x: hashlib.sha256(str(x).encode('utf-8')).hexdigest(), StringType())
new_df = hcData.select(*[udf(column).alias(name) if column == name else column for column in hcData.columns])
This code works fine. But when I try to hash mac field and change name variable nothing happens;
name = 'Document.data[0].mac'
name = 'mac'
I guess it is because, it couldn't find column with given name.
I've tried to change the code a bit;
def valueToCategory(value):
return hashlib.sha256(str(value).encode('utf-8')).hexdigest()
udfValueToCategory = udf(valueToCategory, StringType())
df = hcData.withColumn("Document.data[0].mac",udfValueToCategory("Document.data.mac"))
This code hashes "Document.data.mac" and creates new column with hashed mac addresses. I want to update existing column. For those variables not nested it can update, there is no problem, but for nested variables I couldn't find a way to update.
So basically, I want to hash a field in nested json file with spark python. Can anyone knows how to update spark dataframe with schema?

Here is the python solution for my question below.
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
import hashlib
import re
def find(s, r):
l = re.findall(r, s)
if(len(l)!=0):
return l
else:
lis = ["null"]
return lis
def hash(s):
return hashlib.sha256(str(s).encode('utf-8')).hexdigest()
def hashAll(s, r):
st = s
macs = re.findall(r, s)
for mac in macs:
st = st.replace(mac, hash(mac))
return st
rdd = sc.textFile(inputPath)
regex = "([0-9A-Z]{1,2}[:-]){5}([0-9A-Z]{1,2})"
hashed_rdd = rdd.map(lambda line: hashAll(line, regex))
hashed_rdd.saveAsTextFile(outputPath)

Well, I've found a solution for my question with scala. There can be redundant codes but it worked anyway.
import scala.util.matching.Regex
import java.security.MessageDigest
val inputPath = ""
val outputPath = ""
//finds mac addresses with given regex
def find(s: String, r: Regex): List[String] = {
val l = r.findAllIn(s).toList
if(!l.isEmpty){
return l
} else {
val lis: List[String] = List("null")
return lis
}
}
//hashes given string with sha256
def hash(s: String): String = {
return MessageDigest.getInstance("SHA-256").digest(s.getBytes).map(0xFF & _).map { "%02x".format(_) }.foldLeft(""){_ + _}
}
//hashes given line
def hashAll(s: String, r:Regex): String = {
var st = s
val macs = find(s, r)
for (mac <- macs){
st = st.replaceAll(mac, hash(mac))
}
return st
}
//read data
val rdd = sc.textFile(inputPath)
//mac address regular expression
val regex = "(([0-9A-Z]{1,2}[:-]){5}([0-9A-Z]{1,2}))".r
//hash data
val hashed_rdd = rdd.map(line => hashAll(line, regex))
//write hashed data
hashed_rdd.saveAsTextFile(outputPath)

Related

How to convert Hive schema to Bigquery schema using Python?

What i get from api:
"name":"reports"
"col_type":"array<struct<imageUrl:string,reportedBy:string>>"
So in hive schema I got:
reports array<struct<imageUrl:string,reportedBy:string>>
Note: I got hive array schema as string from api
My target:
bigquery.SchemaField("reports", "RECORD", mode="NULLABLE",
fields=(
bigquery.SchemaField('imageUrl', 'STRING'),
bigquery.SchemaField('reportedBy', 'STRING')
)
)
Note: I would like to create universal code that can handle when i receive any number of struct inside of the array.
Any tips are welcome.
I tried creating a script that parses your input which is reports array<struct<imageUrl:string,reportedBy:string>>. This converts your input to a dictionary that could be used as schema when creating a table. The main idea of the apporach is instead of using SchemaField(), you can create a dictionary which is much easier than creating SchemaField() objects with parameters using your example input.
NOTE: The script is only tested based on your input and it can parse more fields if added in struct<.
import re
from google.cloud import bigquery
def is_even(number):
if (number % 2) == 0:
return True
else:
return False
def clean_string(str_value):
return re.sub(r'[\W_]+', '', str_value)
def convert_to_bqdict(api_string):
"""
This only works for a struct with multiple fields
This could give you an idea on constructing a schema dict for BigQuery
"""
num_even = True
main_dict = {}
struct_dict = {}
field_arr = []
schema_arr = []
# Hard coded this since not sure what the string will look like if there are more inputs
init_struct = sample.split(' ')
main_dict["name"] = init_struct[0]
main_dict["type"] = "RECORD"
main_dict["mode"] = "NULLABLE"
cont_struct = init_struct[1].split('<')
num_elem = len(cont_struct)
# parse fields inside of struct<
for i in range(0,num_elem):
num_even = is_even(i)
# fields are seen on even indices
if num_even and i != 0:
temp = list(filter(None,cont_struct[i].split(','))) # remove blank elements
for elem in temp:
fields = list(filter(None,elem.split(':')))
struct_dict["name"] = clean_string(fields[0])
# "type" works for STRING as of the moment refer to
# https://cloud.google.com/bigquery/docs/schemas#standard_sql_data_types
# for the accepted data types
struct_dict["type"] = clean_string(fields[1]).upper()
struct_dict["mode"] = "NULLABLE"
field_arr.append(struct_dict)
struct_dict = {}
main_dict["fields"] = field_arr # assign dict to array of fields
schema_arr.append(main_dict)
return schema_arr
sample = "reports array<struct<imageUrl:string,reportedBy:string,newfield:bool>>"
bq_dict = convert_to_bqdict(sample)
client = bigquery.Client()
project = client.project
dataset_ref = bigquery.DatasetReference(project, '20211228')
table_ref = dataset_ref.table("20220203")
table = bigquery.Table(table_ref, schema=bq_dict)
table = client.create_table(table)
Output:

Pyspark function works properly by itself but does not perform the task when wrapped in a UDF

I have this function that takes in a code and checks if the code is used (ie: is in used_codes dict). If it has not been used then it spits out that same code if it has been used then it generates a new code. Then I am creating a new df with this new column "code_id" of all unique codes.
My function is working properly by itself but when it goes through the udf it does not do the task. My used_codes dict is empty even though I have a ton of repeat codes that should have been added to used and then replaced.
I'm not sure why it works before it is wrapped in a UDF but not when ran as a UDF.
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
import pyspark.sql.functions as F
import pyspark.sql.types as T
import random
data = [("James", "36636"),
("Michael", "36636"),
("Robert", "42114"),
("Maria", "39192"),
("Jen", "39192")
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("id", StringType(), True), \
])
df = spark.createDataFrame(data=data,schema=schema)
used_codes = {}
def generate_random_code():
random_number = random.randint(10000,90000)
return random_number
def get_valid_code(code):
global used_codes
if(code != "" and code not in used_codes.keys()):
used_codes[code] = 1
return code
new_code = generate_random_code()
while (new_code in used_codes.keys()):
new_code = generate_random_code()
used_codes[new_code] = 2
return new_code
get_valid_code_udf = F.udf(lambda code: get_valid_code(code), T.StringType())
df = spark.createDataFrame(data=data,schema=schema)
new_df = df.withColumn("code_id", get_valid_code_udf('id'))
df.show()
+---------+-----+
|firstname| id|
+---------+-----+
| James|36636|
| Michael|36636|
| Robert|42114|
| Maria|39192|
| Jen|39192|
+---------+-----+
>>> new_df.show()
+---------+-----+-------+
|firstname| id|code_id|
+---------+-----+-------+
| James|36636| 36636|
| Michael|36636| 63312|
| Robert|42114| 42114|
| Maria|39192| 39192|
| Jen|39192| 76399|
+---------+-----+-------+
You're using the global variable used_codes in your function. This global variable does not exist in workers, that's probably why your function is not working as expected as a UDF, even though it still running well in each worker.

How to get details of a JSON Server response parsed into list/dictionary in Python

I am new to Python. I have been trying to parse the response sent as parameter in a function.
I have been trying to convert a function from Perl to Python.
The Perl block looks something like this:
sub fetchId_byusername
{
my ($self,$resString,$name) =#_;
my $my_id;
my #arr = #{$json->allow_nonref->utf8->decode($resString)};
foreach(#arr)
{
my %hash = %{$_};
foreach my $keys (keys %hash)
{
$my_id = $hash{id} if($hash{name} eq $name);
}
}
print "Fetched Id is : $my_id\n";
return $my_id;
The part where JSON data is being parsed is troubling me. How do i write this in python3.
I tried something like
def fetchID_byUsername(self, resString, name):
arr = []
user_id = 0
arr = resString.content.decode('utf-8', errors="replace")
for item in arr:
temp_hash = {}
temp_hash = item
for index in temp_hash.keys():
if temp_hash[name] == name:
user_id = temp_hash[id]
print("Fetched ID is: {}".format(user_id))
return user_id
Now I am not sure, if this is the right way to do it.
The json inputs are something like:
[{"id":12345,"name":"11","email":"11#test.com","groups":[{"id":6967,"name":"Test1"},{"id":123456,"name":"E1"}],"department":{"id":3863,"name":"Department1"},"comments":"111","adminUser":false},{"id":123457,"name":"1234567","email":"1234567#test.com","groups":[{"id":1657,"name":"mytest"},{"id":58881,"name":"Service Admin"}],"department":{"id":182,"name":"Service Admin"},"comments":"12345000","adminUser":true}]
Thanks in advance.
Your json input should be valid python I changed false to False and true to True. If it is json formatted string you can do
import json
data=json.loads(json_formatted_string_here) #data will be python dictionary herer
And tried like this it just iterates and when match found returns id
data=[{"id":12345,"name":"11","email":"11#test.com","groups":[{"id":6967,"name":"Test1"},{"id":123456,"name":"E1"}],"department":{"id":3863,"name":"Department1"},"comments":"111","adminUser":False},{"id":123457,"name":"1234567","email":"1234567#test.com","groups":[{"id":1657,"name":"mytest"},{"id":58881,"name":"Service Admin"}],"department":{"id":182,"name":"Service Admin"},"comments":"12345000","adminUser":True}]
def fetch_id_by_name(list_records,name):
for record in list_records:
if record["name"] == name:
return record["id"]
print(fetch_id_by_name(data,"11"))
First of all import the the json library and use json.loads() like:
import json
x = json.loads(json_feed) #This converts the json feed to a python dictionary
print(x["key"]) #values to "key"

__getnewargs__ error while using udf in Pyspark

There is a datafarame with 2 columns (db and tb): db stands for database and tb stands for tableName of that database.
+--------------------+--------------------+
| database| tableName|
+--------------------+--------------------+
|aaaaaaaaaaaaaaaaa...| tttttttttttttttt|
|bbbbbbbbbbbbbbbbb...| rrrrrrrrrrrrrrrr|
|aaaaaaaaaaaaaaaaa...| ssssssssssssssssss|
I have the following method in python:
def _get_tb_db(db, tb):
df = spark.sql("select * from {}.{}".format(db, tb))
return df.dtypes
and this udf:
test = udf(lambda db, tb: _get_tb_db(db, tb), StringType())
while running this:
df = df.withColumn("dtype", test(col("db"), col("tb")))
there is following error:
pickle.PicklingError: Could not serialize object: Py4JError: An
error occurred while calling o58.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
I found some discussion on stackoverflow: Spark __getnewargs__ error
but yet I am not sure how to resolve this issue?
Is the error because I am creating another dataframe inside the UDF?
Similar to the solution in the link i tried this:
cols = copy.deepcopy(df.columns)
df = df.withColumn("dtype", scanning(cols[0], cols[1]))
but still get error
Any solution?
The error means that you can not use Spark dataframe in the UDF. But since your dataframe containing names of databases and tables is most likely small, it's enough to just take a Python for loop, below are some methods which might help get your data:
from pyspark.sql import Row
# assume dfs is the df containing database names and table names
dfs.printSchema()
root
|-- database: string (nullable = true)
|-- tableName: string (nullable = true)
Method-1: use df.dtypes
Run the sql select * from database.tableName limit 1 to generate df and return its dtypes, convert it into StringType().
data = []
DRow = Row('database', 'tableName', 'dtypes')
for row in dfs.collect():
try:
dtypes = spark.sql('select * from `{}`.`{}` limit 1'.format(row.database, row.tableName)).dtypes
data.append(DRow(row.database, row.tableName, str(dtypes)))
except Exception, e:
print("ERROR from {}.{}: [{}]".format(row.database, row.tableName, e))
pass
df_dtypes = spark.createDataFrame(data)
# DataFrame[database: string, tableName: string, dtypes: string]
Note:
using dtypes instead of str(dtypes) will get the following schema where _1, and _2 are col_name and col_dtype respectively:
root
|-- database: string (nullable = true)
|-- tableName: string (nullable = true)
|-- dtypes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: string (nullable = true)
| | |-- _2: string (nullable = true)
using this method, each table will have only one row. for the next two methods, each col_type of a table will have its own row.
Method-2: use describe
you can also retrieve this information from running spark.sql("describe tableName") by which you get dataframe directly, then use a reduce function to union the results from all tables.
from functools import reduce
def get_df_dtypes(db, tb):
try:
return spark.sql('desc `{}`.`{}`'.format(db, tb)) \
.selectExpr(
'"{}" as `database`'.format(db)
, '"{}" as `tableName`'.format(tb)
, 'col_name'
, 'data_type')
except Exception, e:
print("ERROR from {}.{}: [{}]".format(db, tb, e))
pass
# an example table:
get_df_dtypes('default', 'tbl_df1').show()
+--------+---------+--------+--------------------+
|database|tableName|col_name| data_type|
+--------+---------+--------+--------------------+
| default| tbl_df1| array_b|array<struct<a:st...|
| default| tbl_df1| array_d| array<string>|
| default| tbl_df1|struct_c|struct<a:double,b...|
+--------+---------+--------+--------------------+
# use reduce function to union all tables into one df
df_dtypes = reduce(lambda d1, d2: d1.union(d2), [ get_df_dtypes(row.database, row.tableName) for row in dfs.collect() ])
Method-3: use spark.catalog.listColumns()
Use spark.catalog.listColumns() which creates a list of collections.Column objects, retrieve name and dataType and merge the data. the resulting dataframe is normalized with col_name and col_dtype on their own columns (same as using Method-2).
data = []
DRow = Row('database', 'tableName', 'col_name', 'col_dtype')
for row in dfs.select('database', 'tableName').collect():
try:
for col in spark.catalog.listColumns(row.tableName, row.database):
data.append(DRow(row.database, row.tableName, col.name, col.dataType))
except Exception, e:
print("ERROR from {}.{}: [{}]".format(row.database, row.tableName, e))
pass
df_dtypes = spark.createDataFrame(data)
# DataFrame[database: string, tableName: string, col_name: string, col_dtype: string]
A Note: different Spark distributions/versions might have different result from describe tbl_name and other commands when retrieving metadata, make sure the correct column names are used in the queries.

Convert Python to Scala

I am new with Scala and I used to work with python.
I want to convert program from Python to Scala and have difficulties with following 2 lines (create sql dataframe)
python code
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
data = dataset.map(lambda (filepath, text): (filepath.split("/")[-1],text, filepath.split("/")[-2]))
df = sqlContext.createDataFrame(data, schema)
i have made this
scala code
val category = dataset.map { case (filepath, text) => filepath.split("/")(6) }
val id = dataset.map { case (filepath, text) => filepath.split("/")(7) }
val text = dataset.map { case (filepath, text) => text }
val schema = StructType(Seq(
StructField(id.toString(), StringType, true),
StructField(category.toString(), StringType, true),
StructField(text.toString(), StringType, true)
))
and now i am blocked there!
For what it is worth I have converted your code literally and the following compiles using spark 2.3.2 on my machine
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import spark.implicits._
// Introduced to make code clearer
case class FileRecord(name: String, text: String)
// Whatever data set you have (a single record dataset is hard coded, replace with your data)
val dataSet = Seq(FileRecord("/a/b/c/d/e/f/g/h/i", "example contents")).toDS()
// Whatever you need with path length 6 and 7 hardcoded (you might want to change this)
// you may be able to do the following three map operations more efficiently
val category = dataSet.map { case FileRecord(filepath, text) => filepath.split("/")(6) }
val id = dataSet.map { case FileRecord(filepath, text) => filepath.split("/")(7) }
val text = dataSet.map { case FileRecord(filepath, text) => text }
val schema = StructType(Seq(
StructField(id.toString(), StringType, true),
StructField(category.toString(), StringType, true),
StructField(text.toString(), StringType, true)
))

Categories