Infer schema from json string - python

I have this dataframe :
cSchema = StructType([StructField("id1", StringType()), StructField("id2", StringType()), StructField("params", StringType())\
,StructField("Col2", IntegerType())])
test_list = [[1, 2, '{"param1": "val1", "param2": "val2"}', 1], [1, 3, '{"param1": "val4", "param2": "val5"}', 3]]
df = spark.createDataFrame(test_list,schema=cSchema)
+---+---+--------------------+----+
|id1|id2| params|Col2|
+---+---+--------------------+----+
| 1| 2|{"param1": "val1"...| 1|
| 1| 3|{"param1": "val4"...| 3|
+---+---+--------------------+----+
I want to explode params into columns :
+---+---+----+------+------+
|id1|id2|Col2|param1|param2|
+---+---+----+------+------+
| 1| 2| 1| val1| val2|
| 1| 3| 3| val4| val5|
+---+---+----+------+------+
So I coded this :
schema2 = StructType([StructField("param1", StringType()), StructField("param2", StringType())])
df.withColumn(
"params", from_json("params", schema2)
).select(
col('id1'), col('id2'),col('Col2'), col('params.*')
).show()
The problem is that params schema is dynamic (variable schema2), he may change from one execution to another, so I need to infer the schema dynamically (It's ok to have all columns with String Type)... And I can't figure out of a way to do this..
Can anyone help me up with that please ?

In Pyspark the syntax should be:
import pyspark.sql.functions as F
schema = F.schema_of_json(df.select('params').head()[0])
df2 = df.withColumn(
"params", F.from_json("params", schema)
).select(
'id1', 'id2', 'Col2', 'params.*'
)
df2.show()
+---+---+----+------+------+
|id1|id2|Col2|param1|param2|
+---+---+----+------+------+
| 1| 2| 1| val1| val2|
| 1| 3| 3| val4| val5|
+---+---+----+------+------+

Here is how you can do it, hope you can change it to python
Get the schema dynamically with schema_of_json from the value and use from_json to read.
val schema = schema_of_json(df.first().getAs[String]("params"))
df.withColumn("params", from_json($"params", schema))
.select("id1", "id2", "Col2", "params.*")
.show(false)

If you want to get a larger sample of data to compare, you can read the params field into a list, convert that to an RDD, then read using "spark.read.json()"
params_list = df.select("params").rdd.flatMap(lambda x: x).collect()
params_rdd = sc.parallelize(params_list)
spark.read.json(params_rdd).schema
Caveat here being that you probably don't want to load too much data, as it's all being stuffed into local variables. Try taking the top 1000 or whatever an appropriate sample size may be.

Related

I want to use when on pyspark dataframe but i have multiple columns df.withcolumn

Dataframe Schema:
root |--LAST_UPDATE_DATE |--ADDR_1 |--ADDR_2 |--ERROR
If the "ERROR" col is null i want to change df like :
df = df.withColumn("LAST_UPDATE_DATE", current_timestamp()) \
.withColumn("ADDR_1", lit("ADDR_1")) \
.withColumn("ADDR_2", lit("ADDR_2"))
else :
df = df.withColumn("ADDR_1", lit("0"))
i have checked the "when-otherwise" but only one column can be changed in that scenario
Desired output :
//+----------------+------+------+-----+
//|LAST_UPDATE_DATE|ADDR_1|ADDR_2|ERROR|
//+----------------+------+------+-----+
//|2022-06-17 07:54|ADDR_1|ADDR_2| null|
//| null| null| null| 1|
//+----------------+------+------+-----+
Why not use when-otherwise for each witnColumn? Condition can be taken out for convenience.
Example:
error_event = F.col('ERROR').isNull()
df = (
df
.withColumn('LAST_UPDATE_DATE', F.when(error_event, F.current_timestamp()))
.withColumn('ADDR_1', F.when(error_event, F.lit('ADDR_1'))
.otherwise(1))
)

Create spark Dataframe based on another dataframe with json column

I have a Spark Dataframe (json_df) and I need to create another Dataframe based on the json nested:
This is my current Dataframe:
I know I could do that manually like: final_df = json_df.select( col("Body.EquipmentId"),..... ) but I want to do that in a generic way.
note: for this specific DF, the json records has the same structure.
Any idea?
Thanks!
Programmatically, you can do it like this:
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf
from pyspark.sql import functions as F
conf = SparkConf()
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
df = sc.parallelize([({"A":1, "B":2},), ({"A":3,"B":4},), ({"A":5,"B":6},)]).toDF(['Body'])
keys_df = df.select(F.explode(F.map_keys(F.col('Body')))).distinct()
keys = list(map(lambda row: row[0], keys_df.collect()))
key_cols = list(map(lambda f: F.col("Body").getItem(f).alias(str(f)), keys))
final_cols = df.select(key_cols)
final_cols.show()
Which produces
+---+---+
| B| A|
+---+---+
| 2| 1|
| 4| 3|
| 6| 5|
+---+---+
If you have the entire list of keys already, you can skip the part where it gets the keys and just set the keys manually:
keys = ['A', 'B']
Source: https://mungingdata.com/pyspark/dict-map-to-multiple-columns/

How can I convert unicode to string of a dataframe column?

I have a spark dataframe which has a column 'X'.The column contains elements which are in the form:
u'[23,4,77,890,455,................]'
. How can I convert this unicode to list.That is my output should be
[23,4,77,890,455...................]
. I have apply it for each element in the 'X' column.
I have tried df.withColumn("X_new", ast.literal_eval(x)) and got the error
"Malformed String"
I also tried
df.withColumn("X_new", json.loads(x)) and got the error "Expected
String or Buffer"
and
df.withColumn("X_new", json.dumps(x)) which says JSON not
serialisable.
and also
df_2 = df.rdd.map(lambda x: x.encode('utf-8')) which says rdd has no
attribute encode.
I dont want to use collect and toPandas() because its memory consuming.(But if thats the only way please do tell).I am using Pyspark
Update: cph_sto gave the answer using UDF.Though it worked well,I find that it is Slow.Can Somebody suggest any other method?
import ast
from pyspark.sql.functions import udf
values = [(u'[23,4,77,890.455]',10),(u'[11,2,50,1.11]',20),(u'[10.05,1,22.04]',30)]
df = sqlContext.createDataFrame(values,['list','A'])
df.show()
+-----------------+---+
| list| A|
+-----------------+---+
|[23,4,77,890.455]| 10|
| [11,2,50,1.11]| 20|
| [10.05,1,22.04]| 30|
+-----------------+---+
# Creating a UDF to convert the string list to proper list
string_list_to_list = udf(lambda row: ast.literal_eval(row))
df = df.withColumn('list',string_list_to_list(col('list')))
df.show()
+--------------------+---+
| list| A|
+--------------------+---+
|[23, 4, 77, 890.455]| 10|
| [11, 2, 50, 1.11]| 20|
| [10.05, 1, 22.04]| 30|
+--------------------+---+
Extension of the Q, as asked by OP -
# Creating a UDF to find length of resulting list.
length_list = udf(lambda row: len(row))
df = df.withColumn('length_list',length_list(col('list')))
df.show()
+--------------------+---+-----------+
| list| A|length_list|
+--------------------+---+-----------+
|[23, 4, 77, 890.455]| 10| 4|
| [11, 2, 50, 1.11]| 20| 4|
| [10.05, 1, 22.04]| 30| 3|
+--------------------+---+-----------+
Since it's a string, you could remove the first and last characters:
From '[23,4,77,890,455]' to '23,4,77,890,455'
Then apply the split() function to generate an array, taking , as the delimiter.
Please use the below code to ignore unicode
df.rdd.map(lambda x: x.encode("ascii","ignore"))

How to convert list of dictionaries into Pyspark DataFrame

I want to convert my list of dictionaries into DataFrame. This is the list:
mylist =
[
{"type_activity_id":1,"type_activity_name":"xxx"},
{"type_activity_id":2,"type_activity_name":"yyy"},
{"type_activity_id":3,"type_activity_name":"zzz"}
]
This is my code:
from pyspark.sql.types import StringType
df = spark.createDataFrame(mylist, StringType())
df.show(2,False)
+-----------------------------------------+
| value|
+-----------------------------------------+
|{type_activity_id=1,type_activity_id=xxx}|
|{type_activity_id=2,type_activity_id=yyy}|
|{type_activity_id=3,type_activity_id=zzz}|
+-----------------------------------------+
I assume that I should provide some mapping and types for each column, but I don't know how to do it.
Update:
I also tried this:
schema = ArrayType(
StructType([StructField("type_activity_id", IntegerType()),
StructField("type_activity_name", StringType())
]))
df = spark.createDataFrame(mylist, StringType())
df = df.withColumn("value", from_json(df.value, schema))
But then I get null values:
+-----+
|value|
+-----+
| null|
| null|
+-----+
In the past, you were able to simply pass a dictionary to spark.createDataFrame(), but this is now deprecated:
mylist = [
{"type_activity_id":1,"type_activity_name":"xxx"},
{"type_activity_id":2,"type_activity_name":"yyy"},
{"type_activity_id":3,"type_activity_name":"zzz"}
]
df = spark.createDataFrame(mylist)
#UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
# warnings.warn("inferring schema from dict is deprecated,"
As this warning message says, you should use pyspark.sql.Row instead.
from pyspark.sql import Row
spark.createDataFrame(Row(**x) for x in mylist).show(truncate=False)
#+----------------+------------------+
#|type_activity_id|type_activity_name|
#+----------------+------------------+
#|1 |xxx |
#|2 |yyy |
#|3 |zzz |
#+----------------+------------------+
Here I used ** (keyword argument unpacking) to pass the dictionaries to the Row constructor.
You can do it like this. You will get a dataframe with 2 columns.
mylist = [
{"type_activity_id":1,"type_activity_name":"xxx"},
{"type_activity_id":2,"type_activity_name":"yyy"},
{"type_activity_id":3,"type_activity_name":"zzz"}
]
myJson = sc.parallelize(mylist)
myDf = sqlContext.read.json(myJson)
Output :
+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
| 1| xxx|
| 2| yyy|
| 3| zzz|
+----------------+------------------+
in Spark Version 2.4 it is possible to do directly with
df=spark.createDataFrame(mylist)
>>> mylist = [
... {"type_activity_id":1,"type_activity_name":"xxx"},
... {"type_activity_id":2,"type_activity_name":"yyy"},
... {"type_activity_id":3,"type_activity_name":"zzz"}
... ]
>>> df1=spark.createDataFrame(mylist)
>>> df1.show()
+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
| 1| xxx|
| 2| yyy|
| 3| zzz|
+----------------+------------------+
I was also facing the same issue when creating dataframe from list of dictionaries.
I have resolved this using namedtuple.
Below is my code using data provided.
from collections import namedtuple
final_list = []
mylist = [{"type_activity_id":1,"type_activity_name":"xxx"},
{"type_activity_id":2,"type_activity_name":"yyy"},
{"type_activity_id":3,"type_activity_name":"zzz"}
]
ExampleTuple = namedtuple('ExampleTuple', ['type_activity_id', 'type_activity_name'])
for my_dict in mylist:
namedtupleobj = ExampleTuple(**my_dict)
final_list.append(namedtupleobj)
sqlContext.createDataFrame(final_list).show(truncate=False)
output
+----------------+------------------+
|type_activity_id|type_activity_name|
+----------------+------------------+
|1 |xxx |
|2 |yyy |
|3 |zzz |
+----------------+------------------+
my version informations are as follows
spark: 2.4.0
python: 3.6
It is not necessary to have my_list variable. since it was available I have used it to create namedtuple object otherwise directly namedtuple object can be created.

Pyspark: explode json in column to multiple columns

The data looks like this -
+-----------+-----------+-----------------------------+
| id| point| data|
+-----------------------------------------------------+
| abc| 6|{"key1":"124", "key2": "345"}|
| dfl| 7|{"key1":"777", "key2": "888"}|
| 4bd| 6|{"key1":"111", "key2": "788"}|
I am trying to break it into the following format.
+-----------+-----------+-----------+-----------+
| id| point| key1| key2|
+------------------------------------------------
| abc| 6| 124| 345|
| dfl| 7| 777| 888|
| 4bd| 6| 111| 788|
The explode function explodes the dataframe into multiple rows. But that is not the desired solution.
Note: This solution does not answers my questions.
PySpark "explode" dict in column
As long as you are using Spark version 2.1 or higher, pyspark.sql.functions.from_json should get you your desired result, but you would need to first define the required schema
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType(
[
StructField('key1', StringType(), True),
StructField('key2', StringType(), True)
]
)
df.withColumn("data", from_json("data", schema))\
.select(col('id'), col('point'), col('data.*'))\
.show()
which should give you
+---+-----+----+----+
| id|point|key1|key2|
+---+-----+----+----+
|abc| 6| 124| 345|
|df1| 7| 777| 888|
|4bd| 6| 111| 788|
+---+-----+----+----+
As suggested by #pault, the data field is a string field. since the keys are the same (i.e. 'key1', 'key2') in the JSON string over rows, you might also use json_tuple() (this function is New in version 1.6 based on the documentation)
from pyspark.sql import functions as F
df.select('id', 'point', F.json_tuple('data', 'key1', 'key2').alias('key1', 'key2')).show()
Below is My original post: which is most likely WRONG if the original table is from df.show(truncate=False) and thus the data field is NOT a python data structure.
Since you have exploded the data into rows, I supposed the column data is a Python data structure instead of a string:
from pyspark.sql import functions as F
df.select('id', 'point', F.col('data').getItem('key1').alias('key1'), F.col('data')['key2'].alias('key2')).show()
As mentioned by #jxc, json_tuple should work fine if you were not able to define the schema beforehand and you only needed to deal with a single level of json string. I think it's more straight forward and easier to use. Strangely, I didn't find anyone else mention this function before.
In my use case, original dataframe schema: StructType(List(StructField(a,StringType,true))), json string column shown as:
+---------------------------------------+
|a |
+---------------------------------------+
|{"k1": "v1", "k2": "2", "k3": {"m": 1}}|
|{"k1": "v11", "k3": "v33"} |
|{"k1": "v13", "k2": "23"} |
+---------------------------------------+
Expand json fields into new columns with json_tuple:
from pyspark.sql import functions as F
df = df.select(F.col('a'),
F.json_tuple(F.col('a'), 'k1', 'k2', 'k3') \
.alias('k1', 'k2', 'k3'))
df.schema
df.show(truncate=False)
The document doesn't say much about it, but at least in my use case, new columns extracted by json_tuple are StringType, and it only extract single depth of JSON string.
StructType(List(StructField(k1,StringType,true),StructField(k2,StringType,true),StructField(k3,StringType,true)))
+---------------------------------------+---+----+-------+
|a |k1 |k2 |k3 |
+---------------------------------------+---+----+-------+
|{"k1": "v1", "k2": "2", "k3": {"m": 1}}|v1 |2 |{"m":1}|
|{"k1": "v11", "k3": "v33"} |v11|null|v33 |
|{"k1": "v13", "k2": "23"} |v13|23 |null |
+---------------------------------------+---+----+-------+
This works for my use case
data1 = spark.read.parquet(path)
json_schema = spark.read.json(data1.rdd.map(lambda row: row.json_col)).schema
data2 = data1.withColumn("data", from_json("json_col", json_schema))
col1 = data2.columns
col1.remove("data")
col2 = data2.select("data.*").columns
append_str ="data."
col3 = [append_str + val for val in col2]
col_list = col1 + col3
data3 = data2.select(*col_list).drop("json_col")
All credits to Shrikant Prabhu
You can simply use SQL
SELECT id, point, data.*
FROM original_table
Like this the schema of the new table will adapt if the data changes and you won't have to do anything in your pipelin.
In this approach you just need to set the name of column with Json content.
No need to set up the schema. It makes everything automatically.
json_col_name = 'data'
keys = df.head()[json_col_name].keys()
jsonFields= [f"{json_col_name}.{key} {key}" for key in keys]
main_fields = [key for key in df.columns if key != json_col_name]
df_new = df.selectExpr(main_fields + jsonFields)

Categories