Pyspark: How to transform json strings in a dataframe column - python

The following is more or less straight python code which functionally extracts exactly as I want. The data schema for the column I'm filtering out within the dataframe is basically a json string.
However, I had to greatly bump up the memory requirement for this and I'm only running on a single node. Using a collect is probably bad and creating all of this on a single node really isn't taking advantage of the distributed nature of Spark.
I'd like a more Spark centric solution. Can anyone help me massage the logic below to better take advantage of Spark? Also, as a learning point: please provide an explanation for why/how the updates make it better.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
from pyspark.sql.types import SchemaStruct, SchemaField, StringType
input_schema = SchemaStruct([
SchemaField('scrubbed_col_name', StringType(), nullable=True)
])
output_schema = SchemaStruct([
SchemaField('val01_field_name', StringType(), nullable=True),
SchemaField('val02_field_name', StringType(), nullable=True)
])
example_input = [
'''[{"val01_field_name": "val01_a", "val02_field_name": "val02_a"},
{"val01_field_name": "val01_a", "val02_field_name": "val02_b"},
{"val01_field_name": "val01_b", "val02_field_name": "val02_c"}]''',
'''[{"val01_field_name": "val01_c", "val02_field_name": "val02_a"}]''',
'''[{"val01_field_name": "val01_a", "val02_field_name": "val02_d"}]''',
]
desired_output = {
'val01_a': ['val_02_a', 'val_02_b', 'val_02_d'],
'val01_b': ['val_02_c'],
'val01_c': ['val_02_a'],
}
def capture(dataframe):
# Capture column from data frame if it's not empty
data = dataframe.filter('scrubbed_col_name != null')\
.select('scrubbed_col_name')\
.rdd\
.collect()
# Create a mapping of val1: list(val2)
mapping = {}
# For every row in the rdd
for row in data:
# For each json_string within the row
for json_string in row:
# For each item within the json string
for val in json.loads(json_string):
# Extract the data properly
val01 = val.get('val01_field_name')
val02 = val.get('val02_field_name')
if val02 not in mapping.get(val01, []):
mapping.setdefault(val01, []).append(val02)
return mapping

One possible solution:
(df
.rdd # Convert to rdd
.flatMap(lambda x: x) # Flatten rows
# Parse JSON. In practice you should add proper exception handling
.flatMap(lambda x: json.loads(x))
# Get values
.map(lambda x: (x.get('val01_field_name'), x.get('val02_field_name')))
# Convert to final shape
.groupByKey())
Given output specification this operation is not exactly efficient (do you really require grouped values?) but still much better than collect.

Related

How to compare two schema in Databricks notebook in python

I'm going to ingest data using databricks notebook. I want to validate the schema of the data ingested against what I'm expecting the schema of these data to be.
So basically I have:
validation_schema = StructType([
StructField("a", StringType(), True),
StructField("b", IntegerType(), False),
StructField("c", StringType(), False),
StructField("d", StringType(), False)
])
data_ingested_good = [("foo",1,"blabla","36636"),
("foo",2,"booboo","40288"),
("bar",3,"fafa","42114"),
("bar",4,"jojo","39192"),
("baz",5,"jiji","32432")
]
data_ingested_bad = [("foo","1","blabla","36636"),
("foo","2","booboo","40288"),
("bar","3","fafa","42114"),
("bar","4","jojo","39192"),
("baz","5","jiji","32432")
]
data_ingested_good.printSchema()
data_ingested_bad.printSchema()
validation_schema.printSchema()
I've seen similar questions but answers are always in scala.
it's really depends on your exact requirements & complexities of schemas that you want to compare - for example, ignore nullability flag vs. taking it into account, order of columns, support for maps/structs/arrays, etc. Also, do you want to see difference or just a flag if schemas are matching or not.
In the simplest case it could be as simple as following - just compare string representations of schemas:
def compare_schemas(df1, df2):
return df1.schema.simpleString() == df2.schema.simpleString()
I personally would recommend to take an existing library, like Chispa that has more advanced schema comparison functions - you can tune checks, it will show differences, etc. After installation (you can just do %pip install chispa) - this will throw an exception if schemas are different:
from chispa.schema_comparer import assert_schema_equality
assert_schema_equality(df1.schema, df2.schema)
another method , you can find the difference based on the simple python list comparisons .
dept = [("Finance",10),
("Marketing",20),
("Sales",30),
("IT",40)
]
deptColumns = ["dept_name","dept_id"]
dept1 = [("Finance",10,'999'),
("Marketing",20,'999'),
("Sales",30,'999'),
("IT",40,'999')
]
deptColumns1 = ["dept_name","dept_id","extracol"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
dept1DF = spark.createDataFrame(data=dept1, schema = deptColumns1)
deptDF_columns=deptDF.schema.names
dept1DF_columns=dept1DF.schema.names
list_difference = []
for item in dept1DF_columns:
if item not in deptDF_columns:
list_difference.append(item)
print(list_difference)
Screen print :

filter on the pyspark dataframe schema to get new dataframe with columns having specific type

I want to create a generic function in pyspark that takes dataframe and a datatype as a parameter and filter the columns that does not satisfy the criteria. I am not very good at python and I am kind of stuck at the point from where I am not able to find how can I do that.
I have a scala representation of the code that does the same thing.
//sample data
val df = Seq(("587","mumbai",Some(5000),5.05),("786","chennai",Some(40000),7.055),("432","Gujarat",Some(20000),6.75),("2","Delhi",None,10.0)).toDF("Id","City","Salary","Increase").withColumn("RefID",$"Id")
import org.apache.spark.sql.functions.col
def selectByType(colType: DataType, df: DataFrame) = {
val cols = df.schema.toList
.filter(x => x.dataType == colType)
.map(c => col(c.name))
df.select(cols:_*)
}
val res = selectByType(IntegerType, df)
res is the dataframe that has only integer columns in this case the salary column and we have drop all the other columns that have different types dynamically.
I won't the same behaviour in pyspark but I am not able to accomplish that.
This is what I have tried
//sample data
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, DoubleType
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True), \
StructField("raise",DoubleType(),True) \
])
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000,2.5),
("Michael","Rose","","40288","M",4000,4.7),
("Robert","","Williams","42114","M",4000,8.9),
("Maria","Anne","Jones","39192","F",4000,0.0),
("Jen","Mary","Brown","","F",-1,-1.2)
]
df = spark.createDataFrame(data=data2,schema=schema)
//getting the column list from schema of the dataframe
pschema = df.schema.fields
datatypes = [IntegerType,DoubleType] //column datatype that I want.
out = filter(lambda x: x.dataType.isin(datatypes), pschema) //gives invalid syntax error.
Can someone help me out in terms of what is the thing that I am doing wrong. Scala code only passes single datatype but I as per my use case I want to handle scenario in which we can pass multiple datatypes and we get the dataframe with required columns of that specified datatypes.
Initially if someone can give any idea for how can I make it work for single datatype then I can give it a try to see if I can do that same for multiple datatypes.
Note : Sample data for scala and pyspark is different as I copied the Pyspark sample data from somewhere just to speed the operation as I am just concerned about the final output requirement.

Remove rows where value is string in pyspark dataframe

I am trying to use KMeans on geospatial data stored in MongoDB database using Apache Spark. The data has following format,
DataFrame[decimalLatitude: double, decimalLongitude: double, features: vector]
The code is as follows, where inputdf is the dataframe.
vecAssembler = VectorAssembler(
inputCols=["decimalLatitude", "decimalLongitude"],
outputCol="features")
inputdf = vecAssembler.transform(inputdf)
kmeans = KMeans(k = 10, seed = 123)
model = kmeans.fit(inputdf.select("features"))
There seems to be some empty strings in the dataset, as I get following error,
com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a IntegerType (value: BsonString{value=''})
I tried to find such rows using,
issuedf = inputdf.where(inputdf.decimalLatitude == '')
issuedf.show()
But I get the same type conversion error as above. I also tried df.replace, but I got the same error. How do I remove all rows where such value is present?
This issue can be solved by providing data types when loading the data as follows,
inputdf = my_spark.read.format("mongo").load(schema=StructType(
[StructField("decimalLatitude", DoubleType(), True),
StructField("decimalLongitude", DoubleType(), True)]))
This ensures that all values are of DoubleType. Now empty values can be removed using inputdf.dropna()

Spark: equivelant of zipwithindex in dataframe

Assuming I am having the following dataframe:
dummy_data = [('a',1),('b',25),('c',3),('d',8),('e',1)]
df = sc.parallelize(dummy_data).toDF(['letter','number'])
And i want to create the following dataframe:
[('a',0),('b',2),('c',1),('d',3),('e',0)]
What I do is to convert it to rdd and use zipWithIndex function and after join the results:
convertDF = (df.select('number')
.distinct()
.rdd
.zipWithIndex()
.map(lambda x:(x[0].number,x[1]))
.toDF(['old','new']))
finalDF = (df
.join(convertDF,df.number == convertDF.old)
.select(df.letter,convertDF.new))
Is if there is something similar function as zipWIthIndex in dataframes? Is there another more efficient way to do this task?
Please check https://issues.apache.org/jira/browse/SPARK-23074 for this direct functionality parity in dataframes .. upvote that jira if you're interested to see this at some point in Spark.
Here's a workaround though in PySpark:
def dfZipWithIndex (df, offset=1, colName="rowId"):
'''
Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe
and preserves a schema
:param df: source dataframe
:param offset: adjustment to zipWithIndex()'s index
:param colName: name of the index column
'''
new_schema = StructType(
[StructField(colName,LongType(),True)] # new added field in front
+ df.schema.fields # previous schema
)
zipped_rdd = df.rdd.zipWithIndex()
new_rdd = zipped_rdd.map(lambda args: ([args[1] + offset] + list(args[0])))
return spark.createDataFrame(new_rdd, new_schema)
That's also available in abalon package.

How to map structured data to schemaRDD in Spark?

I've asked this question differently before but there are some changes so I thought asking it again as a new question.
I have a structured data which only part of it is in json format but I need to map the entire data to an schemaRDD. The data looks like this:
03052015 04:13:20
{"recordType":"NEW","data":{"keycol":"val1","col2":"val2","col3":"val3"}
Each line starts with date followed by time and a json formatted text.
I need to map not only the text in json but also the date and time into the same structure.
I tried it in Python but obviously it doesn't work because Row does not take an RDD (jsonRDD in this case).
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
orderFile = sc.textFile(myfile)
orderLine = orderFile.map(lambda line: line.split(" ", 2))
anotherOrderLine = orderLine.map(lambda p: Row(date=p[0], time=p[1], content=sqlContext.jsonRDD(p[3])))
schemaOrder = sqlContext.inferSchema(anotherOrderLine)
schemaOrder.printSchema()
for x in schemaOrder.collect():
print x
The goal is to be able to run a query like this against the schemaRDD:
select date, time, data.keycol, data.val1, data.val2, data.val3 from myOrder
How can I map the entire line to a schemaRDD?
Any help is appreciated?
The simplest option would be to add this field to JSON and use jsonRDD
My data:
03052015 04:13:20 {"recordType":"NEW","data":{"keycol":"val1","col1":"val5","col2":"val3"}}
03062015 04:13:20 {"recordType":"NEW1","data":{"keycol":"val2","col1":"val6","col2":"val3"}}
03072015 04:13:20 {"recordType":"NEW2","data":{"keycol":"val3","col1":"val7","col2":"val3"}}
03082015 04:13:20 {"recordType":"NEW3","data":{"keycol":"val4","col1":"val8","col2":"val3"}}
Code:
import json
def transform(data):
ts = data[:18].strip()
jss = data[18:].strip()
jsj = json.loads(jss)
jsj['ts'] = ts
return json.dumps(jsj)
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
rdd = sc.textFile('/sparkdemo/sample.data')
tbl = sqlContext.jsonRDD(rdd.map(transform))
tbl.registerTempTable("myOrder")
sqlContext.sql("select ts, recordType, data.keycol, data.col1, data.col2 data from myOrder").collect()
Result:
[Row(ts=u'03052015 04:13:20', recordType=u'NEW', keycol=u'val1', col1=u'val5', data=u'val3'), Row(ts=u'03062015 04:13:20', recordType=u'NEW1', keycol=u'val2', col1=u'val6', data=u'val3'), Row(ts=u'03072015 04:13:20', recordType=u'NEW2', keycol=u'val3', col1=u'val7', data=u'val3'), Row(ts=u'03082015 04:13:20', recordType=u'NEW3', keycol=u'val4', col1=u'val8', data=u'val3')]
In your code there is a problem that you are calling jsonRDD for each of the rows, this is not correct - it accepts an RDD and returns SchemaRDD.
The sqlContext.jsonRDD creates a schema rdd from an RDD containing strings where each string contains a JSON representation. This code sample is from the SparkSQL documentation (https://spark.apache.org/docs/1.2.0/sql-programming-guide.html):
val anotherPeopleRDD = sc.parallelize("""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val anotherPeople = sqlContext.jsonRDD(anotherPeopleRDD)
One of the cool things about jsonRDD is the fact that you can provide and additional parameter stating the JSONs schema, which should improve performance your performance. This can be don by creating an schemaRDD (just load a sample document) and then call the schemaRDD.schema method to get the schema.

Categories