Process single data set with different JSON schema rows using Pyspark - python

I am using PySpark and I need to process the log files that are appended into a single data frame. Most of the columns are look normal, but one of the columns has JSON string in {}. Basically, each row is an individual event and for JSON string I can apply individual Schema. But I don't know what is the best way to process data here.
Example:
This table later will help to aggregate the events in the way I need.
I tried to use function withColumn and use from_json. It worked successfully for a single column:
from pyspark.sql.types import *
import pyspark.sql.functions as F
df = (df
.withColumn("nested_json",
F.when(F.col("event_name") == "EventStart",F.from_json("json_string","Name String, Version Int, Id Int")))
It did for my 1st row what I want when I will query nested_json. But it is applied schema on the whole column, and I would like to process each row depends on the event_name
I was naive and try to do this:
from pyspark.sql.types import *
import pyspark.sql.functions as F
df = (df
.withColumn("nested_json",
F.when(F.col("event_name") == "EventStart",F.from_json("json_string","Name String, Version Int, Id Int"))
F.when(F.col("event_name") == "Action1",F.from_json("json_string","Name String, Version Int, UserName String, PosX int, PosY int"))
)
And this is failed to run with when() can only be applied on a Column previously generated by when() function
I assumed, my 1st withColumn applied schema for the whole column.
What other options do I have to apply JSON schema based on event_name value and flattened values?

What if you chain your when statements?
For example,
df.withColumn("nested_json", F.when(F.col("event_name") =="EventStart",F.from_json(...)).when(F.col("event_name") == "Action1", F. from_json(...)))

Related

Pandas: faster string operations in dataframes

I am working on a python script that read data from a database and save this data into a .csv file.
In order to save it correctly I need to escape different characters such as \r\n or \n.
Here is how I am currently doing it:
Firstly, I use the read_sql pandas function in order to read the data from the database.
import pandas as pd
df = pd.read_sql(
sql = 'SELECT * FROM exampleTable',
con = SQLAlchemyConnection
)
The table I get has different types of values.
Then, the script updates the dataframe obtained changing every string value to raw string.
In order to achive that I use two nested for loops in order to operate with every single value.
def update_df(df)
for rowIndex, row in df.iterrows():
for colIndex, values in row.items():
if isinstance(df[rowIndex, colIndex], str):
df.at[rowIndex, colIndex] = repr(df.at[rowIndex, colIndex])
return df
However, the amount of data I need to elaborate is large (more than 1 million rows with more than 100 columns) and it takes hours.
What I need is a way to create the csv file in a faster way.
Thank you in advance.
It should be faster to use applymap if really you have mixed types:
df = df.applymap(lambda x: repr(x) if isinstance(x, str) else x)
However, if you can identify string columns, then you can slice them, (maybe in combination with re.escape?).:
import re
str_cols = ['col1', 'col2']
df[str_cols] = df[str_cols].applymap(re.escape)

How to validate dataframe in pandera using multiple columns

I have following dataframe. Need to validate dataframe to check if there exists rows with columns Name and tag both NULL at the same time.
I tried following - but index where it fails are 0 & 2.
import pandas as pd
import pandera as pa
data = [['Alex',10,'t1'],['Bob',12,None],['Clarke',13,'t3'],[None,14,'t3'],[None,15,None]]
df = pd.DataFrame(data,columns=['Name','Age','Tag'])
schema = pa.DataFrameSchema(checks=pa.Check(lambda df: ~(pd.notnull(df["Name"])&pd.notnull(df["Tag"])) )
)
try:
schema.validate(df)
except pa.errors.SchemaErrors as err:
print("Schema errors and failure cases:")
print(err.failure_cases)
I want above code to return index as 4. How should I create check for pandera schema.
As per the docs on Handling null values,
By default, pandera drops null values before passing the objects to
validate into the check function. For Series objects null elements are
dropped (this also applies to columns), and for DataFrame objects,
rows with any null value are dropped.
If you want to check the properties of a pandas data structure while
preserving null values, specify Check(..., ignore_na=False) when
defining a check.
That way, make sure to add ignore_na=False:
schema = pa.DataFrameSchema(checks=pa.Check(lambda df:
~(df['Name'].isnull() &
df['Tag'].isnull()),
ignore_na=False))

Explode a string column with dictionary structure in PySpark

I have a pyspark dataframe which looks like this-
RowNumber
value
1
[{mail=abc#xyz.com, Name=abc}, {mail=mnc#xyz.com, Name=mnc}]
2
[{mail=klo#xyz.com, Name=klo}, {mail=mmm#xyz.com, Name=mmm}]
The column "value" is of string type.
root
|--value: string (nullable = false)
|--rowNumber: integer (nullable = false)
Step 1 I need to explode the dictionaries inside the list on each row under column "value" like this-
Step 2 And then to further explode the column so that the resultant table looks like :
Although When I try to get to Step1 using:
df.select(explode(col('value')).alias('value'))
it shows me error:
Analysis Exception: cannot resolve 'explode("value")' due to data type mismatch: input to function explode should be array or map type, not string
How do I convert this string under column 'value' to compatible data types so that I can proceed with exploding the dictionary elements as valid array/json (step1) and then into separate columns (step2) ?
please help
EDIT: There may be a simpler way to do this with the from_json function and StructTypes, for that you can check out this link.
The ast way
To parse a string as a dictionary or list in Python, you can use the ast library's literal_eval function. If we convert this function to a PySpark UDF, then the following code will suffice:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import MapType
from ast import literal_eval
literal_eval_udf = udf(literal_eval, ArrayType(MapType()))
table = table.withColumn("value", literal_eval_udf(col("value"))) # Make strings into ArrayTypes of MapTypes
table = table.withColumn("value", explode(col("value"))) # Explode ArrayTypes such that each row contains a MapType
After applying these functions to the table, what should remain is what you originally referred to as the start of "step 2." From here, we want to split each "value" column key into a column with entries from the corresponding value. This is accomplished with another function application which gives us the dict values:
table = table.withColumn("value", map_values(col("value")))
Now the values column contains an ArrayType of the values contained in each dictionary. To make a separate column for each of these, we simply add them in a loop:
keys = ['mail', 'Name']
for k in range(len(keys)):
table = table.withColumn(keys[k], table.value[k])
Then you can drop the original value column because we wouldn't need it anymore, as you'll now have the columns mail and Name with the information from the corresponding maps.

PySpark : Optimize read/load from Delta using selected columns or partitions

I am trying to load data from Delta into a pyspark dataframe.
path_to_data = 's3://mybucket/daily_data/'
df = spark.read.format("delta").load(path_to_data)
Now the underlying data is partitioned by date as
s3://mybucket/daily_data/
dt=2020-06-12
dt=2020-06-13
...
dt=2020-06-22
Is there a way to optimize the read as Dataframe, given:
Only certain date range is needed
Subset of column is only needed
Current way, i tried is :
df.registerTempTable("my_table")
new_df = spark.sql("select col1,col2 from my_table where dt_col > '2020-06-20' ")
# dt_col is column in dataframe of timestamp dtype.
In the above state, does Spark need to load the whole data, filter the data based on date range and then filter columns needed ? Is there any optimization that can be done in pyspark read, to load data since it is already partitioned ?
Something on line of :
df = spark.read.format("delta").load(path_to_data,cols_to_read=['col1','col2'])
or
df = spark.read.format("delta").load(path_to_data,partitions=[...])
In your case, there is no extra step needed. The optimizations would be taken care by Spark. Since you already partitioned the dataset based on column dt when you try to query the dataset with partitioned column dt as filter condition. Spark load only the subset of the data from the source dataset which matches the filter condition, in your case it is dt > '2020-06-20'.
Spark internally does the optimization based partitioning pruning.
To do this without SQL..
from pyspark.sql import functions as F
df = spark.read.format("delta").load(path_to_data).filter(F.col("dt_col") > F.lit('2020-06-20'))
Though for this example you may have some work to do with comparing dates.

Applying Mapping Function on DataFrame

I have just started using databricks/pyspark. Im using python/spark 2.1. I have uploaded data to a table. This table is a single column full of strings. I wish to apply a mapping function to each element in the column. I load the table into a dataframe:
df = spark.table("mynewtable")
The only way I could see was others saying was to convert it to RDD to apply the mapping function and then back to dataframe to show the data. But this throws up job aborted stage failure:
df2 = df.select("_c0").rdd.flatMap(lambda x: x.append("anything")).toDF()
All i want to do is just apply any sort of map function to my data in the table.
For example append something to each string in the column, or perform a split on a char, and then put that back into a dataframe so i can .show() or display it.
You cannot:
Use flatMap because it will flatten the Row
You cannot use append because:
tuple or Row have no append method
append (if present on collection) is executed for side effects and returns None
I would use withColumn:
df.withColumn("foo", lit("anything"))
but map should work as well:
df.select("_c0").rdd.flatMap(lambda x: x + ("anything", )).toDF()
Edit (given the comment):
You probably want an udf
from pyspark.sql.functions import udf
def iplookup(s):
return ... # Some lookup logic
iplookup_udf = udf(iplookup)
df.withColumn("foo", iplookup_udf("c0"))
Default return type is StringType, so if you want something else you should adjust it.

Categories