I have a spark dataframe which has a Json on one of the columns. My task is to turn this dataframe in to a columnar type of dataframe. The problem is that the JSON is dynamic and it always changes structure. What I would like to do is attempt to take values from it and if it and in case it does not have then, return a default value. Is there an option for this in the dataframe? This is how I am taking values out of the JSON, the problem is that in case one of the level changes name or structure, it will not fail.
columnar_df = df.select(col('json')['level1'].alias('json_level1'),
col('json')['level1']['level2a'].alias('json_level1_level2a'),
col('json')['level1']['level2b'].alias('json_levelb'),
)
you can do something like that with json_tuple
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.json_tuple
df.select(json_tuple(col("json"), << all_the_fields , _you_want >> ))
Related
I am currently trying to filter my dataframe into an if and get the field returned into variable.
Here is my code:
if df_table.filter(col(field).contains("val")):
id_2 = df_table.select(another_field)
print(id_2)
# Recursive call with new variable
The problem is : it looks like the if filtering works, but id_2 gives me the column name and type where I want the value itself from that field.
The output for this code is:
DataFrame[ID_1: bigint]
DataFrame[ID_2: bigint]
...
If I try collect like this : id_2 = df_table.select(another_field).collect()
I get this : [Row(ID_1=3013848), Row(ID_1=319481), Row(ID_1=391948)...] which looks like just listing all id in a list.
I thought of doing : id_2 = df_table.select(another_field).filter(col(field).contains("val"))
but I still get the same result as first attempt.
I would like my id_2 for each iteration of my loop to take value from the field I am filtering on. Like :
3013848
319481
...
But not a list from every value of matching fields from my dataframe.
Any idea on how I could get that into my variable ?
Thank you for helping.
In fact, dataFrame.select(colName) is supposed to return a column(a dataframe of with only one column) but not the column value of the line. I see in your comment you want to do recursive lookup in a spark dataframe. The thing is, firstly, spark AFAIK, doesn't support recursive operation. If you have a deep recursive operation to do, you'd better collect the dataframe you have and do it on your driver without spark. Instead, you can use what library you want but you lose the advantage of treating the data in the distributive way.
Secondly, spark isn't designed to do operations with iteration on each record. Try to achieve with join of dataframes, but it return to my first point, if your later operation of join depends on your join result, in a recursive way, just forget spark.
I have a spark table that I want to read in python (I'm using python 3 in databricks) In effect the structure is below. The log data is stored in a single string column but is a dictionary.
How do I break out the dictionary items to read them.
dfstates = spark.createDataFrame([[{"EVENT_ID":"123829:0","EVENT_TS":"2020-06-22T10:16:01.000+0000","RECORD_INDEX":0},
{"EVENT_ID":"123829:1","EVENT_TS":"2020-06-22T10:16:01.000+0000","RECORD_INDEX":1},
{"EVENT_ID":"123828:0","EVENT_TS":"2020-06-20T21:17:39.000+0000","RECORD_INDEX":0}],
['texas','24','01/04/2019'],
['colorado','13','01/07/2019'],
['maine','14','']]).toDF('LogData','State','Orders','OrdDate')
What I want to do is read the spark table into a dataframe, find the max event timestamp, find the rows with that timestamp then count and read just those rows into a new dataframe with the data columns and from the log data, add columns for event id (without the record index), event date and record index.
Downstream I'll be validating the data, converting from StringType to appropriate data type and filling in missing or invalid values as appropriate. All along I'll be asserting that row counts = original row counts.
The only thing I'm stuck on though is how to read this log data column and change it to something I can work with. Something in spark like pandas.series()?
You can split your single struct type column into multiple columns using dfstates.select('Logdata.*) Refer this answer : How to split a list to multiple columns in Pyspark?
Once you have seperate columns, then you can do standard pyspark operations like filtering
In a Spark DataFrame you can address a column's value in the schema by using its name like df['personId'] - but that way does not work with Glue's DynamicFrame. Is there a similar way, without converting the DynamicFrame to a DataFrame, to directly access a columns values by name?
You can use select_fields, see
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-SelectFields.html.
In your case it would be df.select_fields("personId"). Depending on what you want to do, you can save it as a new dynamic frame or just look at the data.
new_frame = df.select_fields("personId")
new_frame.show()
I have a number of tests where the Pandas dataframe output needs to be compared with a static baseline file. My preferred option for the baseline file format is the csv format for its readability and easy maintenance within Git. But if I were to load the csv file into a dataframe, and use
A.equals(B)
where A is the output dataframe and B is the dataframe loaded from the CSV file, inevitably there will be errors as the csv file does not record datatypes and what-nots. So my rather contrived solution is to write the dataframe A into a CSV file and load it back out the same way as B then ask whether they are equal.
Does anyone have a better solution that they have been using for some time without any issues?
If you are worried about the datatypes of the csv file, you can load it as a dataframe with specific datatypes as following:
import pandas as pd
B = pd.DataFrame('path_to_csv.csv', dtypes={"col1": "int", "col2": "float64", "col3": "object"} )
This will ensure that each column of the csv is read as a particular data type
After that you can just compare the dataframes easily by using
A.equals(B)
EDIT:
If you need to compare a lot of pairs, another way to do it would be to compare the hash values of the dataframes instead of comparing each row and column of individual data frames
hashA = hash(A.values.tobytes())
hashB = hash(B.values.tobytes())
Now compare these two hash values which are just integers to check if the original data frames were same or not.
Be Careful though: I am not sure if the data types of the original data frame would matter or not. Be sure to check that.
I came across a solution that does work for my case by making use of Pandas testing utilities.
from pandas.util.testing import assert_frame_equal
Then call it from within a try except block where check_dtype is set to False.
try:
assert_frame_equal(A, B, check_dtype=False)
print("The dataframes are the same.")
except:
print("Please verify data integrity.")
(A != B).any(1) returns a Series with Boolean values which tells you which rows are equal and which ones aren't ...
Boolean values are internally represented by 1's and 0's, so you can do a sum() to check how many rows were not equal.
sum((A != B).any(1))
If you get an output of 0, that would mean all rows were equal.
I'm using Spark 1.3.1.
I am trying to view the values of a Spark dataframe column in Python. With a Spark dataframe, I can do df.collect() to view the contents of the dataframe, but there is no such method for a Spark dataframe column as best as I can see.
For example, the dataframe df contains a column named 'zip_code'. So I can do df['zip_code'] and it turns a pyspark.sql.dataframe.Column type, but I can't find a way to view the values in df['zip_code'].
You can access underlying RDD and map over it
df.rdd.map(lambda r: r.zip_code).collect()
You can also use select if you don't mind results wrapped using Row objects:
df.select('zip_code').collect()
Finally, if you simply want to inspect content then show method should be enough:
df.select('zip_code').show()
You can simply write:
df.select('your column's name').show()
In your case here, it will be:
df.select('zip_code').show()
To view the complete content:
df.select("raw").take(1).foreach(println)
(show will show you an overview).