I'm working with pyspark and I have the following code that creates a nested json file from a dataframe with some fields (product, quantity, from, to) nested in "requirements". Herunder the code that creates the json an one row as example
final2 = final.groupby('identifier', 'plant', 'family', 'familyDescription', 'type', 'name', 'description', 'batchSize', 'phantom', 'makeOrBuy', 'safetyStock', 'unit', 'unitPrice', 'version').agg(F.collect_list(F.struct(F.col("product"), F.col("quantity"), F.col("from"), F.col("to"))).alias('requirements'))
{"identifier":"xxx","plant":"xxxx","family":"xxxx","familyDescription":"xxxx","type":"assembled","name":"xxxx","description":"xxxx","batchSize":20.0,"phantom":"False","makeOrBuy":"make","safetyStock":0.0,"unit":"PZ","unitPrice":xxxx,"version":"0001","requirements":[{"product":"yyyy","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"},{"product":"zzzz","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"},{"product":"kkkk","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"},{"product":"wwww","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"},{"product":"bbbb","quantity":1.0,"from":"2000-01-01T00:00:00.000Z","to":"9999-12-31T00:00:00.000Z"}]}
The schema of the final2 dataframe is the following:
|-- identifier: string (nullable = true)
|-- plant: string (nullable = true)
|-- family: string (nullable = true)
|-- familyDescription: string (nullable = true)
|-- type: string (nullable = false)
|-- name: string (nullable = true)
|-- description: string (nullable = true)
|-- batchSize: double (nullable = true)
|-- phantom: string (nullable = false)
|-- makeOrBuy: string (nullable = false)
|-- safetyStock: double (nullable = true)
|-- unit: string (nullable = true)
|-- unitPrice: double (nullable = true)
|-- version: string (nullable = true)
|-- requirements: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- product: string (nullable = true)
| | |-- quantity: double (nullable = true)
| | |-- from: timestamp (nullable = true)
| | |-- to: timestamp (nullable = true)
I'm facing a problem because I have to add to my final dataframe some data with product, quantity, from, to = Null: using the code above I get "requirements":[{}] , but the DB where I write the file (MongoDB) get an error with the empty JSON object because it expects "requirements":[] for null values.
I've tried with
import pyspark.sql.functions as F
df = final_prova2.withColumn("requirements",
F.when(final_prova2.requirements.isNull(),
F.array()).otherwise(final_prova2.requirements))
but it doesn't work.
Any suggestion on how modify the code? I'm struggling to find a solution (I don't even know if a solution is possible considering the structure required).
Thanks
You need to check if all 4 fields of requirements are NULL, not the column itself. One way you can fix this is to adjust the collect_list aggregate function when creating final2:
import pyspark.sql.functions as F
final2 = final.groupby('identifier', 'plant', 'family', 'familyDescription', 'type', 'name', 'description', 'batchSize', 'phantom', 'makeOrBuy', 'safetyStock', 'unit', 'unitPrice', 'version') \
.agg(F.expr("""
collect_list(
IF(coalesce(quantity, product, from, to) is NULL
, NULL
, struct(product, quantity, from, to)
)
)
""").alias('requirements'))
Where:
we use an SQL expression IF(condition, true_value, false_value) to set up the argument for collect_list
the condition: coalesce(quantity, product, from, to) is NULL is to test if all listed 4 columns are NULL, if it's true, return NULL, otherwise return struct(product, quantity, from, to)
Related
I have a JSON file with various levels of nested struct/array columns in one DataFrame, df_1. I have a smaller DataFrame, df_2, with less columns, but the column names match with some column names in df_1, and none of the nested structure.
I want to apply the schema from df_1 to df_2 in a way that the two share the same schema, taking the existing columns in df_2 where possible, and creating the columns/nested structure that exist in df_1 but not df_2.
df_1
root
|-- association_info: struct (nullable = true)
| |-- ancestry: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- doi: string (nullable = true)
| |-- gwas_catalog_id: string (nullable = true)
| |-- neg_log_pval: double (nullable = true)
| |-- study_id: string (nullable = true)
| |-- pubmed_id: string (nullable = true)
| |-- url: string (nullable = true)
|-- gold_standard_info: struct (nullable = true)
| |-- evidence: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- class: string (nullable = true)
| | | |-- confidence: string (nullable = true)
| | | |-- curated_by: string (nullable = true)
| | | |-- description: string (nullable = true)
| | | |-- pubmed_id: string (nullable = true)
| | | |-- source: string (nullable = true)
| |-- gene_id: string (nullable = true)
| |-- highest_confidence: string (nullable = true)
df_2
root
|-- study_id: string (nullable = true)
|-- description: string (nullable = true)
|-- gene_id: string (nullable = true)
The expected output would be to have the same schema as df_1, and for any columns that don't exist in df_2 to just fill with null.
I have tried completely flattening the structure of df_1 to join the two DataFrames, but then I'm unsure how to change it back into the original schema. All solutions I've attempted so far have been in PySpark. It would be preferable to use PySpark for performance considerations, but if a solution requires converted to a Pandas DataFrame that's also feasible.
df1.select('association_info.study_id',
'gold_standard_info.evidence.element.description',
'gold_standard_info.gene_id')
The above code will reach into the df1 and provide you requisite fields in df2. The schema will remain same.
Could you try the same.
We have some PySpark code that joins a table table_a, twice to another table table_b using the following code. After joining the table twice, we drop the key_hash column from the output DataFrame.
This code was working fine in spark version 3.0.1. Since upgrading to spark version 3.2.2, the behaviour has changed and during the first transform operation the key_hash field gets dropped from the output DataFrame but when the 2nd transform operation gets executed then the key_hash field still stays in the output_df.
Can someone please guide what has changed in Spark behaviour that is causing this issue?
def tr_join_sac_user(self, df_a):
def inner(df_b):
return (
df_b.join(df_a, on=df_b["sac_key_hash"] == df_a["key_hash"], how="left")
.drop(df_a.key_hash)
.drop(df_b.sac_key_hash)
)
return inner
def tr_join_sec_user(self, df_a):
def inner(df_b):
return (
df_b.join(df_a, on=df_b["sec_key_hash"] == df_a["key_hash"], how="left")
.drop(df_a.key_hash)
.drop(df_b.sec_key_hash)
)
return inner
table_a_df = spark.read.format("delta").load("/path/to/table_a")
table_b_df = spark.read.format("delta").load("/path/to/table_b")
output_df = table_b_df.transform(tr_join_sac_user(table_a_df))
output_df = output_df.transform(tr_join_sec_user(table_a_df))
If we use .drop('key_hash') instead of .drop(df_a.key_hash) that seems to work and the column does get dropped in 2nd transform as well. I would like to understand what has changed in Spark behaviour between these versions (or if it’s a bug) as this might have an impact in other places in our codebase as well.
Hi I also have an issue with this one, I don't know if its a bug or not but it seems not happening all time
utilization_raw = time_lab.crossJoin(approved_listing)
utilization_raw = utilization_raw\
.join(availability_series,
((utilization_raw.date_series == availability_series.availability_date)&\
(utilization_raw.listing_id == availability_series.listing_id)),"left")\
.drop(availability_series.listing_id).dropDuplicates()\ --> WORKING
.join(request_series,
((utilization_raw.date_series==request_series.request_date_series)&\
(utilization_raw.listing_id == request_series.listing_id)),"left")\
.drop(request_series.listing_id)\ --> WORKING
.join(listing_pricing,
((utilization_raw.date_series==listing_pricing.price_created_date)&\
(utilization_raw.listing_id==listing_pricing.listing_id)),'left').drop(listing_pricing.listing_id)\ --> NOT WORKING
Here's the result of printSchema()
root
|-- date_series: date (nullable = false)
|-- week_series: date (nullable = true)
|-- month_series: date (nullable = true)
|-- woy_num: integer (nullable = false)
|-- doy_num: integer (nullable = false)
|-- dow_num: integer (nullable = false)
|-- listing_id: integer (nullable = true)
|-- is_driverless: integer (nullable = false)
|-- listing_deleted_at: date (nullable = true)
|-- daily_gmv: decimal(38,23) (nullable = true)
|-- daily_nmv: decimal(38,23) (nullable = true)
|-- daily_calendar_gmv: decimal(31,13) (nullable = true)
|-- daily_calendar_nmv: decimal(31,13) (nullable = true)
|-- active_booking: long (nullable = true)
|-- is_available: integer (nullable = false)
|-- is_requested: integer (nullable = false)
|-- listing_id: integer (nullable = true) --> duplicated
|-- base_price: decimal(10,2) (nullable = true)
Update: what we did is we updated the databricks version from 9.1 to 11.3
I have a pyspark df.
df.printSchema()
root
|-- bio: string (nullable = true)
|-- city: string (nullable = true)
|-- company: string (nullable = true)
|-- custom_fields: struct (nullable = true)
| |-- nested_field1: string (nullable = true)
|-- email: string (nullable = true)
|-- first_conversion: struct (nullable = true)
| |-- nested_field2: struct (nullable = true)
| | |-- number: string (nullable = true)
| | |-- state: string (nullable = true)
I would like to iterate over column and nested fields in order to get their names (just their names). I should be able to print them and get the following result:
bio
city
company
custom_fields
nested_field1
email
first_conversion
nested_field2
number
state
I can easily print the first level with:
for st in df.schema:
print(st.name)
But how do I check for deeper levels in runtime and recursively?
dtypes will give you more details of the schema, you will have to parse it though
df.printSchema()
root
|-- id: integer (nullable = true)
|-- rec: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = true)
| | |-- b: float (nullable = true)
df.dtypes
# [('id', 'int'), ('rec', 'array<struct<a:int,b:float>>')]
I am reading multiple complex json files and trying to extract geo coordinates.
I cannot attach the file itself right now, but I can print the tree here.
The file has several hundred options and some objects repeat.
Please see the structure of the file in .txt format.
When I read the json with Spark in Python, it shows me these coordinates in coordinates column and it is there.
It is stored in coordinates column. Please see a proof.
I am obviously trying to reduce the number of columns and select some columns.
The last two columns are my geo coordinates. I tried both coordinates and geo and also coordinates.coordinates with geo.coordinates. Both options do not work.
df_tweets = tweets.select(['text',
'user.name',
'user.screen_name',
'user.id',
'user.location',
'place.country',
'place.full_name',
'place.name',
'user.followers_count',
'retweet_count',
'retweeted',
'user.friends_count',
'entities.hashtags.text',
'created_at',
'timestamp_ms',
'lang',
'coordinates.coordinates', # or just `coordinates`
'geo.coordinates' # or just `geo`
])
In the first case with coordinates and geo I get the following, printing the schema:
df_tweets.printSchema()
root
|-- text: string (nullable = true)
|-- name: string (nullable = true)
|-- screen_name: string (nullable = true)
|-- id: long (nullable = true)
|-- location: string (nullable = true)
|-- country: string (nullable = true)
|-- full_name: string (nullable = true)
|-- name: string (nullable = true)
|-- followers_count: long (nullable = true)
|-- retweet_count: long (nullable = true)
|-- retweeted: boolean (nullable = true)
|-- friends_count: long (nullable = true)
|-- text: array (nullable = true)
| |-- element: string (containsNull = true)
|-- created_at: string (nullable = true)
|-- timestamp_ms: string (nullable = true)
|-- lang: string (nullable = true)
|-- coordinates: struct (nullable = true)
| |-- coordinates: array (nullable = true)
| | |-- element: double (containsNull = true)
| |-- type: string (nullable = true)
|-- geo: struct (nullable = true)
| |-- coordinates: array (nullable = true)
| | |-- element: double (containsNull = true)
| |-- type: string (nullable = true)
When I do coordinates.coordinates and geo.coordinates, I get
root
|-- text: string (nullable = true)
|-- name: string (nullable = true)
|-- screen_name: string (nullable = true)
|-- id: long (nullable = true)
|-- location: string (nullable = true)
|-- country: string (nullable = true)
|-- full_name: string (nullable = true)
|-- name: string (nullable = true)
|-- followers_count: long (nullable = true)
|-- retweet_count: long (nullable = true)
|-- retweeted: boolean (nullable = true)
|-- friends_count: long (nullable = true)
|-- text: array (nullable = true)
| |-- element: string (containsNull = true)
|-- created_at: string (nullable = true)
|-- timestamp_ms: string (nullable = true)
|-- lang: string (nullable = true)
|-- coordinates: array (nullable = true)
| |-- element: double (containsNull = true)
|-- coordinates: array (nullable = true)
| |-- element: double (containsNull = true)
When I print both dataframes in Pandas, none of them gives me coordinates, I still have None.
How to extract geo coordinates properly?
If I look at my dataframe with tweet data, I see it like this
In [44]: df[df.coordinates.notnull()]['coordinates']
Out[44]:
98 {'type': 'Point', 'coordinates': [-122.32111, ...
99 {'type': 'Point', 'coordinates': [-122.32111, ...
Name: coordinates, dtype: object
So it's a dictionary that has to be parsed
tweets_coords = df[df.coordinates.notnull()]['coordinates'].tolist()
for coords in tweets_coords:
print(coords)
print(coords['coordinates'])
print(coords['coordinates'][0])
print(coords['coordinates'][1])
Output:
{'type': 'Point', 'coordinates': [-122.32111, 47.62366]}
[-122.32111, 47.62366]
-122.32111
47.62362
{'type': 'Point', 'coordinates': [-122.32111, 47.62362]}
[-122.32111, 47.62362]
-122.32111
47.62362
You can setup a lambda function in apply() to parse these out row by row, otherwise you can use list comprehension using what i've provided as the basis for your analysis.
All that said, maybe check this first...
Where you are using coordinates.coordinates and geo.coordinates, try coordinates['coordinates'] and geo['coordinates']
I am trying to parse nested json with some sample json. Below is the print schema
|-- batters: struct (nullable = true)
| |-- batter: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- id: string (nullable = true)
| | | |-- type: string (nullable = true)
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- ppu: double (nullable = true)
|-- topping: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- type: string (nullable = true)
|-- type: string (nullable = true)
Trying to explode batters,topping separately and combine them.
df_batter = df_json.select("batters.*")
df_explode1= df_batter.withColumn("batter", explode("batter")).select("batter.*")
df_explode2= df_json.withColumn("topping", explode("topping")).select("id",
"type","name","ppu","topping.*")
Unable to combine the two data frame.
Tried using single query
exploded1 = df_json.withColumn("batter", df_batter.withColumn("batter",
explode("batter"))).withColumn("topping", explode("topping")).select("id",
"type","name","ppu","topping.*","batter.*")
But getting error.Kindly help me to solve it. Thanks
You basically have to explode the arrays together using arrays_zip which returns a merged array of structs. Try this. I haven't tested but it should work.
from pyspark.sql import functions as F
df_json.select("id","type","name","ppu","topping","batters.*")\
.withColumn("zipped", F.explode(F.arrays_zip("batter","topping")))\
.select("id","type","name","ppu","zipped.*").show()
You could also do it one by one:
from pyspark.sql import functions as F
df1=df_json.select("id","type","name","ppu","topping","batters.*")\
.withColumn("batter", F.explode("batter"))\
.select("id","type","name","ppu","topping","batter")
df1.withColumn("topping", F.explode("topping")).select("id","type","name","ppu","topping.*","batter.*")