I have a CSV file with columns as below
Required result in Nested JSON objects as below:
{
"Type" : "A"
"Value" :"1"
}
{
"Type" : "B"
"Value" :"1"
}
I tried the below code : Any help would be deeply appreciated.
from pyspark.sql.functions import *
list1=['A','B','C']
df2=spark.sql("select * from test limit 10" )
df3=df2.select(list1)
for i in range(0,len(list1)):
df4=df3.withColumn("Type",lit(list1[i]))
This can be done by trick of UnPivoting...
Suppose you have dataset like below.. Let's call it a Patient's test results.. Column A,B,C means.. Test Type A , Test Type B, ... and values in those Columns mean test results in numeric
+-------------+---+---+---+---+---+---+---+
|PatientNumber| A| B| C| D| E| F| G|
+-------------+---+---+---+---+---+---+---+
| 101| 1| 2| 3| 4| 5| 6| 7|
| 102| 11| 12| 13| 14| 15| 16| 17|
+-------------+---+---+---+---+---+---+---+
I've added a PatientNumber column just for making data look more sensible. You can remove that from your code.
I added this dataset to a csv..
val testDF = spark.read.format("csv").option("header", "true").load("""C:\TestData\CSVtoJSon.csv""")
Let's create 2 Arrays, one for id columns, and another for all test Types..
val idCols = Array("PatientNumber")
val valCols = testDF.columns.diff(idCols)
Then here's the code to Unpivot
val valcolNames = valCols.map(x => List(''' + x + ''', x))
val unPivotedDF = testDF.select($"PatientNumber", expr(s"""stack(${valCols.size},${valcolNames.flatMap(x => x).mkString(",")} ) as (Type,Value)"""))
Here's how this Unpivoted data looks like -
+-------------+----+-----+
|PatientNumber|Type|Value|
+-------------+----+-----+
| 101| A| 1|
| 101| B| 2|
| 101| C| 3|
| 101| D| 4|
| 101| E| 5|
| 101| F| 6|
| 101| G| 7|
| 102| A| 11|
| 102| B| 12|
| 102| C| 13|
| 102| D| 14|
| 102| E| 15|
| 102| F| 16|
| 102| G| 17|
+-------------+----+-----+
Finally write this Unpivoted DF as Json -
unPivotedDF.coalesce(1).write.format("json").mode("Overwrite").save("""C:\TestData\output""")
Content of Json File looks same as your desired result -
{"PatientNumber":"101","Type":"A","Value":"1"}
{"PatientNumber":"101","Type":"B","Value":"2"}
{"PatientNumber":"101","Type":"C","Value":"3"}
{"PatientNumber":"101","Type":"D","Value":"4"}
{"PatientNumber":"101","Type":"E","Value":"5"}
{"PatientNumber":"101","Type":"F","Value":"6"}
{"PatientNumber":"101","Type":"G","Value":"7"}
{"PatientNumber":"102","Type":"A","Value":"11"}
{"PatientNumber":"102","Type":"B","Value":"12"}
{"PatientNumber":"102","Type":"C","Value":"13"}
{"PatientNumber":"102","Type":"D","Value":"14"}
{"PatientNumber":"102","Type":"E","Value":"15"}
{"PatientNumber":"102","Type":"F","Value":"16"}
{"PatientNumber":"102","Type":"G","Value":"17"}
Related
My question:
There are two dataframes and the info one is in progress to build.
What I want to do is filtering in reference dataframe based on the condition. When key is b, then apply value is 2 to the into table as whole column.
The output dataframe is the final one I want to do.
Dataframe (info)
+-----+-----+
| key|value|
+-----+-----+
| a| 10|
| b| 20|
| c| 50|
| d| 40|
+-----+-----+
Dataframe (Reference)
+-----+-----+
| key|value|
+-----+-----+
| a| 42|
| b| 2|
| c| 9|
| d| 100|
+-----+-----+
Below is the output I want:
Dataframe (Output)
+-----+-----+-----+
| key|value|const|
+-----+-----+-----+
| a| 10| 2|
| b| 20| 2|
| c| 50| 2|
| d| 40| 2|
+-----+-----+-----+
I have tried several methods and below one is the latest one I tried, but system warm me that pyspark do not have loc function.
df_cal = (
info
.join(reference)
.withColumn('const', reference.loc[reference['key']=='b', 'value'].iloc[0])
.select('key', 'result', 'const')
)
df_cal.show()
And below is the warming that reminded by system:
AttributeError: 'Dataframe' object has no attribute 'loc'
This solve:
from pyspark.sql.functions import lit
target = 'b'
const = [i['value'] for i in df2.collect() if i['key'] == f'{target}']
df_cal = df1.withColumn('const', lit(const[0]))
df_cal.show()
+---+-----+-----+
|key|value|const|
+---+-----+-----+
| a| 10| 2|
| b| 20| 2|
| c| 30| 2|
| d| 40| 2|
+---+-----+-----+
I have a df on the left like this one:
+----+-----+
| id|value|
+----+-----+
| 2| xx|
| 4| xx|
| 11| xx|
| 14| xx|
| 27| xx|
| 28| xx|
| 56| xx|
| 55| xx|
+----+-----+
And another one on the right like this one:
+-----+---+----+
|start|end| ov |
+-----+---+----+
| 0| 9| A|
| 10| 19| B|
| 20| 29| C|
| 30| 39| D|
| 40| 49| F|
+-----+---+----+
And I need to join the rows when the id of the first table is between the range of start end of the second table. The output should look like this:
+----+-----+----+
| id|value| ov |
+----+-----+----+
| 2| xx| A|
| 4| xx| A|
| 11| xx| B|
| 14| xx| B|
| 27| xx| C|
| 28| xx| C|
| 56| xx| |
| 55| xx| |
+----+-----+----+
How can I achive this result with PySpark?
Use between operator with left join.
Example:
#using dataframes api
df.join(df1,(df['id'] >= df1['start']) & (df['id'] <= df1['end']),'left').select(df["*"],df1['ov']).show(10,False)
#using spark sql api
df.createOrReplaceTempView("t1")
df1.createOrReplaceTempView("t2")
spark.sql("select t1.*,t2.ov from t1 left join t2 on t1.id between t2.start and t2.end").show()
#this is just sample data
#+---+-----+----+
#| id|value| ov|
#+---+-----+----+
#| 2| xx| A|
#| 4| xx| A|
#| 55| zz|null|
#+---+-----+----+
Im using the below function to explode a deeply nested JSON (has nested struct and array).
# Flatten nested df
def flatten_df(nested_df):
for col in nested_df.columns:
array_cols = [c[0] for c in nested_df.dtypes if c[1][:5] == 'array']
for col in array_cols:
nested_df =nested_df.withColumn(col, F.explode_outer(nested_df[col]))
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
if len(nested_cols) == 0:
return nested_df
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
flat_df = nested_df.select(flat_cols +
[F.col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols
for c in nested_df.select(nc+'.*').columns])
return flatten_df(flat_df)
Im successfully able to explode. But I also want to add the order or the index of the elements in the exploded dataframe. So in the above code I replace the explode_outer function to posexplode_outer. But I get the below error
An error was encountered:
'The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases'
I tried changing the nested_df.withColumn to nested_df.select but I wasn't successful. Can anyone help me explode the nested json but same time maintain the order of array elements as a column in the exploded dataframe.
Read json data as dataframe and create view or table.In spark SQL you can use number of laterviewexplode method using alias reference. If json data structure in struct type you can use dot to represent the structure. Level1.level2
Replace nested_df =nested_df.withColumn(col, F.explode_outer(nested_df[col])) with nested_df = df.selectExpr("*", f"posexplode({col}) as (position,col)").drop(col)
You might need to write some logic to replace the column names to original, but it should be simple
The error is because posexplode_outer returns two columns pos and col, so you cannot use it along withColumn(). This can be used in select as shown in the code below
from pyspark.sql import functions as F
from pyspark.sql.window import Window
tst= sqlContext.createDataFrame([(1,7,80),(1,8,40),(1,5,100),(5,8,90),(7,6,50),(0,3,60)],schema=['col1','col2','col3'])
tst_new = tst.withColumn("arr",F.array(tst.columns))
expr = tst.columns
expr.append(F.posexplode_outer('arr'))
#%%
tst_explode = tst_new.select(*expr)
results:
tst_explode.show()
+----+----+----+---+---+
|col1|col2|col3|pos|col|
+----+----+----+---+---+
| 1| 7| 80| 0| 1|
| 1| 7| 80| 1| 7|
| 1| 7| 80| 2| 80|
| 1| 8| 40| 0| 1|
| 1| 8| 40| 1| 8|
| 1| 8| 40| 2| 40|
| 1| 5| 100| 0| 1|
| 1| 5| 100| 1| 5|
| 1| 5| 100| 2|100|
| 5| 8| 90| 0| 5|
| 5| 8| 90| 1| 8|
| 5| 8| 90| 2| 90|
| 7| 6| 50| 0| 7|
| 7| 6| 50| 1| 6|
| 7| 6| 50| 2| 50|
| 0| 3| 60| 0| 0|
| 0| 3| 60| 1| 3|
| 0| 3| 60| 2| 60|
+----+----+----+---+---+
If you need to rename the columns, you can use the .withColumnRenamed() function
df_final=(tst_explode.withColumnRenamed('pos','position')).withColumnRenamed('col','column')
You can try select with list-comprehension to posexplode the ArrayType columns in your existing code:
for col in array_cols:
nested_df = nested_df.select([ F.posexplode_outer(col).alias(col+'_pos', col) if c == col else c for c in nested_df.columns ])
Example:
from pyspark.sql import functions as F
df = spark.createDataFrame([(1,"n1", ["a", "b", "c"]),(2,"n2", ["foo", "bar"])],["id", "name", "vals"])
#+---+----+----------+
#| id|name| vals|
#+---+----+----------+
#| 1| n1| [a, b, c]|
#| 2| n2|[foo, bar]|
#+---+----+----------+
col = "vals"
df.select([F.posexplode_outer(col).alias(col+'_pos', col) if c == col else c for c in df.columns]).show()
#+---+----+--------+----+
#| id|name|vals_pos|vals|
#+---+----+--------+----+
#| 1| n1| 0| a|
#| 1| n1| 1| b|
#| 1| n1| 2| c|
#| 2| n2| 0| foo|
#| 2| n2| 1| bar|
#+---+----+--------+----+
I have a spark data frame like below
+---+----+----+----+----+----+----+
| id| 1| 2| 3|sf_1|sf_2|sf_3|
+---+----+----+----+----+----+----+
| 2|null|null|null| 102| 202| 302|
| 4|null|null|null| 104| 204| 304|
| 1|null|null|null| 101| 201| 301|
| 3|null|null|null| 103| 203| 303|
| 1| 11| 21| 31|null|null|null|
| 2| 12| 22| 32|null|null|null|
| 4| 14| 24| 34|null|null|null|
| 3| 13| 23| 33|null|null|null|
+---+----+----+----+----+----+----+
I wanted to transform data frame like below by merging null rows
+---+----+----+----+----+----+----+
| id| 1| 2| 3|sf_1|sf_2|sf_3|
+---+----+----+----+----+----+----+
| 1| 11| 21| 31| 101| 201| 301|
| 2| 12| 22| 32| 102| 202| 302|
| 4| 14| 24| 34| 104| 204| 304|
| 3| 13| 23| 33| 103| 203| 303|
+---+----+----+----+----+----+----+
preferably in scala.
You can group on id and aggregate using first with ignorenulls for other columns:
import pyspark.sql.functions as F
(df.groupBy('id').agg(*[F.first(x,ignorenulls=True) for x in df.columns if x!='id'])
.show())
+---+----+----+----+-----+-----+-----+
| id| 1| 2| 3| sf_1| sf_2| sf_3|
+---+----+----+----+-----+-----+-----+
| 1|11.0|21.0|31.0|101.0|201.0|301.0|
| 3|13.0|23.0|33.0|103.0|203.0|303.0|
| 2|12.0|22.0|32.0|102.0|202.0|302.0|
| 4|14.0|24.0|34.0|104.0|204.0|304.0|
+---+----+----+----+-----+-----+-----+
scala way of doing.
val inputColumns = inputLoadDF.columns.toList.drop(0)
val exprs = inputColumns.map(x => first(x,true))
inputLoadDF.groupBy("id").agg(exprs.head,exprs.tail:_*).show()
I have this input :
timestamp,user
1,A
2,B
5,C
9,E
12,F
The result wanted is :
timestampRange,userList
1 to 2,[A,B]
3 to 4,[] Or null
5 to 6,[C]
7 to 8,[] Or null
9 to 10,[E]
11 to 12,[F]
I tried using Window, but the problem, it doesn't include the empty timestamp range.
Any hints would be helpful.
Don't know if widowing function will cover the gaps between ranges, but you can take the following approach :
Define a dataframe, df_ranges:
val ranges = List((1,2), (3,4), (5,6), (7,8), (9,10))
val df_ranges = sc.parallelize(ranges).toDF("start", "end")
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 3| 4|
| 5| 6|
| 7| 8|
| 9| 10|
+-----+---+
Data with the timestamp column, df_data :
val data = List((1,"A"), (2,"B"), (5,"C"), (9,"E"))
val df_data = sc.parallelize(data).toDF("timestamp", "user")
+---------+----+
|timestamp|user|
+---------+----+
| 1| A|
| 2| B|
| 5| C|
| 9| E|
+---------+----+
Join the two dataframe on the start, end, timestamp columns:
df_ranges.join(df_data, df_ranges.col("start").equalTo(df_data.col("timestamp")).or(df_ranges.col("end").equalTo(df_data.col("timestamp"))), "left")
+-----+---+---------+----+
|start|end|timestamp|user|
+-----+---+---------+----+
| 1| 2| 1| A|
| 1| 2| 2| B|
| 5| 6| 5| C|
| 9| 10| 9| E|
| 3| 4| null|null|
| 7| 8| null|null|
+-----+---+---------+----+
Now do a simple aggregation with collect_list function :
res4.groupBy("start", "end").agg(collect_list("user")).orderBy("start")
+-----+---+------------------+
|start|end|collect_list(user)|
+-----+---+------------------+
| 1| 2| [A, B]|
| 3| 4| []|
| 5| 6| [C]|
| 7| 8| []|
| 9| 10| [E]|
+-----+---+------------------+