How to get the column by its index instead of a name? - python

I have the following initial PySpark DataFrame:
+----------+--------------------------------+
|product_PK| products|
+----------+--------------------------------+
| 686 | [[686,520.70],[645,2]]|
| 685 |[[685,45.556],[678,23],[655,21]]|
| 693 | []|
df = sqlCtx.createDataFrame(
[(686, [[686,520.70], [645,2]]), (685, [[685,45.556], [678,23],[655,21]]), (693, [])],
["product_PK", "products"]
)
The column products contains nested data. I need to extract the second value in each pair of values. I am running this code:
temp_dataframe = dataframe.withColumn("exploded" , explode(col("products"))).withColumn("score", col("exploded").getItem("_2"))
It works well with particular DataFrame. However, I want to put this code into a function and run it on different DataFrames. All of my DataFrames have the same structure. The only difference is that the sub-column "_2" might be named differently in some DataFrames, e.g. "col1" or "col2".
For example:
DataFrame content
root
|-- product_PK: long (nullable = true)
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: long (nullable = true)
| | |-- _2: double (nullable = true)
|-- exploded: struct (nullable = true)
| |-- _1: long (nullable = true)
| |-- _2: double (nullable = true)
DataFrame content
root
|-- product_PK: long (nullable = true)
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- product_PK: long (nullable = true)
| | |-- col2: integer (nullable = true)
|-- exploded: struct (nullable = true)
| |-- product_PK: long (nullable = true)
| |-- col2: integer (nullable = true)
I tried to use index like getItem(1), but it says that the name of a column must be provided.
Is there any way to avoid specifying the column name or somehow generalize this part of a code?
My goal is that exploded contains the second value of each pair in the nested data, i.e. _2 or col1 or col2.

It sounds like you were on the right track. I think the way to accomplish this is to read the schema to determine the name of the field you want to explode on. Instead of schema.names though, you need to use schema.fields to find the struct field, and then use it's properties to figure out the fields in the struct. Here is an example:
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Setup the test dataframe
data = [
(686, [(686, 520.70), (645, 2.)]),
(685, [(685, 45.556), (678, 23.), (655, 21.)]),
(693, [])
]
schema = StructType([
StructField("product_PK", StringType()),
StructField("products",
ArrayType(StructType([
StructField("_1", IntegerType()),
StructField("col2", FloatType())
]))
)
])
df = sqlCtx.createDataFrame(data, schema)
# Find the products field in the schema, then find the name of the 2nd field
productsField = next(f for f in df.schema.fields if f.name == 'products')
target_field = productsField.dataType.elementType.names[1]
# Do your explode using the field name
temp_dataframe = df.withColumn("exploded" , explode(col("products"))).withColumn("score", col("exploded").getItem(target_field))
Now, if you examine the result you get this:
>>> temp_dataframe.printSchema()
root
|-- product_PK: string (nullable = true)
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = true)
| | |-- col2: float (nullable = true)
|-- exploded: struct (nullable = true)
| |-- _1: integer (nullable = true)
| |-- col2: float (nullable = true)
|-- score: float (nullable = true)

Is that what you want?
>>> df.show(10, False)
+----------+-----------------------------------------------------------------------+
|product_PK|products |
+----------+-----------------------------------------------------------------------+
|686 |[WrappedArray(686, null), WrappedArray(645, 2)] |
|685 |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|
|693 |[] |
+----------+-----------------------------------------------------------------------+
>>> import pyspark.sql.functions as F
>>> df.withColumn("exploded", F.explode("products")) \
... .withColumn("exploded", F.col("exploded").getItem(1)) \
... .show(10,False)
+----------+-----------------------------------------------------------------------+--------+
|product_PK|products |exploded|
+----------+-----------------------------------------------------------------------+--------+
|686 |[WrappedArray(686, null), WrappedArray(645, 2)] |null |
|686 |[WrappedArray(686, null), WrappedArray(645, 2)] |2 |
|685 |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|null |
|685 |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|23 |
|685 |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|21 |
+----------+-----------------------------------------------------------------------+--------+

Given that your exploded column is a struct as
|-- exploded: struct (nullable = true)
| |-- _1: integer (nullable = true)
| |-- col2: float (nullable = true)
You can use following logic to get the second element without knowing the name
from pyspark.sql import functions as F
temp_dataframe = df.withColumn("exploded" , F.explode(F.col("products")))
temp_dataframe.withColumn("score", F.col("exploded."+temp_dataframe.select(F.col("exploded.*")).columns[1]))
you should have output as
+----------+--------------------------------------+------------+------+
|product_PK|products |exploded |score |
+----------+--------------------------------------+------------+------+
|686 |[[686,520.7], [645,2.0]] |[686,520.7] |520.7 |
|686 |[[686,520.7], [645,2.0]] |[645,2.0] |2.0 |
|685 |[[685,45.556], [678,23.0], [655,21.0]]|[685,45.556]|45.556|
|685 |[[685,45.556], [678,23.0], [655,21.0]]|[678,23.0] |23.0 |
|685 |[[685,45.556], [678,23.0], [655,21.0]]|[655,21.0] |21.0 |
+----------+--------------------------------------+------------+------+

Related

How to apply a schema from an existing DataFrame to another DataFrame with missing columns in PySpark

I have a JSON file with various levels of nested struct/array columns in one DataFrame, df_1. I have a smaller DataFrame, df_2, with less columns, but the column names match with some column names in df_1, and none of the nested structure.
I want to apply the schema from df_1 to df_2 in a way that the two share the same schema, taking the existing columns in df_2 where possible, and creating the columns/nested structure that exist in df_1 but not df_2.
df_1
root
|-- association_info: struct (nullable = true)
| |-- ancestry: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- doi: string (nullable = true)
| |-- gwas_catalog_id: string (nullable = true)
| |-- neg_log_pval: double (nullable = true)
| |-- study_id: string (nullable = true)
| |-- pubmed_id: string (nullable = true)
| |-- url: string (nullable = true)
|-- gold_standard_info: struct (nullable = true)
| |-- evidence: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- class: string (nullable = true)
| | | |-- confidence: string (nullable = true)
| | | |-- curated_by: string (nullable = true)
| | | |-- description: string (nullable = true)
| | | |-- pubmed_id: string (nullable = true)
| | | |-- source: string (nullable = true)
| |-- gene_id: string (nullable = true)
| |-- highest_confidence: string (nullable = true)
df_2
root
|-- study_id: string (nullable = true)
|-- description: string (nullable = true)
|-- gene_id: string (nullable = true)
The expected output would be to have the same schema as df_1, and for any columns that don't exist in df_2 to just fill with null.
I have tried completely flattening the structure of df_1 to join the two DataFrames, but then I'm unsure how to change it back into the original schema. All solutions I've attempted so far have been in PySpark. It would be preferable to use PySpark for performance considerations, but if a solution requires converted to a Pandas DataFrame that's also feasible.
df1.select('association_info.study_id',
'gold_standard_info.evidence.element.description',
'gold_standard_info.gene_id')
The above code will reach into the df1 and provide you requisite fields in df2. The schema will remain same.
Could you try the same.

Convert array of elements to multiple columns

How I can convert a array (in a column) with a set of elements in a JSON dataset to multiple columns with python, spark or pandas?
The data is structured in this form:
root
|-- items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- idAccount: long (nullable = true)
| | |-- infractionType: string (nullable = true)
| | |-- responseTime: string (nullable = true)
| | |-- status: string (nullable = true)
| | |-- transactionCode: string (nullable = true)
I'm expecting some kind of this:
id
idAccount
value
value
value
value
an array of structs can be exploded into columns using the inline sql function.
here's an example of how it works.
data_sdf = spark.createDataFrame([([(1234, 2345, 3456), (4321, 5432, 6543)],)],
'items array<struct<id: int, id_acc: int, foo: int>>'
)
# +----------------------------------------+
# |items |
# +----------------------------------------+
# |[{1234, 2345, 3456}, {4321, 5432, 6543}]|
# +----------------------------------------+
# root
# |-- items: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- id: integer (nullable = true)
# | | |-- id_acc: integer (nullable = true)
# | | |-- foo: integer (nullable = true)
# explode and create new columns using struct fields - using `inline`
data_sdf. \
selectExpr('inline(items)'). \
show()
# +----+------+----+
# | id|id_acc| foo|
# +----+------+----+
# |1234| 2345|3456|
# |4321| 5432|6543|
# +----+------+----+
you can further just select() the required fields after the explosion.
In Spark SQL, you can access the item in ArrayType or MapType column by using getItem. For example, you want to get the value of the id of first item, you can use df.select(func.getItem(0).getItem('id'))

Spark : How to reuse the same array schema that has all fields defined across the data-frame

I have hundreds of columns a,b,c ... . I would like to modify dataframe schema, where each array will have the same shape date, num and val field.
There are thousands of id so I would like to modify ONLY schema not dataframe. Modified schema will be used in the next step to load data to dataframe efficiently . I would like to avoid using UDF to modify whole dataframe.
Input schema:
df.printSchema()
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true) !!! NOTE : `num` !!!
| | |-- val: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- d: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- val: long (nullable = true)
|-- id: long (nullable = true)
Required Output schema:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- c: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- d: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: long (nullable = true)
| | |-- num: long (nullable = true)
| | |-- val: long (nullable = true)
|-- id: long (nullable = true)
To reproduce input Schema:
df = spark.read.json(sc.parallelize([
"""{"id":1,"a":[{"date":2001,"num":1},{"date":2002,},{"date":2003,}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33}]}""",
"""{"id":2,"a":[{"date":2001,"num":2},{"date":2002},{"date":2003}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":1990,"val":39},{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33},{"date":2004,"val":34}]}}"""
]))
for field in df.schema:
print(field)
Print output:
StructField(a,ArrayType(StructType(List(StructField(date,LongType,true),StructField(num,LongType,true),StructField(val,LongType,true))),true),true)
StructField(b,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(c,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(d,ArrayType(StructType(List(StructField(date,LongType,true),StructField(val,LongType,true))),true),true)
StructField(id,LongType,true)
Solution (see OneCricketeer answer below for details) :
from pyspark.sql.types import StructField, StructType, LongType, ArrayType
jsonstr=[
"""{"id":1,"a":[{"date":2001,"val":1,"num":1},{"date":2002,"val":2},{"date":2003,"val":3}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33}]}""",
"""{"id":2,"a":[{"date":2001,"val":1},{"date":2002,"val":2},{"date":2003,"val":3}],"b":[{"date":2001,"val":4},{"date":2002,"val":5},{"date":2003,"val":6}],"d":[{"date":2001,"val":21},{"date":2002,"val":22},{"date":2003,"val":23}],"c":[{"date":1990,"val":39},{"date":2000,"val":30},{"date":2001,"val":31},{"date":2002,"val":32},{"date":2003,"val":33},{"date":2004,"val":34}]}}"""
]
array_schema = ArrayType(StructType([
StructField('date' ,LongType(),True),
StructField('num' ,LongType(),True),
StructField('val' ,LongType(),True)]),
True)
keys = ['a', 'b', 'c', 'd']
fields = [StructField(k, array_schema, True) for k in keys]
fields.append(StructField('id',LongType(),True))
df_schema = StructType(fields)
dff = spark.read.json(sc.parallelize(jsonstr),df_schema)
I think the true solution is to have consistent names, or at least something more descriptive if the fields are truly different. "num" and "val" are basically synonymous
If I understand the question, you want to reuse the same array schema that has all fields defined
array_schema = ArrayType(StructType([StructField('date' ,LongType(),False),StructField('num' ,LongType(),True),StructField('val' ,LongType(),True))),True)
df_schema = StructType([
StructField('a',array_schema,True)
StructField('b',array_schema,True)
...
StructField('id',LongType(),True)
])
Or you can do this in a loop, which is safe because it's applied in the Spark driver
keys = ['a', 'b']
fields = [StructField(k, array_schema, True) for k in keys]
fields.append(StructField('id',LongType(),True))
df_schema = StructType(fields)
(change each boolean to a False if there will be no nulls)
Then you need to provide this schema to your read function
spark.read.schema(df_schema).json(...
If there will still be more fields that cannot be consistently applied to all "keys", then use ArrayType(MapType(StringType(), LongType()), False)

Extract schema labels from pyspark dataframe

From a pyspark dataframe I want to create a python list with the schema labels for a specific schema "level".
The schema is:
root
|-- DISPLAY: struct (nullable = true)
| |-- 1WO: struct (nullable = true)
| | |-- JPY: struct (nullable = true)
| | | |-- CHANGE24HOUR: string (nullable = true)
| | | |-- CHANGEDAY: string (nullable = true)
| |-- AAVE: struct (nullable = true)
| | |-- JPY: struct (nullable = true)
| | | |-- CHANGE24HOUR: string (nullable = true)
| | | |-- CHANGEDAY: string (nullable = true)
The expected output is:
list = 1WO, AAVE
The following code print everything in the schema:
df.schema.jsonValue()
Is there an easy way to extract those labels pls?
Select the first layer using the asterisk notation, and the n list the columns:
df.select('DISPLAY.*').columns

Spark - Creating Nested DataFrame

I'm starting with PySpark and I'm having troubles with creating DataFrames with nested objects.
This is my example.
I have users.
$ cat user.json
{"id":1,"name":"UserA"}
{"id":2,"name":"UserB"}
Users have orders.
$ cat order.json
{"id":1,"price":202.30,"userid":1}
{"id":2,"price":343.99,"userid":1}
{"id":3,"price":399.99,"userid":2}
And I like to join it to get such a struct where orders are array nested in users.
$ cat join.json
{"id":1, "name":"UserA", "orders":[{"id":1,"price":202.30,"userid":1},{"id":2,"price":343.99,"userid":1}]}
{"id":2,"name":"UserB","orders":[{"id":3,"price":399.99,"userid":2}]}
How can I do that ?
Is there any kind of nested join or something similar ?
>>> user = sqlContext.read.json("user.json")
>>> user.printSchema();
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
>>> order = sqlContext.read.json("order.json")
>>> order.printSchema();
root
|-- id: long (nullable = true)
|-- price: double (nullable = true)
|-- userid: long (nullable = true)
>>> joined = sqlContext.read.json("join.json")
>>> joined.printSchema();
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- orders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)
EDIT:
I know there is possibility to do this using join and foldByKey, but is there any simpler way ?
EDIT2:
I'm using solution by #zero323
def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, joinType = "left_outer"):
tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: r.asDict()[columnRight]))
tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), tmpTable._2.data.alias(columnNested))
return tableLeft.join(tmpTable, tableLeft[columnLeft] == tmpTable["joinColumn"], joinType).drop("joinColumn")
I add second nested structure 'lines'
>>> lines = sqlContext.read.json(path + "lines.json")
>>> lines.printSchema();
root
|-- id: long (nullable = true)
|-- orderid: long (nullable = true)
|-- product: string (nullable = true)
orders = joinTable(order, lines, "id", "orderid", "lines")
joined = joinTable(user, orders, "id", "userid", "orders")
joined.printSchema()
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- orders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)
| | |-- lines: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- _1: long (nullable = true)
| | | | |-- _2: long (nullable = true)
| | | | |-- _3: string (nullable = true)
After this column names from lines are lost.
Any ideas ?
EDIT 3:
I tried to manual specify schema.
from pyspark.sql.types import *
fields = []
fields.append(StructField("_1", LongType(), True))
inner = ArrayType(lines.schema)
fields.append(StructField("_2", inner))
new_schema = StructType(fields)
print new_schema
grouped = lines.rdd.groupBy(lambda r: r.orderid)
grouped = grouped.map(lambda x: (x[0], list(x[1])))
g = sqlCtx.createDataFrame(grouped, new_schema)
Error:
TypeError: StructType(List(StructField(id,LongType,true),StructField(orderid,LongType,true),StructField(product,StringType,true))) can not accept object in type <class 'pyspark.sql.types.Row'>
This will work only in Spark 2.0 or later
First we'll need a couple of imports:
from pyspark.sql.functions import struct, collect_list
The rest is a simple aggregation and join:
orders = spark.read.json("/path/to/order.json")
users = spark.read.json("/path/to/user.json")
combined = users.join(
orders
.groupBy("userId")
.agg(collect_list(struct(*orders.columns)).alias("orders"))
.withColumnRenamed("userId", "id"), ["id"])
For the example data the result is:
combined.show(2, False)
+---+-----+---------------------------+
|id |name |orders |
+---+-----+---------------------------+
|1 |UserA|[[1,202.3,1], [2,343.99,1]]|
|2 |UserB|[[3,399.99,2]] |
+---+-----+---------------------------+
with schema:
combined.printSchema()
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- orders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)
and JSON representation:
for x in combined.toJSON().collect():
print(x)
{"id":1,"name":"UserA","orders":[{"id":1,"price":202.3,"userid":1},{"id":2,"price":343.99,"userid":1}]}
{"id":2,"name":"UserB","orders":[{"id":3,"price":399.99,"userid":2}]}
First, you need to use the userid as the join key for the second DataFrame:
user.join(order, user.id == order.userid)
Then you can use a map step to transform the resulting records to your desired format.
For flatining your data frame from nested to normal use
dff= df.select("column with multiple columns.*")

Categories