Adding a nullable column in Spark dataframe

Adding a nullable column in Spark dataframe - python

In Spark, literal columns, when added, are not nullable:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1,)], ['c1'])
df = df.withColumn('c2', F.lit('a'))
df.printSchema()
# root
# |-- c1: long (nullable = true)
# |-- c2: string (nullable = false)
How to create a nullable column?

The shortest method I've found - using when (the otherwise clause seems not needed):
df = df.withColumn('c2', F.when(F.lit(True), F.lit('a')))
If in Scala: .withColumn("c2", when(lit(true), lit("a")))
Full test result:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1,)], ['c1'])
df = df.withColumn('c2', F.when(F.lit(True), F.lit('a')))
df.show()
# +---+---+
# | c1| c2|
# +---+---+
# | 1| a|
# +---+---+
df.printSchema()
# root
# |-- c1: long (nullable = true)
# |-- c2: string (nullable = true)

Related

Array of string using pandas_udf

This function returns array of int:
from pyspark.sql import functions as F
import pandas as pd
#F.pandas_udf('array<int>')
def pudf(x: pd.Series, y: pd.Series) -> pd.Series:
return pd.Series([[x, y]])
df = spark.createDataFrame([(5, 2), (6, 7)])
df = df.withColumn('out', pudf('_1', '_2'))
df.show()
# +---+---+------+
# | _1| _2| out|
# +---+---+------+
# | 5| 2|[5, 2]|
# | 6| 7|[6, 7]|
# +---+---+------+
df.printSchema()
# root
# |-- _1: long (nullable = true)
# |-- _2: long (nullable = true)
# |-- out: array (nullable = true)
# | |-- element: integer (containsNull = true)
Question. How to return array or string?
If I change int to string and df elements to string, it fails to return the expected array of strings.
from pyspark.sql import functions as F
import pandas as pd
#F.pandas_udf('array<string>')
def pudf(x: pd.Series, y: pd.Series) -> pd.Series:
return pd.Series([[x, y]])
df = spark.createDataFrame([('5', '2'), ('6', '7')])
df = df.withColumn('out', pudf('_1', '_2'))
df.show()
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "pyarrow/array.pxi", line 913, in pyarrow.lib.Array.from_pandas
File "pyarrow/array.pxi", line 311, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'Series' object

from pyspark.sql import functions as F
import pandas as pd
#F.pandas_udf('array<string>')
def pudf(x: pd.Series, y: pd.Series) -> pd.Series:
return pd.Series([[x[0],y[0]]])
df = spark.createDataFrame([('5', '2'), ('6', '7')])
df = df.withColumn('out', pudf('_1', '_2'))
df.show(truncate=False)
df.printSchema()
# +---+---+------+
# |_1 |_2 |out |
# +---+---+------+
# |5 |2 |[5, 2]|
# |6 |7 |[6, 7]|
# +---+---+------+
# root
# |-- _1: string (nullable = true)
# |-- _2: string (nullable = true)
# |-- out: array (nullable = true)
# | |-- element: string (containsNull = true)

How to change data type in pyspark dataframe automatically

I have data from csv file, and use it in jupyter notebook with pysaprk. I have many columns and all of them have string data type. I know how to change data type manually, but there is any way to do it automatically?

You can use the inferSchema option when you load your csv file, to let spark try to infer the schema. With the following example csv file, you can get two different schemas depending on whether you set inferSchema to true or not:
seq,date
1,13/10/1942
2,12/02/2013
3,01/02/1959
4,06/04/1939
5,23/10/2053
6,13/03/2059
7,10/12/1983
8,28/10/1952
9,07/04/2033
10,29/11/2035
Example code:
df = (spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "false") # default option
.load(path))
df.printSchema()
df2 = (spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load(path))
df2.printSchema()
Output:
root
|-- seq: string (nullable = true)
|-- date: string (nullable = true)
root
|-- seq: integer (nullable = true)
|-- date: string (nullable = true)

You would need to define the schema before reading the file:
from pyspark.sql import functions as F
from pyspark.sql.types import *
data2 = [("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])
df = spark.createDataFrame(data=data2,schema=schema)
df.show()
df.printSchema()
+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname| id|gender|salary|
+---------+----------+--------+-----+------+------+
| James| | Smith|36636| M| 3000|
| Michael| Rose| |40288| M| 4000|
| Robert| |Williams|42114| M| 4000|
| Maria| Anne| Jones|39192| F| 4000|
| Jen| Mary| Brown| | F| -1|
+---------+----------+--------+-----+------+------+
root
|-- firstname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- lastname: string (nullable = true)
|-- id: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: integer (nullable = true)

How to flatten json file in pyspark

I have json file structure as shown below.evry time json file structure will change in pyspark how we handle flatten any kind of json file. Can u help me on this.
root
|-- student: struct (nullable = true)
|-- children: struct (nullable = true)
|-- parent: struct (nullable = true
| |-- id: string (nullable = true)
| |-- type: string (nullable = true)
| |-- date: string (nullable = true)
|-- multipliers: array (nullable = true)
| |-- element: double (containsNull = true)
|-- spawn_time: string (nullable = true)
|-- type: array (nullable = true)
| |-- element: string (containsNull = true)

This approach uses a recursive function to determine the columns to select, by building a flat list of fully-named prefixes in the prefix accumulator parameter.
Note that it will work on any format that supports nesting, not just JSON (Parquet, Avro, etc).
Furthermore, the input can have any schema, but this example uses:
{"c1": {"c3": 4, "c4": 12}, "c2": "w1"}
{"c1": {"c3": 5, "c4": 34}, "c2": "w2"}
The original df shows as:
+-------+---+
| c1| c2|
+-------+---+
|[4, 12]| w1|
|[5, 34]| w2|
+-------+---+
The code:
from pyspark.sql.types import StructType
from pyspark.sql.functions import col
# return a list of all (possibly nested) fields to select, within a given schema
def flatten(schema, prefix: str = ""):
# return a list of sub-items to select, within a given field
def field_items(field):
name = f'{prefix}.{field.name}' if prefix else field.name
if type(field.dataType) == StructType:
return flatten(field.dataType, name)
else:
return [col(name)]
return [item for field in schema.fields for item in field_items(field)]
df = spark.read.json(path)
print('===== df =====')
df.printSchema()
flattened = flatten(df.schema)
print('flattened =', flatten(df.schema))
print('===== df2 =====')
df2 = df.select(*flattened)
df2.printSchema()
df2.show()
As you will see in the output, the flatten function returns a flat list of columns, each one fully named (using "parent_col.child_col" naming format).
Output:
===== df =====
root
|-- c1: struct (nullable = true)
| |-- c3: long (nullable = true)
| |-- c4: long (nullable = true)
|-- c2: string (nullable = true)
flattened = [Column<b'c1.c3'>, Column<b'c1.c4'>, Column<b'c2'>]
===== df2 =====
root
|-- c3: long (nullable = true)
|-- c4: long (nullable = true)
|-- c2: string (nullable = true)
+---+---+---+
| c3| c4| c2|
+---+---+---+
| 4| 12| w1|
| 5| 34| w2|
+---+---+---+

How to get the column by its index instead of a name?

I have the following initial PySpark DataFrame:
+----------+--------------------------------+
|product_PK| products|
+----------+--------------------------------+
| 686 | [[686,520.70],[645,2]]|
| 685 |[[685,45.556],[678,23],[655,21]]|
| 693 | []|
df = sqlCtx.createDataFrame(
[(686, [[686,520.70], [645,2]]), (685, [[685,45.556], [678,23],[655,21]]), (693, [])],
["product_PK", "products"]
)
The column products contains nested data. I need to extract the second value in each pair of values. I am running this code:
temp_dataframe = dataframe.withColumn("exploded" , explode(col("products"))).withColumn("score", col("exploded").getItem("_2"))
It works well with particular DataFrame. However, I want to put this code into a function and run it on different DataFrames. All of my DataFrames have the same structure. The only difference is that the sub-column "_2" might be named differently in some DataFrames, e.g. "col1" or "col2".
For example:
DataFrame content
root
|-- product_PK: long (nullable = true)
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: long (nullable = true)
| | |-- _2: double (nullable = true)
|-- exploded: struct (nullable = true)
| |-- _1: long (nullable = true)
| |-- _2: double (nullable = true)
DataFrame content
root
|-- product_PK: long (nullable = true)
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- product_PK: long (nullable = true)
| | |-- col2: integer (nullable = true)
|-- exploded: struct (nullable = true)
| |-- product_PK: long (nullable = true)
| |-- col2: integer (nullable = true)
I tried to use index like getItem(1), but it says that the name of a column must be provided.
Is there any way to avoid specifying the column name or somehow generalize this part of a code?
My goal is that exploded contains the second value of each pair in the nested data, i.e. _2 or col1 or col2.

It sounds like you were on the right track. I think the way to accomplish this is to read the schema to determine the name of the field you want to explode on. Instead of schema.names though, you need to use schema.fields to find the struct field, and then use it's properties to figure out the fields in the struct. Here is an example:
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Setup the test dataframe
data = [
(686, [(686, 520.70), (645, 2.)]),
(685, [(685, 45.556), (678, 23.), (655, 21.)]),
(693, [])
]
schema = StructType([
StructField("product_PK", StringType()),
StructField("products",
ArrayType(StructType([
StructField("_1", IntegerType()),
StructField("col2", FloatType())
]))
)
])
df = sqlCtx.createDataFrame(data, schema)
# Find the products field in the schema, then find the name of the 2nd field
productsField = next(f for f in df.schema.fields if f.name == 'products')
target_field = productsField.dataType.elementType.names[1]
# Do your explode using the field name
temp_dataframe = df.withColumn("exploded" , explode(col("products"))).withColumn("score", col("exploded").getItem(target_field))
Now, if you examine the result you get this:
>>> temp_dataframe.printSchema()
root
|-- product_PK: string (nullable = true)
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = true)
| | |-- col2: float (nullable = true)
|-- exploded: struct (nullable = true)
| |-- _1: integer (nullable = true)
| |-- col2: float (nullable = true)
|-- score: float (nullable = true)

Is that what you want?
>>> df.show(10, False)
+----------+-----------------------------------------------------------------------+
|product_PK|products |
+----------+-----------------------------------------------------------------------+
|686 |[WrappedArray(686, null), WrappedArray(645, 2)] |
|685 |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|
|693 |[] |
+----------+-----------------------------------------------------------------------+
>>> import pyspark.sql.functions as F
>>> df.withColumn("exploded", F.explode("products")) \
... .withColumn("exploded", F.col("exploded").getItem(1)) \
... .show(10,False)
+----------+-----------------------------------------------------------------------+--------+
|product_PK|products |exploded|
+----------+-----------------------------------------------------------------------+--------+
|686 |[WrappedArray(686, null), WrappedArray(645, 2)] |null |
|686 |[WrappedArray(686, null), WrappedArray(645, 2)] |2 |
|685 |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|null |
|685 |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|23 |
|685 |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|21 |
+----------+-----------------------------------------------------------------------+--------+

Given that your exploded column is a struct as
|-- exploded: struct (nullable = true)
| |-- _1: integer (nullable = true)
| |-- col2: float (nullable = true)
You can use following logic to get the second element without knowing the name
from pyspark.sql import functions as F
temp_dataframe = df.withColumn("exploded" , F.explode(F.col("products")))
temp_dataframe.withColumn("score", F.col("exploded."+temp_dataframe.select(F.col("exploded.*")).columns[1]))
you should have output as
+----------+--------------------------------------+------------+------+
|product_PK|products |exploded |score |
+----------+--------------------------------------+------------+------+
|686 |[[686,520.7], [645,2.0]] |[686,520.7] |520.7 |
|686 |[[686,520.7], [645,2.0]] |[645,2.0] |2.0 |
|685 |[[685,45.556], [678,23.0], [655,21.0]]|[685,45.556]|45.556|
|685 |[[685,45.556], [678,23.0], [655,21.0]]|[678,23.0] |23.0 |
|685 |[[685,45.556], [678,23.0], [655,21.0]]|[655,21.0] |21.0 |
+----------+--------------------------------------+------------+------+

SparkSQL sql syntax for nth item in array

I have a json object that has an unfortunate combination of nesting and arrays. So its not totally obvious how to query it with spark sql.
here is a sample object:
{
stuff: [
{a:1,b:2,c:3}
]
}
so, in javascript, to get the value for c, I'd write myData.stuff[0].c
And in my spark sql query, if that array wasn't there, I'd be able to use dot notation:
SELECT stuff.c FROM blah
but I can't, because the innermost object is wrapped in an array.
I've tried:
SELECT stuff.0.c FROM blah // FAIL
SELECT stuff.[0].c FROM blah // FAIL
So, what is the magical way to select that data? or is that even supported yet?

It is not clear what you mean by JSON object so lets consider two different cases:
An array of structs
import tempfile
path = tempfile.mktemp()
with open(path, "w") as fw:
fw.write('''{"stuff": [{"a": 1, "b": 2, "c": 3}]}''')
df = sqlContext.read.json(path)
df.registerTempTable("df")
df.printSchema()
## root
## |-- stuff: array (nullable = true)
## | |-- element: struct (containsNull = true)
## | | |-- a: long (nullable = true)
## | | |-- b: long (nullable = true)
## | | |-- c: long (nullable = true)
sqlContext.sql("SELECT stuff[0].a FROM df").show()
## +---+
## |_c0|
## +---+
## | 1|
## +---+
An array of maps
# Note: schema inference from dictionaries has been deprecated
# don't use this in practice
df = sc.parallelize([{"stuff": [{"a": 1, "b": 2, "c": 3}]}]).toDF()
df.registerTempTable("df")
df.printSchema()
## root
## |-- stuff: array (nullable = true)
## | |-- element: map (containsNull = true)
## | | |-- key: string
## | | |-- value: long (valueContainsNull = true)
sqlContext.sql("SELECT stuff[0]['a'] FROM df").show()
## +---+
## |_c0|
## +---+
## | 1|
## +---+
See also Querying Spark SQL DataFrame with complex types

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Adding a nullable column in Spark dataframe - python

Related

Array of string using pandas_udf

How to change data type in pyspark dataframe automatically

How to flatten json file in pyspark

How to get the column by its index instead of a name?

SparkSQL sql syntax for nth item in array

Categories

Resources