Programatically select columns from a dataframe with udf - python

I am new to pyspark.
I am trying to extract columns of a dataframe using a config file which contains a UDF.
If I define the select column as a list on the client it works but if I import the list from a config file the column list is of the type string.
Is there an alternate way.
opening spark-shell using pyspark .
*******************************************************************
version 2.2.0
Using Python version 2.7.16 (default, Mar 18 2019 18:38:44)
SparkSession available as 'spark'
*******************************************************************
jsonDF = spark.read.json("/tmp/people.json")
jsonDF.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
jsonDF.printSchema()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
jsonCurDF = jsonDF.filter(jsonDF.age.isNotNull()).cache()
# Define the UDF
from pyspark.sql.functions import udf
#udf("long")
def squared_udf(s):
return s * s
# Selecting the columns from a list.
colSelList = ['age', 'name', squared_udf('age')]
jsonCurDF.select(colSelList).show()
+---+------+----------------+
|age| name|squared_udf(age)|
+---+------+----------------+
| 30| Andy| 900|
| 19|Justin| 361|
+---+------+----------------+
# If I use an external config file
colSelListStr = ["age", "name" , "squared_udf('age')"]
jsonCurDF.select(colSelListStr).show()
The above command fails "cannot resolve '`squared_udf('age')'
Tried registering the function, tried selectExpr and using the column function.
In the colSelList the udf call is translated to a column type.
print colSelList[2]
Column<squared_udf(age)
print colSelListStr[2]
squared_udf('age')
print column(colSelListStr[2])
Column<squared_udf('age')
What am I doing wrong here? or is there an alternate solution?

It's because squared_age is considered as string not a function when you are passing it from a list.
There is a round way you can do this and you don't need to import UDF for this.
Assume this is the list you need to select
directly passing this list will result into an error because squared_age doesn't contain in this data frame
so first you take all the columns of existing df into a list by
existing_cols = df.columns
and your these are the columns you need
now take intersection of both the lists
it will you give you a common element list
intersection = list(set(existing_cols) & set(col_list))
now try like this
newDF= df.select(intersection).rdd.map(lambda x: (x["age"], x["name"], x["age"]*x["age"])).toDF(col_list)
which will give you this
hope this helped.

Related

pyspark extracting a string using python

Spark dataframe which has column emailID : ram.shyam.78uy#testing.com. i would like to extract the string between "." and "#" i.e 78uy and store it in column.
tried
split_for_alias = split(rs_csv['emailID'],'[.]')
rs_csv_alias= rs_csv.withColumn('alias',split_for_alias.getItem(size(split_for_alias) -2))
Its adding 78uy#testing as alias. Another column can be added and chop off the extra values. But is it possible to do in single statement.
Extract the alphanumeric immediately to the left of special character . and immediately followed by special character #
DataFrame
data= [
(1,"am.shyam.78uy#testing.com"),
(2, "j.k.kilo#jom.com")
]
df=spark.createDataFrame(data, ("id",'emailID'))
df.show()
+---+--------------------+
| id| emailID|
+---+--------------------+
| 1|am.shyam.78uy#tes...|
| 2| j.k.kilo#jom.com|
+---+--------------------+
Code
df.withColumn('name', regexp_extract('emailID', '(?<=\.)(\w+)(?=\#)',1)).show()
outcome
+---+--------------------+----+
| id| emailID|name|
+---+--------------------+----+
| 1|am.shyam.78uy#tes...|78uy|
| 2| j.k.kilo#jom.com|kilo|
+---+--------------------+----+
We made the Fugue project to port native Python or Pandas code to Spark or Dask. This lets you can keep the logic very readable by expressing it in native Python. Fugue can then port it to Spark for you with one function call.
First we setup a Pandas DataFrame to test:
import pandas as pd
df = pd.DataFrame({"id":[1,2],"email": ["am.shyam.78uy#testing.com", "j.k.kilo#jom.com"]})
Next, we make a native Python function. The logic is clear this way.
from typing import List, Dict, Any
def extract(df:List[Dict[str,Any]]) -> List[Dict[str,Any]]:
for row in df:
email = row["email"].split("#")[0].split(".")[-1]
row["new_col"] = email
return df
Then we can test on the Pandas engine:
from fugue import transform
transform(df, extract, schema="*, new_col:str")
Because it works, we can bring it to Spark by supplying an engine:
import fugue_spark
transform(df, extract, schema="*, new_col:str", engine="spark").show()
+---+--------------------+-------+
| id| email|new_col|
+---+--------------------+-------+
| 1|am.shyam.78uy#tes...| 78uy|
| 2| j.k.kilo#jom.com| kilo|
+---+--------------------+-------+
Note .show() is needed because Spark evaluates lazily. This transform can take in both Pandas and Spark DataFrames and will output a Spark DataFrame if using the Spark engine.

Get last / delimited value from Dataframe column in PySpark

I am trying to get the last string after '/'.
The column can look like this: "lala/mae.da/rg1/zzzzz" (not necessary only 3 /), and I'd like to return: zzzzz
In SQL and Python it's very easy, but I would like to know if there is a way to do it in PySpark.
Solving it in Python:
original_string = "lala/mae.da/rg1/zzzzz"
last_char_index = original_string.rfind("/")
new_string = original_string[last_char_index+1:]
or directly:
new_string = original_string.rsplit('/', 1)[1]
And in SQL:
RIGHT(MyColumn, CHARINDEX('/', REVERSE(MyColumn))-1)
For PySpark I was thinking something like this:
df = df.select(col("MyColumn").rsplit('/', 1)[1])
but I get the following error: TypeError: 'Column' object is not callable and I am not even sure Spark allows me to do rsplit at all.
Do you have any suggestion on how can I solve this?
Adding another solution even though #Pav3k's answer is great. element_at which gets an item at specific position out of a list:
from pyspark.sql import functions as F
df = df.withColumn('my_col_split', F.split(df['MyColumn'], '/'))\
.select('MyColumn',F.element_at(F.col('my_col_split'), -1).alias('rsplit')
)
>>> df.show(truncate=False)
+---------------------+------+
|MyColumn |rsplit|
+---------------------+------+
|lala/mae.da/rg1/zzzzz|zzzzz |
|fefe |fefe |
|fe/fe/frs/fs/fe32/4 |4 |
+---------------------+------+
Pav3k's DF used.
import pandas as pd
from pyspark.sql import functions as F
df = pd.DataFrame({"MyColumn": ["lala/mae.da/rg1/zzzzz", "fefe", "fe/fe/frs/fs/fe32/4"]})
df = spark.createDataFrame(df)
df.show(truncate=False)
# output
+---------------------+
|MyColumn |
+---------------------+
|lala/mae.da/rg1/zzzzz|
|fefe |
|fe/fe/frs/fs/fe32/4 |
+---------------------+
(
df
.withColumn("NewCol",
F.split("MyColumn", "/")
)
.withColumn("NewCol", F.col("Newcol")[F.size("NewCol") -1])
.show()
)
# output
+--------------------+------+
| MyColumn|NewCol|
+--------------------+------+
|lala/mae.da/rg1/z...| zzzzz|
| fefe| fefe|
| fe/fe/frs/fs/fe32/4| 4|
+--------------------+------+
Since Spark 2.4, you can use split built-in function to split your string then use element_at built-in function to get the last element of your obtained array, as follows:
from pyspark.sql import functions as F
df = df.select(F.element_at(F.split(F.col("MyColumn"), '/'), -1))

Use parquet file with special characters in column names in PySpark

MAIN GOAL
Show or select columns from the Spark dataframe read from the parquet file.
All the solutions mentioned in the forum are not successfull in our case.
PROBLEM
The issue happens when the parquet file is read and queried with SPARK and is due to the presence of special characters ,;{}()\n\t= within column names. The problem was reproduced with a simple parquet files with two columns and five rows. The names of the columns are:
SpeedReference_Final_01 (RifVel_G0)
SpeedReference_Final_02 (RifVel_G1)
The error arised is:
Attribute name "SpeedReference_Final_01 (RifVel_G0)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
We are using PySpark in Python language and the experimented solutions can be categorized as it follows:
Solutions based on column rename - [spark.read.parquet + rename of the obtained dataframe]
Several solutions have been experimented:
withColumnRenamed (Issue N.2 within the script)
toDF (Issue N.3)
alias (Issue N.5)
None of them is working in our case.
Read the parquet file into a Pandas dataframe and then create a new one from it - [pd.read.parquet + spark.createDataFrame]
This solution is working with a small parquet file (Issue N.0 i.e. WORKAROUND within the script): the created spark dataframe can be successfully queried even if it has column names containing special characters. Unfortunately it is unpracticable with our big parquet files (600000 rows x 1000 columns for each parquet), since creating the spark dataframe is interminable.
An attempt to read the parquet file into a Spark dataframe and create a new Spark dataframe with its rdd and a renamed schema is not practicable since the extraction of the rdd from the Spark dataframe arises the same error (Issue N.4).
Read the parquet file with a prefixed schema (that avoids the special characters) - [spark.read.schema(...).parquet]
The solution is not working, since data related to critical columns become null/None as expected since the renamed columns are not present within the original file.
The mentioned solutions are summarized in the python code below and have been experimented with the Example parquet file.
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import col
import pandas as pd
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
# Select file
filename = 'D:/Simple.parquet'
issue_num = 0 # Workaround to issues (Equivalent to no issue)
#issue_num = 1 # Issue 1 - Unable to show dataframe or select column with name containing invalid character(s)
#issue_num = 2 # Issue 2 - Unable to show dataframe or select column after rename (using withColumnRenamed)
#issue_num = 3 # Issue 3 - Unable to show dataframe or select column after rename (using toDF)
#issue_num = 4 # Issue 4 - Unable to extract rdd from renamed dataframe
#issue_num = 5 # Issue 5 - Unable to select column with alias
if issue_num == 0:
################################################################################################
# WORKAROUND - Create Spark data frame from Pandas dataframe
df_pd = pd.read_parquet(filename)
DF = spark.createDataFrame(df_pd)
print('WORKAROUND')
DF.show()
# +-----------------------------------+-----------------------------------+
# |SpeedReference_Final_01 (RifVel_G0)|SpeedReference_Final_02 (RifVel_G1)|
# +-----------------------------------+-----------------------------------+
# | 553.5228271484375| 720.3720703125|
# | 553.5228271484375| 720.3720703125|
# | 553.5228271484375| 720.3720703125|
# | 553.5228271484375| 720.3720703125|
# | 553.5228271484375| 720.3720703125|
# +-----------------------------------+-----------------------------------+
################################################################################################
# Correct management of columns with invalid characters when using spark.createDataFrame
# spark.createDataFrame: Create a dataframe with two columns with invalid characters - OK
# DFCREATED
schema = StructType(
[
StructField("SpeedReference_Final_01 (RifVel_G0)", FloatType(), nullable=True),
StructField("SpeedReference_Final_02 (RifVel_G1)", FloatType(), nullable=True)
]
)
row_in = [(553.523,720.372), (553.523,720.372), (553.523,720.372), (553.523,720.372), (553.523,720.372)]
rdd=spark.sparkContext.parallelize(row_in)
DFCREATED = spark.createDataFrame(rdd, schema)
DFCREATED.show()
# +-----------------------------------+-----------------------------------+
# |SpeedReference_Final_01 (RifVel_G0)|SpeedReference_Final_02 (RifVel_G1)|
# +-----------------------------------+-----------------------------------+
# | 553.523| 720.372|
# | 553.523| 720.372|
# | 553.523| 720.372|
# | 553.523| 720.372|
# | 553.523| 720.372|
# +-----------------------------------+-----------------------------------+
DF_SEL_VAR_CREATED = DFCREATED.select(DFCREATED.columns[0]).take(2)
for el in DF_SEL_VAR_CREATED:
print(el)
#Row(SpeedReference_Final_01 (RifVel_G0)=553.5230102539062)
#Row(SpeedReference_Final_01 (RifVel_G0)=553.5230102539062)
else:
# spark.read: read file into dataframe - OK
DF = spark.read.parquet(filename)
print('ORIGINAL SCHEMA')
DF.printSchema()
# root
# |-- SpeedReference_Final_01 (RifVel_G0): float (nullable = true)
# |-- SpeedReference_Final_02 (RifVel_G1): float (nullable = true)
if issue_num == 1:
###############################################################################################
# Issue 1 - Unable to show dataframe or select column with name containing invalid character(s)
DF.show()
# DF.select(DF.columns[0]).show()
# DF_SEL_VAR = DF.select(DF.columns[0]).take(3)
#ECC: Attribute name "SpeedReference_Final_01 (RifVel_G0)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
# on all 3 previous statements
elif issue_num == 2:
###############################################################################################
# Issue 2 - Unable to show dataframe or select column after rename (using withColumnRenamed)
DFRENAMED = DF.withColumnRenamed('SpeedReference_Final_01 (RifVel_G0)','RifVelG0').withColumnRenamed('SpeedReference_Final_02 (RifVel_G1)','RifVelG1')
print('RENAMED SCHEMA')
DFRENAMED.printSchema()
# root
# |-- RifVelG0: float (nullable = true)
# |-- RifVelG1: float (nullable = true)
DFRENAMED.show()
# DF_SEL_VAR_RENAMED = DFRENAMED.select(DFRENAMED.RifVelG0).take(2)
#ECC: Attribute name "SpeedReference_Final_01 (RifVel_G0)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
# on all 2 previous statements
elif issue_num == 3:
###############################################################################################
# Issue 3 - Unable to show dataframe or select column after rename (using to_DF)
DFRENAMED = DF.toDF('RifVelG0', 'RifVelG1')
print('RENAMED SCHEMA')
DFRENAMED.printSchema()
# root
# |-- RifVelG0: float (nullable = true)
# |-- RifVelG1: float (nullable = true)
DFRENAMED.show()
# DF_SEL_VAR_RENAMED = DFRENAMED.select(DFRENAMED.RifVelG0).take(2)
#ECC: Attribute name "SpeedReference_Final_01 (RifVel_G0)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
# on all 2 previous statements
elif issue_num == 4:
###############################################################################################
# Issue 4 - Unable to extract rdd from renamed dataframe
DFRENAMED = DF.withColumnRenamed('SpeedReference_Final_01 (RifVel_G0)','RifVelG0').withColumnRenamed('SpeedReference_Final_02 (RifVel_G1)','RifVelG1')
DFRENAMED_rdd = DFRENAMED.rdd
#ECC: Attribute name "SpeedReference_Final_01 (RifVel_G0)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
elif issue_num == 5:
###############################################################################################
# Issue 5 - Unable to select column with alias
DF_SEL_VAR = DF.select(col(DF.columns[0]).alias('RifVelG0')).take(3)
#ECC: Attribute name "SpeedReference_Final_01 (RifVel_G0)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
Have you any idea on how we can solve the problem?
Any suggestion is really appreciated.
try something like this:
import re
import pyspark.sql.functions as f
def remove_special_characters(string: str):
return re.sub("[^a-zA-Z0-9 ]", "", string)
DFCREATED = DFCREATED.select(
[
f.col(column).alias(remove_special_characters(column))
for column in DFCREATED.columns
]
)
also I think you can use this function to remove other things like space.

md5 is not working on complex data types in pyspark

I am trying to calculate hash using md5 function in pyspark on entire row. In pyspark dataframe I have multiple complex data types present for few columns.
for e.g : col: array (nullable = true)
| |-- element: struct (containsNull = true)
for e.g : col: array (nullable = true)
| |-- element: array (containsNull = true)
when I try to calculate md5 on entire row , md5 throws an error with below message:
**`col`' is of array<array<string>> type. argument 28 requires (array<string> or string) type, however, '`col`' is of array<array<string>> type**
code to calculate md5:
def prepare_data_md5(data):
""" Prepare the data with md5 column.
:param data: input DataFrame object
:return: output DataFrame object
"""
return data.withColumn("hash", md5(concat_ws(*data.columns)))
1.Is there some other function I could use for hash and that works for complex data types too ?
2. Is there some library available in pyspark or python for flattening complex data types , so that I could calculate md5 over flattened data-frame ?
I don't think there is some function available in that calculates hash for complex types.
If you have array and string columns then use concat_ws and array_concat to convert complex types to strings then apply md5.
Example:
df.show()
#+---+------+
#| id| arr|
#+---+------+
#| a|[1, 2]|
#| b|[3, 4]|
#+---+------+
from pyspark.sql.functions import *
df.withColumn("tmp",concat_ws(",",col("arr"))).\
withColumn("new",md5(concat_ws(",",col("id"),array_join(col("arr"),",")))).\
drop("tmp").\
show(10,False)
#+---+------+--------------------------------+
#|id |arr |new |
#+---+------+--------------------------------+
#|a |[1, 2]|9f357697a277b1e5a8315035e7d95984|
#|b |[3, 4]|578bec981ad992ddb641a45969babab1|
#+---+------+--------------------------------+
#dynamic way
df1=df.withColumn("arr",array_join(col("arr"),","))
df1.withColumn("md5",md5(concat_ws(",",*[col(x) for x in df1.columns]))).show(10,False)
#+---+---+--------------------------------+
#|id |arr|md5 |
#+---+---+--------------------------------+
#|a |1,2|9f357697a277b1e5a8315035e7d95984|
#|b |3,4|578bec981ad992ddb641a45969babab1|
#+---+---+--------------------------------+

How to change dataframe column names in PySpark?

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:
df.columns = new_column_name_list
However, the same doesn't work in PySpark dataframes created using sqlContext.
The only solution I could figure out to do this easily is the following:
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='\t').load("data.txt")
oldSchema = df.schema
for i,k in enumerate(oldSchema.fields):
k.name = new_column_name_list[i]
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', delimiter='\t').load("data.txt", schema=oldSchema)
This is basically defining the variable twice and inferring the schema first then renaming the column names and then loading the dataframe again with the updated schema.
Is there a better and more efficient way to do this like we do in pandas?
My Spark version is 1.5.0
There are many ways to do that:
Option 1. Using selectExpr.
data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)],
["Name", "askdaosdka"])
data.show()
data.printSchema()
# Output
#+-------+----------+
#| Name|askdaosdka|
#+-------+----------+
#|Alberto| 2|
#| Dakota| 2|
#+-------+----------+
#root
# |-- Name: string (nullable = true)
# |-- askdaosdka: long (nullable = true)
df = data.selectExpr("Name as name", "askdaosdka as age")
df.show()
df.printSchema()
# Output
#+-------+---+
#| name|age|
#+-------+---+
#|Alberto| 2|
#| Dakota| 2|
#+-------+---+
#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)
Option 2. Using withColumnRenamed, notice that this method allows you to "overwrite" the same column. For Python3, replace xrange with range.
from functools import reduce
oldColumns = data.schema.names
newColumns = ["name", "age"]
df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data)
df.printSchema()
df.show()
Option 3. using
alias, in Scala you can also use as.
from pyspark.sql.functions import col
data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age"))
data.show()
# Output
#+-------+---+
#| name|age|
#+-------+---+
#|Alberto| 2|
#| Dakota| 2|
#+-------+---+
Option 4. Using sqlContext.sql, which lets you use SQL queries on DataFrames registered as tables.
sqlContext.registerDataFrameAsTable(data, "myTable")
df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable")
df2.show()
# Output
#+-------+---+
#| name|age|
#+-------+---+
#|Alberto| 2|
#| Dakota| 2|
#+-------+---+
df = df.withColumnRenamed("colName", "newColName")\
.withColumnRenamed("colName2", "newColName2")
Advantage of using this way: With long list of columns you would like to change only few column names. This can be very convenient in these scenarios. Very useful when joining tables with duplicate column names.
If you want to change all columns names, try df.toDF(*cols)
In case you would like to apply a simple transformation on all column names, this code does the trick: (I am replacing all spaces with underscore)
new_column_name_list= list(map(lambda x: x.replace(" ", "_"), df.columns))
df = df.toDF(*new_column_name_list)
Thanks to #user8117731 for toDf trick.
df.withColumnRenamed('age', 'age2')
If you want to rename a single column and keep the rest as it is:
from pyspark.sql.functions import col
new_df = old_df.select(*[col(s).alias(new_name) if s == column_to_change else s for s in old_df.columns])
this is the approach that I used:
create pyspark session:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('changeColNames').getOrCreate()
create dataframe:
df = spark.createDataFrame(data = [('Bob', 5.62,'juice'), ('Sue',0.85,'milk')], schema = ["Name", "Amount","Item"])
view df with column names:
df.show()
+----+------+-----+
|Name|Amount| Item|
+----+------+-----+
| Bob| 5.62|juice|
| Sue| 0.85| milk|
+----+------+-----+
create a list with new column names:
newcolnames = ['NameNew','AmountNew','ItemNew']
change the column names of the df:
for c,n in zip(df.columns,newcolnames):
df=df.withColumnRenamed(c,n)
view df with new column names:
df.show()
+-------+---------+-------+
|NameNew|AmountNew|ItemNew|
+-------+---------+-------+
| Bob| 5.62| juice|
| Sue| 0.85| milk|
+-------+---------+-------+
I made an easy to use function to rename multiple columns for a pyspark dataframe,
in case anyone wants to use it:
def renameCols(df, old_columns, new_columns):
for old_col,new_col in zip(old_columns,new_columns):
df = df.withColumnRenamed(old_col,new_col)
return df
old_columns = ['old_name1','old_name2']
new_columns = ['new_name1', 'new_name2']
df_renamed = renameCols(df, old_columns, new_columns)
Be careful, both lists must be the same length.
Another way to rename just one column (using import pyspark.sql.functions as F):
df = df.select( '*', F.col('count').alias('new_count') ).drop('count')
Method 1:
df = df.withColumnRenamed("old_column_name", "new_column_name")
Method 2:
If you want to do some computation and rename the new values
df = df.withColumn("old_column_name", F.when(F.col("old_column_name") > 1, F.lit(1)).otherwise(F.col("old_column_name"))
df = df.drop("new_column_name", "old_column_name")
You can use the following function to rename all the columns of your dataframe.
def df_col_rename(X, to_rename, replace_with):
"""
:param X: spark dataframe
:param to_rename: list of original names
:param replace_with: list of new names
:return: dataframe with updated names
"""
import pyspark.sql.functions as F
mapping = dict(zip(to_rename, replace_with))
X = X.select([F.col(c).alias(mapping.get(c, c)) for c in to_rename])
return X
In case you need to update only a few columns' names, you can use the same column name in the replace_with list
To rename all columns
df_col_rename(X,['a', 'b', 'c'], ['x', 'y', 'z'])
To rename a some columns
df_col_rename(X,['a', 'b', 'c'], ['a', 'y', 'z'])
we can use col.alias for renaming the column:
from pyspark.sql.functions import col
df.select(['vin',col('timeStamp').alias('Date')]).show()
We can use various approaches to rename the column name.
First, let create a simple DataFrame.
df = spark.createDataFrame([("x", 1), ("y", 2)],
["col_1", "col_2"])
Now let's try to rename col_1 to col_3. PFB a few approaches to do the same.
# Approach - 1 : using withColumnRenamed function.
df.withColumnRenamed("col_1", "col_3").show()
# Approach - 2 : using alias function.
df.select(df["col_1"].alias("col3"), "col_2").show()
# Approach - 3 : using selectExpr function.
df.selectExpr("col_1 as col_3", "col_2").show()
# Rename all columns
# Approach - 4 : using toDF function. Here you need to pass the list of all columns present in DataFrame.
df.toDF("col_3", "col_2").show()
Here is the output.
+-----+-----+
|col_3|col_2|
+-----+-----+
| x| 1|
| y| 2|
+-----+-----+
I hope this helps.
A way that you can use 'alias' to change the column name:
col('my_column').alias('new_name')
Another way that you can use 'alias' (possibly not mentioned):
df.my_column.alias('new_name')
You can put into for loop, and use zip to pairs each column name in two array.
new_name = ["id", "sepal_length_cm", "sepal_width_cm", "petal_length_cm", "petal_width_cm", "species"]
new_df = df
for old, new in zip(df.columns, new_name):
new_df = new_df.withColumnRenamed(old, new)
I like to use a dict to rename the df.
rename = {'old1': 'new1', 'old2': 'new2'}
for col in df.schema.names:
df = df.withColumnRenamed(col, rename[col])
For a single column rename, you can still use toDF(). For example,
df1.selectExpr("SALARY*2").toDF("REVISED_SALARY").show()
There are multiple approaches you can use:
df1=df.withColumn("new_column","old_column").drop(col("old_column"))
df1=df.withColumn("new_column","old_column")
df1=df.select("old_column".alias("new_column"))
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
CreatingDataFrame = [("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
]
schema = StructType([ \
StructField("employee_name",StringType(),True), \
StructField("department",StringType(),True), \
StructField("state",StringType(),True), \
StructField("salary", IntegerType(), True), \
StructField("age", StringType(), True), \
StructField("bonus", IntegerType(), True) \
])
OurData = spark.createDataFrame(data=CreatingDataFrame,schema=schema)
OurData.show()
# COMMAND ----------
GrouppedBonusData=OurData.groupBy("department").sum("bonus")
# COMMAND ----------
GrouppedBonusData.show()
# COMMAND ----------
GrouppedBonusData.printSchema()
# COMMAND ----------
from pyspark.sql.functions import col
BonusColumnRenamed = GrouppedBonusData.select(col("department").alias("department"), col("sum(bonus)").alias("Total_Bonus"))
BonusColumnRenamed.show()
# COMMAND ----------
GrouppedBonusData.groupBy("department").count().show()
# COMMAND ----------
GrouppedSalaryData=OurData.groupBy("department").sum("salary")
# COMMAND ----------
GrouppedSalaryData.show()
# COMMAND ----------
from pyspark.sql.functions import col
SalaryColumnRenamed = GrouppedSalaryData.select(col("department").alias("Department"), col("sum(salary)").alias("Total_Salary"))
SalaryColumnRenamed.show()
Try the following method. The following method can allow you rename columns of multiple files
Reference: https://www.linkedin.com/pulse/pyspark-methods-rename-columns-kyle-gibson/
df_initial = spark.read.load('com.databricks.spark.csv')
rename_dict = {
'Alberto':'Name',
'Dakota':'askdaosdka'
}
df_renamed = df_initial \
.select([col(c).alias(rename_dict.get(c, c)) for c in df_initial.columns])
rename_dict = {
'FName':'FirstName',
'LName':'LastName',
'DOB':'BirthDate'
}
return df.select([col(c).alias(rename_dict.get(c, c)) for c in df.columns])
df_renamed = spark.read.load('/mnt/datalake/bronze/testData') \
.transform(renameColumns)
The simplest solution is using withColumnRenamed:
renamed_df = df.withColumnRenamed(‘name_1’, ‘New_name_1’).withColumnRenamed(‘name_2’, ‘New_name_2’)
renamed_df.show()
And if you would like to do this like we do with Pandas, you can use toDF:
Create an order of list of new columns and pass it to toDF
df_list = ["newName_1", “newName_2", “newName_3", “newName_4"]
renamed_df = df.toDF(*df_list)
renamed_df.show()
This is an easy way to rename multiple columns with a loop:
cols_to_rename = ["col1","col2","col3"]
for col in cols_to_rename:
df = df.withColumnRenamed(col,"new_{}".format(col))
List comprehension + f-string:
df = df.toDF(*[f'n_{c}' for c in df.columns])
Simple list comprehension:
df = df.toDF(*[c.lower() for c in df.columns])
The closest statement to df.columns = new_column_name_list is:
import pyspark.sql.functions as F
df = df.select(*[F.col(name_old).alias(name_new)
for (name_old, name_new)
in zip(df.columns, new_column_name_list)]
This doesn't require any rarely-used functions, and emphasizes some patterns that are very helpful in Spark. You could also break up the steps if you find this one-liner to be doing too many things:
import pyspark.sql.functions as F
column_mapping = [F.col(name_old).alias(name_new)
for (name_old, name_new)
in zip(df.columns, new_column_name_list)]
df = df.select(*column_mapping)

Categories