Updating a dataframe column in spark

Updating a dataframe column in spark - python

Looking at the new spark DataFrame API, it is unclear whether it is possible to modify dataframe columns.
How would I go about changing a value in row x column y of a dataframe?
In pandas this would be:
df.ix[x,y] = new_value
Edit: Consolidating what was said below, you can't modify the existing dataframe as it is immutable, but you can return a new dataframe with the desired modifications.
If you just want to replace a value in a column based on a condition, like np.where:
from pyspark.sql import functions as F
update_func = (F.when(F.col('update_col') == replace_val, new_value)
.otherwise(F.col('update_col')))
df = df.withColumn('new_column_name', update_func)
If you want to perform some operation on a column and create a new column that is added to the dataframe:
import pyspark.sql.functions as F
import pyspark.sql.types as T
def my_func(col):
do stuff to column here
return transformed_value
# if we assume that my_func returns a string
my_udf = F.UserDefinedFunction(my_func, T.StringType())
df = df.withColumn('new_column_name', my_udf('update_col'))
If you want the new column to have the same name as the old column, you could add the additional step:
df = df.drop('update_col').withColumnRenamed('new_column_name', 'update_col')

While you cannot modify a column as such, you may operate on a column and return a new DataFrame reflecting that change. For that you'd first create a UserDefinedFunction implementing the operation to apply and then selectively apply that function to the targeted column only. In Python:
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
name = 'target_column'
udf = UserDefinedFunction(lambda x: 'new_value', StringType())
new_df = old_df.select(*[udf(column).alias(name) if column == name else column for column in old_df.columns])
new_df now has the same schema as old_df (assuming that old_df.target_column was of type StringType as well) but all values in column target_column will be new_value.

Commonly when updating a column, we want to map an old value to a new value. Here's a way to do that in pyspark without UDF's:
# update df[update_col], mapping old_value --> new_value
from pyspark.sql import functions as F
df = df.withColumn(update_col,
F.when(df[update_col]==old_value,new_value).
otherwise(df[update_col])).

DataFrames are based on RDDs. RDDs are immutable structures and do not allow updating elements on-site. To change values, you will need to create a new DataFrame by transforming the original one either using the SQL-like DSL or RDD operations like map.
A highly recommended slide deck: Introducing DataFrames in Spark for Large Scale Data Science.

Just as maasg says you can create a new DataFrame from the result of a map applied to the old DataFrame. An example for a given DataFrame df with two rows:
val newDf = sqlContext.createDataFrame(df.map(row =>
Row(row.getInt(0) + SOMETHING, applySomeDef(row.getAs[Double]("y")), df.schema)
Note that if the types of the columns change, you need to give it a correct schema instead of df.schema. Check out the api of org.apache.spark.sql.Row for available methods: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html
[Update] Or using UDFs in Scala:
import org.apache.spark.sql.functions._
val toLong = udf[Long, String] (_.toLong)
val modifiedDf = df.withColumn("modifiedColumnName", toLong(df("columnName"))).drop("columnName")
and if the column name needs to stay the same you can rename it back:
modifiedDf.withColumnRenamed("modifiedColumnName", "columnName")

importing col, when from pyspark.sql.functions and updating fifth column to integer(0,1,2) based on the string(string a, string b, string c) into a new DataFrame.
from pyspark.sql.functions import col, when
data_frame_temp = data_frame.withColumn("col_5",when(col("col_5") == "string a", 0).when(col("col_5") == "string b", 1).otherwise(2))

Related

Replacing column values in nested structure spark dataframe

I got VCF data format in Databricks. I wish to rename the subjects based on dictionary.
I got dictionary where I got the key-new names. Then I got function to get the new values and return values work so far:
import pyspark.sql.functions as F
keys= {'old_name': 'new_name'}
mapping_func = lambda x: keys.get(x)
df.withColumn('foo', udf(mapping_func, F.StringType())('geno.sampleId'))
Producing new column foo. What I need to assign the values in the nested structure: (Last row)
StructField(contigName,StringType,true)
StructField(start,LongType,true)
StructField(end,LongType,true)
StructField(names,ArrayType(StringType,true),true)
StructField(referenceAllele,StringType,true)
StructField(alternateAlleles,ArrayType(StringType,true),true)
StructField(qual,DoubleType,true)
StructField(filters,ArrayType(StringType,true),true)
StructField(splitFromMultiAllelic,BooleanType,true)
StructField(geno,StructType(List(StructField(sampleId,StringType,true),StructField(CN,IntegerType,true),StructField(phased,BooleanType,true),StructField(calls,ArrayType(IntegerType,true),true))),true)
Something like this:
df = df.withColumn(F.col('geno').sampleId, udf(mapping_func, F.StringType())('geno.sampleId'))
But this says
Column is not iterable
How would I go about assigning the values to proper place?
Scala 2.12 and spark 3.01

From my understanding, you don't need to use UDF here. You can simply use a map column expression instead:
from itertools import chain
import pyspark.sql.functions as F
keys_map = F.create_map(*[F.lit(x)for x in chain(*keys.items())])
Now, to update a nested field in a struct you need to recreate the whole struct column (for Spark 3.1+, you'd use withField method):
df = df.withColumn(
"geno",
F.struct(
keys_map[F.col("geno.sampleId")].alias("sampleId"), # replaces sampleId value according to your keys mapping
F.col("geno.CN").alias("CN"),
F.col("geno.phased").alias("phased"),
F.col("geno.calls").alias("calls")
)
)

Add prefix and reset index in pyspark dataframe

Here's what I usually do in pandas
cdr = datamonthly.pivot(index="msisdn", columns="last_x_month", values="arpu_sum").add_prefix('arpu_sum_l').reset_index()
But what I did in Pyspark
cdr = datamonthly.groupBy("msisdn").pivot("last_x_month").sum("arpu_sum")
I cant find alternative for add_prefix('arpu_sum_l').reset_index()

There is nothing similar to pandas' add_prefix in spark when doing pivot. But, you can try a workaround like creating a column from concatenation of the custom prefix string and the value of the column to be pivoted.
import pyspark.sql.functions as F
cdr = datamonthly.withColumn("p", F.expr("concat('arpu_sum_l_', last_x_month)")).groupBy("msisdn").pivot("p").sum("arpu_sum")

How to change multiple columns' types in pyspark?

I am just studying pyspark. I want to change the column types like this:
df1=df.select(df.Date.cast('double'),df.Time.cast('double'),
df.NetValue.cast('double'),df.Units.cast('double'))
You can see that df is a data frame and I select 4 columns and change all of them to double. Because of using select, all other columns are ignored.
But, if df has hundreds of columns and I just need to change those 4 columns. I need to keep all the columns. So, how to do it?

Try this:
from pyspark.sql.functions import col
df = df.select([col(column).cast('double') for column in df.columns])

for c in df.columns:
# add condition for the cols to be type cast
df=df.withColumn(c, df[c].cast('double'))

Another way using selectExpr():
df1 = df.selectExpr("cast(Date as double) Date",
"cast(NetValueas string) NetValue")
df1.printSchema()
Using withColumn():
from pyspark.sql.types import DoubleType, StringType
df1 = df.withColumn("Date", df["Date"].cast(DoubleType())) \
.withColumn("NetValueas ", df["NetValueas"].cast(StringType()))
df1.printSchema()
Check types documentation.

I understand that you would like to have a non-for-loop answer that preserves the original set of columns whilst only updating a subset. The following should be the answer you were looking for:
from pyspark.sql.functions import col
df = df.select(*(col(c).cast("double").alias(c) for c in subset),*[x for x in df.columns if x not in subset])
where subset is a list of the columnnames you would like to update.

How to add suffix and prefix to all columns in python/pyspark dataframe

I have a data frame in pyspark with more than 100 columns. What I want to do is for all the column names I would like to add back ticks(`) at the start of the column name and end of column name.
For example:
column name is testing user. I want `testing user`
Is there a method to do this in pyspark/python. when we apply the code it should return a data frame.

Use list comprehension in python.
from pyspark.sql import functions as F
df = ...
df_new = df.select([F.col(c).alias("`"+c+"`") for c in df.columns])
This method also gives you the option to add custom python logic within the alias() function like: "prefix_"+c+"_suffix" if c in list_of_cols_to_change else c

To add prefix or suffix:
Refer df.columns for list of columns ([col_1, col_2...]). This is the dataframe, for which we want to suffix/prefix column.
df.columns
Iterate through above list and create another list of columns with alias that can used inside select expression.
from pyspark.sql.functions import col
select_list = [col(col_name).alias("prefix_" + col_name) for col_name in df.columns]
When using inside select, do not forget to unpack list with asterisk(*). We can assign it back to same or different df for use.
df.select(*select_list).show()
df = df.select(*select_list)
df.columns will now return list of new columns(aliased).

If you would like to add a prefix or suffix to multiple columns in a pyspark dataframe, you could use a for loop and .withColumnRenamed().
As an example, you might like:
def add_prefix(sdf, prefix):
for c in sdf.columns:
sdf = sdf.withColumnRenamed(c, '{}{}'.format(prefix, c))
return sdf
You can amend sdf.columns as you see fit.

You can use withColumnRenamed method of dataframe in combination with na to create new dataframe
df.na.withColumnRenamed('testing user', '`testing user`')
edit : suppose you have list of columns, you can do like -
old = "First Last Age"
new = ["`"+field+"`" for field in old.split()]
df.rdd.toDF(new)
output :
DataFrame[`First`: string, `Last`: string, `Age`: string]

here is how one can solve the similar problems:
df.select([col(col_name).alias('prefix' + col_name + 'suffix') for col_name in df])

I had a dataframe that I duplicated twice then joined together. Since both had the same columns names I used :
df = reduce(lambda df, idx: df.withColumnRenamed(list(df.schema.names)[idx],
list(df.schema.names)[idx] + '_prec'),
range(len(list(df.schema.names))),
df)
Every columns in my dataframe then had the '_prec' suffix which allowed me to do sweet stuff

Change one column of a DataFrame only

I'm using Pandas with Python 3. I have a dataframe with a bunch of columns, but I only want to change the data type of all the values in one of the columns and leave the others alone. The only way I could find to accomplish this is to edit the column, remove the original column and then merge the edited one back. I would like to edit the column without having to remove and merge, leaving the the rest of the dataframe unaffected. Is this possible?
Here is my solution now:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
def make_float(var):
var = float(var)
return var
#create a new dataframe with the value types I want
df2 = df1['column'].apply(make_float)
#remove the original column
df3 = df1.drop('column',1)
#merge the dataframes
df1 = pd.concat([df3,df2],axis=1)
It also doesn't work to apply the function to the dataframe directly. For example:
df1['column'].apply(make_float)
print(type(df1.iloc[1]['column']))
yields:
<class 'str'>

df1['column'] = df1['column'].astype(float)
It will raise an error if conversion fails for some row.
Apply does not work inplace, but rather returns a series that you discard in this line:
df1['column'].apply(make_float)

Apart from Yakym's solution, you can also do this -
df['column'] += 0.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Updating a dataframe column in spark - python

Related

Replacing column values in nested structure spark dataframe

Add prefix and reset index in pyspark dataframe

How to change multiple columns' types in pyspark?

How to add suffix and prefix to all columns in python/pyspark dataframe

Change one column of a DataFrame only

Categories

Resources