Looking at the new spark DataFrame API, it is unclear whether it is possible to modify dataframe columns.
How would I go about changing a value in row x column y of a dataframe?
In pandas this would be:
df.ix[x,y] = new_value
Edit: Consolidating what was said below, you can't modify the existing dataframe as it is immutable, but you can return a new dataframe with the desired modifications.
If you just want to replace a value in a column based on a condition, like np.where:
from pyspark.sql import functions as F
update_func = (F.when(F.col('update_col') == replace_val, new_value)
.otherwise(F.col('update_col')))
df = df.withColumn('new_column_name', update_func)
If you want to perform some operation on a column and create a new column that is added to the dataframe:
import pyspark.sql.functions as F
import pyspark.sql.types as T
def my_func(col):
do stuff to column here
return transformed_value
# if we assume that my_func returns a string
my_udf = F.UserDefinedFunction(my_func, T.StringType())
df = df.withColumn('new_column_name', my_udf('update_col'))
If you want the new column to have the same name as the old column, you could add the additional step:
df = df.drop('update_col').withColumnRenamed('new_column_name', 'update_col')
While you cannot modify a column as such, you may operate on a column and return a new DataFrame reflecting that change. For that you'd first create a UserDefinedFunction implementing the operation to apply and then selectively apply that function to the targeted column only. In Python:
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
name = 'target_column'
udf = UserDefinedFunction(lambda x: 'new_value', StringType())
new_df = old_df.select(*[udf(column).alias(name) if column == name else column for column in old_df.columns])
new_df now has the same schema as old_df (assuming that old_df.target_column was of type StringType as well) but all values in column target_column will be new_value.
Commonly when updating a column, we want to map an old value to a new value. Here's a way to do that in pyspark without UDF's:
# update df[update_col], mapping old_value --> new_value
from pyspark.sql import functions as F
df = df.withColumn(update_col,
F.when(df[update_col]==old_value,new_value).
otherwise(df[update_col])).
DataFrames are based on RDDs. RDDs are immutable structures and do not allow updating elements on-site. To change values, you will need to create a new DataFrame by transforming the original one either using the SQL-like DSL or RDD operations like map.
A highly recommended slide deck: Introducing DataFrames in Spark for Large Scale Data Science.
Just as maasg says you can create a new DataFrame from the result of a map applied to the old DataFrame. An example for a given DataFrame df with two rows:
val newDf = sqlContext.createDataFrame(df.map(row =>
Row(row.getInt(0) + SOMETHING, applySomeDef(row.getAs[Double]("y")), df.schema)
Note that if the types of the columns change, you need to give it a correct schema instead of df.schema. Check out the api of org.apache.spark.sql.Row for available methods: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html
[Update] Or using UDFs in Scala:
import org.apache.spark.sql.functions._
val toLong = udf[Long, String] (_.toLong)
val modifiedDf = df.withColumn("modifiedColumnName", toLong(df("columnName"))).drop("columnName")
and if the column name needs to stay the same you can rename it back:
modifiedDf.withColumnRenamed("modifiedColumnName", "columnName")
importing col, when from pyspark.sql.functions and updating fifth column to integer(0,1,2) based on the string(string a, string b, string c) into a new DataFrame.
from pyspark.sql.functions import col, when
data_frame_temp = data_frame.withColumn("col_5",when(col("col_5") == "string a", 0).when(col("col_5") == "string b", 1).otherwise(2))
Related
I got VCF data format in Databricks. I wish to rename the subjects based on dictionary.
I got dictionary where I got the key-new names. Then I got function to get the new values and return values work so far:
import pyspark.sql.functions as F
keys= {'old_name': 'new_name'}
mapping_func = lambda x: keys.get(x)
df.withColumn('foo', udf(mapping_func, F.StringType())('geno.sampleId'))
Producing new column foo. What I need to assign the values in the nested structure: (Last row)
StructField(contigName,StringType,true)
StructField(start,LongType,true)
StructField(end,LongType,true)
StructField(names,ArrayType(StringType,true),true)
StructField(referenceAllele,StringType,true)
StructField(alternateAlleles,ArrayType(StringType,true),true)
StructField(qual,DoubleType,true)
StructField(filters,ArrayType(StringType,true),true)
StructField(splitFromMultiAllelic,BooleanType,true)
StructField(geno,StructType(List(StructField(sampleId,StringType,true),StructField(CN,IntegerType,true),StructField(phased,BooleanType,true),StructField(calls,ArrayType(IntegerType,true),true))),true)
Something like this:
df = df.withColumn(F.col('geno').sampleId, udf(mapping_func, F.StringType())('geno.sampleId'))
But this says
Column is not iterable
How would I go about assigning the values to proper place?
Scala 2.12 and spark 3.01
From my understanding, you don't need to use UDF here. You can simply use a map column expression instead:
from itertools import chain
import pyspark.sql.functions as F
keys_map = F.create_map(*[F.lit(x)for x in chain(*keys.items())])
Now, to update a nested field in a struct you need to recreate the whole struct column (for Spark 3.1+, you'd use withField method):
df = df.withColumn(
"geno",
F.struct(
keys_map[F.col("geno.sampleId")].alias("sampleId"), # replaces sampleId value according to your keys mapping
F.col("geno.CN").alias("CN"),
F.col("geno.phased").alias("phased"),
F.col("geno.calls").alias("calls")
)
)
Here's what I usually do in pandas
cdr = datamonthly.pivot(index="msisdn", columns="last_x_month", values="arpu_sum").add_prefix('arpu_sum_l').reset_index()
But what I did in Pyspark
cdr = datamonthly.groupBy("msisdn").pivot("last_x_month").sum("arpu_sum")
I cant find alternative for add_prefix('arpu_sum_l').reset_index()
There is nothing similar to pandas' add_prefix in spark when doing pivot. But, you can try a workaround like creating a column from concatenation of the custom prefix string and the value of the column to be pivoted.
import pyspark.sql.functions as F
cdr = datamonthly.withColumn("p", F.expr("concat('arpu_sum_l_', last_x_month)")).groupBy("msisdn").pivot("p").sum("arpu_sum")
I am just studying pyspark. I want to change the column types like this:
df1=df.select(df.Date.cast('double'),df.Time.cast('double'),
df.NetValue.cast('double'),df.Units.cast('double'))
You can see that df is a data frame and I select 4 columns and change all of them to double. Because of using select, all other columns are ignored.
But, if df has hundreds of columns and I just need to change those 4 columns. I need to keep all the columns. So, how to do it?
Try this:
from pyspark.sql.functions import col
df = df.select([col(column).cast('double') for column in df.columns])
for c in df.columns:
# add condition for the cols to be type cast
df=df.withColumn(c, df[c].cast('double'))
Another way using selectExpr():
df1 = df.selectExpr("cast(Date as double) Date",
"cast(NetValueas string) NetValue")
df1.printSchema()
Using withColumn():
from pyspark.sql.types import DoubleType, StringType
df1 = df.withColumn("Date", df["Date"].cast(DoubleType())) \
.withColumn("NetValueas ", df["NetValueas"].cast(StringType()))
df1.printSchema()
Check types documentation.
I understand that you would like to have a non-for-loop answer that preserves the original set of columns whilst only updating a subset. The following should be the answer you were looking for:
from pyspark.sql.functions import col
df = df.select(*(col(c).cast("double").alias(c) for c in subset),*[x for x in df.columns if x not in subset])
where subset is a list of the columnnames you would like to update.
I have a data frame in pyspark with more than 100 columns. What I want to do is for all the column names I would like to add back ticks(`) at the start of the column name and end of column name.
For example:
column name is testing user. I want `testing user`
Is there a method to do this in pyspark/python. when we apply the code it should return a data frame.
Use list comprehension in python.
from pyspark.sql import functions as F
df = ...
df_new = df.select([F.col(c).alias("`"+c+"`") for c in df.columns])
This method also gives you the option to add custom python logic within the alias() function like: "prefix_"+c+"_suffix" if c in list_of_cols_to_change else c
To add prefix or suffix:
Refer df.columns for list of columns ([col_1, col_2...]). This is the dataframe, for which we want to suffix/prefix column.
df.columns
Iterate through above list and create another list of columns with alias that can used inside select expression.
from pyspark.sql.functions import col
select_list = [col(col_name).alias("prefix_" + col_name) for col_name in df.columns]
When using inside select, do not forget to unpack list with asterisk(*). We can assign it back to same or different df for use.
df.select(*select_list).show()
df = df.select(*select_list)
df.columns will now return list of new columns(aliased).
If you would like to add a prefix or suffix to multiple columns in a pyspark dataframe, you could use a for loop and .withColumnRenamed().
As an example, you might like:
def add_prefix(sdf, prefix):
for c in sdf.columns:
sdf = sdf.withColumnRenamed(c, '{}{}'.format(prefix, c))
return sdf
You can amend sdf.columns as you see fit.
You can use withColumnRenamed method of dataframe in combination with na to create new dataframe
df.na.withColumnRenamed('testing user', '`testing user`')
edit : suppose you have list of columns, you can do like -
old = "First Last Age"
new = ["`"+field+"`" for field in old.split()]
df.rdd.toDF(new)
output :
DataFrame[`First`: string, `Last`: string, `Age`: string]
here is how one can solve the similar problems:
df.select([col(col_name).alias('prefix' + col_name + 'suffix') for col_name in df])
I had a dataframe that I duplicated twice then joined together. Since both had the same columns names I used :
df = reduce(lambda df, idx: df.withColumnRenamed(list(df.schema.names)[idx],
list(df.schema.names)[idx] + '_prec'),
range(len(list(df.schema.names))),
df)
Every columns in my dataframe then had the '_prec' suffix which allowed me to do sweet stuff
I'm using Pandas with Python 3. I have a dataframe with a bunch of columns, but I only want to change the data type of all the values in one of the columns and leave the others alone. The only way I could find to accomplish this is to edit the column, remove the original column and then merge the edited one back. I would like to edit the column without having to remove and merge, leaving the the rest of the dataframe unaffected. Is this possible?
Here is my solution now:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
def make_float(var):
var = float(var)
return var
#create a new dataframe with the value types I want
df2 = df1['column'].apply(make_float)
#remove the original column
df3 = df1.drop('column',1)
#merge the dataframes
df1 = pd.concat([df3,df2],axis=1)
It also doesn't work to apply the function to the dataframe directly. For example:
df1['column'].apply(make_float)
print(type(df1.iloc[1]['column']))
yields:
<class 'str'>
df1['column'] = df1['column'].astype(float)
It will raise an error if conversion fails for some row.
Apply does not work inplace, but rather returns a series that you discard in this line:
df1['column'].apply(make_float)
Apart from Yakym's solution, you can also do this -
df['column'] += 0.0