Replacing column values in nested structure spark dataframe - python

I got VCF data format in Databricks. I wish to rename the subjects based on dictionary.
I got dictionary where I got the key-new names. Then I got function to get the new values and return values work so far:
import pyspark.sql.functions as F
keys= {'old_name': 'new_name'}
mapping_func = lambda x: keys.get(x)
df.withColumn('foo', udf(mapping_func, F.StringType())('geno.sampleId'))
Producing new column foo. What I need to assign the values in the nested structure: (Last row)
StructField(contigName,StringType,true)
StructField(start,LongType,true)
StructField(end,LongType,true)
StructField(names,ArrayType(StringType,true),true)
StructField(referenceAllele,StringType,true)
StructField(alternateAlleles,ArrayType(StringType,true),true)
StructField(qual,DoubleType,true)
StructField(filters,ArrayType(StringType,true),true)
StructField(splitFromMultiAllelic,BooleanType,true)
StructField(geno,StructType(List(StructField(sampleId,StringType,true),StructField(CN,IntegerType,true),StructField(phased,BooleanType,true),StructField(calls,ArrayType(IntegerType,true),true))),true)
Something like this:
df = df.withColumn(F.col('geno').sampleId, udf(mapping_func, F.StringType())('geno.sampleId'))
But this says
Column is not iterable
How would I go about assigning the values to proper place?
Scala 2.12 and spark 3.01

From my understanding, you don't need to use UDF here. You can simply use a map column expression instead:
from itertools import chain
import pyspark.sql.functions as F
keys_map = F.create_map(*[F.lit(x)for x in chain(*keys.items())])
Now, to update a nested field in a struct you need to recreate the whole struct column (for Spark 3.1+, you'd use withField method):
df = df.withColumn(
"geno",
F.struct(
keys_map[F.col("geno.sampleId")].alias("sampleId"), # replaces sampleId value according to your keys mapping
F.col("geno.CN").alias("CN"),
F.col("geno.phased").alias("phased"),
F.col("geno.calls").alias("calls")
)
)

Related

Add prefix and reset index in pyspark dataframe

Here's what I usually do in pandas
cdr = datamonthly.pivot(index="msisdn", columns="last_x_month", values="arpu_sum").add_prefix('arpu_sum_l').reset_index()
But what I did in Pyspark
cdr = datamonthly.groupBy("msisdn").pivot("last_x_month").sum("arpu_sum")
I cant find alternative for add_prefix('arpu_sum_l').reset_index()
There is nothing similar to pandas' add_prefix in spark when doing pivot. But, you can try a workaround like creating a column from concatenation of the custom prefix string and the value of the column to be pivoted.
import pyspark.sql.functions as F
cdr = datamonthly.withColumn("p", F.expr("concat('arpu_sum_l_', last_x_month)")).groupBy("msisdn").pivot("p").sum("arpu_sum")

efficiently mapping values in pandas from a 2nd dataframe

I'm looking to best understand how to use a 2nd file/dataframe to efficiently map values when these values are provided as encoded and there is a label I want to map to it. Think of this 2nd file as a data dictionary that translates the values in the first dataframe.
For example
import pandas as pd
dataset = pd.read_csv('https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv')
data_dictionary = pd.DataFrame({'columnname' : ['vs','vs', 'am','am'], 'code' : [0,1,0,1], 'label':['vs_is_0','vs_is_1','am_is_0','am_is_1'] })
Now, I want to be able replace the values in the 'columnname' in the first dataset according to the mapping 'code' with the accurate 'label'. If a value is found in one and not the other, nothing happens.
Currently my approach is as follows but I feel it is very ineffecient and suboptimal. Keep in mind I could have 30-40 columns each with 2-200 values I'd want replaced with this vlookup like replacement:
for each_colname in dataset.columns.tolist():
lookup_values = data_dictionary.query("columnname=={}".format(each_colname))
# and then doing a merge...
Any help is much appreciated!
First you can create a mapper dict and then apply this to your dataset.
mapper = (
data_dictionary.groupby('columnname')
.apply(lambda x: dict(x.values.tolist()))
.to_dict()
)
for e in mapper.keys():
df[e] = df[e].map(mapper[e]).combine_first(df[e])
Update to handle mismatching datatypes:
mapper = (
data_dictionary.groupby('columnname')
.apply(lambda x: dict(x.astype(str).values.tolist()))
.to_dict()
)
for e in mapper.keys():
df[e] = df[e].astype(str).map(mapper[e]).combine_first(df[e])

How to add suffix and prefix to all columns in python/pyspark dataframe

I have a data frame in pyspark with more than 100 columns. What I want to do is for all the column names I would like to add back ticks(`) at the start of the column name and end of column name.
For example:
column name is testing user. I want `testing user`
Is there a method to do this in pyspark/python. when we apply the code it should return a data frame.
Use list comprehension in python.
from pyspark.sql import functions as F
df = ...
df_new = df.select([F.col(c).alias("`"+c+"`") for c in df.columns])
This method also gives you the option to add custom python logic within the alias() function like: "prefix_"+c+"_suffix" if c in list_of_cols_to_change else c
To add prefix or suffix:
Refer df.columns for list of columns ([col_1, col_2...]). This is the dataframe, for which we want to suffix/prefix column.
df.columns
Iterate through above list and create another list of columns with alias that can used inside select expression.
from pyspark.sql.functions import col
select_list = [col(col_name).alias("prefix_" + col_name) for col_name in df.columns]
When using inside select, do not forget to unpack list with asterisk(*). We can assign it back to same or different df for use.
df.select(*select_list).show()
df = df.select(*select_list)
df.columns will now return list of new columns(aliased).
If you would like to add a prefix or suffix to multiple columns in a pyspark dataframe, you could use a for loop and .withColumnRenamed().
As an example, you might like:
def add_prefix(sdf, prefix):
for c in sdf.columns:
sdf = sdf.withColumnRenamed(c, '{}{}'.format(prefix, c))
return sdf
You can amend sdf.columns as you see fit.
You can use withColumnRenamed method of dataframe in combination with na to create new dataframe
df.na.withColumnRenamed('testing user', '`testing user`')
edit : suppose you have list of columns, you can do like -
old = "First Last Age"
new = ["`"+field+"`" for field in old.split()]
df.rdd.toDF(new)
output :
DataFrame[`First`: string, `Last`: string, `Age`: string]
here is how one can solve the similar problems:
df.select([col(col_name).alias('prefix' + col_name + 'suffix') for col_name in df])
I had a dataframe that I duplicated twice then joined together. Since both had the same columns names I used :
df = reduce(lambda df, idx: df.withColumnRenamed(list(df.schema.names)[idx],
list(df.schema.names)[idx] + '_prec'),
range(len(list(df.schema.names))),
df)
Every columns in my dataframe then had the '_prec' suffix which allowed me to do sweet stuff

Pyspark filter dataframe by a comparison between date and string datatype

I have a dataframe in pyspark with the following construction:
DataFrame[Urlaubdate: string, Vacationdate: date, Datensatz: string, Jobname: string]
Now, I would like to filter the dataframe by comparing vacationdate with urlaubdate, unfortunately they have different datatypes. I would like to get filter the rows where vacationdate is bigger than Urlaubdate.
Do you have an idea how to do that?
I think in this case you would have to use user-defined functions as follows:
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
def compare(urlaubdate, vacationdate):
# do your comparison here (cast types if necessary)
# return True or False
# define a udf out of your function
compare_udf = udf(compare, BooleanType())
# filter your dataframe based on it
df_filtered = df.filter(compare_udf(df.urlaubdate, df.vacationdate))

Updating a dataframe column in spark

Looking at the new spark DataFrame API, it is unclear whether it is possible to modify dataframe columns.
How would I go about changing a value in row x column y of a dataframe?
In pandas this would be:
df.ix[x,y] = new_value
Edit: Consolidating what was said below, you can't modify the existing dataframe as it is immutable, but you can return a new dataframe with the desired modifications.
If you just want to replace a value in a column based on a condition, like np.where:
from pyspark.sql import functions as F
update_func = (F.when(F.col('update_col') == replace_val, new_value)
.otherwise(F.col('update_col')))
df = df.withColumn('new_column_name', update_func)
If you want to perform some operation on a column and create a new column that is added to the dataframe:
import pyspark.sql.functions as F
import pyspark.sql.types as T
def my_func(col):
do stuff to column here
return transformed_value
# if we assume that my_func returns a string
my_udf = F.UserDefinedFunction(my_func, T.StringType())
df = df.withColumn('new_column_name', my_udf('update_col'))
If you want the new column to have the same name as the old column, you could add the additional step:
df = df.drop('update_col').withColumnRenamed('new_column_name', 'update_col')
While you cannot modify a column as such, you may operate on a column and return a new DataFrame reflecting that change. For that you'd first create a UserDefinedFunction implementing the operation to apply and then selectively apply that function to the targeted column only. In Python:
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
name = 'target_column'
udf = UserDefinedFunction(lambda x: 'new_value', StringType())
new_df = old_df.select(*[udf(column).alias(name) if column == name else column for column in old_df.columns])
new_df now has the same schema as old_df (assuming that old_df.target_column was of type StringType as well) but all values in column target_column will be new_value.
Commonly when updating a column, we want to map an old value to a new value. Here's a way to do that in pyspark without UDF's:
# update df[update_col], mapping old_value --> new_value
from pyspark.sql import functions as F
df = df.withColumn(update_col,
F.when(df[update_col]==old_value,new_value).
otherwise(df[update_col])).
DataFrames are based on RDDs. RDDs are immutable structures and do not allow updating elements on-site. To change values, you will need to create a new DataFrame by transforming the original one either using the SQL-like DSL or RDD operations like map.
A highly recommended slide deck: Introducing DataFrames in Spark for Large Scale Data Science.
Just as maasg says you can create a new DataFrame from the result of a map applied to the old DataFrame. An example for a given DataFrame df with two rows:
val newDf = sqlContext.createDataFrame(df.map(row =>
Row(row.getInt(0) + SOMETHING, applySomeDef(row.getAs[Double]("y")), df.schema)
Note that if the types of the columns change, you need to give it a correct schema instead of df.schema. Check out the api of org.apache.spark.sql.Row for available methods: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html
[Update] Or using UDFs in Scala:
import org.apache.spark.sql.functions._
val toLong = udf[Long, String] (_.toLong)
val modifiedDf = df.withColumn("modifiedColumnName", toLong(df("columnName"))).drop("columnName")
and if the column name needs to stay the same you can rename it back:
modifiedDf.withColumnRenamed("modifiedColumnName", "columnName")
importing col, when from pyspark.sql.functions and updating fifth column to integer(0,1,2) based on the string(string a, string b, string c) into a new DataFrame.
from pyspark.sql.functions import col, when
data_frame_temp = data_frame.withColumn("col_5",when(col("col_5") == "string a", 0).when(col("col_5") == "string b", 1).otherwise(2))

Categories