Ambiguous columns error in pyspark while iteratively joining dataframes - python
I am currently writing a code to join(left) two dataframes multiple times iteratively based on a set of columns corresponding to the two dataframes on each iteration. For one iteration it is working fine but on second iteration I am getting ambiguous columns error.
This is the sample dataframe on which I am working
sample_data = [("Amit","","Gupta","36678","M",4000),
("Anita","Mathews","","40299","F",5000),
("Ram","","Aggarwal","42124","M",5000),
("Pooja","Anne","Goel","39298","F",5000),
("Geeta","Banuwala","Brown","12345","F",-2)
]
sample_schema = StructType([
StructField("firstname",StringType(),True),
StructField("middlename",StringType(),True),
StructField("lastname",StringType(),True),
StructField("id", StringType(), True),
StructField("gender", StringType(), True),
StructField("salary", IntegerType(), True)
])
df1 = spark.createDataFrame(data = sample_data, schema = sample_schema)
sample_data = [("Amit", "ABC","MTS","36678",10),
("Ani", "DEF","CS","40299",200),
("Ram", "ABC","MTS","421",40),
("Pooja", "DEF","CS","39298",50),
("Geeta", "ABC","MTS","12345",-20)
]
sample_schema = StructType([
StructField("firstname",StringType(),True),
StructField("Company",StringType(),True),
StructField("position",StringType(),True),
StructField("id", StringType(), True),
StructField("points", IntegerType(), True)
])
df2 = spark.createDataFrame(data = sample_data, schema = sample_schema)
The code I used for this is
def joint_left_custom(df1, df2, cols_to_join, cols_df1_to_keep, cols_df2_to_keep):
resultant_df = None
df1_cols = df1.columns
df2 = df2.withColumn("flag", lit(True))
for i in range(len(cols_to_join)):
joined_df = df1.join(df2, [(df1[col_1] == df2[col_2]) for col_1, col_2 in cols_to_join[i].items()], 'left')
joined_df = joined_df.select(*[df1[column] if column in cols_df1_to_keep else df2[column] for column in cols_df1_to_keep + cols_df2_to_keep])
df1 = (joined_df
.filter("flag is NULL")
.select(df1_cols)
)
resultant_df = (joined_df.filter(col("flag") == True) if i == 0
else resultant_df.filter(col("flag") == True).union(resultant_df)
)
return resultant_df
cols_to_join = [{"id": "id"}, {"firstname":"firstname"}]
cols_df1_to_keep = ["firstname", "middlename", "lastname", "id", "gender", "salary"]
cols_df2_to_keep = ["company", "position", "points"]
x = joint_left_custom(df1, df2, cols_to_join, cols_df1_to_keep, cols_df2_to_keep)
it works fine if I execute this code for single run but on second iteration for again joining the rest of the rows on column "firstname" which are not joined on basis of column "id" in first iteration it is throwing following error
Column position#29518, company#29517, points#29520 are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via Dataset.as before joining them, and specify the column using qualified name, e.g. df.as("a").join(df.as("b"), $"a.id" > $"b.id"). You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.
This is the example how you can do or conditional join.
df1.join(df2, on=(df1.id == df2.id) | (df1.firstname == df2.firstname), how='left')
To make the condition dynamic, you can use reduce to chain the conditions.
from functools import reduce
def chain_join_cond(prev, value):
(lcol, rcol) = list(value.items())[0]
return prev | (df1[lcol] == df2[rcol])
# If your condition is OR, use False for initial condition.
# If your condition is AND, use True for initial condition(and use & to concatenate the conditions.)
cond = reduce(chain_join_cond, cols_to_join, F.lit(False))
# Use the cond for `on` option in join.
# df1.join(df2, on=cond, how='left')
Then to get a specific column set from df1 or df2 use list comprehensions to generate the select statement.
df = (df1.join(df2, on=cond, how='left')
.select(*[df1[c] for c in cols_df1_to_keep], *[df2[c] for c in cols_df2_to_keep]))
If you have the cols_to_join as tuple instead of dict, you can slightly simplify the code.
cols_to_join = [("id", "id"), ("firstname", "firstname")]
cond = reduce(lambda p, v: p | (df1[v[0]] == df2[v[1]]) , cols_to_join, F.lit(False))
Related
Can I use method chaining to delete rows with a if-else condition?
Right now my df looks like this (I shorten it cause there are 20 rows). import pandas as pd df=pd.DataFrame({'Country': ["Uruguay", "Italy", "France"], 'Winner': ["Uruguay", "Italy", "France"]}) def f(row): if row['Country'] in row['Winner']: val = False else: val = True return val df["AwayTeam"]=df.apply(f, axis=1) df I want to delete all the rows when AwayTeam=False. Everything's fine, until I was told that I need to build a method chain. #Not Chained df.drop(df[df['AwayTeam'] == False].index, inplace = True) df = df.drop("AwayTeam", axis=1) df #Done for now This is what I tried df=( df.drop( df[df['AwayTeam'] == False].index, inplace = True) .drop("AwayTeam", axis=1) ) df
You need to remove the inplace argument. df = df.drop(df[df["AwayTeam"] == False].index).drop("AwayTeam", axis=1) If it's set to True, the drop method will do the operation inplace and return None, Which means that your method chain will evaluate like this, df = None.drop("AwayTeam", axis=1) When it's False(Which is the default value) the operation will always return a copy of the dataframe so that you can apply other methods on it(method chaining).
Check if column values exists in different dataframe
I have a pandas DataFrame 'df' with x rows, and another pandas DataFrame 'df2' with y rows (x < y). I want to return the indexes of where the values of df['Farm'] equals the value of df2['Fields'], in order to add respective 'Manager' to df. the code I have is as follows: data2 = [['field1', 'Paul G'] , ['field2', 'Mark R'], ['field3', 'Roy Jr']] data = [['field1'] , ['field2']] columns = ['Field'] columns2 = ['Field', 'Manager'] df = pd.DataFrame(data, columns=columns) df2 = pd.DataFrame(data2, columns=columns2) farmNames = df['Farm'] exists = farmNames.reset_index(drop=True) == df1['Field'].reset_index(drop=True) This returns the error message: ValueError: Can only compare identically-labeled Series objects Does anyone know how to fix this?
As #NickODell mentioned, you could use a merge, basically a left join. See below code. df_new = pd.merge(df, df2, on = 'Field', how = 'left') print(df_new) Output: Field Manager 0 field1 Paul G 1 field2 Mark R
PySpark join on multiple columns
I have 2 dataframes, and I would like to know whether it is possible to join across multiple columns in a more generic and compact way. For example, this is a very explicit way and hard to generalize in a function: df = df1.join(df2, on=[ (df1.event_date < df2.risk_date) & (df1.client_id == df2.client_id_risk) & (df1.col_thr_param_1 == df2.col_thr_param_1) & (df1.col_thr_param_2 == df2.col_thr_param_2) & (df1.col_thr_param_3 == df2.col_thr_param_3) & (df1.col_thr_param_4 == df2.col_thr_param_4) ], how="left" ) If I have a list with the name of threshold columns that I want to join in: thr = ["col_thr_param_1", "col_thr_param_2", "col_thr_param_3", "col_thr_param_4"] Is it possible to pass it in a function and generalize the join? Or I always need to resort to call df1 and df2? Something like this. def join_dfs(df1: DataFrame, df2: DataFrame, thr_cols: List[str]): df = df1.join(df2, on=[ (df1.event_date < df2.risk_date) & (df1.client_id == df2.client_id_risk) & **df1.thr_cols == **df2.thr_cols ], how="left" )
Ideally you can use alias with a list using col() to join. You can try something like below: from pyspark.sql import functions as F def join_dfs(df1, df2, thr_cols): df = df1.alias("df1").join(df2.alias("df2"), on=[ [(F.col("df1.event_date") < F.col("df2.risk_date")) , (F.col("df1.client_id") == F.col("df2.client_id_risk")) ]+ [F.col(f"df1.{col}")==F.col(f"df2.{col}") for col in thr_cols] ], how="left" ) return df
How to union Spark SQL Dataframes in Python
Here are several ways of creating a union of dataframes, which (if any) is best /recommended when we are talking about big dataframes? Should I create an empty dataframe first or continuously union to the first dataframe created? Empty Dataframe creation from pyspark.sql.types import StructType, StructField, IntegerType, StringType schema = StructType([ StructField("A", StringType(), False), StructField("B", StringType(), False), StructField("C", StringType(), False) ]) pred_union_df = spark_context.parallelize([]).toDF(schema) Method 1 - Union as you go: for ind in indications: fitted_model = get_fitted_model(pipeline, train_balanced_df, ind) pred = get_predictions(fitted_model, pred_output_df, ind) pred_union_df = pred_union_df.union(pred[['A', 'B', 'C']]) Method 2 - Union at the end: all_pred = [] for ind in indications: fitted_model = get_fitted_model(pipeline, train_balanced_df, ind) pred = get_predictions(fitted_model, pred_output_df, ind) all_pred.append(pred) pred_union_df = pred_union_df.union(all_pred) Or do I have it all wrong? Edit: Method 2 was not possible as I thought it would be from this answer. I had to loop through the list and union each dataframe.
Method 2 is always preferred since it avoid the long lineage issue. Although DataFrame.union only takes one DataFrame as argument, RDD.union does take a list. Given your sample code, you could try to union them before calling toDF. If your data is on disk, you could also try to load them all at once to achieve union, e.g., dataframe = spark.read.csv([path1, path2, path3])
Add RDD to DataFrame Column PySpark
I want to create a Dataframe with the columns of two RDD's. The first is RDD that i get from CSV and second is another RDD with a cluster prediction of each row. My schema is: customSchema = StructType([ \ StructField("Area", FloatType(), True), \ StructField("Perimeter", FloatType(), True), \ StructField("Compactness", FloatType(), True), \ StructField("Lenght", FloatType(), True), \ StructField("Width", FloatType(), True), \ StructField("Asymmetry", FloatType(), True), \ StructField("KernelGroove", FloatType(), True)]) Map my rdd and create the DataFrame: FN2 = rdd.map(lambda x: (float(x[0]), float(x[1]),float(x[2]),float(x[3]),float(x[4]),float(x[5]),float(x[6]))) df = sqlContext.createDataFrame(FN2, customSchema) And my cluster prediction: result = Kmodel.predict(rdd) So, to conclude i want to have in my DataFrame the rows of my CSV and their cluster prediction at the end. I tried to add a new column with .WithColumn() but i got nothing. Thanks.
If you have a common field on both data frame, then join with the key otherwise create a unique Id and join both dataframe to get rows of CSV and their cluster prediction in a single dataframe Scala code to generate a unique id for each row, try to convert for the pyspark. You need to generate a increasing row id and join them with row id import org.apache.spark.sql.types.{StructType, StructField, LongType} val df = sc.parallelize(Seq(("abc", 2), ("def", 1), ("hij", 3))).toDF("word", "count") val wcschema = df.schema val inputRows = df.rdd.zipWithUniqueId.map{ case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)} val wcID = sqlContext.createDataFrame(inputRows, StructType(StructField("id", LongType, false) +: wcschema.fields)) or use sql query val tmpTable1 = sqlContext.sql("select row_number() over (order by count) as rnk,word,count from wordcount") tmpTable1.show()