Hi I faced this case that I need to subtract all column values between two PySpark dataframe like this:
df1:
col1 col2 ... col100
1 2 ... 100
df2:
col1 col2 ... col100
5 4 ... 20
And I want to get the final dataframe with df1 - df2 :
new df:
col1 col2 ... col100
-4 -2 ... 80
I checked the possible solution is subtract two column like:
new_df = df1.withColumn('col1', df1['col1'] - df2['col1'])
But I have 101 columns, how can I simply traverse the whole thing and avoid writing 101 similar logics?
Any answers are super appriciate!
for 101 columns how to simply traverse all column and subtract its values?
You can create a for loop to iterate over the columns and create new columns in the dataframe with the subtracted values. Here's one way to do it in PySpark:
columns = df1.columns
for col in columns:
df1 = df1.withColumn(col, df1[col] - df2[col])
This will create a new dataframe with the subtracted values for each column.
Edit: (to address #Kay's comments)
The error you're encountering is due to a duplicate column name in the output dataframe. You can resolve this by using a different name for the new columns in the output dataframe. Try it by using alias method in the withColumn function:
columns = df1.columns
for col in columns:
df1 = df1.withColumn(col + "_diff", df1[col] - df2[col]).alias(col)
That way you will add a suffix "_diff" to the new columns in the output dataframe to avoid the duplicate column name issue.
Within a single select with a python list comprehension :
columns = df1.columns
df1 = df1.select(*(df1[col] - df2[col]).alias(col) for col in columns))
Related
I have a large dataset. I want to apply something on all the columns except for 2.
I dropped the 2 columns and created a separate dataframe, then tried merging the dataframes after the operation is applied.
I tried appending, merging, joining the two dataframes but they all created duplicate rows. Appending doubled the row count, and changed the dropped columns.
I just want to add back the 2 columns to the initial dataframe unchanged. Any help?
df= col1 col2 col3... col100
1 2 3 100
df2=df.loc[:,['col2', 'col3']]
df.drop(columns=['col2', 'col3'], inplace=True)
Then do what I needed to do to df.
Now I want to merge df and df2.
Like this:
cols = ['col2', 'col3']
df2 = df[cols]
df.drop(columns=cols, inplace=True)
# do something
df = pd.concat([df, df2], axis=1)
This will work as long as you didn't remove rows from either dataframes or changed their order
I am trying to make 2 new dataframes by using 2 given dataframe objects:
DF1 = id feature_text length
1 "example text" 12
2 "example text2" 13
....
....
DF2 = id case_num
3 0
....
....
As you could see, both df1 and df2 have column called "id". However, the df1 has all id values, where df2 only has some of them. I mean, df1 has 3200 rows, where each row has a unique id value (1~3200), however, df2 has only some of them (i.e. id=[3,7,20,...]).
What I want to do is 1) get a merged dataframe which contains all rows that have the id values which are included in both df1 and df2, and 2) get a dataframe, which contains the rows in the df1, which have id values that are not included in the df2.
I was able to find a solution for 1), however, have no idea how to do 2).
Thanks.
For the first case, you could use inner merge:
out = df1.merge(df2, on='id')
For the second case, you could use isin, with negation operator, so that we filter out the rows in df1 that have ids that also exist in df2:
out = df1[~df1['id'].isin(df2['id'])]
I have any output that counts the number of na values in my dataframe using this logic
df.isna().sum()
col1 8
col2 0
And would like the same thing, but with duplicates although I don't see a full df approach to this - only column by column
How can I leverage something like
df.duplicated().any().sum()
Without specifying column by column like df['col1'].duplicated().any().sum()
I have the same dataset but over different weeks (so later weeks contain new rows). I want to append the new rows to the original dataframe to create one big dataframe with all unique rows and no duplicates. I can't just take the last week because some get deleted over the weeks.
I tried to use the following code but somehow my final_info dataframe still contains some non-unique values
final_info = data[list(data.keys())[-1]]['all_info']
for week in reversed(data.keys()):
df_diff = pd.concat([data[week]['all_info'],final_info]).drop_duplicates(subset='project_slug',
keep=False)
final_info = final_info.append(df_diff).reset_index(drop=True)
Does somebody see where it goes wrong?
if I understand your question, you are just trying to add the unique rows from one dataframe to another dataframe. I don't think there is any need to iterate through the keys like you are doing. There is an example on this question that I think can help you and i think it is conceptually easier to follow 1. I'll try to walk through an example to make it more clear.
So if you have a dataframe A:
col1 col2
1 2
2 3
3 4
and a dataframe B:
col1 col2
1 2
2 3
6 4
These two dataframes have the same first two rows but have different last rows. If you wanted to get all the unique rows into one dataframe you could first get all the unique rows from just one of the dataframes. So for this example you could get the unique row in dataframe B, lets call it df_diff in this example. The code to do this would be
df_diff = B[~B.col1.isin(A.col1)]
output: col1 col2
6 4
This above line of code makes whats called a boolean mask and then negates using ~ so that you get all rows in dataframe B where the col1 value is not in dataframe A.
You could then merge this dataframe, df_diff, with the first dataframe A. We can call this df_full. This step is done with:
df_full = pd.concat([A, df_diff], ignore_index=True)
The ignore_index=True just resets the index of the resulting dataframe. This will give you:
col1 col2
1 2
2 3
3 4
6 4
Now the above dataframe has the new row in dataframe B plus the original rows from dataframe A.
I think this would work for your situation and may be less lines of code.
i have two data frame df1 and df2, i want only unmatched column in the result. i tried to do do using SQL but SQL returns all column not one.
df1
col1|col2|col3
a b c
1 2 3
df2
col1|col2|col3
a b e
1 2 3
what i want is if it can return
df3
col3
Is it possible to do in pyspark to do or I have to compare by selecting each column from both the data frame and then compare?
If all you need to do is compare the names of columns between two dataframes, I would suggest the following.
df3 = ## Create empty pyspark dataframe
for name_1, name_2 in zip(df1.schema.names, df2.schema.names):
if name_1 != name_2:
df3[name_2] = df2.name_2
You didn't really specify from which dataframe you want to show columns. Below solution will show you where you have differences at the same row level between both dataframes. Assuming as in your dfs, that there are no nulls earlier.
val df11 = df1.withColumn("id", row_number().over(Window.orderBy("col1")))
val df22 = df2.withColumn("id", row_number().over(Window.orderBy("col1")))
val df_join = df11.join(df22.selectExpr("col1 as col11", "col2 as col22", "col3 as col33", "id"), Seq("id"), "inner")
df_join.select(when($"col1" === $"col11", null).otherwise(col("col1")), when($"col2" === $"col22", null).otherwise(col("col2")), when($"col3" === $"col33", null).otherwise(col("col3"))).show