i have two data frame df1 and df2, i want only unmatched column in the result. i tried to do do using SQL but SQL returns all column not one.
df1
col1|col2|col3
a b c
1 2 3
df2
col1|col2|col3
a b e
1 2 3
what i want is if it can return
df3
col3
Is it possible to do in pyspark to do or I have to compare by selecting each column from both the data frame and then compare?
If all you need to do is compare the names of columns between two dataframes, I would suggest the following.
df3 = ## Create empty pyspark dataframe
for name_1, name_2 in zip(df1.schema.names, df2.schema.names):
if name_1 != name_2:
df3[name_2] = df2.name_2
You didn't really specify from which dataframe you want to show columns. Below solution will show you where you have differences at the same row level between both dataframes. Assuming as in your dfs, that there are no nulls earlier.
val df11 = df1.withColumn("id", row_number().over(Window.orderBy("col1")))
val df22 = df2.withColumn("id", row_number().over(Window.orderBy("col1")))
val df_join = df11.join(df22.selectExpr("col1 as col11", "col2 as col22", "col3 as col33", "id"), Seq("id"), "inner")
df_join.select(when($"col1" === $"col11", null).otherwise(col("col1")), when($"col2" === $"col22", null).otherwise(col("col2")), when($"col3" === $"col33", null).otherwise(col("col3"))).show
Related
Hi I faced this case that I need to subtract all column values between two PySpark dataframe like this:
df1:
col1 col2 ... col100
1 2 ... 100
df2:
col1 col2 ... col100
5 4 ... 20
And I want to get the final dataframe with df1 - df2 :
new df:
col1 col2 ... col100
-4 -2 ... 80
I checked the possible solution is subtract two column like:
new_df = df1.withColumn('col1', df1['col1'] - df2['col1'])
But I have 101 columns, how can I simply traverse the whole thing and avoid writing 101 similar logics?
Any answers are super appriciate!
for 101 columns how to simply traverse all column and subtract its values?
You can create a for loop to iterate over the columns and create new columns in the dataframe with the subtracted values. Here's one way to do it in PySpark:
columns = df1.columns
for col in columns:
df1 = df1.withColumn(col, df1[col] - df2[col])
This will create a new dataframe with the subtracted values for each column.
Edit: (to address #Kay's comments)
The error you're encountering is due to a duplicate column name in the output dataframe. You can resolve this by using a different name for the new columns in the output dataframe. Try it by using alias method in the withColumn function:
columns = df1.columns
for col in columns:
df1 = df1.withColumn(col + "_diff", df1[col] - df2[col]).alias(col)
That way you will add a suffix "_diff" to the new columns in the output dataframe to avoid the duplicate column name issue.
Within a single select with a python list comprehension :
columns = df1.columns
df1 = df1.select(*(df1[col] - df2[col]).alias(col) for col in columns))
I am trying to make 2 new dataframes by using 2 given dataframe objects:
DF1 = id feature_text length
1 "example text" 12
2 "example text2" 13
....
....
DF2 = id case_num
3 0
....
....
As you could see, both df1 and df2 have column called "id". However, the df1 has all id values, where df2 only has some of them. I mean, df1 has 3200 rows, where each row has a unique id value (1~3200), however, df2 has only some of them (i.e. id=[3,7,20,...]).
What I want to do is 1) get a merged dataframe which contains all rows that have the id values which are included in both df1 and df2, and 2) get a dataframe, which contains the rows in the df1, which have id values that are not included in the df2.
I was able to find a solution for 1), however, have no idea how to do 2).
Thanks.
For the first case, you could use inner merge:
out = df1.merge(df2, on='id')
For the second case, you could use isin, with negation operator, so that we filter out the rows in df1 that have ids that also exist in df2:
out = df1[~df1['id'].isin(df2['id'])]
I want to add data from df2 if date is greater than 01/01/2015 and df1 if its below than 01/01/2015. Unsure how to do this as the columns are of difference length.
I have a main DF df1, and two dataframes containing data called df2 and df3.
df2 looks something like this:
day Value
01/01/2015 5
02/01/2015 6
...
going up to today,
I also have DF3 which is the same data but goes from 2000 to today so like
day Value
01/01/2000 10
02/01/2000 15
...
I want to append a Value column to DF1 that is the values of DF3 when date is less than 2015, and the values of DF2 when the date is more than 01/01/2015( including). Unsure how to do this using a condition. I think there is a way with np.where but unsure how.
To add more context.
I want to codify this statement:
df1[values] = DF2[Values] when date is bigger than 1/1/2015 and df1[values] = DF3[Values] when date is less than 1/1/2015
If you want to join 2 dataframe, you can use df1.mrege (df2,on="col_name"). If you want a condition on one dataframe and join with the other,
do it like this,
import pandas as pd
#generating 2 dataframes
date1 = pd.date_range(start='1/01/2015', end='2/02/2015')
val1 = [2,3,5]*11. #length equal to date
dict1 = {"Date":date2,"Value":val1}
df1 = pd.DataFrame(dict1)
df1. #view df1
date2 = pd.date_range(start='1/01/2000', end='2/02/2020')
val2 = [21,15]*3669. #length equal to date
dict2 = {"Date":date2,"Value":val2}
df2 = pd.DataFrame(dict2)
df2. #view df1
df3 =df1[df1["Date"]<"2015-02-01"] #whatever condition you want apply
and store on different dataframe
df2.merge(df3,on="Date")
df2 #joined dataframe
This is how you can join 2 dataframe on date with a condition, just apply
the conditon and store i another dataframe and join it with the first
dataframe by df.merge(df3,on="Date")
I'm merging 2 pretty large data frames, the shape of RD_4ML is (97058, 24) while the shape of NewDF is (104047, 3). They share a common column called 'personUID', below is the merge code I used.
Final_DF = RD_4ML.merge(NewDF, how='left', on='personUID')
Final_DF.fillna('none', inplace=True)
Final_DF.sample()
DF sample output:
|personUID| |code| |Death| |diagnosis_code_type| |lr|
|abc123| |ICD10| |1| |none| |none|
Essentially the columns from RD_4ML populate while the 2 columns from NewDF return "none" values. Does anyone know how to solve an error like this?
I think the 'personUID' column does not match in the two dataframe.
Ensure that they have the same data type.
Merge with how='left' takes every entry from the left dataframe and tries to find a corresponding matching id in the right dataframe. For all nonmatching ones, it will fill in nans for the columns coming from the right frame. In SQL that is called a left join. As an example you can have a look at this here
df1 = pd.DataFrame({"uid":range(4), "values": range(4)})
df2 = pd.DataFrame({"uid":range(5, 9), "values2": range(4)})
df1.merge(df2, how="left", on='uid')
# OUTPUT
uid values values2
0 0 0 NaN
1 1 1 NaN
2 2 2 NaN
3 3 3 NaN
here yous see all uids from the left dataframe end up in the merged dataframe and as no matching entry was found, the column from the right dataframe is set to NaN.
If your goal is, to end up with only those that have a match, you can change from "left" to "inner". For more information about that, just have a look at the great pandas docs.
I have two dataframe and I would like to concat them based on time ranges
for example
dataframe A
user timestamp product
A 2015/3/13 1
B 2015/3/15 2
dataframe B
user time behavior
A 2015/3/1 2
A 2015/3/8 3
A 2015/3/13 1
B 2015/3/1 2
I would like to concat 2 dataframe as below ( frame B left join to frame A)
column "timestamp1" is 7 days before column "timestamp"
for example when timestamp is 3/13 , then 3/6-13 is in the range
otherwise dont concat
user timestamp product time1 behavior
A 2015/3/13 1 2015/3/8 3
A 2015/3/13 1 2015/3/13 1
B 2015/3/15 2 NaN NaN
the sql code would look like
select * from
B left join A
on user
where B.time >= A.timestamp - 7 & B.time <= A.timestamp
##WHERE B.time BETWEEN DATE_SUB(B.time, INTERVAL 7 DAY) AND A.timestamp ;
how can we make this on python?
can only think of the following and dont know how to work with the time..
new = pd.merge(A, B, on='user', how='left')
thanks and sorry..
The few steps required to solve this-
from datetime import timedelta
First,convert your timestamps to pandas datetime. (df1 refers to Dataframe A and df2 refers to Dataframe B)
df1[['time']]=df1[['timestamp']].apply(pd.to_datetime)
df2[['time']]=df2[['time']].apply(pd.to_datetime)
Merge as follows: (Based on your final dataset i think your left join is more of a right join)
df3 = pd.merge(df1,df2,how='left')
Get your final df:
df4 = df3[(df3.time>=df3.timestamp-timedelta(days=7)) & (df3.time<=df3.timestamp)]
The row containing nan is missing and this is because of the way conditional joins are done in pandas.
Condtional joins are not a feature of pandas yet. A way to get past that is by doing filtering post a join.
Here's one solution that relies on two merges - first, to narrow down dataframe B (df2), and then to produce the desired output:
We can read in the example dataframes with read_clipboard():
import pandas as pd
# copy dataframe A data, then:
df1 = pd.read_clipboard(parse_dates=['timestamp'])
# copy dataframe B data, then:
df2 = pd.read_clipboard(parse_dates=['time'])
Merge and filter:
# left merge df2, df1
tmp = df2.merge(df1, on='user', how='left')
# then drop rows which disqualify based on timestamp
mask = tmp.time < (tmp.timestamp - pd.Timedelta(days=7))
tmp.loc[mask, ['time', 'behavior']] = None
tmp = tmp.dropna(subset=['time']).drop(['timestamp','product'], axis=1)
# now perform final merge
merged = df1.merge(tmp, on='user', how='left')
Output:
user timestamp product time behavior
0 A 2015-03-13 1 2015-03-08 3.0
1 A 2015-03-13 1 2015-03-13 1.0
2 B 2015-03-15 2 NaT NaN