I have two pyspark dataframes that have different number of rows. I am trying to compare values in all the columns by joining these two dataframes on multiple keys so I can find the records that have different values and the records that have the same values in these columns.
#df1:
+-------+----------+----------+----------+
|id |age |sex |value
+-------+----------+----------+
| 1 | 23 | M | 8.4
| 2 | 4 | M |2
| 3 |16 | F | 4.1
| 4 | 60 | M |4
| 5 | null | F |5
+-------+----------+----------+|
#df2:
+-------+----------+----------+----------+
|id |age |sex |value
+-------+----------+----------+
| 1 | 23 | M | 8.4
| 2 | 4 | null |2
| 4 | 13 | M |3.1
| 5 | 34 | F |6.2
+-------+----------+----------+|
#joining df1 and df2 on multiple keys
same=df1.join(df2, on=['id','age','sex','value'], how='inner')
Please note that the dataframes above are just samples. My real data has around 25 columns and 100k+ rows. So when I tried to do the join, the spark job was taking a long time and doesn't finish.
Want to know if anyone has good advice on comparing two dataframes and find out records that have different values in columns either using joining or other methods?
Use hashing.
from pyspark.sql.functions import hash
df1 = spark.createDataFrame([('312312','151132'),('004312','12232'),('','151132'),('013vjhrr134','111232'),(None,'151132'),('0fsgdhfjgk','151132')],
("Fruits", "Meat"))
df1 = df1.withColumn('hash_value', hash("Fruits", "Meat"))
df = spark.createDataFrame([('312312','151132'),('000312','151132'),('','151132'),('013vjh134134','151132'),(None,'151132'),('0fsgdhfjgk','151132')],
("Fruits", "Meat"))
df = df.withColumn('hash_value', hash("Fruits", "Meat"))
df.show()
+------------+------+-----------+
| Fruits| Meat| hash_value|
+------------+------+-----------+
| 312312|151132| -344340697|
| 000312|151132| -548650515|
| |151132|-2105905448|
|013vjh134134|151132| 2052362224|
| null|151132| 598159392|
| 0fsgdhfjgk|151132| 951458223|
+------------+------+-----------+
df1.show()
+-----------+------+-----------+
| Fruits| Meat| hash_value|
+-----------+------+-----------+
| 312312|151132| -344340697|
| 004312| 12232| 76821046|
| |151132|-2105905448|
|013vjhrr134|111232| 1289730088|
| null|151132| 598159392|
| 0fsgdhfjgk|151132| 951458223|
+-----------+------+-----------+
OR you can use SHA2 for the same
from pyspark.sql.functions import sha2, concat_ws
df1.withColumn("row_sha2", sha2(concat_ws("||", *df.columns), 256)).show(truncate=False)
+-----------+------+----------------------------------------------------------------+
|Fruits |Meat |row_sha2 |
+-----------+------+----------------------------------------------------------------+
|312312 |151132|7be3824bcaa5fa29ad58df2587d392a1cc9ca5511ef01005be6f97c9558d1eed|
|004312 |12232 |c7fcf8031a17e5f3168297579f6dc8a6f17d7a4a71939d6b989ca783f30e21ac|
| |151132|68ea989b7d33da275a16ff897b0ab5a88bc0f4545ec22d90cee63244c1f00fb0|
|013vjhrr134|111232|9c9df63553d841463a803c64e3f4a8aed53bcdf78bf4a089a88af9e91406a226|
|null |151132|83de2d466a881cb4bb16b83665b687c01752044296079b2cae5bab8af93db14f|
|0fsgdhfjgk |151132|394631bbd1ccee841d3ba200806f8d0a51c66119b13575cf547f8cc91066c90d|
+-----------+------+----------------------------------------------------------------+
This will create a unique code for all your rows now join and compare the hash values on the two data frames.
If the values in the rows are same they will have same hash values as well.
Now you can compare using joins
df1.join(df, "hash_value", "inner").show()
+-----------+----------+------+----------+------+
| hash_value| Fruits| Meat| Fruits| Meat|
+-----------+----------+------+----------+------+
|-2105905448| |151132| |151132|
| -344340697| 312312|151132| 312312|151132|
| 598159392| null|151132| null|151132|
| 951458223|0fsgdhfjgk|151132|0fsgdhfjgk|151132|
+-----------+----------+------+----------+------+
df1.join(df, "hash_value", "outer").show()
+-----------+-----------+------+------------+------+
| hash_value| Fruits| Meat| Fruits| Meat|
+-----------+-----------+------+------------+------+
|-2105905448| |151132| |151132|
| -548650515| null| null| 000312|151132|
| -344340697| 312312|151132| 312312|151132|
| 76821046| 004312| 12232| null| null|
| 598159392| null|151132| null|151132|
| 951458223| 0fsgdhfjgk|151132| 0fsgdhfjgk|151132|
| 1289730088|013vjhrr134|111232| null| null|
| 2052362224| null| null|013vjh134134|151132|
+-----------+-----------+------+------------+------+
Related
I got a dataframe (df1), where I have listed some time frames:
| start | end | event name |
|-------|-----|------------|
| 1 | 3 | name_1 |
| 3 | 5 | name_2 |
| 2 | 6 | name_3 |
In these time frames, I would like to extract some data from another dataframe (df2). For example, I want to extend df1 with the average measurementn from df2 inside the specified time range.
| timestamp | measurement |
|-----------|-------------|
| 1 | 5 |
| 2 | 7 |
| 3 | 5 |
| 4 | 9 |
| 5 | 2 |
| 6 | 7 |
| 7 | 8 |
I was thinking about an UDF function which filters df2 by timestamp and evaluates the average. But in a UDF I can not reference two dataframes:
def get_avg(start, end):
return df2.filter(df2.timestamp > start & df2.timestamp < end).agg({"average": "avg"})
udf_1 = f.udf(get_avg)
df1.select(udf_1('start', 'end').show()
This will throw an error TypeError: cannot pickle '_thread.RLock' object.
How would I solve this issue efficiently?
In this case there is no need to use UDFs, you can simply use join over a range interval determined by the timestamps
import pyspark.sql.functions as F
df1.join(df2, on=[(df2.timestamp > df1.start) & (df2.timestamp < df1.end)]) \
.groupby('start', 'end', 'event_name') \
.agg(F.mean('measurement').alias('avg')) \
.show()
+-----+---+----------+-----------------+
|start|end|event_name| avg|
+-----+---+----------+-----------------+
| 1| 3| name_1| 7.0|
| 3| 5| name_2| 9.0|
| 2| 6| name_3|5.333333333333333|
+-----+---+----------+-----------------+
I have a DF in PySpark where I'm trying to explode two columns of arrays. Here's my DF:
+--------+-----+--------------------+--------------------+
| id| zip_| values| time|
+--------+-----+--------------------+--------------------+
|56434459|02138|[1.033990484, 1.0...|[1.624322475139E9...|
|56434508|02138|[1.04760919, 1.07...|[1.624322475491E9...|
|56434484|02138|[1.047177758, 1.0...|[1.62432247655E9,...|
|56434495|02138|[0.989590562, 1.0...|[1.624322476937E9...|
|56434465|02138|[1.051481754, 1.1...|[1.624322477275E9...|
|56434469|02138|[1.026476497, 1.1...|[1.624322477605E9...|
|56434463|02138|[1.10024864, 1.31...|[1.624322478085E9...|
|56434458|02138|[1.011091305, 1.0...|[1.624322478462E9...|
|56434464|02138|[1.038230333, 1.0...|[1.62432247882E9,...|
|56434474|02138|[1.041924752, 1.1...|[1.624322479386E9...|
|56434452|02138|[1.044482358, 1.1...|[1.624322479919E9...|
|56434445|02138|[1.050144598, 1.1...|[1.624322480344E9...|
|56434499|02138|[1.047851812, 1.0...|[1.624322480785E9...|
|56434449|02138|[1.044700917, 1.1...|[1.6243224811E9, ...|
|56434461|02138|[1.03341455, 1.07...|[1.624322481443E9...|
|56434526|02138|[1.04779412, 1.07...|[1.624322481861E9...|
|56434433|02138|[1.0498406, 1.139...|[1.624322482181E9...|
|56434507|02138|[1.0013894403, 1....|[1.624322482419E9...|
|56434488|02138|[1.047270063, 1.0...|[1.624322482716E9...|
|56434451|02138|[1.043182727, 1.1...|[1.624322483061E9...|
+--------+-----+--------------------+--------------------+
only showing top 20 rows
My current solution is to do a posexplode on each column, combined with a concat_ws for a unique ID, creating two DFs.
First DF:
+-----------+-----+-----------+
| new_id| zip_| values_new|
+-----------+-----+-----------+
| 56434459_0|02138|1.033990484|
| 56434459_1|02138| 1.07805057|
| 56434459_2|02138| 1.09000133|
| 56434459_3|02138| 1.07009546|
| 56434459_4|02138|1.102403015|
| 56434459_5|02138| 1.1291009|
| 56434459_6|02138|1.088399924|
| 56434459_7|02138|1.047513142|
| 56434459_8|02138|1.010418795|
| 56434459_9|02138| 1.0|
|56434459_10|02138| 1.0|
|56434459_11|02138| 1.0|
|56434459_12|02138| 0.99048968|
|56434459_13|02138|0.984854524|
|56434459_14|02138| 1.0|
| 56434508_0|02138| 1.04760919|
| 56434508_1|02138| 1.07858897|
| 56434508_2|02138| 1.09084267|
| 56434508_3|02138| 1.07627785|
| 56434508_4|02138| 1.13778706|
+-----------+-----+-----------+
only showing top 20 rows
Second DF:
+-----------+-----+----------------+
| new_id| zip_| values_new|
+-----------+-----+----------------+
| 56434459_0|02138|1.624322475139E9|
| 56434459_1|02138|1.592786475139E9|
| 56434459_2|02138|1.561164075139E9|
| 56434459_3|02138|1.529628075139E9|
| 56434459_4|02138|1.498092075139E9|
| 56434459_5|02138|1.466556075139E9|
| 56434459_6|02138|1.434933675139E9|
| 56434459_7|02138|1.403397675139E9|
| 56434459_8|02138|1.371861675139E9|
| 56434459_9|02138|1.340325675139E9|
|56434459_10|02138|1.308703275139E9|
|56434459_11|02138|1.277167275139E9|
|56434459_12|02138|1.245631275139E9|
|56434459_13|02138|1.214095275139E9|
|56434459_14|02138|1.182472875139E9|
| 56434508_0|02138|1.624322475491E9|
| 56434508_1|02138|1.592786475491E9|
| 56434508_2|02138|1.561164075491E9|
| 56434508_3|02138|1.529628075491E9|
| 56434508_4|02138|1.498092075491E9|
+-----------+-----+----------------+
only showing top 20 rows
I then join the DFs on new_id, resulting in:
+------------+-----+----------------+-----+------------------+
| new_id| zip_| values_new| zip_| values_new|
+------------+-----+----------------+-----+------------------+
| 123957783_3|02138|1.527644029268E9|02138| 1.0|
| 125820702_3|02138|1.527643636531E9|02138| 1.013462378|
|165689784_12|02138|1.243647038288E9|02138|0.9283950599999999|
|165689784_14|02138|1.180488638288E9|02138| 1.011595547|
| 56424973_12|02138|1.245630256025E9|02138|0.9566622300000001|
| 56424989_14|02138|1.182471866886E9|02138| 1.0|
| 56425304_7|02138|1.403398444955E9|02138| 1.028527131|
| 56425386_6|02138|1.432949752808E9|02138| 1.08516484|
| 56430694_17|02138|1.087866094991E9|02138| 1.120045416|
| 56430700_20|02138| 9.61635686239E8|02138| 1.099920854|
| 56430856_13|02138|1.214097787512E9|02138| 0.989263804|
| 56430866_12|02138|1.245633801277E9|02138| 0.990684134|
| 56430875_10|02138|1.308705777269E9|02138| 1.0|
| 56430883_3|02138|1.529630585921E9|02138| 1.06920212|
| 56430987_13|02138|1.214100806414E9|02138| 0.978794644|
| 56431009_1|02138|1.592792025664E9|02138| 1.07923349|
| 56431013_9|02138|1.340331235566E9|02138| 1.0|
| 56431025_8|02138|1.371860189767E9|02138| 0.9477155|
| 56432373_13|02138|1.214092187852E9|02138| 0.994825498|
| 56432421_2|02138|1.561161037707E9|02138| 1.11343257|
+------------+-----+----------------+-----+------------------+
only showing top 20 rows
My question: Is there a more effective way to get the resultant DF? I tried doing two posexplodes in parallel but PySpark allows only one.
You can achieve it as follows:
df = (df.withColumn("values_new", explode(col("values")))
.withColumn("times_new", explode(col("time")))
.withColumn("id_new", monotonically_increasing_id()))
How do I transpose columns in Pyspark? I want to make columns become rows, and rows become the columns.
Here is the input:
+---- +------+-----+-----+
|idx | vin |cur | mean|
+---- +------+-----+-----+
|Type1| D| 5.0 |6.0 |
|Type2| C| null| 7.0 |
+---- +------+-----+-----+
Expected Outcome:
+---- +------+-----+
|idx |Type1 |Type2|
+---- +------+-----+
|vin | D | C |
|cur | 5.0 | null|
|mean | 6.0 | 7.0 |
+-----+------+-----+
You can combine stack function to unpivot vin, mean and cur columns then pivot column idx:
from pyspark.sql import functions as F
df1 = df.selectExpr("idx", "stack(3, 'vin',vin, 'cur',cur, 'mean',mean)") \
.select("idx", "col0", "col1") \
.groupBy("col0") \
.pivot("idx").agg(F.first("col1")) \
.withColumnRenamed("col0", "idx")
df1.show(truncate=False)
#+----+-----+-----+
#|idx |Type1|Type2|
#+----+-----+-----+
#|vin |D |C |
#|mean|6.0 |7.0 |
#|cur |5.0 |null |
#+----+-----+-----+
You apply the transformation one by one to see how it works and what do each part.
Im working with pyspark and i have frame like this
this is my frame
+---+-----+
| id|value|
+---+-----+
| 1| 65|
| 2| 66|
| 3| 65|
| 4| 68|
| 5| 71|
+---+-----+
and i want to generate frame with pyspark like this
+---+-----+-------------+
| id|value| prev_value |
+---+-----+-------------+
| 1 | 65 | null |
| 2 | 66 | 65 |
| 3 | 65 | 66,65 |
| 4 | 68 | 65,66,65 |
| 5 | 71 | 68,65,66,65 |
+---+-----+-------------+
Here is one way:
from pyspark.sql.window import Window
from pyspark.sql.types import StringType
# define window and calculate "running total" of lagged value
win = Window.partitionBy().orderBy(f.col('id'))
df = df.withColumn('prev_value', f.collect_list(f.lag('value').over(win)).over(win))
# now define udf to concatenate the lists
concat = f.udf(lambda x: 'null' if len(x)==0 else ','.join([str(elt) for elt in x[::-1]]))
df = df.withColumn('prev_value', concat('prev_value'))
I have a spark dataframe like this:
id | Operation | Value
-----------------------------------------------------------
1 | [Date_Min, Date_Max, Device] | [148590, 148590, iphone]
2 | [Date_Min, Date_Max, Review] | [148590, 148590, Good]
3 | [Date_Min, Date_Max, Review, Device] | [148590, 148590, Bad,samsung]
The resul that i expect:
id | Operation | Value |
--------------------------
1 | Date_Min | 148590 |
1 | Date_Max | 148590 |
1 | Device | iphone |
2 | Date_Min | 148590 |
2 | Date_Max | 148590 |
2 | Review | Good |
3 | Date_Min | 148590 |
3 | Date_Max | 148590 |
3 | Review | Bad |
3 | Review | samsung|
I'm using Spark 2.1.0 with pyspark. I tried this solution but it worked only for one column.
Thanks
Here is an example dataframe from above. I use this solution in order to solve your question.
df = spark.createDataFrame(
[[1, ['Date_Min', 'Date_Max', 'Device'], ['148590', '148590', 'iphone']],
[2, ['Date_Min', 'Date_Max', 'Review'], ['148590', '148590', 'Good']],
[3, ['Date_Min', 'Date_Max', 'Review', 'Device'], ['148590', '148590', 'Bad', 'samsung']]],
schema=['id', 'l1', 'l2'])
Here, you can define udf to zip two list together for each row first.
from pyspark.sql.types import *
from pyspark.sql.functions import col, udf, explode
zip_list = udf(
lambda x, y: list(zip(x, y)),
ArrayType(StructType([
StructField("first", StringType()),
StructField("second", StringType())
]))
)
Finally, you can zip two columns together then explode that column.
df_out = df.withColumn("tmp", zip_list('l1', 'l2')).\
withColumn("tmp", explode("tmp")).\
select('id', col('tmp.first').alias('Operation'), col('tmp.second').alias('Value'))
df_out.show()
Output
+---+---------+-------+
| id|Operation| Value|
+---+---------+-------+
| 1| Date_Min| 148590|
| 1| Date_Max| 148590|
| 1| Device| iphone|
| 2| Date_Min| 148590|
| 2| Date_Max| 148590|
| 2| Review| Good|
| 3| Date_Min| 148590|
| 3| Date_Max| 148590|
| 3| Review| Bad|
| 3| Device|samsung|
+---+---------+-------+
If using DataFrame then try this:-
import pyspark.sql.functions as F
your_df.select("id", F.explode("Operation"), F.explode("Value")).show()