How do I transpose columns in Pyspark? I want to make columns become rows, and rows become the columns.
Here is the input:
+---- +------+-----+-----+
|idx | vin |cur | mean|
+---- +------+-----+-----+
|Type1| D| 5.0 |6.0 |
|Type2| C| null| 7.0 |
+---- +------+-----+-----+
Expected Outcome:
+---- +------+-----+
|idx |Type1 |Type2|
+---- +------+-----+
|vin | D | C |
|cur | 5.0 | null|
|mean | 6.0 | 7.0 |
+-----+------+-----+
You can combine stack function to unpivot vin, mean and cur columns then pivot column idx:
from pyspark.sql import functions as F
df1 = df.selectExpr("idx", "stack(3, 'vin',vin, 'cur',cur, 'mean',mean)") \
.select("idx", "col0", "col1") \
.groupBy("col0") \
.pivot("idx").agg(F.first("col1")) \
.withColumnRenamed("col0", "idx")
df1.show(truncate=False)
#+----+-----+-----+
#|idx |Type1|Type2|
#+----+-----+-----+
#|vin |D |C |
#|mean|6.0 |7.0 |
#|cur |5.0 |null |
#+----+-----+-----+
You apply the transformation one by one to see how it works and what do each part.
Related
I have pyspark dataframe consisting of two columns, each named input and target. These two are crossJoin of two single-column dataframes. Below is an example of how such dataframe would look like.
input
target
A
Voigt.
A
Leica
A
Zeiss
B
Voigt.
B
Leica
B
Zeiss
C
Voigt.
C
Leica
C
Zeiss
Then I have another dataframe which provides a number which describes relation between input and target column. However, it is not guaranteed that each input-target has this numerical value. For example, A - Voigt may have 2 as its relational value but A-Leica may have not have this value at all. Below is an example
input
target
val
A
Voigt.
2
A
Zeiss
1
B
Leica
3
C
Zeiss
5
C
Leica
2
Now I want a dataframe that is congregate of these two that looks like this.
input
target
val
A
Voigt.
2
A
Leica
null
A
Zeiss
1
B
Voigt.
null
B
Leica
3
B
Zeiss
null
C
Voigt.
null
C
Leica
5
C
Zeiss
2
I tried to join left these two columns, and tried to filter these out, but I've had problem completing in this form.
result = input_target.join(input_target_w_val, (input_target.input == input_target_w_val.input) & (input_target.target == input_target_w_val.target), 'left')
How should I put a filter from this point, or is there another way I can achieve this?
Try using it as below -
Input DataFrames
df1 = spark.createDataFrame(data=[("A","Voigt.") ,("A","Leica") ,("A","Zeiss") ,("B","Voigt.") ,("B","Leica") ,("B","Zeiss") ,("C","Voigt.") ,("C","Leica") ,("C","Zeiss")], schema = ["input", "target"])
df1.show()
+-----+------+
|input|target|
+-----+------+
| A|Voigt.|
| A| Leica|
| A| Zeiss|
| B|Voigt.|
| B| Leica|
| B| Zeiss|
| C|Voigt.|
| C| Leica|
| C| Zeiss|
+-----+------+
df2 = spark.createDataFrame(data=[("A","Voigt.",2) ,("A","Zeiss",1 ) ,("B","Leica",3 ) ,("C","Zeiss",5 ) ,("C","Leica",2 )], schema = ["input", "target", "val"])
df2.show()
+-----+------+---+
|input|target|val|
+-----+------+---+
| A|Voigt.| 2|
| A| Zeiss| 1|
| B| Leica| 3|
| C| Zeiss| 5|
| C| Leica| 2|
+-----+------+---+
Required Output
df1.join(df2, on = ["input", "target"], how = "left_outer").select(df1["input"], df1["target"], df2["val"]).show(truncate=False)
+-----+------+----+
|input|target|val |
+-----+------+----+
|A |Leica |null|
|A |Voigt.|2 |
|A |Zeiss |1 |
|B |Leica |3 |
|B |Voigt.|null|
|B |Zeiss |null|
|C |Leica |2 |
|C |Voigt.|null|
|C |Zeiss |5 |
+-----+------+----+
You can simply specify a list of join column names.
df = df1.join(df2, ['input', 'target'], 'left')
Some plagiarized modifications are completely silent. Not once or twice, and you take pride in constantly copying other people's work, don't you?
Forgive me for another post, I'm really pissed off, this kind of behavior has seriously disturbed the atmosphere of the forum.
I have two pyspark dataframes that have different number of rows. I am trying to compare values in all the columns by joining these two dataframes on multiple keys so I can find the records that have different values and the records that have the same values in these columns.
#df1:
+-------+----------+----------+----------+
|id |age |sex |value
+-------+----------+----------+
| 1 | 23 | M | 8.4
| 2 | 4 | M |2
| 3 |16 | F | 4.1
| 4 | 60 | M |4
| 5 | null | F |5
+-------+----------+----------+|
#df2:
+-------+----------+----------+----------+
|id |age |sex |value
+-------+----------+----------+
| 1 | 23 | M | 8.4
| 2 | 4 | null |2
| 4 | 13 | M |3.1
| 5 | 34 | F |6.2
+-------+----------+----------+|
#joining df1 and df2 on multiple keys
same=df1.join(df2, on=['id','age','sex','value'], how='inner')
Please note that the dataframes above are just samples. My real data has around 25 columns and 100k+ rows. So when I tried to do the join, the spark job was taking a long time and doesn't finish.
Want to know if anyone has good advice on comparing two dataframes and find out records that have different values in columns either using joining or other methods?
Use hashing.
from pyspark.sql.functions import hash
df1 = spark.createDataFrame([('312312','151132'),('004312','12232'),('','151132'),('013vjhrr134','111232'),(None,'151132'),('0fsgdhfjgk','151132')],
("Fruits", "Meat"))
df1 = df1.withColumn('hash_value', hash("Fruits", "Meat"))
df = spark.createDataFrame([('312312','151132'),('000312','151132'),('','151132'),('013vjh134134','151132'),(None,'151132'),('0fsgdhfjgk','151132')],
("Fruits", "Meat"))
df = df.withColumn('hash_value', hash("Fruits", "Meat"))
df.show()
+------------+------+-----------+
| Fruits| Meat| hash_value|
+------------+------+-----------+
| 312312|151132| -344340697|
| 000312|151132| -548650515|
| |151132|-2105905448|
|013vjh134134|151132| 2052362224|
| null|151132| 598159392|
| 0fsgdhfjgk|151132| 951458223|
+------------+------+-----------+
df1.show()
+-----------+------+-----------+
| Fruits| Meat| hash_value|
+-----------+------+-----------+
| 312312|151132| -344340697|
| 004312| 12232| 76821046|
| |151132|-2105905448|
|013vjhrr134|111232| 1289730088|
| null|151132| 598159392|
| 0fsgdhfjgk|151132| 951458223|
+-----------+------+-----------+
OR you can use SHA2 for the same
from pyspark.sql.functions import sha2, concat_ws
df1.withColumn("row_sha2", sha2(concat_ws("||", *df.columns), 256)).show(truncate=False)
+-----------+------+----------------------------------------------------------------+
|Fruits |Meat |row_sha2 |
+-----------+------+----------------------------------------------------------------+
|312312 |151132|7be3824bcaa5fa29ad58df2587d392a1cc9ca5511ef01005be6f97c9558d1eed|
|004312 |12232 |c7fcf8031a17e5f3168297579f6dc8a6f17d7a4a71939d6b989ca783f30e21ac|
| |151132|68ea989b7d33da275a16ff897b0ab5a88bc0f4545ec22d90cee63244c1f00fb0|
|013vjhrr134|111232|9c9df63553d841463a803c64e3f4a8aed53bcdf78bf4a089a88af9e91406a226|
|null |151132|83de2d466a881cb4bb16b83665b687c01752044296079b2cae5bab8af93db14f|
|0fsgdhfjgk |151132|394631bbd1ccee841d3ba200806f8d0a51c66119b13575cf547f8cc91066c90d|
+-----------+------+----------------------------------------------------------------+
This will create a unique code for all your rows now join and compare the hash values on the two data frames.
If the values in the rows are same they will have same hash values as well.
Now you can compare using joins
df1.join(df, "hash_value", "inner").show()
+-----------+----------+------+----------+------+
| hash_value| Fruits| Meat| Fruits| Meat|
+-----------+----------+------+----------+------+
|-2105905448| |151132| |151132|
| -344340697| 312312|151132| 312312|151132|
| 598159392| null|151132| null|151132|
| 951458223|0fsgdhfjgk|151132|0fsgdhfjgk|151132|
+-----------+----------+------+----------+------+
df1.join(df, "hash_value", "outer").show()
+-----------+-----------+------+------------+------+
| hash_value| Fruits| Meat| Fruits| Meat|
+-----------+-----------+------+------------+------+
|-2105905448| |151132| |151132|
| -548650515| null| null| 000312|151132|
| -344340697| 312312|151132| 312312|151132|
| 76821046| 004312| 12232| null| null|
| 598159392| null|151132| null|151132|
| 951458223| 0fsgdhfjgk|151132| 0fsgdhfjgk|151132|
| 1289730088|013vjhrr134|111232| null| null|
| 2052362224| null| null|013vjh134134|151132|
+-----------+-----------+------+------------+------+
**Dataframe 1 **
+----+------+------+-----+-----+
|key |dc_count|dc_day_count |
+----+------+------+-----+-----+
| 123 |13 |66 |
| 123 |13 |12 |
+----+------+------+-----+-----+
**rule Dataframe **
+----+------+------+-----+-----++------+-----+-----+
|key |rule_dc_count|rule_day_count |rule_out |
+----+------+------+-----+-----++------+-----+-----+
| 123 |2 |30 |139 |
| 123 |null |null |64 |
| 124 |2 |30 |139 |
| 124 |null |null |64 |
+----+------+------+-----+-----+----+------+-----+--
if dc_count>rule_dc_count and dc_day_count > rule_day_count
populate corresponding rule_out
else
other rule_out"
expected output
+----+------+------+-
|key |rule_out |
+----+------+------+
| 123 | 139 |
| 124 | 64 |
+----+------+------+
PySpark Version
The challenge here to get the second row's value for a key in a same column, In order to resolve this LEAD() analytical function can be used.
Create the DataFrame here
from pyspark.sql import functions as F
df = spark.createDataFrame([(123,13,66),(124,13,12)],[ "key","dc_count","dc_day_count"])
df1 = spark.createDataFrame([(123,2,30,139),(123,0,0,64),(124,2,30,139),(124,0,0,64)],
["key","rule_dc_count","rule_day_count","rule_out"])
Logic to get the Desired Result
from pyspark.sql import Window as W
_w = W.partitionBy('key').orderBy(F.col('key').desc())
df1 = df1.withColumn('rn', F.lead('rule_out').over(_w))
df1 = df1.join(df,'key','left')
df1 = df1.withColumn('condition_col',
F.when(
(F.col('dc_count') > F.col('rule_dc_count')) &
(F.col('dc_day_count') > F.col('rule_day_count')),F.col('rule_out'))
.otherwise(F.col('rn')))
df1 = df1.filter(F.col('rn').isNotNull())
Output
df1.show()
+---+-------------+--------------+--------+---+--------+------------+-------------+
|key|rule_dc_count|rule_day_count|rule_out| rn|dc_count|dc_day_count|condition_col|
+---+-------------+--------------+--------+---+--------+------------+-------------+
|124| 2| 30| 139| 64| 13| 12| 64|
|123| 2| 30| 139| 64| 13| 66| 139|
+---+-------------+--------------+--------+---+--------+------------+-------------+
Assuming expected output as-
+---+--------+
|key|rule_out|
+---+--------+
|123|139 |
+---+--------+
Below query should work-
spark.sql(
"""
|SELECT
| t1.key, t2.rule_out
|FROM table1 t1 join table2 t2 on t1.key=t2.key and
|t1.dc_count > t2.rule_dc_count and t1.dc_day_count > t2.rule_day_count
""".stripMargin)
.show(false)
df1:
+---+------+
| id| code|
+---+------+
| 1|[A, F]|
| 2| [G]|
| 3| [A]|
+---+------+
df2:
+--------+----+
| col1|col2|
+--------+----+
| Apple| A|
| Google| G|
|Facebook| F|
+--------+----+
I want the df3 should be like this by using the df1, and df2 columns :
+---+------+-----------------+
| id| code| changed|
+---+------+-----------------+
| 1|[A, F]|[Apple, Facebook]|
| 2| [G]| [Google]|
| 3| [A]| [Apple]|
+---+------+-----------------+
I know this can be archived if the code column is NOT an ARRAY. I don't know how to iterate the code array for this purpose.
Try:
from pyspark.sql.functions import *
import pyspark.sql.functions as f
res=(df1
.select(f.col("id"), f.explode(f.col("code")).alias("code"))
.join(df2, f.col("code")==df2.col2)
.groupBy("id")
.agg(f.collect_list(f.col("code")).alias("code"), f.collect_list(f.col("col1")).alias("changed"))
)
I have a spark dataframe like this:
id | Operation | Value
-----------------------------------------------------------
1 | [Date_Min, Date_Max, Device] | [148590, 148590, iphone]
2 | [Date_Min, Date_Max, Review] | [148590, 148590, Good]
3 | [Date_Min, Date_Max, Review, Device] | [148590, 148590, Bad,samsung]
The resul that i expect:
id | Operation | Value |
--------------------------
1 | Date_Min | 148590 |
1 | Date_Max | 148590 |
1 | Device | iphone |
2 | Date_Min | 148590 |
2 | Date_Max | 148590 |
2 | Review | Good |
3 | Date_Min | 148590 |
3 | Date_Max | 148590 |
3 | Review | Bad |
3 | Review | samsung|
I'm using Spark 2.1.0 with pyspark. I tried this solution but it worked only for one column.
Thanks
Here is an example dataframe from above. I use this solution in order to solve your question.
df = spark.createDataFrame(
[[1, ['Date_Min', 'Date_Max', 'Device'], ['148590', '148590', 'iphone']],
[2, ['Date_Min', 'Date_Max', 'Review'], ['148590', '148590', 'Good']],
[3, ['Date_Min', 'Date_Max', 'Review', 'Device'], ['148590', '148590', 'Bad', 'samsung']]],
schema=['id', 'l1', 'l2'])
Here, you can define udf to zip two list together for each row first.
from pyspark.sql.types import *
from pyspark.sql.functions import col, udf, explode
zip_list = udf(
lambda x, y: list(zip(x, y)),
ArrayType(StructType([
StructField("first", StringType()),
StructField("second", StringType())
]))
)
Finally, you can zip two columns together then explode that column.
df_out = df.withColumn("tmp", zip_list('l1', 'l2')).\
withColumn("tmp", explode("tmp")).\
select('id', col('tmp.first').alias('Operation'), col('tmp.second').alias('Value'))
df_out.show()
Output
+---+---------+-------+
| id|Operation| Value|
+---+---------+-------+
| 1| Date_Min| 148590|
| 1| Date_Max| 148590|
| 1| Device| iphone|
| 2| Date_Min| 148590|
| 2| Date_Max| 148590|
| 2| Review| Good|
| 3| Date_Min| 148590|
| 3| Date_Max| 148590|
| 3| Review| Bad|
| 3| Device|samsung|
+---+---------+-------+
If using DataFrame then try this:-
import pyspark.sql.functions as F
your_df.select("id", F.explode("Operation"), F.explode("Value")).show()