I have multiple dataframes that look like this.
df1:
+---------+---------+---------+
|sum(col1)|sum(col2)|sum(col3)|
+---------+---------+---------+
| 10| 1| 0|
+---------+---------+---------+
df2:
+---------+---------+
|sum(col1)|sum(col2)|
+---------+---------+
| 20| 6|
+---------+---------+
df3:
+---------+---------+---------+---------+
|sum(col1)|sum(col2)|sum(col3)|sum(col4)|
+---------+---------+---------+---------+
| 1| 5| 3| 4|
+---------+---------+---------+---------+
For the above example,the output should look like this.
+--------+------+------+------+
|col_name|value1|value2|value3|
+--------+------+------+------+
| col1| 10| 20| 1|
| col2| 1| 6| 5|
| col3| 0| null| 3|
| col4| null| null| 4|
+--------+------+------+------+
I am using spark 1.6.3 to do this. In the above example, I have different sum calculation for a particular table but I have multiple tables to calculate sum for each of the table and output should be consolidated in the above format.
Any ideas on how to accomplish this?
This is probably easiest to do outside of pyspark, and if the data you are working with is small enough, that is probably what you should do because doing this is pyspark will not be especially efficient.
If for some reason you need to do this is pyspark, you can do this with several dataframe transformations. The first thing we need to do is convert all of the individual dataframes into the same schema which will allow us to iteratively select from each and union into a final result. The following is one way to achieve this.
from pyspark.sql.functions import lit,col
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
a = [[10,1,0]]
b = [[20,6]]
c = [[1,5,3,4]]
dfa = spark.createDataFrame(a,['col1','col2','col3'])
dfb = spark.createDataFrame(b,['col1','col2'])
dfc = spark.createDataFrame(c,['col1','col2','col3','col4'])
dfdict = {'dfa':dfa,'dfb':dfb,'dfc':dfc}
columns = set([col for dfname in dfdict for col in dfdict[dfname].columns])
for dfname in dfdict:
for colname in columns-set(dfdict[dfname].columns):
dfdict[dfname] = dfdict[dfname].withColumn(colname, lit(None).cast(StringType()))
schema = StructType([StructField("col_name", StringType(), True)]+\
[StructField("value_"+dfname, IntegerType(), True) for dfname in dfdict])
resultdf=spark.createDataFrame([],schema = schema)
for colname in columns:
resultdf = resultdf\
.union(dfdict['dfa'].select(lit(colname).alias('col_name'),
col(colname).alias('value_dfa'))\
.crossJoin(dfdict['dfb'].select(col(colname).alias('value_dfb')))\
.crossJoin(dfdict['dfc'].select(col(colname).alias('value_dfc'))))
resultdf.orderBy('col_name').show()
>>>
+--------+---------+---------+---------+
|col_name|value_dfa|value_dfb|value_dfc|
+--------+---------+---------+---------+
| col1| 10| 20| 1|
| col2| 1| 6| 5|
| col3| 0| null| 3|
| col4| null| null| 4|
+--------+---------+---------+---------+
There may be ways to improve efficiency of this by removing the cross joins and replacing them with something more clever.
If you need to work with starting dataframes that have multiple rows you would need to aggregate rows together (or change the requirements of the expected output). For instance, you may want to sum everything like the following example.
from pyspark.sql.functions import sum
d = [[1,2,3],[4,5,6]]
dfd = spark.createDataFrame(a,['col1','col2','col3'])
dfdagg = dfd.groupby().agg(*[sum(col) for colname in dfa.columns])
Where dfdagg can now be used in the same way that the other dataframes have been used above.
In an alternative way, you can use stack function to transpose dfs then merge them
>>> df1x = df1.selectExpr("stack(3, 'col1', col1, 'col2', col2, 'col3', col3) as (col_name, value1)")
>>> df1x.show()
+--------+------+
|col_name|value1|
+--------+------+
| col1| 10|
| col2| 1|
| col3| 0|
+--------+------+
>>> df2x = df2.selectExpr("stack(2, 'col1', col1, 'col2', col2) as (col_name, value2)")
>>> df2x.show()
+--------+------+
|col_name|value2|
+--------+------+
| col1| 20|
| col2| 6|
+--------+------+
>>> df3x = df3.selectExpr("stack(4, 'col1', col1, 'col2', col2, 'col3', col3, 'col4', col4) as (col_name, value3)")
>>> df3x.show()
+--------+------+
|col_name|value3|
+--------+------+
| col1| 1|
| col2| 5|
| col3| 3|
| col4| 4|
+--------+------+
>>> df1x.join(df2x, "col_name", "full").join(df3x, "col_name", "full").sort("col_name").show()
+--------+------+------+------+
|col_name|value1|value2|value3|
+--------+------+------+------+
| col1| 10| 20| 1|
| col2| 1| 6| 5|
| col3| 0| null| 3|
| col4| null| null| 4|
+--------+------+------+------+
Related
I'm attempting to do something very similar to this post here, but I need to use pyspark dataframes and I'm looking to create two columns based off different IDs.
Essentially I am attempting to append my original pyspark dataframe with two news columns each containing the mean value for their paired IDs.
An example initial df and the output df can be found below:
Example input and output
To achieve this , you need to create 2 individual dataframes containing the the aggregation and join it to back to the original dataframe.
Essentially I am attempting to append my original pyspark dataframe
with two news columns each containing the mean value for their paired
IDs.
Data Preparation
d = {
'id1':[1]*2 + [2] * 3,
'id2':[2]*2 + [1] * 3,
'value': [i for i in range(1,100,20)]
}
df = pd.DataFrame(d)
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+---+---+-----+
|id1|id2|value|
+---+---+-----+
| 1| 2| 1|
| 1| 2| 21|
| 2| 1| 41|
| 2| 1| 61|
| 2| 1| 81|
+---+---+-----+
Aggregation & Join
sparkDF_agg_id1 = sparkDF.groupBy('id1').agg(F.mean(F.col('value')).alias('value_mean_id1'))
sparkDF_agg_id2 = sparkDF.groupBy('id2').agg(F.mean(F.col('value')).alias('value_mean_id2'))
finalDF = sparkDF.join(sparkDF_agg_id1
,sparkDF['id1'] == sparkDF_agg_id1['id1']
,'inner'
).select(sparkDF['*']
,sparkDF_agg_id1['value_mean_id1']
)
finalDF = finalDF.join(sparkDF_agg_id2
,finalDF['id2'] == sparkDF_agg_id2['id2']
,'inner'
).select(finalDF['*']
,sparkDF_agg_id2['value_mean_id2']
)
finalDF.show()
+---+---+-----+--------------+--------------+
|id1|id2|value|value_mean_id1|value_mean_id2|
+---+---+-----+--------------+--------------+
| 2| 1| 41| 61.0| 61.0|
| 2| 1| 61| 61.0| 61.0|
| 2| 1| 81| 61.0| 61.0|
| 1| 2| 1| 11.0| 11.0|
| 1| 2| 21| 11.0| 11.0|
+---+---+-----+--------------+--------------+
I have two dataframes one is the main and another one is the lookup dataframe. I need to achieve the third one in the customized form using pyspark. I need check the values in the column list_ids and check the match in the lookup dataframe and mark the count in the final dataframe. I have tried array intersect and array lookup but it is not working.
Main dataframe:
df = spark.createDataFrame([(123, [75319, 75317]), (212, [136438, 25274]), (215, [136438, 75317])], ("ID", "list_IDs"))
df.show()
+---+---------------+
| ID| list_IDs|
+---+---------------+
|123| [75319, 75317]|
|212|[136438, 25274]|
|215|[136438, 75317]|
+---+---------------+
Lookup Dataframe:
df_2 = spark.createDataFrame([(75319, "Wheat", 20), (75317, "Rice", 10), (136438, "Jowar", 30), (25274, "Rajma", 40)], ("ID", "Material", "Count"))
df_2.show()
+------+--------+-----+
| ID|Material|Count|
+------+--------+-----+
| 75319| Wheat| A|
| 75317| Rice| B|
|136438| Jowar| C|
| 25274| Rajma| D|
+------+--------+-----+
Need Resultant dataframe as
+---+---------------+------+------+-------+------+
| ID| list_IDs|Wheat | Rice | Jowar | Rajma|
+---+---------------+------+------+-------+------+
|123| [75319, 75317]| A| B| 0 | 0|
|212|[136438, 25274]| 0| 0| C | D|
|215|[136438, 75317]| 0| B| C | 0 |
+---+---------------+------+------+-------+------+
You can join the two dataframes and then pivot:
import pyspark.sql.functions as F
df2 = df.join(
df_2,
F.array_contains(df.list_IDs, df_2.ID)
).groupBy(df.ID, 'list_IDs').pivot('Material').agg(F.first('Count')).fillna(0)
result.show()
+---+---------------+-----+-----+----+-----+
| ID| list_IDs|Jowar|Rajma|Rice|Wheat|
+---+---------------+-----+-----+----+-----+
|212|[136438, 25274]| 30| 40| 0| 0|
|215|[136438, 75317]| 30| 0| 10| 0|
|123| [75319, 75317]| 0| 0| 10| 20|
+---+---------------+-----+-----+----+-----+
Now I have data like this:
+----+----+
|col1| d|
+----+----+
| A| 4|
| A| 10|
| A| 3|
| B| 3|
| B| 6|
| B| 4|
| B| 5.5|
| B| 13|
+----+----+
col1 is StringType, d is TimestampType, here I use DoubleType instead.
I want to generate data based on conditions tuples.
Given a tuple[(A,3.5),(A,8),(B,3.5),(B,10)]
I want to have the result like
+----+---+
|col1| d|
+----+---+
| A| 4|
| A| 10|
| B| 4|
| B| 13|
+----+---+
That is for each element in the tuple, we select from the pyspark dataframe the first 1 row that d is larger than the tuple number and col1 is equal to the tuple string.
What I've already written is:
df_res=spark_empty_dataframe
for (x,y) in tuples:
dft=df.filter(df.col1==x).filter(df.d>y).limit(1)
df_res=df_res.union(dft)
But I think this might have efficiency problem, I do not know if I were right.
A possible approach avoiding loops can be creating a dataframe from the tuple you have as input:
t = [('A',3.5),('A',8),('B',3.5),('B',10)]
ref=spark.createDataFrame([(i[0],float(i[1])) for i in t],("col1_y","d_y"))
Then we can join on the input dataframe(df) on condition and then group on the keys and values of tuple which will be repeated to get the first value on each group, then drop the extra columns:
(df.join(ref,(df.col1==ref.col1_y)&(df.d>ref.d_y),how='inner').orderBy("col1","d")
.groupBy("col1_y","d_y").agg(F.first("col1").alias("col1"),F.first("d").alias("d"))
.drop("col1_y","d_y")).show()
+----+----+
|col1| d|
+----+----+
| A|10.0|
| A| 4.0|
| B| 4.0|
| B|13.0|
+----+----+
Note, if order of the dataframe is important , you can assign an index column with monotonically_increasing_id and include them in the aggregation then orderBy the index column.
EDIT another way instead of ordering and get first directly with min:
(df.join(ref,(df.col1==ref.col1_y)&(df.d>ref.d_y),how='inner')
.groupBy("col1_y","d_y").agg(F.min("col1").alias("col1"),F.min("d").alias("d"))
.drop("col1_y","d_y")).show()
+----+----+
|col1| d|
+----+----+
| B| 4.0|
| B|13.0|
| A| 4.0|
| A|10.0|
+----+----+
I have 2 DataFrame like this:
+--+-----------+
|id|some_string|
+--+-----------+
| a| foo|
| b| bar|
| c| egg|
| d| fog|
+--+-----------+
and this:
+--+-----------+
|id|some_string|
+--+-----------+
| a| hoi|
| b| hei|
| c| hai|
| e| hui|
+--+-----------+
I want to join them to be like this:
+--+-----------+
|id|some_string|
+--+-----------+
| a| foohoi|
| b| barhei|
| c| egghai|
| d| fog|
| e| hui|
+--+-----------+
so, the column some_string from the first dataframe is concantenated to the column some_string from the second dataframe. If I am using
df_join = df1.join(df2,on='id',how='outer')
it would return
+--+-----------+-----------+
|id|some_string|some_string|
+--+-----------+-----------+
| a| foo| hoi|
| b| bar| hei|
| c| egg| hai|
| d| fog| null|
| e| null| hui|
+--+-----------+-----------+
Is there any way to do it?
You need to use when in order to achieve proper concatenation. Other than that the way you were using outer join was almost correct.
You need to check if anyone of these two columns is Null or not Null and then do the concatenation.
from pyspark.sql.functions import col, when, concat
df1 = sqlContext.createDataFrame([('a','foo'),('b','bar'),('c','egg'),('d','fog')],['id','some_string'])
df2 = sqlContext.createDataFrame([('a','hoi'),('b','hei'),('c','hai'),('e','hui')],['id','some_string'])
df_outer_join=df1.join(df2.withColumnRenamed('some_string','some_string_x'), ['id'], how='outer')
df_outer_join.show()
+---+-----------+-------------+
| id|some_string|some_string_x|
+---+-----------+-------------+
| e| null| hui|
| d| fog| null|
| c| egg| hai|
| b| bar| hei|
| a| foo| hoi|
+---+-----------+-------------+
df_outer_join = df_outer_join.withColumn('some_string_concat',
when(col('some_string').isNotNull() & col('some_string_x').isNotNull(),concat(col('some_string'),col('some_string_x')))
.when(col('some_string').isNull() & col('some_string_x').isNotNull(),col('some_string_x'))
.when(col('some_string').isNotNull() & col('some_string_x').isNull(),col('some_string')))\
.drop('some_string','some_string_x')
df_outer_join.show()
+---+------------------+
| id|some_string_concat|
+---+------------------+
| e| hui|
| d| fog|
| c| egghai|
| b| barhei|
| a| foohoi|
+---+------------------+
Considering you want to perform an outer join you can try the following:
from pyspark.sql.functions import concat, col, lit, when
df_join= df1.join(df2,on='id',how='outer').when(isnull(df1.some_string1), ''). when(isnull(df2.some_string2),'').withColumn('new_column',concat(col('some_string1'),lit(''),col('some_string2'))).select('id','new_column')
(Please note that the some_string1 and 2 refer to the some_string columns from the df1 and df2 dataframes. I would advise you to name them differently instead of giving the same name some_string, so that you can call them)
I am trying to pivot a simple dataframe in pyspark and I must be missing something. I have a dataframe df in the form of:
+----+----+
|Item| Key|
+----+----+
| 1| A|
+----+----+
| 2| A|
+----+----+
I attempt to pivot it on Item such as
df.groupBy("Item").\
pivot("Item", ["1","2"]).\
agg(first("Key"))
and I receive:
+----+----+----+
|Item| 1| 2|
+----+----+----+
| 1| A|null|
+----+----+----+
| 2|null| A|
+----+----+----+
But what I want is:
+----+----+
| 1| 2|
+----+----+
| A| A|
+----+----+
How do I keep the Item column from remaining in my output pivot table which I assume messes up my result? I am running Spark 2.3.2 and Python 3.7.0
Try without define aggregate column
>>> df.show()
+----+---+
|Item|Key|
+----+---+
| 1| A|
| 2| A|
+----+---+
>>> df.groupBy().pivot("Item", ["1","2"]).agg(first("Key")).show()
+---+---+
| 1| 2|
+---+---+
| A| A|
+---+---+