Sum Product in PySpark - python

I have a pyspark dataframe like this
data = [(("ID1", 10, 30)), (("ID2", 20, 60))]
df1 = spark.createDataFrame(data, ["ID", "colA", "colB"])
df1.show()
df1:
+---+-----------+
| ID| colA| colB|
+---+-----------+
|ID1| 10| 30|
|ID2| 20| 60|
+---+-----------+
I have Another dataframe like this
data = [(("colA", 2)), (("colB", 5))]
df2 = spark.createDataFrame(data, ["Column", "Value"])
df2.show()
df2:
+-------+------+
| Column| Value|
+-------+------+
| colA| 2|
| colB| 5|
+-------+------+
I want to divide every column in df1 by the respective value in df2. Hence df3 will look like
df3:
+---+-------------------------+
| ID| colA| colB|
+---+------------+------------+
|ID1| 10/2 = 5| 30/5 = 6|
|ID2| 20/2 = 10| 60/5 = 12|
+---+------------+------------+
Ultimately, I want to add colA and colB to get the final df4 per ID
df4:
+---+---------------+
| ID| finalSum|
+---+---------------+
|ID1| 5 + 6 = 11|
|ID2| 10 + 12 = 22|
+---+---------------+

The idea is to join both the DataFrames together and then apply the division operation. Since, df2 contains the column names and the respective value, so we need to pivot() it first and then join with the main table df1. (Pivoting is an expensive operation, but it should be fine as long as the DataFrame is small.)
# Loading the requisite packages
from pyspark.sql.functions import col
from functools import reduce
from operator import add
# Creating the DataFrames
df1 = sqlContext.createDataFrame([('ID1', 10, 30), ('ID2', 20, 60)],('ID','ColA','ColB'))
df2 = sqlContext.createDataFrame([('ColA', 2), ('ColB', 5)],('Column','Value'))
The code is fairly generic, so that we need not need to specify the column names on our own. We find the column names we need to operate on. Except ID we need all.
# This contains the list of columns where we apply mathematical operations
columns_to_be_operated = df1.columns
columns_to_be_operated.remove('ID')
print(columns_to_be_operated)
['ColA', 'ColB']
Pivoting the df2, which we will join to df1.
# Pivoting the df2 to get the rows in column form
df2 = df2.groupBy().pivot('Column').sum('Value')
df2.show()
+----+----+
|ColA|ColB|
+----+----+
| 2| 5|
+----+----+
We can change the column names, so that we don't have a duplicate name for every column. We do so, by adding a suffix _x on all the names.
# Dynamically changing the name of the columns in df2
df2 = df2.select([col(c).alias(c+'_x') for c in df2.columns])
df2.show()
+------+------+
|ColA_x|ColB_x|
+------+------+
| 2| 5|
+------+------+
Next we join the tables with a Cartesian join. (Note that you may run into memory issues if df2 is large.)
df = df1.crossJoin(df2)
df.show()
+---+----+----+------+------+
| ID|ColA|ColB|ColA_x|ColB_x|
+---+----+----+------+------+
|ID1| 10| 30| 2| 5|
|ID2| 20| 60| 2| 5|
+---+----+----+------+------+
Finally adding the columns by dividing them with the corresponding value first. reduce() applies function add() of two arguments, cumulatively, to the items of the sequence.
df = df.withColumn(
'finalSum',
reduce(add, [col(c)/col(c+'_x') for c in columns_to_be_operated])
).select('ID','finalSum')
df.show()
+---+--------+
| ID|finalSum|
+---+--------+
|ID1| 11.0|
|ID2| 22.0|
+---+--------+
Note: OP has to be careful with the division with 0. The snippet just above can be altered to take this condition into account.

Related

PySpark - Filter dataframe columns based on list

I have a dataframe with some column names and I want to filter out some columns based on a list.
I have a list of columns I would like to have in my final dataframe:
final_columns = ['A','C','E']
My dataframe is this:
data1 = [("James", "Lee", "Smith","36636"),
("Michael","Rose","Boots","40288")]
schema1 = StructType([StructField("A",StringType(),True),
StructField("B",StringType(),True),
StructField("C",StringType(),True),
StructField("D",StringType(),True)])
df1 = spark.createDataFrame(data=data1,schema=schema1)
I would like to transform df1 in order to have the columns of this final_columns list.
So, basically, I expect the resulting dataframe to look like this
+--------+------+------+
| A | C | E |
+--------+------+------+
| James |Smith | |
|Michael |Boots | |
+--------+------+------+
Is there any smart way to do this?
Thank you in advance
You can do so with select and a list comprehension. The idea is to loop through final_columns, if a column is in df.colums then add it, if its not then use lit to add it with the proper alias.
You can write similar logic with a for loop if you find list comprehensions less readable.
from pyspark.sql.functions import lit
df1.select([c if c in df1.columns else lit(None).alias(c) for c in final_columns]).show()
+-------+-----+----+
| A| C| E|
+-------+-----+----+
| James|Smith|null|
|Michael|Boots|null|
+-------+-----+----+
Here is one way: use the DataFrame drop() method with a list which represents the symmetric difference between the DataFrame's current columns and your list of final columns.
df = spark.createDataFrame([(1, 1, "1", 0.1),(1, 2, "1", 0.2),(3, 3, "3", 0.3)],('a','b','c','d'))
df.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| 1| 1| 1|0.1|
| 1| 2| 1|0.2|
| 3| 3| 3|0.3|
+---+---+---+---+
# list of desired final columns
final_cols = ['a', 'c', 'd']
df2 = df.drop( *set(final_cols).symmetric_difference(df.columns) )
Note an alternate syntax for the symmetric difference operation:
df2 = df.drop( *(set(final_cols) ^ set(df.columns)) )
This gives me:
+---+---+---+
| a| c| d|
+---+---+---+
| 1| 1|0.1|
| 1| 1|0.2|
| 3| 3|0.3|
+---+---+---+
Which I believe is what you want.
Based on your requirement have written a dynamic code. This will select columns based on the list provided and also create column with null values if that column is not present in the source/original dataframe.
data1 = [("James", "Lee", "Smith","36636"),
("Michael","Rose","Boots","40288")]
schema1 = StructType([StructField("A",StringType(),True),
StructField("B",StringType(),True),
StructField("C",StringType(),True),
StructField("D",StringType(),True)])
df1 = spark.createDataFrame(data=data1,schema=schema1)
actual_columns = df1.schema.names
final_columns = ['A','C','E']
def Diff(li1, li2):
diff = list(set(li2) - set(li1))
return diff
def Same(li1, li2):
same = list(sorted(set(li1).intersection(li2)))
return same
df1 = df1.select(*Same(actual_columns,final_columns))
for i in Diff(actual_columns,final_columns):
df1 = df1.withColumn(""+i+"",lit(''))
display(df1)

Joining PySpark dataframes with conditional result column

I have these tables:
df1 df2
+---+------------+ +---+---------+
| id| many_cols| | id|criterion|
+---+------------+ +---+---------+
| 1|lots_of_data| | 1| false|
| 2|lots_of_data| | 1| true|
| 3|lots_of_data| | 1| true|
+---+------------+ | 3| false|
+---+---------+
I intend to create additional column in df1:
+---+------------+------+
| id| many_cols|result|
+---+------------+------+
| 1|lots_of_data| 1|
| 2|lots_of_data| null|
| 3|lots_of_data| 0|
+---+------------+------+
result should be 1 if there is a corresponding true in df2
result should be 0 if there's no corresponding true in df2
result should be null if there is no corresponding id in df2
I cannot think of an efficient way to do it. I am stuck with only the 3rd condition working after a join:
df = df1.join(df2, 'id', 'full')
df.show()
# +---+------------+---------+
# | id| many_cols|criterion|
# +---+------------+---------+
# | 1|lots_of_data| false|
# | 1|lots_of_data| true|
# | 1|lots_of_data| true|
# | 3|lots_of_data| false|
# | 2|lots_of_data| null|
# +---+------------+---------+
PySpark dataframes are created like this:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
df1cols = ['id', 'many_cols']
df1data = [(1, 'lots_of_data'),
(2, 'lots_of_data'),
(3, 'lots_of_data')]
df2cols = ['id', 'criterion']
df2data = [(1, False),
(1, True),
(1, True),
(3, None)]
df1 = spark.createDataFrame(df1data, df1cols)
df2 = spark.createDataFrame(df2data, df2cols)
A simple way would be to groupby df2 to get the max criterion by id the join with df1, this way you reduce the number of lines to join. The max of a boolean column is true if there is at least one corresponding true value:
from pyspark.sql import functions as F
df2_group = df2.groupBy("id").agg(F.max("criterion").alias("criterion"))
result = df1.join(df2_group, ["id"], "left").withColumn(
"result",
F.col("criterion").cast("int")
).drop("criterion")
result.show()
#+---+------------+------+
#| id| many_cols|result|
#+---+------------+------+
#| 1|lots_of_data| 1|
#| 3|lots_of_data| 0|
#| 2|lots_of_data| null|
#+---+------------+------+
You can try a correlated subquery to get the maximum Boolean from df2, and cast that to an integer.
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
df = spark.sql("""
select
df1.*,
(select int(max(criterion)) from df2 where df1.id = df2.id) as result
from df1
""")
df.show()
+---+------------+------+
| id| many_cols|result|
+---+------------+------+
| 1|lots_of_data| 1|
| 3|lots_of_data| 0|
| 2|lots_of_data| null|
+---+------------+------+
check out this solution. After joining. you can use multiple condition checks based on your requirement and assign the value accordingly using when clause and then take the max value of result grouping by id and other columns. you can use window function as well to calculate the max of result if you are just using just id for the partition.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df1cols = ['id', 'many_cols']
df1data = [(1, 'lots_of_data'),
(2, 'lots_of_data'),
(3, 'lots_of_data')]
df2cols = ['id', 'criterion']
df2data = [(1, False),
(1, True),
(1, True),
(3, False)]
df1 = spark.createDataFrame(df1data, df1cols)
df2 = spark.createDataFrame(df2data, df2cols)
df2_mod =df2.withColumnRenamed("id", "id_2")
df3=df1.join(df2_mod, on=df1.id== df2_mod.id_2, how='left')
cond1 = (F.col("id")== F.col("id_2"))& (F.col("criterion")==1)
cond2 = (F.col("id")== F.col("id_2"))& (F.col("criterion")==0)
cond3 = (F.col("id_2").isNull())
df3.select("id", "many_cols", F.when(cond1, 1).when(cond2,0).when(cond3, F.lit(None)).alias("result"))\
.groupBy("id", "many_cols").agg(F.max(F.col("result")).alias("result")).orderBy("id").show()
Result:
------
+---+------------+------+
| id| many_cols|result|
+---+------------+------+
| 1|lots_of_data| 1|
| 2|lots_of_data| null|
| 3|lots_of_data| 0|
+---+------------+------+
Using window function
w=Window().partitionBy("id")
df3.select("id", "many_cols", F.when(cond1, 1).when(cond2,0).when(cond3, F.lit(None)).alias("result"))\
.select("id", "many_cols", F.max("result").over(w).alias("result")).drop_duplicates().show()
I had to merge the ideas of proposed answers to get the solution which suited me most.
# The `cond` variable is very useful, here it represents several complex conditions
cond = F.col('criterion') == True
df2_grp = df2.select(
'id',
F.when(cond, 1).otherwise(0).alias('c')
).groupBy('id').agg(F.max(F.col('c')).alias('result'))
df = df1.join(df2_grp, 'id', 'left')
df.show()
#+---+------------+------+
#| id| many_cols|result|
#+---+------------+------+
#| 1|lots_of_data| 1|
#| 3|lots_of_data| 0|
#| 2|lots_of_data| null|
#+---+------------+------+

PySpark: Filter out all lines, which have more columns than header line

df = spark.read.csv('input.csv', header=True, inferSchema=True)
Let's say that 'input.csv' file contains following data:
id, name, age
1, John, 20
2, Mike, 33
3, Phil, 19, 180, 78
4, Sean, 40
I would like to filter out rows, which contain more columns than the header and save it to different output somewhat like this (illustratively):
df2 = df.filter(condition1) #condition1 = rows which have more columns than header
df = df.filter(condition2) #condition2 = rows which have same amount or less columns than header
df.show()
df2.show()
So I would get output as follows:
+---+------+----+
| id| name| age|
+---+------+----+
| 1| John| 20|
| 2| Mike| 33|
| 4| Sean| 40|
+---+------+----+
+---+------+----+
| id| name| age|
+---+------+----+
| 3| Phil| 19|
+---+------+----+
So far I've found nothing. Currently it just shrinks the row to fit the header with no way of obtaining it. What can I do?
Thanks
EDIT:
The schema does not necessarily need to be "id", "name, "age". It should literally take anything, so the filtering can not be dependent on a specific column. Moreover, the solution can not be exclusive to a specific type of reading, the data will be received based on what a user chooses and only thing that can be modified are parameters and options.
You can read as a text file and find the number of columns in each row. Obtain a df with the id's of the rows having more than 3 columns. Then do a semi or anti join to the dataframe (obtained with read.csv) using the id.
text = spark.read.text('input.csv')
textdf = text.selectExpr(
"value",
"size(split(value, ',')) len",
"split(value, ',')[0] id"
).filter('len > 3')
textdf.show()
+----------------+---+---+
| value|len| id|
+----------------+---+---+
|3,Phil,19,180,78| 5| 3|
+----------------+---+---+
df = spark.read.csv('input.csv', header=True)
df1 = df.join(textdf, 'id', 'anti')
df2 = df.join(textdf, 'id', 'semi')
df1.show()
+---+----+---+
| id|name|age|
+---+----+---+
| 1|John| 20|
| 2|Mike| 33|
| 4|Sean| 40|
+---+----+---+
df2.show()
+---+----+---+
| id|name|age|
+---+----+---+
| 3|Phil| 19|
+---+----+---+

Taking max number from one dataframe based on there 'timestamp' and 'id' of another dataframe

I want to see what is the last amount is given by customer. and the last time of per customer sale.
I have two dataframes:
DF1:
+----------+-----------+-----------+
| ref_ID| Amount| Sale time|
| 11111111| 100| 2014-04-21|
| 22222222| 60| 2013-07-04|
| 33333333| 12| 2017-08-02|
| 22222222| 90| 2014-05-02|
| 22222222| 80| 2017-08-02|
| 11111111| 30| 2014-05-02|
+----------+-----------+-----------+
DF2:
+----------+----------+
| ID| num_sale|
| 11111111| 2|
| 33333333| 1|
| 22222222| 3|
+----------+----------+
I Need this output:
+----------+-----------+---------------+----------------+
| ID| num_sale| last_sale_time|last_sale_amount|
| 11111111| 2| 2014-05-02| 30|
| 33333333| 1| 2017-08-02| 12|
| 22222222| 3| 2017-08-02| 80|
+----------+-----------+---------------+----------------+
I am trying to do is:
last_sale_amount= []
for index, row in df.iterrows():
try:
last_sale_amount= max(df2.loc[df['id'] == row['f_id'], 'last_sale_time'])
print(str(last_sale_amount))
num_attempt.append(last_sale_amount)
except KeyError:
last_sale_amount.append(0)
ad['last_sale_amount'] = last_sale_amount
You can use groupby to get the maximum sale time from each column then merge back the info from df1 and df2
df_maxsale = df1.groupby('ref_ID')['Sale time'].max().to_frame().reset_index() \
.merge(df1, how='left', on=['ref_ID', 'Sale time']) \
.merge(df2, how='left', left_on='ref_ID', right_on='ID')
note: .max() returns a series with ref_ID as the index, so you need to call to_frame().reset_index() so that ref_ID is a column and you can merge on it and Sale time
We can use group by with sorted sales time and take there last row.
df1 = df1 .sort_values('Sale time').groupby('ref_ID').last().reset_index()
And Then merge it with dataframe 2 (df2).
df2= df2.merge( df1, left_on = "ID", right_on = "ref_ID", how="left" )

Select a range in Pyspark

I have a spark dataframe in python. And, it was sorted based on a column. How can I select a specific range of data (for example 50% of data in the middle)? For example, if I have 1M data, I want to take data from 250K to 750K index. How can I do that without using collect in pyspark?
To be more precise, I want something like take function to get results between a range. For example, something like take(250000, 750000).
Here is one way to select a range in a pyspark DF:
Create DF
df = spark.createDataFrame(
data = [(10, "2018-01-01"), (22, "2017-01-01"), (13, "2014-01-01"), (4, "2015-01-01")\
,(35, "2013-01-01"),(26, "2016-01-01"),(7, "2012-01-01"),(18, "2011-01-01")]
, schema = ["amount", "date"]
)
df.show()
+------+----------+
|amount| date|
+------+----------+
| 10|2018-01-01|
| 22|2017-01-01|
| 13|2014-01-01|
| 4|2015-01-01|
| 35|2013-01-01|
| 26|2016-01-01|
| 7|2012-01-01|
| 18|2011-01-01|
+------+----------+
Sort (on date) and insert index (based on row number)
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = Window.orderBy("date")
df = df.withColumn("index", F.row_number().over(w))
df.show()
+------+----------+-----+
|amount| date|index|
+------+----------+-----+
| 18|2011-01-01| 1|
| 7|2012-01-01| 2|
| 35|2013-01-01| 3|
| 13|2014-01-01| 4|
| 4|2015-01-01| 5|
| 26|2016-01-01| 6|
| 22|2017-01-01| 7|
| 10|2018-01-01| 8|
+------+----------+-----+
Get The Required Range (assume want everything between rows 3 and 6)
df1=df.filter(df.index.between(3, 6))
df1.show()
+------+----------+-----+
|amount| date|index|
+------+----------+-----+
| 35|2013-01-01| 3|
| 13|2014-01-01| 4|
| 4|2015-01-01| 5|
| 26|2016-01-01| 6|
+------+----------+-----+
This is very simple using between , for example assuming your sorted column name is index -
df_sample = df.select(df.somecolumn, df.index.between(250000, 750000))
once you create a new dataframe df_sample, you can perform any operation (including take or collect) as per your need.

Categories