how to combine two dataframe replacing null values - python

I have two dataframe. The set of columns in them is slightly different
df1:
+---+----+----+----+
| id|col1|col2|col3|
+---+----+----+----+
| 1| 15| 20| 8|
| 2| 0|null| 5|
+---+----+----+----+
df2:
+---+----+----+----+
| id|col1|col2|col4|
+---+----+----+----+
| 1| 10| 10| 40|
| 2| 10| 30| 50|
+---+----+----+----+
How can pyspark make a left join for df1? But at the same time replace null values with values from df2? And also adding the missing columns from df2
result_df:
id col1 col2 col3 col4
1 15 20 8 40
2 0 30 5 50
I need to combine two data frames with id to get an extra column col4, and for col1, col2, col3, take values from df1, unless the value is non-zero, then replace it with the value from df2.

Use coalesce function after the left join.
from pyspark.sql.functions import *
df1.show()
#+---+----+----+----+
#| id|col1|col2|col3|
#+---+----+----+----+
#| 1| 15| 20| 8|
#| 2| 0|null| 5|
#+---+----+----+----+
df2.show()
#+---+----+----+----+----+
#| id|col1|col2|col3|col4|
#+---+----+----+----+----+
#| 1| 15| 20| 8| 40|
#| 2| 0| 30| 5| 50|
#+---+----+----+----+----+
df1.join(df2,["id"],"left").\
select("id",coalesce(df2.col1,df1.col1).alias("col1"),coalesce(df2.col2,df1.col2).alias("col2"),coalesce(df2.col3,df1.col3).alias("col3"),df2.col4).\
show()
+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
| 1| 15| 20| 8| 40|
| 2| 0| 30| 5| 50|
+---+----+----+----+----+

Related

Apply value to a dataframe after filtering another dataframe based on conditions

My question:
There are two dataframes and the info one is in progress to build.
What I want to do is filtering in reference dataframe based on the condition. When key is b, then apply value is 2 to the into table as whole column.
The output dataframe is the final one I want to do.
Dataframe (info)
+-----+-----+
| key|value|
+-----+-----+
| a| 10|
| b| 20|
| c| 50|
| d| 40|
+-----+-----+
Dataframe (Reference)
+-----+-----+
| key|value|
+-----+-----+
| a| 42|
| b| 2|
| c| 9|
| d| 100|
+-----+-----+
Below is the output I want:
Dataframe (Output)
+-----+-----+-----+
| key|value|const|
+-----+-----+-----+
| a| 10| 2|
| b| 20| 2|
| c| 50| 2|
| d| 40| 2|
+-----+-----+-----+
I have tried several methods and below one is the latest one I tried, but system warm me that pyspark do not have loc function.
df_cal = (
info
.join(reference)
.withColumn('const', reference.loc[reference['key']=='b', 'value'].iloc[0])
.select('key', 'result', 'const')
)
df_cal.show()
And below is the warming that reminded by system:
AttributeError: 'Dataframe' object has no attribute 'loc'
This solve:
from pyspark.sql.functions import lit
target = 'b'
const = [i['value'] for i in df2.collect() if i['key'] == f'{target}']
df_cal = df1.withColumn('const', lit(const[0]))
df_cal.show()
+---+-----+-----+
|key|value|const|
+---+-----+-----+
| a| 10| 2|
| b| 20| 2|
| c| 30| 2|
| d| 40| 2|
+---+-----+-----+

Apply a formula in pyspark

I am new in spark and I have some doubts about working with df.
My problem is that I need to apply a formula to a pyspark df column using values from other columns.
I have following df
+-------+-------+-------+-------+-------+-------+
| count1| count2|val__00|val__01|val__02|val__03|
+-------+-------+-------+-------+-------+-------+
| 3| 1| 1.0| 0.0| 8.0| 0.0|
| 4| 2| 0.0| 1.379| 1.49| 1.373|
| 2| 5| 0.7| 0.0| 70.4| 0.0|
| 1| 8| 0.0| 4.0| 0.0| 0.0|
+-------+-------+-------+-------+-------+-------+
I need to apply following formula to columns val__xx for each row:
val__xx = val__xx + (count1*count2)
so final df will be
+-------+-------+-------+-------+-------+-------+
| count1| count2|val__00|val__01|val__02|val__03|
+-------+-------+-------+-------+-------+-------+
| 3| 1| 4.0| 3.0| 11.0| 3.0|
| 4| 2| 8.0| 9.379| 9.49| 9.373|
| 2| 5| 10.7| 10.0| 80.4| 10.0|
| 1| 8| 8.0| 12.0| 8.0| 8.0|
+-------+-------+-------+-------+-------+-------+
I think to apply an udf function but I don't know how to pass more than one column. Is it possible to make a function to pass more than one column?
I have implemented code below but I don't know hot to pass cols val__xx
def calculate(c, count1, count2):
return c + (count1*count2)
calculateUDF = udf(lambda x: calculate(x, count1, count2))
df_final = df.apply(calculateUDF(col(val__xx????), col(count1), col(count2))
You can do using withColumn in a for loop, no need for a udf.
from pyspark.sql import functions as f
for i in range(4):
df = df.withColumn(f'val__0{i}', f.col('count1') * f.col('count2') + f.col(f'val__0{i}'))
df.show()
+------+------+-------+-------+-------+-------+
|count1|count2|val__00|val__01|val__02|val__03|
+------+------+-------+-------+-------+-------+
| 3| 1| 4.0| 3.0| 11.0| 3.0|
| 4| 2| 8.0| 9.379| 9.49| 9.373|
| 2| 5| 10.7| 10.0| 80.4| 10.0|
| 1| 8| 8.0| 12.0| 8.0| 8.0|
+------+------+-------+-------+-------+-------+
If your your 'value' columns reaches double digits you'll need to left pad the i with zeros.

Pyspark divide column by its subtotals grouped by another column

My problem is similar to this and this. Both posts show how to divide a column value by the total sum of the same column. In my case I want to divide the values of a column by the sum of subtotals. Subtotal is calculated by grouping the column values depending on another column. I am slightly modifying the example mentioned in the links shared above.
Here is my dataframe
df = [[1,'CAT1',10], [2, 'CAT1', 11], [3, 'CAT2', 20], [4, 'CAT2', 22], [5, 'CAT3', 30]]
df = spark.createDataFrame(df, ['id', 'category', 'consumption'])
df.show()
+---+--------+-----------+
| id|category|consumption|
+---+--------+-----------+
| 1| CAT1| 10|
| 2| CAT1| 11|
| 3| CAT2| 20|
| 4| CAT2| 22|
| 5| CAT3| 30|
+---+--------+-----------+
I want to divide "consumption" value by the total of grouped "category" and put the value in a column "normalized" as below.
The subtotals doesn't need to be in the output(number 21, 42 and 30 in column consumption)
What I've achieved so far
df.crossJoin(
df.groupby('category').agg(F.sum('consumption').alias('sum_'))
).withColumn("normalized", F.col("consumption")/F.col("sum_"))\
.show()
+---+--------+-----------+--------+----+-------------------+
| id|category|consumption|category|sum_| normalized|
+---+--------+-----------+--------+----+-------------------+
| 1| CAT1| 10| CAT2| 42|0.23809523809523808|
| 2| CAT1| 11| CAT2| 42| 0.2619047619047619|
| 1| CAT1| 10| CAT1| 21|0.47619047619047616|
| 2| CAT1| 11| CAT1| 21| 0.5238095238095238|
| 1| CAT1| 10| CAT3| 30| 0.3333333333333333|
| 2| CAT1| 11| CAT3| 30|0.36666666666666664|
| 3| CAT2| 20| CAT2| 42|0.47619047619047616|
| 4| CAT2| 22| CAT2| 42| 0.5238095238095238|
| 5| CAT3| 30| CAT2| 42| 0.7142857142857143|
| 3| CAT2| 20| CAT1| 21| 0.9523809523809523|
| 4| CAT2| 22| CAT1| 21| 1.0476190476190477|
| 5| CAT3| 30| CAT1| 21| 1.4285714285714286|
| 3| CAT2| 20| CAT3| 30| 0.6666666666666666|
| 4| CAT2| 22| CAT3| 30| 0.7333333333333333|
| 5| CAT3| 30| CAT3| 30| 1.0|
+---+--------+-----------+--------+----+-------------------+
You can do basically the same as in the links you have already mentioned. The only difference is that you have to calculate the subtotals before with groupby and sum:
import pyspark.sql.functions as F
df = df.join(df.groupby('category').sum('consumption'), 'category')
df = df.select('id', 'category', F.round(F.col('consumption')/F.col('sum(consumption)'), 2).alias('normalized'))
df.show()
Output:
+---+--------+----------+
| id|category|normalized|
+---+--------+----------+
| 3| CAT2| 0.48|
| 4| CAT2| 0.52|
| 1| CAT1| 0.48|
| 2| CAT1| 0.52|
| 5| CAT3| 1.0|
+---+--------+----------+
This is another way of solving the problem as proposed by the OP, but without using joins().
joins() in general are costly operations and should be avoided when ever possible.
# We first register our DataFrame as temporary SQL view
df.registerTempTable('table_view')
df = sqlContext.sql("""select id, category,
consumption/sum(consumption) over (partition by category) as normalize
from table_view""")
df.show()
+---+--------+-------------------+
| id|category| normalize|
+---+--------+-------------------+
| 3| CAT2|0.47619047619047616|
| 4| CAT2| 0.5238095238095238|
| 1| CAT1|0.47619047619047616|
| 2| CAT1| 0.5238095238095238|
| 5| CAT3| 1.0|
+---+--------+-------------------+
Note: """ has been used to have multiline statements for the sake of visibility and neatness. With simple 'select id ....' that wouldn't work if you try to spread your statement over multiple lines. Needless to say, the final result will be the same.

cumulative sum function in pyspark grouping on multiple columns based on condition

I need to create a event_id basically a counter grouping on multiple columns(v_id,d_id,ip,l_id) and incrementing it when delta > 40 to get
the output like this
v_id d_id ip l_id delta event_id last_event_flag
1 20 30 40 1 1 N
1 20 30 40 2 1 N
1 20 30 40 3 1 N
1 20 30 40 4 1 Y
1 20 20 40 1 1 Y
1 30 30 40 2 1 N
1 30 30 40 3 1 N
1 30 30 40 4 1 N
1 30 30 40 5 1 Y
i was able to achieve this using pandas data frame
df['event_id'] = (df.delta >=40.0).groupby([df.l_id,df.v_id,d_id,ip]).cumsum() + 1
df.append(df['event_id'], ignore_index=True
but seeing memory error when executing it on a larger data .
How to do similar thing in pyspark.
In pyspark you can do it using a window function:
First let's create the dataframe. Note that you can also directly load it as a dataframe from a csv:
df = spark.createDataFrame(
sc.parallelize(
[[1,20,30,40,1,1],
[1,20,30,40,2,1],
[1,20,30,40,3,1],
[1,20,30,40,4,1],
[1,20,30,40,45,2],
[1,20,30,40,1,2],
[1,30,30,40,2,1],
[1,30,30,40,3,1],
[1,30,30,40,4,1],
[1,30,30,40,5,1]]
),
["v_id","d_id","ip","l_id","delta","event_id"]
)
You have an implicit ordering in your table, we need to create a monotonically increasing id so that we don't end up shuffling it around:
import pyspark.sql.functions as psf
df = df.withColumn(
"rn",
psf.monotonically_increasing_id()
)
+----+----+---+----+-----+--------+----------+
|v_id|d_id| ip|l_id|delta|event_id| rn|
+----+----+---+----+-----+--------+----------+
| 1| 20| 30| 40| 1| 1| 0|
| 1| 20| 30| 40| 2| 1| 1|
| 1| 20| 30| 40| 3| 1| 2|
| 1| 20| 30| 40| 4| 1| 3|
| 1| 20| 30| 40| 45| 2| 4|
| 1| 20| 30| 40| 1| 2|8589934592|
| 1| 30| 30| 40| 2| 1|8589934593|
| 1| 30| 30| 40| 3| 1|8589934594|
| 1| 30| 30| 40| 4| 1|8589934595|
| 1| 30| 30| 40| 5| 1|8589934596|
+----+----+---+----+-----+--------+----------+
Now to compute event_id and last_event_flag:
from pyspark.sql import Window
w1 = Window.partitionBy("v_id", "d_id", "l_id", "ip").orderBy("rn")
w2 = Window.partitionBy("v_id", "d_id", "l_id", "ip").orderBy(psf.desc("rn"))
df.withColumn(
"event_id",
psf.sum((df.delta >= 40).cast("int")).over(w1) + 1
).withColumn(
"last_event_flag",
psf.row_number().over(w2) == 1
).drop("rn")
+----+----+---+----+-----+--------+---------------+
|v_id|d_id| ip|l_id|delta|event_id|last_event_flag|
+----+----+---+----+-----+--------+---------------+
| 1| 20| 30| 40| 1| 1| false|
| 1| 20| 30| 40| 2| 1| false|
| 1| 20| 30| 40| 3| 1| false|
| 1| 20| 30| 40| 4| 1| false|
| 1| 20| 30| 40| 45| 2| false|
| 1| 20| 30| 40| 1| 2| true|
| 1| 30| 30| 40| 2| 1| false|
| 1| 30| 30| 40| 3| 1| false|
| 1| 30| 30| 40| 4| 1| false|
| 1| 30| 30| 40| 5| 1| true|
+----+----+---+----+-----+--------+---------------+
Perhaps you should calculate df = df[df.delta>=40] before running the groupby- I'm not sure if that matters.
Also you can look into chunksize to perform calculations based on chunks of the csv for memory efficiency. So you might break up the data into chunks of 10000 lines and then run the calculations to avoid memory error.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
How to read a 6 GB csv file with pandas

Window timeseries with step in Spark/Scala

I have this input :
timestamp,user
1,A
2,B
5,C
9,E
12,F
The result wanted is :
timestampRange,userList
1 to 2,[A,B]
3 to 4,[] Or null
5 to 6,[C]
7 to 8,[] Or null
9 to 10,[E]
11 to 12,[F]
I tried using Window, but the problem, it doesn't include the empty timestamp range.
Any hints would be helpful.
Don't know if widowing function will cover the gaps between ranges, but you can take the following approach :
Define a dataframe, df_ranges:
val ranges = List((1,2), (3,4), (5,6), (7,8), (9,10))
val df_ranges = sc.parallelize(ranges).toDF("start", "end")
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 3| 4|
| 5| 6|
| 7| 8|
| 9| 10|
+-----+---+
Data with the timestamp column, df_data :
val data = List((1,"A"), (2,"B"), (5,"C"), (9,"E"))
val df_data = sc.parallelize(data).toDF("timestamp", "user")
+---------+----+
|timestamp|user|
+---------+----+
| 1| A|
| 2| B|
| 5| C|
| 9| E|
+---------+----+
Join the two dataframe on the start, end, timestamp columns:
df_ranges.join(df_data, df_ranges.col("start").equalTo(df_data.col("timestamp")).or(df_ranges.col("end").equalTo(df_data.col("timestamp"))), "left")
+-----+---+---------+----+
|start|end|timestamp|user|
+-----+---+---------+----+
| 1| 2| 1| A|
| 1| 2| 2| B|
| 5| 6| 5| C|
| 9| 10| 9| E|
| 3| 4| null|null|
| 7| 8| null|null|
+-----+---+---------+----+
Now do a simple aggregation with collect_list function :
res4.groupBy("start", "end").agg(collect_list("user")).orderBy("start")
+-----+---+------------------+
|start|end|collect_list(user)|
+-----+---+------------------+
| 1| 2| [A, B]|
| 3| 4| []|
| 5| 6| [C]|
| 7| 8| []|
| 9| 10| [E]|
+-----+---+------------------+

Categories