how to get the output in pyspark same as got in pandas - python

I am trying some basic functions in pyspark like min max ect.
while using pandas df.min() I got all the separate columns and their minimum values like the image I have attached.
I need the same output using pyspark code.
but I don't know how to do that
Please help me on this

You can try with below code,
# sample data
data = [(1,10,"a"), (2,10,"c"), (0, 100, "t")]
cols = ["col1", "col2", "col3"]
df = spark.createDataFrame(data, cols)
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 10| a|
| 2| 10| c|
| 0| 100| t|
+----+----+----+
df.selectExpr([f"min({x}) as {x}" for x in cols]).show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 0| 10| a|
+----+----+----+

Related

How to fill gaps between two rows having the difference expressed in days

I have the following dataframe where diff_days is the difference between one row and the previous row
+----------+--------+---------+
| fx_date| col_1 |diff_days|
+----------+--------+---------+
|2020-01-05| A| null|
|2020-01-09| B| 4|
|2020-01-11| C| 2|
+----------+--------+---------+
I want to get a dataframe adding rows with missing dates and replicated values of col_1 related to the first row.
It should be:
+----------+--------+
| fx_date| col_1 |
+----------+--------+
|2020-01-05| A|
|2020-01-06| A|
|2020-01-07| A|
|2020-01-08| A|
|2020-01-09| B|
|2020-01-10| B|
|2021-01-11| C|
+----------+--------+
You can use lag + sequence functions to generate the dates between previous and current row dates, then explode the list like this:
from pyspark.sql import functions as F, Window
df1 = df.withColumn(
"previous_dt",
F.date_add(F.lag("fx_date", 1).over(Window.orderBy("fx_date")), 1)
).withColumn(
"fx_date",
F.expr("sequence(coalesce(previous_dt, fx_date), fx_date, interval 1 day)")
).withColumn(
"fx_date",
F.explode("fx_date")
).drop("previous_dt", "diff_days")
df1.show()
#+----------+-----+
#| fx_date|col_1|
#+----------+-----+
#|2020-01-05| A|
#|2020-01-06| B|
#|2020-01-07| B|
#|2020-01-08| B|
#|2020-01-09| B|
#|2020-01-10| C|
#|2020-01-11| C|
#+----------+-----+

How to create a dataframe based on the lookup dataframe and create mulitple column on dynamic and map values in the specific columns

I have two dataframes one is the main and another one is the lookup dataframe. I need to achieve the third one in the customized form using pyspark. I need check the values in the column list_ids and check the match in the lookup dataframe and mark the count in the final dataframe. I have tried array intersect and array lookup but it is not working.
Main dataframe:
df = spark.createDataFrame([(123, [75319, 75317]), (212, [136438, 25274]), (215, [136438, 75317])], ("ID", "list_IDs"))
df.show()
+---+---------------+
| ID| list_IDs|
+---+---------------+
|123| [75319, 75317]|
|212|[136438, 25274]|
|215|[136438, 75317]|
+---+---------------+
Lookup Dataframe:
df_2 = spark.createDataFrame([(75319, "Wheat", 20), (75317, "Rice", 10), (136438, "Jowar", 30), (25274, "Rajma", 40)], ("ID", "Material", "Count"))
df_2.show()
+------+--------+-----+
| ID|Material|Count|
+------+--------+-----+
| 75319| Wheat| A|
| 75317| Rice| B|
|136438| Jowar| C|
| 25274| Rajma| D|
+------+--------+-----+
Need Resultant dataframe as
+---+---------------+------+------+-------+------+
| ID| list_IDs|Wheat | Rice | Jowar | Rajma|
+---+---------------+------+------+-------+------+
|123| [75319, 75317]| A| B| 0 | 0|
|212|[136438, 25274]| 0| 0| C | D|
|215|[136438, 75317]| 0| B| C | 0 |
+---+---------------+------+------+-------+------+
You can join the two dataframes and then pivot:
import pyspark.sql.functions as F
df2 = df.join(
df_2,
F.array_contains(df.list_IDs, df_2.ID)
).groupBy(df.ID, 'list_IDs').pivot('Material').agg(F.first('Count')).fillna(0)
result.show()
+---+---------------+-----+-----+----+-----+
| ID| list_IDs|Jowar|Rajma|Rice|Wheat|
+---+---------------+-----+-----+----+-----+
|212|[136438, 25274]| 30| 40| 0| 0|
|215|[136438, 75317]| 30| 0| 10| 0|
|123| [75319, 75317]| 0| 0| 10| 20|
+---+---------------+-----+-----+----+-----+

Pyspark - Using two time indices for window function

I have a dataframe where each row has two date columns. I would like to create a window function with a range between that counts the number of rows in a particular range, where BOTH date columns are within the range. In the case below, both timestamps of a row must be before the timestamp of the current row, to be included in the count.
Example df including the count column:
+---+-----------+-----------+-----+
| ID|Timestamp_1|Timestamp_2|Count|
+---+-----------+-----------+-----+
| a| 0| 3| 0|
| b| 2| 5| 0|
| d| 5| 5| 3|
| c| 5| 9| 3|
| e| 8| 10| 4|
+---+-----------+-----------+-----+
I tried creating two windows and creating the new column over both of these:
w_1 = Window.partitionBy().orderBy('Timestamp_1').rangeBetween(Window.unboundedPreceding, 0)
w_2 = Window.partitionBy().orderBy('Timestamp_2').rangeBetween(Window.unboundedPreceding, 0)
df = df.withColumn('count', F.count('ID').over(w_1).over(w_2))
However, this is not allowed in Pyspark and therefore gives an error.
Any ideas? Solutions in SQL are also fine!
Would a self-join work?
from pyspark.sql import functions as F
df_count = (
df.alias('a')
.join(
df.alias('b'),
(F.col('b.Timestamp_1') <= F.col('a.Timestamp_1')) &
(F.col('b.Timestamp_2') <= F.col('a.Timestamp_2')),
'left'
)
.groupBy(
'a.ID'
)
.agg(
F.count('b.ID').alias('count')
)
)
df = df.join(df_count, 'ID')

Pyspark - Transpose multiple dataframe

I have multiple dataframes that look like this.
df1:
+---------+---------+---------+
|sum(col1)|sum(col2)|sum(col3)|
+---------+---------+---------+
| 10| 1| 0|
+---------+---------+---------+
df2:
+---------+---------+
|sum(col1)|sum(col2)|
+---------+---------+
| 20| 6|
+---------+---------+
df3:
+---------+---------+---------+---------+
|sum(col1)|sum(col2)|sum(col3)|sum(col4)|
+---------+---------+---------+---------+
| 1| 5| 3| 4|
+---------+---------+---------+---------+
For the above example,the output should look like this.
+--------+------+------+------+
|col_name|value1|value2|value3|
+--------+------+------+------+
| col1| 10| 20| 1|
| col2| 1| 6| 5|
| col3| 0| null| 3|
| col4| null| null| 4|
+--------+------+------+------+
I am using spark 1.6.3 to do this. In the above example, I have different sum calculation for a particular table but I have multiple tables to calculate sum for each of the table and output should be consolidated in the above format.
Any ideas on how to accomplish this?
This is probably easiest to do outside of pyspark, and if the data you are working with is small enough, that is probably what you should do because doing this is pyspark will not be especially efficient.
If for some reason you need to do this is pyspark, you can do this with several dataframe transformations. The first thing we need to do is convert all of the individual dataframes into the same schema which will allow us to iteratively select from each and union into a final result. The following is one way to achieve this.
from pyspark.sql.functions import lit,col
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
a = [[10,1,0]]
b = [[20,6]]
c = [[1,5,3,4]]
dfa = spark.createDataFrame(a,['col1','col2','col3'])
dfb = spark.createDataFrame(b,['col1','col2'])
dfc = spark.createDataFrame(c,['col1','col2','col3','col4'])
dfdict = {'dfa':dfa,'dfb':dfb,'dfc':dfc}
columns = set([col for dfname in dfdict for col in dfdict[dfname].columns])
for dfname in dfdict:
for colname in columns-set(dfdict[dfname].columns):
dfdict[dfname] = dfdict[dfname].withColumn(colname, lit(None).cast(StringType()))
schema = StructType([StructField("col_name", StringType(), True)]+\
[StructField("value_"+dfname, IntegerType(), True) for dfname in dfdict])
resultdf=spark.createDataFrame([],schema = schema)
for colname in columns:
resultdf = resultdf\
.union(dfdict['dfa'].select(lit(colname).alias('col_name'),
col(colname).alias('value_dfa'))\
.crossJoin(dfdict['dfb'].select(col(colname).alias('value_dfb')))\
.crossJoin(dfdict['dfc'].select(col(colname).alias('value_dfc'))))
resultdf.orderBy('col_name').show()
>>>
+--------+---------+---------+---------+
|col_name|value_dfa|value_dfb|value_dfc|
+--------+---------+---------+---------+
| col1| 10| 20| 1|
| col2| 1| 6| 5|
| col3| 0| null| 3|
| col4| null| null| 4|
+--------+---------+---------+---------+
There may be ways to improve efficiency of this by removing the cross joins and replacing them with something more clever.
If you need to work with starting dataframes that have multiple rows you would need to aggregate rows together (or change the requirements of the expected output). For instance, you may want to sum everything like the following example.
from pyspark.sql.functions import sum
d = [[1,2,3],[4,5,6]]
dfd = spark.createDataFrame(a,['col1','col2','col3'])
dfdagg = dfd.groupby().agg(*[sum(col) for colname in dfa.columns])
Where dfdagg can now be used in the same way that the other dataframes have been used above.
In an alternative way, you can use stack function to transpose dfs then merge them
>>> df1x = df1.selectExpr("stack(3, 'col1', col1, 'col2', col2, 'col3', col3) as (col_name, value1)")
>>> df1x.show()
+--------+------+
|col_name|value1|
+--------+------+
| col1| 10|
| col2| 1|
| col3| 0|
+--------+------+
>>> df2x = df2.selectExpr("stack(2, 'col1', col1, 'col2', col2) as (col_name, value2)")
>>> df2x.show()
+--------+------+
|col_name|value2|
+--------+------+
| col1| 20|
| col2| 6|
+--------+------+
>>> df3x = df3.selectExpr("stack(4, 'col1', col1, 'col2', col2, 'col3', col3, 'col4', col4) as (col_name, value3)")
>>> df3x.show()
+--------+------+
|col_name|value3|
+--------+------+
| col1| 1|
| col2| 5|
| col3| 3|
| col4| 4|
+--------+------+
>>> df1x.join(df2x, "col_name", "full").join(df3x, "col_name", "full").sort("col_name").show()
+--------+------+------+------+
|col_name|value1|value2|value3|
+--------+------+------+------+
| col1| 10| 20| 1|
| col2| 1| 6| 5|
| col3| 0| null| 3|
| col4| null| null| 4|
+--------+------+------+------+

Remove all rows that are duplicates with respect to some rows

I've seen a couple questions like this but not a satisfactory answer for my situation. Here is a sample DataFrame:
+------+-----+----+
| id|value|type|
+------+-----+----+
|283924| 1.5| 0|
|283924| 1.5| 1|
|982384| 3.0| 0|
|982384| 3.0| 1|
|892383| 2.0| 0|
|892383| 2.5| 1|
+------+-----+----+
I want to identify duplicates by just the "id" and "value" columns, and then remove all instances.
In this case:
Rows 1 and 2 are duplicates (again we are ignoring the "type" column)
Rows 3 and 4 are duplicates, and therefore only rows 5 & 6 should remain:
The output would be:
+------+-----+----+
| id|value|type|
+------+-----+----+
|892383| 2.5| 1|
|892383| 2.0| 0|
+------+-----+----+
I've tried
df.dropDuplicates(subset = ['id', 'value'], keep = False)
But the "keep" feature isn't in PySpark (as it is in pandas.DataFrame.drop_duplicates.
How else could I do this?
You can do that using the window functions
from pyspark.sql import Window, functions as F
df.withColumn(
'fg',
F.count("id").over(Window.partitionBy("id", "value"))
).where("fg = 1").drop("fg").show()
You can groupBy the id and type to get the count. Then use join to filter out the rows in your DataFrame where the count is not 1:
df.join(
df.groupBy('id', 'value').count().where('count = 1').drop('count'), on=['id', 'value']
).show()
#+------+-----+----+
#| id|value|type|
#+------+-----+----+
#|892383| 2.5| 1|
#|892383| 2.0| 0|
#+------+-----+----+

Categories