I want to fill Missing value with last row value in Pyspark: - python

My df has multiple columns
Query I tried:
df=df.withColumn('Column_required',F.when(df.Column_present>1,df.Column_present).otherwise(lag(df.Column_present))
Not able to work on otherwise.
. Column on which I want operation:
Column_present Column_required
40000 40000
Null 40000
Null 40000
500 500
Null 500
Null 500

I think your solution might be the usage of last instead of lag:
df_new = spark.createDataFrame([
(1, 40000), (2, None), (3,None), (4,None),
(5,500), (6,None), (7,None)
], ("id", "Col_present"))
df_new.withColumn('Column_required',when(df_new.Col_present>1,df_new.Col_present).otherwise(last(df_new.Col_present,ignorenulls=True).over(Window.orderBy("id")))).show()
This will produce your desired output:
+---+-----------+---------------+
| id|Col_present|Column_required|
+---+-----------+---------------+
| 1| 40000| 40000|
| 2| null| 40000|
| 3| null| 40000|
| 4| null| 40000|
| 5| 500| 500|
| 6| null| 500|
| 7| null| 500|
+---+-----------+---------------+
But be aware, that the window function requires a column to perform the sorting. That's why I used the id column in the example. You can create an id column by yourself, if your dataframe does not contain a sortable column with monotonically_increasing_id().

Related

Pyspark replace NA by searching another column for the same value

Values from Column_1 can't have multiple values on column_2. So, for the same Id we have the same value.
column_1 column_2
52 A
78 B
52
Expected
column_1 column_2
52 A
78 B
52 A
Which means searching column_1 for the first column_1 value that matchs the same missing column_2 id.
I've a working solution using R, but using pyspark I couldn't find a similar approach.
Since the same ID will always have the same value, as you have stated
One way to achieve this is to use the inherent sequence order present within your data and use the lag value to populate the missing values
You can utilise Lag Function to generate the previous value associated to your col_1 and Coalesce to get the first non-null value from the two
Data Preparation
df = pd.DataFrame({
'col_1': [52,78,52,52,78,78],
'col_2': ['A','B',None,'A','B',None]
})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+-----+-----+
|col_1|col_2|
+-----+-----+
| 52| A|
| 78| B|
| 52| null|
| 52| A|
| 78| B|
| 78| null|
+-----+-----+
Lag
window = Window.partitionBy('col_1').orderBy(F.col('col_2').desc())
sparkDF = sparkDF.withColumn('col_2_lag',F.lag('col_2').over(window))
sparkDF.show()
+-----+-----+---------+
|col_1|col_2|col_2_lag|
+-----+-----+---------+
| 52| A| null|
| 52| A| A|
| 52| null| A|
| 78| B| null|
| 78| B| B|
| 78| null| B|
+-----+-----+---------+
Coalesce
sparkDF = sparkDF.withColumn('col_2',F.coalesce(F.col('col_2'),F.col('col_2_lag'))).drop('col_2_lag')
sparkDF.show()
+-----+-----+
|col_1|col_2|
+-----+-----+
| 52| A|
| 52| A|
| 52| A|
| 78| B|
| 78| B|
| 78| B|
+-----+-----+
I would do something like this, using max :
from pyspark.sql import functions as F, Window
df.withColumn(
"column_2",
F.coalesce(
F.col("column_2"), F.max("column_2").over(Window.partitionBy("column_1"))
),
).show()

How to create a dataframe based on the lookup dataframe and create mulitple column on dynamic and map values in the specific columns

I have two dataframes one is the main and another one is the lookup dataframe. I need to achieve the third one in the customized form using pyspark. I need check the values in the column list_ids and check the match in the lookup dataframe and mark the count in the final dataframe. I have tried array intersect and array lookup but it is not working.
Main dataframe:
df = spark.createDataFrame([(123, [75319, 75317]), (212, [136438, 25274]), (215, [136438, 75317])], ("ID", "list_IDs"))
df.show()
+---+---------------+
| ID| list_IDs|
+---+---------------+
|123| [75319, 75317]|
|212|[136438, 25274]|
|215|[136438, 75317]|
+---+---------------+
Lookup Dataframe:
df_2 = spark.createDataFrame([(75319, "Wheat", 20), (75317, "Rice", 10), (136438, "Jowar", 30), (25274, "Rajma", 40)], ("ID", "Material", "Count"))
df_2.show()
+------+--------+-----+
| ID|Material|Count|
+------+--------+-----+
| 75319| Wheat| A|
| 75317| Rice| B|
|136438| Jowar| C|
| 25274| Rajma| D|
+------+--------+-----+
Need Resultant dataframe as
+---+---------------+------+------+-------+------+
| ID| list_IDs|Wheat | Rice | Jowar | Rajma|
+---+---------------+------+------+-------+------+
|123| [75319, 75317]| A| B| 0 | 0|
|212|[136438, 25274]| 0| 0| C | D|
|215|[136438, 75317]| 0| B| C | 0 |
+---+---------------+------+------+-------+------+
You can join the two dataframes and then pivot:
import pyspark.sql.functions as F
df2 = df.join(
df_2,
F.array_contains(df.list_IDs, df_2.ID)
).groupBy(df.ID, 'list_IDs').pivot('Material').agg(F.first('Count')).fillna(0)
result.show()
+---+---------------+-----+-----+----+-----+
| ID| list_IDs|Jowar|Rajma|Rice|Wheat|
+---+---------------+-----+-----+----+-----+
|212|[136438, 25274]| 30| 40| 0| 0|
|215|[136438, 75317]| 30| 0| 10| 0|
|123| [75319, 75317]| 0| 0| 10| 20|
+---+---------------+-----+-----+----+-----+

PySpark group by to return constant if all values are negative but average if only some are

I have a dataframe that looks like this:
df =
+----+-----+
|Year|Value|
+----+-----+
| 1| 50|
| 1| 30|
| 1| 20|
| 2| -14|
| 2| -34|
| 3| 10|
| 3| 20|
| 3| -34|
+----+-----+
I want to group by Year and show the average of value. If the Value column is negative I want to ignore that unless all the values of a particular year are negative (year = 2). Then I just want to show avg(Value) as -1.
I am doing:
df.filter(df.Value > 0).groupBy('Year').agg(avg('Value').alias('Average')).show()
which gives me this
+----+------------------+
|Year| Average|
+----+------------------+
| 1|33.333333333333336|
| 3| 15.0|
+----+------------------+
The result I want is
+----+------------------+
|Year| Average|
+----+------------------+
| 1|33.333333333333336|
| 2| -1|
| 3| 15.0|
+----+------------------+
Anyone has any idea how to achieve the above result?
Replace the negative values in the Value column with null, and then use group by and compute average.
avg ignores nulls, which is what you need. In the end, I'm replacing null averages with -1.
from pyspark.sql.functions import avg, when, col
df.withColumn("Value", when(col("Value") < 0, None).otherwise(col("Value")))\
.groupBy('Year').agg(avg('Value').alias('Average'))\
.fillna(-1, subset=['Average']).show()
+----+------------------+
|Year| Average|
+----+------------------+
| 1|33.333333333333336|
| 2| -1.0|
| 3| 15.0|
+----+------------------+

how to select first n row items based on multiple conditions in pyspark

Now I have data like this:
+----+----+
|col1| d|
+----+----+
| A| 4|
| A| 10|
| A| 3|
| B| 3|
| B| 6|
| B| 4|
| B| 5.5|
| B| 13|
+----+----+
col1 is StringType, d is TimestampType, here I use DoubleType instead.
I want to generate data based on conditions tuples.
Given a tuple[(A,3.5),(A,8),(B,3.5),(B,10)]
I want to have the result like
+----+---+
|col1| d|
+----+---+
| A| 4|
| A| 10|
| B| 4|
| B| 13|
+----+---+
That is for each element in the tuple, we select from the pyspark dataframe the first 1 row that d is larger than the tuple number and col1 is equal to the tuple string.
What I've already written is:
df_res=spark_empty_dataframe
for (x,y) in tuples:
dft=df.filter(df.col1==x).filter(df.d>y).limit(1)
df_res=df_res.union(dft)
But I think this might have efficiency problem, I do not know if I were right.
A possible approach avoiding loops can be creating a dataframe from the tuple you have as input:
t = [('A',3.5),('A',8),('B',3.5),('B',10)]
ref=spark.createDataFrame([(i[0],float(i[1])) for i in t],("col1_y","d_y"))
Then we can join on the input dataframe(df) on condition and then group on the keys and values of tuple which will be repeated to get the first value on each group, then drop the extra columns:
(df.join(ref,(df.col1==ref.col1_y)&(df.d>ref.d_y),how='inner').orderBy("col1","d")
.groupBy("col1_y","d_y").agg(F.first("col1").alias("col1"),F.first("d").alias("d"))
.drop("col1_y","d_y")).show()
+----+----+
|col1| d|
+----+----+
| A|10.0|
| A| 4.0|
| B| 4.0|
| B|13.0|
+----+----+
Note, if order of the dataframe is important , you can assign an index column with monotonically_increasing_id and include them in the aggregation then orderBy the index column.
EDIT another way instead of ordering and get first directly with min:
(df.join(ref,(df.col1==ref.col1_y)&(df.d>ref.d_y),how='inner')
.groupBy("col1_y","d_y").agg(F.min("col1").alias("col1"),F.min("d").alias("d"))
.drop("col1_y","d_y")).show()
+----+----+
|col1| d|
+----+----+
| B| 4.0|
| B|13.0|
| A| 4.0|
| A|10.0|
+----+----+

Looking to convert String Column to Integer Column in PySpark. What happens to strings that can't be converted?

I'm trying to convert a column in a dataframe to IntegerType. Here is an example of the dataframe:
+----+-------+
|From| To|
+----+-------+
| 1|1664968|
| 2| 3|
| 2| 747213|
| 2|1664968|
| 2|1691047|
| 2|4095634|
+----+-------+
I'm using the following code:
exploded_df = exploded_df.withColumn('From', exploded_df['To'].cast(IntegerType()))
However, I wanted to know what happens to strings that are not digits, for example, what happens if I have a string with several spaces? The reason is that I want to filter the dataframe in order to get the values of the column 'From' that don't have numbers in column 'To'.
Is there a simpler way to filter by this condition without converting the columns to IntegerType?
Thank you!
Values which cannot be cast are set to null, and the column will be considered a nullable column of that type. Here's a simple example:
from pyspark import SQLContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
spark = SparkSession.builder.getOrCreate()
sql_context = SQLContext(spark.sparkContext)
df = sql_context.createDataFrame([("1",),
("2",),
("3",),
("4",),
("hello world",)], schema=['id'])
print(df.show())
df = df.withColumn("id", F.col("id").astype(IntegerType()))
print(df.show())
Output:
+-----------+
| id|
+-----------+
| 1|
| 2|
| 3|
| 4|
|hello world|
+-----------+
+----+
| id|
+----+
| 1|
| 2|
| 3|
| 4|
|null|
+----+
And to verify the schema is correct:
print(df.printSchema())
Output:
None
root
|-- id: integer (nullable = true)
Hope this helps!
We can use regex to check does To column have some alphabets,spaces in the data, Using .rlike funtion in spark to filter out the matching rows.
Example:
df=spark.createDataFrame([("1","1664968"),("2","3"),("2","742a7"),("2"," "),("2","a")],["From","To"])
df.show()
#+----+-------+
#|From| To|
#+----+-------+
#| 1|1664968|
#| 2| 3|
#| 2| 742a7|
#| 2| |
#| 2| a|
#+----+-------+
#get the rows which have space or word in them
df.filter(col("To").rlike('([a-z]|\\s+)')).show(truncate=False)
#+----+-----+
#|From|To |
#+----+-----+
#|2 |742a7|
#|2 | |
#|2 |a |
#+----+-----+
#to get rows which doesn't have any space or word in them.
df.filter(~col("To").rlike('([a-z]|\\s+)')).show(truncate=False)
#+----+-------+
#|From|To |
#+----+-------+
#|1 |1664968|
#|2 |3 |
#+----+-------+

Categories