I have a SQL Code which i am trying to Convert into Pyspark?
The SQL Query looks like this: I need to Concatenate '0' at starting of 'ADDRESS_HOME' if the below Query Conditions Satisfies.
UPDATE STUDENT_DATA
SET STUDENT_DATA.ADDRESS_HOME = "0" & [STUDENT_DATA].ADDRESS_HOME
WHERE (((STUDENT_DATA.STATE_ABB)="TURIN" Or
(STUDENT_DATA.STATE_ABB)="RUSH" Or
(STUDENT_DATA.STATE_ABB)="MEXIC" Or
(STUDENT_DATA.STATE_ABB)="VINTA")
AND ((Len([ADDRESS_HOME])) < "5"));
Thank you in Advance for your responses
# +---+---------------+---------+
# | ID|ADDRESS_HOME | STATE_ABB|
# +---+---------------+---------+
# | 1| 7645 |RUSH |
# | 2| 98364 |MEXIC |
# | 3| 2980 |TURIN |
# | 4| 6728 |VINTA |
# | 5| 128 |VINTA |
EXPECTED OUTPUT
# +---+---------------+---------+
# | ID|ADDRESS_HOME | STATE_ABB|
# +---+---------------+---------+
# | 1| 07645 |RUSH |
# | 2| 98364 |MEXIC |
# | 3| 02980 |TURIN |
# | 4| 06728 |VINTA |
# | 5| 0128 |VINTA |
If you want to align the ADDRESS_HOME to be 5 digits and pad with 0, you can use lpad.
df = df.withColumn('ADDRESS_HOME', F.lpad('ADDRESS_HOME', 5, '0'))
If you want only pad with 1 char (0), when the ADDRESS_HOME has less than 5 chars.
df = (df.withColumn('ADDRESS_HOME', F.when(F.length('ADDRESS_HOME') < 5, F.concat(F.lit('0'), F.col('ADDRESS_HOME'))))
.otherwise(F.col('ADDRESS_HOME')))
UPDATE:
You can convert all OR criteria to IN clause(isin) then use logical AND with the other criteria.
states = ['RUSH', 'MEXIC', 'TURIN', 'VINTA']
df = (df.withColumn('ADDRESS_HOME',
F.when(F.col('STATE_ABB').isin(states) & (F.length('ADDRESS_HOME') < 5),
F.concat(F.lit('0'), F.col('ADDRESS_HOME')))
.otherwise(F.col('ADDRESS_HOME'))))
First you filter, your DF serching for the values you want to update.
Then you update the columns (First withcolumn)
After updating, you join your updated DF with your original dataframe (do this to get all values in one dataframe again). And do a coalesce to the FINAL ADDRESS
Finally, you select the values from the original DF (Id and State) and the updated value (Final_Address...since you did a coalesce, the values not updated will not be null, they are going to be the update value on the filtered condition, and the original value on the condition not matched in the filter).
This answer should solve your problem, BUT, #Emma answers is more efficient.
df = df.filter(
(f.col("STATE_ABB").isin(f.lit("TURIN"), f.lit("RUSH"), f.lit("TURIN"), f.lit("VINTA")) &
(f.len("ADDRESS_HOME") < 5)
).withColumn(
"ADDRESS_HOME_CONCAT",
f.concat(f.lit("0"),f.col("ADDRESS_HOME"))
).alias("df_filtered").join(
df.alias("original_df"),
on=f.col("original_df.Id") == f.col("df_filtered.Id")
how='left'
).withColumn(
"FINAL_ADDRESS",
f.coalesce(f.col("df_filtered.ADDRESS_HOME_CONCAT"), f.col("original_df.ADDRESS_HOME")
).select(
f.col("original_df.Id").alias("Id"),
f.col("FINAL_ADDRESS").alias("ADDRESS_HOME"),
f.col("original_df.STATE_ABB").alias("STATE_ABB")
)
Sorry for any typo missing, I've posted it from my cellphone!
Related
I have a dataframe with some column names and I want to filter out some columns based on a list.
I have a list of columns I would like to have in my final dataframe:
final_columns = ['A','C','E']
My dataframe is this:
data1 = [("James", "Lee", "Smith","36636"),
("Michael","Rose","Boots","40288")]
schema1 = StructType([StructField("A",StringType(),True),
StructField("B",StringType(),True),
StructField("C",StringType(),True),
StructField("D",StringType(),True)])
df1 = spark.createDataFrame(data=data1,schema=schema1)
I would like to transform df1 in order to have the columns of this final_columns list.
So, basically, I expect the resulting dataframe to look like this
+--------+------+------+
| A | C | E |
+--------+------+------+
| James |Smith | |
|Michael |Boots | |
+--------+------+------+
Is there any smart way to do this?
Thank you in advance
You can do so with select and a list comprehension. The idea is to loop through final_columns, if a column is in df.colums then add it, if its not then use lit to add it with the proper alias.
You can write similar logic with a for loop if you find list comprehensions less readable.
from pyspark.sql.functions import lit
df1.select([c if c in df1.columns else lit(None).alias(c) for c in final_columns]).show()
+-------+-----+----+
| A| C| E|
+-------+-----+----+
| James|Smith|null|
|Michael|Boots|null|
+-------+-----+----+
Here is one way: use the DataFrame drop() method with a list which represents the symmetric difference between the DataFrame's current columns and your list of final columns.
df = spark.createDataFrame([(1, 1, "1", 0.1),(1, 2, "1", 0.2),(3, 3, "3", 0.3)],('a','b','c','d'))
df.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| 1| 1| 1|0.1|
| 1| 2| 1|0.2|
| 3| 3| 3|0.3|
+---+---+---+---+
# list of desired final columns
final_cols = ['a', 'c', 'd']
df2 = df.drop( *set(final_cols).symmetric_difference(df.columns) )
Note an alternate syntax for the symmetric difference operation:
df2 = df.drop( *(set(final_cols) ^ set(df.columns)) )
This gives me:
+---+---+---+
| a| c| d|
+---+---+---+
| 1| 1|0.1|
| 1| 1|0.2|
| 3| 3|0.3|
+---+---+---+
Which I believe is what you want.
Based on your requirement have written a dynamic code. This will select columns based on the list provided and also create column with null values if that column is not present in the source/original dataframe.
data1 = [("James", "Lee", "Smith","36636"),
("Michael","Rose","Boots","40288")]
schema1 = StructType([StructField("A",StringType(),True),
StructField("B",StringType(),True),
StructField("C",StringType(),True),
StructField("D",StringType(),True)])
df1 = spark.createDataFrame(data=data1,schema=schema1)
actual_columns = df1.schema.names
final_columns = ['A','C','E']
def Diff(li1, li2):
diff = list(set(li2) - set(li1))
return diff
def Same(li1, li2):
same = list(sorted(set(li1).intersection(li2)))
return same
df1 = df1.select(*Same(actual_columns,final_columns))
for i in Diff(actual_columns,final_columns):
df1 = df1.withColumn(""+i+"",lit(''))
display(df1)
Problem Statement
I have two corresponding DataFrames, one is employee table, one is job catalog table, one of their columns is filled with array, I want to find and intersection of two array in the skill_set column from two DataFrames (I've using np.intersect1d) and return the value to employee DataFrame for each id in employee DataFrame.
So 1 id in employee DataFrame will be looped to find intersection of all job_name in job catalog DataFrame in same job rank with the current employee job rank. Final output is meant to find 5 job with highest amount of intersect (using len since np.intersect1d returns a list) from job DataFrames.
employee_data
+----+--------+----------+----------+
| id|emp_name| job_rank| skill_set|
+----+--------+----------+----------+
| 2| c | 1|[a1,a2,a3]|
| 2| a | 2|[a1,a2,a3]|
| 1| c | 3|[a1,a2,a3]|
| 1| j | 4|[a1,a2,a3]|
| 3| k | 5|[a1,a2,a3]|
| 1| l | 6|[a1,a2,a3]|
+----+--------+----------+----------+
job_data
+----+--------+----------+----------+
| id|job_name| job_rank| skill_set|
+----+--------+----------+----------+
| 2| c | 1|[a1,a2,a3]|
| 2| a | 2|[a1,a2,a3]|
| 1| c | 1|[a1,a2,a3]|
| 1| b | 4|[a1,a2,a3]|
| 3| r | 3|[a1,a2,a3]|
| 1| a | 6|[a1,a2,a3]|
| 1| m | 2|[a1,a2,a3]|
| 1| g | 4|[a1,a2,a3]|
+----+--------+----------+----------+
I can give you an idea how you can solve this, considering the emp data and job data are not too big.
Do a full join (or inner join as you need) on employee_data and job_data. So your new joined data will have len(employee_data) * len(job_data) rows and will have skills from both tables including employee details
| emp_details | emp_skills | job_details | job_skills |
Operate on this table to find which of emp_skills matches with job_skills with (lambda) functions. With functions you are easily operate on array objects.
Select the emp details from the row
I have a simple Spark DataFrame with column ID with integer values 1, 2, etc.:
+---+-------+
| ID| Tags |
+---+-------+
| 1| apple |
| 2| kiwi |
| 3| pear |
+---+-------+
I want to check if value like 2 is in the column ID in any row, filter method is only useful for string columns. Any ideas?
UPDATE:
I was trying with:
df.filter(df.ID).contains(2)
At the end I need boolean True or False output.
No. Filter can filter other data types also.
dataDictionary = [
(1,"APPLE"),
(2,"KIWI"),
(3,"PEAR")
]
df = spark.createDataFrame(data=dataDictionary, schema = ["ID","Tags"])
df.printSchema()
df.show(truncate=False)
df.filter("ID==2").rdd.isEmpty() #Will return Boolean.
I am trying to take a column in Spark (using pyspark) that has string values like 'A1', 'C2', and 'B9' and create new columns with each element in the string. How can I extract values from strings to create a new column?
How do I turn this:
| id | col_s |
|----|-------|
| 1 | 'A1' |
| 2 | 'C2' |
into this:
| id | col_s | col_1 | col_2 |
|----|-------|-------|-------|
| 1 | 'A1' | 'A' | '1' |
| 2 | 'C2' | 'C' | '2' |
I have been looking through the docs unsuccessfully.
You can use expr (read here) and substr (read here) to extract the substrings you want. In substr() function, the first argument is the column, second argument is the index from where you want to start extracting and the third argument is the length of the string you want to extract. Note: Its 1 based indexing, as opposed to being 0 based.
from pyspark.sql.functions import substring, length, expr
df = df.withColumn('col_1',expr('substring(col_s, 1, 1)'))
df = df.withColumn('col_2',expr('substring(col_s, 2, 1)'))
df.show()
+---+-----+-----+-----+
| id|col_s|col_1|col_2|
+---+-----+-----+-----+
| 1| A1| A| 1|
| 2| C1| C| 1|
| 3| G8| G| 8|
| 4| Z6| Z| 6|
+---+-----+-----+-----+
I was able to answer my own question 5 minutes after posting it here...
split_col = pyspark.sql.functions.split(df['COL_NAME'], "")
df = df.withColumn('COL_NAME_CHAR', split_col.getItem(0))
df = df.withColumn('COL_NAME_NUM', split_col.getItem(1))
I have a spark dataframe in python. And, it was sorted based on a column. How can I select a specific range of data (for example 50% of data in the middle)? For example, if I have 1M data, I want to take data from 250K to 750K index. How can I do that without using collect in pyspark?
To be more precise, I want something like take function to get results between a range. For example, something like take(250000, 750000).
Here is one way to select a range in a pyspark DF:
Create DF
df = spark.createDataFrame(
data = [(10, "2018-01-01"), (22, "2017-01-01"), (13, "2014-01-01"), (4, "2015-01-01")\
,(35, "2013-01-01"),(26, "2016-01-01"),(7, "2012-01-01"),(18, "2011-01-01")]
, schema = ["amount", "date"]
)
df.show()
+------+----------+
|amount| date|
+------+----------+
| 10|2018-01-01|
| 22|2017-01-01|
| 13|2014-01-01|
| 4|2015-01-01|
| 35|2013-01-01|
| 26|2016-01-01|
| 7|2012-01-01|
| 18|2011-01-01|
+------+----------+
Sort (on date) and insert index (based on row number)
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = Window.orderBy("date")
df = df.withColumn("index", F.row_number().over(w))
df.show()
+------+----------+-----+
|amount| date|index|
+------+----------+-----+
| 18|2011-01-01| 1|
| 7|2012-01-01| 2|
| 35|2013-01-01| 3|
| 13|2014-01-01| 4|
| 4|2015-01-01| 5|
| 26|2016-01-01| 6|
| 22|2017-01-01| 7|
| 10|2018-01-01| 8|
+------+----------+-----+
Get The Required Range (assume want everything between rows 3 and 6)
df1=df.filter(df.index.between(3, 6))
df1.show()
+------+----------+-----+
|amount| date|index|
+------+----------+-----+
| 35|2013-01-01| 3|
| 13|2014-01-01| 4|
| 4|2015-01-01| 5|
| 26|2016-01-01| 6|
+------+----------+-----+
This is very simple using between , for example assuming your sorted column name is index -
df_sample = df.select(df.somecolumn, df.index.between(250000, 750000))
once you create a new dataframe df_sample, you can perform any operation (including take or collect) as per your need.