I have a simple Spark DataFrame with column ID with integer values 1, 2, etc.:
+---+-------+
| ID| Tags |
+---+-------+
| 1| apple |
| 2| kiwi |
| 3| pear |
+---+-------+
I want to check if value like 2 is in the column ID in any row, filter method is only useful for string columns. Any ideas?
UPDATE:
I was trying with:
df.filter(df.ID).contains(2)
At the end I need boolean True or False output.
No. Filter can filter other data types also.
dataDictionary = [
(1,"APPLE"),
(2,"KIWI"),
(3,"PEAR")
]
df = spark.createDataFrame(data=dataDictionary, schema = ["ID","Tags"])
df.printSchema()
df.show(truncate=False)
df.filter("ID==2").rdd.isEmpty() #Will return Boolean.
Related
I have a SQL Code which i am trying to Convert into Pyspark?
The SQL Query looks like this: I need to Concatenate '0' at starting of 'ADDRESS_HOME' if the below Query Conditions Satisfies.
UPDATE STUDENT_DATA
SET STUDENT_DATA.ADDRESS_HOME = "0" & [STUDENT_DATA].ADDRESS_HOME
WHERE (((STUDENT_DATA.STATE_ABB)="TURIN" Or
(STUDENT_DATA.STATE_ABB)="RUSH" Or
(STUDENT_DATA.STATE_ABB)="MEXIC" Or
(STUDENT_DATA.STATE_ABB)="VINTA")
AND ((Len([ADDRESS_HOME])) < "5"));
Thank you in Advance for your responses
# +---+---------------+---------+
# | ID|ADDRESS_HOME | STATE_ABB|
# +---+---------------+---------+
# | 1| 7645 |RUSH |
# | 2| 98364 |MEXIC |
# | 3| 2980 |TURIN |
# | 4| 6728 |VINTA |
# | 5| 128 |VINTA |
EXPECTED OUTPUT
# +---+---------------+---------+
# | ID|ADDRESS_HOME | STATE_ABB|
# +---+---------------+---------+
# | 1| 07645 |RUSH |
# | 2| 98364 |MEXIC |
# | 3| 02980 |TURIN |
# | 4| 06728 |VINTA |
# | 5| 0128 |VINTA |
If you want to align the ADDRESS_HOME to be 5 digits and pad with 0, you can use lpad.
df = df.withColumn('ADDRESS_HOME', F.lpad('ADDRESS_HOME', 5, '0'))
If you want only pad with 1 char (0), when the ADDRESS_HOME has less than 5 chars.
df = (df.withColumn('ADDRESS_HOME', F.when(F.length('ADDRESS_HOME') < 5, F.concat(F.lit('0'), F.col('ADDRESS_HOME'))))
.otherwise(F.col('ADDRESS_HOME')))
UPDATE:
You can convert all OR criteria to IN clause(isin) then use logical AND with the other criteria.
states = ['RUSH', 'MEXIC', 'TURIN', 'VINTA']
df = (df.withColumn('ADDRESS_HOME',
F.when(F.col('STATE_ABB').isin(states) & (F.length('ADDRESS_HOME') < 5),
F.concat(F.lit('0'), F.col('ADDRESS_HOME')))
.otherwise(F.col('ADDRESS_HOME'))))
First you filter, your DF serching for the values you want to update.
Then you update the columns (First withcolumn)
After updating, you join your updated DF with your original dataframe (do this to get all values in one dataframe again). And do a coalesce to the FINAL ADDRESS
Finally, you select the values from the original DF (Id and State) and the updated value (Final_Address...since you did a coalesce, the values not updated will not be null, they are going to be the update value on the filtered condition, and the original value on the condition not matched in the filter).
This answer should solve your problem, BUT, #Emma answers is more efficient.
df = df.filter(
(f.col("STATE_ABB").isin(f.lit("TURIN"), f.lit("RUSH"), f.lit("TURIN"), f.lit("VINTA")) &
(f.len("ADDRESS_HOME") < 5)
).withColumn(
"ADDRESS_HOME_CONCAT",
f.concat(f.lit("0"),f.col("ADDRESS_HOME"))
).alias("df_filtered").join(
df.alias("original_df"),
on=f.col("original_df.Id") == f.col("df_filtered.Id")
how='left'
).withColumn(
"FINAL_ADDRESS",
f.coalesce(f.col("df_filtered.ADDRESS_HOME_CONCAT"), f.col("original_df.ADDRESS_HOME")
).select(
f.col("original_df.Id").alias("Id"),
f.col("FINAL_ADDRESS").alias("ADDRESS_HOME"),
f.col("original_df.STATE_ABB").alias("STATE_ABB")
)
Sorry for any typo missing, I've posted it from my cellphone!
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('SparkByExamples.com') \
.getOrCreate()
data = [('James','Smith','M',3000), ('Anna','Rose','F',4100),
('Robert','Williams','M',6200)
]
columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df2 = df.select(lit("D").alias("S"), "*")
df2.show()
Output:
----------
+---+---------+--------+------+------+
| S|firstname|lastname|gender|salary|
+---+---------+--------+------+------+
| D| James| Smith| M| 3000|
| D| Anna| Rose| F| 4100|
| D| Robert|Williams| M| 6200|
+---+---------+--------+------+------+
Required Output:
Need to add an extra row "T" and count of row for column- "firstname" like below. Column "firstname" can be of any type .
+---+---------+--------+------+------+
| S|firstname|lastname|gender|salary|
+---+---------+--------+------+------+
| D| James| Smith| M| 3000|
| D| Anna| Rose| F| 4100|
| D| Robert|Williams| M| 6200|
| T| 3 | | | |
+---+---------+--------+------+------+
Tried creating a new data frame with trailer values and apply union as suggested on most of the stacoverflow solution- but both the dataframe should have same no of columns.
Is there any better way to have the count in the trailer irrespective of column type of "firstname" column.
Since you want to create a new row irrespective of column type, you can write a function that takes the column name as an input, and returns a dictionary containing all of the necessary information for the new row including the number of entries in that column.
To create an output pyspark dataframe like the one you've shown, every column will have to be a string type because the new row will have to contain an empty string '' for the columns lastname, gender, salary. You cannot have mixed types in pyspark columns (see here), so when you create a union between df2 and total_row_df, any columns that are string type in total_row_df be coerced to a string type in the resulting dataframe.
from pyspark.sql.functions import count
def create_total_row(col_name):
total_row = {}
for col in df2.columns:
if col == 'S':
total_row[col] = 'T'
elif col == col_name:
total_row[col] = df2.select(count(df2[col_name])).collect()[0][0]
else:
total_row[col] = ''
return total_row
total_row = create_total_row('firstname')
total_row_df = spark.createDataFrame([total_row])
df2.union(total_row_df).show()
Result:
+---+---------+--------+------+------+
| S|firstname|lastname|gender|salary|
+---+---------+--------+------+------+
| D| James| Smith| M| 3000|
| D| Anna| Rose| F| 4100|
| D| Robert|Williams| M| 6200|
| T| 3| | | |
+---+---------+--------+------+------+
I have a dataframe with some column names and I want to filter out some columns based on a list.
I have a list of columns I would like to have in my final dataframe:
final_columns = ['A','C','E']
My dataframe is this:
data1 = [("James", "Lee", "Smith","36636"),
("Michael","Rose","Boots","40288")]
schema1 = StructType([StructField("A",StringType(),True),
StructField("B",StringType(),True),
StructField("C",StringType(),True),
StructField("D",StringType(),True)])
df1 = spark.createDataFrame(data=data1,schema=schema1)
I would like to transform df1 in order to have the columns of this final_columns list.
So, basically, I expect the resulting dataframe to look like this
+--------+------+------+
| A | C | E |
+--------+------+------+
| James |Smith | |
|Michael |Boots | |
+--------+------+------+
Is there any smart way to do this?
Thank you in advance
You can do so with select and a list comprehension. The idea is to loop through final_columns, if a column is in df.colums then add it, if its not then use lit to add it with the proper alias.
You can write similar logic with a for loop if you find list comprehensions less readable.
from pyspark.sql.functions import lit
df1.select([c if c in df1.columns else lit(None).alias(c) for c in final_columns]).show()
+-------+-----+----+
| A| C| E|
+-------+-----+----+
| James|Smith|null|
|Michael|Boots|null|
+-------+-----+----+
Here is one way: use the DataFrame drop() method with a list which represents the symmetric difference between the DataFrame's current columns and your list of final columns.
df = spark.createDataFrame([(1, 1, "1", 0.1),(1, 2, "1", 0.2),(3, 3, "3", 0.3)],('a','b','c','d'))
df.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| 1| 1| 1|0.1|
| 1| 2| 1|0.2|
| 3| 3| 3|0.3|
+---+---+---+---+
# list of desired final columns
final_cols = ['a', 'c', 'd']
df2 = df.drop( *set(final_cols).symmetric_difference(df.columns) )
Note an alternate syntax for the symmetric difference operation:
df2 = df.drop( *(set(final_cols) ^ set(df.columns)) )
This gives me:
+---+---+---+
| a| c| d|
+---+---+---+
| 1| 1|0.1|
| 1| 1|0.2|
| 3| 3|0.3|
+---+---+---+
Which I believe is what you want.
Based on your requirement have written a dynamic code. This will select columns based on the list provided and also create column with null values if that column is not present in the source/original dataframe.
data1 = [("James", "Lee", "Smith","36636"),
("Michael","Rose","Boots","40288")]
schema1 = StructType([StructField("A",StringType(),True),
StructField("B",StringType(),True),
StructField("C",StringType(),True),
StructField("D",StringType(),True)])
df1 = spark.createDataFrame(data=data1,schema=schema1)
actual_columns = df1.schema.names
final_columns = ['A','C','E']
def Diff(li1, li2):
diff = list(set(li2) - set(li1))
return diff
def Same(li1, li2):
same = list(sorted(set(li1).intersection(li2)))
return same
df1 = df1.select(*Same(actual_columns,final_columns))
for i in Diff(actual_columns,final_columns):
df1 = df1.withColumn(""+i+"",lit(''))
display(df1)
I have the following Pyspark Dataframe called 'df':
A = ["OTH/CON", "Freight Collect", "OTH/CON", "DBG"]
B = [2, 3, 4, 5]
df = sqlContext.createDataFrame(zip(A, B), schema=['A', 'B'])
In the column 'A', I need to replace the values "OTH/CON" & "Freight Collect" with another string "Collect". And replace "DBG" by "Dispose". Then place the values into a new column 'aa'. I do the following:
from pyspark.sql import functions as F
df = df.withColumn("aa", F.when(F.col("A").isin(["OTH/CON"]), F.lit("Collect")).otherwise(F.col("A")))
df = df.withColumn("aa", F.when(F.col("A").isin(["Freight Collect"]), F.lit("Collect")).otherwise(F.col("A")))
df = df.withColumn("aa", F.when(F.col("A").isin(["DBG"]), F.lit("Dispose")).otherwise(F.col("A")))
But I end up getting only "Freight Collect" value changed to "Collect". "OTH/CON" remains as it is.
I'm not able to figure out why!
My expected output is as follows:
+---------------+---+-------+
| A| B| aa|
+---------------+---+-------+
| OTH/CON| 2|Collect|
|Freight Collect| 3|Collect|
| OTH/CON| 4|Collect|
| DBG| 5|Dispose|
+---------------+---+-------+
Can anyone please help?
You can merge multiple isin conditions into one
(df
.withColumn('aa', F
.when(F.col('A').isin(['OTH/CON', 'Freight Collect']), F.lit('Collect'))
.when(F.col('A').isin(['DBG']), F.lit('Dispose'))
.otherwise(F.col('A'))
)
.show()
)
+---------------+---+-------+
| A| B| aa|
+---------------+---+-------+
| OTH/CON| 2|Collect|
|Freight Collect| 3|Collect|
| OTH/CON| 4|Collect|
| DBG| 5|Dispose|
+---------------+---+-------+
I am trying to take a column in Spark (using pyspark) that has string values like 'A1', 'C2', and 'B9' and create new columns with each element in the string. How can I extract values from strings to create a new column?
How do I turn this:
| id | col_s |
|----|-------|
| 1 | 'A1' |
| 2 | 'C2' |
into this:
| id | col_s | col_1 | col_2 |
|----|-------|-------|-------|
| 1 | 'A1' | 'A' | '1' |
| 2 | 'C2' | 'C' | '2' |
I have been looking through the docs unsuccessfully.
You can use expr (read here) and substr (read here) to extract the substrings you want. In substr() function, the first argument is the column, second argument is the index from where you want to start extracting and the third argument is the length of the string you want to extract. Note: Its 1 based indexing, as opposed to being 0 based.
from pyspark.sql.functions import substring, length, expr
df = df.withColumn('col_1',expr('substring(col_s, 1, 1)'))
df = df.withColumn('col_2',expr('substring(col_s, 2, 1)'))
df.show()
+---+-----+-----+-----+
| id|col_s|col_1|col_2|
+---+-----+-----+-----+
| 1| A1| A| 1|
| 2| C1| C| 1|
| 3| G8| G| 8|
| 4| Z6| Z| 6|
+---+-----+-----+-----+
I was able to answer my own question 5 minutes after posting it here...
split_col = pyspark.sql.functions.split(df['COL_NAME'], "")
df = df.withColumn('COL_NAME_CHAR', split_col.getItem(0))
df = df.withColumn('COL_NAME_NUM', split_col.getItem(1))