Best practice for double every column on the same DataFrame - python

I want to take a DF and double each column (with new column name).
I want to make "Stress Tests" on my ML Model (implemented using PySpark & Spark Pipeline) and see how well it performs if I double/triple the number of features in my input dataset.
For Example, take this DF:
+-------+-------+-----+------+
| _c0| _c1| _c2| _c3|
+-------+-------+-----+------+
| 1 |Testing| | true |
+-------+-------+-----+------+
and make it like this:
+-------+-------+-----+------+-------+-------+-----+------+
| _c0| _c1| _c2| _c3| _c4| _c5| _c6| _c7|
+-------+-------+-----+------+-------+-------+-----+------+
| 1 |Testing| | true | 1 |Testing| | true |
+-------+-------+-----+------+-------+-------+-----+------+
The easiest way I can do it is like this:
df = df
doubledDF = df
for col in df.columns:
doubledDF = doubledDF.withColumn(col+"1dup", df[col])
However, it takes way to much time.
I would appreciate any solution, and even more the explanation why this solution approach is better.
Thank you very much!

You can do this by using selectExpr(). The asterisk * will un-list a list.
For eg; *['_c0', '_c1', '_c2', '_c3'] will return '_c0', '_c1', '_c2', '_c3'
Along with the help of list-comprehensions, this code can be fairly generalized.
df = sqlContext.createDataFrame([(1,'Testing','',True)],('_c0','_c1','_c2','_c3'))
df.show()
+---+-------+---+----+
|_c0| _c1|_c2| _c3|
+---+-------+---+----+
| 1|Testing| |true|
+---+-------+---+----+
col_names = df.columns
print(col_names)
['_c0', '_c1', '_c2', '_c3']
df = df.selectExpr(*[i for i in col_names],*[i+' as '+i+'_dup' for i in col_names])
df.show()
+---+-------+---+----+-------+-------+-------+-------+
|_c0| _c1|_c2| _c3|_c0_dup|_c1_dup|_c2_dup|_c3_dup|
+---+-------+---+----+-------+-------+-------+-------+
| 1|Testing| |true| 1|Testing| | true|
+---+-------+---+----+-------+-------+-------+-------+
Note: The following code will work too.
df = df.selectExpr('*',*[i+' as '+i+'_dup' for i in col_names])

Related

Trim String Characters in Pyspark dataframe

Suppose if I have dataframe in which I have the values in a column like :
ABC00909083888
ABC93890380380
XYZ7394949
XYZ3898302
PQR3799_ABZ
MGE8983_ABZ
I want to trim these values like, remove first 3 characters and remove last 3 characters if it ends with ABZ.
00909083888
93890380380
7394949
3898302
3799
8983
Tried some methods but did not work.
from pyspark.sql import functions as f
new_df = df.withColumn("new_column", f.when((condition on some column),
f.substring('Existing_COL', 4, f.length(f.col("Existing_COL"))), ))
Can anyone please tell me which function I can use in pyspark.
Trim only removes white space or tab something characters.
Based upon your input and expected output. See below logic -
from pyspark.sql.functions import *
df = spark.createDataFrame(data = [("ABC00909083888",) ,("ABC93890380380",) ,("XYZ7394949",) ,("XYZ3898302",) ,("PQR3799_ABZ",) ,("MGE8983_ABZ",)], schema = ["values",])
(df.withColumn("new_vals", when(col('values').rlike("(_ABZ$)"), regexp_replace(col('values'),r'(_ABZ$)', '')).otherwise(col('values')))
.withColumn("final_vals", expr(("substring(new_vals, 4 ,length(new_vals))")))
).show()
Output
+--------------+--------------+-----------+
| values| new_vals| final_vals|
+--------------+--------------+-----------+
|ABC00909083888|ABC00909083888|00909083888|
|ABC93890380380|ABC93890380380|93890380380|
| XYZ7394949| XYZ7394949| 7394949|
| XYZ3898302| XYZ3898302| 3898302|
| PQR3799_ABZ| PQR3799| 3799|
| MGE8983_ABZ| MGE8983| 8983|
+--------------+--------------+-----------+
If I get you correctly and if you don't insist on using pyspark substring or trim functions, you can easily define a function to do what you want and then make use of that with udfs in spark:
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
def mysub(word):
if word.endswith('_ABZ'):
word = word[:-4]
return word[3:]
udf1 = udf(lambda x: mysub(x), StringType())
df.withColumn('new_label',udf1('label')).show()
The output will be like:
+---+--------------+-----------+
| id| label| new_label|
+---+--------------+-----------+
| 1|ABC00909083888|00909083888|
| 2|ABC93890380380|93890380380|
| 3| XYZ7394949| 7394949|
| 4| XYZ3898302| 3898302|
| 5| PQR3799_ABZ| 3799|
| 6| MGE8983_ABZ| 8983|
+---+--------------+-----------+
Please let me know if I got you wrong in some cases.

pyspark select first element over window on some condition

Problem
Hello is there a way in pyspark/spark to select first element over some window on some condition?
Examples
Let's have an example input dataframe
+---------+----------+----+----+----------------+
| id| timestamp| f1| f2| computed|
+---------+----------+----+----+----------------+
| 1|2020-01-02|null|c1f2| [f2]|
| 1|2020-01-01|c1f1|null| [f1]|
| 2|2020-01-01|c2f1|null| [f1]|
+---------+----------+----+----+----------------+
I want to select for each id latest column (f1, f2...) that was computed.
So the "code" would look like this
cols = ["f1", "f2"]
w = Window().partitionBy("id").orderBy(f.desc("timestamp")).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
output_df = (
input_df.select(
"id",
*[f.first(col, condition=f.array_contains(f.col("computed"), col)).over(w).alias(col) for col in cols]
)
.groupBy("id")
.agg(*[f.first(col).alias(col) for col in cols])
)
And output should be
+---------+----+----+
| id| f1| f2|
+---------+----+----+
| 1|c1f1|c1f2|
| 2|c2f1|null|
+---------+----+----+
If the input looks like this
+---------+----------+----+----+----------------+
| id| timestamp| f1| f2| computed|
+---------+----------+----+----+----------------+
| 1|2020-01-02|null|c1f2| [f1, f2]|
| 1|2020-01-01|c1f1|null| [f1]|
| 2|2020-01-01|c2f1|null| [f1]|
+---------+----------+----+----+----------------+
Then the output should be
+---------+----+----+
| id| f1| f2|
+---------+----+----+
| 1|null|c1f2|
| 2|c2f1|null|
+---------+----+----+
As you can see it's not easy just to use f.first(ignore_nulls=True) because in this case we don't want to skip the null as it is taken as computed value.
Current solution
Step 1
Save original data types
cols = ["f1", "f2"]
orig_dtypes = [field.dataType for field in input_df.schema if field.name in cols]
Step 2
For Each column create new column with it's value if the column is computed and also replace original null with our "synthetic" <NULL> string
output_df = input_df.select(
"id", "timestamp", "computed",
*[
f.when(f.array_contains(f.col("computed"), col) & f.col(col).isNotNull(), f.col(col))
.when(f.array_contains(f.col("computed"), col) & f.col(col).isNull(), "<NULL>")
.alias(col)
for col in cols
]
)
Step 3
Select first non null value over window because now we know that <NULL> won't be skipped
output_df = (
output_df.select(
"id",
*[f.first(col, ignorenulls=True).over(w).alias(col) for col in cols],
)
.groupBy("id")
.agg(*[f.first(col).alias(col) for col in cols])
)
Step 4
Replace our "synthetic" <NULL> for original nulls.
output_df = output_df.replace("<NULL>", None)
Step 5
Cast columns back to it's original types because they might get retyped to string in step 2
output_df = output_df.select("id", *[f.col(col).cast(type_) for col, type_ in zip(cols, orig_dtypes)])
This solution works but it does not seem to be the right way to do it. Besides it's pretty heavy and it's taking too long to get computed.
Is there any other more "sparkish" way to do it?
Here's one way by using this trick of struct ordering.
Groupby id and collect list of structs like struct<col_exists_in_computed, timestamp, col_value> for each column in cols list, then using array_max function on the resulting array you get the lasted value you want:
from pyspark.sql import functions as F
output_df = input_df.groupBy("id").agg(
*[F.array_max(
F.collect_list(
F.struct(F.array_contains("computed", c), F.col("timestamp"), F.col(c))
)
)[c].alias(c) for c in cols]
)
# applied to you second dataframe example, it gives
output_df.show()
#+---+----+----+
#| id| f1| f2|
#+---+----+----+
#| 1|null|c1f2|
#| 2|c2f1|null|
#+---+----+----+

How to filter pyspark dataframes

I have seen many questions related to filtering pyspark dataframes but despite my best efforts I haven't been able to get any of the non-SQL solutions to work.
+----------+-------------+-------+--------------------+--------------+---+
|purch_date| purch_class|tot_amt| serv-provider|purch_location| id|
+----------+-------------+-------+--------------------+--------------+---+
|03/11/2017|Uncategorized| -17.53| HOVER | | 0|
|02/11/2017| Groceries| -70.05|1774 MAC'S CONVEN...| BRAMPTON | 1|
|31/10/2017|Gasoline/Fuel| -20| ESSO | | 2|
|31/10/2017| Travel| -9|TORONTO PARKING A...| TORONTO | 3|
|30/10/2017| Groceries| -1.84| LONGO'S # 2| | 4|
This did not work:
df1 = spark.read.csv("/some/path/to/file", sep=',')\
.filter((col('purch_location')=='BRAMPTON')
And this did not work
df1 = spark.read.csv("/some/path/to/file", sep=',')\
.filter(purch_location == 'BRAMPTON')
This (SQL expression) works but takes a VERY long time, I imagine there's a faster non-SQL approach
df1 = spark.read.csv("/some/path/to/file", sep=',')\
.filter(purch_location == 'BRAMPTON')
UPDATE I should mention I am able to use methods like (which run faster than the SQL expression):
df1 = spark.read.csv("/some/path/to/file", sep=',')
df2 = df1.filter(df1.purch_location == "BRAMPTON")
But want to understand why the "pipe" / connection syntax is incorrect.
you can use df["purch_location"]
df = spark.read.csv("/some/path/to/file", sep=',')
df = df.filter(df["purch_location"] == "BRAMPTON")
If you insist on using the backslash, you can do:
from pyspark.sql.functions import col
df = spark.read.csv('/some/path/to/file', sep=',') \
.filter(col('purch_location') == 'BRAMPTON')
Your first attempt failed because the brackets are not balanced.
Also it seems there are some spaces after the string BRAMPTON, so you might want to trim the column first:
from pyspark.sql.functions import col, trim
df = spark.read.csv('/some/path/to/file', sep=',') \
.filter(trim(col('purch_location')) == 'BRAMPTON')

How to use to_json and from_json to eliminate nested structfields in pyspark dataframe?

This solution in theory, works perfectly for what I need, which is to create a new copied version of a dataframe while excluding certain nested structfields. here is a minimally reproducible artifact of my issue:
>>> df.printSchema()
root
| -- big: array(nullable=true)
| | -- element: struct(containsNull=true)
| | | -- keep: string(nullable=true)
| | | -- delete: string(nullable=true)
which you can instantiate like such:
schema = StructType([StructField("big", ArrayType(StructType([
StructField("keep", StringType()),
StructField("delete", StringType())
])))])
df = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)
My goal is to convert the dataframe (along with the values in the columns I want to keep) to one that excludes certain nested structs, like delete for example.
root
| -- big: array(nullable=true)
| | -- element: struct(containsNull=true)
| | | -- keep: string(nullable=true)
According to the solution I linked that tries to leverage pyspark.sql's to_json and from_json functions, it should be accomplishable with something like this:
new_schema = StructType([StructField("big", ArrayType(StructType([
StructField("keep", StringType())
])))])
test_df = df.withColumn("big", to_json(col("big"))).withColumn("big", from_json(col("big"), new_schema))
>>> test_df.printSchema()
root
| -- big: struct(nullable=true)
| | -- big: array(nullable=true)
| | | -- element: struct(containsNull=true)
| | | | -- keep: string(nullable=true)
>>> test_df.show()
+----+
| big|
+----+
|null|
+----+
So either I'm not following his directions right, or it doesn't work. How do you do this without a udf?
Pyspark to_json documentation
Pyspark from_json documentation
It should be working, you just need to adjust your new_schema to include metadata for the column 'big' only, not for the dataframe:
new_schema = ArrayType(StructType([StructField("keep", StringType())]))
test_df = df.withColumn("big", from_json(to_json("big"), new_schema))

Filtering Spark Dataframe

I've created a dataframe as:
ratings = imdb_data.sort('imdbRating').select('imdbRating').filter('imdbRating is NOT NULL')
Upon doing ratings.show() as shown below, i can see that
the imdbRating field has a mixed type of data such as random strings, movie title, movie url and actual ratings. So the dirty data looks this:
+--------------------+
| imdbRating|
+--------------------+
|Mary (TV Episode...|
| Paranormal Activ...|
| Sons (TV Episode...|
| Spion (2011)|
| Winter... und Fr...|
| and Gays (TV Epi...|
| grAs - Die Serie...|
| hat die Wahl (2000)|
| 1.0|
| 1.3|
| 1.4|
| 1.5|
| 1.5|
| 1.5|
| 1.6|
| 1.6|
| 1.7|
| 1.9|
| 1.9|
| 1.9|
+--------------------+
only showing top 20 rows
Is there anyway i can filter out the unwanted strings and all just get the ratings ? I tried using UDF as:
ratings_udf = udf(lambda imdbRating: imdbRating if isinstance(imdbRating, float) else None)
and tried calling it as:
ratings = imdb_data.sort('imdbRating').select('imdbRating')
filtered = rating.withColumn('imdbRating',ratings_udf(ratings.imdbRating))
The problem with above is, since it tried calling the udf on each row, each row of the dataframe mapped to a Row type and hence returning None on all the values.
Is there any straightforward way to filter out those data ?
Any help will be much appreciated. Thank you
Finally, i was able to resolve it.The problem was there was some corrupt data with not all fields present. Firstly, i tried is using pandas by reading the csv files in pandas as:
pd_frame = pd.read_csv('imdb.csv', error_bad_lines=False)
This skipped/dropped the corrupt rows which had less columns than the actual. I tried to read the above panda dataframe, pd_frame, to spark using:
imdb_data= spark.createDataFrame(pd_frame)
but got some error because of mismatch while inferring schema. Turns out spark csv reader has something similar which drops the corrupt rows as:
imdb_data = spark.read.csv('imdb.csv', header='true', mode='DROPMALFORMED')

Categories