String to datetime conversion is failing - python

The conversion of the string to datetime is failing.
The data in the dataframe has the following format: "2020-08-05T12:34:10.800046".
I used pattern yyyy-MM-ddTHH:mm:ss.SSSSSS
config_df.withColumn(
"modifiedDate",
F.to_timestamp(config_df["modifiedDate"], "yyyy-MM-dd'T'HH:mm:ss.SSSSSS"),
).show()
+------------+
|modifiedDate|
+------------+
| null|
+------------+
The execution works without problem but all values in the updated column are NULL. Which format should I use?

According to this post, SSSis for milliseconds. Therefore, it matches the first 3 digits 800 in your 800046, no matter how many S you add.
I couldn't find any pattern that match your date, so you first need to update your string to keep only 3 digits at the end. With a regex for example
a = [
("2020-08-05T12:34:10.800123",),
]
b = ["modifiedDate"]
df = spark.createDataFrame(a, b)
df.withColumn(
"modifiedDate",
F.to_timestamp(
F.regexp_extract(
"modifiedDate", r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}", 0
),
"yyyy-MM-dd'T'HH:mm:ss.SSS",
),
).show()
+-------------------+
| modifiedDate|
+-------------------+
|2020-08-05 12:34:10|
+-------------------+

Related

Trim String Characters in Pyspark dataframe

Suppose if I have dataframe in which I have the values in a column like :
ABC00909083888
ABC93890380380
XYZ7394949
XYZ3898302
PQR3799_ABZ
MGE8983_ABZ
I want to trim these values like, remove first 3 characters and remove last 3 characters if it ends with ABZ.
00909083888
93890380380
7394949
3898302
3799
8983
Tried some methods but did not work.
from pyspark.sql import functions as f
new_df = df.withColumn("new_column", f.when((condition on some column),
f.substring('Existing_COL', 4, f.length(f.col("Existing_COL"))), ))
Can anyone please tell me which function I can use in pyspark.
Trim only removes white space or tab something characters.
Based upon your input and expected output. See below logic -
from pyspark.sql.functions import *
df = spark.createDataFrame(data = [("ABC00909083888",) ,("ABC93890380380",) ,("XYZ7394949",) ,("XYZ3898302",) ,("PQR3799_ABZ",) ,("MGE8983_ABZ",)], schema = ["values",])
(df.withColumn("new_vals", when(col('values').rlike("(_ABZ$)"), regexp_replace(col('values'),r'(_ABZ$)', '')).otherwise(col('values')))
.withColumn("final_vals", expr(("substring(new_vals, 4 ,length(new_vals))")))
).show()
Output
+--------------+--------------+-----------+
| values| new_vals| final_vals|
+--------------+--------------+-----------+
|ABC00909083888|ABC00909083888|00909083888|
|ABC93890380380|ABC93890380380|93890380380|
| XYZ7394949| XYZ7394949| 7394949|
| XYZ3898302| XYZ3898302| 3898302|
| PQR3799_ABZ| PQR3799| 3799|
| MGE8983_ABZ| MGE8983| 8983|
+--------------+--------------+-----------+
If I get you correctly and if you don't insist on using pyspark substring or trim functions, you can easily define a function to do what you want and then make use of that with udfs in spark:
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
def mysub(word):
if word.endswith('_ABZ'):
word = word[:-4]
return word[3:]
udf1 = udf(lambda x: mysub(x), StringType())
df.withColumn('new_label',udf1('label')).show()
The output will be like:
+---+--------------+-----------+
| id| label| new_label|
+---+--------------+-----------+
| 1|ABC00909083888|00909083888|
| 2|ABC93890380380|93890380380|
| 3| XYZ7394949| 7394949|
| 4| XYZ3898302| 3898302|
| 5| PQR3799_ABZ| 3799|
| 6| MGE8983_ABZ| 8983|
+---+--------------+-----------+
Please let me know if I got you wrong in some cases.

pyspark select first element over window on some condition

Problem
Hello is there a way in pyspark/spark to select first element over some window on some condition?
Examples
Let's have an example input dataframe
+---------+----------+----+----+----------------+
| id| timestamp| f1| f2| computed|
+---------+----------+----+----+----------------+
| 1|2020-01-02|null|c1f2| [f2]|
| 1|2020-01-01|c1f1|null| [f1]|
| 2|2020-01-01|c2f1|null| [f1]|
+---------+----------+----+----+----------------+
I want to select for each id latest column (f1, f2...) that was computed.
So the "code" would look like this
cols = ["f1", "f2"]
w = Window().partitionBy("id").orderBy(f.desc("timestamp")).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
output_df = (
input_df.select(
"id",
*[f.first(col, condition=f.array_contains(f.col("computed"), col)).over(w).alias(col) for col in cols]
)
.groupBy("id")
.agg(*[f.first(col).alias(col) for col in cols])
)
And output should be
+---------+----+----+
| id| f1| f2|
+---------+----+----+
| 1|c1f1|c1f2|
| 2|c2f1|null|
+---------+----+----+
If the input looks like this
+---------+----------+----+----+----------------+
| id| timestamp| f1| f2| computed|
+---------+----------+----+----+----------------+
| 1|2020-01-02|null|c1f2| [f1, f2]|
| 1|2020-01-01|c1f1|null| [f1]|
| 2|2020-01-01|c2f1|null| [f1]|
+---------+----------+----+----+----------------+
Then the output should be
+---------+----+----+
| id| f1| f2|
+---------+----+----+
| 1|null|c1f2|
| 2|c2f1|null|
+---------+----+----+
As you can see it's not easy just to use f.first(ignore_nulls=True) because in this case we don't want to skip the null as it is taken as computed value.
Current solution
Step 1
Save original data types
cols = ["f1", "f2"]
orig_dtypes = [field.dataType for field in input_df.schema if field.name in cols]
Step 2
For Each column create new column with it's value if the column is computed and also replace original null with our "synthetic" <NULL> string
output_df = input_df.select(
"id", "timestamp", "computed",
*[
f.when(f.array_contains(f.col("computed"), col) & f.col(col).isNotNull(), f.col(col))
.when(f.array_contains(f.col("computed"), col) & f.col(col).isNull(), "<NULL>")
.alias(col)
for col in cols
]
)
Step 3
Select first non null value over window because now we know that <NULL> won't be skipped
output_df = (
output_df.select(
"id",
*[f.first(col, ignorenulls=True).over(w).alias(col) for col in cols],
)
.groupBy("id")
.agg(*[f.first(col).alias(col) for col in cols])
)
Step 4
Replace our "synthetic" <NULL> for original nulls.
output_df = output_df.replace("<NULL>", None)
Step 5
Cast columns back to it's original types because they might get retyped to string in step 2
output_df = output_df.select("id", *[f.col(col).cast(type_) for col, type_ in zip(cols, orig_dtypes)])
This solution works but it does not seem to be the right way to do it. Besides it's pretty heavy and it's taking too long to get computed.
Is there any other more "sparkish" way to do it?
Here's one way by using this trick of struct ordering.
Groupby id and collect list of structs like struct<col_exists_in_computed, timestamp, col_value> for each column in cols list, then using array_max function on the resulting array you get the lasted value you want:
from pyspark.sql import functions as F
output_df = input_df.groupBy("id").agg(
*[F.array_max(
F.collect_list(
F.struct(F.array_contains("computed", c), F.col("timestamp"), F.col(c))
)
)[c].alias(c) for c in cols]
)
# applied to you second dataframe example, it gives
output_df.show()
#+---+----+----+
#| id| f1| f2|
#+---+----+----+
#| 1|null|c1f2|
#| 2|c2f1|null|
#+---+----+----+

how to convert a bytearray in one row of a pyspark dataframe to a column of bytes?

My data currently looks something like this
df = pd.DataFrame({'content': [bytearray(b'\x01%\xeb\x8cH\x89')]})
spark.createDataFrame(df).show()
+-------------------+
| content|
+-------------------+
|[01 25 EB 8C 48 89]|
+-------------------+
How do I get a column that has a row for each value in the array?
+-------+
|content|
+-------+
| 1|
| 37|
| 235|
| 140|
| 72|
| 137|
+-------+
I've tried explode but this will not work on a bytearray.
edit: additional context, the df is the result of reading in a binary file with spark.read.format('binaryfile').load(...).
I applied a chain of transformations here with comments. It's a bit "hacky" though.
from pyspark.sql import functions as F
(df
.withColumn('content', F.hex('content')) # convert bytes to hex: 0125EB8C4889
.withColumn('content', F.regexp_replace('content', '(\w{2})', '$1,')) # split hex to chunks: 01,25,EB,8C,48,89,
.withColumn('content', F.expr('substring(content, 0, length(content) - 1)')) # remove redundent comma: 01,25,EB,8C,48,89
.withColumn('content', F.split('content', ',')) # split hex values by comma: [01, 25, EB, 8C, 48, 89]
.withColumn('content', F.explode('content')) # explode hex values to multiple rows
.withColumn('content', F.conv('content', 16, 10)) # convert hex to dec
.show(10, False)
)
# Output
# +-------+
# |content|
# +-------+
# |1 |
# |37 |
# |235 |
# |140 |
# |72 |
# |137 |
# +-------+
You need use to flatMap on your column - you pass in a function to parse each data element. The function you provide should emit a sequence. Each element in the sequence will become a new row.
A longer explanation with more examples is here:
https://koalatea.io/python-pyspark-flatmap/
converting bytearray to array using UDF might help
import pyspark.sql.functions as f
from pyspark.sql.types import IntegerType,ArrayType
byte_to_int = lambda x : [int(y) for y in x]
byte_to_int_udf = f.udf(lambda z :byte_to_int(z),ArrayType(IntegerType()))
df = pd.DataFrame({'content': [bytearray(b'\x01%\xeb\x8cH\x89')]})
df1 = spark.createDataFrame(df)
df1.withColumn("content_array",byte_to_int_udf(f.col('content'))).select(f.explode(f.col('content_array'))).show()

pyspark `substr' without length

Is there a way, in pyspark, to perform the substr function on a DataFrame column, without specifying the length? Namely, something like df["my-col"].substr(begin).
I am not sure why this function is not exposed as api in pysaprk.sql.functions module.
SparkSQL supports the substring function without defining len argument substring(str, pos, len)
You can use it with expr api of functions module like below to achieve same:
df.withColumn('substr_name', f.expr("substring(name, 2)")).show()
+----------+---+-----------+
| name| id|substr_name|
+----------+---+-----------+
|Alex Shtof| 1| lex Shtof|
| SMaZ| 2| MaZ|
+----------+---+-----------+
How spark is doing it internally :
Now if you see physical plan of above statement then will notice that if we don't pass len then spark will automatically add 2147483647.
As #pault said in comment, 2147483647 is the maximum positive value for a 32-bit signed binary integer (2^31 -1).
df.withColumn('substr_name', f.expr("substring(name, 2)")).explain()
== Physical Plan ==
*Project [name#140, id#141L, substring(name#140, 2, 2147483647) AS substr_name#169]
+- Scan ExistingRDD[name#140,id#141L] --> 2147483647 is automatically added
In substring api implementation of functions module, it expect us to explicitly pass length. If you want then you can give any higher number in len which can cover max lengths of your column.
df.withColumn('substr_name', f.substring('name', 2, 100)).show()
+----------+---+-----------+
| name| id|substr_name|
+----------+---+-----------+
|Alex Shtof| 1| lex Shtof|
| SMaZ| 2| MaZ|
+----------+---+-----------+
>>> df.withColumn('substr_name', f.substring('name', 2, 100)).explain()
== Physical Plan ==
*Project [name#140, id#141L, substring(name#140, 2, 100) AS substr_name#189]
+- Scan ExistingRDD[name#140,id#141L] --> 100 is what we passed
If the objective is to make a substring from a position given by a parameter begin to the end of the string, then you can do it as follows:
import pyspark.sql.functions as f
l = [(1, 'Prague'), (2, 'New York')]
df = spark.createDataFrame(l, ['id', 'city'])
begin = 2
l = (f.length('city') - f.lit(begin) + 1)
(
df
.withColumn('substr', f.col('city').substr(f.lit(begin), l))
).show()
+---+--------+-------+
| id| city| substr|
+---+--------+-------+
| 1| Prague| rague|
| 2|New York|ew York|
+---+--------+-------+
I'd create udf.
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.types import StringType
>>> df = spark.createDataFrame([('Alice', 23), ('Brian', 25)], schema=["name", "age"])
>>> df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 23|
|Brian| 25|
+-----+---+
>>> #F.udf(returnType=StringType())
... def substr_udf(col):
... return str(col)[2:]
>>> df = df.withColumn('substr', substr_udf('name'))
>>> df.show()
+-----+---+------+
| name|age|substr|
+-----+---+------+
|Alice| 23| ice|
|Brian| 25| ian|
+-----+---+------+
No we need to specify the both parameters pos and len
BUt do make sure that both should be of same type other wise it will give error.
Error: Column not iterable.
You can do in this way:
df = df.withColumn("new", F.col("previous").substr(F.lit(5), F.length("previous")-5))

Pyspark -Convert String to TimeStamp - Getting Nulls

I've the following column as string on a dataframe df:
date|
+----------------+
|4/23/2019 23:59|
|05/06/2019 23:59|
|4/16/2019 19:00
I am trying to convert this to Timestamp but I only getting NULL values.
My statement is:
from pyspark.sql.functions import col, unix_timestamp
df.withColumn('date',unix_timestamp(df['date'], "MM/dd/yyyy hh:mm").cast("timestamp"))
Why I am getting only Null values? Is It because the Month format (since I hive an additional 0 on 05)?
Thanks!
The pattern for 24 hour format is HH, hh is for am./pm.
https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html
df \
.withColumn('converted_date', psf.to_timestamp('date', format='MM/dd/yyyy HH:mm')) \
.show()
+----------------+-------------------+
| date| converted_date|
+----------------+-------------------+
| 4/23/2019 23:59|2019-04-23 23:59:00|
|05/06/2019 23:59|2019-05-06 23:59:00|
| 4/16/2019 19:00|2019-04-16 19:00:00|
+----------------+-------------------+
Whether there is or not a leading 0 does not matter

Categories