Replace datetime column in DF with hour from int column - python

I have a pyspark df with an hour column (int) like this:
hour
0
0
1
...
14
And I have an execution_datetime variable that looks like 2022-01-02 17:23:11
Now I want to calculate a new column for my DF that holds my execution_datetime but the hour is replaced by the values from the hour col. Output should look like:
hour
exec_dttm_with_hour
0
2022-01-02 00:23:11
0
2022-01-02 00:23:11
1
2022-01-02 01:23:11
...
...
14
2022-01-02 14:23:11
I know there are ways using i.e. .collect(), then edit the list and insert as new col. But I need to make use of sparks parallel execution since it could be a super high data load. Also, casting it to pandas and then editing it is not suitable for my use case.
Thanks in advance for any suggestions!

You can use the make_timestamp function if you're on spark 3+, else you can use an UDF.
spark.range(15). \
withColumnRenamed('id', 'hour'). \
withColumn('static_dttm', func.lit(execution_datetime).cast('timestamp')). \
withColumn('dttm',
func.expr('''make_timestamp(year(static_dttm),
month(static_dttm),
day(static_dttm),
hour,
minute(static_dttm),
second(static_dttm)
)
''')
). \
drop('static_dttm'). \
show()
# +----+-------------------+
# |hour| dttm|
# +----+-------------------+
# | 0|2022-01-02 00:23:11|
# | 1|2022-01-02 01:23:11|
# | 2|2022-01-02 02:23:11|
# | 3|2022-01-02 03:23:11|
# | 4|2022-01-02 04:23:11|
# | 5|2022-01-02 05:23:11|
# | 6|2022-01-02 06:23:11|
# | 7|2022-01-02 07:23:11|
# | 8|2022-01-02 08:23:11|
# | 9|2022-01-02 09:23:11|
# | 10|2022-01-02 10:23:11|
# | 11|2022-01-02 11:23:11|
# | 12|2022-01-02 12:23:11|
# | 13|2022-01-02 13:23:11|
# | 14|2022-01-02 14:23:11|
# +----+-------------------+
Using UDF
def update_ts(string_ts, hour_col):
import datetime
dttm = datetime.datetime.strptime(string_ts, '%Y-%m-%d %H:%M:%S')
return datetime.datetime(dttm.year, dttm.month, dttm.day, hour_col, dttm.minute, dttm.second)
update_ts_udf = func.udf(update_ts, TimestampType())
spark.range(15). \
withColumnRenamed('id', 'hour'). \
withColumn('dttm', update_ts_udf(func.lit(execution_datetime), func.col('hour'))). \
show()
# +----+-------------------+
# |hour| dttm|
# +----+-------------------+
# | 0|2022-01-02 00:23:11|
# | 1|2022-01-02 01:23:11|
# | 2|2022-01-02 02:23:11|
# | 3|2022-01-02 03:23:11|
# | 4|2022-01-02 04:23:11|
# | 5|2022-01-02 05:23:11|
# | 6|2022-01-02 06:23:11|
# | 7|2022-01-02 07:23:11|
# | 8|2022-01-02 08:23:11|
# | 9|2022-01-02 09:23:11|
# | 10|2022-01-02 10:23:11|
# | 11|2022-01-02 11:23:11|
# | 12|2022-01-02 12:23:11|
# | 13|2022-01-02 13:23:11|
# | 14|2022-01-02 14:23:11|
# +----+-------------------+

You can use concat to connect multiple strings in a row and lit to add a constant value to each row.
In the following code, a new column timestamp is introduced where the first 11 characters of execution_datetime are conceited with characters after the hours and in between the hours are added. It also makes sure, that the hours have a leading zero.
import pyspark.sql.functions as f
df = df.withColumn('timestamp', f.concat(f.lit(execution_datetime [0:11]), f.lpad(f.col('hour'), 2, '0') , f.lit(f.lit(execution_datetime [13:]))))
Remark: This might be the faster version than using timestamp functions as suggested in samkart's answer, but also less safe in catching wrong inputs.

Related

Object to dictonary to use get() python pandas

I'm having some issues with a column in my csv that the type is an 'object', but it's should be an dict series (a dict for which row).
The point is to make which row as a dict to use get('id') on the key to return the id's values for which row in the 'Conta' column.
Thats the way it's as 'object' column:
| Conta |
| ---------------------------------------------|
| {'name':'joe','id':'4347176000574713087'} |
| {'name':'mary','id':'4347176000115055151'} |
| {'name':'fred','id':'4347176000574610147'} |
| {'name':'Marcos','id':'4347176000555566806'} |
| {'name':'marcos','id':'4347176000536834310'} |
Thats the way it's should be in the end
| Conta |
| ------------------- |
| 4347176000574713087 |
| 4347176000115055151 |
| 4347176000574610147 |
| 4347176000555566806 |
| 4347176000536834310 |
I tried to use:
import pandas as pd
df = pd.read_csv('csv/Modulo_CS.csv')
df['Conta'] = df['Conta'].to_dict()
df['Conta'] = [x.get('id', 0) for x in df['Conta']]
#return: AttributeError: 'str' object has no attribute 'get'
I also tried to use ast.literal_eval() but it dosen't work as well
import ast
import pandas as pd
df = pd.read_csv('csv/Modulo_CS.csv')
df = df[['Conta','ID_CS']]
df['Conta'] = df['Conta'].apply(ast.literal_eval)
#return: ValueError: malformed node or string: nan
Can someone help me?
Consider replacing the following line:
df['Conta'] = df['Conta'].apply(ast.literal_eval)
If it's being correctly detected as a dictionary then:
df['Conta'] = df['Conta].map(lambda x: x['id'])
If each row is a string:
df['Conta'] = df['Conta'].map(lambda x: ast.literal_eval(x)['id'])
However, if you are getting a malformed node or json error. Consider first using str and then ast.literal_eval():
df['Conta'] = df['Conta'].map(lambda x: ast.literal_eval(str(x))['id'])

PySpark - How to loop through the dataframe and match against another common value in another dataframe

This is a recommender system and I have a Dataframe which contains about 10 recommended item for each user (recommendation_df) and I have another Dataframe which consist of the recent purchases of each user (recent_df).
I am trying to code out this task but I can't seem to get along the syntax, and the manipulation
I am implementing a hit/miss ratio, basically for every new_party_id in recent_df, if any of the merch_store_code matches the merch_store_code for the same party_id in the recommendation_df, count +=1 (Hit)
Then calculating the hit/miss ratio by using count/total user count
(However in recent_df, each user might have multiple recent purchases, but if any of the purchases is in the list of recommendations_list for the same user, take it as a hit (count +=1)
recommendation_df
+--------------+----------------+-----------+----------+
|party_id_index|merch_store_code| rating| party_id|
+--------------+----------------+-----------+----------+
| 148| 900000166| 0.4021678|G18B00332C|
| 148| 168339566| 0.27687865|G18B00332C|
| 148| 168993309| 0.15999989|G18B00332C|
| 148| 168350313| 0.1431974|G18B00332C|
| 148| 168329726| 0.13634883|G18B00332C|
| 148| 168351967|0.120235085|G18B00332C|
| 148| 168993312| 0.11800903|G18B00332C|
| 148| 168337234|0.116267696|G18B00332C|
| 148| 168993256| 0.10836013|G18B00332C|
| 148| 168339482| 0.10341005|G18B00332C|
| 463| 168350313| 0.93455887|K18M926299|
| 463| 900000072| 0.8275664|K18M926299|
| 463| 700012303| 0.70220494|K18M926299|
| 463| 700012180| 0.23209469|K18M926299|
| 463| 900000157| 0.1727839|K18M926299|
| 463| 700013689| 0.13854747|K18M926299|
| 463| 900000166| 0.12866624|K18M926299|
| 463| 168993284|0.107065596|K18M926299|
| 463| 168993269| 0.10272527|K18M926299|
| 463| 168339566| 0.10256036|K18M926299|
+--------------+----------------+-----------+----------+
recent_df
+------------+---------------+----------------+
|new_party_id|recent_purchase|merch_store_code|
+------------+---------------+----------------+
| A11275842R| 2022-05-21| 168289403|
| A131584211| 2022-06-01| 168993311|
| A131584211| 2022-06-01| 168349493|
| A131584211| 2022-06-01| 168350192|
| A182P3539K| 2022-03-26| 168341707|
| A182V2883F| 2022-05-26| 168350824|
| A183B5482P| 2022-05-10| 168993464|
| A183C6900K| 2022-05-14| 168338795|
| A183D56093| 2022-05-20| 700012303|
| A183J5388G| 2022-03-18| 700013650|
| A183U8880P| 2022-04-01| 900000072|
| A183U8880P| 2022-04-01| 168991904|
| A18409762L| 2022-05-10| 168319352|
| A18431276J| 2022-05-14| 168163905|
| A18433684M| 2022-03-21| 168993324|
| A18433978F| 2022-05-20| 168341876|
| A184410389| 2022-05-04| 900000166|
| A184716280| 2022-04-06| 700013653|
| A18473797O| 2022-05-24| 168330339|
| A18473797O| 2022-05-24| 168350592|
+------------+---------------+----------------+
Here is my current coding logic:
count = 0
def hitratio(recommendation_df, recent_df):
for i in recent_df['new_party_id']:
for j in recommendation_df['party_id']:
if (i = j) & i.merch_store_code == j.merch_store_code:
count += 1
return (count/recent_df.count())
In Spark, refrain from loops on rows. Spark does not work like that, you need to think of the whole column, not about row-by-row scenario.
You need to join both tables and select users, but they need to be without duplicates (distinct)
from pyspark.sql import functions as F
df_distinct_matches = (
recent_df
.join(recommendation_df, F.col('new_party_id') == F.col('party_id'))
.select('party_id').distinct()
)
hit = df_distinct_matches.count()
assumption :
i am taking all the count rows of recent df as denominator for calculating the hit/miss ratio you can change the formula.
from pyspark.sql import functions as F
matching_cond = ((recent_df["merch_store_code"]==recommender_df["merch_store_code"]) &(recommendation_df["party_id"].isNotNull()))
df_recent_fnl= df_recent.join(recommendation_df,df_recent["new_party_id"]==recommendation_df["party_id"],"left")\
.select(df_recent["*"],recommender_df["merch_store_code"],recommendation_df["party_id"])\
.withColumn("hit",F.when(matching_cond,F.lit(True)).otherwise(F.lit(False)))\
.withColumn("hit/miss",df_recent_fnl.filter(F.col("hit")).count()/df_recent.count())
do let me know if you have any questions around this .
If you like my solution , you can upvote

PySpark : How do you use the values in multiple columns to perform some sort of aggregation?

What i have :
#+-------+----------+----------+
#|dotId |codePp |status |
#+-------+----------+----------+
#|dot0001 |Pp3523 |start |
#|dot0001 |Pp3524 |stop |
#|dot0020 |Pp3522 |start |
#|dot0020 |Pp3556 |stop |
#|dot9999 |Pp3545 |stop |
#|dot9999 |Pp3523 |start |
#|dot9999 |Pp3587 |stop |
#|dot9999 |Pp3567 |start |
#------------------------------|
What i want :
Instruction: if status is 'stop' put codePp with '(stop)' else put 'codePp'
#+-------+----------------------------------------------+
#|dotId |codePp |
#+-------+----------------------------------------------+
#|dot0001 |Pp3523, Pp3524(stop) |
#|dot0020 |Pp3522, Pp3556(stop) |
#|dot9999 |Pp3545(stop), Pp3523, Pp3587(stop), Pp3567 |
#-------------------------------------------------------|
But how to wrote it at pyspark ?
You may try the following using a case expression (using when) to determine whether to append the status. This was done in a group by/aggregation that used collect_list to gather all codePp values and concat_ws to convert it into a comma separated string.
from pyspark.sql import functions as F
output_df =(
df.groupBy("dotId")
.agg(
F.concat_ws(
', ',
F.collect_list(
F.concat(
F.col("codePp"),
F.when(F.col("status")=="stop"),"(stop)")
)
)
).alias("codePp")
)
)
Let me know if this works for you.

Convert a value using a value from a different row with petl?

I have the following table:
+---------+------------+----------------+
| IRR | Price List | Cambrdige Data |
+=========+============+================+
| '1.56%' | '0' | '6/30/1989' |
+---------+------------+----------------+
| '5.17%' | '100' | '9/30/1989' |
+---------+------------+----------------+
| '4.44%' | '0' | '12/31/1990' |
+---------+------------+----------------+
I'm trying to write a calculator that updates the Price List field by making a simple calculation. The logic is basically this:
previous price * ( 1 + IRR%)
So for the last row, the calculation would be: 100 * (1 + 4.44%) = 104.44
Since I'm using petl, I'm trying to figure out how to update a field with its above value and a value from the same row and then populate this across the whole Price List column. I can't seem to find a useful petl utility for this. Should I just manually write a method? What do you guys think?
Try this:
# conversion can access other values from the same row
table = etl.convert(table, 'Price List',
lambda row: 100 * (1 + row.IRR),
pass_row=True)

Applying a udf function in a distributed fashion in PySpark

Say I have a very basic Spark DataFrame that consists of a couple of columns, one of which contains a value that I want to modify.
|| value || lang ||
| 3 | en |
| 4 | ua |
Say, I want to have a new column per specific class where I would add a float number to the given value (this is not much relevant to the final question though, in reality I do a prediction with sklearn there, but for simplicity let's assume we are adding stuff, the idea is I am modifying the value in some way). So given a dict classes={'1':2.0, '2':3.0} I would like to have a column for each class where I add the value from DF to the value of the class and then save it to a csv:
class_1.csv
|| value || lang || my_class | modified ||
| 3 | en | 1 | 5.0 | # this is 3+2.0
| 4 | ua | 1 | 6.0 | # this is 4+2.0
class_2.csv
|| value || lang || my_class | modified ||
| 3 | en | 2 | 6.0 | # this is 3+3.0
| 4 | ua | 2 | 7.0 | # this is 4+3.0
So far I have the following code that works and modifies the value for each defined class, but it is done with a for loop and I am looking for a more advanced optimization for it:
import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import functions as F
from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf
from pyspark.sql.functions import lit
# create session and context
spark = pyspark.sql.SparkSession.builder.master("yarn").appName("SomeApp").getOrCreate()
conf = SparkConf().setAppName('Some_App').setMaster("local[*]")
sc = SparkContext.getOrCreate(conf)
my_df = spark.read.csv("some_file.csv")
# modify the value here
def do_stuff_to_column(value, separate_class):
# do stuff to column, let's pretend we just add a specific value per specific class that is read from a dictionary
class_dict = {'1':2.0, '2':3.0} # would be loaded from somewhere
return float(value+class_dict[separate_class])
# iterate over each given class later
class_dict = {'1':2.0, '2':3.0} # in reality have more than 10 classes
# create a udf function
udf_modify = udf(do_stuff_to_column, FloatType())
# loop over each class
for my_class in class_dict:
# create the column first with lit
my_df2 = my_df.withColumn("my_class", lit(my_class))
# modify using udf function
my_df2 = my_df2.withColumn("modified", udf_modify("value","my_class"))
# write to csv now
my_df2.write.format("csv").save("class_"+my_class+".csv")
So the question is, is there a better/faster way of doing this then in a for loop?
I would use some form of join, in this case crossJoin. Here's a MWE:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(3, 'en'), (4, 'ua')], ['value', 'lang'])
classes = spark.createDataFrame([(1, 2.), (2, 3.)], ['class_key', 'class_value'])
res = df.crossJoin(classes).withColumn('modified', F.col('value') + F.col('class_value'))
res.show()
For saving as separate CSV's I think there is no better way than to use a loop.

Categories