PySpark - converting hour and minute data to seconds - python

I have a given time of XXh:YYm (ex 1h:23m) that I'm trying to convert to seconds. The tricky part is that if it is less than an hour then the time will be given as just YYm (eg 52m).
I am currently using
%pyspark
newColumn = unix_timestamp(col("time"), "H:mm")
dataF.withColumn('time', regexp_replace('time', 'h|m', '')).withColumn("time", newColumn).show()
This works great for removing the h and m letters and then converting to seconds, but throws a null when the time is less than an hour as explained above since it's not actually on the H:mm format. What's a good approach to this? I keep trying different things that seems to overcomplicate it, and I still haven't found a solution.
I am leaning toward some sort of conditional like
if value contains 'h:' then newColumn = unix_timestamp(col("time"), "H:mm")
else newColumn = unix_timestamp(col("time"), "mm")
but I am fairly new to pyspark and not sure how to do this to get the final output. I am basically looking for an approach that will convert a time to seconds and can handle formats of '1h:23m' as well as '53m'.

This should do the trick, assuming time column is stringtype. Just used when otherwise to separate the two different times(by contains 'h') and used substring to get desired minutes.
from pyspark.sql import functions as F
df.withColumn("seconds", F.when(F.col("time").contains("h"), F.unix_timestamp(F.regexp_replace("time", "h|m", ''),"H:mm"))\
.otherwise(F.unix_timestamp(F.substring("time",1,2),"mm")))\
.show()
+------+-------+
| time|seconds|
+------+-------+
|1h:23m| 4980|
| 23m| 1380|
+------+-------+

You can use "unix_timestamp" function to convert DateTime to unix timestamp in seconds.
You can refer to one of my blog on the Spark DateTime function and go to "unix_timestamp" section.
https://medium.com/expedia-group-tech/deep-dive-into-apache-spark-datetime-functions-b66de737950a
Regards,
Neeraj

Related

Get date format code from a string/datetime using python

is there a way to find out in Python the date format code of a string?
My Input would be e.g.:
2020-09-11T17:42:33.040Z
What I am looking for is in this example to get this:
'%Y-%m-%dT%H:%M:%S.%fZ'
Point is that I have diffrent time Formats for diffrent Files, therefore I don't know in Advancce how my datetime code format will look like.
For processing my data, I need unix time format, but to calculate that I need a solution to this problem.
data["time_unix"] = data.time.apply(lambda row: (datetime.datetime.strptime(row, '%Y-%m-%dT%H:%M:%S.%fZ').timestamp()*100))
Thank you for the support!

How to specify the format of timestamp in python

I have a dataframe with dates in string format. I convert those dates to timestamp, so that I could use this date column in the later part of the code. Everything is fine with calculations/comparisons etc, but I would like the timestamp to appear in %d.%m.%Y format, as opposed to default %Y-%m-%d. Let me illustrate it -
dt=pd.DataFrame({'date':['09.12.1998','07.04.2014']},index=[1,2])
dt
Out[4]:
date
1 09.12.1998
2 07.04.2014
dt['date_1']=pd.to_datetime(dt['date'],format='%d.%m.%Y')
dt
Out[7]:
date date_1
1 09.12.1998 1998-12-09
2 07.04.2014 2014-04-07
I would like to have dt['date_1'] to de displayed in the same format as dt['date']. I don't wish to use .strftime() function because it will convert the datatype from timestamp to string.
In Nutshell: How can I invoke the python system in displaying the timestamp in the format of my choice(months could be like APR, MAY etc), rather than getting a default format(like 1998-12-09), keeping in mind that the data type remains a timestamp, rather than string?
It seems Pandas didn't implement this option yet:
https://github.com/pandas-dev/pandas/issues/11501
having a look at https://pandas.pydata.org/pandas-docs/stable/options.html looks like you can set the display to achieve some of this, although not all.
display.date_dayfirst When True, prints and parses dates with the day first, eg 20/01/2005
display.date_yearfirst When True, prints and parses dates with the year first, eg 2005/01/20
so you can have dayfirst, but they haven't included names for months.
On a more fundamental level, whenever you're displaying something it is a string, right? I'm not sure why you wouldn't be able to convert it when you're displaying it without having to change the original dataframe.
your code would be:
pd.set_option("display.date_dayfirst", True)
except actually this doesn't work:
https://github.com/pandas-dev/pandas/issues/11501
the options have been implemented for parsing, but not for displaying.
Hallo Stael/Cezar/Droravr, Thank you all for providing your inputs. I value your time and appreciate your help a lot. Thanks for sharing this link https://github.com/pandas-dev/pandas/issues/11501 as well. I went through the link and understood that this problem can be broken down to a 'displaying problem' ultimately, as also expounded by jreback. This issue to have the dates displayed to your desired format has been marked as an Enhancement, so probably will be added to future versions.
All I wanted was the have to dates exported as dd-mm-yyy and by just formatting the string while exporting, we could solve this problem.
So, I sorted this issue by exporting the file as -
dt.to_csv(filename, date_format='%d-%m-%Y',index=False).
date date_1
09.12.1998 09-12-1998
07.04.2014 07-04-2014
Thus, this issue stands SOLVED.
Once again, thank you all for your kind help and the precious hours you spent with this issue. Deeply appreciated.

Apply a function to each row python

I am trying to convert from UTC time to LocaleTime in my dataframe. I have a dictionary where I store the number of hours I need to shift for each country code. So for example if I have df['CountryCode'][0]='AU' and I have a df['UTCTime'][0]=2016-08-12 08:01:00 I want to get df['LocaleTime'][0]=2016-08-12 19:01:00 which is
df['UTCTime'][0]+datetime.timedelta(hours=dateDic[df['CountryCode'][0]])
I have tried to do it with a for loop but since I have more than 1 million rows it's not efficient. I have looked into the apply function but I can't seem to be able to put it to take inputs from two different columns.
Can anyone help me?
Without having a more concrete example its difficult but try this:
pd.to_timedelta(df.CountryCode.map(dateDict), 'h') + df.UTCTime

look ahead time analysis in R (data mining algorithm)

I have a file (dozens of columns and millions of rows) that essentially looks like this:
customerID VARCHAR(11)
accountID VARCHAR(11)
snapshotDate Date
isOpen Boolean
...
One record in the file might look like this:
1,100,200901,1,...
1,100,200902,1,...
1,100,200903,1,...
1,100,200904,1,...
1,100,200905,1,...
1,100,200906,1,...
...
1,100,201504,1,...
1,100,201505,1,...
1,100,201506,1,...
When an account is closed, two things can happen. Typically, no further snapshots for that record will exist in the data. Occasionally, further records will continue to be added but the isOpen flag will be set to 0.
I want to add an additional Boolean column, called "closedInYr", that has a 0 value UNLESS THE ACCOUNT CLOSES WITHIN ONE YEAR AFTER THE SNAPSHOT DATE.
My solution is slow and gross. It takes each record, counts forward in time 12 months, and if it finds a record with the same customerID, accountID, and isOpen set to 1, it populates the record with a 0 in the "closedInYr" field, otherwise it populates the field with a 1. It works, but the performance is not acceptable, and we have a number of these kinds of files to process.
Any ideas on how to implement this? I use R, but am willing to code in Perl, Python, or practically anything except COBOL or VB.
Thanks
I suggest to use the Linux "date" command to convert the date to the unix time stamps.
Unix time stamp are the number of seconds elapsed since 1 January 1970. So basically a year is 60s*60m*24h*256d seconds. So, if the difference between the time stamps is more than this number then it is longer than a year.
It will be something like this:
>date --date='201106' "+%s"
1604642400
So if you use perl, which is a pretty cool file handling language, you will parse your whole file in a few lines and use eval"you date command".
If all the snapshots for a given record appear in one row, and the records that were open for the same period of time have the same length (i.e., snapshots were taken at regular intervals), then one possibility might be filtering based on row lengths. If the longest open row is length N and one year's records are M, then you know a N-M row was open, at longest, one year less than the longest... That approach doesn't handle the case where the snapshots keep getting added, albeit with open flags set to 0, but it might allow you to cut the number of searches down by at least reducing the number of searches that need to be made per row?
At least, that's an idea. More generally, searching from the end to find the last year where isOpen == 1 might cut the search down a little...
Of course, this all assumes each record is in one row. If not, maybe a melt is in order first?

How to remove day from datetime index in pandas?

The idea behind this question is, that when I'm working with full datetime tags and data from different days, I sometimes want to compare how the hourly behavior compares.
But because the days are different, I can not directly plot two 1-hour data sets on top of each other.
My naive idea would be that I need to remove the day from the datetime index on both sets and then plot them on top of each other. What's the best way to do that?
Or, alternatively, what's the better approach to my problem?
This may not be exactly it but should help you along, assuming ts is your timeseries:
hourly = ts.resample('H')
hourly.index = pd.MultiIndex.from_arrays([hourly.index.hour, hourly.index.normalize()])
hourly.unstack().plot()
If you don't care about the day AT ALL, just hourly.index = hourly.index.hour should work

Categories