Pandas: Add and Preserve time component when input file has only date in dataframe - python

Scenario:
The input file which I read in Pandas has column with sparsely populated date in String/Object format.
I need to add time component, for ex.
2021-08-27 is my input in String format, and 2021-08-28 00:00:00 should by output in datetime64[ns] format
My Trials:
df = pd.read_parquet("sample.parquet")
df.head()
a
b
c
dttime_col
1
1
2
2021-07-12 00:00:00
0
1
0
NaN
1
2
0
NaN
2
1
1
2021-02-04 00:00:00
3
5
2
NaN
df["dttime_col"] = pd.to_datetime(df["dttime_col"])
df["dttime_col"]
Out[16]:
0 2021-07-12
1 NaT
2 NaT
3 2021-02-04
4 NaT
5 2021-05-22
6 NaT
7 2021-10-06
8 2021-01-31
9 NaT
Name: dttime_col, dtype: datetime64[ns]
But as you see here, there is not time component. I tried adding format as %Y-%m-%d %H:%M:%S but still the output is same. Further more, I tried adding Time component as default as a String type.
df["dttime_col"] = df["dttime_col"].dt.strftime("%Y-%m-%d 00:00:00").replace('NaT', np.nan)
Out[17]:
0 2021-07-12 00:00:00
1 NaN
2 NaN
3 2021-02-04 00:00:00
4 NaN
5 2021-05-22 00:00:00
6 NaN
7 2021-10-06 00:00:00
8 2021-01-31 00:00:00
9 NaN
Name: dttime_col, dtype: object
Now this gives me time next to date, but in String/Object format. The moment I convert it back to datetime format, all the HH:MM:SS are removed.
df["dttime_col"].apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d %H:%M:%S") if not isinstance(x, float) else np.nan)
Out[24]:
0 2021-07-12
1 NaT
2 NaT
3 2021-02-04
4 NaT
5 2021-05-22
6 NaT
7 2021-10-06
8 2021-01-31
9 NaT
Name: dttime_col, dtype: datetime64[ns]
It feels like going in circles all again.
Output I expect:
0 2021-07-12 00:00:00
1 NaN
2 NaN
3 2021-02-04 00:00:00
4 NaN
5 2021-05-22 00:00:00
6 NaN
7 2021-10-06 00:00:00
8 2021-01-31 00:00:00
9 NaN
Name: dttime_col, dtype: datetime64[ns]
EDIT 1:
Providing output as asked by #mozway
df["dttime_col"].dt.second
Out[27]:
0 0.0
1 NaN
2 NaN
3 0.0
4 NaN
5 0.0
6 NaN
7 0.0
8 0.0
9 NaN
Name: dttime_col, dtype: float64

Related

Get last N rows from a particular place in pandas data frame based on a value

I have a dateset like
Sno change date
0 NaN 2017-01-01
1 NaN 2017-02-01
2 NaN 2017-03-01
3 NaN 2017-04-01
4 NaN 2017-05-01
5 NaN 2017-06-01
6 NaN 2017-07-01
7 NaN 2017-08-01
8 0.0 2017-09-01
9 NaN 2017-10-01
10 NaN 2017-11-01
11 1 2017-12-01
12 NaN 2018-01-01
13 NaN 2018-02-01
I want to get the last 5 rows of "date" column in the data frame when the value in column "change" changes from NaN to anything else. So for this example, it will be divided into two sets:
Sno date
3 2017-04-01
4 2017-05-01
5 2017-06-01
6 2017-07-01
7 2017-08-01
8 2017-09-01
and
Sno date
6 2017-07-01
7 2017-08-01
8 2017-09-01
9 2017-10-01
10 2017-11-01
11 2017-12-01
Can anyone help me to get this? Thank you
You can try something like this, with loc and isna:
#df=df.set_index('Sno')
idxs=df.index[~df.change.isna()]
sets=[df.loc[i-5:i,['date']] for i in idxs]
Output:
sets
[ date
Sno
3 2017-04-01
4 2017-05-01
5 2017-06-01
6 2017-07-01
7 2017-08-01
8 2017-09-01,
date
Sno
6 2017-07-01
7 2017-08-01
8 2017-09-01
9 2017-10-01
10 2017-11-01
11 2017-12-01]
You can use isna() to check for NaN values, then np.whereto extract the locations of last row, finally,np.r_` for creating slices:
s = df.change.isna()
valids = np.where(s.shift() & (~s))[0]
[df.iloc[np.r_[x-5:x]] for x in valid]
[ Sno change date
3 3 NaN 2017-04-01
4 4 NaN 2017-05-01
5 5 NaN 2017-06-01
6 6 NaN 2017-07-01
7 7 NaN 2017-08-01,
Sno change date
6 6 NaN 2017-07-01
7 7 NaN 2017-08-01
8 8 0.0 2017-09-01
9 9 NaN 2017-10-01
10 10 NaN 2017-11-01]

Transform multiple format Duration Data to common formatted '%H%M%S' . The %M part of the format (minutes) is inconsistent

I have Duration data that is an object with multiple formats, particularly in the minutes part between the colons. Any idea, how I can transform this data. I tried everything with regex imaginable (except for the correct answer :) ), which was the main part where I was struggling with. For example, below is my attempt to zero-pad the minutes column.
df['temp'] = df['temp'].replace(':?:', ':0?:', regex=True)
Input:
Duration
0 00:0:00
1 00:00:00
2 00:8:00
3 00:08:00
4 00:588:00
5 09:14:00
Expected Output Option #1 (Time format):
Duration
0 00:00:00
1 00:00:00
2 00:08:00
3 00:08:00
4 09:48:00
5 09:14:00
My end goal is to get the minutes, so another acceptable format would be:
Expected Output Option #2 (Minutes - integer or float):
Minutes
0 0
1 0
2 8
3 8
4 588
5 554
We can just do pd.to_timedelta:
pd.to_timedelta(df.Duration)
Output:
0 00:00:00
1 00:00:00
2 00:08:00
3 00:08:00
4 09:48:00
5 09:14:00
Name: Duration, dtype: timedelta64[ns]
Or Option 2 - Minutes:
pd.to_timedelta(df.Duration).dt.total_seconds()/60
Output:
0 0.0
1 0.0
2 8.0
3 8.0
4 588.0
5 554.0
Name: Duration, dtype: float64
We can do split with mul
df.Duration.str.split(':',expand=True).astype(int).mul([60,1,1/60]).sum(1)
0 0.0
1 0.0
2 8.0
3 8.0
4 588.0
5 554.0
dtype: float64

Convert column of integers to time in HH:MM:SS format efficiently

I am trying to develop a more efficient loop to complete a problem. At the moment, the code below applies a string if it aligns with a specific value. However, the values are in identical order so a loop could make this process more efficient.
Using the df below as an example, using integers to represent time periods, each integer increase equates to a 15 min period. So 1 == 8:00:00 and 2 == 8:15:00 etc. At the moment I would repeat this process until the last time period. If this gets up to 80 it could become very inefficient. Could a loop be incorporated here?
import pandas as pd
d = ({
'Time' : [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6],
})
df = pd.DataFrame(data = d)
def time_period(row) :
if row['Time'] == 1 :
return '8:00:00'
if row['Time'] == 2 :
return '8:15:00'
if row['Time'] == 3 :
return '8:30:00'
if row['Time'] == 4 :
return '8:45:00'
if row['Time'] == 5 :
return '9:00:00'
if row['Time'] == 6 :
return '9:15:00'
.....
if row['Time'] == 80 :
return '4:00:00'
df['24Hr Time'] = df.apply(lambda row: time_period(row), axis=1)
print(df)
Out:
Time 24Hr Time
0 1 8:00:00
1 1 8:00:00
2 1 8:00:00
3 2 8:15:00
4 2 8:15:00
5 2 8:15:00
6 3 8:30:00
7 3 8:30:00
8 3 8:30:00
9 4 8:45:00
10 4 8:45:00
11 4 8:45:00
12 5 9:00:00
13 5 9:00:00
14 5 9:00:00
15 6 9:15:00
16 6 9:15:00
17 6 9:15:00
This is possible with some simple timdelta arithmetic:
df['24Hr Time'] = (
pd.to_timedelta((df['Time'] - 1) * 15, unit='m') + pd.Timedelta(hours=8))
df.head()
Time 24Hr Time
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
df.dtypes
Time int64
24Hr Time timedelta64[ns]
dtype: object
If you need a string, use pd.to_datetime with unit and origin:
df['24Hr Time'] = (
pd.to_datetime((df['Time']-1) * 15, unit='m', origin='8:00:00')
.dt.strftime('%H:%M:%S'))
df.head()
Time 24Hr Time
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
df.dtypes
Time int64
24Hr Time object
dtype: object
In general, you want to make a dictionary and apply
my_dict = {'old_val1': 'new_val1',...}
df['24Hr Time'] = df['Time'].map(my_dict)
But, in this case, you can do with time delta:
df['24Hr Time'] = pd.to_timedelta(df['Time']*15, unit='T') + pd.to_timedelta('7:45:00')
Output (note that the new column is of type timedelta, not string)
Time 24Hr Time
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
5 2 08:15:00
6 3 08:30:00
7 3 08:30:00
8 3 08:30:00
9 4 08:45:00
10 4 08:45:00
11 4 08:45:00
12 5 09:00:00
13 5 09:00:00
14 5 09:00:00
15 6 09:15:00
16 6 09:15:00
17 6 09:15:00
I end up using this
pd.to_datetime((df.Time-1)*15*60+8*60*60,unit='s').dt.time
0 08:00:00
1 08:00:00
2 08:00:00
3 08:15:00
4 08:15:00
5 08:15:00
6 08:30:00
7 08:30:00
8 08:30:00
9 08:45:00
10 08:45:00
11 08:45:00
12 09:00:00
13 09:00:00
14 09:00:00
15 09:15:00
16 09:15:00
17 09:15:00
Name: Time, dtype: object
A fun way is using pd.timedelta_range and index.repeat
n = df.Time.nunique()
c = df.groupby('Time').size()
df['24_hr'] = pd.timedelta_range(start='8 hours', periods=n, freq='15T').repeat(c)
Out[380]:
Time 24_hr
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
5 2 08:15:00
6 3 08:30:00
7 3 08:30:00
8 3 08:30:00
9 4 08:45:00
10 4 08:45:00
11 4 08:45:00
12 5 09:00:00
13 5 09:00:00
14 5 09:00:00
15 6 09:15:00
16 6 09:15:00
17 6 09:15:00

Grouping by preserving hours in pandas

I have the following dateframe. I would like to groupby mean every hour but still preserv the hours datetime info.
date A I r z
0 2017-08-01 00:00:00 3 56 4 6.
1 2017-08-01 00:00:01 3 57 1 6
2 2017-08-01 00:00:03 3 58 9 6
3 2017-08-01 00:00:05 3 52 10 2.
4 2017-08-01 00:00:06 3 50 1 1
df.groupby(df['date'].dt.hour).mean()
date A I r z
0 3 56 4 6.
1 3 57 1 6
2 3 58 9 6
3 3 52 10 2.
4 3 50 1 1
I would like to have as an index the same date before such as 2017-08-01 00:00:00 datetime64[ns]
How can I achieve this output in Python?
Output desired:
date A I r z
0 2017-08-01 00:00:00 3 56 4 6.
1 2017-08-01 01:00:00 3 57 1 6
2 2017-08-01 02:00:00 3 58 9 6
3 2017-08-01 03:00:00 3 52 10 2.
4 2017-08-01 04:00:00 3 50 1 1
Using resample
df.set_index('date').resample('H').mean()
Out[179]:
A I r z
date
2017-08-01 00:00:00 3.0 55.75 6.0 5.0
2017-08-01 01:00:00 NaN NaN NaN NaN
2017-08-01 02:00:00 NaN NaN NaN NaN
2017-08-01 03:00:00 3.0 50.00 1.0 1.0
Data input
date A I r z
0 2017-08-01 00:00:00 3 56 4 6.0
1 2017-08-01 00:00:01 3 57 1 6.0
2 2017-08-01 00:00:03 3 58 9 6.0
3 2017-08-01 00:00:05 3 52 10 2.0
4 2017-08-01 03:00:06 3 50 1 1.0# different hour here

How to make all non-date values null in Pandas

I have an excel doc where the users put dates and strings in the same column. I want to make every string object null and leave all the dates. How do I do this in pandas? Thanks.
An easy way to convert dates in a DataFrame is with pandas.DataFrame.convert_objects, as mentioned by #Jeff, and it also handles numbers and timedeltas. Here is an example of using it:
# contents of Sheet1 of test.xlsx
x y date1 z date2 date3
1 fum 6/1/2016 7 9/1/2015 string3
2 fo 6/2/2016 alpha string0 10/1/2016
3 fi 6/3/2016 9 9/3/2015 10/2/2016
4 fee 6/4/2016 10 string1 string4
5 dumbledum 6/5/2016 beta string2 10/3/2015
6 dumbledee 6/6/2016 12 9/4/2015 string5
import pandas as pd
xl = pd.ExcelFile('test.xlsx')
df = xl.parse("Sheet1")
df1 = df.convert_objects(convert_dates='coerce')
# 'coerce' required for conversion to NaT on error
df1
Out[7]:
x y date1 z date2 date3
0 1 fum 2016-06-01 7 2015-09-01 NaT
1 2 fo 2016-06-02 alpha NaT 2016-10-01
2 3 fi 2016-06-03 9 2015-09-03 2016-10-02
3 4 fee 2016-06-04 10 NaT NaT
4 5 dumbledum 2016-06-05 beta NaT 2015-10-03
5 6 dumbledee 2016-06-06 12 2015-09-04 NaT
Individual columns in a DataFrame can be converted using pandas.to_datetime, as pointed out by #Jeff, and with pandas.Series.map, however neither are done in place. For example, with pandas.to_datetime:
import pandas as pd
xl2 = pd.ExcelFile('test.xlsx')
df2 = xl2.parse("Sheet1")
for col in ['date1', 'date2', 'date3']:
df2[col] = pd.to_datetime(df2[col],coerce=True, infer_datetime_format=True)
df2
Out[8]:
x y date1 z date2 date3
0 1 fum 2016-06-01 7 2015-09-01 NaT
1 2 fo 2016-06-02 alpha NaT 2016-10-01
2 3 fi 2016-06-03 9 2015-09-03 2016-10-02
3 4 fee 2016-06-04 10 NaT NaT
4 5 dumbledum 2016-06-05 beta NaT 2015-10-03
5 6 dumbledee 2016-06-06 12 2015-09-04 NaT
And using pandas.Series.map:
import pandas as pd
import datetime
xl3 = pd.ExcelFile('test.xlsx')
df3 = xl3.parse("Sheet1")
for col in ['date1', 'date2', 'date3']:
df3[col] = df3[col].map(lambda x: x if isinstance(x,(datetime.datetime)) else None)
df3
Out[9]:
x y date1 z date2 date3
0 1 fum 2016-06-01 7 2015-09-01 NaT
1 2 fo 2016-06-02 alpha NaT 2016-10-01
2 3 fi 2016-06-03 9 2015-09-03 2016-10-02
3 4 fee 2016-06-04 10 NaT NaT
4 5 dumbledum 2016-06-05 beta NaT 2015-10-03
5 6 dumbledee 2016-06-06 12 2015-09-04 NaT
An upfront way to convert dates in an excel doc is while parsing its sheets. This can be done using pandas.ExcelFile.parse's converters option with a function derived from pandas.to_datetime as the functions in the converters dict and enabling it with coerce=True to force errors to NaT. For example:
def converter(x):
return pd.to_datetime(x,coerce=True,infer_datetime_format=True)
# the following also works for this example
# return pd.to_datetime(x,format='%d/%m/%Y',coerce=True)
converters={'date1': converter,'date2': converter, 'date3': converter}
xl4 = pd.ExcelFile('test.xlsx')
df4 = xl4.parse("Sheet1",converters=converters)
df4
Out[10]:
x y date1 z date2 date3
0 1 fum 2016-06-01 7 2015-09-01 NaT
1 2 fo 2016-06-02 alpha NaT 2016-10-01
2 3 fi 2016-06-03 9 2015-09-03 2016-10-02
3 4 fee 2016-06-04 10 NaT NaT
4 5 dumbledum 2016-06-05 beta NaT 2015-10-03
5 6 dumbledee 2016-06-06 12 2015-09-04 NaT

Categories