Cleaning up pandas column of datetime strings

Cleaning up pandas column of datetime strings - python

I currently have some data in the form of datestrings that I would like to standardize into a zero-padded %H:%M:%S string. In its original form, the data deviates from the standard format in the following ways:
The time is not zero padded (e.g. '2:05:00')
There can be trailing whitespaces (e.g., ' 2:05:00')
There can be times over 24H displayed (e.g., '25:00:00')
Currently, this is what I have:
df['arrival_time'] = pd.to_datetime(df['arrival_time'].map(lambda x: x.strip()), format='%H:%M:%S').dt.strftime('%H:%M:%S')
But I get an error on the times that are over 24H. Is there a good way to transform this dataframe column into the proper format?

I believe you need:
df = pd.DataFrame({'arrival_time':['2:05:00','2:05:00','25:00:00'],})
df['arrival_time'] = df['arrival_time'].str.strip().str.zfill(8)
print (df)
arrival_time
0 02:05:00
1 02:05:00
2 25:00:00
Or:
df['arrival_time'] = pd.to_datetime(df['arrival_time'].str.strip(), errors='coerce')
.dt.strftime('%H:%M:%S')
print (df)
arrival_time
0 02:05:00
1 02:05:00
2 NaT
Or:
df['arrival_time'] = (pd.to_timedelta(df['arrival_time'].str.strip())
.astype(str)
.str.extract('\s.*\s(.*)\.', expand=False))
print (df)
arrival_time
0 02:05:00
1 02:05:00
2 01:00:00

Related

Pandas Dataframe Time Duration Expand to Minute Data

I am receiving data which consists of a 'StartTime' and a 'Duration' of time active. This is hard to work with when I need to do calculations on a specified time range over multiple days. I would like to break this data down to minutely data to make future calculations easier. Please see the example to get a better understanding.
Data which I currently have:
data = {'StartTime':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00'],
'Duration':[1,1,3,1,2],
'Site':['1','2','3','4','5']
}
df = pd.DataFrame(data)
df['StartTime'] = pd.to_datetime(df['StartTime']).dt.tz_localize('utc').dt.tz_convert('Australia/Melbourne')
What I would like to have:
data_expected = {'Time':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 04:37:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00','2019-01-02 05:14:00+11:00'],
'Duration':[1,1,1,1,1,1,1],
'Site':['1','2','3','3','4','5','5']
}
df_expected = pd.DataFrame(data_expected)
df_expected['Time'] = pd.to_datetime(df_expected['Time']).dt.tz_localize('utc').dt.tz_convert('Australia/Melbourne')
I would like to see if anyone has a good solution for this problem. Effectively, I would need data rows with Duration >1 to be duplicated with time +1minute for each minute above 1 minute duration. Is there a way to do this without creating a whole new dataframe?
******** EDIT ********
In response to #DavidErickson 's answer. Putting this here because I can't put images in comments. I ran into a bit of trouble. df1 is a subset of the original dataframe. df2 is df1 after applying the code provided. You can see that the time that is added on to index 635 is incorrect.

I think you might want to address use case where Duration > 2 as well.
For the modified given input:
data = {'StartTime':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00'],
'Duration':[1,1,3,1,2],
'Site':['1','2','3','4','5']
}
df = pd.DataFrame(data)
df['StartTime'] = pd.to_datetime(df['StartTime'])
This code should do the trick:
df['offset'] = df['Duration'].apply(lambda x: list(range(x)))
df = df.explode('offset')
df['offset'] = df['offset'].apply(lambda x: pd.Timedelta(x, unit='T'))
df['StartTime'] += df['offset']
df["Duration"] = 1
Basically, it works as follow:
create a list of integer based on Duration value;
replicate row (explode) with consecutive integer offset;
transform integer offset into timedelta offset;
perform datetime arithmetics and reset Duration field.
The result is about:
StartTime Duration Site offset
0 2018-12-30 12:45:00+11:00 1 1 00:00:00
1 2018-12-31 16:48:00+11:00 1 2 00:00:00
2 2019-01-01 04:36:00+11:00 1 3 00:00:00
2 2019-01-01 04:37:00+11:00 1 3 00:01:00
2 2019-01-01 04:38:00+11:00 1 3 00:02:00
3 2019-01-01 19:27:00+11:00 1 4 00:00:00
4 2019-01-02 05:13:00+11:00 1 5 00:00:00
4 2019-01-02 05:14:00+11:00 1 5 00:01:00

Use df.index.repeat according to the Duration column to add the relevant number of rows. Then create a mask with .groupby and cumcount that adds the appropriate number of minutes on top of the base time.
input:
data = {'StartTime':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00'],
'Duration':[1,1,2,1,2],
'Site':['1','2','3','4','5']
}
df = pd.DataFrame(data)
df['StartTime'] = pd.to_datetime(df['StartTime'])
code:
df = df.loc[df.index.repeat(df['Duration'])]
mask = df.groupby('Site').cumcount()
df['StartTime'] = df['StartTime'] + pd.to_timedelta(mask, unit='m')
df = df.append(df).sort_values('StartTime').assign(Duration=1).drop_duplicates()
df
output:
StartTime Duration Site
0 2018-12-30 12:45:00+11:00 1 1
1 2018-12-31 16:48:00+11:00 1 2
2 2019-01-01 04:36:00+11:00 1 3
2 2019-01-01 04:37:00+11:00 1 3
2 2019-01-01 04:38:00+11:00 1 3
3 2019-01-01 19:27:00+11:00 1 4
4 2019-01-02 05:13:00+11:00 1 5
4 2019-01-02 05:14:00+11:00 1 5
If you are running into memory issues, then you can also try with dask. I have included #jlandercy's pandas answer and changed to dask syntax as I'm not sure if the pandas operation index.repeat would work with dask. Here is documentation on the funcitons/operations. I would research the ones in the code https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_sql_table:
import dask.dataframe as dd
#read as a dask dataframe from csv or SQL or other
df = dd.read_csv(files) #df = dd.read_sql_table(table, uri, index_col='StartTime')
df['offset'] = df['Duration'].apply(lambda x: list(range(x)))
df = dd.explode('offset')
df['offset'] = df['offset'].apply(lambda x: dd.Timedelta(x, unit='T'))
df['StartTime'] += df['offset']
df["Duration"] = 1

Pandas Converting date string (only month and year) to datetime

I am trying to convert a datetime object to datetime. In the original dataframe the data type is a string and the dataset has shape = (28000000, 26). Importantly, the format of the date is MMYYYY only. Here's a data sample:
DATE
Out[3] 0 081972
1 051967
2 101964
3 041975
4 071976
I tried:
df['DATE'].apply(pd.to_datetime(format='%m%Y'))
and
pd.to_datetime(df['DATE'],format='%m%Y')
I got Runtime Error both times
Then
df['DATE'].apply(pd.to_datetime)
it worked for the other not shown columns(with DDMMYYYY format), but generated future dates with df['DATE'] because it reads the dates as MMDDYY instead of MMYYYY.
DATE
0 1972-08-19
1 2067-05-19
2 2064-10-19
3 1975-04-19
4 1976-07-19
Expect output:
DATE
0 1972-08
1 1967-05
2 1964-10
3 1975-04
4 1976-07
If this question is a duplicate please direct me to the original one, I wasn't able to find any suitable answer.
Thank you all in advance for your help

First if error is raised obviously some datetimes not match, you can test it by errors='coerce' parameter and Series.isna, because for not matched values are returned missing values:
print (df)
DATE
0 81972
1 51967
2 101964
3 41975
4 171976 <-changed data
print (pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce'))
0 1972-08-01
1 1967-05-01
2 1964-10-01
3 1975-04-01
4 NaT
Name: DATE, dtype: datetime64[ns]
print (df[pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce').isna()])
DATE
4 171976
Solution with output from changed data with converting to datetimes and the to months periods by Series.dt.to_period:
df['DATE'] = pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce').dt.to_period('m')
print (df)
DATE
0 1972-08
1 1967-05
2 1964-10
3 1975-04
4 NaT
Solution with original data:
df['DATE'] = pd.to_datetime(df['DATE'],format='%m%Y', errors='coerce').dt.to_period('m')
print (df)
0 1972-08
1 1967-05
2 1964-10
3 1975-04
4 1976-07

I would have done:
df['date_formatted'] = pd.to_datetime(
dict(
year=df['DATE'].str[2:],
month=df['DATE'].str[:2],
day=1
)
)
Maybe this helps. Works for your sample data.

Converting date formats in pandas dataframe

I have a dataframe and the Date column has two different types of date formats going on.
eg. 1983-11-10 00:00:00 and 10/11/1983
I want them all to be the same type, how can I iterate through the Date column of my dataframe and convert the dates to one format?

I believe you need parameter dayfirst=True in to_datetime:
df = pd.DataFrame({'Date': {0: '1983-11-10 00:00:00', 1: '10/11/1983'}})
print (df)
Date
0 1983-11-10 00:00:00
1 10/11/1983
df['Date'] = pd.to_datetime(df.Date, dayfirst=True)
print (df)
Date
0 1983-11-10
1 1983-11-10
because:
df['Date'] = pd.to_datetime(df.Date)
print (df)
Date
0 1983-11-10
1 1983-10-11
Or you can specify both formats and then use combine_first:
d1 = pd.to_datetime(df.Date, format='%Y-%m-%d %H:%M:%S', errors='coerce')
d2 = pd.to_datetime(df.Date, format='%d/%m/%Y', errors='coerce')
df['Date'] = d1.combine_first(d2)
print (df)
Date
0 1983-11-10
1 1983-11-10
General solution for multiple formats:
from functools import reduce
def convert_formats_to_datetimes(col, formats):
out = [pd.to_datetime(col, format=x, errors='coerce') for x in formats]
return reduce(lambda l,r: pd.Series.combine_first(l,r), out)
formats = ['%Y-%m-%d %H:%M:%S', '%d/%m/%Y']
df['Date'] = df['Date'].pipe(convert_formats_to_datetimes, formats)
print (df)
Date
0 1983-11-10
1 1983-11-10

I want them all to be the same type, how can I iterate through the
Date column of my dataframe and convert the dates to one format?
Your input data is ambiguous: is 10 / 11 10th November or 11th October? You need to specify logic to determine which is appropriate. A function is useful if you with to try multiple date formats sequentially:
def date_apply_formats(s, form_lst):
s = pd.to_datetime(s, format=form_lst[0], errors='coerce')
for form in form_lst[1:]:
s = s.fillna(pd.to_datetime(s, format=form, errors='coerce'))
return s
df['Date'] = date_apply_formats(df['Date'], ['%Y-%m-%d %H:%M:%S', '%d/%m/%Y'])
Priority is given to the first item in form_lst. The solution is extendible to an arbitrary number of provided formats.

Input date is
NSECODE Date Close
1 NSE500 20000103 1291.5500
2 NSE500 20000104 1335.4500
3 NSE500 20000105 1303.8000
history_nseindex_df["Date"] = pd.to_datetime(history_nseindex_df["Date"])
history_nseindex_df["Date"] = history_nseindex_df["Date"].dt.strftime("%Y-%m-%d")
ouput is now
NSECode Date Close
1 NSE500 2000-01-03 1291.5500
2 NSE500 2000-01-04 1335.4500
3 NSE500 2000-01-05 1303.8000

Comparing today date with date in dataframe

Comparing today date with date in dataframe
Sample Data
id date
1 1/2/2018
2 1/5/2019
3 5/3/2018
4 23/11/2018
Desired output
id date
2 1/5/2019
4 23/11/2018
My current code
dfdateList = pd.DataFrame()
dfDate= self.df[["id", "date"]]
today = datetime.datetime.now()
today = today.strftime("%d/%m/%Y").lstrip("0").replace(" 0", "")
expList = []
for dates in dfDate["date"]:
if dates <= today:
expList.append(dates)
dfdateList = pd.DataFrame(expList)
Currently my code is printing every single line despite the conditions, can anyone guide me? thanks

Pandas has native support for a large class of operations on datetimes, so one solution here would be to use pd.to_datetime to convert your dates from strings to pandas' representation of datetimes, pd.Timestamp, then just create a mask based on the current date:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df[df['date'] > pd.Timestamp.now()]
For example:
In [34]: df['date'] = pd.to_datetime(df['date'], dayfirst=True)
In [36]: df
Out[36]:
id date
0 1 2018-02-01
1 2 2019-05-01
2 3 2018-03-05
3 4 2018-11-23
In [37]: df[df['date'] > pd.Timestamp.now()]
Out[37]:
id date
1 2 2019-05-01
3 4 2018-11-23

Convert Dataframe column to time format in python

I have a dataframe column which looks like this :
It reads M:S.MS. How can I convert it into a M:S:MS timeformat so I can plot it as a time series graph?
If I plot it as it is, python throws an Invalid literal for float() error.
Note
: This dataframe contains one hour worth of data. Values between
0:0.0 - 59:59.9

df = pd.DataFrame({'date':['00:02.0','00:05:0','00:08.1']})
print (df)
date
0 00:02.0
1 00:05:0
2 00:08.1
It is possible convert to datetime:
df['date'] = pd.to_datetime(df['date'], format='%M:%S.%f')
print (df)
date
0 1900-01-01 00:00:02.000
1 1900-01-01 00:00:05.000
2 1900-01-01 00:00:08.100
Or to timedeltas:
df['date'] = pd.to_timedelta(df['date'].radd('00:'))
print (df)
date
0 00:00:02
1 00:00:05
2 00:00:08.100000
EDIT:
For custom date use:
date = '2015-01-04'
td = pd.to_datetime(date) - pd.to_datetime('1900-01-01')
df['date'] = pd.to_datetime(df['date'], format='%M:%S.%f') + td
print (df)
date
0 2015-01-04 00:00:02.000
1 2015-01-04 00:00:05.000
2 2015-01-04 00:00:08.100

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cleaning up pandas column of datetime strings - python

Related

Pandas Dataframe Time Duration Expand to Minute Data

Pandas Converting date string (only month and year) to datetime

Converting date formats in pandas dataframe

Comparing today date with date in dataframe

Convert Dataframe column to time format in python

Categories

Resources