Below is a sample of my df
date value
0006-03-01 00:00:00 1
0006-03-15 00:00:00 2
0006-05-15 00:00:00 1
0006-07-01 00:00:00 3
0006-11-01 00:00:00 1
2009-05-20 00:00:00 2
2009-05-25 00:00:00 8
2020-06-24 00:00:00 1
2020-06-30 00:00:00 2
2020-07-01 00:00:00 13
2020-07-15 00:00:00 2
2020-08-01 00:00:00 4
2020-10-01 00:00:00 2
2020-11-01 00:00:00 4
2023-04-01 00:00:00 1
2218-11-12 10:00:27 1
4000-01-01 00:00:00 6
5492-04-15 00:00:00 1
5496-03-15 00:00:00 1
5589-12-01 00:00:00 1
7199-05-15 00:00:00 1
9186-12-30 00:00:00 1
As you can see, the data contains some misspelled dates.
Questions:
How can we convert this column to format dd.mm.yyyy?
How can we replace rows when Year greater than 2022? by 01.01.2100
How can we Remove All rows when Year less than 2005?
The final output should look like this.
date value
20.05.2009 2
25.05.2009 8
26.04.2020 1
30.06.2020 2
01.07.2020 13
15.07.2020 2
01.08.2020 4
01.10.2020 2
01.11.2020 4
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
01.01.2100 1
I tried to convert the column using to_datetime but it failed.
df[col] = pd.to_datetime(df[col], infer_datetime_format=True)
Out of bounds nanosecond timestamp: 5-03-01 00:00:00
Thanks to anyone helping!
You could check the first element of your datetime strings after a split on '-' and clean up / replace based on its integer value. For the small values like '0006', calling pd.to_datetime with errors='coerce' will do the trick. It will leave 'NaT' for the invalid dates. You can drop those with dropna(). Example:
import pandas as pd
df = pd.DataFrame({'date': ['0006-03-01 00:00:00',
'0006-03-15 00:00:00',
'0006-05-15 00:00:00',
'0006-07-01 00:00:00',
'0006-11-01 00:00:00',
'nan',
'2009-05-25 00:00:00',
'2020-06-24 00:00:00',
'2020-06-30 00:00:00',
'2020-07-01 00:00:00',
'2020-07-15 00:00:00',
'2020-08-01 00:00:00',
'2020-10-01 00:00:00',
'2020-11-01 00:00:00',
'2023-04-01 00:00:00',
'2218-11-12 10:00:27',
'4000-01-01 00:00:00',
'NaN',
'5496-03-15 00:00:00',
'5589-12-01 00:00:00',
'7199-05-15 00:00:00',
'9186-12-30 00:00:00']})
# first, drop columns where 'date' contains 'nan' (case-insensitive):
df = df.loc[~df['date'].str.contains('nan', case=False)]
# now replace strings where the year is above a threshold:
df.loc[df['date'].str.split('-').str[0].astype(int) > 2022, 'date'] = '2100-01-01 00:00:00'
# convert to datetime, if year is too low, will result in NaT:
df['date'] = pd.to_datetime(df['date'], errors='coerce')
# df['date']
# 0 NaT
# 1 NaT
# 2 NaT
# 3 NaT
# 4 NaT
# 5 2009-05-20
# 6 2009-05-25
# ...
df = df.dropna()
# df
# date
# 6 2009-05-25
# 7 2020-06-24
# 8 2020-06-30
# 9 2020-07-01
# 10 2020-07-15
# 11 2020-08-01
# 12 2020-10-01
# 13 2020-11-01
# 14 2100-01-01
# 15 2100-01-01
# ...
Due to the limitations of pandas, the out of bounds error is thrown (https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html). This code will remove values that would cause this error before creating the dataframe.
import datetime as dt
import pandas as pd
data = [[dt.datetime(year=2022, month=3, day=1), 1],
[dt.datetime(year=2009, month=5, day=20), 2],
[dt.datetime(year=2001, month=5, day=20), 2],
[dt.datetime(year=2023, month=12, day=30), 3],
[dt.datetime(year=6, month=12, day=30), 3]]
dataCleaned = [elements for elements in data if pd.Timestamp.max > elements[0] > pd.Timestamp.min]
df = pd.DataFrame(dataCleaned, columns=['date', 'Value'])
print(df)
# OUTPUT
date Value
0 2022-03-01 1
1 2009-05-20 2
2 2001-05-20 2
3 2023-12-30 3
df.loc[df.date.dt.year > 2022, 'date'] = dt.datetime(year=2100, month=1, day=1)
df.drop(df.loc[df.date.dt.year < 2005, 'date'].index, inplace=True)
print(df)
#OUTPUT
0 2022-03-01 1
1 2009-05-20 2
3 2100-01-01 3
If you still want to include the dates that throw the out of bounds error, check out How to work around Python Pandas DataFrame's "Out of bounds nanosecond timestamp" error?
I suggest the following:
df = pd.DataFrame.from_dict({'date': ['0003-03-01 00:00:00',
'7199-05-15 00:00:00',
'2020-10-21 00:00:00'],
'value': [1, 2, 3]})
df['date'] = [d[8:10] + '.' + d[5:7] + '.' + d[:4] if '2004' < d[:4] < '2023' \
else '01.01.2100' if d[:4] > '2022' else np.NaN for d in df['date']]
df.dropna(inplace = True)
This yields the desired output:
date value
01.01.2100 2
21.10.2020 3
Related
I want to find the longest duration of the constant values in a dataframe. For example, given a dataframe below, the longest duration should be 30 minutes (when value = 2).
import pandas as pd
d = {'date_time': ['2016-01-01 12:00:00', '2016-01-01 12:15:00',
'2016-01-01 12:30:00', '2016-01-01 12:45:00',
'2016-01-01 13:00:00', '2016-01-01 13:15:00',
'2016-01-01 13:30:00', '2016-01-01 13:45:00'],
'value': [1,2,2,2,4,5,5,7]}
df = pd.DataFrame(data=d)
df['date_time'] = pd.to_datetime(df['date_time'])
print(df)
date_time value
0 2016-01-01 12:00:00 1
1 2016-01-01 12:15:00 2
2 2016-01-01 12:30:00 2
3 2016-01-01 12:45:00 2
4 2016-01-01 13:00:00 4
5 2016-01-01 13:15:00 5
6 2016-01-01 13:30:00 5
7 2016-01-01 13:45:00 7
(Note: the date_time interval is not always consistent.)
I managed to find it by finding the indexes of df.value.diff().abs()==0, build a complex function to iterate through that list and compute the range.
Since the actual dataframe is much larger than this example, is there a shortcut function or a faster way to get this without multiple iterations?
Thank you.
EDIT:
In my case, the same value can appear in other streaks. A more appropriate example would be
d = {'date_time': ['2016-01-01 12:00:00', '2016-01-01 12:15:00',
'2016-01-01 12:30:00', '2016-01-01 12:45:00',
'2016-01-01 13:00:00', '2016-01-01 13:15:00',
'2016-01-01 13:30:00', '2016-01-01 13:45:00',
'2016-01-01 14:00:00', '2016-01-01 14:05:00'],
'value': [1,2,2,2,4,5,5,7,5,5]}
df = pd.DataFrame(data=d)
df['date_time'] = pd.to_datetime(df['date_time'])
print(df)
date_time value
0 2016-01-01 12:00:00 1
1 2016-01-01 12:15:00 2
2 2016-01-01 12:30:00 2
3 2016-01-01 12:45:00 2
4 2016-01-01 13:00:00 4
5 2016-01-01 13:15:00 5
6 2016-01-01 13:30:00 5
7 2016-01-01 13:45:00 7
8 2016-01-01 14:00:00 5
9 2016-01-01 14:05:00 5
The longest duration, in this case, remains 30 minutes when value = 2.
groupby + nlargest
Create a grouping series that tracks changes.
groupr = df.value.ne(df.value.shift()).cumsum()
Create a mapping dictionary that can translate from the groupr key to the actual value in the df.value column.
mapper = dict(zip(groupr, df.value))
Now we group and use ptp and nlargest. Finally, we use rename and mapper to translate the index value, which is the groupr value, back to the value value (phew, that's a tad confusing).
df.groupby(groupr).date_time.apply(np.ptp).nlargest(1).rename(mapper)
value
2 0 days 00:30:00
Name: date_time, dtype: timedelta64[ns]
The 2 in the index is the value with the longest duration. The 0 days 00:30:00 is the longest duration.
References
np.ptp
nlargest
You can groupby the value column and use .size() to get the size/length of each group.
>>> groups = df.groupby('value')
>>> groups.size()
value
1 1
2 3
4 1
5 2
7 1
dtype: int64
.idxmax() will give you the index of the largest group which you can pass to .get_groups()
>>> groups.get_group(groups.size().idxmax())
date_time value
1 2016-01-01 12:15:00 2
2 2016-01-01 12:30:00 2
3 2016-01-01 12:45:00 2
Then you can diff the last and first dates (assuming they are sorted - if not you can sort them)
>>> max_streak = groups.get_group(groups.size().idxmax())
>>> max_streak.iloc[-1].date_time - max_streak.iloc[0].date_time
Timedelta('0 days 00:30:00')
If value can repeat in other streaks you can groupby using:
groups = df.groupby((df.value != df.value.shift()).cumsum())
Update: Maximum duration of any streak
>>> groups = df.groupby((df.value != df.value.shift()).cumsum())
>>> last = groups.last()
>>> max_duration = (last.date_time - groups.first().date_time).nlargest(1)
>>> max_duration.iat[0]
Timedelta('0 days 00:30:00')
>>> last.loc[max_duration.index].value.iat[0]
2
You could use pd.pivot_table to get the minimum and maximum datetime value for each value, then calculate the duration between them and extract the longest.
import pandas as pd
import numpy as np
d = {'date_time': ['2016-01-01 12:00:00', '2016-01-01 12:15:00',
'2016-01-01 12:30:00', '2016-01-01 12:45:00',
'2016-01-01 13:00:00', '2016-01-01 13:15:00',
'2016-01-01 13:30:00', '2016-01-01 13:45:00'],
'value': [1,2,2,2,4,5,5,7]}
df = pd.DataFrame(data=d)
df['date_time'] = pd.to_datetime(df['date_time'])
df_pivot = pd.pivot_table(df, index='value', values='date_time', aggfunc=[np.min,np.max])
df_pivot['duration'] = df_pivot.iloc[:, 1] - df_pivot.iloc[:, 0]
print(df_pivot[df_pivot['duration'] == max(df_pivot['duration'])])
I have a Date column in my dataframe having dates with 2 different types (YYYY-DD-MM 00:00:00 and YYYY-DD-MM) :
Date
0 2023-01-10 00:00:00
1 2024-27-06
2 2022-07-04 00:00:00
3 NaN
4 2020-30-06
(you can use pd.read_clipboard(sep='\s\s+') after copying the previous dataframe to get it in your notebook)
I would like to have only a YYYY-MM-DD type. Consequently, I would like to have :
Date
0 2023-10-01
1 2024-06-27
2 2022-04-07
3 NaN
4 2020-06-30
How please could I do ?
Use Series.str.replace with to_datetime and format parameter:
df['Date'] = pd.to_datetime(df['Date'].str.replace(' 00:00:00',''), format='%Y-%d-%m')
print (df)
Date
0 2023-10-01
1 2024-06-27
2 2022-04-07
3 NaT
4 2020-06-30
Another idea with match both formats:
d1 = pd.to_datetime(df['Date'], format='%Y-%d-%m', errors='coerce')
d2 = pd.to_datetime(df['Date'], format='%Y-%d-%m 00:00:00', errors='coerce')
df['Date'] = d1.fillna(d2)
df = pd.DataFrame({
'subject_id':[1,1,2,2],
'time_1':['2173/04/11 12:35:00','2173/04/12 12:50:00','2173/04/11 12:59:00','2173/04/12 13:14:00'],
'time_2':['2173/04/12 16:35:00','2173/04/13 18:50:00','2173/04/13 22:59:00','2173/04/21 17:14:00'],
'val' :[5,5,40,40],
'iid' :[12,12,12,12]
})
df['time_1'] = pd.to_datetime(df['time_1'])
df['time_2'] = pd.to_datetime(df['time_2'])
df['day'] = df['time_1'].dt.day
Currently my dataframe looks like as shown below
I would like to replace the timestamp in time_1 column to 00:00:00 and time_2 column to 23:59:00
This is what I tried but it doesn't work
df.groupby(df['subject_id'])['time_1'].apply(lambda x: pd.datetime.strftime(x, "%H:%M:%S") == "00:00:00") #approach 1
df.groupby(df['subject_id'])['time_1'].apply(lambda x: pd.pd.Timestamp(hour = '00', second = '00')) #approach 2
I expect my output dataframe to be like as shown below
I pandas if all datetimes have 00:00:00 times in same column then not display it.
Use Series.dt.floor or Series.str.normalize for remove times and for second add DateOffset:
df['time_1'] = pd.to_datetime(df['time_1']).dt.floor('d')
#alternative
#df['time_1'] = pd.to_datetime(df['time_1']).dt.normalize()
df['time_2']=pd.to_datetime(df['time_2']).dt.floor('d') + pd.DateOffset(hours=23, minutes=59)
df['day'] = df['time_1'].dt.day
print (df)
subject_id time_1 time_2 val iid day
0 1 2173-04-11 2173-04-12 23:59:00 5 12 11
1 1 2173-04-12 2173-04-13 23:59:00 5 12 12
2 2 2173-04-11 2173-04-13 23:59:00 40 12 11
3 2 2173-04-12 2173-04-21 23:59:00 40 12 12
My dataset has dates in the European format, and I'm struggling to convert it into the correct format before I pass it through a pd.to_datetime, so for all day < 12, my month and day switch.
Is there an easy solution to this?
import pandas as pd
import datetime as dt
df = pd.read_csv(loc,dayfirst=True)
df['Date']=pd.to_datetime(df['Date'])
Is there a way to force datetime to acknowledge that the input is formatted at dd/mm/yy?
Thanks for the help!
Edit, a sample from my dates:
renewal["Date"].head()
Out[235]:
0 31/03/2018
2 30/04/2018
3 28/02/2018
4 30/04/2018
5 31/03/2018
Name: Earliest renewal date, dtype: object
After running the following:
renewal['Date']=pd.to_datetime(renewal['Date'],dayfirst=True)
I get:
Out[241]:
0 2018-03-31 #Correct
2 2018-04-01 #<-- this number is wrong and should be 01-04 instad
3 2018-02-28 #Correct
Add format.
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
You can control the date construction directly if you define separate columns for 'year', 'month' and 'day', like this:
import pandas as pd
df = pd.DataFrame(
{'Date': ['01/03/2018', '06/08/2018', '31/03/2018', '30/04/2018']}
)
date_parts = df['Date'].apply(lambda d: pd.Series(int(n) for n in d.split('/')))
date_parts.columns = ['day', 'month', 'year']
df['Date'] = pd.to_datetime(date_parts)
date_parts
# day month year
# 0 1 3 2018
# 1 6 8 2018
# 2 31 3 2018
# 3 30 4 2018
df
# Date
# 0 2018-03-01
# 1 2018-08-06
# 2 2018-03-31
# 3 2018-04-30
I have a pandas data frame like
x = pd.DataFrame(['05/06/2015 00:00', '22/06/2015 00:00', None], columns=['myDate'])
I want to find out the number of days between the dates in the myDate column and the current date. How can I do this? I tried the below without much success
pd.to_datetime(x['myDate']) - pd.datetime.now().date()
the following works for me:
In [9]:
df = pd.DataFrame(['05/06/2015 00:00', '22/06/2015 00:00', None], columns=['myDate'])
df['myDate']= pd.to_datetime(df['myDate'], errors='coerce')
df
Out[9]:
myDate
0 2015-05-06
1 2015-06-22
2 NaT
In [10]:
df['diff'] = df['myDate'] - pd.Timestamp.now().normalize()
df
Out[10]:
myDate diff
0 2015-05-06 9 days
1 2015-06-22 56 days
2 NaT NaT
As does your version:
In [13]:
df['diff'] = df['myDate'] - pd.Timestamp.now().normalize()
df
Out[13]:
myDate diff
0 2015-05-06 9 days
1 2015-06-22 56 days
2 NaT NaT
A more compact version:
In [15]:
df = pd.DataFrame(['05/06/2015 00:00', '22/06/2015 00:00', None], columns=['myDate'])
df['diff']= pd.to_datetime(df['myDate'], errors='coerce') - pd.Timestamp.now().normalize()
df
Out[15]:
myDate diff
0 05/06/2015 00:00 9 days
1 22/06/2015 00:00 56 days
2 None NaT