Change NaT to blank in pandas dataframe - python

I have a dataframe (df) that looks like:
DATES
0 NaT
1 01/08/2003
2 NaT
3 NaT
4 04/08/2003
5 NaT
6 30/06/2003
7 01/03/2004
8 18/05/2003
9 NaT
10 NaT
11 31/10/2003
12 NaT
13 NaT
I am struggling to find out how I transform the data-frame to remove the NaT values so the final output looks like
DATES
0
1 01/08/2003
2
3
4 04/08/2003
5
6 30/06/2003
7 01/03/2004
8 18/05/2003
9
10
11 31/10/2003
12
13
I have tried :
df["DATES"].fillna("", inplace = True)
but with no success.
For information the column is in a datatime format set with
df["DATES"] = pd.to_datetime(df["DATES"],errors='coerce').dt.strftime('%d/%m/%Y')
What can I do to resolve this?

There is problem NaT are strings, so need:
df["DATES"] = df["DATES"].replace('NaT', '')

df.fillna() works on numpy.NaN values. Your "Nat" are probably strings. So you can do following,
if you want to use fillna()
df["DATES"].replace("NaT",np.NaN, inplace=True)
df.fillna("", inplace=True)
Else, you can just replace with your desired string
df["DATES"].replace("NaT","", inplace=True)

Convert column to object and then use Series.where:
df['Dates'] = df['Dates'].astype(object).where(df['Date'].notnull(),np.nan)
Or whatever you want np.nan to be

Your conversion to datetime did not work properly on the NaTs.
You can check this before calling the fillna by printing out df['DATES'][0] and seeing that you get a 'NaT' (string) and not NaT (your wanted format)
Instead, use (for example): df['DATES'] = df['DATES'].apply(pd.Timestamp)
This example worked for me as is, but notice that it's not datetime but rather pd.Timestamp (it's another time format, but it's an easy one to use). You do not need to specify your time format with this, your current format is understood by pd.Timestamp.

Related

Filter time in a dataframe using python

I have a column where there is only time. After reading that CSV file i have converted that column to datetime datatype as it was object when i read it in jupyter notebook. When i try to filter i am getting error like below
TypeError: Index must be DatetimeIndex
code
newdata = newdata['APPOINTMENT_TIME'].between_time('14:30:00', '20:00:00')
sample_data
APPOINTMENT_TIME Id
13:30:00 1
15:10:00 2
18:50:00 3
14:10:00 4
14:00:00 5
Here i am trying display the rows whose appointment_time is between 14:30:00 to 20:00:00
datatype info
Could anyone help. Thanks in advance
between_time is a special method that works with datetime objects as index, which is not your case. It would be useful if you had data like 2021-12-21 13:30:00
In your case, you can just use the between method on strings and the fact that times with your format HH:MM:SS will be naturally sorted:
filtered_data = newdata[newdata['APPOINTMENT_TIME'].between('14:30:00', '20:00:00')]
Output:
APPOINTMENT_TIME Id
1 15:10:00 2
2 18:50:00 3
NB. You can't use a range that starts before midnight and ends after.

Comparison between datetime64[ns] and date

I have DataFrame with values looks like this
Date Value
1 2020-04-12 A
2 2020-05-12 B
3 2020-07-12 C
4 2020-10-12 D
5 2020-11-12 E
and I need to create new DataFrame only with dates from today (7.12) to future (in this example only rows 3, 4 and 5).
I use this code:
df1= df[df["Date"] >= date.today()]
but it gives me TypeError: Invalid comparison between dtype=datetime64[ns] and date
What am I doing wrong? Thank you!
Use the .dt.date on the df['Date'] column. Then you are comparing dates with dates. So:
df1 = df.loc[df['Date'].dt.date >= date.today()]
This will give you:
Date Value
3 2020-12-07 C
4 2020-12-10 D
5 2020-12-11 E
Also make sure that your dateformat is actualy correct. For example by print df['Date'].dt.month to see that it gives all 12's. If not, your date string is not converted correctly. In that case, use df['Date'] = pd.to_datetime(df['Date'], format="%Y-%d-%m") to convert the Date column to the correct datetime format after creating the DataFrame.
Could you please try following. This considers that your dates are in YYYY-DD-MM format, in case its other format then one could change date format accordingly in strftime function.
import pandas as pd
today=pd.datetime.today().strftime("%Y-%d-%m")
df.loc[df['Date'] >= today]
Sample run of solution above: Let's say we have following test DataFrame.
Date Value
1 2020-04-12 A
2 2020-05-12 B
3 2020-07-12 C
4 2020-11-12 D
5 2020-12-12 E
Now when we run the solution above we will get following output:
Date Value
3 2020-07-12 C
4 2020-11-12 D
5 2020-12-12 E

ValueError: time data 'May 11, 2015' does not match format '%m%y%d' (match) [duplicate]

I'm getting a value error saying my data does not match the format when it does. Not sure if this is a bug or I'm missing something here. I'm referring to this documentation for the string format. The weird part is if I write the 'data' Dataframe to a csv and read it in then call the function below it will convert the date so I'm not sure why it doesn't work without writing to a csv.
Any ideas?
data['Date'] = pd.to_datetime(data['Date'], format='%d-%b-%Y')
I'm getting two errors
TypeError: Unrecognized value type: <class 'str'>
ValueError: time data '27‑Aug‑2018' does not match format '%d-%b-%Y' (match)
Example dates -
2‑Jul‑2018
27‑Aug‑2018
28‑May‑2018
19‑Jun‑2017
5‑Mar‑2018
15‑Jan‑2018
11‑Nov‑2013
23‑Nov‑2015
23‑Jun‑2014
18‑Jun‑2018
30‑Apr‑2018
14‑May‑2018
16‑Apr‑2018
26‑Feb‑2018
19‑Mar‑2018
29‑Jun‑2015
Is it because they all aren't double digit days? What is the string format value for single digit days? Looks like this could be the cause but I'm not sure why it would error on the '27' though.
End solution (It was unicode & not a string) -
data['Date'] = data['Date'].apply(unidecode.unidecode)
data['Date'] = data['Date'].apply(lambda x: x.replace("-", "/"))
data['Date'] = pd.to_datetime(data['Date'], format="%d/%b/%Y")
There seems to be an issue with your date strings. I replicated your issue with your sample data and if I remove the hyphens and replace them manually (for the first three dates) then the code works
pd.to_datetime(df1['Date'] ,errors ='coerce')
output:
0 2018-07-02
1 2018-08-27
2 2018-05-28
3 NaT
4 NaT
5 NaT
6 NaT
7 NaT
8 NaT
9 NaT
10 NaT
11 NaT
12 NaT
13 NaT
14 NaT
15 NaT
Bottom line: your hyphens look like regular ones but are actually something else, just clean your source data and you're good to go
You got a special mark here it is not -
df.iloc[0,0][2]
Out[287]: '‑'
Replace it with '-'
pd.to_datetime(df.iloc[:,0].str.replace('‑','-'),format='%d-%b-%Y')
Out[288]:
0 2018-08-27
1 2018-05-28
2 2017-06-19
3 2018-03-05
4 2018-01-15
5 2013-11-11
6 2015-11-23
7 2014-06-23
8 2018-06-18
9 2018-04-30
10 2018-05-14
11 2018-04-16
12 2018-02-26
13 2018-03-19
14 2015-06-29
Name: 2‑Jul‑2018, dtype: datetime64[ns]

parse multiple date format pandas

I 've got stuck with the following format:
0 2001-12-25
1 2002-9-27
2 2001-2-24
3 2001-5-3
4 200510
5 20078
What I need is the date in a format %Y-%m
What I tried was
def parse(date):
if len(date)<=5:
return "{}-{}".format(date[:4], date[4:5], date[5:])
else:
pass
df['Date']= parse(df['Date'])
However, I only succeeded in parse 20078 to 2007-8, the format like 2001-12-25 appeared as None.
So, how can I do it? Thank you!
we can use the pd.to_datetime and use errors='coerce' to parse the dates in steps.
assuming your column is called date
s = pd.to_datetime(df['date'],errors='coerce',format='%Y-%m-%d')
s = s.fillna(pd.to_datetime(df['date'],format='%Y%m',errors='coerce'))
df['date_fixed'] = s
print(df)
date date_fixed
0 2001-12-25 2001-12-25
1 2002-9-27 2002-09-27
2 2001-2-24 2001-02-24
3 2001-5-3 2001-05-03
4 200510 2005-10-01
5 20078 2007-08-01
In steps,
first we cast the regular datetimes to a new series called s
s = pd.to_datetime(df['date'],errors='coerce',format='%Y-%m-%d')
print(s)
0 2001-12-25
1 2002-09-27
2 2001-02-24
3 2001-05-03
4 NaT
5 NaT
Name: date, dtype: datetime64[ns]
as you can can see we have two NaT which are null datetime values in our series, these correspond with your datetimes which are missing a day,
we then reapply the same datetime method but with the opposite format, and apply those to the missing values of s
s = s.fillna(pd.to_datetime(df['date'],format='%Y%m',errors='coerce'))
print(s)
0 2001-12-25
1 2002-09-27
2 2001-02-24
3 2001-05-03
4 2005-10-01
5 2007-08-01
then we re-assign to your dataframe.
You could use a regex to pull out the year and month, and convert to datetime :
df = pd.read_clipboard("\s{2,}",header=None,names=["Dates"])
pattern = r"(?P<Year>\d{4})[-]*(?P<Month>\d{1,2})"
df['Dates'] = pd.to_datetime([f"{year}-{month}" for year, month in df.Dates.str.extract(pattern).to_numpy()])
print(df)
Dates
0 2001-12-01
1 2002-09-01
2 2001-02-01
3 2001-05-01
4 2005-10-01
5 2007-08-01
Note that pandas automatically converts the day to 1, since only year and month was supplied.

Error converting data type float to datetime format

I would like to convert the data type float below to datetime format:
df
Date
0 NaN
1 NaN
2 201708.0
4 201709.0
5 201700.0
6 201600.0
Name: Cred_Act_LstPostDt_U324123, dtype: float64
pd.to_datetime(df['Date'],format='%Y%m.0')
ValueError: time data 201700.0 does not match format '%Y%m.0' (match)
How could I transform these rows without month information as yyyy01 as default?
You can use pd.Series.str.replace to clean up your month data:
s = [x.replace('00.0', '01.0') for x in df['Date'].astype(str)]
df['Date'] = pd.to_datetime(s, format='%Y%m.0', errors='coerce')
print(df)
Date
0 NaT
1 NaT
2 2017-08-01
4 2017-09-01
5 2017-01-01
6 2016-01-01
Create a string that contains the float using .asType(str), then split the string at the fourth char and using cat insert a hyphen. Then you can use format='%Y%m.
However this may still fail if you try to use incorrect month numbering, such as month 00
string = df['Date'].astype(str)
s = pd.Series([string[:4], '-',string[4:6])
date = s.str.cat(sep=',')
pd.to_datetime(date.astype(str),format='%Y%m')

Categories