Year as Float to Datetime64 - python

I have a series of years in the following Series:
>>>temp
>>> 0 1994
1 1995
2 1996
3 -9999
4 1997
5 2001
dtype: float64
I have tried a number of different solutions to get these values to years. I only seem to be able to get the following to convert these floats to valid datetime values.
>>>temp.replace(-9999, np.nan).dropna().astype(int).astype(str).apply(np.datetime64)
>>>0 1994-01-01
1 1995-01-01
2 1996-01-01
4 2001-01-01
5 2002-01-01
dtype: datetime64[ns]
Is there a more effective way to go about this? I doubt that converting everything to an integer and then a string is actually necessary or appropriate in this circumstance.

You can try to_datetime:
print temp
0 1994
1 1995
2 1996
3 -9999
4 1997
5 2001
dtype: int64
print pd.to_datetime(temp, format='%Y', errors='coerce')
0 1994-01-01
1 1995-01-01
2 1996-01-01
3 NaT
4 1997-01-01
5 2001-01-01
dtype: datetime64[ns]
And if you need remove NaT add dropna:
print pd.to_datetime(temp, format='%Y', errors='coerce').dropna()
0 1994-01-01
1 1995-01-01
2 1996-01-01
4 1997-01-01
5 2001-01-01
dtype: datetime64[ns]

Using standard datetime library:
datetime.datetime.strptime(str(temp[1]),'%Y')
but needs to iterate over the Series, and manage errors, as it will crash on -9999
Something like this, will do the trick:
for i in range(1,len(temp)+1):
try:
temp[i]=datetime.datetime.strptime(str(temp[i]),'%Y')
except:
temp[i]=None

Related

Create custom sized bins of datetime Series in Pandas

I have multiple Pandas Series of datetime64 values that I want to bin into groups using arbitrary bin sizes.
I've found the Series.to_period() function which does exactly what I want except that I need more control over the chosen bin size. to_period allows me to bin by full years, months, days, etc. but I also want to bin by 5 years, 6 hours or 15 minutes. Using a syntax like 5Y, 6H or 15min works in other corners of Pandas but apparently not here.
s = pd.Series(["2020-02-01", "2020-02-02", "2020-02-03", "2020-02-04"], dtype="datetime64[ns]")
# Output as expected
s.dt.to_period("M").value_counts()
2020-02 4
Freq: M, dtype: int64
# Output as expected
s.dt.to_period("W").value_counts()
2020-01-27/2020-02-02 2
2020-02-03/2020-02-09 2
Freq: W-SUN, dtype: int64
# Output as expected
s.dt.to_period("D").value_counts()
2020-02-01 1
2020-02-02 1
2020-02-03 1
2020-02-04 1
Freq: D, dtype: int64
# Output unexpected (and wrong?)
s.dt.to_period("2D").value_counts()
2020-02-01 1
2020-02-02 1
2020-02-03 1
2020-02-04 1
Freq: 2D, dtype: int64
I believe that pd.Grouper is what you're looking for.
https://pandas.pydata.org/docs/reference/api/pandas.Grouper.html
It provides the flexibility of having multiple frequencies in addition to the standard ones: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
From the documentation:
>>> start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'
>>> rng = pd.date_range(start, end, freq='7min')
>>> ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
>>> ts
2000-10-01 23:30:00 0
2000-10-01 23:37:00 3
2000-10-01 23:44:00 6
2000-10-01 23:51:00 9
2000-10-01 23:58:00 12
2000-10-02 00:05:00 15
2000-10-02 00:12:00 18
2000-10-02 00:19:00 21
2000-10-02 00:26:00 24
Freq: 7T, dtype: int64
>>> ts.groupby(pd.Grouper(freq='17min')).sum()
2000-10-01 23:14:00 0
2000-10-01 23:31:00 9
2000-10-01 23:48:00 21
2000-10-02 00:05:00 54
2000-10-02 00:22:00 24
Freq: 17T, dtype: int64
NOTE:
If you'd like to .groupby a certain column then use the following syntax:
df.groupby(pd.Grouper(key="my_col", freq="3M"))

Pandas: Add and Preserve time component when input file has only date in dataframe

Scenario:
The input file which I read in Pandas has column with sparsely populated date in String/Object format.
I need to add time component, for ex.
2021-08-27 is my input in String format, and 2021-08-28 00:00:00 should by output in datetime64[ns] format
My Trials:
df = pd.read_parquet("sample.parquet")
df.head()
a
b
c
dttime_col
1
1
2
2021-07-12 00:00:00
0
1
0
NaN
1
2
0
NaN
2
1
1
2021-02-04 00:00:00
3
5
2
NaN
df["dttime_col"] = pd.to_datetime(df["dttime_col"])
df["dttime_col"]
Out[16]:
0 2021-07-12
1 NaT
2 NaT
3 2021-02-04
4 NaT
5 2021-05-22
6 NaT
7 2021-10-06
8 2021-01-31
9 NaT
Name: dttime_col, dtype: datetime64[ns]
But as you see here, there is not time component. I tried adding format as %Y-%m-%d %H:%M:%S but still the output is same. Further more, I tried adding Time component as default as a String type.
df["dttime_col"] = df["dttime_col"].dt.strftime("%Y-%m-%d 00:00:00").replace('NaT', np.nan)
Out[17]:
0 2021-07-12 00:00:00
1 NaN
2 NaN
3 2021-02-04 00:00:00
4 NaN
5 2021-05-22 00:00:00
6 NaN
7 2021-10-06 00:00:00
8 2021-01-31 00:00:00
9 NaN
Name: dttime_col, dtype: object
Now this gives me time next to date, but in String/Object format. The moment I convert it back to datetime format, all the HH:MM:SS are removed.
df["dttime_col"].apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d %H:%M:%S") if not isinstance(x, float) else np.nan)
Out[24]:
0 2021-07-12
1 NaT
2 NaT
3 2021-02-04
4 NaT
5 2021-05-22
6 NaT
7 2021-10-06
8 2021-01-31
9 NaT
Name: dttime_col, dtype: datetime64[ns]
It feels like going in circles all again.
Output I expect:
0 2021-07-12 00:00:00
1 NaN
2 NaN
3 2021-02-04 00:00:00
4 NaN
5 2021-05-22 00:00:00
6 NaN
7 2021-10-06 00:00:00
8 2021-01-31 00:00:00
9 NaN
Name: dttime_col, dtype: datetime64[ns]
EDIT 1:
Providing output as asked by #mozway
df["dttime_col"].dt.second
Out[27]:
0 0.0
1 NaN
2 NaN
3 0.0
4 NaN
5 0.0
6 NaN
7 0.0
8 0.0
9 NaN
Name: dttime_col, dtype: float64

How to handle missing value datetime64[ns] dtype column for Python Pandas DataFrame?

If I have data something like this, a missing value in 'Date4' column, its a datetime64[ns] dtype.
How to handle missing values in this type of situation?
What if I want to fill it with most_frequent date, how can it be done for the dates?
I have searched for the solution on several websites but couldn't get proper answer yet.
No Name Date1 Date2 Date3 Date4
0 1 Per1 2015-05-25 2016-03-20 2016-03-22 2017-01-01
1 2 Per2 2015-06-26 2016-05-22 2016-06-22 2017-02-02
2 3 Per3 2015-09-28 2016-07-24 2016-07-26 2017-05-22
3 4 Per4 2015-11-21 2016-09-02 2016-05-09 2017-05-22
4 5 Per5 2015-12-25 2016-11-11 2016-11-14 NaT
In [135]: df
Out[135]:
Date4
0 2017-01-01
1 2017-02-02
2 2017-05-22
3 2017-05-22
4 NaT
In [136]: df["Date4"].replace(np.nan, df["Date4"].mode().iloc[0])
Out[136]:
0 2017-01-01
1 2017-02-02
2 2017-05-22
3 2017-05-22
4 2017-05-22
Name: Date4, dtype: datetime64[ns]
What you just described is called Imputation. Sklearn's SimpleImputer() does the job well. You can even specify how you want the missing values to be filled.
imp=SimpleImputer(missing_values=np.nan, strategy = 'most_frequent')
df=pd.DataFrame(imp.fit_transform(df))

Transform multiple format Duration Data to common formatted '%H%M%S' . The %M part of the format (minutes) is inconsistent

I have Duration data that is an object with multiple formats, particularly in the minutes part between the colons. Any idea, how I can transform this data. I tried everything with regex imaginable (except for the correct answer :) ), which was the main part where I was struggling with. For example, below is my attempt to zero-pad the minutes column.
df['temp'] = df['temp'].replace(':?:', ':0?:', regex=True)
Input:
Duration
0 00:0:00
1 00:00:00
2 00:8:00
3 00:08:00
4 00:588:00
5 09:14:00
Expected Output Option #1 (Time format):
Duration
0 00:00:00
1 00:00:00
2 00:08:00
3 00:08:00
4 09:48:00
5 09:14:00
My end goal is to get the minutes, so another acceptable format would be:
Expected Output Option #2 (Minutes - integer or float):
Minutes
0 0
1 0
2 8
3 8
4 588
5 554
We can just do pd.to_timedelta:
pd.to_timedelta(df.Duration)
Output:
0 00:00:00
1 00:00:00
2 00:08:00
3 00:08:00
4 09:48:00
5 09:14:00
Name: Duration, dtype: timedelta64[ns]
Or Option 2 - Minutes:
pd.to_timedelta(df.Duration).dt.total_seconds()/60
Output:
0 0.0
1 0.0
2 8.0
3 8.0
4 588.0
5 554.0
Name: Duration, dtype: float64
We can do split with mul
df.Duration.str.split(':',expand=True).astype(int).mul([60,1,1/60]).sum(1)
0 0.0
1 0.0
2 8.0
3 8.0
4 588.0
5 554.0
dtype: float64

Match datetime YYYY-MM-DD object in pandas dataframe

I have a pandas DataFrame of the form:
id amount birth
0 4 78.0 1980-02-02 00:00:00
1 5 24.0 1989-03-03 00:00:00
2 6 49.5 2014-01-01 00:00:00
3 7 34.0 2014-01-01 00:00:00
4 8 49.5 2014-01-01 00:00:00
I am interested in only the year, month and day in the birth column of the dataframe. I tried to leverage on the Python datetime from pandas but it resulted into an error:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1054-02-07 00:00:00
The birth column is an object dtype.
My guess would be that it is an incorrect date. I would not like to pass the parameter errors="coerce" into the to_datetime method, because each item is important and I need just the YYYY-MM-DD.
I tried to leverage on the regex from pandas:
df["birth"].str.find("(\d{4})-(\d{2})-(\d{2})")
But this is returning NANs. How can I resolve this?
Thanks
Because not possible convert to datetimes you can use split by first whitespace and then select first value:
df['birth'] = df['birth'].str.split().str[0]
And then if necessary convert to periods.
Representing out-of-bounds spans.
print (df)
id amount birth
0 4 78.0 1980-02-02 00:00:00
1 5 24.0 1989-03-03 00:00:00
2 6 49.5 2014-01-01 00:00:00
3 7 34.0 2014-01-01 00:00:00
4 8 49.5 0-01-01 00:00:00
def to_per(x):
splitted = x.split('-')
return pd.Period(year=int(splitted[0]),
month=int(splitted[1]),
day=int(splitted[2]), freq='D')
df['birth'] = df['birth'].str.split().str[0].apply(to_per)
print (df)
id amount birth
0 4 78.0 1980-02-02
1 5 24.0 1989-03-03
2 6 49.5 2014-01-01
3 7 34.0 2014-01-01
4 8 49.5 0000-01-01

Categories