Create custom sized bins of datetime Series in Pandas - python

I have multiple Pandas Series of datetime64 values that I want to bin into groups using arbitrary bin sizes.
I've found the Series.to_period() function which does exactly what I want except that I need more control over the chosen bin size. to_period allows me to bin by full years, months, days, etc. but I also want to bin by 5 years, 6 hours or 15 minutes. Using a syntax like 5Y, 6H or 15min works in other corners of Pandas but apparently not here.
s = pd.Series(["2020-02-01", "2020-02-02", "2020-02-03", "2020-02-04"], dtype="datetime64[ns]")
# Output as expected
s.dt.to_period("M").value_counts()
2020-02 4
Freq: M, dtype: int64
# Output as expected
s.dt.to_period("W").value_counts()
2020-01-27/2020-02-02 2
2020-02-03/2020-02-09 2
Freq: W-SUN, dtype: int64
# Output as expected
s.dt.to_period("D").value_counts()
2020-02-01 1
2020-02-02 1
2020-02-03 1
2020-02-04 1
Freq: D, dtype: int64
# Output unexpected (and wrong?)
s.dt.to_period("2D").value_counts()
2020-02-01 1
2020-02-02 1
2020-02-03 1
2020-02-04 1
Freq: 2D, dtype: int64

I believe that pd.Grouper is what you're looking for.
https://pandas.pydata.org/docs/reference/api/pandas.Grouper.html
It provides the flexibility of having multiple frequencies in addition to the standard ones: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
From the documentation:
>>> start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'
>>> rng = pd.date_range(start, end, freq='7min')
>>> ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
>>> ts
2000-10-01 23:30:00 0
2000-10-01 23:37:00 3
2000-10-01 23:44:00 6
2000-10-01 23:51:00 9
2000-10-01 23:58:00 12
2000-10-02 00:05:00 15
2000-10-02 00:12:00 18
2000-10-02 00:19:00 21
2000-10-02 00:26:00 24
Freq: 7T, dtype: int64
>>> ts.groupby(pd.Grouper(freq='17min')).sum()
2000-10-01 23:14:00 0
2000-10-01 23:31:00 9
2000-10-01 23:48:00 21
2000-10-02 00:05:00 54
2000-10-02 00:22:00 24
Freq: 17T, dtype: int64
NOTE:
If you'd like to .groupby a certain column then use the following syntax:
df.groupby(pd.Grouper(key="my_col", freq="3M"))

Related

Pandas: Add and Preserve time component when input file has only date in dataframe

Scenario:
The input file which I read in Pandas has column with sparsely populated date in String/Object format.
I need to add time component, for ex.
2021-08-27 is my input in String format, and 2021-08-28 00:00:00 should by output in datetime64[ns] format
My Trials:
df = pd.read_parquet("sample.parquet")
df.head()
a
b
c
dttime_col
1
1
2
2021-07-12 00:00:00
0
1
0
NaN
1
2
0
NaN
2
1
1
2021-02-04 00:00:00
3
5
2
NaN
df["dttime_col"] = pd.to_datetime(df["dttime_col"])
df["dttime_col"]
Out[16]:
0 2021-07-12
1 NaT
2 NaT
3 2021-02-04
4 NaT
5 2021-05-22
6 NaT
7 2021-10-06
8 2021-01-31
9 NaT
Name: dttime_col, dtype: datetime64[ns]
But as you see here, there is not time component. I tried adding format as %Y-%m-%d %H:%M:%S but still the output is same. Further more, I tried adding Time component as default as a String type.
df["dttime_col"] = df["dttime_col"].dt.strftime("%Y-%m-%d 00:00:00").replace('NaT', np.nan)
Out[17]:
0 2021-07-12 00:00:00
1 NaN
2 NaN
3 2021-02-04 00:00:00
4 NaN
5 2021-05-22 00:00:00
6 NaN
7 2021-10-06 00:00:00
8 2021-01-31 00:00:00
9 NaN
Name: dttime_col, dtype: object
Now this gives me time next to date, but in String/Object format. The moment I convert it back to datetime format, all the HH:MM:SS are removed.
df["dttime_col"].apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d %H:%M:%S") if not isinstance(x, float) else np.nan)
Out[24]:
0 2021-07-12
1 NaT
2 NaT
3 2021-02-04
4 NaT
5 2021-05-22
6 NaT
7 2021-10-06
8 2021-01-31
9 NaT
Name: dttime_col, dtype: datetime64[ns]
It feels like going in circles all again.
Output I expect:
0 2021-07-12 00:00:00
1 NaN
2 NaN
3 2021-02-04 00:00:00
4 NaN
5 2021-05-22 00:00:00
6 NaN
7 2021-10-06 00:00:00
8 2021-01-31 00:00:00
9 NaN
Name: dttime_col, dtype: datetime64[ns]
EDIT 1:
Providing output as asked by #mozway
df["dttime_col"].dt.second
Out[27]:
0 0.0
1 NaN
2 NaN
3 0.0
4 NaN
5 0.0
6 NaN
7 0.0
8 0.0
9 NaN
Name: dttime_col, dtype: float64

Transform multiple format Duration Data to common formatted '%H%M%S' . The %M part of the format (minutes) is inconsistent

I have Duration data that is an object with multiple formats, particularly in the minutes part between the colons. Any idea, how I can transform this data. I tried everything with regex imaginable (except for the correct answer :) ), which was the main part where I was struggling with. For example, below is my attempt to zero-pad the minutes column.
df['temp'] = df['temp'].replace(':?:', ':0?:', regex=True)
Input:
Duration
0 00:0:00
1 00:00:00
2 00:8:00
3 00:08:00
4 00:588:00
5 09:14:00
Expected Output Option #1 (Time format):
Duration
0 00:00:00
1 00:00:00
2 00:08:00
3 00:08:00
4 09:48:00
5 09:14:00
My end goal is to get the minutes, so another acceptable format would be:
Expected Output Option #2 (Minutes - integer or float):
Minutes
0 0
1 0
2 8
3 8
4 588
5 554
We can just do pd.to_timedelta:
pd.to_timedelta(df.Duration)
Output:
0 00:00:00
1 00:00:00
2 00:08:00
3 00:08:00
4 09:48:00
5 09:14:00
Name: Duration, dtype: timedelta64[ns]
Or Option 2 - Minutes:
pd.to_timedelta(df.Duration).dt.total_seconds()/60
Output:
0 0.0
1 0.0
2 8.0
3 8.0
4 588.0
5 554.0
Name: Duration, dtype: float64
We can do split with mul
df.Duration.str.split(':',expand=True).astype(int).mul([60,1,1/60]).sum(1)
0 0.0
1 0.0
2 8.0
3 8.0
4 588.0
5 554.0
dtype: float64

Year as Float to Datetime64

I have a series of years in the following Series:
>>>temp
>>> 0 1994
1 1995
2 1996
3 -9999
4 1997
5 2001
dtype: float64
I have tried a number of different solutions to get these values to years. I only seem to be able to get the following to convert these floats to valid datetime values.
>>>temp.replace(-9999, np.nan).dropna().astype(int).astype(str).apply(np.datetime64)
>>>0 1994-01-01
1 1995-01-01
2 1996-01-01
4 2001-01-01
5 2002-01-01
dtype: datetime64[ns]
Is there a more effective way to go about this? I doubt that converting everything to an integer and then a string is actually necessary or appropriate in this circumstance.
You can try to_datetime:
print temp
0 1994
1 1995
2 1996
3 -9999
4 1997
5 2001
dtype: int64
print pd.to_datetime(temp, format='%Y', errors='coerce')
0 1994-01-01
1 1995-01-01
2 1996-01-01
3 NaT
4 1997-01-01
5 2001-01-01
dtype: datetime64[ns]
And if you need remove NaT add dropna:
print pd.to_datetime(temp, format='%Y', errors='coerce').dropna()
0 1994-01-01
1 1995-01-01
2 1996-01-01
4 1997-01-01
5 2001-01-01
dtype: datetime64[ns]
Using standard datetime library:
datetime.datetime.strptime(str(temp[1]),'%Y')
but needs to iterate over the Series, and manage errors, as it will crash on -9999
Something like this, will do the trick:
for i in range(1,len(temp)+1):
try:
temp[i]=datetime.datetime.strptime(str(temp[i]),'%Y')
except:
temp[i]=None

python pandas series loc value from multi index

I have a series that looks like this
2014 7 2014-07-01 -0.045417
8 2014-08-01 -0.035876
9 2014-09-02 -0.030971
10 2014-10-01 -0.027471
11 2014-11-03 -0.032968
12 2014-12-01 -0.031110
2015 1 2015-01-02 -0.028906
2 2015-02-02 -0.035563
3 2015-03-02 -0.040338
4 2015-04-01 -0.032770
5 2015-05-01 -0.025762
6 2015-06-01 -0.019746
7 2015-07-01 -0.018541
8 2015-08-03 -0.028101
9 2015-09-01 -0.043237
10 2015-10-01 -0.053565
11 2015-11-02 -0.062630
12 2015-12-01 -0.064618
2016 1 2016-01-04 -0.064852
I want to be able to get the value from a date. Something like:
myseries.loc('2015-10-01') and it returns -0.053565
The index are tuples in the form (2016, 1, 2016-01-04)
You can do it like this:
In [32]:
df.loc(axis=0)[:,:,'2015-10-01']
Out[32]:
value
year month date
2015 10 2015-10-01 -0.053565
You can also pass slice for each level:
In [39]:
df.loc[(slice(None),slice(None),'2015-10-01'),]
Out[39]:
value
year month date
2015 10 2015-10-01 -0.053565|
Or just pass the first 2 index levels:
In [40]:
df.loc[2015,10]
Out[40]:
value
date
2015-10-01 -0.053565
Try xs:
print s.xs('2015-10-01',level=2,axis=0)
#year datetime
#2015 10 -0.053565
#Name: series, dtype: float64
print s.xs(7,level=1,axis=0)
#year datetime
#2014 2014-07-01 -0.045417
#2015 2015-07-01 -0.018541
#Name: series, dtype: float64

How to rearrange a date in python

I have a column in a pandas data frame looking like:
test1.Received
Out[9]:
0 01/01/2015 17:25
1 02/01/2015 11:43
2 04/01/2015 18:21
3 07/01/2015 16:17
4 12/01/2015 20:12
5 14/01/2015 11:09
6 15/01/2015 16:05
7 16/01/2015 21:02
8 26/01/2015 03:00
9 27/01/2015 08:32
10 30/01/2015 11:52
This represents a time stamp as Day Month Year Hour Minute. I would like to rearrange the date as Year Month Day Hour Minute. So that it would look like:
test1.Received
Out[9]:
0 2015/01/01 17:25
1 2015/01/02 11:43
...
Just use pd.to_datetime:
In [33]:
import pandas as pd
pd.to_datetime(df['date'])
Out[33]:
index
0 2015-01-01 17:25:00
1 2015-02-01 11:43:00
2 2015-04-01 18:21:00
3 2015-07-01 16:17:00
4 2015-12-01 20:12:00
5 2015-01-14 11:09:00
6 2015-01-15 16:05:00
7 2015-01-16 21:02:00
8 2015-01-26 03:00:00
9 2015-01-27 08:32:00
10 2015-01-30 11:52:00
Name: date, dtype: datetime64[ns]
In your case:
pd.to_datetime(test1['Received'])
should just work
If you want to change the display format then you need to parse as a datetime and then apply `datetime.strftime:
In [35]:
import datetime as dt
pd.to_datetime(df['date']).apply(lambda x: dt.datetime.strftime(x, '%m/%d/%y %H:%M:%S'))
Out[35]:
index
0 01/01/15 17:25:00
1 02/01/15 11:43:00
2 04/01/15 18:21:00
3 07/01/15 16:17:00
4 12/01/15 20:12:00
5 01/14/15 11:09:00
6 01/15/15 16:05:00
7 01/16/15 21:02:00
8 01/26/15 03:00:00
9 01/27/15 08:32:00
10 01/30/15 11:52:00
Name: date, dtype: object
So the above is now showing month/day/year, in your case the following should work:
pd.to_datetime(test1['Received']).apply(lambda x: dt.datetime.strftime(x, '%y/%m/%d %H:%M:%S'))
EDIT
it looks like you need to pass param dayfirst=True to to_datetime:
In [45]:
pd.to_datetime(df['date'], format('%d/%m/%y %H:%M:%S'), dayfirst=True).apply(lambda x: dt.datetime.strftime(x, '%m/%d/%y %H:%M:%S'))
Out[45]:
index
0 01/01/15 17:25:00
1 01/02/15 11:43:00
2 01/04/15 18:21:00
3 01/07/15 16:17:00
4 01/12/15 20:12:00
5 01/14/15 11:09:00
6 01/15/15 16:05:00
7 01/16/15 21:02:00
8 01/26/15 03:00:00
9 01/27/15 08:32:00
10 01/30/15 11:52:00
Name: date, dtype: object
Pandas has this in-built, you can specify your datetime format
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html.
use infer_datetime_format
>>> import pandas as pd
>>> i = pd.date_range('20000101',periods=100)
>>> df = pd.DataFrame(dict(year = i.year, month = i.month, day = i.day))
>>> pd.to_datetime(df.year*10000 + df.month*100 + df.day, format='%Y%m%d')
0 2000-01-01
1 2000-01-02
...
98 2000-04-08
99 2000-04-09
Length: 100, dtype: datetime64[ns]
you can use the datetime functions to convert from and to strings.
# converts to date
datetime.strptime(date_string, 'DD/MM/YYYY HH:MM')
and
# converts to your requested string format
datetime.strftime(date_string, "YYYY/MM/DD HH:MM:SS")

Categories