I have and pandas dataframe with a multiindex that looks like this:
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
# multi-indexed dataframe
df = pd.DataFrame(np.random.randn(8760 * 3, 3))
df['concept'] = "some_value"
df['datetime'] = pd.date_range(start='2016', periods=len(df), freq='60Min')
df.set_index(['concept', 'datetime'], inplace=True)
df.sort_index(inplace=True)
Console output:
df.head()
Out[23]:
0 1 2
datetime
2016 0.458802 0.413004 0.091056
2016 -0.051840 -1.780310 -0.304122
2016 -1.119973 0.954591 0.279049
2016 -0.691850 -0.489335 0.554272
2016 -1.278834 -1.292012 -0.637931
df.head()
...: df.tail()
Out[24]:
0 1 2
datetime
2018 -1.872155 0.434520 -0.526520
2018 0.345213 0.989475 -0.892028
2018 -0.162491 0.908121 -0.993499
2018 -1.094727 0.307312 0.515041
2018 -0.880608 -1.065203 -1.438645
Now I want to create annual sums along the level 'datetime'.
My first try was the following but this doesn't work:
# sum along years
years = df.index.get_level_values('datetime').year.tolist()
df.index.set_levels([years], level=['datetime'], inplace=True)
df = df.groupby(level=['datetime']).sum()
And it also seems quite heavy handed to me as this task is probably pretty easy to realize.
So here's my question: How can I get annual sums for the level 'datetime'? Is there a simple way to realize this by applying a function to the DateTime level values?
You can groupby by second level of multiindex and year:
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
# multi-indexed dataframe
df = pd.DataFrame(np.random.randn(8760 * 3, 3))
df['concept'] = "some_value"
df['datetime'] = pd.date_range(start='2016', periods=len(df), freq='60Min')
df.set_index(['concept', 'datetime'], inplace=True)
df.sort_index(inplace=True)
print df.head()
0 1 2
concept datetime
some_value 2016-01-01 00:00:00 1.973437 0.101535 -0.693360
2016-01-01 01:00:00 1.221657 -1.983806 -0.075609
2016-01-01 02:00:00 -0.208122 -2.203801 1.254084
2016-01-01 03:00:00 0.694332 -0.235864 0.538468
2016-01-01 04:00:00 -0.928815 -1.417445 1.534218
# sum along years
#years = df.index.get_level_values('datetime').year.tolist()
#df.index.set_levels([years], level=['datetime'], inplace=True)
print df.index.levels[1].year
[2016 2016 2016 ..., 2018 2018 2018]
df = df.groupby(df.index.levels[1].year).sum()
print df.head()
0 1 2
2016 -93.901914 -32.205514 -22.460965
2017 205.681817 67.701669 -33.960801
2018 67.438355 150.954614 -21.381809
Or you can use get_level_values and year:
df = df.groupby(df.index.get_level_values('datetime').year).sum()
print df.head()
0 1 2
2016 -93.901914 -32.205514 -22.460965
2017 205.681817 67.701669 -33.960801
2018 67.438355 150.954614 -21.381809
Starting with your sample data:
df = pd.DataFrame(np.random.randn(8760 * 3, 3))
df['concept'] = "some_value"
df['datetime'] = pd.date_range(start='2016', periods=len(df), freq='60Min')
df.set_index(['concept', 'datetime'], inplace=True)
you can apply groupby to a level of your MultiIndex:
df.groupby(pd.TimeGrouper(level='datetime', freq='A')).sum()
to get:
0 1 2
datetime
2016-12-31 100.346135 -71.673222 42.816675
2017-12-31 -132.880909 -66.017010 -73.449358
2018-12-31 -71.449710 -15.774929 97.634349
pd.TimeGrouper is now (0.23) deprecated; please use pd.Grouper(freq=...) instead.
Related
Hi i am looking for a more elegant solution than my code. i have a given df which look like this:
import pandas as pd
from pandas.tseries.offsets import DateOffset
sdate = date(2021,1,31)
edate = date(2021,8,30)
date_range = pd.date_range(sdate,edate-timedelta(days=1),freq='m')
df_test = pd.DataFrame({ 'Datum': date_range})
i take this df and have to insert a new first row with the minimum date
data_perf_indexed_vv = df_test.copy()
minimum_date = df_test['Datum'].min()
data_perf_indexed_vv = data_perf_indexed_vv.reset_index()
df1 = pd.DataFrame([[np.nan] * len(data_perf_indexed_vv.columns)],
columns=data_perf_indexed_vv.columns)
data_perf_indexed_vv = df1.append(data_perf_indexed_vv, ignore_index=True)
data_perf_indexed_vv['Datum'].iloc[0] = minimum_date - DateOffset(months=1)
data_perf_indexed_vv.drop(['index'], axis=1)
may somebody have a shorter or more elegant solution. thanks
Instead of writing such big 2nd block of code just make use of:
df_test.loc[len(df_test)+1,'Datum']=(df_test['Datum'].min()-DateOffset(months=1))
Finally make use of sort_values() method:
df_test=df_test.sort_values(by='Datum',ignore_index=True)
Now if you print df_test you will get desired output:
#output
Datum
0 2020-12-31
1 2021-01-31
2 2021-02-28
3 2021-03-31
4 2021-04-30
5 2021-05-31
6 2021-06-30
7 2021-07-31
I have the following pd data df that includes one string column mydate
import pandas as pd
df = {'mydate': ['01JAN2009','20FEB2013','13MAR2010','01APR2012', '20MAY2013', '18JUN2018', '10JUL2002', '30AUG2000', '15SEP2001', '30OCT1999',
'04NOV2020', '23DEC1995']}
df = pd.DataFrame(df, columns = ['mydate'])
I need to convert mydate into date type and store it in a new column mydate2.
You could try this:
import pandas as pd
df = {'mydate': ['01JAN2009','20FEB2013','13MAR2010','01APR2012', '20MAY2013', '18JUN2018', '10JUL2002', '30AUG2000', '15SEP2001', '30OCT1999',
'04NOV2020', '23DEC1995']}
df = pd.DataFrame(df, columns = ['mydate'])
df['mydate2']=pd.to_datetime(df['mydate'])
print(df)
Output:
mydate mydate2
0 01JAN2009 2009-01-01
1 20FEB2013 2013-02-20
2 13MAR2010 2010-03-13
3 01APR2012 2012-04-01
4 20MAY2013 2013-05-20
5 18JUN2018 2018-06-18
6 10JUL2002 2002-07-10
7 30AUG2000 2000-08-30
8 15SEP2001 2001-09-15
9 30OCT1999 1999-10-30
10 04NOV2020 2020-11-04
11 23DEC1995 1995-12-23
I have the following pandas dataframe. The dates are with time:
from pandas.tseries.holiday import USFederalHolidayCalendar
import pandas as pd<BR>
df = pd.DataFrame([[6,0,"2016-01-02 01:00:00",0.0],
[7,0,"2016-07-04 02:00:00",0.0]])
cal = USFederalHolidayCalendar()
holidays = cal.holidays(start='2014-01-01', end='2018-12-31')
I want to add a new boolean column with True/False if the date is holiday or not.
Tried df["hd"] = df[2].isin(holidays), but it doesn't work because of time digits.
Use Series.dt.floor or Series.dt.normalize for remove times:
df[2] = pd.to_datetime(df[2])
df["hd"] = df[2].dt.floor('d').isin(holidays)
#alternative
df["hd"] = df[2].dt.normalize().isin(holidays)
print (df)
0 1 2 3 hd
0 6 0 2016-01-02 01:00:00 0.0 False
1 7 0 2016-07-04 02:00:00 0.0 True
My dataset has dates in the European format, and I'm struggling to convert it into the correct format before I pass it through a pd.to_datetime, so for all day < 12, my month and day switch.
Is there an easy solution to this?
import pandas as pd
import datetime as dt
df = pd.read_csv(loc,dayfirst=True)
df['Date']=pd.to_datetime(df['Date'])
Is there a way to force datetime to acknowledge that the input is formatted at dd/mm/yy?
Thanks for the help!
Edit, a sample from my dates:
renewal["Date"].head()
Out[235]:
0 31/03/2018
2 30/04/2018
3 28/02/2018
4 30/04/2018
5 31/03/2018
Name: Earliest renewal date, dtype: object
After running the following:
renewal['Date']=pd.to_datetime(renewal['Date'],dayfirst=True)
I get:
Out[241]:
0 2018-03-31 #Correct
2 2018-04-01 #<-- this number is wrong and should be 01-04 instad
3 2018-02-28 #Correct
Add format.
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
You can control the date construction directly if you define separate columns for 'year', 'month' and 'day', like this:
import pandas as pd
df = pd.DataFrame(
{'Date': ['01/03/2018', '06/08/2018', '31/03/2018', '30/04/2018']}
)
date_parts = df['Date'].apply(lambda d: pd.Series(int(n) for n in d.split('/')))
date_parts.columns = ['day', 'month', 'year']
df['Date'] = pd.to_datetime(date_parts)
date_parts
# day month year
# 0 1 3 2018
# 1 6 8 2018
# 2 31 3 2018
# 3 30 4 2018
df
# Date
# 0 2018-03-01
# 1 2018-08-06
# 2 2018-03-31
# 3 2018-04-30
I have a pandas Dataframe that is indexed by Date. I would like to select all consecutive gaps by period and all consecutive days by Period. How can I do this?
Example of Dataframe with No Columns but a Date Index:
In [29]: import pandas as pd
In [30]: dates = pd.to_datetime(['2016-09-19 10:23:03', '2016-08-03 10:53:39','2016-09-05 11:11:30', '2016-09-05 11:10:46','2016-09-05 10:53:39'])
In [31]: ts = pd.DataFrame(index=dates)
As you can see there is a gap from 2016-08-03 and 2016-09-19. How do I detect these so I can create descriptive statistics, i.e. 40 gaps, with median gap duration of "x", etc. Also, I can see that 2016-09-05 and 2016-09-06 is a two day range. How I can detect these and also print descriptive stats?
Ideally the result would be returned as another Dataframe in each case since I want use other columns in the Dataframe to groupby.
Pandas version 1.0.1 has a built-in method DataFrame.diff() which you can use to accomplish this. One benefit is you can use pandas series functions like mean() to quickly compute summary statistics on the gaps series object
from datetime import datetime, timedelta
import pandas as pd
# Construct dummy dataframe
dates = pd.to_datetime([
'2016-08-03',
'2016-08-04',
'2016-08-05',
'2016-08-17',
'2016-09-05',
'2016-09-06',
'2016-09-07',
'2016-09-19'])
df = pd.DataFrame(dates, columns=['date'])
# Take the diff of the first column (drop 1st row since it's undefined)
deltas = df['date'].diff()[1:]
# Filter diffs (here days > 1, but could be seconds, hours, etc)
gaps = deltas[deltas > timedelta(days=1)]
# Print results
print(f'{len(gaps)} gaps with average gap duration: {gaps.mean()}')
for i, g in gaps.iteritems():
gap_start = df['date'][i - 1]
print(f'Start: {datetime.strftime(gap_start, "%Y-%m-%d")} | '
f'Duration: {str(g.to_pytimedelta())}')
here's something to get started:
df = pd.DataFrame(np.ones(5),columns = ['ones'])
df.index = pd.DatetimeIndex(['2016-09-19 10:23:03', '2016-08-03 10:53:39', '2016-09-05 11:11:30', '2016-09-05 11:10:46', '2016-09-06 10:53:39'])
daily_rng = pd.date_range('2016-08-03 00:00:00', periods=48, freq='D')
daily_rng = daily_rng.append(df.index)
daily_rng = sorted(daily_rng)
df = df.reindex(daily_rng).fillna(0)
df = df.astype(int)
df['ones'] = df.cumsum()
The cumsum() creates a grouping variable on 'ones' partitioning your data at the points your provided. If you print df to say a spreadsheet it will make sense:
print df.head()
ones
2016-08-03 00:00:00 0
2016-08-03 10:53:39 1
2016-08-04 00:00:00 1
2016-08-05 00:00:00 1
2016-08-06 00:00:00 1
print df.tail()
ones
2016-09-16 00:00:00 4
2016-09-17 00:00:00 4
2016-09-18 00:00:00 4
2016-09-19 00:00:00 4
2016-09-19 10:23:03 5
now to complete:
df = df.reset_index()
df = df.groupby(['ones']).aggregate({'ones':{'gaps':'count'},'index':{'first_spotted':'min'}})
df.columns = df.columns.droplevel()
which gives:
first_time gaps
ones
0 2016-08-03 00:00:00 1
1 2016-08-03 10:53:39 34
2 2016-09-05 11:10:46 1
3 2016-09-05 11:11:30 2
4 2016-09-06 10:53:39 14
5 2016-09-19 10:23:03 1