date index from from multiindex for pandas dataframe - python

I have a dataframe with multiindex which I want to convert to date() index.
Here is an example emulation of the type of dataframes I have:
i = pd.date_range('01-01-2016', '01-01-2020')
x = pd.DataFrame(index = i, data=np.random.randint(0, 10, len(i)))
x = x.groupby(by = [x.index.year, x.index.month]).sum()
print(x)
I tried to convert it to date index by this:
def to_date(ind):
return pd.to_datetime(str(ind[0]) + '/' + str(ind[1]), format="%Y/%m").date()
# flattening the multiindex to tuples to later reset the index
x.set_axis(x.index.to_flat_index(), axis=0, inplace = True)
x = x.rename(index = to_date)
x.set_axis(pd.DatetimeIndex(x.index), axis=0, inplace=True)
But it is very slow. I think the problem is in the pd.to_datetime(str(ind[0]) + '/' + str(ind[1]), format="%Y/%m").date() line. Would greatly appreciate any ideas to make this faster.

You can just use:
x.index=pd.to_datetime([f"{a}-{b}" for a,b in x.index],format='%Y-%m')
print(x)
0
2016-01-01 162
2016-02-01 119
2016-03-01 148
2016-04-01 125
2016-05-01 132
2016-06-01 144
2016-07-01 157
2016-08-01 141
2016-09-01 138
2016-10-01 168
2016-11-01 140
2016-12-01 137
2017-01-01 113
2017-02-01 113
2017-03-01 155
..........
..........
......

Related

Subsetting a dataframe greater than a specific date Pandas and append to another dataframe

I have a dataframe (df1) with a date column with the dd/mm/yyyy date format.
I have second dataframe (df2) with the same structure, however, has some shared data. I want to add the data from df2 to df1 for the data is after the most recent date in df1.
My approach was to find the maxdate in df1 and then look for dates in df2 subset and append to df1
maxdate = df.loc[pd.to_datetime(df['DATE'],dayfirst=True).idxmax(), 'DATE']
# in this instance it is 11/09/2022 (dd/mm/yyyy)
df3 = df2.loc[pd.to_datetime(df2['DATE']) > maxdate] #this is to by my subset to append to df1
some of the df1 below
DATE TIME X Y Z
3692 23/08/2022 16:55:00 734154.2872 9551189.353 2.845237e+03
3693 23/08/2022 16:55:00 734199.2516 9551070.666 2.842993e+03
3694 23/08/2022 05:02:00 734669.6130 9551361.865 2.845012e+03
3695 24/08/2022 17:25:00 734215.9910 9551068.295 2.842111e+03
3696 24/08/2022 17:25:00 734684.8444 9551383.618 2.846049e+03
3697 27/08/2022 17:20:00 734214.1851 9551061.242 2.841501e+03
3698 28/08/2022 17:00:00 734669.6130 9551361.865 2.845012e+03
3699 30/08/2022 05:25:00 734176.3412 9551168.550 2.844325e+03
3700 01/09/2022 17:18:00 734686.1061 9551385.420 2.846083e+03
3701 01/09/2022 17:18:00 734667.0922 9551358.264 2.844812e+03
3702 01/09/2022 17:18:00 734164.7047 9551178.039 2.844962e+03
3703 02/09/2022 17:16:00 734151.9079 9551185.951 2.845472e+03
3704 03/09/2022 17:15:00 734141.2542 9551197.062 2.844747e+03
3705 04/09/2022 17:08:00 734687.3678 9551387.222 2.846116e+03
3706 04/09/2022 17:08:00 734665.8319 9551356.464 2.844713e+03
3707 05/09/2022 05:08:00 734704.3326 9551376.581 2.842331e+03
3708 07/09/2022 16:58:00 734687.3678 9551387.222 2.846116e+03
3709 08/09/2022 16:55:00 734663.3109 9551352.864 2.844512e+03
3710 10/09/2022 17:03:00 734689.8913 9551390.826 2.846184e+03
3711 11/09/2022 17:13:00 734691.1530 9551392.628 9.551393e+06
some of df2 below
DATE TIME X Y Z
134 23/08/22 16:55:00 734154.2872 9551189.3534 2845.237
135 23/08/22 16:55:00 734199.2516 9551070.6664 2842.9929
136 23/08/22 5:02:00 734669.613 9551361.8645 2845.0122
138 24/08/22 17:25:00 734215.991 9551068.2954 2842.1106
139 24/08/22 17:25:00 734684.8444 9551383.618 2846.0492
147 27/08/22 17:20:00 734214.1851 9551061.2423 2841.501
149 28/08/22 17:00:00 734669.613 9551361.8645 2845.0122
151 29/08/22 17:30:00 - - -
153 30/08/22 5:25:00 734176.3412 9551168.5498 2844.325
180 11/09/22 17:13:00 734691.153 9551392.6276 9551392.6276
However df3 is sub setting the dataframe that includes dates before the "maxdate"
I feel it is related to the date format that I have.
any help appreciated.
You need to convert the values to pandas DateTime, else the comparison will be based on string values and not the dates, also its not clear if 11 is the day or 09 is the day in sample max date 11/09/2022, if 11 is the day, you also need to pass dayfirst=True to pd.to_datetime:
>>> maxdate=pd.to_datetime('11/09/2022')
# Timestamp('2022-11-09 00:00:00')
>>> df2 = df.loc[pd.to_datetime(df['DATE'], dayfirst=True) > maxdate]
Here is the execution for the sample data you have added to the question:
# Getting the max date from first dataframe
max_date=pd.to_datetime(df1['DATE'],dayfirst=True).max()
max_date
Timestamp('2022-09-11 00:00:00')
# Filtering second dataframe based on maximum date
df2[pd.to_datetime(df2['DATE'], dayfirst=True)>max_date]
Empty DataFrame
Columns: [DATE, TIME, X, Y, Z]
Index: []
# Result is empty dataframe for the sample data cause no record matches condition
# Records for maximum date:
df2[pd.to_datetime(df2['DATE'], dayfirst=True)==max_date]
DATE TIME X Y Z
180 11/09/22 17:13:00 734691.153 9551392.6276 9551392.6276
# Records for dates older than the maximum date:
df2[pd.to_datetime(df2['DATE'], dayfirst=True)<max_date]
DATE TIME X Y Z
134 23/08/22 16:55:00 734154.2872 9551189.3534 2845.237
135 23/08/22 16:55:00 734199.2516 9551070.6664 2842.9929
136 23/08/22 5:02:00 734669.613 9551361.8645 2845.0122
138 24/08/22 17:25:00 734215.991 9551068.2954 2842.1106
139 24/08/22 17:25:00 734684.8444 9551383.618 2846.0492
147 27/08/22 17:20:00 734214.1851 9551061.2423 2841.501
149 28/08/22 17:00:00 734669.613 9551361.8645 2845.0122
151 29/08/22 17:30:00 - - -
153 30/08/22 5:25:00 734176.3412 9551168.5498 2844.325

Concatenate Decimal Seconds in to a Time Column in Pandas

I have a column with hh:mm:ss and a separate column with the decimal seconds.
I have quite a horrible text files to process and the decimal value of my time is separated into another column. Now I'd like to concatenate them back in.
For example:
df = {'Time':['01:00:00','01:00:00 AM','01:00:01 AM','01:00:01 AM'],
'DecimalSecond':['14','178','158','75']}
I tried the following but it didn't work. It gives me "01:00:00 AM.14" LOL
df = df['Time2'] = df['Time'].map(str) + '.' + df['DecimalSecond'].map(str)
The goal is to come up with one column named "Time2" which has the first row 01:00:00.14 AM, second row 01.00.00.178 AM, etc)
Thank you for the help.
You can convert ouput to datetimes and then call Series.dt.time:
#Time column is splitted by space and extracted values before first space
s = df['Time'].astype(str).str.split().str[0] + '.' + df['DecimalSecond'].astype(str)
df['Time2'] = pd.to_datetime(s).dt.time
print (df)
Time DecimalSecond Time2
0 01:00:00 14 01:00:00.140000
1 01:00:00 AM 178 01:00:00.178000
2 01:00:01 AM 158 01:00:01.158000
3 01:00:01 AM 75 01:00:01.750000
Please see the python code below
In [1]:
import pandas as pd
In [2]:
df = pd.DataFrame({'Time':['01:00:00','01:00:00','01:00:01','01:00:01'],
'DecimalSecond':['14','178','158','75']})
In [3]:
df['Time2'] = df[['Time','DecimalSecond']].apply(lambda x: ' '.join(x), axis = 1)
print(df)
Time DecimalSecond Time2
0 01:00:00 14 01:00:00 14
1 01:00:00 178 01:00:00 178
2 01:00:01 158 01:00:01 158
3 01:00:01 75 01:00:01 75
In [4]:
df.iloc[:,2]
Out[4]:
0 01:00:00 14
1 01:00:00 178
2 01:00:01 158
3 01:00:01 75
Name: Time2, dtype: object

Convert a column in pandas of HH:MM to minutes

I want to convert a column in dataset of hh:mm format to minutes. I tried the following code but it says " AttributeError: 'Series' object has no attribute 'split' ". The data is in following format. I also have nan values in the dataset and the plan is to compute the median of values and then fill the rows which has nan with the median
02:32
02:14
02:31
02:15
02:28
02:15
02:22
02:16
02:22
02:14
I have tried this so far
s = dataset['Enroute_time_(hh mm)']
hours, minutes = s.split(':')
int(hours) * 60 + int(minutes)
I suggest you avoid row-wise calculations. You can use a vectorised approach with Pandas / NumPy:
df = pd.DataFrame({'time': ['02:32', '02:14', '02:31', '02:15', '02:28', '02:15',
'02:22', '02:16', '02:22', '02:14', np.nan]})
values = df['time'].fillna('00:00').str.split(':', expand=True).astype(int)
factors = np.array([60, 1])
df['mins'] = (values * factors).sum(1)
print(df)
time mins
0 02:32 152
1 02:14 134
2 02:31 151
3 02:15 135
4 02:28 148
5 02:15 135
6 02:22 142
7 02:16 136
8 02:22 142
9 02:14 134
10 NaN 0
If you want to use split you will need to use the str accessor, ie s.str.split(':').
However I think that in this case it makes more sense to use apply:
df = pd.DataFrame({'Enroute_time_(hh mm)': ['02:32', '02:14', '02:31',
'02:15', '02:28', '02:15',
'02:22', '02:16', '02:22', '02:14']})
def convert_to_minutes(value):
hours, minutes = value.split(':')
return int(hours) * 60 + int(minutes)
df['Enroute_time_(hh mm)'] = df['Enroute_time_(hh mm)'].apply(convert_to_minutes)
print(df)
# Enroute_time_(hh mm)
# 0 152
# 1 134
# 2 151
# 3 135
# 4 148
# 5 135
# 6 142
# 7 136
# 8 142
# 9 134
I understood that you have a column in a DataFrame with multiple Timedeltas as Strings. Then you want to extract the total minutes of the Deltas. After that you want to fill the NaN values with the median of the total minutes.
import pandas as pd
df = pd.DataFrame(
{'hhmm' : ['02:32',
'02:14',
'02:31',
'02:15',
'02:28',
'02:15',
'02:22',
'02:16',
'02:22',
'02:14']})
Your Timedeltas are not Timedeltas. They are strings. So you need to convert them first.
df.hhmm = pd.to_datetime(df.hhmm, format='%H:%M')
df.hhmm = pd.to_timedelta(df.hhmm - pd.datetime(1900, 1, 1))
This gives you the following values (Note the dtype: timedelta64[ns] here)
0 02:32:00
1 02:14:00
2 02:31:00
3 02:15:00
4 02:28:00
5 02:15:00
6 02:22:00
7 02:16:00
8 02:22:00
9 02:14:00
Name: hhmm, dtype: timedelta64[ns]
Now that you have true timedeltas, you can use some cool functions like total_seconds() and then calculate the minutes.
df.hhmm.dt.total_seconds() / 60
If that is not what you wanted, you can also use the following.
df.hhmm.dt.components.minutes
This gives you the minutes from the HH:MM string as if you would have split it.
Fill the na-values.
df.hhmm.fillna((df.hhmm.dt.total_seconds() / 60).mean())
or
df.hhmm.fillna(df.hhmm.dt.components.minutes.mean())

Pandas get specific rows from HDF5 by index

I have a pandas DataFrame that I have written to an HDF5 file. The data is indexed by Timestamps and looks like this:
In [5]: df
Out[5]:
Codes Price Size
Time
2015-04-27 01:31:08-04:00 T 111.75 23
2015-04-27 01:31:39-04:00 T 111.80 23
2015-04-27 01:31:39-04:00 T 113.00 35
2015-04-27 01:34:14-04:00 T 113.00 85
2015-04-27 01:55:15-04:00 T 113.50 203
... ... ... ...
2015-05-26 11:35:00-04:00 CA 110.55 196
2015-05-26 11:35:00-04:00 CA 110.55 98
2015-05-26 11:35:00-04:00 CA 110.55 738
2015-05-26 11:35:00-04:00 CA 110.55 19
2015-05-26 11:37:01-04:00 110.55 12
What I would like is to create a function that I can pass a pandas DatetimeIndex and it will return a DataFrame with the rows at or right before each Timestamp in the DatetimeIndex.
The problem I'm running into is that concatenated read_hdf queries won't work if I am looking for more than 30 rows -- see [pandas read_hdf with 'where' condition limitation?
What I am doing now is this, but there has to be a better solution:
from pandas import read_hdf, DatetimeIndex
from datetime import timedelta
import pytz
def getRows(file, dataset, index):
if len(index) == 1:
start = index.date[0]
end = (index.date + timedelta(days=1))[0]
else:
start = index.date.min()
end = (index.date.max() + timedelta(days=1))
where = '(index >= "' + str(start) + '") & (index < "' str(end) + '")'
df = read_hdf(file, dataset, where=where)
df = df.groupby(level=0).last().reindex(index, method='pad')
return df
This is an example of using a where mask
In [22]: pd.set_option('max_rows',10)
In [23]: df = DataFrame({'A' : np.random.randn(100), 'B' : pd.date_range('20130101',periods=100)}).set_index('B')
In [24]: df
Out[24]:
A
B
2013-01-01 0.493144
2013-01-02 0.421045
2013-01-03 -0.717824
2013-01-04 0.159865
2013-01-05 -0.485890
... ...
2013-04-06 -0.805954
2013-04-07 -1.014333
2013-04-08 0.846877
2013-04-09 -1.646908
2013-04-10 -0.160927
[100 rows x 1 columns]
Store the tests frame
In [25]: store = pd.HDFStore('test.h5',mode='w')
In [26]: store.append('df',df)
Create a random selection of dates.
In [27]: dates = df.index.take(np.random.randint(0,100,10))
In [28]: dates
Out[28]: DatetimeIndex(['2013-03-29', '2013-02-16', '2013-01-15', '2013-02-06', '2013-01-12', '2013-02-24', '2013-02-18', '2013-01-06', '2013-03-17', '2013-03-21'], dtype='datetime64[ns]', name=u'B', freq=None, tz=None)
Select the index column (in its entirety)
In [29]: c = store.select_column('df','index')
In [30]: c
Out[30]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
4 2013-01-05
...
95 2013-04-06
96 2013-04-07
97 2013-04-08
98 2013-04-09
99 2013-04-10
Name: B, dtype: datetime64[ns]
Select the indexers that you want. This could actually be somewhat complicated, e.g. you might want a .reindex(method='nearest')
In [34]: c[c.isin(dates)]
Out[34]:
5 2013-01-06
11 2013-01-12
14 2013-01-15
36 2013-02-06
46 2013-02-16
48 2013-02-18
54 2013-02-24
75 2013-03-17
79 2013-03-21
87 2013-03-29
Name: B, dtype: datetime64[ns]
Select the rows that you want
In [32]: store.select('df',where=c[c.isin(dates)].index)
Out[32]:
A
B
2013-01-06 0.680930
2013-01-12 0.165923
2013-01-15 -0.517692
2013-02-06 -0.351020
2013-02-16 1.348973
2013-02-18 0.448890
2013-02-24 -1.078522
2013-03-17 -0.358597
2013-03-21 -0.482301
2013-03-29 0.343381
In [33]: store.close()

No item named 'timestamp' for DataFrame while there is really one

I have extracted the table below from a csv file :
date user_id whole_cost cost1
02/10/2012 00:00:00 1 1790 12
07/10/2012 00:00:00 1 364 15
30/01/2013 00:00:00 1 280 10
02/02/2013 00:00:00 1 259 24
05/03/2013 00:00:00 1 201 39
02/10/2012 00:00:00 3 623 1
07/12/2012 00:00:00 3 90 0
30/01/2013 00:00:00 3 312 90
02/02/2013 00:00:00 5 359 45
05/03/2013 00:00:00 5 301 34
02/02/2013 00:00:00 5 359 1
05/03/2013 00:00:00 5 801 12
For this purpose I used the following statement :
import pandas as pd
newnames = ['date','user_id', 'whole_cost', 'cost1']
df = pd.read_csv('expenses.csv', names = newnames, index_col = 'timestamp')
pivoted = df.pivot('timestamp','user_id')
But this last line generate the error message : no item named timestamp.
Many thanks in advance for your help.
looks like column name timestamp is not present in the dataframe.
Try index_col = 'date' instead of index_col = 'timestamp' also use pares_dates = ['date'] while using pd.read_csv.
This should work:
df = pd.read_csv('expenses.csv', header = False, names = newnames, index_col = 'date', parse_dates = ['date'])
Hope this helps.

Categories