Reset index after groupby operation

Reset index after groupby operation - python

by_month = df_omsk_last_year.groupby(df_omsk_last_year.index.month, as_index=False).agg({'T': ['mean', 'min', 'max']})
by_month = by_month.reset_index()
by_month = by_month.rename(columns={'mean':'mean__'})
by_month.info()
by_month['mean__']
I have key error, of course.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
(index, ) 12 non-null int64
(T, mean__) 12 non-null float64
(T, min) 12 non-null float64
(T, max) 12 non-null float64
dtypes: float64(3), int64(1)
memory usage: 464.0 bytes
What I should do? I have tried a lot of ways.
index is datetime, T is float.

Problem is MultiIndex in columns with same level T. You can prevent it by specify column after groupby for processing:
df_omsk_last_year = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'T':[7,8,9,4,2,3],
}, index=pd.date_range('2015-01-01', periods=6, freq='10D'))
by_month = (df_omsk_last_year.groupby(df_omsk_last_year.index.month.rename('month'))['T']
.agg(['mean', 'min', 'max'])
.rename(columns={'mean':'mean__'})
.reset_index())
print (by_month)
month mean__ min max
0 1 7.0 4 9
1 2 2.5 2 3
Or by named aggregations:
by_month = (df_omsk_last_year.groupby(df_omsk_last_year.index.month)
.agg(mean__=('T', 'mean'),
min__=('T', 'min'),
max__=('T', 'max'))
.reset_index())

Related

Including minutes column in CSV breaks date parsing

Problem
I have a CSV file with components of the date and time in separate columns. When I use pandas.read_csv, I can use the parse_date kwarg to combine the components into a single datetime column if I don't include the minutes column.
Example
Consider the following example:
from io import StringIO
import pandas
data = """\
GAUGE,YEAR,MONTH,DAY,HOUR,MINUTE,PRECIP
1,2008,03,27,19,30,0.02
1,2008,03,27,19,45,0.06
1,2008,03,27,20,0,0.01
1,2008,03,27,20,30,0.01
1,2008,03,27,21,0,0.12
1,2008,03,27,21,15,0.02
1,2008,03,27,23,15,0.02
1,2008,03,27,23,30,0.01
1,2008,03,30,04,0,0.05
1,2008,03,30,04,15,0.24
"""
df_has_dt = pandas.read_csv(
StringIO(data),
parse_dates={'datetime': ['YEAR', 'MONTH', 'DAY', 'HOUR']},
)
df_no_dt = pandas.read_csv(
StringIO(data),
parse_dates={'datetime': ['YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE']},
)
If I look at the .info() method of each dataframe, I get:
The first:
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10 non-null datetime64[ns] # <--- good, but doesn't have minutes of course
1 GAUGE 10 non-null int64
2 MINUTE 10 non-null int64
3 PRECIP 10 non-null float64
The second:
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10 non-null object # <--- bad!
1 GAUGE 10 non-null int64
2 PRECIP 10 non-null float64
Indeed, df_no_dt.head() shows a very strange "datetime" column:
datetime GAUGE PRECIP
2008 03 27 19 30 1 0.02
2008 03 27 19 45 1 0.06
2008 03 27 20 0 1 0.01
2008 03 27 20 30 1 0.01
2008 03 27 21 0 1 0.12
Question:
What's causing this and how should I efficiently get the minutes of the time into the datetime column?

I'm not sure why just adding on the minutes column for datetime parsing isn't working. But you can specify a function to parse them like so:
from io import StringIO
import pandas
data = """\
GAUGE,YEAR,MONTH,DAY,HOUR,MINUTE,PRECIP
1,2008,03,27,19,30,0.02
1,2008,03,27,19,45,0.06
1,2008,03,27,20,0,0.01
1,2008,03,27,20,30,0.01
1,2008,03,27,21,0,0.12
1,2008,03,27,21,15,0.02
1,2008,03,27,23,15,0.02
1,2008,03,27,23,30,0.01
1,2008,03,30,04,0,0.05
1,2008,03,30,04,15,0.24
"""
DT_COLS = ['YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE']
def dt_parser(*args):
return pandas.to_datetime(pandas.DataFrame(zip(*args), columns=DT_COLS))
df = pandas.read_csv(
StringIO(data),
parse_dates={'datetime': DT_COLS},
date_parser=dt_parser,
)
Which outputs:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10 non-null datetime64[ns]
1 GAUGE 10 non-null int64
2 PRECIP 10 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 368.0 bytes
How dt_parser works. It relies on an feature of pd.to_datetime that can recognise common names for date/time types. These aren't fully documented, but at the time of writing they can be found in the pandas source. They all are:
"year", "years",
"month", "months",
"day", "days",
"hour", "hours",
"minute", "minutes",
"second", "seconds",
"ms", "millisecond", "milliseconds",
"us", "microsecond", "microseconds",
"ns", "nanosecond", "nanoseconds",

Pandas convert datetime64 [ns] columns to datetime64 [ns, UTC] for mutliple column at once

I have a dataframe called query_df and some of the columns are in datetime[ns] datatype.
I want to convert all datetime[ns] to datetime[ns, UTC] all at once.
This is what I've done so far by retrieving columns that are datetime[ns]:
dt_columns = [col for col in query_df.columns if query_df[col].dtype == 'datetime64[ns]']
To convert it, I can use pd.to_datetime(query_df["column_name"], utc=True).
Using dt_columns, I want to convert all columns in dt_columns.
How can I do it all at once?
Attempt:
query_df[dt_columns] = pd.to_datetime(query_df[dt_columns], utc=True)
Error:
ValueError: to assemble mappings requires at least that [year, month,
day] be specified: [day,month,year] is missing

You have to use lambda function to achieve this. Try doing this
df[dt_columns] = df[dt_columns].apply(pd.to_datetime, utc=True)

First part of the process is already done by you i.e. grouping the names of the columns whose datatype is to be converted , by using :
dt_columns = [col for col in query_df.columns if query_df[col].dtype == 'datetime64[ns]']
Now , all you have to do ,is to convert all the columns to datetime all at once using pandas apply() functionality :
query_df[dt_columns] = query_df[dt_columns].apply(pd.to_datetime)
This will convert the required columns to the data type you specify.
EDIT:
Without using the lambda function
step 1: Create a dictionary with column names (columns to be changed) and their datatype :
convert_dict = {}
Step 2: Iterate over column names which you extracted and store in the dictionary as key with their respective value as datetime :
for col in dt_columns:
convert_dict[col] = datetime
Step 3: Now convert the datatypes by passing the dictionary into the astype() function like this :
query_df = query_df.astype(convert_dict)
By doing this, all the values of keys will be applied to the columns matching the keys.

Your attempt query_df[dt_columns] = pd.to_datetime(query_df[dt_columns], utc=True) is interpreting dt_columns as year, month, day. Below the example in the help of to_datetime():
Assembling a datetime from multiple columns of a DataFrame. The keys can be
common abbreviations like ['year', 'month', 'day', 'minute', 'second',
'ms', 'us', 'ns']) or plurals of the same
>>> df = pd.DataFrame({'year': [2015, 2016],
... 'month': [2, 3],
... 'day': [4, 5]})
>>> pd.to_datetime(df)
0 2015-02-04
1 2016-03-05
dtype: datetime64[ns]
Below a code snippet that gives you a solution with a little example. Bear in mind that depending in your data format or your application the UTC might not give your the right date.
import pandas as pd
query_df = pd.DataFrame({"ts1":[1622098447.2419431, 1622098447], "ts2":[1622098427.370945,1622098427], "a":[1,2], "b":[0.0,0.1]})
query_df.info()
# convert to datetime in nano seconds
query_df[["ts1","ts2"]] = query_df[["ts1","ts2"]].astype("datetime64[ns]")
query_df.info()
#convert to datetime with UTC
query_df[["ts1","ts2"]] = query_df[["ts1","ts2"]].astype("datetime64[ns, UTC]")
query_df.info()
which outputs:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ts1 2 non-null float64
1 ts2 2 non-null float64
2 a 2 non-null int64
3 b 2 non-null float64
dtypes: float64(3), int64(1)
memory usage: 192.0 bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ts1 2 non-null datetime64[ns]
1 ts2 2 non-null datetime64[ns]
2 a 2 non-null int64
3 b 2 non-null float64
dtypes: datetime64[ns](2), float64(1), int64(1)
memory usage: 192.0 bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ts1 2 non-null datetime64[ns, UTC]
1 ts2 2 non-null datetime64[ns, UTC]
2 a 2 non-null int64
3 b 2 non-null float64
dtypes: datetime64[ns, UTC](2), float64(1), int64(1)
memory usage: 192.0 byte

Currency/Date datframe merge failing

Having issues with merging two datframes (xrate and df) based on currency_str and created_date_time
display(xrate.info())
Int64Index: 1611 entries, 6 to 112
Data columns (total 3 columns):
Date 1611 non-null datetime64[ns]
PX_LAST 1611 non-null object
Currency 1611 non-null object
display(xrate.head(3))
Date PX_LAST Currency
2018-05-30 1 CAD
2018-05-29 1 CAD
2018-05-28 1 CAD
I created a new date to merge on:
#df['formatted_created_date_time'] = df['created_date_time'].dt.strftime('%d%m%Y')
df['formatted_created_date_time'] = df['created_date_time'].dt.strftime('%d-%m-%Y')
#convert to date
#df['formatted_created_date_time'] = pd.to_datetime(df['formatted_created_date_time'], format='%d%m%Y')
df['formatted_created_date_time'] = pd.to_datetime(df['formatted_created_date_time'], format='%d-%m-%Y')
display(df.info())
RangeIndex: 3488 entries, 0 to 3487
Data columns (total 43 columns):
created_date_time 3488 non-null datetime64[ns]
rfq_create_date_time 3488 non-null datetime64[ns]
currency_str 3488 non-null object
display(df.head(3))
Now the two dataframes are merged:
result = pd.merge(df, xrate, left_on=['currency_str', 'formatted_created_date_time'], right_on=['Currency', 'Date'], how='left')
display(result.info())
RangeIndex: 3488 entries, 0 to 3487
Data columns (total 43 columns):
created_date_time 3488 non-null datetime64[ns]
rfq_create_date_time 3488 non-null datetime64[ns]
.
.
formatted_created_date_time 3488 non-null datetime64[ns]
The match has failed:
display(result.head(3))
System Datetime:
Any ideas on this one?

It should working nice.
But another solution is merge by strings:
df['formatted_created_date_time'] = df['created_date_time'].dt.strftime('%d-%m-%Y')
xrate['Date'] = xrate['Date'].dt.strftime('%d-%m-%Y')
result = pd.merge(df, xrate, left_on=['currency_str', 'formatted_created_date_time'],
right_on=['Currency', 'Date'], how='left')
Your solution should be simplify by floor or date
df['formatted_created_date_time'] = df['created_date_time'].dt.floor('d')
xrate['Date'] = xrate['Date'].dt.floor('d')
result = pd.merge(df, xrate, left_on=['currency_str', 'formatted_created_date_time'],
right_on=['Currency', 'Date'], how='left')
df['formatted_created_date_time'] = df['created_date_time'].dt.date
xrate['Date'] = xrate['Date'].dt.date
result = pd.merge(df, xrate, left_on=['currency_str', 'formatted_created_date_time'],
right_on=['Currency', 'Date'], how='left')

Using set_index within a custom function

I would like to convert the date observations from a column into the index for my dataframe. I am able to do this with the code below:
Sample data:
test = pd.DataFrame({'Values':[1,2,3], 'Date':["1/1/2016 17:49","1/2/2016 7:10","1/3/2016 15:19"]})
Indexing code:
test['Date Index'] = pd.to_datetime(test['Date'])
test = test.set_index('Date Index')
test['Index'] = test.index.date
However when I try to include this code in a function, I am able to create the 'Date Index' column but set_index does not seem to work as expected.
def date_index(df):
df['Date Index'] = pd.to_datetime(df['Date'])
df = df.set_index('Date Index')
df['Index'] = df.index.date
If I inspect the output of not using a function info() returns:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3 entries, 2016-01-01 17:49:00 to 2016-01-03 15:19:00
Data columns (total 3 columns):
Date 3 non-null object
Values 3 non-null int64
Index 3 non-null object
dtypes: int64(1), object(2)
memory usage: 96.0+ bytes
If I inspect the output of the function info() returns:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
Date 3 non-null object
Values 3 non-null int64
dtypes: int64(1), object(1)
memory usage: 120.0+ bytes
I would like the DatetimeIndex.
How can set_index be used within a function? Am I using it incorrectly?

IIUC return df is missing:
df1 = pd.DataFrame({'Values':[1,2,3], 'Exam Completed Date':["1/1/2016 17:49","1/2/2016 7:10","1/3/2016 15:19"]})
def date_index(df):
df['Exam Completed Date Index'] = pd.to_datetime(df['Exam Completed Date'])
df = df.set_index('Exam Completed Date Index')
df['Index'] = df.index.date
return df
print (date_index(df1))
Exam Completed Date Values Index
Exam Completed Date Index
2016-01-01 17:49:00 1/1/2016 17:49 1 2016-01-01
2016-01-02 07:10:00 1/2/2016 7:10 2 2016-01-02
2016-01-03 15:19:00 1/3/2016 15:19 3 2016-01-03
print (date_index(df1).info())
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3 entries, 2016-01-01 17:49:00 to 2016-01-03 15:19:00
Data columns (total 3 columns):
Exam Completed Date 3 non-null object
Values 3 non-null int64
Index 3 non-null object
dtypes: int64(1), object(2)
memory usage: 96.0+ bytes
None

Merge two dataframes with python

I have two dataframes :dfDepas and df7 ;
dfDepas.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 0 to 6
Data columns (total 4 columns):
day_of_week 7 non-null object
P_ACT_KW 7 non-null float64
P_SOUSCR 7 non-null float64
depassement 7 non-null float64
dtypes: float64(3), object(1)
memory usage: 280.0+ bytes
df7.info()
<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, Fri to Thurs
Data columns (total 6 columns):
ACT_TIME_AERATEUR_1_F1 7 non-null float64
ACT_TIME_AERATEUR_1_F3 7 non-null float64
ACT_TIME_AERATEUR_1_F5 7 non-null float64
ACT_TIME_AERATEUR_1_F6 7 non-null float64
ACT_TIME_AERATEUR_1_F7 7 non-null float64
ACT_TIME_AERATEUR_1_F8 7 non-null float64
dtypes: float64(6)
memory usage: 392.0+ bytes
I try to merge these two dataframes according ['day_of_week'] which is the index in dfDepas dataframe.
I don't know how can I use this : merged_df = pd.merge(dfDepas, df7, how='inner',on=['day_of_week'])
Any idea to help me please?
Thank you
Kind regards
EDIT
dfDepas
day_of_week P_ACT_KW P_SOUSCR depassement
Fri 157.258929 427.142857 0.0
Mon 157.788110 426.875000 0.0
Sat 166.989236 426.875000 0.0
Sun 149.676215 426.875000 0.0
Thurs 157.339286 427.142857 0.0
Tues 151.122913 427.016021 0.0
Weds 159.569444 427.142857 0.0
df7
ACT_TIME_AERATEUR_1_F1 ACT_TIME_AERATEUR_1_F3 ACT_TIME_AERATEUR_1_F5 ACT_TIME_AERATEUR_1_F6 ACT_TIME_AERATEUR_1_F7 ACT_TIME_AERATEUR_1_F8
Fri 0.326258 0.330253 0.791144 0.654682 3.204544 1.008550
Sat -0.201327 -0.228196 0.044616 0.184003 -0.579214 0.292886
Sun 5.068735 5.250199 5.407271 5.546657 7.823564 5.786713
Mon -0.587129 -0.559986 -0.294890 -0.155503 2.013379 -0.131496
Tues-1.244922 -1.510025 -0.788717 -1.098790 -0.996845 -0.718881
Weds-3.264598 -3.391776 -3.188409 -3.041306 -4.846189 -4.668533
Thurs -0.178179 0.011002 -1.907544 -2.084516 -6.119337

You can use reset_index and rename column 0 to day_of_week for matching:
merged_df = pd.merge(dfDepas,
df7.reset_index().rename(columns={0:'day_of_week'}),
on=['day_of_week'])
Thank you Quickbeam2k1 for another solution:
merged_df = pd.merge(dfDepas.set_index('day_of_week'),
df7,
right_index=True,
left_index =True)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reset index after groupby operation - python

Related

Including minutes column in CSV breaks date parsing

Pandas convert datetime64 [ns] columns to datetime64 [ns, UTC] for mutliple column at once

Currency/Date datframe merge failing

Using set_index within a custom function

Merge two dataframes with python

Categories

Resources