My dataset is similar to the below:
data = [['Jane', 10,10.5,11,45,66,21,88,99,77,41,22], ['John',11,22,55,34,44,22,44,55,88,56,47],['Tom',23,32,43,12,11,44,77,85,99,45,63]]
df = pd.DataFrame(data, columns = ['Name', '09-Aug-21', 'Aug-21', '02-Sep-21', 'Sep-21', '18-Oct-21', 'Oct-21', '02-Nov-21','Nov-21','14-Dec-21', 'Dec-21', '15-Jan-22'])
df
How can I add columns to this which show the quarterly figure, which is an average of the preceding three months? Eg, suppose we started at adding a column after 'Dec-21' called Q4 2021 which took the average of the columns called 'Oct-21', 'Nov-21' and 'Dec-21'.
Will I need to create a function which takes the preceding three values and returns an average, and then concatenate this to my dataframe? It does not have to be directly after each period, eg I am also happy to add all of the Quarterly averages right at the end.
from datetime import datetime
def get_quarter_name(timestamp):
"""Convert '2021-12-01' to 'Q4-2021'
"""
return f"Q{timestamp.quarter}-{timestamp.year}"
# your data
data = [['Jane', 10,10.5,11,45,66,21,88,99,77,41,22], ['John',11,22,55,34,44,22,44,55,88,56,47],['Tom',23,32,43,12,11,44,77,85,99,45,63]]
df = pd.DataFrame(data, columns = ['Name', '09-Aug-21', 'Aug-21', '02-Sep-21', 'Sep-21', '18-Oct-21', 'Oct-21', '02-Nov-21','Nov-21','14-Dec-21', 'Dec-21', '15-Jan-22'])
# filter only relevant columns, which start with an alphabetical character
cols = [col for col in df.columns if not col[0].isdigit()]
# extract only relevant columns and transpose
df_T = df[cols].set_index("Name").T
# convert index values to dates
df_T.index = pd.Index([pd.Timestamp(datetime.strptime(d,'%b-%y').strftime('%Y-%m-%d')) for d in df_T.index])
# resample by Quarters and transpose again to original format
df_quarter = df_T.resample("Q").mean().T
# rename columns to quarter-like descriptions
df_quarter.columns = [get_quarter_name(col) for col in df_quarter.columns]
df_quarter is your final answer which you can merge back to original df
Output:
Q3-2021 Q4-2021
Name
Jane 27.75 53.666667
John 28.00 44.333333
Tom 22.00 58.000000
here is one way to do it
# Define your quarters month
q1=['Aug','Sep']
q2=['Oct','Nov']
q3=['Dec','Jan']
df['q1']=df[df.columns[(df.columns.str.contains(rf'|'.join(q1)) )]].mean(axis=1)
df['q2']=df[df.columns[(df.columns.str.contains(rf'|'.join(q2)) )]].mean(axis=1)
df['q3']=df[df.columns[(df.columns.str.contains(rf'|'.join(q3)) )]].mean(axis=1)
df
Name 09-Aug-21 Aug-21 02-Sep-21 Sep-21 18-Oct-21 Oct-21 02-Nov-21 Nov-21 14-Dec-21 Dec-21 15-Jan-22 q1 q2 q3
0 Jane 10 10.5 11 45 66 21 88 99 77 41 22 19.125 68.50 46.666667
1 John 11 22.0 55 34 44 22 44 55 88 56 47 30.500 41.25 63.666667
2 Tom 23 32.0 43 12 11 44 77 85 99 45 63 27.500 54.25 69.000000
This is kinda messy, but it SHOULD allow you to dynamically generate a column for each quarter (does not have the quarter year you could add that logic if you want).
data = [['Jane', 10,10.5,11,45,66,21,88,99,77,41,22], ['John',11,22,55,34,44,22,44,55,88,56,47],['Tom',23,32,43,12,11,44,77,85,99,45,63]]
df = pd.DataFrame(data, columns = ['Name', '09-Aug-21', 'Aug-21', '02-Sep-21', 'Sep-21', '18-Oct-21', 'Oct-21', '02-Nov-21','Nov-21','14-Dec-21', 'Dec-21', '15-Jan-22'])
columns_to_use = [column for column in df.columns if column[0].isalpha()]
df = df[columns_to_use]
df = df.melt(id_vars = 'Name')
df['variable'] = '01-' + df['variable']
df['variable'] = pd.to_datetime(df['variable'],infer_datetime_format=True)
df['Quarter'] = df['variable'].dt.quarter
df['Quarter_Avg'] = df.groupby(['Name', 'Quarter'])['value'].transform('mean')
df1 = df.groupby(['Name', 'Quarter'])['Quarter_Avg'].agg('mean').reset_index()
df1['Quarter'] = 'Quarter ' + df1['Quarter'].astype(str)
df1 = df1.pivot_table(index = 'Name', columns = 'Quarter', values = 'Quarter_Avg').reset_index()
df['variable'] = df['variable'].astype(str)
df['variable'] = df['variable'].apply(lambda x : '-'.join(x.split('-')[0:2]))
df = df.pivot_table(index = 'Name', columns = 'variable', values = 'value').reset_index()
df_final = df.merge(df1, on = 'Name')
df_final
A fair amount of steps but it gives you the expected result
from datetime import datetime
data = [['Jane', 10,10.5,11,45,66,21,88,99,77,41,22,22], ['John',11,22,55,34,44,22,44,55,88,56,47,47],['Tom',23,32,43,12,11,44,77,85,99,45,63,63]]
df = pd.DataFrame(data, columns = ['Name', '09-Aug-21', 'Aug-21', '02-Sep-21', 'Sep-21', '18-Oct-21', 'Oct-21', '02-Nov-21','Nov-21','14-Dec-21', 'Dec-21', '15-Jan-22', 'Jan-22'])
# Melt the data frame by date
meltedDF = df.melt(id_vars=["Name"], var_name=["Date"])
# Remove the dates that don't match the "Month-year" format
meltedDF = meltedDF[pd.to_datetime(meltedDF.Date, format='%b-%y', errors='coerce').notna()].reset_index(drop=True)
# Convert those dates to datetime objects
meltedDF["Date"] = pd.to_datetime(meltedDF.Date, format='%b-%y')
# Find the quarter that those dates fall into and add the year string to the that quarter
meltedDF["Quarter"] = "Q" + meltedDF.Date.dt.quarter.astype(str) + " " + meltedDF.Date.dt.year.astype(str)
# Group by the quarter and the person's name then get the mean of their values
meltedDF = meltedDF.groupby(["Quarter", "Name"], as_index=False).mean().round(1)
# Pivot the table's Quarter values to be column names
meltedDF = pd.pivot_table(meltedDF, index=['Name'], values=['value'], columns="Quarter")
# Combine the names and the Quarter total values
meltedDF = pd.concat([meltedDF.reset_index()["Name"], meltedDF.reset_index()["value"]], axis=1)
# Merge these values back into the original Dataframe
df = df.merge(meltedDF, left_on='Name', right_on='Name')
Output:
Name 09-Aug-21 Aug-21 02-Sep-21 Sep-21 18-Oct-21 Oct-21 02-Nov-21 Nov-21 14-Dec-21 Dec-21 15-Jan-22 Jan-22 Q1 2022 Q3 2021 Q4 2021
0 Jane 10 10.5 11 45 66 21 88 99 77 41 22 22 22.0 27.8 53.7
1 John 11 22.0 55 34 44 22 44 55 88 56 47 47 47.0 28.0 44.3
2 Tom 23 32.0 43 12 11 44 77 85 99 45 63 63 63.0 22.0 58.0
here's a sample of the data i m using :
SCENARIO DATE POD AREA IDOC STATUS TYPE
AAA 02.06.2015 JKJKJKJKJKK 4210 713375 51 1
AAA 02.06.2015 JWERWERE 4210 713375 51 1
AAA 02.06.2015 JAFDFDFDFD 4210 713375 51 9
BBB 02.06.2015 AAAAAAAA 5400 713504 51 43
CCC 05.06.2015 BBBBBBBBBB 4100 756443 51 187
AAA 05.06.2015 EEEEEEEE 4100 756457 53 228
I have written the following code in pandas to groupby:
import pandas as pd
import numpy as np
xl = pd.ExcelFile("MRD.xlsx")
df = xl.parse("Sheet3")
#print (df.column.values)
# The following gave ValueError: Cannot label index with a null key
# dfi = df.pivot('SCENARIO)
# Here i do not actually need it to count every column, just a specific one
table = df.groupby(["SCENARIO", "STATUS", "TYPE"]).agg(['count'])
writer = pd.ExcelWriter('pandas.out.xlsx', engine='xlsxwriter')
table.to_excel(writer, sheet_name='Sheet1')
writer.save()
table2 = pd.DataFrame(df.groupby(["SCENARIO", "STATUS", "TYPE"])['TYPE'].count())
print (table2)
writer2 = pd.ExcelWriter('pandas2.out.xlsx', engine='xlsxwriter')
table2.to_excel(writer2, sheet_name='Sheet1')
writer2.save()
this yields a result :
SCENARIO STATUS TYPE TYPE
AAA 51 1 2
9 1
53 228 1
BBB 51 43 1
CCC 51 187 1
Name: TYPE, dtype: int64
How could i add subtotals per group? Ideally i would want to achieve something like:
SCENARIO STATUS TYPE TYPE
AAA 51 1 2
9 1
Total 3
53 228 1
Total 1
BBB 51 43 1
Total 1
CCC 51 187 1
Total 1
Name: TYPE, dtype: int64
Is this possible?
Use:
#if necessary convert TYPE column to string
df['TYPE'] = df['TYPE'].astype(str)
df = df.groupby(["SCENARIO", "STATUS", "TYPE"])['TYPE'].count()
#aggregate sum by first 2 levels
df1 = df.groupby(["SCENARIO", "STATUS"]).sum()
#add 3 level of MultiIndex
df1.index = [df1.index.get_level_values(0),
df1.index.get_level_values(1),
['Total'] * len(df1)]
#thanks MaxU for improving
#df1 = df1.set_index(np.array(['Total'] * len(df1)), append=True)
print (df1)
SCENARIO STATUS
AAA 51 Total 3
53 Total 1
BBB 51 Total 1
CCC 51 Total 1
Name: TYPE, dtype: int64
#join together and sorts
df = pd.concat([df, df1]).sort_index(level=[0,1])
print (df)
SCENARIO STATUS TYPE
AAA 51 1 2
9 1
Total 3
53 228 1
Total 1
BBB 51 43 1
Total 1
CCC 51 187 1
Total 1
Name: TYPE, dtype: int64
The same thing can be achived with pandas pivot table:
table = pd.pivot_table(df, values=['TYPE'], index=['SCENARIO', 'STATUS'], aggfunc='count')
table
Chris Moffitt has created a library named sidetable to ease this process which can be used with the groupby object with an accessor making it very easy. That said, the accepted answer and comments are a gold mine, which I feel it's worth checking it out first.
I have a dataframe that looks like:
[date1] [date1] [date2] [date2]
[Min:] [Max:] [Min:] [Max:]
A B C D
and my desired output would look like:
['Date'] ['Min'] ['Max']
[date 1] A B
[date 2] C D
How would I do this in pandas?
I'm simply importing a csv I have locally saved.
import pandas as pd
import csv
import datetime
SampleWeatherDate = pd.read_csv(weatherdata.csv)
This is what my data looks like in excel
You can use T and pivot if first and second rows are columns:
print df
date1 date2
Min Max Min Max
0 A B C D
print df.columns
MultiIndex(levels=[[u'date1', u'date2'], [u'Max', u'Min']],
labels=[[0, 0, 1, 1], [1, 0, 1, 0]])
#transpose and reset_index
df = df.T.reset_index()
#set columns names
df.columns =['a','b','c']
print df
a b c
0 date1 Min A
1 date1 Max B
2 date2 Min C
3 date2 Max D
#pivot
print df.pivot(index='a', columns='b', values='c')
b Max Min
a
date1 B A
date2 D C
Solution with data:
import pandas as pd
import io
temp=u"""Date;2/4/17;2/4/17;2/5/17;2/5/17;2/6/17;2/6/17
City:;Min:;Max:;Min:;Max:;Min:;Max:
New York;28;34;29;35;30;36
Los Angeles;80;86;81;87;82;88"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", index_col=0, header=[0,1])
print df
Date 2/4/17 2/5/17 2/6/17
City: Min: Max: Min: Max: Min: Max:
New York 28 34 29 35 30 36
Los Angeles 80 86 81 87 82 88
#transpose and reset_index
df = df.T.reset_index()
#convert column Date to datetime
df['Date'] = pd.to_datetime(df['Date'])
#strip : from column City:
df['City:'] = df['City:'].str.strip(':')
#remove : from column name City:
df.rename(columns={'City:':'City'}, inplace=True)
print df
Date City New York Los Angeles
0 2017-02-04 Min 28 80
1 2017-02-04 Max 34 86
2 2017-02-05 Min 29 81
3 2017-02-05 Max 35 87
4 2017-02-06 Min 30 82
5 2017-02-06 Max 36 88
print df.pivot(index='Date', columns='City')
New York Los Angeles
City Max Min Max Min
Date
2017-02-04 34 28 86 80
2017-02-05 35 29 87 81
2017-02-06 36 30 88 82
You don't need the csv module, as you can read it directly with Pandas.
df = sample_weather_data = pd.read_csv(weatherdata.csv)
You're source data is formatted poorly, so there is quite a bit of munging to do.
>>> df
Date 2/4/17 2/4/17.1 2/5/17 2/5/17.1 2/6/17 2/6/17.1
0 City: Min: Max: Min: Max: Min: Max:
1 New York 28 34 29 35 30 36
2 Los Angeles 80 86 81 87 82 88
First, note how the dates repeat with .1 appended on the second date. Also note that the first column is Date:
>>> df.columns
Index([u'Date', u'2/4/17', u'2/4/17.1', u'2/5/17', u'2/5/17.1', u'2/6/17', u'2/6/17.1'], dtype='object')
Let's extract every other date starting with the first (note that Python uses zero based indexing).
dates = df.columns[1::2]
>>> dates
Index([u'2/4/17', u'2/5/17', u'2/6/17'], dtype='object')
While we're at it, we can convert them to timestamps.
dates = pd.to_datetime(dates)
>>> dates
DatetimeIndex(['2017-02-04', '2017-02-05', '2017-02-06'], dtype='datetime64[ns]', freq=None)
We can use the same technique to extract the City, Min and Max values. iloc is for integer location selection. It uses (row, column) selection indices. We are ignoring the first value (the zero index value), so we use [1:] to select all rows except the first one.
cities = df.iloc[1:, 0] # Column 0
min_max_vals = df.iloc[1:, 1:] # Every column starting at 1, ignoring first row.
We can index min_max_vals with cities:
min_max_vals.index = cities
We now need to create a MultiIndex with the dates and Min/Max and assign it to the dataframe.
min_max_vals.columns = pd.MultiIndex.from_product([dates, ['Min', 'Max']])
You're desired output above is missing the city, so I assume you really want something like this:
['City 1'] ['City 2]
['Date'] ['Min'] ['Max'] ['Min'] ['Max']
[date 1] A B E F
[date 2] C D G H
Transposing the results and unstacking:
>>> min_max_vals.T.unstack()
Date New York Los Angeles
Max Min Max Min
2017-02-04 34 28 86 80
2017-02-05 35 29 87 81
2017-02-06 36 30 88 82
Summary
df = sample_weather_data = pd.read_csv('weatherdata.csv')
dates = pd.to_datetime(df.columns[1::2])
min_max_vals = df.iloc[1:, 1:]
min_max_vals.index = df.iloc[1:, 0]
min_max_vals.columns = pd.MultiIndex.from_product([dates, ['Min', 'Max']])
df = min_max_vals.T.unstack()