Grouping data by column name - python

I have a dataframe that looks like:
[date1] [date1] [date2] [date2]
[Min:] [Max:] [Min:] [Max:]
A B C D
and my desired output would look like:
['Date'] ['Min'] ['Max']
[date 1] A B
[date 2] C D
How would I do this in pandas?
I'm simply importing a csv I have locally saved.
import pandas as pd
import csv
import datetime
SampleWeatherDate = pd.read_csv(weatherdata.csv)
This is what my data looks like in excel

You can use T and pivot if first and second rows are columns:
print df
date1 date2
Min Max Min Max
0 A B C D
print df.columns
MultiIndex(levels=[[u'date1', u'date2'], [u'Max', u'Min']],
labels=[[0, 0, 1, 1], [1, 0, 1, 0]])
#transpose and reset_index
df = df.T.reset_index()
#set columns names
df.columns =['a','b','c']
print df
a b c
0 date1 Min A
1 date1 Max B
2 date2 Min C
3 date2 Max D
#pivot
print df.pivot(index='a', columns='b', values='c')
b Max Min
a
date1 B A
date2 D C
Solution with data:
import pandas as pd
import io
temp=u"""Date;2/4/17;2/4/17;2/5/17;2/5/17;2/6/17;2/6/17
City:;Min:;Max:;Min:;Max:;Min:;Max:
New York;28;34;29;35;30;36
Los Angeles;80;86;81;87;82;88"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", index_col=0, header=[0,1])
print df
Date 2/4/17 2/5/17 2/6/17
City: Min: Max: Min: Max: Min: Max:
New York 28 34 29 35 30 36
Los Angeles 80 86 81 87 82 88
#transpose and reset_index
df = df.T.reset_index()
#convert column Date to datetime
df['Date'] = pd.to_datetime(df['Date'])
#strip : from column City:
df['City:'] = df['City:'].str.strip(':')
#remove : from column name City:
df.rename(columns={'City:':'City'}, inplace=True)
print df
Date City New York Los Angeles
0 2017-02-04 Min 28 80
1 2017-02-04 Max 34 86
2 2017-02-05 Min 29 81
3 2017-02-05 Max 35 87
4 2017-02-06 Min 30 82
5 2017-02-06 Max 36 88
print df.pivot(index='Date', columns='City')
New York Los Angeles
City Max Min Max Min
Date
2017-02-04 34 28 86 80
2017-02-05 35 29 87 81
2017-02-06 36 30 88 82

You don't need the csv module, as you can read it directly with Pandas.
df = sample_weather_data = pd.read_csv(weatherdata.csv)
You're source data is formatted poorly, so there is quite a bit of munging to do.
>>> df
Date 2/4/17 2/4/17.1 2/5/17 2/5/17.1 2/6/17 2/6/17.1
0 City: Min: Max: Min: Max: Min: Max:
1 New York 28 34 29 35 30 36
2 Los Angeles 80 86 81 87 82 88
First, note how the dates repeat with .1 appended on the second date. Also note that the first column is Date:
>>> df.columns
Index([u'Date', u'2/4/17', u'2/4/17.1', u'2/5/17', u'2/5/17.1', u'2/6/17', u'2/6/17.1'], dtype='object')
Let's extract every other date starting with the first (note that Python uses zero based indexing).
dates = df.columns[1::2]
>>> dates
Index([u'2/4/17', u'2/5/17', u'2/6/17'], dtype='object')
While we're at it, we can convert them to timestamps.
dates = pd.to_datetime(dates)
>>> dates
DatetimeIndex(['2017-02-04', '2017-02-05', '2017-02-06'], dtype='datetime64[ns]', freq=None)
We can use the same technique to extract the City, Min and Max values. iloc is for integer location selection. It uses (row, column) selection indices. We are ignoring the first value (the zero index value), so we use [1:] to select all rows except the first one.
cities = df.iloc[1:, 0] # Column 0
min_max_vals = df.iloc[1:, 1:] # Every column starting at 1, ignoring first row.
We can index min_max_vals with cities:
min_max_vals.index = cities
We now need to create a MultiIndex with the dates and Min/Max and assign it to the dataframe.
min_max_vals.columns = pd.MultiIndex.from_product([dates, ['Min', 'Max']])
You're desired output above is missing the city, so I assume you really want something like this:
['City 1'] ['City 2]
['Date'] ['Min'] ['Max'] ['Min'] ['Max']
[date 1] A B E F
[date 2] C D G H
Transposing the results and unstacking:
>>> min_max_vals.T.unstack()
Date New York Los Angeles
Max Min Max Min
2017-02-04 34 28 86 80
2017-02-05 35 29 87 81
2017-02-06 36 30 88 82
Summary
df = sample_weather_data = pd.read_csv('weatherdata.csv')
dates = pd.to_datetime(df.columns[1::2])
min_max_vals = df.iloc[1:, 1:]
min_max_vals.index = df.iloc[1:, 0]
min_max_vals.columns = pd.MultiIndex.from_product([dates, ['Min', 'Max']])
df = min_max_vals.T.unstack()

Related

How would I find the quarterly averages of these monthly figures?

My dataset is similar to the below:
data = [['Jane', 10,10.5,11,45,66,21,88,99,77,41,22], ['John',11,22,55,34,44,22,44,55,88,56,47],['Tom',23,32,43,12,11,44,77,85,99,45,63]]
df = pd.DataFrame(data, columns = ['Name', '09-Aug-21', 'Aug-21', '02-Sep-21', 'Sep-21', '18-Oct-21', 'Oct-21', '02-Nov-21','Nov-21','14-Dec-21', 'Dec-21', '15-Jan-22'])
df
How can I add columns to this which show the quarterly figure, which is an average of the preceding three months? Eg, suppose we started at adding a column after 'Dec-21' called Q4 2021 which took the average of the columns called 'Oct-21', 'Nov-21' and 'Dec-21'.
Will I need to create a function which takes the preceding three values and returns an average, and then concatenate this to my dataframe? It does not have to be directly after each period, eg I am also happy to add all of the Quarterly averages right at the end.
from datetime import datetime
def get_quarter_name(timestamp):
"""Convert '2021-12-01' to 'Q4-2021'
"""
return f"Q{timestamp.quarter}-{timestamp.year}"
# your data
data = [['Jane', 10,10.5,11,45,66,21,88,99,77,41,22], ['John',11,22,55,34,44,22,44,55,88,56,47],['Tom',23,32,43,12,11,44,77,85,99,45,63]]
df = pd.DataFrame(data, columns = ['Name', '09-Aug-21', 'Aug-21', '02-Sep-21', 'Sep-21', '18-Oct-21', 'Oct-21', '02-Nov-21','Nov-21','14-Dec-21', 'Dec-21', '15-Jan-22'])
# filter only relevant columns, which start with an alphabetical character
cols = [col for col in df.columns if not col[0].isdigit()]
# extract only relevant columns and transpose
df_T = df[cols].set_index("Name").T
# convert index values to dates
df_T.index = pd.Index([pd.Timestamp(datetime.strptime(d,'%b-%y').strftime('%Y-%m-%d')) for d in df_T.index])
# resample by Quarters and transpose again to original format
df_quarter = df_T.resample("Q").mean().T
# rename columns to quarter-like descriptions
df_quarter.columns = [get_quarter_name(col) for col in df_quarter.columns]
df_quarter is your final answer which you can merge back to original df
Output:
Q3-2021 Q4-2021
Name
Jane 27.75 53.666667
John 28.00 44.333333
Tom 22.00 58.000000
here is one way to do it
# Define your quarters month
q1=['Aug','Sep']
q2=['Oct','Nov']
q3=['Dec','Jan']
df['q1']=df[df.columns[(df.columns.str.contains(rf'|'.join(q1)) )]].mean(axis=1)
df['q2']=df[df.columns[(df.columns.str.contains(rf'|'.join(q2)) )]].mean(axis=1)
df['q3']=df[df.columns[(df.columns.str.contains(rf'|'.join(q3)) )]].mean(axis=1)
df
Name 09-Aug-21 Aug-21 02-Sep-21 Sep-21 18-Oct-21 Oct-21 02-Nov-21 Nov-21 14-Dec-21 Dec-21 15-Jan-22 q1 q2 q3
0 Jane 10 10.5 11 45 66 21 88 99 77 41 22 19.125 68.50 46.666667
1 John 11 22.0 55 34 44 22 44 55 88 56 47 30.500 41.25 63.666667
2 Tom 23 32.0 43 12 11 44 77 85 99 45 63 27.500 54.25 69.000000
This is kinda messy, but it SHOULD allow you to dynamically generate a column for each quarter (does not have the quarter year you could add that logic if you want).
data = [['Jane', 10,10.5,11,45,66,21,88,99,77,41,22], ['John',11,22,55,34,44,22,44,55,88,56,47],['Tom',23,32,43,12,11,44,77,85,99,45,63]]
df = pd.DataFrame(data, columns = ['Name', '09-Aug-21', 'Aug-21', '02-Sep-21', 'Sep-21', '18-Oct-21', 'Oct-21', '02-Nov-21','Nov-21','14-Dec-21', 'Dec-21', '15-Jan-22'])
columns_to_use = [column for column in df.columns if column[0].isalpha()]
df = df[columns_to_use]
df = df.melt(id_vars = 'Name')
df['variable'] = '01-' + df['variable']
df['variable'] = pd.to_datetime(df['variable'],infer_datetime_format=True)
df['Quarter'] = df['variable'].dt.quarter
df['Quarter_Avg'] = df.groupby(['Name', 'Quarter'])['value'].transform('mean')
df1 = df.groupby(['Name', 'Quarter'])['Quarter_Avg'].agg('mean').reset_index()
df1['Quarter'] = 'Quarter ' + df1['Quarter'].astype(str)
df1 = df1.pivot_table(index = 'Name', columns = 'Quarter', values = 'Quarter_Avg').reset_index()
df['variable'] = df['variable'].astype(str)
df['variable'] = df['variable'].apply(lambda x : '-'.join(x.split('-')[0:2]))
df = df.pivot_table(index = 'Name', columns = 'variable', values = 'value').reset_index()
df_final = df.merge(df1, on = 'Name')
df_final
A fair amount of steps but it gives you the expected result
from datetime import datetime
data = [['Jane', 10,10.5,11,45,66,21,88,99,77,41,22,22], ['John',11,22,55,34,44,22,44,55,88,56,47,47],['Tom',23,32,43,12,11,44,77,85,99,45,63,63]]
df = pd.DataFrame(data, columns = ['Name', '09-Aug-21', 'Aug-21', '02-Sep-21', 'Sep-21', '18-Oct-21', 'Oct-21', '02-Nov-21','Nov-21','14-Dec-21', 'Dec-21', '15-Jan-22', 'Jan-22'])
# Melt the data frame by date
meltedDF = df.melt(id_vars=["Name"], var_name=["Date"])
# Remove the dates that don't match the "Month-year" format
meltedDF = meltedDF[pd.to_datetime(meltedDF.Date, format='%b-%y', errors='coerce').notna()].reset_index(drop=True)
# Convert those dates to datetime objects
meltedDF["Date"] = pd.to_datetime(meltedDF.Date, format='%b-%y')
# Find the quarter that those dates fall into and add the year string to the that quarter
meltedDF["Quarter"] = "Q" + meltedDF.Date.dt.quarter.astype(str) + " " + meltedDF.Date.dt.year.astype(str)
# Group by the quarter and the person's name then get the mean of their values
meltedDF = meltedDF.groupby(["Quarter", "Name"], as_index=False).mean().round(1)
# Pivot the table's Quarter values to be column names
meltedDF = pd.pivot_table(meltedDF, index=['Name'], values=['value'], columns="Quarter")
# Combine the names and the Quarter total values
meltedDF = pd.concat([meltedDF.reset_index()["Name"], meltedDF.reset_index()["value"]], axis=1)
# Merge these values back into the original Dataframe
df = df.merge(meltedDF, left_on='Name', right_on='Name')
Output:
Name 09-Aug-21 Aug-21 02-Sep-21 Sep-21 18-Oct-21 Oct-21 02-Nov-21 Nov-21 14-Dec-21 Dec-21 15-Jan-22 Jan-22 Q1 2022 Q3 2021 Q4 2021
0 Jane 10 10.5 11 45 66 21 88 99 77 41 22 22 22.0 27.8 53.7
1 John 11 22.0 55 34 44 22 44 55 88 56 47 47 47.0 28.0 44.3
2 Tom 23 32.0 43 12 11 44 77 85 99 45 63 63 63.0 22.0 58.0

using usecols when specifying a multi-index header in Python Pandas

I have a huge data to read based on two headers, But when I am using multi-index approach I am unable to use 'usecols' in pandas dataframe.
when I am using
df = pd.read_csv(files, delimiter=' ', header=[0,1])
it is taking too much of time and memory
another approach I am trying to use is
df = pd.read_csv(files, delimiter=' ', usecols = ["80.375"])
it is taking only one colomn, rather it should take all the four colomn with header '80.375'
Desired output
Please suggest any alternative approach
Thanks in advance
You can use two pass to extract data and headers.
# read_csv common options
opts = {'sep': ' ', 'header': None}
# Extract headers, create MultiIndex
headers = pd.read_csv('data.csv', **opts, nrows=2)
mi = pd.MultiIndex.from_frame(headers.T)
# Keep desired columns
dti = [0, 1, 2] # Year, Month, Day
cols = mi.get_locs([80.375]).tolist()
# Build dataframe
df = pd.read_csv('data.csv', **opts, skiprows=2, index_col=dti, usecols=dti+cols)
df.columns = mi[cols]
df = df.rename_axis(index=['Year', 'Month', 'Day'], columns=['Lvl1', 'Lvl2'])
df.index = pd.to_datetime(df.index.to_frame()).rename('DateTime')
Output:
>>> df
Lvl1 80.375
Lvl2 28.625 28.875 29.125 29.375
DateTime
2015-01-01 21 22 23 24
2015-01-02 31 32 33 34
2015-01-03 41 42 43 44
2015-01-04 51 52 53 54
Input csv file:
80.125 80.375 80.375 80.375 80.375 80.625
28.875 28.625 28.875 29.125 29.375 28.875
2015 1 1 20 21 22 23 24 25
2015 1 2 30 31 32 33 34 35
2015 1 3 40 41 42 43 44 45
2015 1 4 50 51 52 53 54 55
Update
I need to convert the output in single header row.
# Extract headers, create MultiIndex
headers = pd.read_csv('data.csv', sep=' ', header=None, nrows=2)
mi = pd.MultiIndex.from_frame(headers.T)
# Keep desired columns
dti_cols = [0, 1, 2] # Year, Month, Day
dti_names = ['Year', 'Month', 'Day']
dat_cols = mi.get_locs([80.375]).tolist()
dat_names = mi[cols].to_flat_index().map(lambda x: f"{x[0]}_{x[1]}").tolist()
# Build dataframe
df = (pd.read_csv('data.csv', sep=' ', header=None, skiprows=2,
usecols=dti_cols+dat_cols, names=dti_names+dat_names,
parse_dates={'Date': ['Year', 'Month', 'Day']}))
Output:
>>> df
Date 80.375_28.625 80.375_28.875 80.375_29.125 80.375_29.375
0 2015-01-01 21 22 23 24
1 2015-01-02 31 32 33 34
2 2015-01-03 41 42 43 44
3 2015-01-04 51 52 53 54

How can i calculate pct changes between groups of colums efficiently?

I have a set of columsn like so:
q1_cash_total, q2_cash_total, q3_cash_total,
q1_shop_us, q2_shop_us, q3_shop_us,
etc, i have about 40 similarly named column names like this. I wish to calculate the pct changes between each of these groups of 3. e.g. i know individually i can do:
df[['q1_cash_total', 'q2_cash_total', 'q3_cash_total']].pct_change().add_suffix('_PCT_CHG')
to do this for every 3 i do:
q1 = [col for col in df.columns if 'q1' in col ]
q2 = [col for col in df.columns if 'q2' in col ]
q3 = [col for col in df.columns if 'q3' in col ]
q_cols = q1+q2+q3
dflist = []
for col in df[q_cols].columns:
#col[3:] to just get col name without the q1_/q2_ etc
print(col[3:])
cols = [c for c in df.columns if col[3:] in c]
pct = df[cols].pct_change().add_suffix('_PCT_CHG')
dflist.append(pct)
pcts_df = pd.concat(dflist)
I cannot think of a cleaner way to do this. Does anybody have any ideas? How can i also do it such that i do the pct change between q1 and q3 too instead of successively.
You could create a dataframe containing only the desires columns, for that, filter column names starting with q immediately follow by one or more digits and an underscore (^q\d+?_). Remove the prefix and keep only unique column names using pd.unique. For each unique column name, filter columns with that specific name and apply the percentage change along the columns axis (.pct_change(axis='columns')) to obtain the changes between q1, q2 and q3.
To get the percentage change between q1 and q3 you can select those columns by name over the previous created dataframe (df_q) and apply the same pct_change executed earlier.
df used as input
q1_cash_total q1_shop_us q2_cash_total q2_shop_us q3_cash_total q3_shop_us another_col numCols dataCols
0 52 93 15 72 61 21 83 87 75
1 75 88 24 3 22 53 2 88 30
2 38 2 64 60 21 33 76 58 22
3 89 49 91 59 42 92 60 80 15
4 62 62 47 62 51 55 64 3 51
df_q = df.filter(regex='^q\d+?_')
unique_cols = pd.unique([c[3:] for c in df_q.columns])
dflist = []
for col in unique_cols:
q_name = df_q.filter(like=col)
df_s = q_name.pct_change(axis='columns').add_suffix('_PCT_CHG')
dflist.append(df_s)
df_s = df_q[[f'q1_{col}', f'q3_{col}']].pct_change(axis='columns').add_suffix('_Q1-Q3')
dflist.append(df_s)
pcts_df = pd.concat(dflist, axis=1)
Output from pcts_df
q1_cash_total_PCT_CHG q2_cash_total_PCT_CHG q3_cash_total_PCT_CHG ... q3_shop_us_PCT_CHG q1_shop_us_Q1-Q3 q3_shop_us_Q1-Q3
0 NaN -0.711538 3.066667 ... -0.708333 NaN -0.774194
1 NaN -0.680000 -0.083333 ... 16.666667 NaN -0.397727
2 NaN 0.684211 -0.671875 ... -0.450000 NaN 15.500000
3 NaN 0.022472 -0.538462 ... 0.559322 NaN 0.877551
4 NaN -0.241935 0.085106 ... -0.112903 NaN -0.112903
[5 rows x 10 columns]

weighted average aggregation on multiple columns of df

I'm trying to calculate a weighted average for multiple columns in a dataframe.
This is a sample of my data
Group
Year
Month
Weight(kg)
Nitrogen
Calcium
A
2020
01
10000
10
70
A
2020
01
15000
4
78
A
2021
05
12000
5
66
A
2021
05
10000
8
54
B
2021
08
14000
10
90
C
2021
08
50000
20
92
C
2021
08
40000
10
95
My desired result would look something like this:
What I've tried:
I can get the correct weighted average values for a single column using this function:
(similar to: link)
def wavg(df, value, weight):
d = df[value]
w = df[weight]
try:
return (d * w).sum() / w.sum()
except ZeroDivisionError:
return d.mean()
I can apply this function to a single column of my df:
df2 = df.groupby(["Group", "year", "month"]).apply(wavg, "Calcium", "Weight(kg").to_frame()
(Don't mind the different values, they are correct for the data in my notebook)
The obvious problem is that this function only works for a single column whilst I have a douzens of columns. I therefore tried a for loop:
column_list=[]
for column in df.columns:
column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").to_frame())
It calculates the values correctly, but the columns are placed on top of each other instead of next to eachother. They also miss a usefull column name:
How could I adapt my code to return the desired df?
Change function for working by multiple columns and for avoid removing column for grouping are converting to MultiIndex:
def wavg(x, value, weight):
d = x[value]
w = x[weight]
try:
return (d.mul(w, axis=0)).div(w.sum())
except ZeroDivisionError:
return d.mean()
#columns used for groupby
groups = ["Group", "Year", "Month"]
#processing all another columns
cols = df.columns.difference(groups + ["Weight(kg)"], sort=False)
#create index and processing all columns by variable cols
df1 = (df.set_index(groups)
.groupby(level=groups)
.apply(wavg, cols, "Weight(kg)")
.reset_index())
print (df2)
Group Year Month Calcium Nitrogen
0 A 2020 1 28.000000 4.000000
1 A 2020 1 46.800000 2.400000
2 A 2021 5 36.000000 2.727273
3 A 2021 5 24.545455 3.636364
4 B 2021 8 90.000000 10.000000
5 C 2021 8 51.111111 11.111111
6 C 2021 8 42.222222 4.444444
Try via concat() and reset_index():
df=pd.concat(column_list,axis=1).reset_index()
OR
you can make changes here:
column_list=[]
for column in df.columns:
column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").reset_index())
#Finally:
df=pd.concat(column_list,axis=1)

How to rename the column names post flattening

This is
Date Group Value Duration
2018-01-01 A 20 30
2018-02-01 A 10 60
2018-03-01 A 25 88
2018-01-01 B 15 180
2018-02-01 B 30 210
2018-03-01 B 25 238
I want to pivot the above df
My Approach:
df_pivot = dealer_f.pivot_table(index='Group',columns='Date',fill_value=0)
df_pivot.columns = dealer_f_pivot.columns.map('_'.join)
ff_pivot = dealer_f_pivot.reset_index()
I am getting an error as TypeError: sequence item 1: expected str instance, int found
If I simply follow reset_index then I get the column names as ('Value',2018-01-01),('Value',2018-02-10) etc.
I want to flatten the columns so that my output looks like below
df_pivot.columns.tolist()
['2018-01-01_Value','2018-02-01_Value',.....'2018-01-01_Duration',...]
Any clue? Or where I am missing?
Use:
df_pivot.columns = [f'{b}_{a}' for a, b in df_pivot.columns]
Or:
df_pivot.columns = [f'{b.strftime("%Y-%m-%d")}_{a}' for a, b in df_pivot.columns]
df_pivot = df_pivot.reset_index()
print (df_pivot)
Group 2018-01-01_Duration 2018-02-01_Duration 2018-03-01_Duration \
0 A 30 60 88
1 B 180 210 238
2018-01-01_Value 2018-02-01_Value 2018-03-01_Value
0 20 10 25
1 15 30 25

Categories