I have a DataFrame with Employees and their hours for different categories.
I need to recalculate only specific categories (OT, MILE and REST Categories SHOULD NOT Be Updated, ALL Other Should be updated) ONLY if OT category is present under Empl_Id.
data = {'Empl_Id': [1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3],
'Category': ["MILE", "REST", "OT", "TRVL", "REG", "ADMIN", "REST", "REG", "MILE", "OT", "TRVL", "REST", "MAT", "REG"],
'Value': [43, 0.7, 6.33, 2.67, 52, 22, 1.17, 16.5, 73.6, 4.75, 1.33, 2.5, 5.5, 52.25]}
df = pd.DataFrame(data=data)
df
Empl_Id
Category
Value
1
MILE
43
1
REST
0.7
1
OT
6.33
1
TRVL
2.67
1
REG
52
2
ADMIN
22
2
REST
1.17
2
REG
16.5
3
MILE
73.6
3
OT
4.75
3
TRVL
1.33
3
REST
2.5
3
MAT
5.5
3
REG
52.25
The Logic is to:
1) Find % of OT Hours from Total Hours (OT, REST and MILE don't count):
1st Empl_Id: 6.33 (OT) / 2.67 (TRVL) + 52 (REG) = 6.33 / 54.67 = 11.58 %
2nd Empl_Id: OT Hours Not present, nothing should be updated
3rd Empl_Id: 4.75 (OT) / 1.33 (TRVL) + 5.5 (MAT) + 52.25 (REG) = 4.75 / 59.08 = 8.04 %
2) Substract % of OT from each category (OT, REST and MILE don't count):
Empl_Id
Category
Value
1
MILE
43
1
REST
0.7
1
OT
6.33
1
TRVL
2.67 - 11.58 % (0.31) = 2.36
1
REG
52 - 11.58 % (6.02) = 45.98
2
ADMIN
22
2
REST
1.17
2
REG
16.5
3
MILE
73.6
3
OT
4.75
3
TRVL
1.33 - 8.04 % (0.11) = 1.22
3
REST
2.5
3
MAT
5.5 - 8.04 % (0.44) = 5.06
3
REG
52.25 - 8.04 % (4.2) = 48.05
You can use:
keep = ['OT', 'MILE', 'REST']
# get factor
factor = (df.groupby(df['Empl_Id'])
.apply(lambda g: g.loc[g['Category'].eq('OT'),'Value'].sum()
/g.loc[~g['Category'].isin(keep),'Value'].sum()
)
.rsub(1)
)
# update
df.loc[~df['Category'].isin(keep), 'Value'] *= df['Empl_Id'].map(factor)
output:
Empl_Id Category Value
0 1 MILE 43.000000
1 1 REST 0.700000
2 1 OT 6.330000
3 1 TRVL 2.360852
4 1 REG 45.979148
5 2 ADMIN 22.000000
6 2 REST 1.170000
7 2 REG 16.500000
8 3 MILE 73.600000
9 3 OT 1.750000
10 3 TRVL 1.290604
11 3 REST 2.500000
12 3 MAT 5.337085
13 3 REG 50.702310
I have a huge csv file of dataframe. However, I don't have the date column. I only have the sales for every month from Jan-2022 until Dec-2034. Below is the example of my dataframe:
import pandas as pd
data = [[6661, 'Mobile Phone', 43578, 5000, 78564, 52353, 67456, 86965, 43634, 32546, 56332, 58944, 98878, 68588, 43634, 3463, 74533, 73733, 64436, 45426, 57333, 89762, 4373, 75457, 74845, 86843, 59957, 74563, 745335, 46342, 463473, 52352, 23622],
[6672, 'Play Station', 4475, 2546, 5757, 2352, 57896, 98574, 53536, 56533, 88645, 44884, 76585, 43575, 74573, 75347, 57573, 5736, 53737, 35235, 5322, 54757, 74573, 75473, 77362, 21554, 73462, 74736, 1435, 4367, 63462, 32362, 56332],
[6631, 'Laptop', 35347, 36376, 164577, 94584, 78675, 76758, 75464, 56373, 56343, 54787, 7658, 76584, 47347, 5748, 8684, 75373, 57573, 26626, 25632, 73774, 847373, 736646, 847457, 57346, 43732, 347346, 75373, 6473, 85674, 35743, 45734],
[6600, 'Camera', 14365, 60785, 25436, 46747, 75456, 97644, 63573, 56433, 25646, 32548, 14325, 64748, 68458, 46537, 7537, 46266, 7457, 78235, 46223, 8747, 67453, 4636, 3425, 4636, 352236, 6622, 64625, 36346, 46346, 35225, 6436],
[6643, 'Lamp', 324355, 143255, 696954, 97823, 43657, 66686, 56346, 57563, 65734, 64484, 87685, 54748, 9868, 573, 73472, 5735, 73422, 86352, 5325, 84333, 7473, 35252, 7547, 73733, 7374, 32266, 654747, 85743, 57333, 46346, 46266]]
ds = pd.DataFrame(data, columns = ['ID', 'Product', 'SalesJan-22', 'SalesFeb-22', 'SalesMar-22', 'SalesApr-22', 'SalesMay-22', 'SalesJun-22', 'SalesJul-22', 'SalesAug-22', 'SalesSep-22', 'SalesOct-22', 'SalesNov-22', 'SalesDec-22', 'SalesJan-23', 'SalesFeb-23', 'SalesMar-23', 'SalesApr-23', 'SalesMay-23', 'SalesJun-23', 'SalesJul-23', 'SalesAug-23', 'SalesSep-23', 'SalesOct-23', 'SalesNov-23', 'SalesDec-23', 'SalesJan-24', 'SalesFeb-24', 'SalesMar-24', 'SalesApr-24', 'SalesMay-24', 'SalesJun-24', 'SalesJul-24']
Since I have more than 10 monthly sales column, I want to loop the date after each of the month sales column. Then, the first 6 months will generate number 1, while the next 12 months will generate number 2, then another 12 months will generate number 3, another subsequent 12 months will generate number 4 and so on.
Below shows the sample of result that I want:
Is there any way to perform the loop and adding the date column beside each of the sales month?
Here is the simplest approach I can think of:
for i, col in enumerate(ds.columns[2:]):
ds.insert(2 * i + 2, col.removeprefix("Sales"), (i - 6) // 12 + 2)
Here is a vectorial approach (using insert repeatedly is inefficient):
# convert (valid) columns to datetime
cols = pd.to_datetime(ds.columns, format='Sales%b-%y', errors='coerce')
# identify valid dates
m = cols.notna()
# get year
y = cols[m].year
# calculate number (1 for first 6 months, then +1 per 12 months)
num = ((cols[m].month+12*(y-y.min()))+5)//12+1
# slice dates columns, assign the number, rename
df2 = (ds.loc[:, m].assign(**dict(zip(ds.columns[m], num)))
.rename(columns=lambda x: x[5:])
)
# get new order of columns
idx = np.r_[np.zeros((~m).sum()), np.tile(np.arange(m.sum()), 2)+1]
# concat and reorder
out = pd.concat([ds, df2], axis=1).iloc[:, np.argsort(idx)]
print(out)
output:
ID Product SalesJan-22 Jan-22 SalesFeb-22 Feb-22 SalesMar-22 Mar-22 SalesApr-22 Apr-22 SalesMay-22 May-22 SalesJun-22 Jun-22 SalesJul-22 Jul-22 SalesAug-22 Aug-22 Sep-22 SalesSep-22 Oct-22 SalesOct-22 SalesNov-22 Nov-22 Dec-22 SalesDec-22 Jan-23 SalesJan-23 Feb-23 SalesFeb-23 SalesMar-23 Mar-23 Apr-23 SalesApr-23 SalesMay-23 May-23 SalesJun-23 Jun-23 Jul-23 SalesJul-23 SalesAug-23 Aug-23 Sep-23 SalesSep-23 SalesOct-23 Oct-23 Nov-23 SalesNov-23 Dec-23 SalesDec-23 Jan-24 SalesJan-24 Feb-24 SalesFeb-24 Mar-24 SalesMar-24 Apr-24 SalesApr-24 May-24 SalesMay-24 SalesJun-24 Jun-24 SalesJul-24 Jul-24
0 6661 Mobile Phone 43578 1 5000 1 78564 1 52353 1 67456 1 86965 1 43634 2 32546 2 2 56332 2 58944 98878 2 2 68588 2 43634 2 3463 74533 2 2 73733 64436 2 45426 2 3 57333 89762 3 3 4373 75457 3 3 74845 3 86843 3 59957 3 74563 3 745335 3 46342 3 463473 52352 3 23622 4
1 6672 Play Station 4475 1 2546 1 5757 1 2352 1 57896 1 98574 1 53536 2 56533 2 2 88645 2 44884 76585 2 2 43575 2 74573 2 75347 57573 2 2 5736 53737 2 35235 2 3 5322 54757 3 3 74573 75473 3 3 77362 3 21554 3 73462 3 74736 3 1435 3 4367 3 63462 32362 3 56332 4
2 6631 Laptop 35347 1 36376 1 164577 1 94584 1 78675 1 76758 1 75464 2 56373 2 2 56343 2 54787 7658 2 2 76584 2 47347 2 5748 8684 2 2 75373 57573 2 26626 2 3 25632 73774 3 3 847373 736646 3 3 847457 3 57346 3 43732 3 347346 3 75373 3 6473 3 85674 35743 3 45734 4
3 6600 Camera 14365 1 60785 1 25436 1 46747 1 75456 1 97644 1 63573 2 56433 2 2 25646 2 32548 14325 2 2 64748 2 68458 2 46537 7537 2 2 46266 7457 2 78235 2 3 46223 8747 3 3 67453 4636 3 3 3425 3 4636 3 352236 3 6622 3 64625 3 36346 3 46346 35225 3 6436 4
4 6643 Lamp 324355 1 143255 1 696954 1 97823 1 43657 1 66686 1 56346 2 57563 2 2 65734 2 64484 87685 2 2 54748 2 9868 2 573 73472 2 2 5735 73422 2 86352 2 3 5325 84333 3 3 7473 35252 3 3 7547 3 73733 3 7374 3 32266 3 654747 3 85743 3 57333 46346 3 46266 4
Here's a little solution : (I put the year unstead of your 1, 2, ... incrementation since i thought it is more representative, but you can change it easily)
idx_counter = 0
for idx, col in enumerate(ds.columns):
if col.startswith('Sales'):
date = col.replace('Sales', '')
year = col.split('-')[1]
ds.insert(loc=idx + 1 + idx_counter, column=date, value=[year] * ds.shape[0])
idx_counter += 1
output:
ID Product SalesJan-22 Jan-22 SalesFeb-22 Feb-22 SalesMar-22 Mar-22 SalesApr-22 Apr-22 ... SalesMar-24 Mar-24 SalesApr-24 Apr-24 SalesMay-24 May-24 SalesJun-24 Jun-24 SalesJul-24 Jul-24
0 6661 Mobile Phone 43578 22 5000 22 78564 22 52353 22 ... 745335 24 46342 24 463473 24 52352 24 23622 24
1 6672 Play Station 4475 22 2546 22 5757 22 2352 22 ... 1435 24 4367 24 63462 24 32362 24 56332 24
2 6631 Laptop 35347 22 36376 22 164577 22 94584 22 ... 75373 24 6473 24 85674 24 35743 24 45734 24
3 6600 Camera 14365 22 60785 22 25436 22 46747 22 ... 64625 24 36346 24 46346 24 35225 24 6436 24
4 6643 Lamp 324355 22 143255 22 696954 22 97823 22 ... 654747 24 85743 24 57333 24 46346 24 46266 24
This should do the trick.
import math
new_cols = []
old_cols = [x for x in df.columns if x.startswith('Sales')]
for i, col in enumerate(old_cols):
new_cols.append(col[5:])
if i < 6:
val = 1
else:
val = ((i+6)/12)+1
df[col[5:]] = math.floor(val)
df[['ID', 'Product'] + [x for y in zip(old_cols, new_cols) for x in y]]
I have a dataframe with columns like this:
['id', 't_dur0', 't_dur1', 't_dur2', 't_dance0', 't_dance1', 't_dance2', 't_energy0',
't_energy1', 't_energy2']
And I have a code which returns the average of three columns with the same name:
# Takes in a dataframe with three columns and returns a dataframe with one column of their means as integers
def average_column(dataframe):
dataframe = dataframe.copy() # To avoid SettingWithCopyWarning
# Create new column name without integers
temp = dataframe.columns.tolist()[0]
col_name = temp.rstrip(temp[2:-1])
dataframe[col_name] = dataframe.mean(axis=1) # Add column to the dataframe (axis=1 means the mean() is applied row-wise)
mean_df = dataframe.iloc[: , -1:] # Isolated column of the mean by selecting all rows (:) for the last column (-1:)
print("Original:\n{}\nAverage columns:\n{}".format(dataframe, mean_df))
return mean_df.astype(float)
This function gives me this output:
Original:
t_dance0 t_dance1 t_dance2 dance
0 0.549 0.623 0.5190 0.563667
1 0.871 0.702 0.4160 0.663000
2 0.289 0.328 0.2340 0.283667
3 0.886 0.947 0.8260 0.886333
4 0.724 0.791 0.7840 0.766333
... ... ... ... ...
Average columns:
dance
0 0.563667
1 0.663000
2 0.283667
3 0.886333
4 0.766333
... ...
I asked this question about how I can split it into unique and duplicate columns. Which led me to this code:
# Function that splits dataframe into two separate dataframes, one with all unique
columns and one with all duplicates
def sub_dataframes(dataframe):
# Extract common prefix -> remove trailing digits
cols = dataframe.columns.str.replace(r'\d*$', '', regex=True).to_series().value_counts()
# Split columns
unq_cols = cols[cols == 1].index
dup_cols = dataframe.columns[~dataframe.columns.isin(unq_cols)] # All columns from dataframe that is not in unq_cols
return dataframe[unq_cols], dataframe[dup_cols]
unq_df = sub_dataframes(df)[0]
dup_df = sub_dataframes(df)[1]
print("Unique columns:\n\n{}\n\nDuplicate columns:\n\n{}".format(unq_df, dup_df))
Which gives me this output:
Unique columns:
id
0 22352
1 106534
2 23608
3 8655
4 49670
... ...
Duplicate columns:
t_dur0 t_dur1 t_dur2 t_dance0 t_dance1 t_dance2
0 292720 293760.0 292733.0 0.549 0.623 0.5190
1 213760 181000.0 245973.0 0.871 0.702 0.4160
2 157124 130446.0 152450.0 0.289 0.328 0.2340
3 127896 176351.0 166968.0 0.886 0.947 0.8260
4 210320 226253.0 211880.0 0.724 0.791 0.7840
... ... ... ... ... ... ...
2828 70740 262400.0 220680.0 0.224 0.609 0.7110
2829 252226 222400.0 214973.0 0.526 0.623 0.4820
2830 269146 251560.0 172760.0 0.551 0.756 0.7820
2831 344764 425613.0 249652.0 0.473 0.572 0.8230
2832 210955 339869.0 304124.0 0.112 0.523 0.0679
I have tried to combine these functions in another function that takes in a dataframe and returns the dataframe with all duplicate columns replaced by their mean, but I have trouble with splitting the dups_df into smaller dataframes. Is there a simpler way I can do this?
An example on the desired output:
Original:
total_tracks t_dur0 t_dur1 t_dur2 t_dance0 t_dance1 t_dance2 \
0 4 292720 293760.0 292733.0 0.549 0.623 0.5190
1 12 213760 181000.0 245973.0 0.871 0.702 0.4160
2 59 157124 130446.0 152450.0 0.289 0.328 0.2340
3 8 127896 176351.0 166968.0 0.886 0.947 0.8260
4 17 210320 226253.0 211880.0 0.724 0.791 0.7840
... ... ... ... ... ... ... ...
After function:
total_tracks popularity duration dance
0 4 21 293071.000000 0.563667
1 12 14 213577.666667 0.663000
2 59 41 146673.333333 0.283667
3 8 1 157071.666667 0.886333
4 17 47 216151.000000 0.766333
... ... ... ...
Use wide_to_long for reshape original DataFrame first and then aggregate mean:
cols = ['total_tracks']
df1 = (pd.wide_to_long(df,
stubnames=['t_dur','t_dance'],
i=cols,
j='tmp')
.reset_index()
.drop('tmp', 1)
.groupby(cols, as_index=False)
.mean())
print (df1)
total_tracks t_dur t_dance
0 4 293071.000000 0.563667
1 8 157071.666667 0.886333
2 12 213577.666667 0.663000
3 17 216151.000000 0.766333
4 59 146673.333333 0.283667
Details:
cols = ['total_tracks']
print(pd.wide_to_long(df,
stubnames=['t_dur','t_dance'],
i=cols,
j='tmp'))
t_dur t_dance
total_tracks tmp
4 0 292720.0 0.549
12 0 213760.0 0.871
59 0 157124.0 0.289
8 0 127896.0 0.886
17 0 210320.0 0.724
4 1 293760.0 0.623
12 1 181000.0 0.702
59 1 130446.0 0.328
8 1 176351.0 0.947
17 1 226253.0 0.791
4 2 292733.0 0.519
12 2 245973.0 0.416
59 2 152450.0 0.234
8 2 166968.0 0.826
17 2 211880.0 0.784