Combine rows and average column if another column is minimum - python

I have a pandas dataframe:
Server Clock 1 Clock 2 Power diff
0 PhysicalWindows1 3400 3300.0 58.5 100.0
1 PhysicalWindows1 3400 3500.0 63.0 100.0
2 PhysicalWindows1 3400 2900.0 25.0 500.0
3 PhysicalWindows2 3600 3300.0 83.8 300.0
4 PhysicalWindows2 3600 3500.0 65.0 100.0
5 PhysicalWindows2 3600 2900.0 10.0 700.0
6 PhysicalLinux1 2600 NaN NaN NaN
7 PhysicalLinux1 2600 NaN NaN NaN
8 Test 2700 2700.0 30.0 0.0
Basically, I would like to average the Power for each server but only if the difference is minimum. For example, if you look at the 'PhysicalWindows1' server, I have 3 rows, two have a diff of 100, and one has a diff of 500. Since I have two rows with a diff of 100, I would like to average out my Power of 58.5 and 63.0. For 'PhysicalWindows2', since there is only one row which has the least diff, we return the power for that one row - 65. If NaN, return Nan, and if there is only one match, return the power for that one match.
My resultant dataframe would look like this:
Server Clock 1 Power
0 PhysicalWindows1 3400 (58.5+63.0)/2
1 PhysicalWindows2 3600 65.0
2 PhysicalLinux1 2600 NaN
3 Test 2700 30.0

Use groupby with dropna=False to avoid to remove PhysicalLinux1 and sort=True to sort index level (lowest diff on top) then drop_duplicates to keep only one instance of (Server, Clock 1):
out = (df.groupby(['Server', 'Clock 1', 'diff'], dropna=False, sort=True)['Power']
.mean().droplevel('diff').reset_index().drop_duplicates(['Server', 'Clock 1']))
# Output
Server Clock 1 Power
0 PhysicalLinux1 2600 NaN
1 PhysicalWindows1 3400 60.75
3 PhysicalWindows2 3600 65.00
6 Test 2700 30.00

Here is a possible solution using df.groupby() and pd.merge()
grp_df = df.groupby(['Server', 'diff'])['Power'].mean().reset_index()
grp_df = grp_df.groupby('Server').first().reset_index()
grp_df = grp_df.rename(columns={'diff': 'min_diff', 'Power': 'Power_avg'})
df_out = (pd.merge(df[['Server', 'Clock 1']].drop_duplicates(), grp_df, on='Server', how='left')
.drop(['min_diff'], axis=1))
print(df_out)
Server Clock 1 Power_avg
0 PhysicalWindows1 3400 60.75
1 PhysicalWindows2 3600 65.00
2 PhysicalLinux1 2600 NaN
3 Test 2700 30.00

Use a double groupby, first groupby.transform to mask the non-max Power, then groupby.agg to aggregate
m = df.groupby('Server')['diff'].transform('min').eq(df['diff'])
(df.assign(Power=df['Power'].where(m))
.groupby('Server', sort=False, as_index=False)
.agg({'Clock 1': 'first', 'Power': 'mean'})
)
Output:
Server Clock 1 Power
0 PhysicalWindows1 3400 60.75
1 PhysicalWindows2 3600 65.00
2 PhysicalLinux1 2600 NaN
3 Test 2700 30.00

import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"Server": ['PhysicalWindows1', 'PhysicalWindows1', 'PhysicalWindows1', 'PhysicalWindows2',
'PhysicalWindows2', 'PhysicalWindows2', 'PhysicalLinux1', 'PhysicalLinux1', 'Test'],
"Clock 1": [3400, 3400, 3400, 3600, 3600, 3600, 2600, 2600, 2700],
"Clock 2": [3300.0, 3500.0, 2900.0, 3300.0, 3500.0, 2900.0, np.nan, np.nan, 2700.0],
"Power": [58.5, 63.0, 25.0, 83.8, 65.0, 10.0, np.nan, np.nan, 30.0],
"diff": [100.0, 100.0, 500.0, 300.0, 100.0, 700.0, np.nan, np.nan, 0.0]
}
)
r = (df.groupby(['Server'])
.apply(lambda d: d[d['diff']==d['diff'].min()])
.reset_index(drop=True)
.groupby(['Server'])
.agg({"Clock 1":'mean', "Power":'mean', "diff":'first'})
.reset_index()
)
r = (r.append(df[df['diff'].isnull()]
.drop_duplicates()
).drop(['Clock 2', 'diff'], axis=1)
.reset_index(drop=True)
)
print(r)
Server Clock 1 Power
0 PhysicalWindows1 3400.0 60.75
1 PhysicalWindows2 3600.0 65.00
2 Test 2700.0 30.00
3 PhysicalLinux1 2600.0 NaN

def function1(dd:pd.DataFrame):
dd1=dd.loc[dd.loc[:,"diff"].eq(dd.loc[:,"diff"].min())]
return pd.DataFrame({"Clock 1":dd1[["Clock 1"]].min().squeeze(),"Power":dd1.Power.mean()},index=[dd.name])
df1.groupby('Server',sort=False).apply(function1).droplevel(1)
out
Clock 1 Power
Server
PhysicalWindows1 3400.0 60.75
PhysicalWindows2 3600.0 65.00
PhysicalLinux1 NaN NaN
Test 2700.0 30.00

Related

DataFrame groupby and divide by group sum

In order to build stock portfolios for a backtest I am trying to get the market capitalization (me) weight of each stock within its portfolio. For test purposes I built the following DataFrame of price and return observations. Every day I am assigning the stocks to quantiles based on price and all stocks in the same quantile that day will be in one portfolio:
d = {'date' : ['202211', '202211', '202211','202211', '202212', '202212', '202212', '202212'],
'price' : [1, 1.2, 1.3, 1.5, 1.7, 2, 1.5, 1],
'shrs' : [100, 100, 100, 100, 100, 100, 100, 100]}
df = pd.DataFrame(data = d)
df.set_index('date', inplace=True)
df.index = pd.to_datetime(df.index, format='%Y%m%d')
df["me"] = df['price'] * df['shrs']
df['rank'] = df.groupby('date')['price'].transform(lambda x: pd.qcut(x, 2, labels=range(1,3), duplicates='drop'))
df
price shrs me rank
date
2022-01-01 1.0 100 100.0 1
2022-01-01 1.2 100 120.0 1
2022-01-01 1.3 100 130.0 2
2022-01-01 1.5 100 150.0 2
2022-01-02 1.7 100 170.0 2
2022-01-02 2.0 100 200.0 2
2022-01-02 1.5 100 150.0 1
2022-01-02 1.0 100 100.0 1
In the next step I am grouping by 'date' and 'rank' and divide each observation's market cap by the sum of the groups market cap in order to obtain the stocks weight in the portfolio:
df['weight'] = df.groupby(['date', 'rank'], group_keys=False).apply(lambda x: x['me'] / x['me'].sum()).sort_index()
print(df)
price shrs me rank weight
date
2022-01-01 1.0 100 100.0 1 0.454545
2022-01-01 1.2 100 120.0 1 0.545455
2022-01-01 1.3 100 130.0 2 0.464286
2022-01-01 1.5 100 150.0 2 0.535714
2022-01-02 1.7 100 170.0 2 0.600000
2022-01-02 2.0 100 200.0 2 0.400000
2022-01-02 1.5 100 150.0 1 0.459459
2022-01-02 1.0 100 100.0 1 0.540541
Now comes the flaw. On my test df this works perfectly fine. However on the real data (DataFrame with shape 160000 x 21) the calculations take endless and I always have to interrupt the Jupyter Kernel at some point. Is there a more efficient way to do this? What am I missing?
Interestingly I am using the same code as some colleagues on similar DataFrames and for them it takes seconds only.
Use GroupBy.transform with sum for new Series and use it for divide me column:
df['weight'] = df['me'].div(df.groupby(['date', 'rank'])['me'].transform('sum'))
It might not be the most elegant solution, but if you run into performance issue you can try to split it into multiple parts, but storing the groupped value of me in a Series and then merge it back
temp = df.groupby(['date', 'rank'], group_keys=False).apply(lambda x: x['me'].sum())
temp = temp.reset_index(name='weight')
df = df.merge(temp, on=['date', 'rank'])
df['weight'] = df['me'] / df['weight']
df.set_index('date', inplace=True)
df
which should lead to the output:
price shrs me rank weight
date
2022-01-01 1.0 100 100.0 1 0.454545
2022-01-01 1.2 100 120.0 1 0.545455
2022-01-01 1.3 100 130.0 2 0.464286
2022-01-01 1.5 100 150.0 2 0.535714
2022-01-02 1.7 100 170.0 2 0.459459
2022-01-02 2.0 100 200.0 2 0.540541
2022-01-02 1.5 100 150.0 1 0.600000
2022-01-02 1.0 100 100.0 1 0.400000

Take rowmean of sorted columns for first 4 that are not NA (python)

I have a dataframe (df) with populations for columns, instances as rows, and frequencies as entries (see attached screenshot, there about 2.3M rows and 40 columns).
I have a way of sorting populations by their geographical distance. What I would like to do is take the row mean for the top four closest populations that do not have frequency as NA.
If there weren't any NAs, I would do something like:
closest = get_four_closest_pops(focalpop) # get a list of the four closest pops
whatiwant = df[closest].mean(axis=1) # doing it this way does not take into account NAs
If I loop through rows it will take way longer than I want it to, even if I use ipyparallel. The pandas.DataFrame.mean() method is quite quick. I was wondering if there wasn't a simple method where I could supply a list of all pops ordered by proximity to a focalpop and then take the four closest non-NAs.
example
For instance, I would like a pd.Series returned (or a dict) that has the rownames and the mean of the first four pops that are not NA.
for row jcf7190000000000-77738 in the screenshot (if for instance columns are ordered by proximity to a pop that is not shown) I would want output as 0.666625 (ie from DF_p1, DF_p24, DF_p25, DF_p26)
for row jcf7190000000000-77764 I would want output as 0.771275 (ie from DF_p1, DF_p24, DF_p25, DF_p26)
for row jcf7190000000004-54418 I would want output as 0.28651 (ie from DF_1, DF_2, DF_23, DF_24)
For future questions, see how to ask and How to create a Minimal, Reproducible Example (just a suggestion). Now, with a dummy df, since you didn't add a sample of your dataframe to work with, and taking into account that you have ordered by proximity the columns, you could try this:
import pandas as pd
from math import isnan
from statistics import mean
#creation of dummy df
df = pd.DataFrame({'df_1': ['700','ABC','500','XYZ','1200','DDD','150','350','400','5000', '100'],
'df_2': ['DDD','150','350','400','5000','500','XYZ','1200','DDD','150','350'] ,
'df_3': ['700','ABC','500','XYZ','1200','DDD','150','350','400','5000', '100'],
'df_4': ['DDD','150','350','400','5000','500','XYZ','1200','DDD','150','350'],
'df_5': ['700','ABC','500','XYZ','1200','DDD','150','350','400','5000', '100'],
'df_6': ['DDD','150','350','400','5000','500','XYZ','1200','DDD','150','350'],
'df_7': ['700','ABC','500','XYZ','1200','DDD','150','350','400','5000', '100'],
'df_8': ['DDD','150','350','400','5000','500','XYZ','1200','DDD','150','350'],
'df_9': ['DDD','150','350','400','5000','500','XYZ','1200','DDD','150','350'],
'df_10': ['700','ABC','500','XYZ','1200','DDD','150','350','400','5000', '100']
})
df=df.apply(pd.to_numeric, errors='coerce')
df.index.name='instances'
#approach of a solution to your problem
def func(row):
meanx=mean([float(x) for x in list(row)[1:] if not isnan(x)][:4])
columns=[df.columns[i] for i,x in enumerate(list(row)[1:]) if not isnan(x)][:4]
return {'columns':columns, 'mean':meanx}
dc={row[0]:func(row) for row in df.to_records()}
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print(df)
print(dc)
Output:
df
df_1 df_2 df_3 df_4 df_5 df_6 df_7 df_8 df_9 df_10
instances
0 700.0 NaN 700.0 NaN 700.0 NaN 700.0 NaN NaN 700.0
1 NaN 150.0 NaN 150.0 NaN 150.0 NaN 150.0 150.0 NaN
2 500.0 350.0 500.0 350.0 500.0 350.0 500.0 350.0 350.0 500.0
3 NaN 400.0 NaN 400.0 NaN 400.0 NaN 400.0 400.0 NaN
4 1200.0 5000.0 1200.0 5000.0 1200.0 5000.0 1200.0 5000.0 5000.0 1200.0
5 NaN 500.0 NaN 500.0 NaN 500.0 NaN 500.0 500.0 NaN
6 150.0 NaN 150.0 NaN 150.0 NaN 150.0 NaN NaN 150.0
7 350.0 1200.0 350.0 1200.0 350.0 1200.0 350.0 1200.0 1200.0 350.0
8 400.0 NaN 400.0 NaN 400.0 NaN 400.0 NaN NaN 400.0
9 5000.0 150.0 5000.0 150.0 5000.0 150.0 5000.0 150.0 150.0 5000.0
10 100.0 350.0 100.0 350.0 100.0 350.0 100.0 350.0 350.0 100.0
dc
{0: {'columns': ['df_1', 'df_3', 'df_5', 'df_7'], 'mean': 700.0}, 1: {'columns': ['df_2', 'df_4', 'df_6', 'df_8'], 'mean': 150.0}, 2: {'columns': ['df_1', 'df_2', 'df_3', 'df_4'], 'mean': 425.0}, 3: {'columns': ['df_2', 'df_4', 'df_6', 'df_8'], 'mean': 400.0}, 4: {'columns': ['df_1', 'df_2', 'df_3', 'df_4'], 'mean': 3100.0}, 5: {'columns': ['df_2', 'df_4', 'df_6', 'df_8'], 'mean': 500.0}, 6: {'columns': ['df_1', 'df_3', 'df_5', 'df_7'], 'mean': 150.0}, 7: {'columns': ['df_1', 'df_2', 'df_3', 'df_4'], 'mean': 775.0}, 8: {'columns': ['df_1', 'df_3', 'df_5', 'df_7'], 'mean': 400.0}, 9: {'columns': ['df_1', 'df_2', 'df_3', 'df_4'], 'mean': 2575.0}, 10: {'columns': ['df_1', 'df_2', 'df_3', 'df_4'], 'mean': 225.0}}
Note: It should be clarified that the program is designed assuming that all instances have at least 4 non-nan values.

Build rows in Python Dataframe, based on values in previous column

My input looks like this:
import datetime as dt
import pandas as pd
some_money = [34,42,300,450,550]
df = pd.DataFrame({'TIME': ['2020-01', '2019-12', '2019-11', '2019-10', '2019-09'], \
'MONEY':some_money})
df
Producing the following:
I want to add 3 more columns, getting the MONEY value for the previous month, like this (color coding for illustrative purposes):
This is what I have tried:
prev_period_money = ["m-1", "m-2", "m-3"]
for m in prev_period_money:
df[m] = df["MONEY"] - 10 #well, it "works", but it gives df["MONEY"]- 10...
The TIME column is sorted, so one should not care about it. (But it would be great, if someone shows the "magic", being able to get data from it.)
Use for pandas 0.24+ fill_value=0 in Series.shift, then also are correct integers columns:
for x in range(1,4):
df[f"m-{x}"] = df["MONEY"].shift(periods=-x, fill_value=0)
print (df)
TIME MONEY m-1 m-2 m-3
0 2020-01 34 42 300 450
1 2019-12 42 300 450 550
2 2019-11 300 450 550 0
3 2019-10 450 550 0 0
4 2019-09 550 0 0 0
For pandas below 0.24 is necessary replace mising values and convert to integers:
for x in range(1,4):
df[f"m-{x}"] = df["MONEY"].shift(periods=-x).fillna(0).astype(int)
It is quite easy if you use shift
That would give you the desired output:
df["m-1"] = df["MONEY"].shift(periods=-1)
df["m-2"] = df["MONEY"].shift(periods=-2)
df["m-3"] = df["MONEY"].shift(periods=-3)
df = df.fillna(0)
This would work only if it's ordered. Otherwise you have to order it before.
My suggestion: Use a list comprehension with the shift function to get your three columns, concat them on columns, and concatenate it again to the original dataframe
(pd.concat([df,pd.concat([df.MONEY.shift(-i) for i in
range(1,4)],axis=1)],
axis=1)
.fillna(0)
)
TIME MONEY MONEY MONEY MONEY
0 2020-01 34 42.0 300.0 450.0
1 2019-12 42 300.0 450.0 550.0
2 2019-11 300 450.0 550.0 0.0
3 2019-10 450 550.0 0.0 0.0
4 2019-09 550 0.0 0.0 0.0
import pandas as pd
columns = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov"]
some_money = [34,42,300,450,550]
df = pd.DataFrame({'TIME': ['2020-01', '2019-12', '2019-11', '2019-10', '2019-09'], 'MONEY':some_money})
prev_period_money = ["m-1", "m-2", "m-3"]
count = 1
for m in prev_period_money:
df[m] = df['MONEY'].iloc[count:].reset_index(drop=True)
count += 1
df = df.fillna(0)
Output:
TIME MONEY m-1 m-2 m-3
0 2020-01 34 42.0 300.0 450.0
1 2019-12 42 300.0 450.0 550.0
2 2019-11 300 450.0 550.0 0.0
3 2019-10 450 550.0 0.0 0.0
4 2019-09 550 0.0 0.0 0.0

pandas dataframe: convert 2 columns (value, value) into 2 columns (value, type)

Let's say I have the following dataframe "A"
utilization utilization_billable
service
1 10.0 5.0
2 30.0 20.0
3 40.0 30.0
4 40.0 32.0
I need to convert it into the following dataframe "B"
utilization type
service
1 10.0 total
2 30.0 total
3 40.0 total
4 40.0 total
1 5.0 billable
2 20.0 billable
3 30.0 billable
4 32.0 billable
so the values from the first are categorized into type column with values of total or billable.
data = {
'utilization': [10.0, 30.0, 40.0, 40.0],
'utilization_billable': [5.0, 20.0, 30.0, 32.0],
'service': [1, 2, 3, 4]
}
df = pd.DataFrame.from_dict(data).set_index('service')
print(df)
data = {
'utilization': [10.0, 30.0, 40.0, 40.0, 5.0, 20.0, 30.0, 32.0],
'service': [1, 2, 3, 4, 1, 2, 3, 4],
'type': [
'total',
'total',
'total',
'total',
'billable',
'billable',
'billable',
'billable',
]
}
df = pd.DataFrame.from_dict(data).set_index('service')
print(df)
Is there a way to transform the data frame and perform such categorization?
You could use pd.melt:
import pandas as pd
data = {
'utilization': [10.0, 30.0, 40.0, 40.0],
'utilization_billable': [5.0, 20.0, 30.0, 32.0],
'service': [1, 2, 3, 4]}
df = pd.DataFrame(data)
result = pd.melt(df, var_name='type', value_name='utilization', id_vars='service')
print(result)
yields
service type utilization
0 1 utilization 10.0
1 2 utilization 30.0
2 3 utilization 40.0
3 4 utilization 40.0
4 1 utilization_billable 5.0
5 2 utilization_billable 20.0
6 3 utilization_billable 30.0
7 4 utilization_billable 32.0
Then result.set_index('service') would make service the index,
but I would recommend avoiding that since service values are not unique.
This can be done with pd.wide_to_long after adding a suffix to the first column.
import pandas as pd
df = df.rename(columns={'utilization': 'utilization_total'})
pd.wide_to_long(df.reset_index(), stubnames='utilization', sep='_',
i='service', j='type', suffix='.*').reset_index(1)
Output:
type utilization
service
1 total 10.0
2 total 30.0
3 total 40.0
4 total 40.0
1 billable 5.0
2 billable 20.0
3 billable 30.0
4 billable 32.0
looks like a job for df.stack() with multiple DataFrame.rename()
df.rename(index=str, columns={"utilization": "total", "utilization_billable": "billable"})\
.stack().reset_index(1).rename(index=str, columns={"level_1": "type", 0: "utilization"})\
.sort_values(by='type', ascending = False)
Output:
type utilization
service
1 total 10.0
2 total 30.0
3 total 40.0
4 total 40.0
1 billable 5.0
2 billable 20.0
3 billable 30.0
4 billable 32.0

Removing the name of mulilevel index

I created the mulitlevel DataFrame:
Price
Country England Germany US
sys dis
23 0.8 300.0 300.0 800.0
24 0.8 1600.0 600.0 600.0
27 1.0 4000.0 4000.0 5500.0
30 1.0 1000.0 3000.0 1000.0
Right now I want to remove the name: Country, and add the index from 0 to..
Price
sys dis England Germany US
0 23 0.8 300.0 300.0 800.0
1 24 0.8 1600.0 600.0 600.0
2 27 1.0 4000.0 4000.0 5500.0
3 30 1.0 1000.0 3000.0 1000.0
This is my code:
df = pd.DataFrame({'sys':[23,24,27,30],'dis': [0.8, 0.8, 1.0,1.0], 'Country':['US', 'England', 'US', 'Germany'], 'Price':[500, 1000, 1500, 2000]})
df = df.set_index(['sys','dis', 'Country']).unstack().fillna(0)
Can I have some hints how to solve it? I don't have too much experience with multilevel DataFrame.
Try this:
df.reset_index()
df.columns.names =[None, None]
df.columns
MultiIndex(levels=[[u'Price'], [u'England', u'Germany', u'US']],
labels=[[0, 0, 0], [0, 1, 2]],
names=[None, u'Country'])
Best I've got for now:
df.rename_axis([None, None], 1).reset_index()
for the index
df.reset_index(inplace=True)
For the columns
df.columns = df.columns.droplevel(1)

Categories