Removing the name of mulilevel index - python

I created the mulitlevel DataFrame:
Price
Country England Germany US
sys dis
23 0.8 300.0 300.0 800.0
24 0.8 1600.0 600.0 600.0
27 1.0 4000.0 4000.0 5500.0
30 1.0 1000.0 3000.0 1000.0
Right now I want to remove the name: Country, and add the index from 0 to..
Price
sys dis England Germany US
0 23 0.8 300.0 300.0 800.0
1 24 0.8 1600.0 600.0 600.0
2 27 1.0 4000.0 4000.0 5500.0
3 30 1.0 1000.0 3000.0 1000.0
This is my code:
df = pd.DataFrame({'sys':[23,24,27,30],'dis': [0.8, 0.8, 1.0,1.0], 'Country':['US', 'England', 'US', 'Germany'], 'Price':[500, 1000, 1500, 2000]})
df = df.set_index(['sys','dis', 'Country']).unstack().fillna(0)
Can I have some hints how to solve it? I don't have too much experience with multilevel DataFrame.

Try this:
df.reset_index()
df.columns.names =[None, None]
df.columns
MultiIndex(levels=[[u'Price'], [u'England', u'Germany', u'US']],
labels=[[0, 0, 0], [0, 1, 2]],
names=[None, u'Country'])

Best I've got for now:
df.rename_axis([None, None], 1).reset_index()

for the index
df.reset_index(inplace=True)
For the columns
df.columns = df.columns.droplevel(1)

Related

Combine rows and average column if another column is minimum

I have a pandas dataframe:
Server Clock 1 Clock 2 Power diff
0 PhysicalWindows1 3400 3300.0 58.5 100.0
1 PhysicalWindows1 3400 3500.0 63.0 100.0
2 PhysicalWindows1 3400 2900.0 25.0 500.0
3 PhysicalWindows2 3600 3300.0 83.8 300.0
4 PhysicalWindows2 3600 3500.0 65.0 100.0
5 PhysicalWindows2 3600 2900.0 10.0 700.0
6 PhysicalLinux1 2600 NaN NaN NaN
7 PhysicalLinux1 2600 NaN NaN NaN
8 Test 2700 2700.0 30.0 0.0
Basically, I would like to average the Power for each server but only if the difference is minimum. For example, if you look at the 'PhysicalWindows1' server, I have 3 rows, two have a diff of 100, and one has a diff of 500. Since I have two rows with a diff of 100, I would like to average out my Power of 58.5 and 63.0. For 'PhysicalWindows2', since there is only one row which has the least diff, we return the power for that one row - 65. If NaN, return Nan, and if there is only one match, return the power for that one match.
My resultant dataframe would look like this:
Server Clock 1 Power
0 PhysicalWindows1 3400 (58.5+63.0)/2
1 PhysicalWindows2 3600 65.0
2 PhysicalLinux1 2600 NaN
3 Test 2700 30.0
Use groupby with dropna=False to avoid to remove PhysicalLinux1 and sort=True to sort index level (lowest diff on top) then drop_duplicates to keep only one instance of (Server, Clock 1):
out = (df.groupby(['Server', 'Clock 1', 'diff'], dropna=False, sort=True)['Power']
.mean().droplevel('diff').reset_index().drop_duplicates(['Server', 'Clock 1']))
# Output
Server Clock 1 Power
0 PhysicalLinux1 2600 NaN
1 PhysicalWindows1 3400 60.75
3 PhysicalWindows2 3600 65.00
6 Test 2700 30.00
Here is a possible solution using df.groupby() and pd.merge()
grp_df = df.groupby(['Server', 'diff'])['Power'].mean().reset_index()
grp_df = grp_df.groupby('Server').first().reset_index()
grp_df = grp_df.rename(columns={'diff': 'min_diff', 'Power': 'Power_avg'})
df_out = (pd.merge(df[['Server', 'Clock 1']].drop_duplicates(), grp_df, on='Server', how='left')
.drop(['min_diff'], axis=1))
print(df_out)
Server Clock 1 Power_avg
0 PhysicalWindows1 3400 60.75
1 PhysicalWindows2 3600 65.00
2 PhysicalLinux1 2600 NaN
3 Test 2700 30.00
Use a double groupby, first groupby.transform to mask the non-max Power, then groupby.agg to aggregate
m = df.groupby('Server')['diff'].transform('min').eq(df['diff'])
(df.assign(Power=df['Power'].where(m))
.groupby('Server', sort=False, as_index=False)
.agg({'Clock 1': 'first', 'Power': 'mean'})
)
Output:
Server Clock 1 Power
0 PhysicalWindows1 3400 60.75
1 PhysicalWindows2 3600 65.00
2 PhysicalLinux1 2600 NaN
3 Test 2700 30.00
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"Server": ['PhysicalWindows1', 'PhysicalWindows1', 'PhysicalWindows1', 'PhysicalWindows2',
'PhysicalWindows2', 'PhysicalWindows2', 'PhysicalLinux1', 'PhysicalLinux1', 'Test'],
"Clock 1": [3400, 3400, 3400, 3600, 3600, 3600, 2600, 2600, 2700],
"Clock 2": [3300.0, 3500.0, 2900.0, 3300.0, 3500.0, 2900.0, np.nan, np.nan, 2700.0],
"Power": [58.5, 63.0, 25.0, 83.8, 65.0, 10.0, np.nan, np.nan, 30.0],
"diff": [100.0, 100.0, 500.0, 300.0, 100.0, 700.0, np.nan, np.nan, 0.0]
}
)
r = (df.groupby(['Server'])
.apply(lambda d: d[d['diff']==d['diff'].min()])
.reset_index(drop=True)
.groupby(['Server'])
.agg({"Clock 1":'mean', "Power":'mean', "diff":'first'})
.reset_index()
)
r = (r.append(df[df['diff'].isnull()]
.drop_duplicates()
).drop(['Clock 2', 'diff'], axis=1)
.reset_index(drop=True)
)
print(r)
Server Clock 1 Power
0 PhysicalWindows1 3400.0 60.75
1 PhysicalWindows2 3600.0 65.00
2 Test 2700.0 30.00
3 PhysicalLinux1 2600.0 NaN
def function1(dd:pd.DataFrame):
dd1=dd.loc[dd.loc[:,"diff"].eq(dd.loc[:,"diff"].min())]
return pd.DataFrame({"Clock 1":dd1[["Clock 1"]].min().squeeze(),"Power":dd1.Power.mean()},index=[dd.name])
df1.groupby('Server',sort=False).apply(function1).droplevel(1)
out
Clock 1 Power
Server
PhysicalWindows1 3400.0 60.75
PhysicalWindows2 3600.0 65.00
PhysicalLinux1 NaN NaN
Test 2700.0 30.00

How to set new values to row based on the same substring from other column?

This is an example of a bigger data. Imagine I have a dataframe like this:
df = pd.DataFrame({"CLASS":["AG_1","AG_2","AG_3","MAR","GOM"],
"TOP":[200, np.nan, np.nan, 600, np.nan],
"BOT":[230, 250, 380, np.nan, 640]})
df
Out[49]:
CLASS TOP BOT
0 AG_1 200.0 230.0
1 AG_2 NaN 250.0
2 AG_3 NaN 380.0
3 MAR 600.0 NaN
4 GOM NaN 640.0
I would like to set the values for TOP on lines 1 and 2. My condition is that these values must be the BOT values from the row above if the class begins with the same substring "AG". The output should be like this:
CLASS TOP BOT
0 AG_1 200.0 230.0
1 AG_2 230.0 250.0
2 AG_3 250.0 380.0
3 MAR 600.0 NaN
4 GOM NaN 640.0
Anyone could show me how to do that?
generic case: filling all groups
I would use fillna with a groupby.shift using a custom group extracting the substring from CLASS with str.extract:
group = df['CLASS'].str.extract('([^_]+)', expand=False)
df['TOP'] = df['TOP'].fillna(df.groupby(group)['BOT'].shift())
Output:
CLASS TOP BOT
0 AG_1 200.0 230.0
1 AG_2 230.0 250.0
2 AG_3 250.0 380.0
3 MAR 600.0 NaN
4 GOM NaN 640.0
Intermediate group:
0 AG
1 AG
2 AG
3 MAR
4 GOM
Name: CLASS, dtype: object
special case: only AG group
m = df['CLASS'].str.startswith('AG')
df.loc[m, 'TOP'] = df.loc[m, 'TOP'].fillna(df.loc[m, 'BOT'].shift())
Example:
CLASS TOP BOT
0 AG_1 200.0 230.0
1 AG_2 230.0 250.0
2 AG_3 250.0 380.0
3 MAR_1 600.0 601.0
4 MAR_2 NaN NaN # this is not filled
5 GOM NaN 640.0

Stick the dataframe rows and column in one row+ replace the nan values with the day before or after

I have a df and I want to stick the values of it. At first I want to select the specific time, and replace the Nan values with the same in the day before. Here is a simple example: I only want to choose the values in 2020, I want to stick its value based on the time, and also replace the nan value same as day before.
df = pd.DataFrame()
df['day'] =[ '2020-01-01', '2019-01-01', '2020-01-02','2020-01-03', '2018-01-01', '2020-01-15','2020-03-01', '2020-02-01', '2017-01-01' ]
df['value_1'] = [ 1, np.nan, 32, 48, 5, -1, 5,10,2]
df['value_2'] = [ np.nan, 121, 23, 34, 15, 21, 15, 12, 39]
df
day value_1 value_2
0 2020-01-01 1.0 NaN
1 2019-01-01 NaN 121.0
2 2020-01-02 32.0 23.0
3 2020-01-03 48.0 34.0
4 2018-01-01 5.0 15.0
5 2020-01-15 -1.0 21.0
6 2020-03-01 5.0 15.0
7 2020-02-01 10.0 12.0
8 2017-01-01 2.0 39.0
The output:
_1 _2 _3 _4 _5 _6 _7 _8 _9 _10 _11 _12
0 1 121 1 23 48 34 -1 21 10 12 -1 21
I have tried to use the follwing code, but it does not solve my problem:
val_cols = df.filter(like='value_').columns
output = (df.pivot('day', val_cols).groupby(level=0, axis=1).apply(lambda x:x.ffill(axis=1).bfill(axis=1)).sort_index(axis=1, level=1))
I don't know what the output is supposed to be but i think this should do at least part of what you're trying to do
df['day'] = pd.to_datetime(df['day'], format='%Y-%m-%d')
df = df.sort_values(by=['day'])
filter_2020 = df['day'].dt.year == 2020
val_cols = df.filter(like='value_').columns
df.loc[filter_2020, val_cols] = df.loc[:,val_cols].ffill().loc[filter_2020]
print(df)
day value_1 value_2
8 2017-01-01 2.0 39.0
4 2018-01-01 5.0 15.0
1 2019-01-01 NaN 121.0
0 2020-01-01 1.0 121.0
2 2020-01-02 32.0 23.0
3 2020-01-03 48.0 34.0
5 2020-01-15 -1.0 21.0
7 2020-02-01 10.0 12.0
6 2020-03-01 5.0 15.0

Take rowmean of sorted columns for first 4 that are not NA (python)

I have a dataframe (df) with populations for columns, instances as rows, and frequencies as entries (see attached screenshot, there about 2.3M rows and 40 columns).
I have a way of sorting populations by their geographical distance. What I would like to do is take the row mean for the top four closest populations that do not have frequency as NA.
If there weren't any NAs, I would do something like:
closest = get_four_closest_pops(focalpop) # get a list of the four closest pops
whatiwant = df[closest].mean(axis=1) # doing it this way does not take into account NAs
If I loop through rows it will take way longer than I want it to, even if I use ipyparallel. The pandas.DataFrame.mean() method is quite quick. I was wondering if there wasn't a simple method where I could supply a list of all pops ordered by proximity to a focalpop and then take the four closest non-NAs.
example
For instance, I would like a pd.Series returned (or a dict) that has the rownames and the mean of the first four pops that are not NA.
for row jcf7190000000000-77738 in the screenshot (if for instance columns are ordered by proximity to a pop that is not shown) I would want output as 0.666625 (ie from DF_p1, DF_p24, DF_p25, DF_p26)
for row jcf7190000000000-77764 I would want output as 0.771275 (ie from DF_p1, DF_p24, DF_p25, DF_p26)
for row jcf7190000000004-54418 I would want output as 0.28651 (ie from DF_1, DF_2, DF_23, DF_24)
For future questions, see how to ask and How to create a Minimal, Reproducible Example (just a suggestion). Now, with a dummy df, since you didn't add a sample of your dataframe to work with, and taking into account that you have ordered by proximity the columns, you could try this:
import pandas as pd
from math import isnan
from statistics import mean
#creation of dummy df
df = pd.DataFrame({'df_1': ['700','ABC','500','XYZ','1200','DDD','150','350','400','5000', '100'],
'df_2': ['DDD','150','350','400','5000','500','XYZ','1200','DDD','150','350'] ,
'df_3': ['700','ABC','500','XYZ','1200','DDD','150','350','400','5000', '100'],
'df_4': ['DDD','150','350','400','5000','500','XYZ','1200','DDD','150','350'],
'df_5': ['700','ABC','500','XYZ','1200','DDD','150','350','400','5000', '100'],
'df_6': ['DDD','150','350','400','5000','500','XYZ','1200','DDD','150','350'],
'df_7': ['700','ABC','500','XYZ','1200','DDD','150','350','400','5000', '100'],
'df_8': ['DDD','150','350','400','5000','500','XYZ','1200','DDD','150','350'],
'df_9': ['DDD','150','350','400','5000','500','XYZ','1200','DDD','150','350'],
'df_10': ['700','ABC','500','XYZ','1200','DDD','150','350','400','5000', '100']
})
df=df.apply(pd.to_numeric, errors='coerce')
df.index.name='instances'
#approach of a solution to your problem
def func(row):
meanx=mean([float(x) for x in list(row)[1:] if not isnan(x)][:4])
columns=[df.columns[i] for i,x in enumerate(list(row)[1:]) if not isnan(x)][:4]
return {'columns':columns, 'mean':meanx}
dc={row[0]:func(row) for row in df.to_records()}
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print(df)
print(dc)
Output:
df
df_1 df_2 df_3 df_4 df_5 df_6 df_7 df_8 df_9 df_10
instances
0 700.0 NaN 700.0 NaN 700.0 NaN 700.0 NaN NaN 700.0
1 NaN 150.0 NaN 150.0 NaN 150.0 NaN 150.0 150.0 NaN
2 500.0 350.0 500.0 350.0 500.0 350.0 500.0 350.0 350.0 500.0
3 NaN 400.0 NaN 400.0 NaN 400.0 NaN 400.0 400.0 NaN
4 1200.0 5000.0 1200.0 5000.0 1200.0 5000.0 1200.0 5000.0 5000.0 1200.0
5 NaN 500.0 NaN 500.0 NaN 500.0 NaN 500.0 500.0 NaN
6 150.0 NaN 150.0 NaN 150.0 NaN 150.0 NaN NaN 150.0
7 350.0 1200.0 350.0 1200.0 350.0 1200.0 350.0 1200.0 1200.0 350.0
8 400.0 NaN 400.0 NaN 400.0 NaN 400.0 NaN NaN 400.0
9 5000.0 150.0 5000.0 150.0 5000.0 150.0 5000.0 150.0 150.0 5000.0
10 100.0 350.0 100.0 350.0 100.0 350.0 100.0 350.0 350.0 100.0
dc
{0: {'columns': ['df_1', 'df_3', 'df_5', 'df_7'], 'mean': 700.0}, 1: {'columns': ['df_2', 'df_4', 'df_6', 'df_8'], 'mean': 150.0}, 2: {'columns': ['df_1', 'df_2', 'df_3', 'df_4'], 'mean': 425.0}, 3: {'columns': ['df_2', 'df_4', 'df_6', 'df_8'], 'mean': 400.0}, 4: {'columns': ['df_1', 'df_2', 'df_3', 'df_4'], 'mean': 3100.0}, 5: {'columns': ['df_2', 'df_4', 'df_6', 'df_8'], 'mean': 500.0}, 6: {'columns': ['df_1', 'df_3', 'df_5', 'df_7'], 'mean': 150.0}, 7: {'columns': ['df_1', 'df_2', 'df_3', 'df_4'], 'mean': 775.0}, 8: {'columns': ['df_1', 'df_3', 'df_5', 'df_7'], 'mean': 400.0}, 9: {'columns': ['df_1', 'df_2', 'df_3', 'df_4'], 'mean': 2575.0}, 10: {'columns': ['df_1', 'df_2', 'df_3', 'df_4'], 'mean': 225.0}}
Note: It should be clarified that the program is designed assuming that all instances have at least 4 non-nan values.

Build rows in Python Dataframe, based on values in previous column

My input looks like this:
import datetime as dt
import pandas as pd
some_money = [34,42,300,450,550]
df = pd.DataFrame({'TIME': ['2020-01', '2019-12', '2019-11', '2019-10', '2019-09'], \
'MONEY':some_money})
df
Producing the following:
I want to add 3 more columns, getting the MONEY value for the previous month, like this (color coding for illustrative purposes):
This is what I have tried:
prev_period_money = ["m-1", "m-2", "m-3"]
for m in prev_period_money:
df[m] = df["MONEY"] - 10 #well, it "works", but it gives df["MONEY"]- 10...
The TIME column is sorted, so one should not care about it. (But it would be great, if someone shows the "magic", being able to get data from it.)
Use for pandas 0.24+ fill_value=0 in Series.shift, then also are correct integers columns:
for x in range(1,4):
df[f"m-{x}"] = df["MONEY"].shift(periods=-x, fill_value=0)
print (df)
TIME MONEY m-1 m-2 m-3
0 2020-01 34 42 300 450
1 2019-12 42 300 450 550
2 2019-11 300 450 550 0
3 2019-10 450 550 0 0
4 2019-09 550 0 0 0
For pandas below 0.24 is necessary replace mising values and convert to integers:
for x in range(1,4):
df[f"m-{x}"] = df["MONEY"].shift(periods=-x).fillna(0).astype(int)
It is quite easy if you use shift
That would give you the desired output:
df["m-1"] = df["MONEY"].shift(periods=-1)
df["m-2"] = df["MONEY"].shift(periods=-2)
df["m-3"] = df["MONEY"].shift(periods=-3)
df = df.fillna(0)
This would work only if it's ordered. Otherwise you have to order it before.
My suggestion: Use a list comprehension with the shift function to get your three columns, concat them on columns, and concatenate it again to the original dataframe
(pd.concat([df,pd.concat([df.MONEY.shift(-i) for i in
range(1,4)],axis=1)],
axis=1)
.fillna(0)
)
TIME MONEY MONEY MONEY MONEY
0 2020-01 34 42.0 300.0 450.0
1 2019-12 42 300.0 450.0 550.0
2 2019-11 300 450.0 550.0 0.0
3 2019-10 450 550.0 0.0 0.0
4 2019-09 550 0.0 0.0 0.0
import pandas as pd
columns = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov"]
some_money = [34,42,300,450,550]
df = pd.DataFrame({'TIME': ['2020-01', '2019-12', '2019-11', '2019-10', '2019-09'], 'MONEY':some_money})
prev_period_money = ["m-1", "m-2", "m-3"]
count = 1
for m in prev_period_money:
df[m] = df['MONEY'].iloc[count:].reset_index(drop=True)
count += 1
df = df.fillna(0)
Output:
TIME MONEY m-1 m-2 m-3
0 2020-01 34 42.0 300.0 450.0
1 2019-12 42 300.0 450.0 550.0
2 2019-11 300 450.0 550.0 0.0
3 2019-10 450 550.0 0.0 0.0
4 2019-09 550 0.0 0.0 0.0

Categories