I'm trying to pivot my dataframe so that there is a single row and a single cell for each summary X metric comparison. I have tried pivoting this, but can't figure out a sensible index column.
Here is my current output.
Does anyone know how to achieve my expected output?
To reproduce:
import pandas as pd
pd.DataFrame({'summary': {0: 'mean',
1: 'stddev',
2: 'mean',
3: 'stddev',
4: 'mean',
5: 'stddev'},
'metric': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'C', 5: 'C'},
'value': {0: '2.0',
1: '1.5811388300841898',
2: '0.4',
3: '0.5477225575051661',
4: None,
5: None}})
Remove missing values by DataFrame.dropna, join columns together, convert to index and transpose by DataFrame.T:
df = df.dropna(subset=['value'])
df['g'] = df['summary'] + '_' + df['metric']
df = df.set_index('g')[['value']].T.reset_index(drop=True).rename_axis(None, axis=1)
print (df)
mean_A stddev_A mean_B stddev_B
0 2.0 1.5811388300841898 0.4 0.5477225575051661
Related
I have a df such as
Letter | Stats
B 0
B 1
C 22
B 0
C 0
B 3
How can I filter for a value in the Letter column and also then convert the stats column for that value into an array?
Basically want to filter for B and convert the Stats column to an array, Thanks!
here is one way to do it
# function received, dataframe and letter as parameter
# return stats values as list for the passed Letter
def grp(df, letter):
return df.loc[df['Letter'].eq(letter)]['Stats'].values.tolist()
# pass the dataframe, and the letter
result=grp(df,'B')
print(result)
[0, 1, 0, 3]
data used
data ={'Letter': {0: 'B', 1: 'B', 2: 'C', 3: 'B', 4: 'C', 5: 'B'},
'Stats': {0: 0, 1: 1, 2: 22, 3: 0, 4: 0, 5: 3}}
df=pd.DataFrame(data)
Although I believe that solution proposed by #Naveed is enough for this problem one little extension could be suggested.
If you would like to get result as an pandas series and obtain some statistic for the series:
data ={'Letter': {0: 'B', 1: 'B', 2: 'C', 3: 'B', 4: 'C', 5: 'B'},
'Stats': {0: 0, 1: 1, 2: 22, 3: 0, 4: 0, 5: 3}}
df = pd.DataFrame(data)
letter = 'B'
ser = pd.Series(name=letter, data=df.loc[df['Letter'].eq(letter)]['Stats'].values)
print(f"Max value: {ser.max()} | Min value: {ser.min()} | Median value: {ser.median()}") etc.
Output:
Max value: 3 | Min value: 0 | Median value: 0.5
given the following df:
data = {'identifier': {0: 'a',
1: 'a',
3: 'b',
4: 'b',
5: 'c'},
'gt_50': {0: 1, 1: 1, 3: 0, 4: 0, 5: 0},
'gt_10': {0: 1, 1: 1, 3: 1, 4: 1, 5: 1}}
df = pd.DataFrame(data)
i want to find the nuniques of the column "identifier" for each column that starts with "gt_" and where the value is one.
Expected output:
- gt_50 1
- gt_10 3
I could make a for loop and filter the frame in each loop on one gt column and then count the uniques but I think it's not very clean.
Is there a way to do this in a clean way?
Use DataFrame.melt with filter _gt columns for unpivot, then get rows with 1 in DataFrame.query and last count unique values by DataFrameGroupBy.nunique:
out = (df.melt('identifier', value_vars=df.filter(regex='^gt_').columns)
.query('value == 1')
.groupby('variable')['identifier']
.nunique())
print (out)
variable
gt_10 3
gt_50 1
Name: identifier, dtype: int64
Or:
s = df.set_index('identifier').filter(regex='^gt_').stack()
out = s[s.eq(1)].reset_index().groupby('level_1')['identifier'].nunique()
print (df)
level_1
gt_10 3
gt_50 1
Name: identifier, dtype: int64
The above table is the entry data. I am trying to get the Total sum of points got by each student, the maximum point got by each student, and the name of the subject.
The below table is the result. What is the most efficient way to use groupby.
Use groupby() method:
resdf=df.groupby('Name').agg(Sum=('Point','sum'),MaxSub=('Sub','last'),Point=('Point','max'))
Output of resdf:
Name Sum MaxSub Point
0 A 210 Socio 90
1 B 115 Com 70
2 C 150 Eng 90
Let's try to groupby transform to get the sum for each group, sort_values to ensure that the max points is at the end of each group, then drop_duplicates with keep last to keep only max row:
import pandas as pd
df = pd.DataFrame({'SN': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7},
'Name': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B', 5: 'C',
6: 'C'},
'Sub': {0: 'Math', 1: 'Eng', 2: 'Socio', 3: 'Geo', 4: 'Com',
5: 'Com', 6: 'Eng'},
'Point': {0: 70, 1: 50, 2: 90, 3: 45, 4: 70, 5: 60, 6: 90}})
# Get Sum For each Group
df['Sum'] = df.groupby('Name')['Point'].transform('sum')
# Sort Values so Highest Point value is the last in each group
df = df.sort_values(['Name', 'Point'])
# Keep Only the last from each Group
df = df.drop_duplicates('Name', keep='last').reset_index(drop=True)
# Re-order and rename Columns
df = df[['Name', 'Sum', 'Sub', 'Point']].rename(columns={'Sub': 'Max Sub'})
print(df)
df:
Name Sum Max Sub Point
0 A 210 Socio 90
1 B 115 Com 70
2 C 150 Eng 90
Since "most efficient way" was noted in the question, here's a perfplot:
Depending on whether the DataFrame has more or less than 10,000 rows will determine which option is more performant.
import string
import numpy as np
import pandas as pd
import perfplot
def gen_data(n):
return pd.DataFrame({
'Name': np.random.choice(list(string.ascii_uppercase)[:max(3, n // 2)],
size=n),
'Sub': np.random.choice(['Math', 'Eng', 'Socio', 'Com'], size=n),
'Point': np.random.randint(50, 90, size=n)
}).sort_values('Name') \
.reset_index(drop=True) \
.reset_index() \
.rename(columns={'index': 'SN'}) \
.assign(SN=lambda s: s.SN + 1)
def anurag_dabas(df):
return df.groupby('Name').agg(Sum=('Point', 'sum'),
MaxSub=('Sub', 'last'),
Point=('Point', 'max'))
def henry_ecker(df):
df['Sum'] = df.groupby('Name')['Point'].transform('sum')
return df.sort_values(['Name', 'Point']) \
.drop_duplicates('Name', keep='last') \
.reset_index(drop=True)[['Name', 'Sum', 'Sub', 'Point']] \
.rename(columns={'Sub': 'Max Sub'})
if __name__ == '__main__':
out = perfplot.bench(
setup=gen_data,
kernels=[
anurag_dabas,
henry_ecker
],
labels=['Anurag Dabas', 'Henry Ecker'],
n_range=[2 ** k for k in range(25)],
equality_check=None
)
out.save('perfplot_results.png', transparent=False)
I have the following dataframe:
df = pd.DataFrame({'Variable': {0: 'Abs', 1: 'rho0', 2: 'cp', 3: 'K0'},
'Value': {0: 0.585, 1: 8220.000, 2: 435.000, 3: 11.400},
'Description': {0: 'foo', 1: 'foo', 2: 'foo', 3: 'foo'}})
I would like to reshape it like this:
df2 = pd.DataFrame({'Abs': {0: 0.585},
'rho0': {0: 8220.000},
'cp': {0: 435.000},
'K0': {0: 11.400}})
How can I do it?
df3 = df.pivot_table(columns='Variable', values='Value')
print(df3)
Variable Abs K0 cp rho0
Value 0.585 11.4 435.0 8220.0
gets very close to what I was looking for, but I'd rather do without the first column Variable, if at all possible.
You can try renaming the axis()
df3 = df.pivot_table(values='Value', columns='Variable').rename_axis(None, axis=1)
additionally if you want to reset the index
df3 = df.pivot_table( columns='Variable').rename_axis(None, axis=1).reset_index().drop('index',axis=1)
df3.to_dict()
# Output
{'Abs': {0: 0.585},
'K0': {0: 11.4},
'cp': {0: 435.0},
'rho0': {0: 8220.0}}
I have a pandas dataframe with a two-level column index. It's read in from a spreadsheet where the author used a lot of whitespace to accomplish things like alignment (for example, one column is called 'Tank #').
I've been able to remove the whitespace on the levels individually...
level0 = raw.columns.levels[0].str.replace('\s', '', regex=True)
level1 = raw.columns.levels[1].str.replace('\s', '', regex=True)
raw.columns.set_levels([level0, level1], inplace=True)
...but I'm curious if there is a way to do it without having to change each individual level one at a time.
I tried raw.columns.set_levels(raw.columns.str.replace('\s', '', regex=True)
but got AttributeError: Can only use .str accessor with Index, not MultiIndex.
Here is a small sample subset of the data-- my best attempt at SO table formatting :D, followed by a picture where I've highlighted in yellow the indices as received.
Run Info
Run Info
Run Data
Run Data
run #
Tank #
Step A
conc. %
ph
0
6931
5
5.29
33.14
1
6932
1
5.28
33.13
2
6933
2
5.32
33.40
3
6934
3
5.19
32.98
Thanks for any insight!
Edit: adding to_dict()
df.to_dict()
Out[5]:
{'Unnamed: 0': {0: nan, 1: 0.0, 2: 1.0, 3: 2.0, 4: 3.0, 5: 4.0},
'Run Info': {0: 'run #',
1: '6931',
2: '6932',
3: '6933',
4: '6934',
5: '6935'},
'Run Info.1': {0: 'Tank #',
1: '5',
2: '1',
3: '2',
4: '3',
5: '4'},
'Run Data': {0: 'Step A\npH',
1: '5.29',
2: '5.28',
3: '5.32',
4: '5.19',
5: '5.28'},
'Run Data.1': {0: 'concentration',
1: '33.14',
2: '33.13',
3: '33.4',
4: '32.98',
5: '32.7'}}
How about rename:
import re
df.rename(columns=lambda x: re.sub('\s+', ' ', x.strip() ),inplace=True)
If you don't want to keep any of the spaces, you can just replace ' ' with ''.