Groupby column name and add results as additional columns - python

I have a dataframe like this:
import pandas as pd
df = pd.DataFrame({
'stuff_1_var_1': range(5),
'stuff_1_var_2': range(2, 7),
'stuff_2_var_1': range(3, 8),
'stuff_2_var_2': range(5, 10)
})
stuff_1_var_1 stuff_1_var_2 stuff_2_var_1 stuff_2_var_2
0 0 2 3 5
1 1 3 4 6
I would like to groupby based on the column headers and then add the mean and median of each group as new columns. So my expected output looks like this:
stuff_1_var_mean stuff_1_var_median stuff_2_var_mean stuff_2_var_median
0 1 1 4 4
1 2 2 5 5
Brief explanation:
we have two groups stuff_1_var_ and stuff_2_var_ for which would calculate the mean and median per row. So, e.g. for stuff_1_var_ it would be:
# values from stuff_1_var_1 and stuff_1_var_2
(0 + 2) / 2 = 1 and
( 1 + 3) / 2 = 2
The values are then added as a new column stuff_1_var_mean; analogue for meadian and stuff_2_var_.
I got until:
df = df.T
pattern = df.index.str.extract('(^stuff_\d_var_)', expand=False)
dfgb = df.groupby(pattern).agg(['mean', 'median']).T
stuff_1_var_ stuff_2_var_
0 mean 1 4
median 1 4
1 mean 2 5
median 2 5
How can I do the final step(s)?

Your solution should be changed:
df = df.T
pattern = df.index.str.extract('(^stuff_\d_var_)', expand=False)
dfgb = df.groupby(pattern).agg(['mean', 'median']).T.unstack()
dfgb.columns = dfgb.columns.map(lambda x: f'{x[0]}{x[1]}')
print (dfgb)
stuff_1_var_mean stuff_1_var_median stuff_2_var_mean stuff_2_var_median
0 1 1 4 4
1 2 2 5 5
2 3 3 6 6
3 4 4 7 7
4 5 5 8 8
Unfortunately for axis=1 is not implemented agg, so possible solution is create mean and median separately and then concat:
dfgb = df.groupby(pattern, axis=1).agg(['mean','median'])
NotImplementedError: axis other than 0 is not supported
pattern = df.columns.str.extract('(^stuff_\d_var_)', expand=False)
g = df.groupby(pattern, axis=1)
dfgb = pd.concat([g.mean().add_suffix('mean'),
g.median().add_suffix('median')], axis=1)
dfgb = dfgb.iloc[:, [0,2,1,3]]
print (dfgb)
stuff_1_var_mean stuff_1_var_median stuff_2_var_mean stuff_2_var_median
0 1 1 4 4
1 2 2 5 5
2 3 3 6 6
3 4 4 7 7
4 5 5 8 8

Here's a way you can do:
col = 'stuff_1_var_'
use_col = [x for x in df.columns if 'stuff_1' in x]
df[f'{col}mean'] = df[use_col].mean(1)
df[f'{col}median'] = df[use_col].median(1)
col2 = 'stuff_2_var_'
use_col = [x for x in df.columns if 'stuff_2' in x]
df[f'{col2}mean'] = df[use_col].mean(1)
df[f'{col2}median'] = df[use_col].median(1)
print(df.iloc[:,-4:]) # showing last four new columns
stuff_1_var_mean stuff_1_var_median stuff_2_var_mean stuff_2_var_median
0 1.0 1.0 4.0 4.0
1 2.0 2.0 5.0 5.0
2 3.0 3.0 6.0 6.0
3 4.0 4.0 7.0 7.0
4 5.0 5.0 8.0 8.0
Ofcourse, you can put it in a function to avoid repeating the same code.

Related

Separation of the dataframes by row values

I want to split my dataframe based on the first row to generate four separate dataframes (for subgroup analysis). I have a 172x106 Excel file, where the first row consists of either a 1, 2, 3, or 4. The other 171 lines are radiomic features, which I want to copy to the ''new'' dataset. The columns do not have headernames. My data looks like the following:
{0: [4.0, 0.65555056, 0.370511262, 16.5876203, 44.76954415, 48.0, 32.984845, 49.47726751, 49.47726751, 13133.33333, 29.34869973, 0.725907513, 3708.396349, 0.282365204, 13696.0, 2.122884402, 3.039611259, 1419.058749, 1.605529827, 0.488297449], 1: [2.0, 0.82581372, 0.33201741, 20.65753167, 62.21821817, 50.59644256, 62.60990337, 55.56977596, 77.35631842, 23890.66667, 51.38065822, 0.521666786, 7689.706847, 0.321870752, 25152.0, 1.022813615, 1.360453239, 548.2156387, 0.314035581, 0.181204079]}
I wanted to use groupby, but since the column headers have no name, it makes it hard to implement. This is my current code:
import numpy as np
df = pd.read_excel(r'H:\Documenten\MATLAB\sample_file.xlsx',header=None)
Class_1=df.groupby(df.T.loc[:,0])
df_new = Class_1.get_group("1")
print(df_new)
The error I get is the following:
Traceback (most recent call last):
File "H:/PycharmProjects/RadiomicsPipeline/spearman_subgroups.py", line 5, in <module>
df_new = Class_1.get_group("1")
File "C:\Users\cpullen\AppData\Roaming\Python\Python37\site-packages\pandas\core\groupby\groupby.py", line 754, in get_group
raise KeyError(name)
KeyError: '1'
How do I implement the separation of the dataframes by row values?
I am not sure that this is the result you want. If not, please clearly show the desired output.
You can simply achieve what you want by transposing your dataframe.
import numpy as np
import pandas as pd
np.random.seed(0)
n = 5
data = np.stack((np.random.choice([1, 2, 3, 4], n), np.random.rand(n), np.random.rand(n), np.random.rand(n)), axis=0)
df = pd.DataFrame(data)
df.head():
0 1 2 3 4
0 1.000000 4.000000 2.000000 1.000000 4.000000
1 0.857946 0.847252 0.623564 0.384382 0.297535
2 0.056713 0.272656 0.477665 0.812169 0.479977
3 0.392785 0.836079 0.337396 0.648172 0.368242
df_transpose = df.transpose()
df_transpose.columns = ['group'] + list(df_transpose.columns.values[1:])
df_transpose.head():
group 1 2 3
0 1.0 0.857946 0.056713 0.392785
1 4.0 0.847252 0.272656 0.836079
2 2.0 0.623564 0.477665 0.337396
3 1.0 0.384382 0.812169 0.648172
4 4.0 0.297535 0.479977 0.368242
list_df = dict()
for group in set(df_transpose['group']):
list_df[group] = df_transpose.loc[df_transpose['group'] == group]
list_df:
{
1.0:
group 1 2 3
0 1.0 0.857946 0.056713 0.392785
3 1.0 0.384382 0.812169 0.648172,
2.0:
group 1 2 3
2 2.0 0.623564 0.477665 0.337396,
4.0:
group 1 2 3
1 4.0 0.847252 0.272656 0.836079
4 4.0 0.297535 0.479977 0.368242
}
Individual dataframes:
list_df[1.0], list_df[2.0], list_df[3.0] (which does not exist in this example), list_df[4.0]
(BTW. I defined a wrong variable name. The name should be dict_df, not list_df.)
First of all, after importing the dataframe, sort the value of the first row in order
df
Out[26]:
0 1 2 3
0 0 1 0 1
1 1 2 5 8
2 2 3 6 9
3 3 4 7 0
df = df.sort_values(by = 0, axis = 1)
Out[30]:
0 2 1 3
0 0 0 1 1
1 1 5 2 8
2 2 6 3 9
3 3 7 4 0
You should have a dataframe with ordered 1st row values. After that, you can use df.iloc to rename your column name and you will drop the first row.
df.columns = df.iloc[0]
0 0 0 1 1
0 0 0 1 1
1 1 5 2 8
2 2 6 3 9
3 3 7 4 0
df.drop(0, inplace=True)
df.drop(0)
Out[51]:
0 0 0 1 1
1 1 5 2 8
2 2 6 3 9
3 3 7 4 0
Eventually, you can do slicing based on the column name.
df_1 = df[1]
df_1
Out[56]:
0 1 1
1 2 8
2 3 9
3 4 0

How to rest a row value to the nths rows values of another dataframe

I have this two df's
df1:
lon lat
0 -60.7 -2.8333333333333335
1 -55.983333333333334 -2.4833333333333334
2 -51.06666666666667 -0.05
3 -66.96666666666667 -0.11666666666666667
4 -48.483333333333334 -1.3833333333333333
5 -54.71666666666667 -2.4333333333333336
6 -44.233333333333334 -2.6
7 -59.983333333333334 -3.15
df2:
lon lat
0 -24.109 -2.0035
1 -17.891 -1.70911
2 -14.5822 -1.7470700000000001
3 -12.8138 -1.72322
4 -14.0688 -1.5028700000000002
5 -13.8406 -1.44416
6 -12.1292 -0.671266
7 -13.8406 -0.8824270000000001
8 -15.12 -18.223
I want to rest each value of df1['lat'] with all values of df2
Something like this :
results0=df1.loc[0,'lat']-df2.loc[:,'lat']
results1=df1.loc[1,'lat']-df2.loc[:,'lat']
#etc etc....
So i tried this:
for i,j in zip(range(len(df1)), range(len(df2))):
exec(f"result{i}=df1.loc[{i},'lat']-df2.loc[{j},'lat']")
But it only gave me one result value for each result, instead of 8 values for each result.
I will appreciate any possible solution. Thanks!
You can create list of Series:
L = [df1.loc[i,'lat']-df2['lat'] for i in df1.index]
Or you can use numpy for new DataFrame:
arr = df1['lat'].to_numpy() - df2['lat'].to_numpy()[:, None]
df3 = pd.DataFrame(arr, index=df2.index, columns=df1.index)
print (df3)
0 1 2 3 4 5 \
0 -0.829833 -0.479833 1.953500 1.886833 0.620167 -0.429833
1 -1.124223 -0.774223 1.659110 1.592443 0.325777 -0.724223
2 -1.086263 -0.736263 1.697070 1.630403 0.363737 -0.686263
3 -1.110113 -0.760113 1.673220 1.606553 0.339887 -0.710113
4 -1.330463 -0.980463 1.452870 1.386203 0.119537 -0.930463
5 -1.389173 -1.039173 1.394160 1.327493 0.060827 -0.989173
6 -2.162067 -1.812067 0.621266 0.554599 -0.712067 -1.762067
7 -1.950906 -1.600906 0.832427 0.765760 -0.500906 -1.550906
8 15.389667 15.739667 18.173000 18.106333 16.839667 15.789667
6 7
0 -0.596500 -1.146500
1 -0.890890 -1.440890
2 -0.852930 -1.402930
3 -0.876780 -1.426780
4 -1.097130 -1.647130
5 -1.155840 -1.705840
6 -1.928734 -2.478734
7 -1.717573 -2.267573
8 15.623000 15.073000
Since df1 has one less row than df2
df1['lat'] = df1['lat'] - df2.loc[:df1.shape[0]-1, 'lat']
output:
0 -0.829833
1 -0.774223
2 1.697070
3 1.606553
4 0.119537
5 -0.989173
6 -1.928734
7 -2.267573
Name: lat, dtype: float64

populate missing values for multiple columns with multiple values

I have gone through the posts that are similar to filling out the multiple columns for pandas in one go, however it appears that my problem here is a little different, in the sense that I need to be able to populate a missing column value with a specific column value and be able to do that for multiple columns in one go.
Eg: I can use the commands as below individually to fill the NA's
result1_copy['BASE_B'] = np.where(pd.isnull(result1_copy['BASE_B']), result1_copy['BASE_S'], result1_copy['BASE_B'])
result1_copy['QWE_B'] = np.where(pd.isnull(result1_copy['QWE_B']), result1_copy['QWE_S'], result1_copy['QWE_B'])
However, if I were to try populating it one go, it does not work:
result1_copy['BASE_B','QWE_B'] = result1_copy['BASE_B', 'QWE_B'].fillna(result1_copy['BASE_S','QWE_S'])
Do we know why ?
Please note I have only used 2 columns here for ease of purpose, however I have 10s of columns to impute. And they are either object, float or datetime.
Is datatypes the issue here ?
You need add [] for filtered DataFrame and for align columns add rename:
d = {'BASE_S':'BASE_B', 'QWE_S':'QWE_B'}
result1_copy[['BASE_B','QWE_B']] = result1_copy[['BASE_B', 'QWE_B']]
.fillna(result1_copy[['BASE_S','QWE_S']]
.rename(columns=d))
More dynamic solution:
L = ['BASE_','QWE_']
orig = ['{}B'.format(x) for x in L]
new = ['{}S'.format(x) for x in L]
d = dict(zip(new, orig))
result1_copy[orig] = (result1_copy[orig].fillna(result1_copy[new]
.rename(columns=d)))
Another solution if match columns with B and S:
for x in ['BASE_','QWE_']:
result1_copy[x + 'B'] = result1_copy[x + 'B'].fillna(result1_copy[x + 'S'])
Sample:
result1_copy = pd.DataFrame({'A':list('abcdef'),
'BASE_B':[np.nan,5,4,5,5,np.nan],
'QWE_B':[np.nan,8,9,4,2,np.nan],
'BASE_S':[1,3,5,7,1,0],
'QWE_S':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (result1_copy)
A BASE_B BASE_S F QWE_B QWE_S
0 a NaN 1 a NaN 5
1 b 5.0 3 a 8.0 3
2 c 4.0 5 a 9.0 6
3 d 5.0 7 b 4.0 9
4 e 5.0 1 b 2.0 2
5 f NaN 0 b NaN 4
d = {'BASE_S':'BASE_B', 'QWE_S':'QWE_B'}
result1_copy[['BASE_B','QWE_B']] = (result1_copy[['BASE_B', 'QWE_B']]
.fillna(result1_copy[['BASE_S','QWE_S']]
.rename(columns=d)))
print (result1_copy)
A BASE_B BASE_S F QWE_B QWE_S
0 a 1.0 1 a 5.0 5
1 b 5.0 3 a 8.0 3
2 c 4.0 5 a 9.0 6
3 d 5.0 7 b 4.0 9
4 e 5.0 1 b 2.0 2
5 f 0.0 0 b 4.0 4

Pandas groupby treat nonconsecutive as different variables?

I want to treat non consecutive ids as different variables during groupby, so that I can take return the first value of stamp, and the sum of increment as a new dataframe. Here is sample input and output.
import pandas as pd
import numpy as np
df = pd.DataFrame([np.array(['a','a','a','b','c','b','b','a','a','a']),
np.arange(1, 11), np.ones(10)]).T
df.columns = ['id', 'stamp', 'increment']
df_result = pd.DataFrame([ np.array(['a','b','c','b','a']),
np.array([1,4,5,6,8]), np.array([3,1,1,2,3])]).T
df_result.columns = ['id', 'stamp', 'increment_sum']
In [2]: df
Out[2]:
id stamp increment
0 a 1 1
1 a 2 1
2 a 3 1
3 b 4 1
4 c 5 1
5 b 6 1
6 b 7 1
7 a 8 1
8 a 9 1
9 a 10 1
In [3]: df_result
Out[3]:
id stamp increment_sum
0 a 1 3
1 b 4 1
2 c 5 1
3 b 6 2
4 a 8 3
I can accomplish this via
def get_result(d):
sum = d.increment.sum()
stamp = d.stamp.min()
name = d.id.max()
return name, stamp, sum
#idea from http://stackoverflow.com/questions/25147091/combine-consecutive-rows-with-the-same-column-values
df['key'] = (df['id'] != df['id'].shift(1)).astype(int).cumsum()
result = zip(*df.groupby([df.key]).apply(get_result))
df = pd.DataFrame(np.array(result).T)
df.columns = ['id', 'stamp', 'increment_sum']
But I'm sure there must be a more elegant solution
Not that good in terms of optimum code, but solves the problem
> df_group = df.groupby('id')
we cant use id alone for groupby, so adding another new column to groupby within id based whether it is continuous or not
> df['group_diff'] = df_group['stamp'].diff().apply(lambda v: float('nan') if v == 1 else v).ffill().fillna(0)
> df
id stamp increment group_diff
0 a 1 1 0
1 a 2 1 0
2 a 3 1 0
3 b 4 1 0
4 c 5 1 0
5 b 6 1 2
6 b 7 1 2
7 a 8 1 5
8 a 9 1 5
9 a 10 1 5
Now we can the new column group_diff for secondary grouping.. Added sort function in the end as suggested in the comments to get the exact function
> df.groupby(['id','group_diff']).agg({'increment':sum, 'stamp': 'first'}).reset_index()[['id', 'stamp','increment']].sort('stamp')
id stamp increment
0 a 1 3
2 b 4 1
4 c 5 1
3 b 6 2
1 a 8 3

Python pandas, multindex, slicing

I have got a pd.DataFrame
Time Value
a 1 1 1
2 2 5
3 5 7
b 1 1 5
2 2 9
3 10 11
I want to multiply the column Value with the column Time - Time(t-1) and write the result to a column Product, starting with row b, but separately for each top level index.
For example Product('1','b') should be (Time('1','b') - Time('1','a')) * Value('1','b'). To do this, i would need a "shifted" version of column Time "starting" at row b so that i could do df["Product"] = (df["Time"].shifted - df["Time"]) * df["Value"]. The result should look like this:
Time Value Product
a 1 1 1 0
2 2 5 5
3 5 7 21
b 1 1 5 0
2 2 9 9
3 10 11 88
This should do it:
>>> time_shifted = df['Time'].groupby(level=0).apply(lambda x: x.shift())
>>> df['Product'] = ((df.Time - time_shifted)*df.Value).fillna(0)
>>> df
Time Value Product
a 1 1 1 0
2 2 5 5
3 5 7 21
b 1 1 5 0
2 2 9 9
3 10 11 88
Hey this should do what you need it to. Comment if I missed anything.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Time':[1,2,5,1,2,10],'Value':[1,5,7,5,9,11]},
index = [['a','a','a','b','b','b'],[1,2,3,1,2,3]])
def product(x):
x['Product'] = (x['Time']-x.shift()['Time'])*x['Value']
return x
df = df.groupby(level =0).apply(product)
df['Product'] = df['Product'].replace(np.nan, 0)
print df

Categories