Pandas divide multiple mutliindex columns

Pandas divide multiple mutliindex columns - python

I've got a dataframe that looks like this:
and I'd like to divide the x columns by the y columns, but at the moment I get the following result:
Full example:
import pandas as pd
# create example dataframe
data = {'x': [2, 4, 6], 'y': [1, 2, 3]}
df = pd.DataFrame(data)
df = pd.concat([df, df*10], axis=1, keys=['apple', 'orange'])
# slice just x and y columns
x = df.loc[:, (slice(None), 'x')]
y = df.loc[:, (slice(None), 'y')]
# divide (this doesn't work)
result = x / y
Ideally I'd like to add the result back as a separate column:
Is there an elegant way to do this?

Your solution working if same second level created by rename:
new = (x.rename(columns={'x':'x/y'}) / y.rename(columns={'y':'x/y'})
print (new)
apple orange
x/y x/y
0 2.0 2.0
1 2.0 2.0
2 2.0 2.0
Or is possible use DataFrame.xs - be default is removed selected level, so divid working nice (because same columns in x and y DataFrame), so is necessary create second level by MultiIndex.from_product:
x = df.xs('x', axis=1, level=1)
y = df.xs('y', axis=1, level=1)
new = x / y
new.columns = pd.MultiIndex.from_product([new.columns, ['x/y']])
print (new)
apple orange
x/y x/y
0 2.0 2.0
1 2.0 2.0
2 2.0 2.0
And then use concat with DataFrame.sort_index and DataFrame.reindex:
df = pd.concat([df, new], axis=1).sort_index(axis=1).reindex(['x','x/y','y'], axis=1, level=1)
print (df)
apple orange
x x/y y x x/y y
0 2 2.0 1 20 2.0 10
1 4 2.0 2 40 2.0 20
2 6 2.0 3 60 2.0 30

Related

Adding new columns to Pandas Data Frame which the length of new column value is bigger than length of index

I'm in a trouble with adding a new column to a pandas dataframe when the length of new column value is bigger than length of index.
Data may like this :
import pandas as pd
df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
So, you see, length of this df's index is 3.
And next I wanna add a new column , code may like this two ways below:
df["new_col"] = [1,2,3,4]
It'll raise an error : Length of values does not match length of index.
Or:
df["new_col"] = pd.Series([1,2,3,4])
I will just get values[1,2,3] in my data frame df.
(The count of new column values can't out of the max index).
Now, what I want just like :
Is there a better way ?
Looking forward to your answer,thanks!

Use DataFrame.join with change Series name and right join:
#if not default index
#df = df.reset_index(drop=True)
df = df.join(pd.Series([1,2,3,4]).rename('new_col'), how='right')
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
Another idea is add reindex by new s.index:
s = pd.Series([1,2,3,4])
df = df.reindex(s.index)
df["new_col"] = s
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
s = pd.Series([1,2,3,4])
df = df.reindex(s.index).assign(new_col = s)

df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
new_col = pd.Series([1,2,3,4])
df = pd.concat([df,new_col],axis=1)
print(df)
bar zoo 0
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4

Find the mean of columns with matching column names

I have a dataframe similar to the following but with thousands of rows and columns:
x y ghb_00hr_rep1 ghb_00hr_rep2 ghb_00hr_rep3 ghl_06hr_rep1 ghl_06hr_rep2
x y 2 3 2 1 3
x y 5 7 6 2 1
I would like my output to look like this:
ghb_00hr hl_06hr
2.3 2
6 1.5
My goal is to find the average of the matching columns. I have come up with this: temp = df.groupby(name, axis=1).agg('mean') But I am not sure how to define 'name' as the matching columns.
My previous strategy was the following:
name = pd.Series(['_'.join(i.split('_')[:-1])
for i in df.columns[3:]],
index = df.columns[3:]
)
temp = df.groupby(name, axis=1).agg('mean')
avg = pd.concat([df.iloc[:, :3], temp],
axis=1
)
However the number of 'replicates' ranges from 1-4 so grouping by index location isn't an option.
Not sure if there is a better way to do this or if I am on the right track.

An option is to groupby level=0:
(df.set_index(['name','x','y'])
.groupby(level=0, axis=1)
.mean().reset_index()
)
Output:
name x y ghb_00hr ghl_06hr
0 gene1 x y 2.333333 2.0
1 gene2 x y 6.000000 1.5
Update: for the modified question:
d = df.filter(like='gh')
# or d = df.iloc[:, 2:]
# depending on your columns of interest
names = d.columns.str.rsplit('_', n=1).str[0]
d.groupby(names, axis=1).mean()
Output:
ghb_00hr ghl_06hr
0 2.333333 2.0
1 6.000000 1.5

You can convert df.columns to set then iterate:
df = pd.DataFrame([[1, 2, 3, 4, 5, 6]], columns=['a', 'a', 'a', 'b', 'b', 'b'])
for column in set(df.columns):
print(column, df[common_name].mean(axis=1))
will outputs
a 0 2.0
dtype: float64
b 0 5.0
dtype: float64
Use sorted if the order matters:
for column in sorted(set(df.columns)):
From here you can get the output in pretty much any format you want.

Loop through dataframe (cols and rows) and replace data

I have:
df = pd.DataFrame([[1, 2,3], [2, 4,6],[3, 6,9]], columns=['A', 'B','C'])
and I need to calculate de difference between the i+1 and i value of each row and column, and store it again in the same column. The output needed would be:
Out[2]:
A B C
0 1 2 3
1 1 2 3
2 1 2 3
I have tried to do this, but I finally get a list with all values appended, and I need to have them stored separately (in lists, or in the same dataframe).
Is there a way to do it?
difs=[]
for column in df:
for i in range(len(df)-1):
a = df[column]
b = a[i+1]-a[i]
difs.append(b)
for x in difs:
for column in df:
df[column]=x

You can use pandas function shift to achieve your intended goal. This is what it does (more on it on the docs):
Shift index by desired number of periods with an optional time freq.
for col in df:
df[col] = df[col] - df[col].shift(1).fillna(0)
df
Out[1]:
A B C
0 1.0 2.0 3.0
1 1.0 2.0 3.0
2 1.0 2.0 3.0
Added
In case you want to use the loop, probably a good approach is to use iterrows (more on it here) as it provides (index, Series) pairs.
difs = []
for i, row in df.iterrows():
if i == 0:
x = row.values.tolist() ## so we preserve the first row
else:
x = (row.values - df.loc[i-1, df.columns]).values.tolist()
difs.append(x)
difs
Out[1]:
[[1, 2, 3], [1, 2, 3], [1, 2, 3]]
## Create new / replace old dataframe
cols = [col for col in df.columns]
new_df = pd.DataFrame(difs, columns=cols)
new_df
Out[2]:
A B C
0 1.0 2.0 3.0
1 1.0 2.0 3.0
2 1.0 2.0 3.0

groupby with conditions pandas

I am trying to generate a summarized dataframe (using groupby). While I have done basic aggregations before, this one has more complex aggregation conditions. I have tried the web help but unable to work my waythrough.
sample data :
df = pd.DataFrame({'indi_id': [1,1,1,2,2],
'co_id': [1,1,2,2,3],
'relationship': ['shareholder', 'signatory', 'shareholder', 'shareholder', 'director'],
'co_type': ['SP', 'SP', 'PT', 'PT', 'SP'],
'co_nw': [10,10,100,100,2],
'sh_perc': [100, np.nan, 3, 4, np.nan]})
What I need to do is generate a summary dataframe below (groupby: indi_id):
indi_id: 'groupby field'
num_cos_assoc: 'nunique'(co_ID) - no problems here
num_companies_assoc_sh: nunique(co_ID) where relationship = 'shareholder'
num_SP_companies_assoc: nunique(co_ID) where co_type = 'SP'
total_nw_co_sh: sum(co_nw*sh_prec) where relationship = 'shareholder'
sample Outcome below:
Indi_ID num_cos_assoc num_companies_assoc_sh num_SP_companies_assoc total_nw_co_sh
1 2 2 1 1300
2 2 1 0 400

Use custom function with GroupBy.apply, because agg be design working with each column separately, so filtering by another columns is really problematic:
def f(x):
a = x['co_id'].nunique()
b = x.loc[x['relationship'] == 'shareholder', 'co_id'].nunique()
c = x.loc[x['co_type'] == 'SP', 'co_id'].nunique()
d = x.loc[x['relationship'] == 'shareholder', ['co_nw', 'sh_perc']]
d = d['co_nw'].mul(d['sh_perc'], fill_value=1).sum()
cols =['num_cos_assoc','num_companies_assoc_sh','num_SP_companies_assoc','total_nw_co_sh']
return pd.Series([a,b,c,d], index=cols)
df1 = df.groupby('indi_id').apply(f).reset_index()
print (df1)
indi_id num_cos_assoc num_companies_assoc_sh num_SP_companies_assoc \
0 1 2.0 2.0 1.0
1 2 2.0 1.0 1.0
total_nw_co_sh
0 1300.0
1 400.0

Here is a way to do it without a custom function, though it really sums up to doing the same thing as jezrael's solution:
df.groupby('indi_id').apply(lambda x: pd.Series([
x.co_id.nunique(),
x.loc[x.relationship == 'shareholder'].co_id.nunique(),
x.loc[x.co_type == 'SP'].co_id.nunique(),
x.loc[x.relationship == 'shareholder'][['co_nw','sh_perc']].prod(axis = 1).sum()],
index = ['num_cos_assoc','num_companies_assoc_sh','nump_SP_companies_assoc','total_nw_co_sh']))
And the corresponding output:
num_cos_assoc num_companies_assoc_sh nump_SP_companies_assoc total_nw_co_sh
indi_id
1 2.0 2.0 1.0 1300.0
2 2.0 1.0 1.0 400.0

How to create a matrix that is the sum of multiple matrices using pandas dataframe?

I have multiple data frames that I saved in a concatenated list like below. Each df represents a matrix.
my_df = pd.concat([df1, df2, df3, .....])
How do I sum all these dfs (matrices) into one df (matrix)?
I found a discussion here, but it only answers how to add two data frames, by using code like below.
df_x.add(df_y, fill_value=0)
Should I use the code above in a loop, or is there a more concise way?
I tried to do print(my_df.sum()) but got a very confusing result (it's suddenly turned into a one row instead of two-dimensional matrix).
Thank you.

I believe need functools.reduce if each DataFrame in list have same index and columns values:
np.random.seed(2018)
df1 = pd.DataFrame(np.random.choice([1,np.nan,2], size=(3,3)), columns=list('abc'))
df2 = pd.DataFrame(np.random.choice([1,np.nan,3], size=(3,3)), columns=list('abc'))
df3 = pd.DataFrame(np.random.choice([1,np.nan,4], size=(3,3)), columns=list('abc'))
print (df1)
a b c
0 2.0 2.0 2.0
1 NaN NaN 1.0
2 1.0 2.0 NaN
print (df2)
a b c
0 NaN NaN 1.0
1 3.0 3.0 3.0
2 NaN 1.0 3.0
print (df3)
a b c
0 4.0 NaN NaN
1 4.0 1.0 1.0
2 4.0 NaN 1.0
from functools import reduce
my_df = [df1,df2, df3]
df = reduce(lambda x, y: x.add(y, fill_value=0), my_df)
print (df)
a b c
0 6.0 2.0 3.0
1 7.0 4.0 5.0
2 5.0 3.0 4.0

I believe the idiomatic solution to this is to preserve the information about different DataFrames with the help of the keys parameter and then use sum on the innermost level:
dfs = [df1, df2, df3]
my_df = pd.concat(dfs, keys=['df{}'.format(i+1) for i in range(len(dfs))])
my_df.sum(level=1)
which yields
a b c
0 6.0 2.0 3.0
1 7.0 4.0 5.0
2 5.0 3.0 4.0
with jezrael's sample DataFrames.

One method is to use sum with a list of arrays. The output here will be an array rather than a dataframe.
This assumes you need to replace np.nan with 0:
res = sum([x.fillna(0).values for x in [df1, df2, df3]])
Alternatively, you can use numpy directly in a couple of different ways:
res_np1 = np.add.reduce([x.fillna(0).values for x in [df1, df2, df3]])
res_np2 = np.nansum([x.values for x in [df1, df2, df3]], axis=0)
numpy.nansum assumes np.nan equals zero for summing purposes.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas divide multiple mutliindex columns - python

Related

Adding new columns to Pandas Data Frame which the length of new column value is bigger than length of index

Find the mean of columns with matching column names

Loop through dataframe (cols and rows) and replace data

groupby with conditions pandas

How to create a matrix that is the sum of multiple matrices using pandas dataframe?

Categories

Resources