I've got a dataframe that looks like this:
and I'd like to divide the x columns by the y columns, but at the moment I get the following result:
Full example:
import pandas as pd
# create example dataframe
data = {'x': [2, 4, 6], 'y': [1, 2, 3]}
df = pd.DataFrame(data)
df = pd.concat([df, df*10], axis=1, keys=['apple', 'orange'])
# slice just x and y columns
x = df.loc[:, (slice(None), 'x')]
y = df.loc[:, (slice(None), 'y')]
# divide (this doesn't work)
result = x / y
Ideally I'd like to add the result back as a separate column:
Is there an elegant way to do this?
Your solution working if same second level created by rename:
new = (x.rename(columns={'x':'x/y'}) / y.rename(columns={'y':'x/y'})
print (new)
apple orange
x/y x/y
0 2.0 2.0
1 2.0 2.0
2 2.0 2.0
Or is possible use DataFrame.xs - be default is removed selected level, so divid working nice (because same columns in x and y DataFrame), so is necessary create second level by MultiIndex.from_product:
x = df.xs('x', axis=1, level=1)
y = df.xs('y', axis=1, level=1)
new = x / y
new.columns = pd.MultiIndex.from_product([new.columns, ['x/y']])
print (new)
apple orange
x/y x/y
0 2.0 2.0
1 2.0 2.0
2 2.0 2.0
And then use concat with DataFrame.sort_index and DataFrame.reindex:
df = pd.concat([df, new], axis=1).sort_index(axis=1).reindex(['x','x/y','y'], axis=1, level=1)
print (df)
apple orange
x x/y y x x/y y
0 2 2.0 1 20 2.0 10
1 4 2.0 2 40 2.0 20
2 6 2.0 3 60 2.0 30
Related
I'm in a trouble with adding a new column to a pandas dataframe when the length of new column value is bigger than length of index.
Data may like this :
import pandas as pd
df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
So, you see, length of this df's index is 3.
And next I wanna add a new column , code may like this two ways below:
df["new_col"] = [1,2,3,4]
It'll raise an error : Length of values does not match length of index.
Or:
df["new_col"] = pd.Series([1,2,3,4])
I will just get values[1,2,3] in my data frame df.
(The count of new column values can't out of the max index).
Now, what I want just like :
Is there a better way ?
Looking forward to your answer,thanks!
Use DataFrame.join with change Series name and right join:
#if not default index
#df = df.reset_index(drop=True)
df = df.join(pd.Series([1,2,3,4]).rename('new_col'), how='right')
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
Another idea is add reindex by new s.index:
s = pd.Series([1,2,3,4])
df = df.reindex(s.index)
df["new_col"] = s
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
s = pd.Series([1,2,3,4])
df = df.reindex(s.index).assign(new_col = s)
df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
new_col = pd.Series([1,2,3,4])
df = pd.concat([df,new_col],axis=1)
print(df)
bar zoo 0
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
I have a dataframe similar to the following but with thousands of rows and columns:
x y ghb_00hr_rep1 ghb_00hr_rep2 ghb_00hr_rep3 ghl_06hr_rep1 ghl_06hr_rep2
x y 2 3 2 1 3
x y 5 7 6 2 1
I would like my output to look like this:
ghb_00hr hl_06hr
2.3 2
6 1.5
My goal is to find the average of the matching columns. I have come up with this: temp = df.groupby(name, axis=1).agg('mean') But I am not sure how to define 'name' as the matching columns.
My previous strategy was the following:
name = pd.Series(['_'.join(i.split('_')[:-1])
for i in df.columns[3:]],
index = df.columns[3:]
)
temp = df.groupby(name, axis=1).agg('mean')
avg = pd.concat([df.iloc[:, :3], temp],
axis=1
)
However the number of 'replicates' ranges from 1-4 so grouping by index location isn't an option.
Not sure if there is a better way to do this or if I am on the right track.
An option is to groupby level=0:
(df.set_index(['name','x','y'])
.groupby(level=0, axis=1)
.mean().reset_index()
)
Output:
name x y ghb_00hr ghl_06hr
0 gene1 x y 2.333333 2.0
1 gene2 x y 6.000000 1.5
Update: for the modified question:
d = df.filter(like='gh')
# or d = df.iloc[:, 2:]
# depending on your columns of interest
names = d.columns.str.rsplit('_', n=1).str[0]
d.groupby(names, axis=1).mean()
Output:
ghb_00hr ghl_06hr
0 2.333333 2.0
1 6.000000 1.5
You can convert df.columns to set then iterate:
df = pd.DataFrame([[1, 2, 3, 4, 5, 6]], columns=['a', 'a', 'a', 'b', 'b', 'b'])
for column in set(df.columns):
print(column, df[common_name].mean(axis=1))
will outputs
a 0 2.0
dtype: float64
b 0 5.0
dtype: float64
Use sorted if the order matters:
for column in sorted(set(df.columns)):
From here you can get the output in pretty much any format you want.
I have:
df = pd.DataFrame([[1, 2,3], [2, 4,6],[3, 6,9]], columns=['A', 'B','C'])
and I need to calculate de difference between the i+1 and i value of each row and column, and store it again in the same column. The output needed would be:
Out[2]:
A B C
0 1 2 3
1 1 2 3
2 1 2 3
I have tried to do this, but I finally get a list with all values appended, and I need to have them stored separately (in lists, or in the same dataframe).
Is there a way to do it?
difs=[]
for column in df:
for i in range(len(df)-1):
a = df[column]
b = a[i+1]-a[i]
difs.append(b)
for x in difs:
for column in df:
df[column]=x
You can use pandas function shift to achieve your intended goal. This is what it does (more on it on the docs):
Shift index by desired number of periods with an optional time freq.
for col in df:
df[col] = df[col] - df[col].shift(1).fillna(0)
df
Out[1]:
A B C
0 1.0 2.0 3.0
1 1.0 2.0 3.0
2 1.0 2.0 3.0
Added
In case you want to use the loop, probably a good approach is to use iterrows (more on it here) as it provides (index, Series) pairs.
difs = []
for i, row in df.iterrows():
if i == 0:
x = row.values.tolist() ## so we preserve the first row
else:
x = (row.values - df.loc[i-1, df.columns]).values.tolist()
difs.append(x)
difs
Out[1]:
[[1, 2, 3], [1, 2, 3], [1, 2, 3]]
## Create new / replace old dataframe
cols = [col for col in df.columns]
new_df = pd.DataFrame(difs, columns=cols)
new_df
Out[2]:
A B C
0 1.0 2.0 3.0
1 1.0 2.0 3.0
2 1.0 2.0 3.0
I am trying to generate a summarized dataframe (using groupby). While I have done basic aggregations before, this one has more complex aggregation conditions. I have tried the web help but unable to work my waythrough.
sample data :
df = pd.DataFrame({'indi_id': [1,1,1,2,2],
'co_id': [1,1,2,2,3],
'relationship': ['shareholder', 'signatory', 'shareholder', 'shareholder', 'director'],
'co_type': ['SP', 'SP', 'PT', 'PT', 'SP'],
'co_nw': [10,10,100,100,2],
'sh_perc': [100, np.nan, 3, 4, np.nan]})
What I need to do is generate a summary dataframe below (groupby: indi_id):
indi_id: 'groupby field'
num_cos_assoc: 'nunique'(co_ID) - no problems here
num_companies_assoc_sh: nunique(co_ID) where relationship = 'shareholder'
num_SP_companies_assoc: nunique(co_ID) where co_type = 'SP'
total_nw_co_sh: sum(co_nw*sh_prec) where relationship = 'shareholder'
sample Outcome below:
Indi_ID num_cos_assoc num_companies_assoc_sh num_SP_companies_assoc total_nw_co_sh
1 2 2 1 1300
2 2 1 0 400
Use custom function with GroupBy.apply, because agg be design working with each column separately, so filtering by another columns is really problematic:
def f(x):
a = x['co_id'].nunique()
b = x.loc[x['relationship'] == 'shareholder', 'co_id'].nunique()
c = x.loc[x['co_type'] == 'SP', 'co_id'].nunique()
d = x.loc[x['relationship'] == 'shareholder', ['co_nw', 'sh_perc']]
d = d['co_nw'].mul(d['sh_perc'], fill_value=1).sum()
cols =['num_cos_assoc','num_companies_assoc_sh','num_SP_companies_assoc','total_nw_co_sh']
return pd.Series([a,b,c,d], index=cols)
df1 = df.groupby('indi_id').apply(f).reset_index()
print (df1)
indi_id num_cos_assoc num_companies_assoc_sh num_SP_companies_assoc \
0 1 2.0 2.0 1.0
1 2 2.0 1.0 1.0
total_nw_co_sh
0 1300.0
1 400.0
Here is a way to do it without a custom function, though it really sums up to doing the same thing as jezrael's solution:
df.groupby('indi_id').apply(lambda x: pd.Series([
x.co_id.nunique(),
x.loc[x.relationship == 'shareholder'].co_id.nunique(),
x.loc[x.co_type == 'SP'].co_id.nunique(),
x.loc[x.relationship == 'shareholder'][['co_nw','sh_perc']].prod(axis = 1).sum()],
index = ['num_cos_assoc','num_companies_assoc_sh','nump_SP_companies_assoc','total_nw_co_sh']))
And the corresponding output:
num_cos_assoc num_companies_assoc_sh nump_SP_companies_assoc total_nw_co_sh
indi_id
1 2.0 2.0 1.0 1300.0
2 2.0 1.0 1.0 400.0
I have multiple data frames that I saved in a concatenated list like below. Each df represents a matrix.
my_df = pd.concat([df1, df2, df3, .....])
How do I sum all these dfs (matrices) into one df (matrix)?
I found a discussion here, but it only answers how to add two data frames, by using code like below.
df_x.add(df_y, fill_value=0)
Should I use the code above in a loop, or is there a more concise way?
I tried to do print(my_df.sum()) but got a very confusing result (it's suddenly turned into a one row instead of two-dimensional matrix).
Thank you.
I believe need functools.reduce if each DataFrame in list have same index and columns values:
np.random.seed(2018)
df1 = pd.DataFrame(np.random.choice([1,np.nan,2], size=(3,3)), columns=list('abc'))
df2 = pd.DataFrame(np.random.choice([1,np.nan,3], size=(3,3)), columns=list('abc'))
df3 = pd.DataFrame(np.random.choice([1,np.nan,4], size=(3,3)), columns=list('abc'))
print (df1)
a b c
0 2.0 2.0 2.0
1 NaN NaN 1.0
2 1.0 2.0 NaN
print (df2)
a b c
0 NaN NaN 1.0
1 3.0 3.0 3.0
2 NaN 1.0 3.0
print (df3)
a b c
0 4.0 NaN NaN
1 4.0 1.0 1.0
2 4.0 NaN 1.0
from functools import reduce
my_df = [df1,df2, df3]
df = reduce(lambda x, y: x.add(y, fill_value=0), my_df)
print (df)
a b c
0 6.0 2.0 3.0
1 7.0 4.0 5.0
2 5.0 3.0 4.0
I believe the idiomatic solution to this is to preserve the information about different DataFrames with the help of the keys parameter and then use sum on the innermost level:
dfs = [df1, df2, df3]
my_df = pd.concat(dfs, keys=['df{}'.format(i+1) for i in range(len(dfs))])
my_df.sum(level=1)
which yields
a b c
0 6.0 2.0 3.0
1 7.0 4.0 5.0
2 5.0 3.0 4.0
with jezrael's sample DataFrames.
One method is to use sum with a list of arrays. The output here will be an array rather than a dataframe.
This assumes you need to replace np.nan with 0:
res = sum([x.fillna(0).values for x in [df1, df2, df3]])
Alternatively, you can use numpy directly in a couple of different ways:
res_np1 = np.add.reduce([x.fillna(0).values for x in [df1, df2, df3]])
res_np2 = np.nansum([x.values for x in [df1, df2, df3]], axis=0)
numpy.nansum assumes np.nan equals zero for summing purposes.