Create new columns and fill with calculated values from same dataframe - python

Here is a simplified example of my df:
ds = pd.DataFrame(np.abs(randn(3, 4)), index=[1,2,3], columns=['A','B','C','D'])
ds['sum'] = ds.sum(axis=1)
which looks like
A B C D sum
1 0.095389 0.556978 1.646888 1.959295 4.258550
2 1.076190 2.668270 0.825116 1.477040 6.046616
3 0.245034 1.066285 0.967124 0.791606 3.070049
I would like to create 4 new columns and calculate the percentage value from the total (sum) in every row. So first value in the first new column should be (0.095389/4.258550), first value in the second new column (0.556978/4.258550)...and so on.

You can do this easily manually for each column like this:
df['A_perc'] = df['A']/df['sum']
If you want to do this in one step for all columns, you can use the div method (http://pandas.pydata.org/pandas-docs/stable/basics.html#matching-broadcasting-behavior):
ds.div(ds['sum'], axis=0)
And if you want this in one step added to the same dataframe:
>>> ds.join(ds.div(ds['sum'], axis=0), rsuffix='_perc')
A B C D sum A_perc B_perc \
1 0.151722 0.935917 1.033526 0.941962 3.063127 0.049532 0.305543
2 0.033761 1.087302 1.110695 1.401260 3.633017 0.009293 0.299283
3 0.761368 0.484268 0.026837 1.276130 2.548603 0.298739 0.190013
C_perc D_perc sum_perc
1 0.337409 0.307517 1
2 0.305722 0.385701 1
3 0.010530 0.500718 1

In [56]: df = pd.DataFrame(np.abs(randn(3, 4)), index=[1,2,3], columns=['A','B','C','D'])
In [57]: df.divide(df.sum(axis=1), axis=0)
Out[57]:
A B C D
1 0.319124 0.296653 0.138206 0.246017
2 0.376994 0.326481 0.230464 0.066062
3 0.036134 0.192954 0.430341 0.340571

You can convert sum column in a numpy column array and broadcast division.
new_df = df / df[['sum']].values # note the double-brackets around 'sum'
To add the percentages as new columns,
df[df.columns.drop('sum') + '_perc'] = df.drop(columns='sum') / df[['sum']].values

Related

Applying function to multiple columns in pandas dataframe to make 1 new column

fruits.xlsx
I'm trying to apply a function that:
takes the value of each cell in a column divided by the mean of its respective column.
Then create a column called ['Score'] that has the sum of each cell value in a row computed from step 1.
My code so far:
import pandas as pd
df = pd.DataFrame(pd.read_excel('fruits.xlsx'))
def func(column):
out = df[column].values / df[column].mean()
return out
Im really unsure of how to execute this with pandas properly.
Try this one will calculate exactly what you need in one single line:
df['Score'] = df.apply(lambda x: sum([x[i]/df[i].mean() for i in df.columns]),axis=1)
You can do it like this
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['a', 'b', 'c'])
df['score'] = df.div(df.mean()).sum(axis=1)
Output
a b c score
0 1 2 3 1.15
1 4 5 6 3.00
2 7 8 9 4.85
you can make the output as a column in the dataframe
df["Score"] = df[<col_name>] / df[<col_name>].mean()
and you can use
df["Score"] = df[<col_name>].values / df[<col_name>].mean()
I tested both and both gave me the same output in the dataframe

Pandas: Get top n columns based on a row values

Having a dataframe with a single row, I need to filter it into a smaller one with filtered columns based on a value in a row.
What's the most effective way?
df = pd.DataFrame({'a':[1], 'b':[10], 'c':[3], 'd':[5]})
a
b
c
d
1
10
3
5
For example top-3 features:
b
c
d
10
3
5
Use sorting per row and select first 3 values:
df1 = df.sort_values(0, axis=1, ascending=False).iloc[:, :3]
print (df1)
b d c
0 10 5 3
Solution with Series.nlargest:
df1 = df.iloc[0].nlargest(3).to_frame().T
print (df1)
b d c
0 10 5 3
You can transpose T, and use nlargest():
new = df.T.nlargest(columns = 0, n = 3).T
print(new)
b d c
0 10 5 3
You can use np.argsort to get the solution. This Numpy method, in the below code, gives the indices of the column values in descending order. Then slicing selects the largest n values' indices.
import pandas as pd
import numpy as np
# Your dataframe
df = pd.DataFrame({'a':[1], 'b':[10], 'c':[3], 'd':[5]})
# Pick the number n to find n largest values
nlargest = 3
# Get the order of the largest value columns by their indices
order = np.argsort(-df.values, axis=1)[:, :nlargest]
# Find the columns with the largest values
top_features = df.columns[order].tolist()[0]
# Filter the dateframe by the columns
top_features_df = df[top_features]
top_features_df
output:
b d c
0 10 5 3

Indexing a dataframe from a dataframe of row indexes

I have two python dataframes with equal shape, for example:
df1 = pd.DataFrame(np.random.randn(3,2), index=np.arange(3), columns=['a','b'] )
df2 = pd.DataFrame(np.random.randint(0, high=3, size=(3,2)), index=np.arange(3), columns=['a','b'] )
print df1
a b
0 0.336811 -2.132993
1 -1.492770 0.278024
2 -2.355762 -0.894376
print df2
a b
0 1 2
1 0 2
2 2 1
I would like to use the values in df2 as row indexes to select the values in df1 and create a new dataframe of equal shape.
Expected result:
print df3
a b
0 -1.492770 -0.894376
1 0.336811 -0.894376
2 -2.355762 0.278024
I have tried using .loc and it works well for a single column:
df3 = df1.loc[df2['a'], 'a']
print df3
0 -1.492770
1 0.336811
2 -2.355762
But I was not able to use .loc or .iloc on all columns at the same time.
I would like to avoid loops to optimize performance since I am working on a large dataframe.
Any ideas?
Using numpy selection
pd.DataFrame([df1[col].values[df2[col]] for col in df1.columns], index=['a','b']).T
a b
0 -1.492770 -0.894376
1 0.336811 -0.894376
2 -2.355762 0.278024
If you want to avoid for loops, you have to play with raveling and unraveling. In a nutshell, you flatten all your data frame in a single vector, sum len(df1) at each block to jump indexes to the beginning of the next column, and then reshape back to the original size. All operations in this context are vectorized, so should be fast.
For example,
df1.T.values.ravel()[df2.T.values.ravel() + np.repeat(np.arange(0, len(df1)+1, len(df1)), len(df1))].reshape(df1.T.shape).T
Gives
array([[-1.49277 , -0.894376],
[ 0.336811, -0.894376],
[-2.355762, 0.278024]])

How to transform the result of a Pandas `GROUPBY` function to the original dataframe

Suppose I have a Pandas DataFrame with 6 columns and a custom function that takes counts of the elements in 2 or 3 columns and produces a boolean output. When a groupby object is created from the original dataframe and the custom function is applied df.groupby('col1').apply(myfunc), the result is a series whose length is equal to the number of categories of col1. How do I expand this output to match the length of the original dataframe? I tried transform, but was not able to use the custom function myfunc with it.
EDIT:
Here is an example code:
A = pd.DataFrame({'X':['a','b','c','a','c'], 'Y':['at','bt','ct','at','ct'], 'Z':['q','q','r','r','s']})
print (A)
def myfunc(df):
return ((df['Z'].nunique()>=2) and (df['Y'].nunique()<2))
A.groupby('X').apply(myfunc)
I would like to expand this output as a new column Result such that where there is a in column X, the Result will be True.
You can map the groupby back to the original dataframe
A['Result'] = A['X'].map(A.groupby('X').apply(myfunc))
Result would look like:
X Y Z Result
0 a at q True
1 b bt q False
2 c ct r True
3 a at r True
4 c ct s True
My solution may not be the best one, which uses a loop, but it's pretty good I think.
The core idea is you can traverse all the sub-dataframe (gdf) by for i, gdf in gp. Then add the column result (in my example it is c) for each sub-dataframe. Finally concat all the sub-dataframe into one.
Here is an example:
import pandas as pd
df = pd.DataFrame({'a':[1,2,1,2],'b':['a','b','c','d']})
gp = df.groupby('a') # group
s = gp.apply(sum)['a'] # apply a func
adf = []
# then create a new dataframe
for i, gdf in gp:
tdf = gdf.copy()
tdf.loc[:,'c'] = s.loc[i]
adf.append(tdf)
pd.concat(adf)
from:
a b
0 1 a
1 2 b
2 1 c
3 2 d
to:
a b c
0 1 a 2
2 1 c 2
1 2 b 4
3 2 d 4

Duplicating Pandas Dataframe rows based on string split, without iteration

I have a dataframe with a multiindex, where one of thecolumns represents multiple values, separated by a "|", like this:
value
left right
x a|b 2
y b|c|d -1
I want to duplicate the rows based on the "right" column, to get something like this:
values
left right
x a 2
x b 2
y b -1
y c -1
y d -1
The solution I have to this feels wrong and runs slow, because it's based on iteration:
df2 = df.iloc[:0]
for index, row in df.iterrows():
stgs = index[1].split("|")
for s in stgs:
row.name = (index[0], s)
df2 = df2.append(row)
Is there a more vectored way to do this?
Pandas Series have a dedicated method split to perform this operation
split works only on Series so isolate the Column you want
SO = df['right']
Now 3 steps at once: spilt return A Series of array. apply(pd.Series, 1) convert array in columns. stack stacks you columns into a unique column
S1 = SO.str.split(',').apply(pd.Series, 1).stack()
The only issue is that you have now a multi-index. So just drop the level you don`t need
S1.index.droplevel(-1)
Full example
SO = pd.Series(data=["a,b", "b,c,d"])
S1 = SO.str.split(',').apply(pd.Series, 1).stack()
S1
Out[4]:
0 0 a
1 b
1 0 b
1 c
2 d
S1.index = S1.index.droplevel(-1)
S1
Out[5]:
0 a
0 b
1 b
1 c
1 d
Building upon the answer #xNoK, I am adding here the additional step needed to include the result back in the original DataFrame.
We have this data:
arrays = [['x', 'y'], ['a|b', 'b|c|d']]
midx = pd.MultiIndex.from_arrays(arrays, names=['left', 'right'])
df = pd.DataFrame(index=midx, data=[2, -1], columns=['value'])
df
Out[17]:
value
left right
x a|b 2
y b|c|d -1
First, let's generate the values for right index as #xNoK suggested. First take the Index level we want to work on by index.levels[1] and convert it it to series so that we can perform the str.split() function, and finally stack() it to get the result we want.
new_multi_idx_val = df.index.levels[1].to_series().str.split('|').apply(pd.Series).stack()
new_multi_idx_val
Out[18]:
right
a|b 0 a
1 b
b|c|d 0 b
1 c
2 d
dtype: object
Now we want to put this value in the original DataFrame df. To do that, let's change its shape so that result we generated in the previous step could be copied.
In order to do that, we can repeat the rows (including the indexes) by a number of | present in right level of multi-index. df.index.levels[1].to_series().str.split('|').apply(lambda x: len(x)) gives the number of times a row (including index) should be repeated. We apply this to the function index.repeat() and fetch values at those indexes to create a new DataFrame df_repeted.
df_repeted = df.loc[df.index.repeat(df.index.levels[1].to_series().str.split('|').apply(lambda x: len(x)))]
df_repeted
Out[19]:
value
left right
x a|b 2
a|b 2
y b|c|d -1
b|c|d -1
b|c|d -1
Now df_repeted DataFrame is in a shape where we could change the index to get the answer we want.
Replace the index of df_repeted with desired values as following:
df_repeted.index = [df_repeted.index.droplevel(1), new_multi_idx_val]
df_repeted.index.rename(names=['left', 'right'], inplace=True)
df_repeted
Out[20]:
value
left right
x a 2
b 2
y b -1
c -1
d -1

Categories