I have a dataframe:
df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
A B
0 1 2
1 1 3
2 4 6
I want to return a dataframe of the same size containing the mean of each column:
A B
0 2 3.666
1 2 3.666
2 2 3.666
Is there a simple way of doing this?
You can only provide one single line at DataFrame creation time:
pd.DataFrame(data = [df.mean()], index = df.index)
It gives:
A B
0 2.0 3.666667
1 2.0 3.666667
2 2.0 3.666667
Here's one with assign:
df.assign(**df.mean())
A B
0 2.0 3.666667
1 2.0 3.666667
2 2.0 3.666667
Details
The mean is easily obtained with DataFrame.mean:
df.mean()
tenor_yrs 14.292857
rates 2.622000
dtype: float64
From the above Series, we can use dictionary unpacking to replace the existing columns with the resulting values. Note that we can unpack the Series into a dictionary using **:
{**df.mean()}
# {'tenor_yrs': 14.292857142857143, 'rates': 2.622}
Given that the way assign adds new columns is as df.assign(a_given_column=a_value, another_column=some_other_value), the unpacking makes the dictionary keys be the function's arguments. And since the original dataframe's index is respected, df.assign(**df.mean()) will replace the dataframe`s values with the means.
Recreate the DataFrame. Send the mean Series to a dict, then the index defines the number of rows.
pd.DataFrame(df.mean().to_dict(), index=df.index)
# A B
#0 2.0 3.666667
#1 2.0 3.666667
#2 2.0 3.666667
Same concept, but creating the full array first saves a decent amount of time.
pd.DataFrame(np.broadcast_to(df.mean(), df.shape),
index=df.index,
columns=df.columns)
Here are some timings. Of course this will depend slightly on the number of columns but you can see there are pretty large differences when you provide the entire array to begin with
import perfplot
import pandas as pd
import numpy as np
perfplot.show(
setup=lambda N: pd.DataFrame(np.random.randint(1,100, (N, 5)),
columns=[str(x) for x in range(5)]),
kernels=[
lambda df: pd.DataFrame(np.broadcast_to(df.mean(), df.shape), index=df.index, columns=df.columns),
lambda df: df.assign(**df.mean()),
lambda df: pd.DataFrame(df.mean().to_dict(), index=df.index)
],
labels=['numpy broadcast', 'assign', 'dict'],
n_range=[2 ** k for k in range(1, 22)],
equality_check=np.allclose,
xlabel="Len(df)"
)
Related
I need to apply a function that takes subcolumns (aka Series) of multiindexed columns as arguments. I have come up with a solution that works, but I was curious if there was a more pythonic/proper pandas way to do this.
Let's say we have a function that takes two series as arguments and performs some user-defined operation on those series and returns a single series.
import pandas as pd
def user_defined_function(series1, series2):
return 12 * (series1 * series2 / 3)
Lets create a dataframe with multindexed columns.
data = [[1, 2, 3, 4],
[5, 6, 7, 8],
[10, 11, 12, 13],
[14, 15, 16, 17]]
columns = (('A', 'sub_col_1'),
('A', 'sub_col_2'),
('B', 'sub_col_1'),
('B', 'sub_col_2'))
df = pd.DataFrame(data, columns=columns)
print(df)
A B
sub_col_1 sub_col_2 sub_col_1 sub_col_2
0 1 2 3 4
1 5 6 7 8
2 10 11 12 13
3 14 15 16 17
I want to appy my user_defined_function() to the sub columns of A and B.
Now if you try and use apply traditionally pandas will traverse each column individually returning a single series to the function. So you can't just do this.
df.apply(lambda x: user_defined_function(x['sub_col_1'], x['sub_col_2']))
You'll end up getting a key error because pandas is passing a series not a normally indexed "sub dataframe."
So this is the solution I came up with.
level1_labels = set(df.columns.get_level_values(0))
processed_df = pd.DataFrame()
for label in level1_labels:
data_to_apply_function_to = df[label]
processed_series = user_defined_function(data_to_apply_function_to['sub_col_1'],
data_to_apply_function_to['sub_col_2'])
processed_df[label] = processed_series
print(processed_df)
A B
0 8.0 48.0
1 120.0 224.0
2 440.0 624.0
3 840.0 1088.0
This returns what I want it to. However, I am curious if there is a cleaner, more pythonic, proper way to do this.
You can groupby over the columns axis. Your function requires a Series so we'll need to squeeze if we want to select by label.
(df.groupby(level=0, axis=1)
.apply(lambda gp: user_defined_function(gp.xs('sub_col_1', level=1, axis=1).squeeze(),
gp.xs('sub_col_2', level=1, axis=1).squeeze()))
)
# A B
#0 8.0 48.0
#1 120.0 224.0
#2 440.0 624.0
#3 840.0 1088.0
A bit more error prone, though fine if you know all groups have two Series in the same positions
(df.groupby(level=0, axis=1)
.apply(lambda gp: user_defined_function(gp.iloc[:, 0], gp.iloc[:, 1]))
)
It looks to me that this is a very custom case. It is actually possible to use apply within the 0 level columns as following
import pandas as pd
# I just renamed it cos was very long
def udf(series1, series2):
return 12 * (series1 * series2 / 3)
col = "A"
df[col].apply(lambda x: udf(x['sub_col_1'], x['sub_col_2']),axis=1)\
.to_frame()\
.rename(columns={0:col})
returns
A
0 8.0
1 120.0
2 440.0
3 840.0
But again for the output you are looking for you should still need to loop.
out = []
for col in set(df.columns.get_level_values(0)):
out.append(
df[col].apply(lambda x: udf(x['sub_col_1'],
x['sub_col_2']),
axis=1)\
.to_frame()\
.rename(columns={0:col}))
out = pd.concat(out, axis=1)
I have:
df = pd.DataFrame([[1, 2,3], [2, 4,6],[3, 6,9]], columns=['A', 'B','C'])
and I need to calculate de difference between the i+1 and i value of each row and column, and store it again in the same column. The output needed would be:
Out[2]:
A B C
0 1 2 3
1 1 2 3
2 1 2 3
I have tried to do this, but I finally get a list with all values appended, and I need to have them stored separately (in lists, or in the same dataframe).
Is there a way to do it?
difs=[]
for column in df:
for i in range(len(df)-1):
a = df[column]
b = a[i+1]-a[i]
difs.append(b)
for x in difs:
for column in df:
df[column]=x
You can use pandas function shift to achieve your intended goal. This is what it does (more on it on the docs):
Shift index by desired number of periods with an optional time freq.
for col in df:
df[col] = df[col] - df[col].shift(1).fillna(0)
df
Out[1]:
A B C
0 1.0 2.0 3.0
1 1.0 2.0 3.0
2 1.0 2.0 3.0
Added
In case you want to use the loop, probably a good approach is to use iterrows (more on it here) as it provides (index, Series) pairs.
difs = []
for i, row in df.iterrows():
if i == 0:
x = row.values.tolist() ## so we preserve the first row
else:
x = (row.values - df.loc[i-1, df.columns]).values.tolist()
difs.append(x)
difs
Out[1]:
[[1, 2, 3], [1, 2, 3], [1, 2, 3]]
## Create new / replace old dataframe
cols = [col for col in df.columns]
new_df = pd.DataFrame(difs, columns=cols)
new_df
Out[2]:
A B C
0 1.0 2.0 3.0
1 1.0 2.0 3.0
2 1.0 2.0 3.0
Given the following DataFrame:
A B
0 -10.0 NaN
1 NaN 20.0
2 -30.0 NaN
I want to merge columns A and B, filling the NaN cells in column A with the values from column B and then drop column B, resulting in a DataFrame like this:
A
0 -10.0
1 20.0
2 -30.0
I have managed to solve this problem by using the iterrows() function.
Complete code example:
import numpy as np
import pandas as pd
example_data = [[-10, np.NaN], [np.NaN, 20], [-30, np.NaN]]
example_df = pd.DataFrame(example_data, columns = ['A', 'B'])
for index, row in example_df.iterrows():
if pd.isnull(row['A']):
row['A'] = row['B']
example_df = example_df.drop(columns = ['B'])
example_df
This seems to work fine, but I find this information in the documentation for iterrows():
You should never modify something you are iterating over.
So it seems like I'm doing it wrong.
What would be a better/recommended approach for achieving the same result?
Use Series.fillna with Series.to_frame:
df = df['A'].fillna(df['B']).to_frame()
#alternative
#df = df['A'].combine_first(df['B']).to_frame()
print (df)
A
0 -10.0
1 20.0
2 -30.0
If more columns and need first non missing values per rows use back filling missing values with select first column by one element list for one column DataFrame:
df = df.bfill(axis=1).iloc[:, [0]]
print (df)
A
0 -10.0
1 20.0
2 -30.0
I start with a dictionary, which is the way my data was already formatted:
import pandas as pd
dict2 = {'A': {'a':1.0, 'b':2.0, 'd':4.0}, 'B':{'a':2.0, 'c':2.0, 'd':5.0},
'C':{'b':1.0,'c':2.0, 'd':4.0}}
I then convert it to a pandas dataframe:
df = pd.DataFrame(dict2)
print(df)
A B C
a 1.0 2.0 NaN
b 2.0 NaN 1.0
c NaN 2.0 2.0
d 4.0 5.0 4.0
Of course, I can get the difference one at a time by doing this:
df['A'] - df['B']
Out[643]:
a -1.0
b NaN
c NaN
d -1.0
dtype: float64
I figured out how to loop through and calculate A-A, A-B, A-C:
for column in df:
print(df['A'] - df[column])
a 0.0
b 0.0
c NaN
d 0.0
Name: A, dtype: float64
a -1.0
b NaN
c NaN
d -1.0
dtype: float64
a NaN
b 1.0
c NaN
d 0.0
dtype: float64
What I would like to do is iterate through the columns so as to calculate |A-B|, |A-C|, and |B-C| and store the results in another dictionary.
I want to do this so as to calculate the Euclidean distance between all combinations of columns later on. If there is an easier way to do this I would like to see it as well. Thank you.
You can use numpy broadcasting to compute vectorised Euclidean distance (L2-norm), ignoring NaNs using np.nansum.
i = df.values.T
j = np.nansum((i - i[:, None]) ** 2, axis=2) ** .5
If you want a DataFrame representing a distance matrix, here's what that would look like:
df = (lambda v, c: pd.DataFrame(v, c, c))(j, df.columns)
df
A B C
A 0.000000 1.414214 1.0
B 1.414214 0.000000 1.0
C 1.000000 1.000000 0.0
df[i, j] represents the distance between the ith and jth column in the original DataFrame.
The code below iterates through columns to calculate the difference.
# Import libraries
import pandas as pd
import numpy as np
# Create dataframe
df = pd.DataFrame({'A': {'a':1.0, 'b':2.0, 'd':4.0}, 'B':{'a':2.0, 'c':2.0, 'd':5.0},'C':{'b':1.0,'c':2.0, 'd':4.0}})
df2 = pd.DataFrame()
# Calculate difference
clist = df.columns
for i in range (0,len(clist)-1):
for j in range (1,len(clist)):
if (clist[i] != clist[j]):
var = clist[i] + '-' + clist[j]
df[var] = abs(df[clist[i]] - df[clist[j]]) # optional
df2[var] = abs(df[clist[i]] - df[clist[j]]) # optional
Output in same dataframe
df.head()
Output in a new dataframe
df2.head()
I have multiple data frames that I saved in a concatenated list like below. Each df represents a matrix.
my_df = pd.concat([df1, df2, df3, .....])
How do I sum all these dfs (matrices) into one df (matrix)?
I found a discussion here, but it only answers how to add two data frames, by using code like below.
df_x.add(df_y, fill_value=0)
Should I use the code above in a loop, or is there a more concise way?
I tried to do print(my_df.sum()) but got a very confusing result (it's suddenly turned into a one row instead of two-dimensional matrix).
Thank you.
I believe need functools.reduce if each DataFrame in list have same index and columns values:
np.random.seed(2018)
df1 = pd.DataFrame(np.random.choice([1,np.nan,2], size=(3,3)), columns=list('abc'))
df2 = pd.DataFrame(np.random.choice([1,np.nan,3], size=(3,3)), columns=list('abc'))
df3 = pd.DataFrame(np.random.choice([1,np.nan,4], size=(3,3)), columns=list('abc'))
print (df1)
a b c
0 2.0 2.0 2.0
1 NaN NaN 1.0
2 1.0 2.0 NaN
print (df2)
a b c
0 NaN NaN 1.0
1 3.0 3.0 3.0
2 NaN 1.0 3.0
print (df3)
a b c
0 4.0 NaN NaN
1 4.0 1.0 1.0
2 4.0 NaN 1.0
from functools import reduce
my_df = [df1,df2, df3]
df = reduce(lambda x, y: x.add(y, fill_value=0), my_df)
print (df)
a b c
0 6.0 2.0 3.0
1 7.0 4.0 5.0
2 5.0 3.0 4.0
I believe the idiomatic solution to this is to preserve the information about different DataFrames with the help of the keys parameter and then use sum on the innermost level:
dfs = [df1, df2, df3]
my_df = pd.concat(dfs, keys=['df{}'.format(i+1) for i in range(len(dfs))])
my_df.sum(level=1)
which yields
a b c
0 6.0 2.0 3.0
1 7.0 4.0 5.0
2 5.0 3.0 4.0
with jezrael's sample DataFrames.
One method is to use sum with a list of arrays. The output here will be an array rather than a dataframe.
This assumes you need to replace np.nan with 0:
res = sum([x.fillna(0).values for x in [df1, df2, df3]])
Alternatively, you can use numpy directly in a couple of different ways:
res_np1 = np.add.reduce([x.fillna(0).values for x in [df1, df2, df3]])
res_np2 = np.nansum([x.values for x in [df1, df2, df3]], axis=0)
numpy.nansum assumes np.nan equals zero for summing purposes.