I am using the following code to normalize a pandas DataFrame:
df_norm = (df - df.mean()) / (df.max() - df.min())
This works fine when all columns are numeric. However, now I have some string columns in df and the above normalization got errors. Is there a way to perform such normalization only on numeric columns of a data frame (keeping string column unchanged)?
You can use select_dtypes to calculate value for the desired columns:
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c'], 'c': [4, 5, 6]})
df
a b c
0 1 a 4
1 2 b 5
2 3 c 6
df_num = df.select_dtypes(include='number')
df_num
a c
0 1 4
1 2 5
2 3 6
And then you can assign them back to the original df:
df_norm = (df_num - df_num.mean()) / (df_num.max() - df_num.min())
df[df_norm.columns] = df_norm
df
a b c
0 -0.5 a -0.5
1 0.0 b 0.0
2 0.5 c 0.5
Related
Take the following data frame and groupby object.
df = pd.DataFrame([[1, 2, 3],[1, 4, 5],[2, 5, 6]], columns=['a', 'b', 'c'])
print(df)
a b c
0 1 2 3
1 1 4 5
2 2 5 6
dfGrouped = df.groupby(['a'])
How would I apply to the groupby object dfGrouped, multiplying each element of b and c together and then taking the sum. So for this example, 2*3 + 4*5 = 26 for the 1 group and 5*6 = 30 for the 0 group.
So my desired output for the groupby object is:
a f
0 1 26
2 2 30
Do:
df = pd.DataFrame([[1, 2, 3],[1, 4, 5],[2, 5, 6]], columns=['a', 'b', 'c'])
df['f'] = df['c'] * df['b']
res = df.groupby('a', as_index=False)['f'].sum()
print(res)
Output
a f
0 1 26
1 2 30
If need multiple all columns without a use DataFrame.prod with aggregate sum:
df = df.drop('a', 1).prod(axis=1).groupby(df['a']).sum().reset_index(name='f')
print (df)
a f
0 1 26
1 2 30
Alternative with helper column:
df = df.assign(f = df.drop('a', 1).prod(axis=1)).groupby("a", as_index=False).f.sum()
If need multiple only some columns one idea is use #sammywemmy solution from comments:
df = df.assign(f = df.b.mul(df.c)).groupby("a", as_index=False).f.sum()
print (df)
a f
0 1 26
1 2 30
Code:
df=(df.b * df.c).groupby(df['a']).sum().reset_index(name="f")
print(df)
Output:
a f
0 1 26
1 2 30
I have a pandas dataframe like this:
df = pd.DataFrame({'A': [2, 3], 'B': [1, 2], 'C': [0, 1], 'D': [1, 0], 'total': [4, 6]})
A B C D total
0 2 1 0 1 4
1 3 2 1 0 6
I'm trying to perform a rowwise calculation and create a new column with the result. The calculation is to divide each column ABCD by the total, square it, and sum it up rowwise. This should be the result (0 if total is 0):
A B C D total result
0 2 1 0 1 4 0.375
1 3 2 1 0 6 0.389
This is what I've tried so far, but it always returns 0:
df['result'] = df[['A', 'B', 'C', 'D']].apply(lambda x: ((x/df['total'])**2).sum(), axis=1)
I guess the problem is df['total'] in the lambda function, because if I replace this by a number it works fine. I don't know how to work around this though. Appreciate any suggestions.
A combination of div, pow and sum can solve this :
df["result"] = df.filter(regex="[^total]").div(df.total, axis=0).pow(2).sum(1)
df
A B C D total result
0 2 1 0 1 4 0.375000
1 3 2 1 0 6 0.388889
you could do
df['result'] = (df.loc[:, "A": 'D'].divide(df.total, axis=0) ** 2).sum(axis=1)
I have a dataframe like as shown below
import pandas as pd
df = pd.DataFrame({'a': [0, -1, 2], 'b': [-3, 2, 1]})
In my real data, I have more than 100 columns. What I would like to do is excluding two columns, I would like replace the negative values in all other columns to zero
I tried this but it works for all columns.
df[df < 0] = 0
Is the only way is to have all column names in a list and run through a loop like as shown below
col_list = ['a1','a2','a3','a4',..........'a100'] # in this `a21`,a22` columns are ignored from the list
for col in col_list:
df[col] = [df[col]<0] = 0
As you can see it's lengthy and inefficient.
Can you help me with any efficient approach to do this?
There is problem df[col_list] return boolean DataFrame, so cannot be filtered by df[df < 0] = 0 with specified columns names, is necessary use DataFrame.mask:
col_list = df.columns.difference(['a21','a22'])
m = df[col_list] < 0
df[col_list] = df[col_list].mask(m, 0)
EDIT:
For numeric columns without a21 and a22 use DataFrame.select_dtypes with Index.difference:
df = pd.DataFrame({
'a21':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[-7,8,9,4,2,3],
'D':[1,3,5,-7,1,'a'], <- object column because last `a`
'E':[5,3,-6,9,2,-4],
'a22':list('aaabbb')
})
col_list = df.select_dtypes(np.number).columns.difference(['a21','a22'])
m = df[col_list] < 0
df[col_list] = df[col_list].mask(m, 0)
print (df)
a21 B C D E a22
0 a 4 0 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 0 a
3 d 5 4 -7 9 b
4 e 5 2 1 2 b
5 f 4 3 a 0 b
How about simple clipping at 0?
df[col_list] = df[col_list].clip(0)
I have a DataFrame with integer indexes that are missing some values (i.e. not equally spaced), I want to create a new DataFrame with equally spaced index values and forward fill column values. Below is a simple example:
have
import pandas as pd
df = pd.DataFrame(['A', 'B', 'C'], index=[0, 2, 4])
0
0 A
2 B
4 C
want to use above and create:
0
0 A
1 A
2 B
3 B
4 C
Use reindex with method='ffill':
df = df.reindex(np.arange(0, df.index.max()+1), method='ffill')
Or:
df = df.reindex(np.arange(df.index.min(), df.index.max() + 1), method='ffill')
print (df)
0
0 A
1 A
2 B
3 B
4 C
Using reindex and ffill:
df = df.reindex(range(df.index[0],df.index[-1]+1)).ffill()
print(df)
0
0 A
1 A
2 B
3 B
4 C
You can do this:
In [319]: df.reindex(list(range(df.index.min(),df.index.max()+1))).ffill()
Out[319]:
0
0 A
1 A
2 B
3 B
4 C
i have a DF like this:
df = pd.DataFrame({'x': ['a', 'a', 'b', 'b', 'b', 'c'],
'y': [1, 2, 3, 4, 5, 6],
})
which looks like:
x y
0 a 1
1 a 2
2 b 3
3 b 4
4 b 5
5 c 6
I need to reshape it in the way to keep 'x' column unique:
x y_1 y_2 y_3
0 a 1 2 NaN
1 b 3 4 5
2 c 6 NaN NaN
So the max N of 'y_N' columns have to be equal to
max(df.groupby('x').count().values)
and the x column has to contain unique values.
For now i dont get how to get y_N columns.
Thanks.
You can use pandas.crosstab with cumcount column as the columns parameter:
(pd.crosstab(df.x, df.groupby('x').cumcount() + 1, df.y,
aggfunc = lambda x: x.iloc[0])
.rename(columns="y_{}".format).reset_index())