I have a dataframe on which I'd like to add a level of columns.
The correct new level of column can be found using my_dict.
df = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
my_dict = {"B": "BB","A": "AA","C": "CC"}
This is what I expect:
Out[92]:
A B
AA BB
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
Thanks
You can do df.columns.map then convert to Multiindex
df.columns = pd.MultiIndex.from_arrays((df.columns,df.columns.map(my_dict)))
A B
AA BB
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
Use Index.map with assign back to columns names by nested lists - if no match get NaN:
df.columns = [df.columns, df.columns.map(my_dict)]
print (df)
A B
AA BB
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
Solution with rename - if no match get original values:
df.columns = [df.columns, df.rename(columns=my_dict).columns]
df.columns = [df.columns, df.columns]
df = df.rename(columns=my_dict, level=1)
Test for not match:
df = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5), 'D': range(5)})
my_dict = {"B": "BB","A": "AA","C": "CC"}
df.columns = [df.columns, df.columns.map(my_dict)]
print (df)
A B D
AA BB NaN
a 0 0 0
b 1 1 1
c 2 2 2
d 3 3 3
e 4 4 4
df.columns = [df.columns, df.columns]
df = df.rename(columns=my_dict, level=1)
#df.columns = [df.columns, df.rename(columns=my_dict).columns]
print (df)
A B D
AA BB D
a 0 0 0
b 1 1 1
c 2 2 2
d 3 3 3
e 4 4 4
Related
I want to fill some rows' values use other rows' value.
let me list an example:
In [7]: df = pd.DataFrame([['a', 'b', 'c', 'aa', 'ba'], [1,2,3,np.nan,np.nan]]).T
In [8]: df
Out[8]:
0 1
0 a 1
1 b 2
2 c 3
3 aa NaN
4 bb NaN
what i want is to fill df.loc[3, 1] with value of df.loc[0, 1],
df.loc[4, 1] with df.loc[1, 1]
because a given condition ('a' and 'aa'(loc[3,1] and loc[0, 1]) have same
first 'a', 'b' and 'bb' shared 'b')
is there any good methods to do this?
If possible combine values by first letter with forward filling use:
df[1] = df.groupby(df[0].str[0])[1].ffill()
print (df)
0 1
0 a 1
1 b 2
2 c 3
3 aa 1
4 ba 2
If need replace by first non missing value use GroupBy.transform with GroupBy.first:
df = pd.DataFrame([['aa', 'b', 'c', 'a', 'ba'], [np.nan,2,3,1,np.nan]]).T
print (df)
0 1
0 aa NaN
1 b 2
2 c 3
3 a 1
4 ba NaN
df[1] = df.groupby(df[0].str[0])[1].transform('first')
print (df)
0 1
0 aa 1
1 b 2
2 c 3
3 a 1
4 ba 2
using map I can only think of this:
map_val = df.dropna().set_index(0).to_dict()[1]
df[1] = df[1].fillna(df[0].map(lambda x:map_val[x[0]]))
df
0 1
0 a 1
1 b 2
2 c 3
3 aa 1
4 ba 2
df have:
A B C
a 1 2 3
b 2 1 4
c 1 1 1
df want:
A B C
a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
I am able to get df want by using:
df.loc['d']=df.loc['b']-df.loc['a']
However, my actual df has 'a','b','c' rows for multiple IDs 'X', 'Y' etc.
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
Y a 1 2 3
b 2 1 4
c 1 1 1
How can I create the same output with multiple IDs?
My original method:
df.loc['d']=df.loc['b']-df.loc['a']
fails KeyError:'b'
Desired output:
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
Y a 1 2 3
b 2 2 4
c 1 1 1
d 1 0 1
IIUC,
for i, sub in df.groupby(df.index.get_level_values(0)):
df.loc[(i, 'd'), :] = sub.loc[(i,'b')] - sub.loc[(i, 'a')]
print(df.sort_index())
Or maybe
k = df.groupby(df.index.get_level_values(0), as_index=False).apply(lambda s: pd.DataFrame([s.loc[(s.name,'b')].values - s.loc[(s.name, 'a')].values],
columns=s.columns,
index=pd.MultiIndex(levels=[[s.name], ['d']], codes=[[0],[0]])
)).reset_index(drop=True, level=0)
pd.concat([k, df]).sort_index()
Data reshaping is a useful trick if you want to do manipulation on a particular level of a multiindex. See code below,
result = (df.unstack(0).T
.assign(d=lambda x:x.b-x.a)
.stack()
.unstack(0))
Use pd.IndexSlice to slice a and b. Call diff and slice on b and rename it to d. Finally, append it to original df
idx = pd.IndexSlice
df1 = df.loc[idx[:,['a','b']],:].diff().loc[idx[:,'b'],:].rename({'b': 'd'})
df2 = df.append(df1).sort_index().astype(int)
Out[106]:
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
Y a 1 2 3
b 2 2 4
c 1 1 1
d 1 0 1
I've got a matrix like this:
df = pd.DataFrame({'a':[7, 0, 3], 'b':[0, 4, 2], 'c':[3, 2, 9]})
df.index = list(df)
df
a b c
a 7 0 3
b 0 4 2
c 3 2 9
And I'd like to get something like this:
C1 C2 V
0 a a 7
1 a b 0
2 a c 3
3 b a 0
4 b b 4
5 b c 2
6 c a 3
7 c b 2
8 c c 9
For which I've written the following code:
vv = pd.DataFrame(columns=['C1', 'C2', 'V'])
i = 0
for cat1 in df.index:
for cat2 in df.index:
vv.loc[i] = [cat1, cat2, d[cat1][cat2]]
i += 1
vv['V'] = vv['V'].astype(int)
Is there a better/faster/more elegant way of doing this?
In [90]: df = df.stack().reset_index()
In [91]: df.columns = ['C1', 'C2', 'v']
In [92]: df
Out[92]:
C1 C2 v
0 a a 7
1 a b 0
2 a c 3
3 b a 0
4 b b 4
5 b c 2
6 c a 3
7 c b 2
8 c c 9
YOu can use the stack() method followed by resetting the index and renaming the columns.
df = pd.DataFrame({'a':[7, 0, 3], 'b':[0, 4, 2], 'c':[3, 2, 9]})
df.index = list(df)
result = df.stack().reset_index().rename(columns={'level_0':'C1', 'level_1':'C2',0:'V'})
print(result)
C1 C2 V
0 a a 7
1 a b 0
2 a c 3
3 b a 0
4 b b 4
5 b c 2
6 c a 3
7 c b 2
8 c c 9
Use:
df = (df.rename_axis('C2')
.reset_index()
.melt('C2', var_name='C1', value_name='V')
.reindex(columns=['C1','C2','V']))
print (df)
C1 C2 V
0 a a 7
1 a b 0
2 a c 3
3 b a 0
4 b b 4
5 b c 2
6 c a 3
7 c b 2
8 c c 9
You can use stack:
df.stack()
a a 7
b 0
c 3
b a 0
b 4
c 2
c a 3
b 2
c 9
dtype: int64
The pd.set_option('display.multi_sparse', False) will desparsen the series, showing the values in every row
Additionally, with proper renaming in a pipeline
df.stack()
.reset_index()
.rename(columns={'level_0': 'C1', 'level_1': 'C2', 0:'V'})
yields:
C1 C2 V
0 a a 7
1 a b 0
2 a c 3
3 b a 0
4 b b 4
5 b c 2
6 c a 3
7 c b 2
8 c c 9
To complete the answer and get the same output, I've added the following code:
vv = df.stack().reset_index()
vv.columns = ['C1', 'C2', 'V']
I have a pandas data frame that looks something like this:
data = {'1' : [0, 2, 0, 0], '2' : [5, 0, 0, 2], '3' : [2, 0, 0, 0], '4' : [0, 7, 0, 0]}
df = pd.DataFrame(data, index = ['a', 'b', 'c', 'd'])
df
1 2 3 4
a 0 5 2 0
b 2 0 0 7
c 0 0 0 0
d 0 2 0 0
I know I can get the maximum value and the corresponding column name for each row by doing (respectively):
df.max(1)
df.idxmax(1)
How can I get the values and the column name for every cell that is not zero?
So in this case, I'd want 2 tables, one giving me each value != 0 for each row:
a 5
a 2
b 2
b 7
d 2
And one giving me the column names for those values:
a 2
a 3
b 1
b 4
d 2
Thanks!
You can use stack for Series, then filter by boolean indexing, rename_axis, reset_index and last drop column or select columns by subset:
s = df.stack()
df1 = s[s!= 0].rename_axis(['a','b']).reset_index(name='c')
print (df1)
a b c
0 a 2 5
1 a 3 2
2 b 1 2
3 b 4 7
4 d 2 2
df2 = df1.drop('b', axis=1)
print (df2)
a c
0 a 5
1 a 2
2 b 2
3 b 7
4 d 2
df3 = df1.drop('c', axis=1)
print (df3)
a b
0 a 2
1 a 3
2 b 1
3 b 4
4 d 2
df3 = df1[['a','c']]
print (df3)
a c
0 a 5
1 a 2
2 b 2
3 b 7
4 d 2
df3 = df1[['a','b']]
print (df3)
a b
0 a 2
1 a 3
2 b 1
3 b 4
4 d 2
I have a dataframe like this
A B
0 a 1
1 b 2
2 c 3
3 d nan
4 e nan
I would like to add column C like below
A B C
0 a 1 a1
1 b 2 b2
2 c 3 c3
3 d nan d
4 e nan e
So I tried
df["C"]=df.A+df.B
but It returns
C
a1
b2
c3
nan
nan
How can get correct result?
In your code, I think the data type of the element in the dataframe is str, so, try fillna.
In [10]: import pandas as pd
In [11]: import numpy as np
In [12]: df = pd.DataFrame({'A': ['a', 'b', 'c', 'd', 'e'],
'B': ['1', '2', '3', np.nan, np.nan]})
In [13]: df.B.fillna('')
Out[13]:
0 1
1 2
2 3
3
4
Name: B, dtype: object
In [14]: df
Out[14]:
A B
0 a 1
1 b 2
2 c 3
3 d NaN
4 e NaN
[5 rows x 2 columns]
In [15]: df.B = df.B.fillna('')
In [16]: df["C"]=df.A+df.B
In [17]: df
Out[17]:
A B C
0 a 1 a1
1 b 2 b2
2 c 3 c3
3 d d
4 e e
[5 rows x 3 columns]
df['C'] = pd.Series(df.fillna('').values.tolist()).str.join(' ')
You can use add method with the fill_value parameter
df['C'] = df.A.add(df.B, fill_value='')
df