I have a dataframe like this
A B
0 a 1
1 b 2
2 c 3
3 d nan
4 e nan
I would like to add column C like below
A B C
0 a 1 a1
1 b 2 b2
2 c 3 c3
3 d nan d
4 e nan e
So I tried
df["C"]=df.A+df.B
but It returns
C
a1
b2
c3
nan
nan
How can get correct result?
In your code, I think the data type of the element in the dataframe is str, so, try fillna.
In [10]: import pandas as pd
In [11]: import numpy as np
In [12]: df = pd.DataFrame({'A': ['a', 'b', 'c', 'd', 'e'],
'B': ['1', '2', '3', np.nan, np.nan]})
In [13]: df.B.fillna('')
Out[13]:
0 1
1 2
2 3
3
4
Name: B, dtype: object
In [14]: df
Out[14]:
A B
0 a 1
1 b 2
2 c 3
3 d NaN
4 e NaN
[5 rows x 2 columns]
In [15]: df.B = df.B.fillna('')
In [16]: df["C"]=df.A+df.B
In [17]: df
Out[17]:
A B C
0 a 1 a1
1 b 2 b2
2 c 3 c3
3 d d
4 e e
[5 rows x 3 columns]
df['C'] = pd.Series(df.fillna('').values.tolist()).str.join(' ')
You can use add method with the fill_value parameter
df['C'] = df.A.add(df.B, fill_value='')
df
Related
What I want to do is:
1- Group the dataframe by two columns
2- From each group, check if the values of a column are not in another column of the group.
x = pd.DataFrame({'x': [1,1,1,1,1,1,2], 'y': [4,4,4,5,5,5,4], 'z':['a', 'b', 'c', 'a', 'b', 'c', 'a'], 's':['a', 'a', 'b', 'a', 'a', 'a', 'b']})
x:
x y z s
0 1 4 a a
1 1 4 b a
2 1 4 c b
3 1 5 a a
4 1 5 b a
5 1 5 c a
6 2 4 a b
What I would like to check is if the values of column z are not in column s being the dataframe grouped by x and y.
For example, in the following group (x=1 and y=4):
x y z s
0 1 4 a a
1 1 4 b a
2 1 4 c b
The result will be the third row:
x y z s
0 1 4 c b
I have tried something like this but it gets stuck:
x= x.groupby(['x', 'y'])[(~x.z.isin(x.s)).index]
Any suggestions?
Thanks in advance!
Left merge:
m = x.merge(x, left_on=['x','y','z'],
right_on=['x','y','s'],
how='left', suffixes=['','_']
)
You would see:
x y z s z_ s_
0 1 4 a a a a
1 1 4 a a b a
2 1 4 b a c b
3 1 4 c b NaN NaN
4 1 5 a a a a
5 1 5 a a b a
6 1 5 a a c a
7 1 5 b a NaN NaN
8 1 5 c a NaN NaN
9 2 4 a b NaN NaN
Then your data is where s_ is NaN, so
m.loc[m['s_'].isna(), x.columns]
Output:
x y z s
3 1 4 c b
7 1 5 b a
8 1 5 c a
9 2 4 a b
Option 2: do an apply with isin on groupby:
(x.groupby(['x','y'])
.apply(lambda d: d[~d['z'].isin(d['s'])])
.reset_index(level=['x','y'], drop=True)
)
Output:
x y z s
2 1 4 c b
4 1 5 b a
5 1 5 c a
6 2 4 a b
I have a dataframe on which I'd like to add a level of columns.
The correct new level of column can be found using my_dict.
df = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
my_dict = {"B": "BB","A": "AA","C": "CC"}
This is what I expect:
Out[92]:
A B
AA BB
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
Thanks
You can do df.columns.map then convert to Multiindex
df.columns = pd.MultiIndex.from_arrays((df.columns,df.columns.map(my_dict)))
A B
AA BB
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
Use Index.map with assign back to columns names by nested lists - if no match get NaN:
df.columns = [df.columns, df.columns.map(my_dict)]
print (df)
A B
AA BB
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
Solution with rename - if no match get original values:
df.columns = [df.columns, df.rename(columns=my_dict).columns]
df.columns = [df.columns, df.columns]
df = df.rename(columns=my_dict, level=1)
Test for not match:
df = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5), 'D': range(5)})
my_dict = {"B": "BB","A": "AA","C": "CC"}
df.columns = [df.columns, df.columns.map(my_dict)]
print (df)
A B D
AA BB NaN
a 0 0 0
b 1 1 1
c 2 2 2
d 3 3 3
e 4 4 4
df.columns = [df.columns, df.columns]
df = df.rename(columns=my_dict, level=1)
#df.columns = [df.columns, df.rename(columns=my_dict).columns]
print (df)
A B D
AA BB D
a 0 0 0
b 1 1 1
c 2 2 2
d 3 3 3
e 4 4 4
I want to fill some rows' values use other rows' value.
let me list an example:
In [7]: df = pd.DataFrame([['a', 'b', 'c', 'aa', 'ba'], [1,2,3,np.nan,np.nan]]).T
In [8]: df
Out[8]:
0 1
0 a 1
1 b 2
2 c 3
3 aa NaN
4 bb NaN
what i want is to fill df.loc[3, 1] with value of df.loc[0, 1],
df.loc[4, 1] with df.loc[1, 1]
because a given condition ('a' and 'aa'(loc[3,1] and loc[0, 1]) have same
first 'a', 'b' and 'bb' shared 'b')
is there any good methods to do this?
If possible combine values by first letter with forward filling use:
df[1] = df.groupby(df[0].str[0])[1].ffill()
print (df)
0 1
0 a 1
1 b 2
2 c 3
3 aa 1
4 ba 2
If need replace by first non missing value use GroupBy.transform with GroupBy.first:
df = pd.DataFrame([['aa', 'b', 'c', 'a', 'ba'], [np.nan,2,3,1,np.nan]]).T
print (df)
0 1
0 aa NaN
1 b 2
2 c 3
3 a 1
4 ba NaN
df[1] = df.groupby(df[0].str[0])[1].transform('first')
print (df)
0 1
0 aa 1
1 b 2
2 c 3
3 a 1
4 ba 2
using map I can only think of this:
map_val = df.dropna().set_index(0).to_dict()[1]
df[1] = df[1].fillna(df[0].map(lambda x:map_val[x[0]]))
df
0 1
0 a 1
1 b 2
2 c 3
3 aa 1
4 ba 2
I've got a matrix like this:
df = pd.DataFrame({'a':[7, 0, 3], 'b':[0, 4, 2], 'c':[3, 2, 9]})
df.index = list(df)
df
a b c
a 7 0 3
b 0 4 2
c 3 2 9
And I'd like to get something like this:
C1 C2 V
0 a a 7
1 a b 0
2 a c 3
3 b a 0
4 b b 4
5 b c 2
6 c a 3
7 c b 2
8 c c 9
For which I've written the following code:
vv = pd.DataFrame(columns=['C1', 'C2', 'V'])
i = 0
for cat1 in df.index:
for cat2 in df.index:
vv.loc[i] = [cat1, cat2, d[cat1][cat2]]
i += 1
vv['V'] = vv['V'].astype(int)
Is there a better/faster/more elegant way of doing this?
In [90]: df = df.stack().reset_index()
In [91]: df.columns = ['C1', 'C2', 'v']
In [92]: df
Out[92]:
C1 C2 v
0 a a 7
1 a b 0
2 a c 3
3 b a 0
4 b b 4
5 b c 2
6 c a 3
7 c b 2
8 c c 9
YOu can use the stack() method followed by resetting the index and renaming the columns.
df = pd.DataFrame({'a':[7, 0, 3], 'b':[0, 4, 2], 'c':[3, 2, 9]})
df.index = list(df)
result = df.stack().reset_index().rename(columns={'level_0':'C1', 'level_1':'C2',0:'V'})
print(result)
C1 C2 V
0 a a 7
1 a b 0
2 a c 3
3 b a 0
4 b b 4
5 b c 2
6 c a 3
7 c b 2
8 c c 9
Use:
df = (df.rename_axis('C2')
.reset_index()
.melt('C2', var_name='C1', value_name='V')
.reindex(columns=['C1','C2','V']))
print (df)
C1 C2 V
0 a a 7
1 a b 0
2 a c 3
3 b a 0
4 b b 4
5 b c 2
6 c a 3
7 c b 2
8 c c 9
You can use stack:
df.stack()
a a 7
b 0
c 3
b a 0
b 4
c 2
c a 3
b 2
c 9
dtype: int64
The pd.set_option('display.multi_sparse', False) will desparsen the series, showing the values in every row
Additionally, with proper renaming in a pipeline
df.stack()
.reset_index()
.rename(columns={'level_0': 'C1', 'level_1': 'C2', 0:'V'})
yields:
C1 C2 V
0 a a 7
1 a b 0
2 a c 3
3 b a 0
4 b b 4
5 b c 2
6 c a 3
7 c b 2
8 c c 9
To complete the answer and get the same output, I've added the following code:
vv = df.stack().reset_index()
vv.columns = ['C1', 'C2', 'V']
Given the following data frame:
import pandas as pd
import numpy as np
df=pd.DataFrame({'A':['A','A','A','B','B','B'],
'B':['a','a','b','a','a','a'],
})
df
A B
0 A a
1 A a
2 A b
3 B a
4 B a
5 B a
I'd like to create column 'C', which numbers the rows within each group in columns A and B like this:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
I've tried this so far:
df['C']=df.groupby(['A','B'])['B'].transform('rank')
...but it doesn't work!
Use groupby/cumcount:
In [25]: df['C'] = df.groupby(['A','B']).cumcount()+1; df
Out[25]:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
Use groupby.rank function.
Here the working example.
df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
df
C1 C2
a 1
a 2
a 3
b 4
b 5
df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
df
C1 C2 RANK
a 1 1
a 2 2
a 3 3
b 4 1
b 5 2