How do I combine N non-numerical columns while removing null values? - python

Building on this question Combining columns and removing NaNs Pandas,
I have a dataframe that looks like this:
col x y z
a1 a NaN NaN
a2 NaN b NaN
a3 NaN c NaN
a4 NaN NaN d
a5 NaN e NaN
a6 f NaN NaN
a7 g NaN NaN
a8 NaN NaN NaN
The cell values are strings and the NaNs are arbitrary null values.
I would like to combine the columns to add a new combined column thus:
col w
a1 a
a2 b
a3 c
a4 d
a5 e
a6 f
a7 g
a8 NaN
The elegant solution proposed in the question above uses
df['w']=df[['x','y','z']].sum(axis=1)
but sum does not work for non-numerical values.
How, in this case for strings, do I combine the columns into a single column?
You can assume:
Each row only has one of x, y, z that is non-null.
The individual columns must be referenced by name (since they are a subset of all of the available columns in the dataframe).
In general there are N and not just 3 columns in the subset.
Hopefully no use for iloc/for loops :\
Update: (apologies to those who have already given answers :\ )
I have added a final row where every column contains NaN, and I would like the combined row to reflect that. Thanks + sorry!
Thanks as ever for all help

Here is yet another solution:
df['res'] = df.fillna('').sum(1).replace('', np.nan)
The result is
x y z res
col
a1 a NaN NaN a
a2 NaN b NaN b
a3 NaN c NaN c
a4 NaN NaN d d
a5 NaN e NaN e
a6 f NaN NaN f
a7 g NaN NaN g
a8 NaN NaN NaN NaN

I think you need:
s = df[['x','y','z']]
df['w'] = s.values[s.notnull()]
df[['col','w']]
Or After edit of question:
df['w'] = pd.DataFrame(df[['x','y','z']].apply(lambda x: x.values[x.notnull()],axis=1).tolist())
df[['col','w']].fillna(np.nan)
Which gives
col w
0 a1 a
1 a2 b
2 a3 c
3 a4 d
4 a5 e
5 a6 f
6 a7 g
7 a8 NaN

Instead of generic sum, you have to apply a custom function.
This one, for example works on your example:
import numpy as np
f = lambda x: x[x.notnull()][0] if any(x.notnull()) else np.nan
df['w'] = df[list('xyz')].apply(f, axis=1)

Related

Merging columns using pandas

I am trying to merge multiple-choice question columns using pandas so I can then manipulate them. An example of what my questions look like is:
C1 C2 C3
0 A A
1 B B
2 C C
3 D D
The data is currently presented as C1 and C2 but I need it to be combined into 1 column as represented in C3.
One option, assuming NaN in empty cells, is to bfill the first column and copy it:
df['C3'] = df[['C1', 'C2']].bfill(axis=1)['C1']
This way is extensible to any number of initial columns.
Output:
C1 C2 C3
0 A NaN A
1 NaN B B
2 NaN C C
3 D NaN D
You may try with fillna
df['C3'] = df['C1'].fillna(df['C2'])
df
Out[483]:
C1 C2 C3
0 A NaN A
1 NaN B B
2 NaN C C
3 D NaN D
You can also use combine_first:
df['C3'] = df['C1'].combine_first(df['C2'])
print(df)
# Output
C1 C2 C3
0 A NaN A
1 NaN B B
2 NaN C C
3 D NaN D
If your cells contain empty strings and not null values, replace them temporary by NaN:
df['C3'] = df['C1'].replace('', np.nan).combine_first(df['C2'])
print(df)
# Output
C1 C2 C3
0 A A
1 B B
2 C C
3 D D

how to add new input row on dataframe?

I have this data-frame
df = pd.DataFrame({'Type':['A','A','B','B'], 'Variants':['A3','A6','Bxy','Byz']})
it shows like this
Type Variants
0 A A3
1 A A6
2 B Bxy
3 B Byz
I should make a function that adds a new row below each on every new Type key-values.
it should go like this if I'm adding n=2
Type Variants
0 A A3
1 A A6
2 A Nan
3 A Nan
4 B Bxy
5 B Byz
6 B Nan
7 B Nan
can anyone help me with this , I will appreciate it a lot, thx in advance
Create a dataframe to merge with your original one:
def add_rows(df, n):
df1 = pd.DataFrame(np.repeat(df['Type'].unique(), n), columns=['Type'])
return pd.concat([df, df1]).sort_values('Type').reset_index(drop=True)
out = add_rows(df, 2)
print(out)
# Output
Type Variants
0 A A3
1 A A6
2 A NaN
3 A NaN
4 B Bxy
5 B Byz
6 B NaN
7 B NaN

Calculate mean of row after X entries

I want to calculate the mean of all values in a row after e.g. 5 entries in this particular row, this leads to different "start" points of mean-calculation. As soon as there are 5 values in a row the mean of the values should be calculated.
Note: There might be some NaNs in the rows which should not count in the 5 entries, I want to use valid values only.
Example if I wanted to calculate after e.g. 5 entries:
Index D1 D2 D3 D4 D5 D6 D7
1 NaN NaN 2 3 4 5 6
2 1 1 2 3 4 5 6
3 2 1 NaN 3 4 5 6
4 NaN NaN NaN 3 4 5 6
My desired output looks like this:
Index D1 D2 D3 D4 D5 D6 D7
1 NaN NaN NaN NaN NaN NaN 4
2 NaN NaN NaN NaN 2.2 2.66 3.14
3 NaN NaN NaN NaN NaN 3 3.5
4 NaN NaN NaN NaN NaN NaN NaN
I was trying to use the .count method, but I got NaNs in all cells using my code below:
B = A.copy()
for i in range(A.shape[0]):
for j in range(A.shape[1]):
if A.iloc[i,0:j].count() > 5:
B.iloc[i,j] = B.iloc[i,0:j].sum()/B.iloc[i,0:j].count()
else:
B.iloc[i,j] = np.nan
Edit:
It looks like I found a solution: Changing inside the forloop:
# Old version
B.iloc[i,j] = B.iloc[i,0:j].sum()/B.iloc[i,0:j].count()
# New version
B.iloc[i,j] = A.iloc[i,0:j].sum()/A.iloc[i,0:j].count()
If someone has a faster/prettier solution let me know anyways, I don't really like this one.
What you want is the expanding mean:
df.loc[:, 'D1':].expanding(5, axis=1).mean()
I'm not sure Index is a column or an index in your dataframe. If it's the index, you can remove the loc[...] call.

Merge columns with have \n

ex)
C1 C2 C3 C4 C5 C6
0 A B nan C A nan
1 B C D nan B nan
2 D E F nan C nan
3 nan nan A nan nan B
I'm merging columns, but I want to give '\n\n' in the merging process.
so output what I want
C
0 A
B
C
A
1 B
C
D
B
2 D
E
F
C
3. A
B
I want 'nan' to drop.
I tried
df['merge'] = df['C1'].map(str) + '\n\n' + tt['C2'].map(str) + '\n\n' + tt['C3'].map(str) + '\n\n' + df['C4'].map(str)
However, this includes all nan values.
thank you for reading.
Use DataFrame.stack for Series, misisng values are removed, so you can aggregate join:
df['merge'] = df.stack().groupby(level=0).agg('\n\n'.join)
#for filter only C columns
df['merge'] = df.filter(like='C').stack().groupby(level=0).agg('\n\n'.join)
Or remove missing values by join per rows by Series.dropna:
df['merge'] = df.apply(lambda x: '\n\n'.join(x.dropna()), axis=1)
#for filter only C columns
df['merge'] = df.filter(like='C').apply(lambda x: '\n\n'.join(x.dropna()), axis=1)
print (df)
C1 C2 C3 C4 C5 C6 merge
0 A B NaN C A NaN A\n\nB\n\nC\n\nA
1 B C D NaN B NaN B\n\nC\n\nD\n\nB
2 D E F NaN C NaN D\n\nE\n\nF\n\nC
3 NaN NaN A NaN NaN B A\n\nB

NaN values in pivot_table index causes loss of data

Here is a simple DataFrame:
> df = pd.DataFrame({'a': ['a1', 'a2', 'a3'],
'b': ['optional1', None, 'optional3'],
'c': ['c1', 'c2', 'c3'],
'd': [1, 2, 3]})
> df
a b c d
0 a1 optional1 c1 1
1 a2 None c2 2
2 a3 optional3 c3 3
Pivot method 1
The data can be pivoted to this:
> df.pivot_table(index=['a','b'], columns='c')
d
c c1 c3
a b
a1 optional1 1.0 NaN
a3 optional3 NaN 3.0
Downside: data in the 2nd row is lost because df['b'][1] == None.
Pivot method 2
> df.pivot_table(index=['a'], columns='c')
d
c c1 c2 c3
a
a1 1.0 NaN NaN
a2 NaN 2.0 NaN
a3 NaN NaN 3.0
Downside: column b is lost.
How can the two methods be combined so that columns b and the 2nd row are kept like so:
d
c c1 c2 c3
a b
a1 optional1 1.0 NaN NaN
a2 None NaN 2.0 NaN
a3 optional3 NaN NaN 3.0
More generally: How can information from a row be retained during pivoting if a key has NaN value?
Use set_index and unstack to perform the pivot:
df = df.set_index(['a', 'b', 'c']).unstack('c')
This is essentially what pandas does under the hood for pivot. The stack and unstack methods are closely related to pivot, and can generally be used to perform pivot-like operations that don't quite conform with the built-in pivot functions.
The resulting output:
d
c c1 c2 c3
a b
a1 optional1 1.0 NaN NaN
a2 NaN NaN 2.0 NaN
a3 optional3 NaN NaN 3.0
You could use fillna to replace the None entry:
df['b'] = df['b'].fillna('foo')
df.pivot_table(index=['a','b'], columns=['c'])
----
d
c c1 c2 c3
a b
a1 optional1 1.0 NaN NaN
a2 foo NaN 2.0 NaN
a3 optional3 NaN NaN 3.0
Use this one:
def pivot_table(df, index, columns, values):
df = df[index + columns + values]
i = len(index)
df = df.set_index(index+columns).unstack(columns).reset_index()
df.columns = df.columns.droplevel(1)[:i].append(df.columns.droplevel(0)[i:])
return df
pivot_table(df, index =['a', 'b'], columns= ['c'], values= ['d'])
You can use fillna to replace type None to string "NULL"
Say...
df.fillna("NULL").pivot_table(index=['a'], columns='c')

Categories