Replacing non-null values with column names

Replacing non-null values with column names - python

Given the following data frame:
import pandas as pd
d = pd.DataFrame({'a':[1,2,3],'b':[np.nan,5,6]})
d
a b
0 1 NaN
1 2 5.0
2 3 6.0
I would like to replace all non-null values with the column name.
Desired result:
a b
0 a NaN
1 a b
2 a b
In reality, I have many columns.
Thanks in advance!
Update to answer from root:
To perform this on a subset of columns:
d.loc[:,d.columns[3:]] = np.where(d.loc[:,d.columns[3:]].notnull(), d.loc[:,d.columns[3:]].columns, d.loc[:,d.columns[3:]])

Using numpy.where and notnull:
d[:] = np.where(d.notnull(), d.columns, d)
The resulting output:
a b
0 a NaN
1 a b
2 a b
Edit
To select specific columns:
cols = d.columns[3:] # or whatever Index/list-like of column names
d[cols] = np.where(d[cols].notnull(), cols, d[cols])

I can think of one possibility using apply/transform:
In [1610]: d.transform(lambda x: np.where(x.isnull(), x, x.name))
Out[1610]:
a b
0 a nan
1 a b
2 a b
You could also use df.where:
In [1627]: d.where(d.isnull(), d.columns.values.repeat(len(d)).reshape(d.shape))
Out[1627]:
a b
0 a NaN
1 a b
2 b b

Related

How to overwrite multiple rows from one row (iloc/loc difference)?

I have a dataframe and would like to assign multiple values from one row to multiple other rows.
I get it to work with .iloc but for some when I use conditions with .loc it only returns nan.
df = pd.DataFrame(dict(A = [1,2,0,0],B=[0,0,0,10],C=[3,4,5,6]))
df.index = ['a','b','c','d']
When I use loc with conditions or with direct index names:
df.loc[df['A']>0, ['B','C']] = df.loc['d',['B','C']]
df.loc[['a','b'], ['B','C']] = df.loc['d',['B','C']]
it will return
A B C
a 1.0 NaN NaN
b 2.0 NaN NaN
c 0.0 0.0 5.0
d 0.0 10.0 6.0
but when I use .iloc it actually works as expected
df.iloc[0:2,1:3] = df.iloc[3,1:3]
A B C
a 1 10 6
b 2 10 6
c 0 0 5
d 0 10 6
is there a way to do this with .loc or do I need to rewrite my code to get the row numbers from my mask?

When you use labels, pandas perform index alignment, and in your case there is no common indices thus the NaNs, while location based indexing does not align.
You can assign a numpy array to prevent index alignment:
df.loc[['a','b'], ['B','C']] = df.loc['d',['B','C']].values
output:
A B C
a 1 10 6
b 2 10 6
c 0 0 5
d 0 10 6

Pandas Dataframe - (Column re structure)

I have a dataframe that has n number of columns. These contain letters, the amount of letters a column contains varies and a letter can appear in various amounts of columns. I need the code for a pandas dataframe to convert the sheet to columns starting with the letters, the rows should contain the numbers of the columns that that letter was in.
Link to example problem
ABCDEF
ABDE. 11 1
BBCC -> 2 2
EFB. 3 3
4 4
The image describes my problem better. Thank you in advance for any help.

Use DataFrame.stack with DataFrame.reset_index for reshape, then DataFrame.sort_values and aggregate lists, last create DataFrame by constructor with transpose:
s=df.stack().reset_index(name='a').sort_values('level_1').groupby('a')['level_1'].agg(list)
df1 = pd.DataFrame(s.tolist(), index=s.index).T
print (df1)
a a b c d e f
0 1 1 1 1 3 2
1 3 3 2 4 4 None
2 None 4 None None None None
Or use GroupBy.cumcount for counter and reshape by DataFrame.pivot:
df2 = df.stack().reset_index(name='a').sort_values('level_1')
df2['g'] = df2.groupby('a').cumcount()
df2 = df2.pivot('g','a','level_1')
print (df2)
a a b c d e f
g
0 1 1 1 1 3 2
1 3 3 2 4 4 NaN
2 NaN 4 NaN NaN NaN NaN
Last if necessary remove index and columns names:
df1 = df1.rename_axis(index=None)
df2 = df2.rename_axis(index=None, columns=None)

Is there a faster way to count values across multiple columns, excluding duplicated values on the same row?

Given the following df
id val1 val2 val3
0 1 A A B
1 1 A B B
2 1 B C NaN
3 1 NaN B D
4 2 A D NaN
I would like to sum the value counts within each id group for all columns; however, I need to only count values that appear on the same row once, so the expected output is:
id
1 B 4
A 2
C 1
D 1
2 A 1
D 1
I can accomplish this with
import pandas as pd
df.set_index('id').apply(lambda x: list(set(x)), axis=1).apply(pd.Series).stack().groupby(level=0).value_counts()
but the apply(...axis=1) (and perhaps apply(pd.Series)) really kills the performance on large DataFrames. Since I have a small number of columns, I guess I could just check for all pairwise duplicates, replace one with np.NaN and then just use df.set_index('id').stack().groupby(level=0).value_counts() but that doesn't seem like the right approach when the number of columns get large.
Any ideas on a faster way around this?

Here's the missing steps that remove row duplicates from your dataframe:
nodups = df.stack().reset_index(level=0).drop_duplicates()
nodups = nodups.set_index(['level_0', nodups.index]).unstack()
nodups.columns = nodups.columns.levels[1]
# id val1 val2 val3
#level_0
#0 1 A None B
#1 1 A B None
#2 1 B C None
#3 1 None B D
#4 2 A D None
Now you can follow with:
nodups.set_index('id').stack().groupby(level=0).value_counts()
Perhaps you can further optimize the code.

I am using get_dummies
s=df.set_index('id',append=True).stack().str.get_dummies().sum(level=[0,1]).gt(0).sum(level=1).stack().astype(int)
s[s.gt(0)]
Out[234]:
id
1 A 2
B 4
C 1
D 1
2 A 1
D 1
dtype: int32

Delete a row if it doesn't contain a specified integer value (Pandas)

I have a Pandas dataset that I want to clean up prior to applying my ML algorithm. I am wondering if it was possible to remove a row if an element of its columns does not match a set of values. For example, if I have the dataframe:
a b
0 1 6
1 4 7
2 2 4
3 3 7
...
And I desire the values of a to be one of [1,3] and of b to be one of [6,7], such that my final dataset is:
a b
0 1 6
1 3 7
...
Currently, my implementation is not working as some of my data rows have erroneous strings attached to the value. For example, instead of a value of 1 I'll have something like 1abc. Hence why I would like to remove anything that is not an integer of that value.
My workaround is also a bit archaic, as I am removing entries for column a that do not have 1 or 3 via:
dataset = dataset[(dataset.commute != 1)]
dataset = dataset[(dataset.commute != 3)]

You can use boolean indexing with double isin and &:
df1 = df[(df['a'].isin([1,3])) & (df['b'].isin([6,7]))]
print (df1)
a b
0 1 6
3 3 7
Or use numpy.in1d:
df1 = df[(np.in1d(df['a'], [1,3])) & (np.in1d(df['b'], [6,7])) ]
print (df1)
a b
0 1 6
3 3 7
But if need remove all rows with non numeric then need to_numeric with errors='coerce' which return NaN and then is possible filter it by notnull:
df = pd.DataFrame({'a':['1abc','2','3'],
'b':['4','5','dsws7']})
print (df)
a b
0 1abc 4
1 2 5
2 3 dsws7
mask = pd.to_numeric(df['a'], errors='coerce').notnull() &
pd.to_numeric(df['b'], errors='coerce').notnull()
df1 = df[mask].astype(int)
print (df1)
a b
1 2 5
If need check if some value is NaN or None:
df = pd.DataFrame({'a':['1abc',None,'3'],
'b':['4','5',np.nan]})
print (df)
a b
0 1abc 4
1 None 5
2 3 NaN
print (df[df.isnull().any(axis=1)])
a b
1 None 5
2 3 NaN

You can use pandas isin()
df = df[df.a.isin([1,3]) & df.b.isin([6,7])]
a b
0 1 6
3 3 7

Remove columns that have NA values for rows - Python

Suppose I have a dataframe as follows,
import pandas as pd
columns=['A','B','C','D', 'E', 'F']
index=['1','2','3','4','5','6']
df = pd.DataFrame(columns=columns,index=index)
df['D']['1'] = 1
df['E'] = 1
df['F']['1'] = 1
df['A']['2'] = 1
df['B']['3'] = 1
df['C']['4'] = 1
df['A']['5'] = 1
df['B']['5'] = 1
df['C']['5'] = 1
df['D']['6'] = 1
df['F']['6'] = 1
df
A B C D E F
1 NaN NaN NaN 1 1 1
2 1 NaN NaN NaN 1 NaN
3 NaN 1 NaN NaN 1 NaN
4 NaN NaN 1 NaN 1 NaN
5 1 1 1 NaN 1 NaN
6 NaN NaN NaN 1 1 1
My condition is, I want to remove the columns which have value only when A,B,C(together) don't have a value. I want to find which column is mutually exclusive to A,B,C columns together. I am interested in finding the columns that have values only when A or B or C has values. The output here would be to remove D,F columns. But my dataframe has 400 columns and I want a way to check this for A,B,C vs rest of the columns.
One way I can think is,
Remove NA rows from A,B,C
df = df[np.isfinite(df['A'])]
df = df[np.isfinite(df['B'])]
df = df[np.isfinite(df['C'])]
and get NA count of all columns and check with the total number of rows,
df.isnull().sum()
and remove the counts that match.
Is there a better and efficient way to do this?
Thanks

Rather than delete rows, just select the others that don't have A, B, C equal to NaN at the same time.
mask = df[["A", "B", "C"]].isnull().all(axis=1)
df = df[~mask]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replacing non-null values with column names - python

Using numpy.where and notnull: d[:] = np.where(d.notnull(), d.columns, d) The resulting output: a b 0 a NaN 1 a b 2 a b Edit To select specific columns: cols = d.columns[3:] # or whatever Index/list-like of column names d[cols] = np.where(d[cols].notnull(), cols, d[cols])

I can think of one possibility using apply/transform: In [1610]: d.transform(lambda x: np.where(x.isnull(), x, x.name)) Out[1610]: a b 0 a nan 1 a b 2 a b You could also use df.where: In [1627]: d.where(d.isnull(), d.columns.values.repeat(len(d)).reshape(d.shape)) Out[1627]: a b 0 a NaN 1 a b 2 b b

Related

How to overwrite multiple rows from one row (iloc/loc difference)?

Pandas Dataframe - (Column re structure)

Is there a faster way to count values across multiple columns, excluding duplicated values on the same row?

Delete a row if it doesn't contain a specified integer value (Pandas)

Remove columns that have NA values for rows - Python

Categories

Resources