Non-reducing variant of the ANY() function that respects NaN - python

Hard to explain in words but the expample should be clear:
df = DataFrame( { 'x':[0,1], 'y':[np.NaN,0], 'z':[0,np.NaN] }, index=['a','b'] )
x y z
a 0 NaN 0
b 1 0 NaN
I want to replace all non-NaN values with a '1', if there is a '1' anywhere in that row. Just like this:
x y z
a 0 NaN 0
b 1 1 NaN
This sort of works, but unfortunately overwrites the NaN
df[ df.any(1) ] = 1
x y z
a 0 NaN 0
b 1 1 1
I thought there might be some non-reducing form of any (like cumsum is a non-reducing form of sum), but I can't find anything like that so far...

You could combine a multiplication by zero (to give an empty frame but which remembers nan locations) with an add on axis=0:
>>> df
x y z
a 0 NaN 0
b 1 0 NaN
>>> (df * 0).add(df.any(1), axis=0)
x y z
a 0 NaN 0
b 1 1 NaN

Related

How to compare two dataframes, and add the rows and columns which one of the two doesn`t have

I have a small dataframe with fewer rows and columns than a bigger dataframe.
How can I add the rows and columns which are in the bigger dataframe, and populate them with zero? Basically I want to add the cells in red in the picture below:
A toy example is below. I have tried with pandas.concat, but I end up with all the values from the bigger dataframe.
import numpy as np
import pandas as pd
df_big = pd.DataFrame(index=["a","b","c","d"])
df_big["x"] = np.arange(4)
df_big["y"] = df_big.x * 2
df_big["z"] = df_big.x * 3
df_small=pd.DataFrame(index=["a","b"])
df_small["x"]=[8,10]
df_small["y"]=[30,40]
out = pd.concat( [df_big, df_small] , axis=0)
This looks like a good use case for DataFrame.align:
_, out = df_big.align(df_small, fill_value=0)
out
x y z
a 8 30 0
b 10 40 0
c 0 0 0
d 0 0 0
You can also use DataFrame.reindex_like on df_small:
df_small.reindex_like(df_big).fillna(0, downcast='infer')
x y z
a 8 30 0
b 10 40 0
c 0 0 0
d 0 0 0
Using mul with notnull
df_small.mul(df_big.notnull(),fill_value=0).astype(int)
Out[275]:
x y z
a 8 30 0
b 10 40 0
c 0 0 0
d 0 0 0
#df_small.mul(df_big.astype(bool),fill_value=0).astype(int) # change to astype will achieve the same
Late answer, but you can also use pandas.DataFrame.update, i.e.:
df_big[:] = 0
df_big.update(df_small, join='left', overwrite=True)
x y z
a 8.0 30.0 0
b 10.0 40.0 0
c 0.0 0.0 0
d 0.0 0.0 0

Pandas taking values in columns order

Given this df:
Name i j k
A 1 0 3
B 0 5 4
C 0 0 4
D 0 5
My goal is to add in a column "Final" that takes value in an order of i j k:
Name i j k Final
A 1 0 3 1
B 0 5 4 5
C 0 0 4 4
D 0 5 <-- this one is tricky. We do count the null for j column here.
Here is my attempt: df['Final'] = df[['i', 'j', 'k'].bfill(axis=1).iloc[:, 0]. This doesn't work since it always takes the value of column 1. Any help would be appreciated. :)
Many thanks!
If by "taking values in column order", you mean "taking the first non-zero value in each row, or zero if all values are zero", you could use DataFrame.lookup after doing a boolean comparison:
In [113]: df["final"] = df.lookup(df.index,(df[["i","j","k"]] != 0).idxmax(axis=1))
In [114]: df
Out[114]:
Name i j k final
0 A 1 0.0 3 1.0
1 B 0 5.0 4 5.0
2 C 0 0.0 4 4.0
3 D 0 NaN 5 NaN
where first we compare everything with zero:
In [115]: df[["i","j","k"]] != 0
Out[115]:
i j k
0 True False True
1 False True True
2 False False True
3 False True True
and then we use idxmax to find the first True (or the first False if you have a row of zeroes):
In [116]: (df[["i","j","k"]] != 0).idxmax(axis=1)
Out[116]:
0 i
1 j
2 k
3 j
dtype: object
Is this what you need ?
df['Final']=df[['i', 'j', 'k']].mask((df=='')|(df==0)).bfill(axis=1).iloc[:, 0][(df!='').all(1)]
df
Out[1290]:
Name i j k Final
0 A 1 0 3 1.0
1 B 0 5 4 5.0
2 C 0 0 4 4.0
3 D 0 5 NaN
Using pandas.Series.nonzero the solution can be expressed succicntly.
df['Final'] = df.apply(lambda x: x.iloc[x.nonzero()[0][0]], axis=1)
How this works:
nonzero() returns the indices of elements that are not zero (and will match np.nan as well).
We take the first index location and return the value at that location to construct the Final Column.
We apply this on the dataframe using axis=1 to apply it row by row.
A benefit of this approach is that it does not depend on naming individual columns ['i', 'j', 'k']

Fill few missing values in python

I want to fill missing values of a specific column only if a condition is met.
e.g. A B
Nan 0
Nan 0
0 0
Nan 1
Nan 1
.....................
.....................
In the above case I want to fill Nan values in column A only when corresponding value in column B is 0. Rest values in A (with Nan) should not change.
Use mask with fillna:
df['A'] = df['A'].mask(df['B'] == 0, df['A'].fillna(3))
Alternatives with loc, numpy.where:
df.loc[df['B'] == 0, 'A'] = df['A'].fillna(3)
df['A'] = np.where(df['B'] == 0, df['A'].fillna(3), df['A'])
print (df)
A B
0 3.0 0
1 3.0 0
2 0.0 0
3 NaN 1
4 NaN 1
np.where is quicke and simple solution.
In [47]: df['A'] = np.where(np.isnan(df['A']) & df['B'] == 0, 3, df['A'])
In [48]: df
Out[48]:
A B
0 3.0 0
1 3.0 0
2 3.0 0
3 NaN 1
4 NaN 1
You should use a loop over all elements, something like this:
for i in range(len(A))
if numpy.isnan(A[i]) && B[i] == 0:
A[i] = value
There are nicer ways to implement these loops, but I don't know what structures you are using.

How to replace a value in a pandas dataframe with column name based on a condition?

I have a dataframe that looks something like this:
I want to replace all 1's in the range A:D with the name of the column, so that the final result should resemble:
How can I do that?
You can recreate my dataframe with this:
dfz = pd.DataFrame({'A' : [1,0,0,1,0,0],
'B' : [1,0,0,1,0,1],
'C' : [1,0,0,1,3,1],
'D' : [1,0,0,1,0,0],
'E' : [22.0,15.0,None,10.,None,557.0]})
One way could be to use replace and pass in a Series mapping column labels to values (those same labels in this case):
>>> dfz.loc[:, 'A':'D'].replace(1, pd.Series(dfz.columns, dfz.columns))
A B C D
0 A B C D
1 0 0 0 0
2 0 0 0 0
3 A B C D
4 0 0 3 0
5 0 B C 0
To make the change permanent, you'd assign the returned DataFrame back to dfz.loc[:, 'A':'D'].
Solutions aside, it's useful to keep in mind that you may lose a lot of performance benefits when you mix numeric and string types in columns, as pandas is forced to use the generic 'object' dtype to hold the values.
A solution using where:
>>> dfz.where(dfz != 1, dfz.columns.to_series(), axis=1)
A B C D E
0 A B C D 22.0
1 0 0 0 0 15.0
2 0 0 0 0 NaN
3 A B C D 10.0
4 0 0 3 0 NaN
5 0 B C 0 557.0
Maybe it's not so elegant but...just looping through columns and replace:
for i in dfz[['A','B','C','D']].columns:
dfz[i].replace(1,i,inplace=True)
I do prefer very elegant solution from #ajcr.
In case if you have column names that you cant use that easily for slicing, here is my solution:
dfz.ix[:, dfz.filter(regex=r'(A|B|C|D)').columns.tolist()] = (
dfz[dfz!=1].ix[:,dfz.filter(regex=r'(A|B|C|D)').columns.tolist()]
.apply(lambda x: x.fillna(x.name))
)
Output:
In [207]: dfz
Out[207]:
A B C D E
0 A B C D 22.0
1 0 0 0 0 15.0
2 0 0 0 0 NaN
3 A B C D 10.0
4 0 0 3 0 NaN
5 0 B C 0 557.0

Pandas align multiindex dataframe with other with regular index

I have one dataframe, let's call it df1, with a a MultiIndex (just a snippet, there are many more columns and rows)
M1_01 M1_02 M1_03 M1_04 M1_05
Eventloc Exonloc
chr10:52619746-52623793|- 52622648-52622741 0 0 0 0 0
chr19:58859211-58865080|+ 58864686-58864827 0 0 0 0 0
58864686-58864840 0 0 0 0 0
58864744-58864840 0 0 0 0 0
chr19:58863054-58863649|- 58863463-58863550 0 0 0 0 0
And another dataframe, let's go with the creative name df2, like this (these are the results of different algorithms, which is why they have different indices). The columns are the same, though in the first df they are not sorted.
M1_01 M1_02 M1_03 M1_04 M1_05
chr3:53274267:53274364:-#chr3:53271813:53271836:-#chr3:53268999:53269190:- 0.02 NaN NaN NaN NaN
chr2:9002720:9002852:-#chr2:9002401:9002452:-#chr2:9000743:9000894:- 0.04 NaN NaN NaN NaN
chr1:160192441:160192571:-#chr1:160190249:160190481:-#chr1:160188639:160188758:- NaN NaN NaN NaN NaN
chr7:100473194:100473333:+#chr7:100478317:100478390:+#chr7:100478906:100479034:+ NaN NaN NaN NaN NaN
chr11:57182088:57182204:-#chr11:57177408:57177594:-#chr11:57176648:57176771:- NaN NaN NaN NaN NaN
And I have this dataframe, again let's be creative and call it df3, which unifies the indices of df1 and df2:
Eventloc Exonloc
event_id
chr3:53274267:53274364:-#chr3:53271813:53271836:-#chr3:53268999:53269190:- chr3:53269191-53274267|- 53271812-53271836
chr2:9002720:9002852:-#chr2:9002401:9002452:-#chr2:9000743:9000894:- chr2:9000895-9002720|- 9002400-9002452
chr1:160192441:160192571:-#chr1:160190249:160190481:-#chr1:160188639:160188758:- chr1:160188759-160192441|- 160190248-160190481
chr7:100473194:100473333:+#chr7:100478317:100478390:+#chr7:100478906:100479034:+ chr7:100473334-100478906|+ 100478316-100478390
chr4:55124924:55124984:+#chr4:55127262:55127579:+#chr4:55129834:55130094:+ chr4:55124985-55129834|+ 55127261-55127579
I need to do a 1:1 comparison of these results, so I tried doing both
df1.ix[df3.head().values]
and
df1.ix[pd.MultiIndex.from_tuples(df3.head().values.tolist(), names=['Eventloc', 'Exonloc'])]
But they both give me dataframes of NAs. The only thing that works is:
event_id = df2.index[0]
df1.ix[df3.ix[event_id]]
But this obviously suboptimal as it is not vectorized and very slow. I think I'm missing some critical concept of MultiIndexes.
Thanks,
Olga
If I understand what you are doing, you need to either explicity construct the tuples (they must be fully qualifiied tuples though, e.g. have a value for EACH level), or easier, construct a boolean indexer)
In [7]: df1 = DataFrame(0,index=MultiIndex.from_product([list('abc'),[range(2)]]),columns=['A'])
In [8]: df1
Out[8]:
A
a 0 0
b 1 0
c 0 0
[3 rows x 1 columns]
In [9]: df1 = DataFrame(0,index=MultiIndex.from_product([list('abc'),list(range(2))]),columns=['A'])
In [10]: df1
Out[10]:
A
a 0 0
1 0
b 0 0
1 0
c 0 0
1 0
[6 rows x 1 columns]
In [11]: df3 = DataFrame(0,index=['a','b'],columns=['A'])
In [12]: df3
Out[12]:
A
a 0
b 0
[2 rows x 1 columns]
These are all the values of level 0 in the first frame
In [13]: df1.index.get_level_values(level=0)
Out[13]: Index([u'a', u'a', u'b', u'b', u'c', u'c'], dtype='object')
Construct a boolean indexer of the result
In [14]: df1.index.get_level_values(level=0).isin(df3.index)
Out[14]: array([ True, True, True, True, False, False], dtype=bool)
In [15]: df1.loc[df1.index.get_level_values(level=0).isin(df3.index)]
Out[15]:
A
a 0 0
1 0
b 0 0
1 0
[4 rows x 1 columns]

Categories