One common thing people seem to want to do in pandas is to replace None-values with the next or previous None-value. This is easily done with .fillna. I however want to do something similar but different.
I have a dataframe, df, with some entries. Every row has a different number of entries and they are all "left-adjusted" (if the df is 10 columns wide and some row has n<10 entries the first n columns hold the entries and the remaining columns are Nones).
What I want to do is find the last non-None entry in every row and change it to also be a None. This could be any of the columns from the first to the last.
I could of course do this with a for-loop but my dfs can be quite large so something quicker would be preferable. Any ideas?
Thanks!
With help from numpy, this is quite easy. By counting the number of None in each row one can find for each row the column with the last non-None value. Then using Numpy change this value to None:
data = np.random.random((6,10))
df = pd.DataFrame(data)
df.iloc[0, 7:] = None
df.iloc[1, 6:] = None
df.iloc[2, 5:] = None
df.iloc[3, 8:] = None
df.iloc[4, 5:] = None
df.iloc[5, 4:] = None
Original dataframe looks like this:
0 1 2 3 4 5
0 0.992337 0.651785 0.521422 NaN NaN NaN
1 0.912962 0.292458 0.620195 0.507071 0.010205 NaN
2 0.061320 0.565979 0.344755 NaN NaN NaN
3 0.521936 0.057917 0.359699 0.484009 NaN NaN
isnull = df.isnull()
col = data.shape[1] - isnull.sum(axis = 1) - 1
df.values[range(len(df)), col] = None
Updated dataframe looks like this:
0 1 2 3 4 5
0 0.992337 0.651785 NaN NaN NaN NaN
1 0.912962 0.292458 0.620195 0.507071 NaN NaN
2 0.061320 0.565979 NaN NaN NaN NaN
3 0.521936 0.057917 0.359699 NaN NaN NaN
You can find the index of the element to replace in each row with np.argmax():
indices = np.isnan(df.to_numpy()).argmax(axis=1) - 1
df.to_numpy()[range(len(df)), indices] = None
Related
a1
a2
a3
Last_Not_NaN_Value
1
NaN
NaN
1
0
0
NaN
0
NaN
5
NaN
5
I've managed so far to get last not NaN value in the row this way:
data.ffill(axis=1).iloc[:, -1]
But, I also need to replace that value with NaN (drop it from the DataFrame)
Create a boolean mask to identify non-nan values, then calculate cumsum along axis=1 then mask the values in original dataframe where cumsum is maximum
m = df.notna()
s = m.cumsum(1)
df.mask(s.eq(s.max(1), axis=0))
a1 a2 a3
0 NaN NaN NaN
1 0.0 NaN NaN
2 NaN NaN NaN
PS: There is no need to create an intermediate column Last_Not_NaN_Value
one way is to use last_valid_index on each row:
df = df[['a1', 'a2', 'a3']] #just in case
for i, r in df.iterrows():
df.loc[i, r.last_valid_index()] = np.nan
import pandas as pd
seq = (
df # set index and column values by their ordinal numbers
.set_axis(range(df.shape[0]), axis=0)
.set_axis(range(df.shape[1]), axis=1)
.agg(pd.DataFrame.last_valid_index, 1)
)
df.values[seq.index, seq] = pd.NA
Here
df is a given data frame;
seq - associate rows with a corresponding last valid column number;
df.values is a numpy.array and it's a view to the values of df
values[seq.index, seq] is Integer array indexing, which allows selection of arbitrary items in df (it's a view to the original data, so we can use assigning to change those values).
There's something fundamental about manipulating pandas dataframes which I am not getting.
TL,DR: passing a boolean series to the indexing operator [] of a pandas dataframe returns the rows or columns of that df where the series was True. But passing a boolean dataframe (ie: multidimensional) returns a weird dataframe consisting only of NaN values.
Edit: to rephrase: why is it possible to pass a dataframe of boolean values to another dataframe, and what does it do? With a series, this makes sense, but with a dataframe, I don't understand what's happening 'under the hood', and why in my example I get a dataframe of null NaN values.
In detail with examples:
When I pass a pandas boolean Series to the indexing operator, it returns a list of rows corresponding to indices where the Series is True:
test_list = [[1,2,3,4],[3,4,5],[4,5]]
test_df = pd.DataFrame(test_list)
test_df
0 1 2 3
0 1 2 3.0 4.0
1 3 4 5.0 NaN
2 4 5 NaN NaN
test_df[test_df[2].isnull()]
0 1 2 3
2 4 5 NaN NaN
So far, so good. But what happens when I do this:
test_df[test_df.isnull()]
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
Why does this return a dataframe consisting of only NaN values? I would expect it to either return an error, or perhaps to return a new dataframe truncated using the boolean mask dataframe. But I find this output completely confusing.
Edit: As an outcome I would expect to get an error. I don't understand why it's possible to pass a dataframe under these circumstances, or why it returns this dataframe of NaN values
test_df[..] calls an indexing method __getitem__(). From the source code:
def __getitem__(self, key):
...
# Do we have a (boolean) DataFrame?
if isinstance(key, DataFrame):
return self.where(key)
# Do we have a (boolean) 1d indexer?
if com.is_bool_indexer(key):
return self._getitem_bool_array(key)
As you can see, if the key is a boolean DataFrame, it will call pandas.DataFrame.where(). The function of where() is to replace values where the condition is False with NaN by default.
# print(test_df.isnull())
0 1 2 3
0 False False False False
1 False False False True
2 False False True True
# print(test_df)
0 1 2 3
0 1 2 3.0 4.0
1 3 4 5.0 NaN
2 4 5 NaN NaN
test_df.where(test_df.isnull()) replaces not null values with NaN.
I believe all values are transformed to NaN because you passed the entire df. The error 'message', precisely, is that all the returned values are NaN (including those that were not NaN), which allows us to see that something wrong happened. But surely a more experienced user will be able to answer you in more detail. Also note most of the time you want to remove or transform these NaN--not just flag them.
Following my comment above and LoukasPap's answer, here is a way to flag, count, and then remove or transform these NaN values:
First flag NaN values:
test_df.isnull()
You might also be interested to count your NaN values:
test_df.isnull().sum() # sum NaN by column
test_df.isnull().sum().sum() # get grand total of NaN
You can now drop NaN values by row
test_df.dropna()
Or by column:
test_df.dropna(axis=1)
Or replace NaN values by median:
test_df.fillna(test_df.median())
For my program I need a check comparison between an already existing DataFrame and a new DataFrame which comes as an input. The comparison should compare each cell of each DataFrame.
The case i need to find is, that in either the old and the new DataFrame there is a value but only if the value is different for this specific position it goes into a third reference DataFrame. The Reference DataFrame should e.g. like this:
A B
0 1 nan
1 2 nan
2 nan 3
3 nan 4
Input DataFrame
A B
0 nan 2
1 3 5
2 4 4
3 nan nan
Reference DataFrame
A B
0 3 4
I figured the best way is to compare each column with np.where
Since both existing and input DataFrame can be different the challenge is, that this method can only compare identically-labeled.
therefore I excluded non sharing colums and sorted them in the same order. So all Column names are the same and in the same order
Also i used this loop to align the number of records of both DataFrames:
dfshape = df.shape[0]
df1shape = df1.shape[0]
if dfshape < df1shape:
while i < (df1shape - dfshape):
df = df.append(pd.Series(0, index=df.columns), ignore_index=True)
i += 1
else:
while i < (dfshape - df1shape):
df1 = df1.append(pd.Series(0, index=df1.columns), ignore_index=True)
i += 1
Both DataFrames brought into the same shape I tried the following operation:
for column in df1:
for idx in df1.index:
if df.loc[idx, column] is not None:
dfRefercence = np.where(df1[column] != df[column], df1.loc[idx, column])
But ValueError: Can only compare identically-labeled Series objects
At this point I've run out of idieas to takle this problem and also could not identify the cause of the thrown Exception.
Maybe there is another way to achieve the desired result?
pandas.DataFrame.compare seems not to do the trick for me with this problem:
df2 = df.compare(df1)
A B
self other self other
0 1 nan nan 2
1 2 3 nan 5
2 nan 4 3 4
3 NaN NaN 4 nan
I'm trying to match values in a matrix on python using pandas dataframes. Maybe this is not the best way to express it.
Imagine you have the following dataset:
import pandas as pd
d = {'stores':['','','','',''],'col1': ['x','price','','',1],'col2':['y','quantity','',1,''], 'col3':['z','',1,'',''] }
df = pd.DataFrame(data=d)
stores col1 col2 col3
0 NaN x y z
1 NaN price quantity NaN
2 NaN NaN Nan 1
3 NaN NaN 1 NaN
4 NaN 1 NaN NaN
I'm trying to get the following:
stores col1 col2 col3
0 NaN x y z
1 NaN price quantity NaN
2 z NaN Nan 1
3 y NaN 1 NaN
4 x 1 NaN NaN
Any ideas how this might work? I've tried running loops on lists but I'm not quite sure how to do it.
This is what I have so far but it's just terrible (and obviously not working) and I am sure there is a much simpler way of doing this but I just can't get my head around it.
stores = ['x','y','z']
for i in stores:
for v in df.iloc[0,:]:
if i==v :
df['stores'] = i
It yields the following:
stores col1 col2 col3
0 z x y z
1 z price quantity NaN
2 z NaN NaN 1
3 z NaN 1 NaN
4 z 1 NaN NaN
Thank you in advance.
You can complete this task with a loop by doing the following. It loops through each column excluding the first where you want to write the data. Takes the index values where the value is 1 and writes the value from the first row to the column 'stores'.
Be careful where you might have 1's in multiple rows, in which case it will fill the stores column with the last column that had a 1 value.
for col in df.columns[1:]:
index_values = df[col][df[col]==1].index.tolist()
df.loc[index_values, 'stores'] = df[col][0]
You can fill the whole column at once, like this:
df["stores"] = df[["col1", "col2", "col3"]].rename(columns=df.loc[0]).eq(1).idxmax(axis=1)
This first creates a version of the dataframe with the columns renamed "x", "y", and "z" after the values in the first row; then idxmax(axis=1) returns the column heading associated with the max value in each row (which is the True one).
However this adds an "x" in rows where none of the columns has a 1. If that is a problem you could do something like this:
df["NA"] = 1 # add a column of ones
df["stores"] = df[["col1", "col2", "col3", "NA"]].rename(columns=df.loc[0]).eq(1).idxmax(axis=1)
df["stores"].replace(1, np.NaN, inplace=True) # replace the 1s with NaNs
When using the drop_duplicates() method I reduce duplicates but also merge all NaNs into one entry. How can I drop duplicates while preserving rows with an empty entry (like np.nan, None or '')?
import pandas as pd
df = pd.DataFrame({'col':['one','two',np.nan,np.nan,np.nan,'two','two']})
Out[]:
col
0 one
1 two
2 NaN
3 NaN
4 NaN
5 two
6 two
df.drop_duplicates(['col'])
Out[]:
col
0 one
1 two
2 NaN
Try
df[(~df.duplicated()) | (df['col'].isnull())]
The result is :
col
0 one
1 two
2 NaN
3 NaN
4 NaN
Well, one workaround that is not really beautiful is to first save the NaN and put them back in:
temp = df.iloc[pd.isnull(df).any(1).nonzero()[0]]
asd = df.drop_duplicates('col')
pd.merge(temp, asd, how='outer')
Out[81]:
col
0 one
1 two
2 NaN
3 NaN
4 NaN
use:
df.drop_duplicates('col').append(df[df['col'].isna()])