When using the drop_duplicates() method I reduce duplicates but also merge all NaNs into one entry. How can I drop duplicates while preserving rows with an empty entry (like np.nan, None or '')?
import pandas as pd
df = pd.DataFrame({'col':['one','two',np.nan,np.nan,np.nan,'two','two']})
Out[]:
col
0 one
1 two
2 NaN
3 NaN
4 NaN
5 two
6 two
df.drop_duplicates(['col'])
Out[]:
col
0 one
1 two
2 NaN
Try
df[(~df.duplicated()) | (df['col'].isnull())]
The result is :
col
0 one
1 two
2 NaN
3 NaN
4 NaN
Well, one workaround that is not really beautiful is to first save the NaN and put them back in:
temp = df.iloc[pd.isnull(df).any(1).nonzero()[0]]
asd = df.drop_duplicates('col')
pd.merge(temp, asd, how='outer')
Out[81]:
col
0 one
1 two
2 NaN
3 NaN
4 NaN
use:
df.drop_duplicates('col').append(df[df['col'].isna()])
Related
One common thing people seem to want to do in pandas is to replace None-values with the next or previous None-value. This is easily done with .fillna. I however want to do something similar but different.
I have a dataframe, df, with some entries. Every row has a different number of entries and they are all "left-adjusted" (if the df is 10 columns wide and some row has n<10 entries the first n columns hold the entries and the remaining columns are Nones).
What I want to do is find the last non-None entry in every row and change it to also be a None. This could be any of the columns from the first to the last.
I could of course do this with a for-loop but my dfs can be quite large so something quicker would be preferable. Any ideas?
Thanks!
With help from numpy, this is quite easy. By counting the number of None in each row one can find for each row the column with the last non-None value. Then using Numpy change this value to None:
data = np.random.random((6,10))
df = pd.DataFrame(data)
df.iloc[0, 7:] = None
df.iloc[1, 6:] = None
df.iloc[2, 5:] = None
df.iloc[3, 8:] = None
df.iloc[4, 5:] = None
df.iloc[5, 4:] = None
Original dataframe looks like this:
0 1 2 3 4 5
0 0.992337 0.651785 0.521422 NaN NaN NaN
1 0.912962 0.292458 0.620195 0.507071 0.010205 NaN
2 0.061320 0.565979 0.344755 NaN NaN NaN
3 0.521936 0.057917 0.359699 0.484009 NaN NaN
isnull = df.isnull()
col = data.shape[1] - isnull.sum(axis = 1) - 1
df.values[range(len(df)), col] = None
Updated dataframe looks like this:
0 1 2 3 4 5
0 0.992337 0.651785 NaN NaN NaN NaN
1 0.912962 0.292458 0.620195 0.507071 NaN NaN
2 0.061320 0.565979 NaN NaN NaN NaN
3 0.521936 0.057917 0.359699 NaN NaN NaN
You can find the index of the element to replace in each row with np.argmax():
indices = np.isnan(df.to_numpy()).argmax(axis=1) - 1
df.to_numpy()[range(len(df)), indices] = None
how can I set all my values in df1 as missing if their position equivalent is a missing value in df2?
Data df1:
Index Data
1 3
2 8
3 9
Data df2:
Index Data
1 nan
2 2
3 nan
desired output:
Index Data
1 nan
2 8
3 nan
So I would like to keep the data of df1, but only for the positions for which df2 also has data entries. For all nans in df2 I would like to replace the value of df1 with nan as well.
I tried the following, but this replaced all data points with nan.
df1 = df1.where(df2== np.nan, np.nan)
Thank you very much for your help.
Use mask, which is doing exactly the inverse of where:
df3 = df1.mask(df2.isna())
output:
Index Data
0 1 NaN
1 2 8.0
2 3 NaN
In your case, you were setting all elements matching a non-NaN as NaN, and because equality is not the correct way to check for NaN (np.nan == np.nan yields False), you were setting all to NaN.
Change df2 == np.nan by df2.notna():
df3 = df1.where(df2.notna(), np.nan)
print(df3)
# Output
Index Data
0 1 NaN
1 2 8.0
2 3 NaN
When reading csv file my dataframe has these column names:
df.columns:
Index([nan,"A", nan, "B", "C", nan],dtype='object')
For unknown reasons it does not automatically name them as "Unnamed:0" and so on as it usually does.
Therefore is it possible to rename the multiple nan columns to Unnamed:0, Unnamed:1 and so on, depending on how many nan columns are there- the number of nan columns varies.
first convert your columns to a series then apply a cumulative count cumcount to a boolean condition which is True if there is a null occurrence. then use the conditional value to fill the null values.
s = pd.Series(df.columns)
print(s)
0 NaN
1 A
2 NaN
3 B
4 C
5 NaN
s = s.fillna('unnamed:' + (s.groupby(s.isnull()).cumcount() + 1).astype(str))
print(s)
0 unnamed:1
1 A
2 unnamed:2
3 B
4 C
5 unnamed:3
dtype: object
df.columns = s
I have the following dataframe as "w":
A B
0 Alex Benedict
1 John NaN
I want to find the maximum from these 2 columns and store it in "A" column
I used the following method:
w["A"] = w[['A','B']].max(axis=1)
A B
0 NaN Benedict
1 NaN NaN
I don't want this output of "NaN" in the "A" column. How should I get rid of this?
It is possible with max per rows with removing missing values:
w['A'] = w[['A','B']].apply(lambda x: x.dropna().max(), axis=1)
print (w)
A B
0 Benedict Benedict
1 John NaN
The nanmax() function of numpy does the job
w['A'] = w[['A', 'B']].apply(np.nanmax)
I'm trying to match values in a matrix on python using pandas dataframes. Maybe this is not the best way to express it.
Imagine you have the following dataset:
import pandas as pd
d = {'stores':['','','','',''],'col1': ['x','price','','',1],'col2':['y','quantity','',1,''], 'col3':['z','',1,'',''] }
df = pd.DataFrame(data=d)
stores col1 col2 col3
0 NaN x y z
1 NaN price quantity NaN
2 NaN NaN Nan 1
3 NaN NaN 1 NaN
4 NaN 1 NaN NaN
I'm trying to get the following:
stores col1 col2 col3
0 NaN x y z
1 NaN price quantity NaN
2 z NaN Nan 1
3 y NaN 1 NaN
4 x 1 NaN NaN
Any ideas how this might work? I've tried running loops on lists but I'm not quite sure how to do it.
This is what I have so far but it's just terrible (and obviously not working) and I am sure there is a much simpler way of doing this but I just can't get my head around it.
stores = ['x','y','z']
for i in stores:
for v in df.iloc[0,:]:
if i==v :
df['stores'] = i
It yields the following:
stores col1 col2 col3
0 z x y z
1 z price quantity NaN
2 z NaN NaN 1
3 z NaN 1 NaN
4 z 1 NaN NaN
Thank you in advance.
You can complete this task with a loop by doing the following. It loops through each column excluding the first where you want to write the data. Takes the index values where the value is 1 and writes the value from the first row to the column 'stores'.
Be careful where you might have 1's in multiple rows, in which case it will fill the stores column with the last column that had a 1 value.
for col in df.columns[1:]:
index_values = df[col][df[col]==1].index.tolist()
df.loc[index_values, 'stores'] = df[col][0]
You can fill the whole column at once, like this:
df["stores"] = df[["col1", "col2", "col3"]].rename(columns=df.loc[0]).eq(1).idxmax(axis=1)
This first creates a version of the dataframe with the columns renamed "x", "y", and "z" after the values in the first row; then idxmax(axis=1) returns the column heading associated with the max value in each row (which is the True one).
However this adds an "x" in rows where none of the columns has a 1. If that is a problem you could do something like this:
df["NA"] = 1 # add a column of ones
df["stores"] = df[["col1", "col2", "col3", "NA"]].rename(columns=df.loc[0]).eq(1).idxmax(axis=1)
df["stores"].replace(1, np.NaN, inplace=True) # replace the 1s with NaNs