How can I match the missing values (nan) of two dataframes? - python

how can I set all my values in df1 as missing if their position equivalent is a missing value in df2?
Data df1:
Index Data
1 3
2 8
3 9
Data df2:
Index Data
1 nan
2 2
3 nan
desired output:
Index Data
1 nan
2 8
3 nan
So I would like to keep the data of df1, but only for the positions for which df2 also has data entries. For all nans in df2 I would like to replace the value of df1 with nan as well.
I tried the following, but this replaced all data points with nan.
df1 = df1.where(df2== np.nan, np.nan)
Thank you very much for your help.

Use mask, which is doing exactly the inverse of where:
df3 = df1.mask(df2.isna())
output:
Index Data
0 1 NaN
1 2 8.0
2 3 NaN
In your case, you were setting all elements matching a non-NaN as NaN, and because equality is not the correct way to check for NaN (np.nan == np.nan yields False), you were setting all to NaN.

Change df2 == np.nan by df2.notna():
df3 = df1.where(df2.notna(), np.nan)
print(df3)
# Output
Index Data
0 1 NaN
1 2 8.0
2 3 NaN

Related

Pandas renaming multiple NaN column names

When reading csv file my dataframe has these column names:
df.columns:
Index([nan,"A", nan, "B", "C", nan],dtype='object')
For unknown reasons it does not automatically name them as "Unnamed:0" and so on as it usually does.
Therefore is it possible to rename the multiple nan columns to Unnamed:0, Unnamed:1 and so on, depending on how many nan columns are there- the number of nan columns varies.
first convert your columns to a series then apply a cumulative count cumcount to a boolean condition which is True if there is a null occurrence. then use the conditional value to fill the null values.
s = pd.Series(df.columns)
print(s)
0 NaN
1 A
2 NaN
3 B
4 C
5 NaN
s = s.fillna('unnamed:' + (s.groupby(s.isnull()).cumcount() + 1).astype(str))
print(s)
0 unnamed:1
1 A
2 unnamed:2
3 B
4 C
5 unnamed:3
dtype: object
df.columns = s

How can I match values on a matrix on python using pandas?

I'm trying to match values in a matrix on python using pandas dataframes. Maybe this is not the best way to express it.
Imagine you have the following dataset:
import pandas as pd
d = {'stores':['','','','',''],'col1': ['x','price','','',1],'col2':['y','quantity','',1,''], 'col3':['z','',1,'',''] }
df = pd.DataFrame(data=d)
stores col1 col2 col3
0 NaN x y z
1 NaN price quantity NaN
2 NaN NaN Nan 1
3 NaN NaN 1 NaN
4 NaN 1 NaN NaN
I'm trying to get the following:
stores col1 col2 col3
0 NaN x y z
1 NaN price quantity NaN
2 z NaN Nan 1
3 y NaN 1 NaN
4 x 1 NaN NaN
Any ideas how this might work? I've tried running loops on lists but I'm not quite sure how to do it.
This is what I have so far but it's just terrible (and obviously not working) and I am sure there is a much simpler way of doing this but I just can't get my head around it.
stores = ['x','y','z']
for i in stores:
for v in df.iloc[0,:]:
if i==v :
df['stores'] = i
It yields the following:
stores col1 col2 col3
0 z x y z
1 z price quantity NaN
2 z NaN NaN 1
3 z NaN 1 NaN
4 z 1 NaN NaN
Thank you in advance.
You can complete this task with a loop by doing the following. It loops through each column excluding the first where you want to write the data. Takes the index values where the value is 1 and writes the value from the first row to the column 'stores'.
Be careful where you might have 1's in multiple rows, in which case it will fill the stores column with the last column that had a 1 value.
for col in df.columns[1:]:
index_values = df[col][df[col]==1].index.tolist()
df.loc[index_values, 'stores'] = df[col][0]
You can fill the whole column at once, like this:
df["stores"] = df[["col1", "col2", "col3"]].rename(columns=df.loc[0]).eq(1).idxmax(axis=1)
This first creates a version of the dataframe with the columns renamed "x", "y", and "z" after the values in the first row; then idxmax(axis=1) returns the column heading associated with the max value in each row (which is the True one).
However this adds an "x" in rows where none of the columns has a 1. If that is a problem you could do something like this:
df["NA"] = 1 # add a column of ones
df["stores"] = df[["col1", "col2", "col3", "NA"]].rename(columns=df.loc[0]).eq(1).idxmax(axis=1)
df["stores"].replace(1, np.NaN, inplace=True) # replace the 1s with NaNs

Check whether a dataframe cell contains value that is in another dataframe's cell

I'm trying to do the following:
Given a row in df1, if str(row['code']) is in any rows for df2['code'], then I would like all those rows in df2['lamer_url_1'] and df2['shopee_url_1'] to take the corresponding values as from df1.
Then carry on with the next row for df1['code']...
'''
==============
Initial Tables:
df1
code lamer_url_1 shopee_url_1
0 L61B18H089 b a
1 L61S19H014 e d
2 L61S19H015 z y
df2
code lamer_url_1 shopee_url_1 lamer_url_2 shopee_url_2
0 L61B18H089-F1424 NaN NaN NaN NaN
1 L61S19H014-S1500 NaN NaN NaN NaN
2 L61B18H089-F1424 NaN NaN NaN NaN
==============
Expected output:
df2
code lamer_url_1 shopee_url_1 lamer_url_2 shopee_url_2
0 L61B18H089-F1424 b a NaN NaN
1 L61S19H014-S1500 e d NaN NaN
2 L61B18H089-F1424 b a NaN NaN
'''
I assumed that common part of "code" from "df2" are chars before "-". I also assumed that from "df1" we want 'lamer_url_1', 'shopee_url_1' and from "df2" we want 'lamer_url_2', 'shopee_url_2' (correct me in comment if I am wrong so I can polish code):
df1.set_index(df1['code'], inplace=True)
df2.set_index(df2['code'].apply(lambda x: x.split('-')[0]), inplace=True)
df2.index.names = ['code_join']
df3 = pd.merge(df2[['code', 'lamer_url_2', 'shopee_url_2']],
df1[['lamer_url_1', 'shopee_url_1']],
left_index=True, right_index=True)

How to compare two dataframes and filter rows and columns where a difference is found

I am testing dataframes for equality.
df_diff=(df1!=df2)
I get df_diff which is same shape as df*, and contains boolean True/False.
Now I would like to keep only the columns and rows of df1 where there was at least a different value.
If I simply do
df1=[df_diff.values]
I get all the rows where there was at least one True in df_diff, but lots of columns originally had False only.
As a second step, I would like then to be able to replace all the values (element-wise in the dataframe) which were equal (where df_diff==False) with NaNs.
example:
df1=pd.DataFrame(data=[[1,2,3],[4,5,6],[7,8,9]])
df2=pd.DataFrame(data=[[1,99,3],[4,5,99],[7,8,9]])
I would like to get from df1
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
to
1 2
0 2 NaN
1 NaN 6
I think you need DataFrame.any for check at least one True per rows of columns:
df = df_diff[df_diff.any(axis=1)]
It is possible to filter both of the original dataframes like so:
df11 = df1[df_diff.any(axis=1)]
df22 = df2[df_diff.any(axis=1)]
If want all columns and rows:
df = df_diff.loc[df_diff.any(axis=1), df_diff.any()]
EDIT: Filter d1 and add NaNs by where:
df_diff=(df1!=df2)
m1 = df_diff.any(axis=1)
m2 = df_diff.any()
out = df1.loc[m1, m2].where(df_diff.loc[m1, m2])
print (out)
1 2
0 2.0 NaN
1 NaN 6.0

Drop duplicates while preserving NaNs in pandas

When using the drop_duplicates() method I reduce duplicates but also merge all NaNs into one entry. How can I drop duplicates while preserving rows with an empty entry (like np.nan, None or '')?
import pandas as pd
df = pd.DataFrame({'col':['one','two',np.nan,np.nan,np.nan,'two','two']})
Out[]:
col
0 one
1 two
2 NaN
3 NaN
4 NaN
5 two
6 two
df.drop_duplicates(['col'])
Out[]:
col
0 one
1 two
2 NaN
Try
df[(~df.duplicated()) | (df['col'].isnull())]
The result is :
col
0 one
1 two
2 NaN
3 NaN
4 NaN
Well, one workaround that is not really beautiful is to first save the NaN and put them back in:
temp = df.iloc[pd.isnull(df).any(1).nonzero()[0]]
asd = df.drop_duplicates('col')
pd.merge(temp, asd, how='outer')
Out[81]:
col
0 one
1 two
2 NaN
3 NaN
4 NaN
use:
df.drop_duplicates('col').append(df[df['col'].isna()])

Categories