Pandas- from duplicate rows, keeping rows without null values - python

I have the following dataframe:
df= pd.DataFrame ({'id': [1,1,2,3,3, 4], 'test': ['a', np.nan, 'b','w', 'd', np.nan]})
as you see the "id" column has some duplicate values with different values for the "test" column. From duplicate rows, I want only keep rows without null values. if a duplicate rows does not have any null values, I want to keep it.
The output should be like this:
id value
0 1 a
1 2 b
2 3 w
3 3 d
4 4 NaN
I tried this, but it does not work because it removes the duplicate rows where the id = 3.
df = df.groupby('id', as_index=False, sort=False)['value'].first()
any suggestion?

For your sample data:
dup_id = df['id'].duplicated(keep=False)
df[~(dup_id & df.test.isna())]
gives what you want:
id test
0 1 a
2 2 b
3 3 w
4 3 d
5 4 NaN

Related

How to add value of dataframe to another dataframe?

I want to add a row of dataframe to every row of another dataframe.
df1=pd.DataFrame({"a": [1,2],
"b": [3,4]})
df2=pd.DataFrame({"a":[4], "b":[5]})
I want to add df2 value to every row of df1.
I use df1+df2 and get following result
a b
0 5.0 8.0
1 NaN NaN
But I want to get the following result
a b
0 5 7
1 7 9
Any help would be dearly appreciated!
If really need add values per columns it means number of columns in df2 is same like number of rows in df1 use:
df = df1.add(df2.loc[0].to_numpy(), axis=0)
print (df)
a b
0 5 7
1 7 9
If need add by rows it means first value of df1 is add to first column of df2, so output is different:
df = df1.add(df2.loc[0], axis=1)
print (df)
a b
0 5 8
1 6 9

Select multiple rows from pandas data frame where one of column contains some values as NaN

Select rows with columns ['A','B'] where rows with column 'C' contain NaN values in python (Pandas)
I have pandas data frame with three columns 'A' , 'B', 'C'.
In column 'C' there are some rows which contains NaN values.
Now I want to select column 'A' and column 'B' of data frame where column 'C' contains NaN values.
If all columns or only one column needs to be selected then I can do below,
df['A'][df['C'].isnull()]
or
df[df['C'].isnull()]
but I am not getting how to select multiple columns.
You can put multiple column names in the first form.
df[['A','B']][df['C'].isnull()]
You can use loc, and select a list of columns:
df.loc[df['C'].isnull(), ['A','B']]
For example
>>> df = pd.DataFrame({'A':[1,2,3,4], 'B':[5,6,7,8], 'C':[np.nan,1,np.nan,2]})
>>> df
A B C
0 1 5 NaN
1 2 6 1.0
2 3 7 NaN
3 4 8 2.0
>>> df.loc[df['C'].isnull(), ['A','B']]
A B
0 1 5
2 3 7
I like dropna and drop , since we will not have copy warning when we forget add the .copy()
sub=df.dropna(subset=['C']).drop('C',1)
sub
Out[26]:
A B
1 2 6
3 4 8

code multiple columns based on lists and dictionaries in Python

I have the following dataframe in Pandas
OfferPreference_A OfferPreference_B OfferPreference_C
A B A
B C C
C S G
I have the following dictionary of unique values under all the columns
dict1={A:1, B:2, C:3, S:4, G:5, D:6}
I also have a list of the columnames
columnlist=['OfferPreference_A', 'OfferPreference_B', 'OfferPreference_C']
I Am trying to get the following table as the output
OfferPreference_A OfferPreference_B OfferPreference_C
1 2 1
2 3 3
3 4 5
How do I do this.
Use:
#if value not match get NaN
df = df[columnlist].applymap(dict1.get)
Or:
#if value not match get original value
df = df[columnlist].replace(dict1)
Or:
#if value not match get NaN
df = df[columnlist].stack().map(dict1).unstack()
print (df)
OfferPreference_A OfferPreference_B OfferPreference_C
0 1 2 1
1 2 3 3
2 3 4 5
You can use map for this like shown below, assuming the values will match always
for col in columnlist:
df[col] = df[col].map(dict1)

Pandas: Create dataframe column based on other dataframe

If I have 2 dataframes like these two:
import pandas as pd
df1 = pd.DataFrame({'Type':list('AABAC')})
df2 = pd.DataFrame({'Type':list('ABCDEF'), 'Value':[1,2,3,4,5,6]})
Type
0 A
1 A
2 B
3 A
4 C
Type Value
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
5 F 6
I would like to add a column in df1 based on the values in df2. df2 only contains unique values, whereas df1 has multiple entries of each value.
So the resulting df1 should look like this:
Type Value
0 A 1
1 A 1
2 B 2
3 A 1
4 C 3
My actual dataframe df1 is quite long, so I need something that is efficient (I tried it in a loop but this takes forever).
As requested I am posting a solution that uses map without the need to create a temporary dict:
In[3]:
df1['Value'] = df1['Type'].map(df2.set_index('Type')['Value'])
df1
Out[3]:
Type Value
0 A 1
1 A 1
2 B 2
3 A 1
4 C 3
This relies on a couple things, that the key values that are being looked up exist otherwise we get a KeyError and that we don't have duplicate entries in df2 otherwise setting the index raises InvalidIndexError: Reindexing only valid with uniquely valued Index objects
You could create dict from your df2 with to_dict method and then map result to Type column for df1:
replace_dict = dict(df2.to_dict('split')['data'])
In [50]: replace_dict
Out[50]: {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6}
df1['Value'] = df1['Type'].map(replace_dict)
In [52]: df1
Out[52]:
Type Value
0 A 1
1 A 1
2 B 2
3 A 1
4 C 3
Another way to do this is by using the label based indexer loc. First use the Type column as the index using .set_index, then access using the df1 column, and reset the index to the original with .reset_index:
df2.set_index('Type').loc[df1['Type'],:].reset_index()
Either use this as your new df1 or extract the Value column:
df1['Value'] = df2.set_index('Type').loc[df1['Type'],:].reset_index()['Value']

Chosing a different value for NaN entries from appending DataFrames with different columns

I am concatenating multiple months of csv's where newer, more recent versions have additional columns. As a result, putting them all together fills certain rows of certain columns with NaN.
The issue with this behavior is that it mixes these NaNs with true null entries from the data set which need to be easily distinguishable.
My only solution as of now is to replace the original NaNs with a unique string, concatenate the csv's, replace the new NaNs with a second unique string, replace the first unique string with NaN.
Given the amount of data I am processing, this is a very inefficient solution. I thought there was some way to determine how Panda's DataFrame fill these entries but couldn't find anything on it.
Updated example:
A B
1 NaN
2 3
And append
A B C
1 2 3
Gives
A B C
1 NaN NaN
2 3 NaN
1 2 3
But I want
A B C
1 NaN 'predated'
2 3 'predated'
1 2 3
In case you have a core set of columns, as here represented by df1, you could apply .fillna() to the .difference() between the core set and any new columns in more recent DataFrames.
df1 = pd.DataFrame(data={'A': [1, 2], 'B': [np.nan, 3]})
A B
0 1 NaN
1 2 3
df2 = pd.DataFrame(data={'A': 1, 'B': 2, 'C': 3}, index=[0])
A B C
0 1 2 3
df = pd.concat([df1, df2], ignore_index=True)
new_cols = df2.columns.difference(df1.columns).tolist()
df[new_cols] = df[new_cols].fillna(value='predated')
A B C
0 1 NaN predated
1 2 3 predated
2 1 2 3

Categories