Drop duplicate rows, but only if column equals NaN - python

I only want to drop rows where two columns (ID, Code) are duplicates, but the third column (Descrip) is equal to 'NaN'. My dataframe, df (Shown below) relfects my intial dataframe and df2 is what I want instead.
df:
ID Descrip Code
1 NaN CC
1 3 SS
2 4 CC
2 7 SS
3 NaN CC
3 1 CC
3 NaN SS
4 20 CC
4 22 SS
5 15 CC
5 10 SS
6 100 CC
6 NaN CC
6 4 SS
6 NaN SS
df2:
ID Descrip Code
1 NaN CC
1 3 SS
2 4 CC
2 7 SS
3 1 CC
3 NaN SS
4 20 CC
4 22 SS
5 15 CC
5 10 SS
6 100 CC
6 4 SS
I know using df.drop(subset['ID', 'Code'], keep='first'), would remove the duplicate rows, but I only want this where 'Decrip' == 'NaN'.

You can use groupby and take the max value (every number is larger than NaN):
df2 = df.groupby(["ID", "Code"])["Descrip"].max().reset_index()

I think you could use:
df = df[~(df.duplicated(['ID','Code'], False) & df['Descrip'].isna())]
Where (and I'll try my best to explain as to my understanding):
df.duplicated(['ID','Code'], False) - Returns a boolean if there is any duplicate in the subset ID and Code, where False ensures all rows are considered. Documentation here.
df['Descrip'].isna() - Checks wheather or not Descrip holds NaN. Documentation here
df[~(....first point above .... & .... second point above ....)] - The tilde stands for not operator to invert the boolean mask and the ampersand chains these two expressions together with bitwise and, together filtering out the rows of interest. Documentation here.
Result:
ID Descrip Code
0 1 NaN CC
1 1 3 SS
2 2 4 CC
3 2 7 SS
5 3 1 CC
6 3 NaN SS
7 4 20 CC
8 4 22 SS
9 5 15 CC
10 5 10 SS
11 6 100 CC
13 6 4 SS

Related

How to find the next row that have a value in column in a dataframe pandas?

I have a dataframe such as:
id info date group label
1 aa 02/05 1 7
2 ba 02/05 1 8
3 cp 09/05 2 7
4 dd 09/05 2 8
5 ii 09/05 2 9
Every group should have the numbers 7, 8 and 9. In the example above, the group 1 does not have the three numbers, the number 9 is missing. In that case, I would like to find the closest row with a 9 in the label, and add it to the dataframe, also changing the date to the group's date.
So the desired result would be:
id info date group label
1 aa 02/05 1 7
2 ba 02/05 1 8
6 ii 02/05 1 9
3 cp 09/05 2 7
4 dd 09/05 2 8
5 ii 09/05 2 9
Welcome to SO. Its good if you include what you have tried so far so keep that in mind. Anyhow for this question, breakdown your thought process into pandas syntax. Like first step would be to check what group do not have which label from [8,9]:
dfs = df.groupby(['group', 'date']).agg({'label':set}).reset_index().sort_values('group')
dfs['label'] = dfs['label'].apply(lambda x: {8, 9}.difference(x)).explode() # This is the missing label
dfs
Which will give you:
group
date
label
1
02/05
9
2
09/05
nan
Now merge it with original on label and have info filled in:
final_df = pd.concat([df, dfs.merge(df[['label', 'info']], on='label', suffixes=['','_grouped'])])
final_df
id
info
date
group
label
1
aa
02/05
1
7
2
ba
02/05
1
8
3
cp
09/05
2
7
4
dd
09/05
2
8
5
ii
09/05
2
9
nan
ii
02/05
1
9
And prettify:
final_df.reset_index(drop=True).reset_index().assign(id=lambda x:x['index']+1).drop(columns=['index']).sort_values(['group', 'id'])
id
info
date
group
label
1
aa
02/05
1
7
2
ba
02/05
1
8
6
ii
02/05
1
9
3
cp
09/05
2
7
4
dd
09/05
2
8
5
ii
09/05
2
9

Fill down zeros for specific column where there are no values (Python)

I have a dataframe, df, where I would like to fill down zeroes in a specific column where there are no values.
Data
id type count ppr
aa cool 12 9
aa hey 7
aa hi 12 7
bb no 7
bb yes 7
Desired
id type count ppr
aa cool 12 9
aa hey 0 7
aa hi 12 7
bb no 0 7
bb yes 0 7
Doing
df['count'] = df.fillna(0).astype(int)
However, this only gives output of a single column and not the full dataset
Any suggestion is appreciated
If you have no values, maybe it's because you have an empty string:
>>> df['count']
0 12
1
2 12
3
4
Name: count, dtype: object # <- HERE
So, this should work:
df['count'] = df['count'].replace('', 0).astype(int)
>>> df
id type count ppr
0 aa cool 12 9
1 aa hey 0 7
2 aa hi 12 7
3 bb no 0 7
4 bb yes 0 7
df['count']=df['count'].fillna(0).astype(int)
Here is the code that worked for me (and that should have been added to the question to make solving it easier
import pandas as pd
import numpy as np
df_sample =\
pd.DataFrame([["day1","day2","day1","day2","day1","day2"],
[None,160,None,180,110,None]] ).T
df_sample.columns = ["day","count"]
df_sample['count']=df_sample['count'].fillna(0).astype(int)
print(df_sample)
Seen from your sample code of fillna() that works for a single column. I suppose your "no values" are actually NaN values.
Further seen from your last paragraph that you mentioned difficulty in applying to the full dataset, I further suppose you want to apply to multiple columns. As such, see below:
If you want to apply to ALL numeric type columns and try to change them to integer type if possible, you can try:
df.loc[:, df.select_dtypes(include='number').columns] = df.select_dtypes(include='number').fillna(0, downcast='infer')
Demo
# Before conversion:
print(df)
id type count ppr
0 aa cool 12.0 9.0
1 aa hey NaN 7.0
2 aa hi 12.0 NaN
3 bb no NaN 7.0
4 bb yes NaN 7.0
df.loc[:, df.select_dtypes(include='number').columns] = df.select_dtypes(include='number').fillna(0, downcast='infer')
# After conversion:
print(df)
id type count ppr
0 aa cool 12 9
1 aa hey 0 7
2 aa hi 12 0
3 bb no 0 7
4 bb yes 0 7

Filling na in a column in pandas from multiple columns conition

I would love to fill na in pandas dataframe where two columns in the dataframe both has on the same row.
A B C
2 3 5
Nan nan 7
4 7 9
Nan 4 9
12 5 8
Nan Nan 6
In the above dataframe, I would love to replace just row where both column A and B has Nan with "Not Available"
Thus:
A B C
2 3 5
Not available not available 7
4 7 9
Nan 4 9
12 5 8
Not available not available 6
I have tried multiple approach but I am getting undesirable result
If want test only A and B columns use DataFrame.loc with mask test it missing value with DataFrame.all for test if both are match:
m = df[['A','B']].isna().all(axis=1)
df.loc[m, ['A','B']] = 'Not Available'
If need test any 2 columns first count number of missing values and is 2 use fillna:
m = df.isna().sum(axis=1).eq(1)
df = df.where(m, df.fillna('Not Available'))
print (df)
A B C
0 2 3 5
1 Not Available Not Available 7
2 4 7 9
3 NaN 4 9
4 12 5 8
5 Not Available Not Available 6

Replace specific values in dataframe

How to replace particular values in Dataframe. For example in the below dataframe i want to replace the rows starting with [AA,CB,EZ] and the value i want to replace is ''
df = pandas.DataFrame({'A': ['AA','BB','CB','DD','EZ'],'B':[6,7,8,9,10],'C':[11,12,13,14,15]})
$ df
A B C
0 AA 6 11
1 BB 7 12
2 CB 8 13
3 DD 9 14
4 EZ 10 15
$ Expected Ouputdf
A B C
0 AA
1 BB 7 12
2 CB
3 DD 9 14
4 EZ
You can replace values by boolean mask by empty strings, but get mixed types - strings with numeric and some functions should failed:
mask = df['A'].str.startswith(('AA','CB','EZ'))
df.loc[mask, ['B', 'C']] = ''
print (df)
A B C
0 AA
1 BB 7 12
2 CB
3 DD 9 14
4 EZ
Better is replace values to NaNs:
df.loc[mask, ['B', 'C']] = np.nan
print (df)
A B C
0 AA NaN NaN
1 BB 7.0 12.0
2 CB NaN NaN
3 DD 9.0 14.0
4 EZ NaN NaN
Another solution:
df[['B', 'C']] = df[['B', 'C']].mask(mask)

Compare 2 consecutive rows and assign increasing value if different (using Pandas)

I have a dataframe df_in like so:
import pandas as pd
dic_in = {'A':['aa','aa','bb','cc','cc','cc','cc','dd','dd','dd','ee'],
'B':['200','200','200','400','400','500','700','700','900','900','200'],
'C':['da','cs','fr','fs','se','at','yu','j5','31','ds','sz']}
df_in = pd.DataFrame(dic_in)
I would like to investigate the 2 columns A and B in the following way.
I 2 consecutive rows[['A','B']] are equal then they are assigned a new value (according to a specific rule which i am about to describe).
I will give an example to be more clear: If the first row[['A','B']] is equal to the following one, then I set 1; if the second one is equal to the third one then I will set 1. Every time two consecutive rows are different, then I increase the value to set by 1.
The result should look like this:
A B C value
0 aa 200 da 1
1 aa 200 cs 1
2 bb 200 fr 2
3 cc 400 fs 3
4 cc 400 se 3
5 cc 500 at 4
6 cc 700 yu 5
7 dd 700 j5 6
8 dd 900 31 7
9 dd 900 ds 7
10 ee 200 sz 8
Can you suggest me a smart one to achieve this goal?
Use shift and any to compare consecutive rows, using True to indicate where the value should change. Then take the cumulative sum with cumsum to get the increasing value:
df_in['value'] = (df_in[['A', 'B']] != df_in[['A', 'B']].shift()).any(axis=1)
df_in['value'] = df_in['value'].cumsum()
Alternatively, condensing it to one line:
df_in['value'] = (df_in[['A', 'B']] != df_in[['A', 'B']].shift()).any(axis=1).cumsum()
The resulting output:
A B C value
0 aa 200 da 1
1 aa 200 cs 1
2 bb 200 fr 2
3 cc 400 fs 3
4 cc 400 se 3
5 cc 500 at 4
6 cc 700 yu 5
7 dd 700 j5 6
8 dd 900 31 7
9 dd 900 ds 7
10 ee 200 sz 8

Categories