find duplicate subset of columns with nan values in dataframe

find duplicate subset of columns with nan values in dataframe - python

I have a dataframe with 4 columns that can have np.nan
df =
i_example i_frame OId HId
0 0 20 3.0 0.0
1 3 13 NaN 8.0
2 3 13 NaN 10.0
3 0 21 3.0 NaN
4 0 21 3.0 0.0
5 1 22 0.0 4.0
6 1 22 NaN 4.0
7 2 20 0.0 4.0
8 2 20 1.0 4.0
I am looking for invalid rows.
invalid rows are
[1] rows with duplicate columns = [i_example, i_frame, OId] or
[2] rows with duplicate columns = [i_example, i_frame, HId].
So in the example above, all the rows are invalid beside the first three rows.
valid_df =
i_example i_frame OId HId
0 0 20 3.0 0.0
1 3 13 NaN 8.0
2 3 13 NaN 10.0
and
invalid_df =
i_example i_frame OId HId
3 0 21 3.0 NaN
4 0 21 3.0 0.0
5 1 22 0.0 4.0
6 1 22 NaN 4.0
7 2 20 0.0 4.0
8 2 20 1.0 4.0
1 0 21 3.0 NaN
2 0 21 3.0 0.0
These two rows are invalid because of the condition [1].
and
3 1 22 0.0 4.0
4 1 22 NaN 4.0
are invalid because of the condition [2]
and
5 2 20 0.0 4.0
6 2 20 1.0 4.0
are invalid for the same reason
I tried is_duplicated but it does not work with nan values

I am not sure if the df.duplicated() function offers to eliminate NaNs. But you can add a condition to check of the value is NaN or not and find the duplicates.
df[df.duplicated(['i_example', 'i_frame', 'OId'], keep=False) & df['OId'].notna()]
Result:
i_example i_frame OId HId
3 0 21 3.0 NaN
4 0 21 3.0 0.0
So, for your question, I would see if the value is not NaN and then find the duplicates using df.duplicated() and create a boolean mask. With that filter the df as valid and invalid.
dupes = (df['OId'].notna() & df.duplicated(['i_example', 'i_frame', 'OId'], keep=False)) | (df['HId'].notna() & df.duplicated(['i_example', 'i_frame', 'HId'], keep=False))
invalid_df = df[dupes]
valid_df = df[~dupes]
Result:
valid_df =
i_example i_frame OId HId
0 0 20 3.0 0.0
1 3 13 NaN 8.0
2 3 13 NaN 10.0
invalid_df =
i_example i_frame OId HId
3 0 21 3.0 NaN
4 0 21 3.0 0.0
5 1 22 0.0 4.0
6 1 22 NaN 4.0
7 2 20 0.0 4.0
8 2 20 1.0 4.0

Related

Pandas: start a new group on every non-NA value

I am looking for a method to create an array of numbers to label groups, based on the value of the 'number' column. If it's possible?
With this abbreviated example DF:
number = [nan,nan,1,nan,nan,nan,2,nan,nan,3,nan,nan,nan,nan,nan,4,nan,nan]
df = pd.DataFrame(columns=['number'])
df = pd.DataFrame.assign(df, number=number)
Ideally I would like to make a new column, 'group', based on the int in column 'number' - so there would be effectively be array's of 1, ,2, 3, etc. FWIW, the DF is 1000's lines long, with sporadically placed int's.
The result would be a new column, something like this:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
All advice much appreciated!

You can use notna combined with cumsum:
df['group'] = df['number'].notna().cumsum()
NB. if you had zeros: df['group'] = df['number'].ne(0).cumsum().
output:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4

You can use forward fill:
df['number'].ffill().fillna(0)
Output:
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 2.0
7 2.0
8 2.0
9 3.0
10 3.0
11 3.0
12 3.0
13 3.0
14 3.0
15 4.0
16 4.0
17 4.0
Name: number, dtype: float64

Fill Nan based on multiple column condition in Pandas

The objective is to fill NaN with respect to two columns (i.e., a, b) .
a b c d
2,0,1,4
5,0,5,6
6,0,1,1
1,1,1,4
4,1,5,6
5,1,5,6
6,1,1,1
1,2,2,3
6,2,5,6
Such that, there should be continous value of between 1 to 6 for the column a for a fixed value in column b. Then, the other rows assigned to nan.
The code snippet does the trick
import numpy as np
import pandas as pd
maxval_col_a=6
lowval_col_a=1
maxval_col_b=2
lowval_col_b=0
r=list(range(lowval_col_b,maxval_col_b+1))
df=pd.DataFrame(np.column_stack([[2,5,6,1,4,5,6,1,6,],
[0,0,0,1,1,1,1,2,2,], [1,5,1,1,5,5,1,2,5,],[4,6,1,4,6,6,1,3,6,]]),columns=['a','b','c','d'])
all_df=[]
for idx in r:
k=df.loc[df['b']==idx].set_index('a').reindex(range(lowval_col_a, maxval_col_a+1, 1)).reset_index()
k['b']=idx
all_df.append(k)
df=pd.concat(all_df)
But, I am curious whether there are more efficient and better way of doing this with Pandas.
The expected output
a b c d
0 1 0 NaN NaN
1 2 0 1.0 4.0
2 3 0 NaN NaN
3 4 0 NaN NaN
4 5 0 5.0 6.0
5 6 0 1.0 1.0
0 1 1 1.0 4.0
1 2 1 NaN NaN
2 3 1 NaN NaN
3 4 1 5.0 6.0
4 5 1 5.0 6.0
5 6 1 1.0 1.0
0 1 2 2.0 3.0
1 2 2 NaN NaN
2 3 2 NaN NaN
3 4 2 NaN NaN
4 5 2 NaN NaN
5 6 2 5.0 6.0

Create the cartesian product of combinations:
mi = pd.MultiIndex.from_product([df['b'].unique(), range(1, 7)],
names=['b', 'a']).swaplevel()
out = df.set_index(['a', 'b']).reindex(mi).reset_index()
print(out)
# Output
a b c d
0 1 0 NaN NaN
1 2 0 1.0 4.0
2 3 0 NaN NaN
3 4 0 NaN NaN
4 5 0 5.0 6.0
5 6 0 1.0 1.0
6 1 1 1.0 4.0
7 2 1 NaN NaN
8 3 1 NaN NaN
9 4 1 5.0 6.0
10 5 1 5.0 6.0
11 6 1 1.0 1.0
12 1 2 2.0 3.0
13 2 2 NaN NaN
14 3 2 NaN NaN
15 4 2 NaN NaN
16 5 2 NaN NaN
17 6 2 5.0 6.0

First create a multindex with cols [a,b] then a new multindex with all the combinations and then you reindex with the new multindex:
(showing all steps)
# set both a and b as index (it's a multiindex)
df.set_index(['a','b'],drop=True,inplace=True)
# create the new multindex
new_idx_a=np.tile(np.arange(0,6+1),3)
new_idx_b=np.repeat([0,1,2],6+1)
new_multidx=pd.MultiIndex.from_arrays([new_idx_a,
new_idx_b])
# reindex
df=df.reindex(new_multidx)
# convert the multindex back to columns
df.index.names=['a','b']
df.reset_index()
results:
a b c d
0 0 0 NaN NaN
1 1 0 NaN NaN
2 2 0 1.0 4.0
3 3 0 NaN NaN
4 4 0 NaN NaN
5 5 0 5.0 6.0
6 6 0 1.0 1.0
7 0 1 NaN NaN
8 1 1 1.0 4.0
9 2 1 NaN NaN
10 3 1 NaN NaN
11 4 1 5.0 6.0
12 5 1 5.0 6.0
13 6 1 1.0 1.0
14 0 2 NaN NaN
15 1 2 2.0 3.0
16 2 2 NaN NaN
17 3 2 NaN NaN
18 4 2 NaN NaN
19 5 2 NaN NaN
20 6 2 5.0 6.0

We can do it by using a groupby on the column b, then set a as index and add the missing values of a using numpy.arange.
To finish, reset the index to get the expected result :
import numpy as np
df.groupby('b').apply(lambda x : x.set_index('a').reindex(np.arange(1, 7))).drop('b', 1).reset_index()
Output :
b a c d
0 0 1 NaN NaN
1 0 2 1.0 4.0
2 0 3 NaN NaN
3 0 4 NaN NaN
4 0 5 5.0 6.0
5 0 6 1.0 1.0
6 1 1 1.0 4.0
7 1 2 NaN NaN
8 1 3 NaN NaN
9 1 4 5.0 6.0
10 1 5 5.0 6.0
11 1 6 1.0 1.0
12 2 1 2.0 3.0
13 2 2 NaN NaN
14 2 3 NaN NaN
15 2 4 NaN NaN
16 2 5 NaN NaN
17 2 6 5.0 6.0

fill NA of a column with elements of another column

i'm in this situation,
my df is like that
A B
0 0.0 2.0
1 3.0 4.0
2 NaN 1.0
3 2.0 NaN
4 NaN 1.0
5 4.8 NaN
6 NaN 1.0
and i want to apply this line of code:
df['A'] = df['B'].fillna(df['A'])
and I expect a workflow and final output like that:
A B
0 2.0 2.0
1 4.0 4.0
2 1.0 1.0
3 NaN NaN
4 1.0 1.0
5 NaN NaN
6 1.0 1.0
A B
0 2.0 2.0
1 4.0 4.0
2 1.0 1.0
3 2.0 NaN
4 1.0 1.0
5 4.8 NaN
6 1.0 1.0
but I receive this error:
TypeError: Unsupported type Series
probably because each time there is an NA it tries to fill it with the whole series and not with the single element with the same index of the B column.
I receive the same error with a syntax like that:
df['C'] = df['B'].fillna(df['A'])
so the problem seems not to be the fact that I'm first changing the values of A with the ones of B and then trying to fill the "B" NA with the values of a column that is technically the same as B
I'm in a databricks environment and I'm working with koalas data frames but they work as the pandas ones.
can you help me?

Another option
Suppose the following dataset
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'State':[1,2,3,4,5,6, 7, 8, 9, 10],
'Sno Center': ["Guntur", "Nellore", "Visakhapatnam", "Biswanath", "Doom-Dooma", "Guntur", "Labac-Silchar", "Numaligarh", "Sibsagar", "Munger-Jamalpu"],
'Mar-21': [121, 118.8, 131.6, 123.7, 127.8, 125.9, 114.2, 114.2, 117.7, 117.7],
'Apr-21': [121.1, 118.3, 131.5, np.NaN, 128.2, 128.2, 115.4, 115.1, np.NaN, 118.3]})
df
State Sno Center Mar-21 Apr-21
0 1 Guntur 121.0 121.1
1 2 Nellore 118.8 118.3
2 3 Visakhapatnam 131.6 131.5
3 4 Biswanath 123.7 NaN
4 5 Doom-Dooma 127.8 128.2
5 6 Guntur 125.9 128.2
6 7 Labac-Silchar 114.2 115.4
7 8 Numaligarh 114.2 115.1
8 9 Sibsagar 117.7 NaN
9 10 Munger-Jamalpu 117.7 118.3
Then
df.loc[(df["Mar-21"].notnull()) & (df["Apr-21"].isna()), "Apr-21"] = df["Mar-21"]
df
State Sno Center Mar-21 Apr-21
0 1 Guntur 121.0 121.1
1 2 Nellore 118.8 118.3
2 3 Visakhapatnam 131.6 131.5
3 4 Biswanath 123.7 123.7
4 5 Doom-Dooma 127.8 128.2
5 6 Guntur 125.9 128.2
6 7 Labac-Silchar 114.2 115.4
7 8 Numaligarh 114.2 115.1
8 9 Sibsagar 117.7 117.7
9 10 Munger-Jamalpu 117.7 118.3

IIUC:
try with max():
df['A']=df[['A','B']].max(axis=1)
output of df:
A B
0 2.0 2.0
1 4.0 4.0
2 1.0 1.0
3 2.0 NaN
4 1.0 1.0
5 4.8 NaN
6 1.0 1.0

transform on multiple columns to interpolate/copy missing values

I'm trying to fill out missing values in a pandas dataframe by interpolating or copying the last-known value within a group (identified by trip). My data looks like this:
brake speed trip
0 0.0 NaN 1
1 1.0 NaN 1
2 NaN 1.264 1
3 NaN 0.000 1
4 0.0 NaN 1
5 NaN 1.264 1
6 NaN 6.704 1
7 1.0 NaN 1
8 0.0 NaN 1
9 NaN 11.746 2
10 1.0 NaN 2
11 0.0 NaN 2
12 NaN 16.961 3
13 1.0 NaN 3
14 NaN 11.832 3
15 0.0 NaN 3
16 NaN 17.082 3
17 NaN 22.435 3
18 NaN 28.707 3
19 NaN 34.216 3
I have found Pandas interpolate within a groupby but I need brake to simply be copied from the last-known, yet speed to be interpolated (my actual dataset has 12 columns that each need such treatment)

You can apply separate methods to each column. For example:
# interpolate speed
df['speed'] = df.groupby('trip').speed.transform(lambda x: x.interpolate())
# fill brake with last known value
df['brake'] = df.groupby('trip').brake.transform(lambda x: x.fillna(method='ffill'))
>>> df
brake speed trip
0 0.0 NaN 1
1 1.0 NaN 1
2 1.0 1.2640 1
3 1.0 0.0000 1
4 0.0 0.6320 1
5 0.0 1.2640 1
6 0.0 6.7040 1
7 1.0 6.7040 1
8 0.0 6.7040 1
9 NaN 11.7460 2
10 1.0 11.7460 2
11 0.0 11.7460 2
12 NaN 16.9610 3
13 1.0 14.3965 3
14 1.0 11.8320 3
15 0.0 14.4570 3
16 0.0 17.0820 3
17 0.0 22.4350 3
18 0.0 28.7070 3
19 0.0 34.2160 3
Note that this means you remain with some NaN in brake, because there was no "last known value" for the first row of a trip, and some NaNs in speed when the first few rows were NaN. You can replace these as you see fit with fillna()

Indexing columns based on cell value in pandas

I have a dataframe of race results. I'd like to create a series that takes the last stage position and subtracts that by the average of all the stages before that. Here is a small slice for the df (could have more stages, countries and rows)
race_location stage1_position stage2_position stage3_position number_of_stages
AUS 2.0 2.0 NaN 2
AUS 1.0 5.0 NaN 2
AUS 3.0 4.0 NaN 2
AUS 4.0 8.0 NaN 2
AUS 10.0 6.0 NaN 2
AUS 9.0 7.0 NaN 2
FRA 23.0 1.0 10.0 3
FRA 6.0 12.0 24.0 3
FRA 14.0 11.0 14.0 3
FRA 18.0 10.0 1.0 3
FRA 15.0 14.0 4.0 3
USA 24.0 NaN NaN 1
USA 7.0 NaN NaN 1
USA 22.0 NaN NaN 1
USA 11.0 NaN NaN 1
USA 8.0 NaN NaN 1
USA 16.0 NaN NaN 1
USA 13.0 NaN NaN 1
USA 19.0 NaN NaN 1
USA 5.0 NaN NaN 1
USA 25.0 NaN NaN 1
The output would be
last_stage_minus_average
0
4
1
4
-4
-2
-2
15
1.5
-13
-10.5
0
0
0
0
0
0
0
0
0
0
0
This wont work, but I was thinking something like this:
new_series = []
for country in country_list:
num_stages = df.loc[df['race_location'] == country, 'number_of_stages']
differnce = df.ix[df['race_location'] == country, num_stages] -
df.iloc[:, 0:num_stages-1].mean(axis=1)
new_series.append(difference)
I'm not sure how to go about doing this. Any help or direction would be amazing!

#use pandas apply to take the mean for the first n-1 stages and subtract from last stage.
df.apply(lambda x: x.iloc[x.number_of_stages]-np.mean(x.iloc[1:x.number_of_stages]),axis=1).fillna(0)
Out[264]:
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
dtype: float64

I'd use filter to get just he stage columns, then stack and groupby
stages = df.filter(regex='^stage\d+.*')
stages.stack().groupby(level=0).apply(
lambda x: x.iloc[-1] - x.iloc[:-1].mean()
).fillna(0)
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
dtype: float64
how it works
stack will automatically drop the NaN values when converting to a series.
Now, position -1 is the last value within each group if we grouped by the first level of the new multiindex
So, we use a lambda and calculate the mean with every thing up to the last value x.iloc[:-1].mean()
And subtract that from the last value x.iloc[-1]

subtracts that by the average of all the stages before that
It's not a big deal but I'm just curious! Unlike your desired output but along to your description, if one of the racers finished only one race, shouldn't their result be inf or nan instead of 0? (to specify them from the one who has already done 2~3 race but last race result is exactly same with average of races? like racer #1 vs racer #11~20)
df_sp = df.filter(regex='^stage\d+.*')
df['last'] = df_sp.T.fillna(method='ffill').T.iloc[:, -1]
df['mean'] = (df_sp.sum(axis=1) - df['last']) / (df['number_of_stages'] - 1)
print(df['last'] - df['mean'])
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 NaN

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

find duplicate subset of columns with nan values in dataframe - python

Related

Pandas: start a new group on every non-NA value

Fill Nan based on multiple column condition in Pandas

fill NA of a column with elements of another column

transform on multiple columns to interpolate/copy missing values

Indexing columns based on cell value in pandas

Categories

Resources