Remove NaN from each column and rearranging it with python pandas/numpy - python

I got similar issue with my previous question:
Remove zero from each column and rearranging it with python pandas/numpy
But in this case, I need to remove NaN. I have tried many solutions including modifying solutions from my previous post:
a = a[a!=np.nan].reshape(-1,3)
but it gave me weird result.
Here is my initial matrix from Dataframe :
A B C D E F
nan nan nan 0.0 27.7 nan
nan nan nan 5.0 27.5 nan
nan nan nan 10.0 27.4 nan
0.0 29.8 nan nan nan nan
5.0 29.9 nan nan nan nan
10.0 30.0 nan nan nan nan
nan nan 0.0 28.6 nan nan
nan nan 5.0 28.6 nan nan
nan nan 10.0 28.5 nan nan
nan nan 15.0 28.4 nan nan
nan nan 20.0 28.3 nan nan
nan nan 25.0 28.2 nan nan
And I expect to have result like this :
A B
0.0 27.7
5.0 27.5
10.0 27.4
0.0 29.8
5.0 29.9
10.0 30.0
0.0 28.6
5.0 28.6
10.0 28.5
15.0 28.4
0.0 28.3
25.0 28.2

Solution:
Given the input dataframe a:
a.apply(lambda x: pd.Series(x.dropna().values)).dropna(axis='columns')
This will give you the desired output.
Example:
import numpy as np
import pandas as pd
a = pd.DataFrame({ 'A':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'B':[np.nan,np.nan,np.nan,np.nan,np.nan,4],
'C':[7,np.nan,9,np.nan,2,np.nan],
'D':[1,3,np.nan,7,np.nan,np.nan],
'E':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]})
print (a)
A B C D E
0 NaN NaN 7.0 1.0 NaN
1 NaN NaN NaN 3.0 NaN
2 NaN NaN 9.0 NaN NaN
3 NaN NaN NaN 7.0 NaN
4 NaN NaN 2.0 NaN NaN
5 NaN 4.0 NaN NaN NaN
a_new = a.apply(lambda x: pd.Series(x.dropna().values)).dropna(axis='columns')
print(a_new)
C D
0 7.0 1.0
1 9.0 3.0
2 2.0 7.0

Use np.isnan for test missing values with ~ for invert mask if there are always 2 non missing values per rows:
a = df.to_numpy()
df = pd.DataFrame(a[~np.isnan(a)].reshape(-1,2))
print (df)
0 1
0 0.0 27.7
1 5.0 27.5
2 10.0 27.4
3 0.0 29.8
4 5.0 29.9
5 10.0 30.0
6 0.0 28.6
7 5.0 28.6
8 10.0 28.5
9 15.0 28.4
10 20.0 28.3
11 25.0 28.2
Another idea is use justify fucntion with remove only NaNs columns:
df1 = (pd.DataFrame(justify(a, invalid_val=np.nan),
columns=df.columns).dropna(how='all', axis=1))
print (df1)
A B
0 0.0 27.7
1 5.0 27.5
2 10.0 27.4
3 0.0 29.8
4 5.0 29.9
5 10.0 30.0
6 0.0 28.6
7 5.0 28.6
8 10.0 28.5
9 15.0 28.4
10 20.0 28.3
11 25.0 28.2
EDIT:
df = pd.concat([df] * 1000, ignore_index=True)
a = df.to_numpy()
print (a.shape)
(12000, 6)
In [168]: %timeit df.apply(lambda x: pd.Series(x.dropna().values)).dropna(axis='columns')
8.06 ms ± 597 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [172]: %%timeit
...: a = df.to_numpy()
...: pd.DataFrame(a[~np.isnan(a)].reshape(-1,2))
...:
...:
...:
422 µs ± 3.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [173]: %timeit pd.DataFrame(justify(a, invalid_val=np.nan),columns=df.columns).dropna(how='all', axis=1)
2.88 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Related

how to remove a column in dataframe if the specified column values are NaN

This is my dataframe. I need to delete the row with all the values in d1,d2,d3,c1,c2,c3 are Nan.
no d1 d2 d3 c1 c2 c3
0 59890 28.4 32.2 31.3 40.7 40.0 39.6
1 55679 NaN 32.8 31.5 37.3 39.2 39.4
2 58900 NaN NaN NaN NaN NaN NaN
3 76522 34.0 32.4 32.6 45.4 NaN 46.9
4 89525 32.7 31.9 32.0 44.1 44.4 46.1
... ... ... ... ... ... ... ...
The expected output:
no d1 d2 d3 c1 c2 c3
0 59890 28.4 32.2 31.3 40.7 40.0 39.6
1 55679 NaN 32.8 31.5 37.3 39.2 39.4
3 76522 34.0 32.4 32.6 45.4 NaN 46.9
4 89525 32.7 31.9 32.0 44.1 44.4 46.1
... ... ... ... ... ... ... ...
You can use dropna() parameters:
df = df.dropna(subset=['d1','d2','d3','c1','c2','c3'], how='all')
Alternatively, if it's the first column you don't want to include:
df = df.dropna(subset=df.columns[1:], how='all')

Avoid SettingWithCopyWarning in python using iloc

Usually, to avoid SettingWithCopyWarning, I replace values using .loc or .iloc, but this does not work when I want to forward fill my column (from the first to the last non-nan value).
Do you know why it does that and how to bypass it ?
My test dataframe :
df3 = pd.DataFrame({'Timestamp':[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.10,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9],
'test':[np.nan,np.nan,np.nan,2,22,8,np.nan,4,5,4,5,np.nan,-3,-54,-23,np.nan,89,np.nan,np.nan]})
and the code that raises me a warning :
df3['test'].iloc[df3['test'].first_valid_index():df3['test'].last_valid_index()+1] = df3['test'].iloc[df3['test'].first_valid_index():df3['test'].last_valid_index()+1].fillna(method="ffill")
I would like something like that in the end :
Use first_valid_index and last_valid_index to determine range that you want to ffill and then select range of your dataframe
df = pd.DataFrame({'Timestamp':[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.10,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9],
'test':[np.nan,np.nan,np.nan,2,22,8,np.nan,4,5,4,5,np.nan,-3,-54,-23,np.nan,89,np.nan,np.nan]})
first=df['test'].first_valid_index()
last=df['test'].last_valid_index()+1
df['test']=df['test'][first:last].ffill()
print(df)
Timestamp test
0 11.1 NaN
1 11.2 NaN
2 11.3 NaN
3 11.4 2.0
4 11.5 22.0
5 11.6 8.0
6 11.7 8.0
7 11.8 4.0
8 11.9 5.0
9 12.0 4.0
10 12.1 5.0
11 12.2 5.0
12 12.3 -3.0
13 12.4 -54.0
14 12.5 -23.0
15 12.6 -23.0
16 12.7 89.0
17 12.8 NaN
18 12.9 NaN

delete consecutive rows conditionally pandas

I have a df with columns (A, B, C, D, F). I want to:
1) Compare consecutive rows
2) if the absolute difference between consecutive E <=1 AND absolute difference between consecutive C>7, then delete the row with the lowest C value.
Sample Data:
A B C D E
0 94.5 4.3 26.0 79.0 NaN
1 34.0 8.8 23.0 58.0 54.5
2 54.2 5.4 25.5 9.91 50.2
3 42.2 3.5 26.0 4.91 5.1
4 98.0 8.2 13.0 193.7 5.5
5 20.5 9.6 17.0 157.3 5.3
6 32.9 5.4 24.5 45.9 79.8
Desired result:
A B C D E
0 94.5 4.3 26.0 79.0 NaN
1 34.0 8.8 23.0 58.0 54.5
2 54.2 5.4 25.5 9.91 50.2
3 42.2 3.5 26.0 4.91 5.01
4 32.9 5.4 24.5 45.9 79.8
Row 4 was deleted when compared with row 3. Row 5 is now row 4 and it was deleted when compared to row 3.
This code returns the results as boolean (not df with values) and does not satisfy all the conditions.
df = (abs(df.E.diff(-1)) <=1 & (abs(df.C.diff(-1)) >7.)
The result of the code:
0 False
1 False
2 False
3 True
4 False
5 False
6 False
dtype: bool
Any help appreciated.
Using shift() to compare the rows, and a while loop to iterate until no new change happens:
while(True):
rows = len(df)
df = df[~((abs(df.E - df.E.shift(1)) <= 1)&(abs(df.C - df.C.shift(1)) > 7))]
df.reset_index(inplace = True, drop = True)
if (rows == len(df)):
break
It produces the desired output:
A B C D E
0 94.5 4.3 26.0 79.00 NaN
1 34.0 8.8 23.0 58.00 54.5
2 54.2 5.4 25.5 9.91 50.2
3 42.2 3.5 26.0 4.91 5.1
4 32.9 5.4 24.5 45.90 79.8

Setting nan to rows in pandas dataframe based on column value

Using:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
a = pd.read_csv('file.csv', na_values=['-9999.0'], decimal=',')
a.index = pd.to_datetime(a[['Year', 'Month', 'Day', 'Hour', 'Minute']])
pd.options.mode.chained_assignment = None
The dataframe is something like:
Index A B C D
2016-07-20 18:00:00 9 4.0 NaN 2
2016-07-20 19:00:00 9 2.64 0.0 3
2016-07-20 20:00:00 12 2.59 0.0 1
2016-07-20 21:00:00 9 4.0 NaN 2
The main objective is to set np.nan to the entire row if the value on A column is 9 and on D column is 2 at the same time, for exemple:
Output expectation
Index A B C D
2016-07-20 18:00:00 NaN NaN NaN NaN
2016-07-20 19:00:00 9 2.64 0.0 3
2016-07-20 20:00:00 12 2.59 0.0 2
2016-07-20 21:00:00 NaN NaN NaN NaN
Would be thankful if someone could help.
Option 1
This is the opposite of #Jezrael's mask solution.
a.where(a.A.ne(9) | a.D.ne(2))
A B C D
Index
2016-07-20 18:00:00 NaN NaN NaN NaN
2016-07-20 19:00:00 9.0 2.64 0.0 3.0
2016-07-20 20:00:00 12.0 2.59 0.0 1.0
2016-07-20 21:00:00 NaN NaN NaN NaN
Option 2
pd.DataFrame.reindex
a[a.A.ne(9) | a.D.ne(2)].reindex(a.index)
A B C D
Index
2016-07-20 18:00:00 NaN NaN NaN NaN
2016-07-20 19:00:00 9.0 2.64 0.0 3.0
2016-07-20 20:00:00 12.0 2.59 0.0 1.0
2016-07-20 21:00:00 NaN NaN NaN NaN
Try this:
df.loc[df.A.eq(9) & df.D.eq(2)] = [np.nan] * len(df.columns)
Demo:
In [158]: df
Out[158]:
A B C D
Index
2016-07-20 18:00:00 9 4.00 NaN 2
2016-07-20 19:00:00 9 2.64 0.0 3
2016-07-20 20:00:00 12 2.59 0.0 1
2016-07-20 21:00:00 9 4.00 NaN 2
In [159]: df.loc[df.A.eq(9) & df.D.eq(2)] = [np.nan] * len(df.columns)
In [160]: df
Out[160]:
A B C D
Index
2016-07-20 18:00:00 NaN NaN NaN NaN
2016-07-20 19:00:00 9.0 2.64 0.0 3.0
2016-07-20 20:00:00 12.0 2.59 0.0 1.0
2016-07-20 21:00:00 NaN NaN NaN NaN
alternatively we can use DataFrame.where() method:
In [174]: df = df.where(~(df.A.eq(9) & df.D.eq(2)))
In [175]: df
Out[175]:
A B C D
Index
2016-07-20 18:00:00 NaN NaN NaN NaN
2016-07-20 19:00:00 9.0 2.64 0.0 3.0
2016-07-20 20:00:00 12.0 2.59 0.0 1.0
2016-07-20 21:00:00 NaN NaN NaN NaN
Use mask, which create NaNs by default:
df = a.mask((a['A'] == 9) & (a['D'] == 2))
print (df)
A B C D
Index
2016-07-20 18:00:00 NaN NaN NaN NaN
2016-07-20 19:00:00 9.0 2.64 0.0 3.0
2016-07-20 20:00:00 12.0 2.59 0.0 1.0
2016-07-20 21:00:00 NaN NaN NaN NaN
Or boolean indexing with assign NaN:
a[(a['A'] == 9) & (a['D'] == 2)] = np.nan
print (a)
A B C D
Index
2016-07-20 18:00:00 NaN NaN NaN NaN
2016-07-20 19:00:00 9.0 2.64 0.0 3.0
2016-07-20 20:00:00 12.0 2.59 0.0 1.0
2016-07-20 21:00:00 NaN NaN NaN NaN
Timings:
np.random.seed(123)
N = 1000000
L = list('abcdefghijklmnopqrst'.upper())
a = pd.DataFrame(np.random.choice([np.nan,2,9], size=(N,20)), columns=L)
#jez2
In [256]: %timeit a[(a['A'] == 9) & (a['D'] == 2)] = np.nan
10 loops, best of 3: 25.8 ms per loop
#jez2upr
In [257]: %timeit a.loc[(a['A'] == 9) & (a['D'] == 2)] = np.nan
10 loops, best of 3: 27.6 ms per loop
#Wen
In [258]: %timeit a.mul(np.where((a.A==9)&(a.D==2),np.nan,1),0)
10 loops, best of 3: 90.5 ms per loop
#jez1
In [259]: %timeit a.mask((a['A'] == 9) & (a['D'] == 2))
1 loop, best of 3: 316 ms per loop
#maxu2
In [260]: %timeit a.where(~(a.A.eq(9) & a.D.eq(2)))
1 loop, best of 3: 318 ms per loop
#pir1
In [261]: %timeit a.where(a.A.ne(9) | a.D.ne(2))
1 loop, best of 3: 316 ms per loop
#pir2
In [263]: %timeit a[a.A.ne(9) | a.D.ne(2)].reindex(a.index)
1 loop, best of 3: 355 ms per loop
Or you can try using.mul after np.where
a=np.where((df2.A==9)&(df2.D==2),np.nan,1)
df2.mul(a,0)
#one line df.mul(np.where((df.A==9)&(df.D==2),np.nan,1))
A B C D
Index
2016-07-20 18:00:00 NaN NaN NaN NaN
2016-07-20 19:00:00 9.0 2.64 0.0 3.0
2016-07-20 20:00:00 12.0 2.59 0.0 1.0
2016-07-20 21:00:00 NaN NaN NaN NaN

Merging pandas dataframes: empty columns created in left

I have several datasets, which I am trying to merge into one. Below, I created fictive simpler smaller datasets to test the method and it worked perfectly fine.
examplelog = pd.DataFrame({'Depth':[10,20,30,40,50,60,70,80],
'TVD':[10,19.9,28.8,37.7,46.6,55.5,64.4,73.3],
'T1':[11,11.3,11.5,12.,12.3,12.6,13.,13.8],
'T2':[11.3,11.5,11.8,12.2,12.4,12.7,13.1,14.1]})
log1 = pd.DataFrame({'Depth':[30,40,50,60],'T3':[12.1,12.6,13.7,14.]})
log2 = pd.DataFrame({'Depth':[20,30,40,50,60],'T4':[12.0,12.2,12.4,13.2,14.1]})
logs=[log1,log2]
result=examplelog.copy()
for i in logs:
result=result.merge(i,how='left', on='Depth')
print result
The result is, as expected:
Depth T1 T2 TVD T3 T4
0 10 11.0 11.3 10.0 NaN NaN
1 20 11.3 11.5 19.9 NaN 12.0
2 30 11.5 11.8 28.8 12.1 12.2
3 40 12.0 12.2 37.7 12.3 12.4
4 50 12.3 12.4 46.6 13.5 13.2
5 60 12.6 12.7 55.5 14.2 14.1
6 70 13.0 13.1 64.4 NaN NaN
7 80 13.8 14.1 73.3 NaN NaN
Happy with the result, I applied this method to my actual data, but for T3 and T4 in the resulting dataframes, I received just empty columns (all values were NaN). I suspect that the problem is with floating numbers, because my datasets were created on different machines by different software and although the "Depth" has the precision of two decimal numbers in all of the files, I am afraid that it may not be 20.05 in both of them, but one might be 20.049999999999999 while in the other it might be 20.05000000000001. Then, the merge function will not work, as shown in the following example:
examplelog = pd.DataFrame({'Depth':[10,20,30,40,50,60,70,80],
'TVD':[10,19.9,28.8,37.7,46.6,55.5,64.4,73.3],
'T1':[11,11.3,11.5,12.,12.3,12.6,13.,13.8],
'T2':[11.3,11.5,11.8,12.2,12.4,12.7,13.1,14.1]})
log1 = pd.DataFrame({'Depth':[30.05,40.05,50.05,60.05],'T3':[12.1,12.6,13.7,14.]})
log2 = pd.DataFrame({'Depth':[20.01,30.01,40.01,50.01,60.01],'T4':[12.0,12.2,12.4,13.2,14.1]})
logs=[log1,log2]
result=examplelog.copy()
for i in logs:
result=result.merge(i,how='left', on='Depth')
print result
Depth T1 T2 TVD T3 T4
0 10 11.0 11.3 10.0 NaN NaN
1 20 11.3 11.5 19.9 NaN NaN
2 30 11.5 11.8 28.8 NaN NaN
3 40 12.0 12.2 37.7 NaN NaN
4 50 12.3 12.4 46.6 NaN NaN
5 60 12.6 12.7 55.5 NaN NaN
6 70 13.0 13.1 64.4 NaN NaN
7 80 13.8 14.1 73.3 NaN NaN
Do you know how to fix this?
Thanks!
Round the Depth values to the appropriate precision:
for df in [examplelog, log1, log2]:
df['Depth'] = df['Depth'].round(1)
import numpy as np
import pandas as pd
examplelog = pd.DataFrame({'Depth':[10,20,30,40,50,60,70,80],
'TVD':[10,19.9,28.8,37.7,46.6,55.5,64.4,73.3],
'T1':[11,11.3,11.5,12.,12.3,12.6,13.,13.8],
'T2':[11.3,11.5,11.8,12.2,12.4,12.7,13.1,14.1]})
log1 = pd.DataFrame({'Depth':[30.05,40.05,50.05,60.05],'T3':[12.1,12.6,13.7,14.]})
log2 = pd.DataFrame({'Depth':[20.01,30.01,40.01,50.01,60.01],
'T4':[12.0,12.2,12.4,13.2,14.1]})
for df in [examplelog, log1, log2]:
df['Depth'] = df['Depth'].round(1)
logs=[log1,log2]
result=examplelog.copy()
for i in logs:
result=result.merge(i,how='left', on='Depth')
print(result)
yields
Depth T1 T2 TVD T3 T4
0 10 11.0 11.3 10.0 NaN NaN
1 20 11.3 11.5 19.9 NaN 12.0
2 30 11.5 11.8 28.8 12.1 12.2
3 40 12.0 12.2 37.7 12.6 12.4
4 50 12.3 12.4 46.6 13.7 13.2
5 60 12.6 12.7 55.5 14.0 14.1
6 70 13.0 13.1 64.4 NaN NaN
7 80 13.8 14.1 73.3 NaN NaN
Per the comments, rounding does not appear to work for the OP on the actual
data. To debug the problem, find some rows which should merge:
subframes = []
for frame in [examplelog, log2]:
mask = (frame['Depth'] < 20.051) & (frame['Depth'] >= 20.0)
subframes.append(frame.loc[mask])
Then post
for frame in subframes:
print(frame.to_dict('list'))
print(frame.info()) # shows the dtypes of the columns
This might give us the info we need to reproduce the problem.

Categories