Removing samples based on unique and nan values - python

I have a data-frame like this:
dtf:
id f1 f2 f3 f4 f5
t1 34 12 5 nan 6
t1 nan 4 2 9 7
t1 34 nan 5 nan 6
t2 nan nan nan nan nan
t2 nan nan nan nan nan
t2 nan nan nan nan nan
t3 23 7 8 1 32
t3 12 3 nan 45 56
t3 nan nan nan nan nan
I want to remove those rows (which have unique id) and all the features' values are 'nan' (like t2). Thus my desired data-frame should be like this:
dtf_new:
id f1 f2 f3 f4 f5
t1 34 12 5 nan 6
t1 nan 4 2 9 7
t1 34 nan 5 nan 6
t3 23 7 8 1 32
t3 12 3 nan 45 56
t3 nan nan nan nan nan
I have tried to convert it a dictionary using the below code, and then try to find nan values. But I still could not find the right solution.
dict=dict(enumerate(dtf.id.unique()))

You could do groupby and isna:
>>> dtf
id f1 f2 f3 f4 f5
0 t1 34.0 12.0 5.0 NaN 6.0
1 t1 NaN 4.0 2.0 9.0 7.0
2 t1 34.0 NaN 5.0 NaN 6.0
3 t2 NaN NaN NaN NaN NaN
4 t2 NaN NaN NaN NaN NaN
5 t2 NaN NaN NaN NaN NaN
6 t3 23.0 7.0 8.0 1.0 32.0
7 t3 12.0 3.0 NaN 45.0 56.0
8 t3 NaN NaN NaN NaN NaN
>>> dtf_new = dtf[~dtf['id'].map(dtf.groupby('id').apply(lambda x: x.drop(columns='id').isna().all(axis=None)))]
>>> dtf_new
id f1 f2 f3 f4 f5
0 t1 34.0 12.0 5.0 NaN 6.0
1 t1 NaN 4.0 2.0 9.0 7.0
2 t1 34.0 NaN 5.0 NaN 6.0
6 t3 23.0 7.0 8.0 1.0 32.0
7 t3 12.0 3.0 NaN 45.0 56.0
8 t3 NaN NaN NaN NaN NaN

#sacse is right, dropna does the job
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html
Just change default "how" parameter...
These kinds of processing needs are basic, and common to most of pandas users... You can assume there is a feature for it, few mins on the documentation and you will find your answer, and other interesting features :)
Worth some time reading!

Related

Setting Values with pandas DataFrame.loc

Consider I have a data frame :
>>> data
c0 c1 c2 _c1 _c2
0 0 1 2 18.0 19.0
1 3 4 5 NaN NaN
2 6 7 8 20.0 21.0
3 9 10 11 NaN NaN
4 12 13 14 NaN NaN
5 15 16 17 NaN NaN
I want to update the values in the c1 and c2 columns with the values in the _c1 and _c2 columns whenever those latter values are not NaN. Why won't the following work, and what is the correct way to do this?
>>> data.loc[~(data._c1.isna()),['c1','c2']]=data.loc[~(data._c1.isna()),['_c1','_c2']]
>>> data
c0 c1 c2 _c1 _c2
0 0 NaN NaN 18.0 19.0
1 3 4.0 5.0 NaN NaN
2 6 NaN NaN 20.0 21.0
3 9 10.0 11.0 NaN NaN
4 12 13.0 14.0 NaN NaN
5 15 16.0 17.0 NaN NaN
For completeness's sake I want the result to look like
>>> data.loc[~(data._c1.isna()),['c1','c2']]=data.loc[~(data._c1.isna()),['_c1','_c2']]
>>> data
c0 c1 c2 _c1 _c2
0 0 18.0 19.0. 18.0 19.0
1 3 4.0 5.0 NaN NaN
2 6 20.0 21.0 20.0 21.0
3 9 10.0 11.0 NaN NaN
4 12 13.0 14.0 NaN NaN
5 15 16.0 17.0 NaN NaN
I recommend update after rename
df.update(df[['_c1','_c2']].rename(columns={'_c1':'c1','_c2':'c2'}))
df
Out[266]:
c0 c1 c2 _c1 _c2
0 0 18.0 19.0 18.0 19.0
1 3 4.0 5.0 NaN NaN
2 6 20.0 21.0 20.0 21.0
3 9 10.0 11.0 NaN NaN
4 12 13.0 14.0 NaN NaN
5 15 16.0 17.0 NaN NaN
You can use np.where:
df[['c1', 'c2']] = np.where(df[['_c1', '_c2']].notna(),
df[['_c1', '_c2']],
df[['c1', 'c2']])
print(df)
# Output:
c0 c1 c2 _c1 _c2
0 0 18.0 19.0 18.0 19.0
1 3 4.0 5.0 NaN NaN
2 6 20.0 21.0 20.0 21.0
3 9 10.0 11.0 NaN NaN
4 12 13.0 14.0 NaN NaN
5 15 16.0 17.0 NaN NaN
Update
Do you know by any chance WHY the above doesn't work the way I thought it would?
Your column names from left and right side of your expression are different so Pandas can't use this values even if the shape is the same.
# Left side of your expression
>>> data.loc[~(data._c1.isna()),['c1','c2']]
c1 c2 # <- note the column names
0 18.0 19.0
2 20.0 21.0
# Right side of your expression
>>> data.loc[~(data._c1.isna()),['_c1','_c2']]
_c1 _c2 # <- Your column names are difference from left side
0 18.0 19.0
2 20.0 21.0
How to solve it? Simply use .values on the right side. As your right side is not row/column indexed, Pandas use the shape to set the values.
data.loc[~(data._c1.isna()),['c1','c2']] = \
data.loc[~(data._c1.isna()),['_c1','_c2']].values
print(data)
# Output:
c0 c1 c2 _c1 _c2
0 0 18.0 19.0 18.0 19.0
1 3 4.0 5.0 NaN NaN
2 6 20.0 21.0 20.0 21.0
3 9 10.0 11.0 NaN NaN
4 12 13.0 14.0 NaN NaN
5 15 16.0 17.0 NaN NaN

Is there a way for inserting/adding NaN rows and columns on a DataFrame?

I want to turn a DataFrame (or a numpy array):
df1:
0 1 2
0 1. 5. 9.
1 2. 6. 10.
2 3. 7. 11.
3 4. 8. 12.
into a DataFrame like this:
df1
0 1 2 3 4 5 6
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN 1. NaN 5. NaN 9. NaN
2 NaN NaN NaN NaN NaN NaN NaN
3 NaN 2. NaN 6. NaN 10. NaN
4 NaN NaN NaN NaN NaN NaN NaN
5 NaN 3. NaN 7. NaN 11. NaN
6 NaN NaN NaN NaN NaN NaN NaN
7 NaN 4. NaN 8. NaN 12. NaN
8 NaN NaN NaN NaN NaN NaN NaN
, i.e., I want to insert NaN rows and columns on df1 (as many as I want)
Could you make this work even for a large DataFrame, where you cannot do this manually?
So far, I have this:
import numpy as np
import pandas as pd
p = np.arange(1,13).reshape(4,3)
p1 = pd.DataFrame(p)
#Add a row of NaN's on p1
p1.index = range(1, 2*len(p1)+1, 2)
p1 = p1.reindex(index=range(2*len(p1)))
#Repeat for rows...I know its a lil bit st*pid
p1 = pd.DataFrame(p1)
p1.index = range(1, 2*len(p1)+1, 2)
p1 = p1.reindex(index=range(2*len(p1)))
#etc...
p1 = pd.DataFrame(p1)
p1.index = range(1, 2*len(p1)+1, 2)
p1 = p1.reindex(index=range(2*len(p1)))
It seems to work, but only for rows until now...
e.g., see this
Based on this answer you can interleave two dataframes on a particular axis.
pd.concat([df1, df2]).sort_index().reset_index(drop=True)
You can start by interleaving by rows (axis=0) df1 with a dataframe containing nan values. And do the same on the columns (axis=1) with another dataframe of nan values.
df1 = pd.DataFrame([[1., 5., 9.], [2., 6., 10.], [3., 7., 11.], [4., 8., 12.]])
rows, cols = df1.shape
Tricky part is getting the sizes right:
nan1 = pd.DataFrame([[np.nan]*cols]*(rows+1))
nan2 = pd.DataFrame([[np.nan]*(cols + 1)]*(2*rows + 1))
Then perform two consecutives concatenations, on axis=0 (default one) and axis=1:
df2_r = pd.concat([nan1, df1]).sort_index().reset_index(drop=True)
df2 = pd.concat([nan2, df2_r], axis=1).sort_index(axis=1).T.reset_index(drop=True).T
Edit: it seems there's is no built-in method to reset the columns indexing. However this will do:
df.T.reset_index(drop=True).T
Here are the results for each operation:
df1
0 1 2
0 1.0 5.0 9.0
1 2.0 6.0 10.0
2 3.0 7.0 11.0
3 4.0 8.0 12.0
nan1
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
concat on axis=0
0 1 2
0 NaN NaN NaN
1 1.0 5.0 9.0
2 NaN NaN NaN
3 2.0 6.0 10.0
4 NaN NaN NaN
5 3.0 7.0 11.0
6 NaN NaN NaN
7 4.0 8.0 12.0
8 NaN NaN NaN
nan2
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
concat on axis=1
0 1 2 3 4 5 6
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN 1.0 NaN 5.0 NaN 9.0 NaN
2 NaN NaN NaN NaN NaN NaN NaN
3 NaN 2.0 NaN 6.0 NaN 10.0 NaN
4 NaN NaN NaN NaN NaN NaN NaN
5 NaN 3.0 NaN 7.0 NaN 11.0 NaN
6 NaN NaN NaN NaN NaN NaN NaN
7 NaN 4.0 NaN 8.0 NaN 12.0 NaN
8 NaN NaN NaN NaN NaN NaN NaN
I am curious to see what you have tried so far, but here is an easy "quick and dirty" way to do it for your example. This is not a definitive answer: I'll let you figure out how to generalize it to any dataframe sizes/content you might have.
I am providing this code for your example so you have an idea which pandas functions/properties to use.
import pandas as pd
import numpy as np
# Making your base DataFrame
df = pd.DataFrame([[1,5,9], [2,6,8], [3,7,4]])
df:
0 1 2
0 1 5 9
1 2 6 8
2 3 7 4
spacing out your columns existing columns numbers and adding filling the left columns numbers with NaN:
df.columns = [1,3,5]
for i in range(0, 8, 2):
df[i] = np.NaN
df:
1 3 5 0 2 4 6
0 1 5 9 NaN NaN NaN NaN
1 2 6 8 NaN NaN NaN NaN
2 3 7 4 NaN NaN NaN NaN
Now adding extra rows, with NaN data (we need 4 more with 7 columns)
df2 = pd.DataFrame([[np.NaN] * 7] * 4)
df = pd.concat([df, df2])
df3:
0 1 2 3 4 5 6
0 NaN 1.0 NaN 5.0 NaN 9.0 NaN
1 NaN 2.0 NaN 6.0 NaN 8.0 NaN
2 NaN 3.0 NaN 7.0 NaN 4.0 NaN
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
As you can see: we have the right data, and it is now only a matter of ordering your rows.
df3.index = [1,3,5,0,2,4,6]
df3 = df3.sort_index()
df3:
0 1 2 3 4 5 6
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN 1.0 NaN 5.0 NaN 9.0 NaN
2 NaN NaN NaN NaN NaN NaN NaN
3 NaN 2.0 NaN 6.0 NaN 8.0 NaN
4 NaN NaN NaN NaN NaN NaN NaN
5 NaN 3.0 NaN 7.0 NaN 4.0 NaN
6 NaN NaN NaN NaN NaN NaN NaN
I think this is a very elegant way to solve this.
array=np.array([[1,5,9],[2,6,10],[3,7,11],[4,8,12]])
Data=pd.DataFrame(array)
Data.index=Data.index*2+1
Data.columns=Data.columns*2+1
Data=Data.reindex(list(range(0,9)))
Data=Data.T.reindex(list(range(0,9)))
A fast way using numpy (work with dataframe as well):
# Sample data
a = np.arange(1,13).reshape(4,3)
df = pd.DataFrame(a)
# New data with empty values
a2 = np.empty([i*2+1 for i in a.shape])
a2[:] = np.nan
a2[1::2, 1::2] = a
Output of pd.DataFrame(a2):
0 1 2 3 4 5 6
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN 1.0 NaN 2.0 NaN 3.0 NaN
2 NaN NaN NaN NaN NaN NaN NaN
3 NaN 4.0 NaN 5.0 NaN 6.0 NaN
4 NaN NaN NaN NaN NaN NaN NaN
5 NaN 7.0 NaN 8.0 NaN 9.0 NaN
6 NaN NaN NaN NaN NaN NaN NaN
7 NaN 10.0 NaN 11.0 NaN 12.0 NaN
8 NaN NaN NaN NaN NaN NaN NaN
Note: If you have a DataFrame, just replace a.shape by df.shape, and a by df.values.

Merge rows of the same dataframe by columns

I am new to Python and i can solve the following problem.
I have a dataframe that looks like that:
ID X1 x2 x3
1 15 NaN NaN
2 NaN 2 NaN
3 NaN NaN 5
1 NaN 16 NaN
2 1 NaN NaN
3 6 NaN NaN
4 NaN NaN 75
5 NaN 67 NaN
I want to merge the rows by ID, as a result it should look like that:
ID x1 x2 x3
1 15 16 NaN
2 1 2 NaN
3 6 NaN 5
4 NaN NaN 75
5 NaN 67 NaN
I have tryed a lot with df.groupby("ID"), without success.
Can someone fix that for me an supply the code for me. Thx
You can change your existing groupby like this. You can remove replace part if you would like 0.0 instead of NaN:
import numpy as np
df = df.fillna(0).astype(int).groupby('ID').sum().replace(0,np.nan)
print(df)
Output:
ID X1 x2 x3
1 15.0 16.0 NaN
2 1.0 2.0 NaN
3 6.0 NaN 5.0
4 NaN NaN 75.0
5 NaN 67.0 NaN
If you don't want ID as index, you can add reset_index:
import numpy as np
df = df.fillna(0).astype(int).groupby('ID').sum().replace(0,np.nan).reset_index()
print(df)
Output:
ID X1 x2 x3
0 1 15.0 16.0 NaN
1 2 1.0 2.0 NaN
2 3 6.0 NaN 5.0
3 4 NaN NaN 75.0
4 5 NaN 67.0 NaN
Try this:
df1 = df.groupby('ID',as_index=False,sort=False).last()

Pandas row filtering reproduces entire table and turns data into NaNs [duplicate]

This question already has answers here:
How to change column names in pandas Dataframe using a list of names?
(5 answers)
Closed 3 years ago.
I have a Pandas dataframe with several columns. I want to create a new dataframe which contains all the rows in the original dataframe for which the boolean value "Present" is True.
Normally the way you are supposed to do this is by calling grades[grades['Present']], but I get the following unexpected result:
It reproduces the entire dataframe, except changes the True values in the "Present" column to 1s (the False ones become NaNs).
Any idea why this might be happening?
Here is my full script:
import pandas as pd
# read CSV and clean up data
grades = pd.read_csv("2학기 speaking test grades - 2·3학년.csv")
grades = grades[["Year","Present?","내용 / 30","유찬성 / 40","태도 / 30"]]
grades.columns = [["Year","Present","Content","Fluency","Attitude"]]
# Change integer Present to a boolean
grades['Present']=grades['Present']==1
print(grades.head())
print(grades.dtypes)
print(grades[grades['Present']])
And terminal output:
Year Present Content Fluency Attitude
0 2 True 30.0 40.0 30.0
1 2 True 30.0 40.0 30.0
2 2 True 30.0 40.0 30.0
3 2 True 30.0 40.0 30.0
4 2 True 30.0 40.0 30.0
Year int64
Present bool
Content float64
Fluency float64
Attitude float64
dtype: object
Year Present Content Fluency Attitude
0 NaN 1.0 NaN NaN NaN
1 NaN 1.0 NaN NaN NaN
2 NaN 1.0 NaN NaN NaN
3 NaN 1.0 NaN NaN NaN
4 NaN 1.0 NaN NaN NaN
5 NaN 1.0 NaN NaN NaN
6 NaN 1.0 NaN NaN NaN
7 NaN 1.0 NaN NaN NaN
8 NaN 1.0 NaN NaN NaN
9 NaN 1.0 NaN NaN NaN
10 NaN 1.0 NaN NaN NaN
11 NaN 1.0 NaN NaN NaN
12 NaN 1.0 NaN NaN NaN
13 NaN 1.0 NaN NaN NaN
14 NaN 1.0 NaN NaN NaN
15 NaN 1.0 NaN NaN NaN
16 NaN 1.0 NaN NaN NaN
17 NaN 1.0 NaN NaN NaN
18 NaN 1.0 NaN NaN NaN
19 NaN 1.0 NaN NaN NaN
20 NaN 1.0 NaN NaN NaN
21 NaN 1.0 NaN NaN NaN
22 NaN 1.0 NaN NaN NaN
23 NaN 1.0 NaN NaN NaN
24 NaN 1.0 NaN NaN NaN
25 NaN 1.0 NaN NaN NaN
26 NaN 1.0 NaN NaN NaN
27 NaN 1.0 NaN NaN NaN
28 NaN 1.0 NaN NaN NaN
29 NaN 1.0 NaN NaN NaN
.. ... ... ... ... ...
91 NaN NaN NaN NaN NaN
92 NaN NaN NaN NaN NaN
93 NaN 1.0 NaN NaN NaN
94 NaN 1.0 NaN NaN NaN
95 NaN NaN NaN NaN NaN
96 NaN 1.0 NaN NaN NaN
97 NaN 1.0 NaN NaN NaN
98 NaN 1.0 NaN NaN NaN
99 NaN 1.0 NaN NaN NaN
100 NaN 1.0 NaN NaN NaN
101 NaN 1.0 NaN NaN NaN
102 NaN 1.0 NaN NaN NaN
103 NaN 1.0 NaN NaN NaN
104 NaN 1.0 NaN NaN NaN
105 NaN 1.0 NaN NaN NaN
106 NaN 1.0 NaN NaN NaN
107 NaN 1.0 NaN NaN NaN
108 NaN 1.0 NaN NaN NaN
109 NaN 1.0 NaN NaN NaN
110 NaN 1.0 NaN NaN NaN
111 NaN 1.0 NaN NaN NaN
112 NaN 1.0 NaN NaN NaN
113 NaN 1.0 NaN NaN NaN
114 NaN 1.0 NaN NaN NaN
115 NaN 1.0 NaN NaN NaN
116 NaN 1.0 NaN NaN NaN
117 NaN 1.0 NaN NaN NaN
118 NaN 1.0 NaN NaN NaN
119 NaN 1.0 NaN NaN NaN
120 NaN 1.0 NaN NaN NaN
[121 rows x 5 columns]
Here is the CSV file. SE won't let me upload it directly, so if you paste it into your own CSV file you'll need to modify the Python code above to specify that it's in the EUC-KR encoding, like so: pd.read_csv("paste.csv",encoding="EUC-KR")
Year,Class,Year / class * presence (used to filter for averages),Present?,내용 / 30,유찬성 / 40,태도 / 30,Total,,,Averages (평균점),,
2,2,22,1,30,40,30,100,,,Grade distribution (점수 막대 그래프),,
2,2,22,1,30,40,30,100,,,The graph below includes the scores of all students in grades 2 and 3. ,,
2,2,22,1,30,40,30,100,,,아래 그래프에는 2·3학년에서 모든 학생의 점수가 정리됩니다.,,
2,2,22,1,30,40,30,100,,,,,
2,2,22,1,30,40,30,100,,,,,
2,1,21,1,30,40,30,100,,,,,
2,1,21,1,30,40,30,100,,,,,
2,1,21,1,30,40,30,100,,,,,
2,1,21,1,30,40,30,100,,,,,
2,1,21,1,30,40,30,100,,,,,
2,1,21,1,30,40,30,100,,,,,
3,2,32,1,30,40,30,100,,,,,
3,2,32,1,30,40,30,100,,,,,
3,2,32,1,30,40,30,100,,,,,
3,2,32,1,30,40,30,100,,,,,
3,2,32,1,30,40,30,100,,,,,
3,2,32,1,30,40,30,100,,,,,
3,2,32,1,30,40,30,100,,,,,
3,2,32,1,30,40,30,100,,,,,
2,2,22,1,30,30,30,90,,,,,
2,2,22,1,30,30,30,90,,,,,
2,2,22,1,30,30,30,90,,,,,
2,2,22,1,30,30,30,90,,,Average scores (평균점),,
2,2,22,1,30,30,30,90,,,These averages only count students who were present for the test.,,
2,2,22,1,30,30,30,90,,,평균점에는 참석한 학생의 점수만 포함됩니다.,,
2,2,22,1,30,30,30,90,,,,,
2,2,22,1,30,30,30,90,,,2학년 1반,,77.1
2,1,21,1,30,30,30,90,,,2학년 2반,,77.6
2,1,21,1,30,30,30,90,,,3학년 1반,,71.5
2,1,21,1,30,30,30,90,,,3학년 2반,,77.4
2,1,21,1,30,30,30,90,,,,,
2,1,21,1,30,30,30,90,,,,,
2,1,21,1,30,30,30,90,,,,,
2,1,21,1,30,30,30,90,,,,,
2,1,21,1,30,30,30,90,,,,,
2,1,21,1,30,30,30,90,,,,,
2,1,21,1,30,30,30,90,,,,,
2,1,21,1,30,30,30,90,,,,,
3,2,32,1,30,30,30,90,,,,,
3,2,32,1,30,30,30,90,,,,,
3,2,32,1,30,30,30,90,,,,,
3,2,32,1,30,30,30,90,,,,,
3,2,32,1,20,40,30,90,,,,,
2,2,22,1,20,30,30,80,,,,,
2,2,22,1,20,30,30,80,,,,,
2,2,22,1,30,20,30,80,,,,,
2,2,22,1,30,20,30,80,,,,,
2,2,22,1,30,30,20,80,,,,,
2,2,22,1,30,20,30,80,,,,,
2,1,21,1,20,30,30,80,,,,,
2,1,21,1,20,30,30,80,,,,,
2,1,21,1,30,30,20,80,,,,,
3,2,32,1,20,30,30,80,,,,,
3,2,32,1,30,20,30,80,,,,,
3,2,32,1,20,30,30,80,,,,,
3,2,32,1,30,30,20,80,,,,,
3,2,32,1,30,20,30,80,,,,,
2,2,22,1,10,30,30,70,,,,,
2,2,22,1,20,20,30,70,,,,,
2,2,22,1,30,20,20,70,,,,,
2,2,22,1,20,20,30,70,,,,,
2,2,22,1,20,20,30,70,,,,,
3,2,32,1,30,10,30,70,,,,,
3,2,32,1,20,30,20,70,,,,,
3,2,32,1,20,20,30,70,,,,,
2,1,21,1,20,20,20,60,,,,,
2,1,21,1,10,20,30,60,,,,,
2,2,22,1,10,20,20,50,,,,,
2,2,22,1,10,10,30,50,,,,,
2,1,21,1,10,10,30,50,,,,,
2,1,21,1,20,20,10,50,,,,,
3,2,32,1,10,10,30,50,,,,,
3,2,32,1,10,10,30,50,,,,,
2,2,22,1,10,0,30,40,,,,,
2,1,21,1,10,0,30,40,,,,,
3,2,32,1,10,0,30,40,,,,,
3,2,32,1,10,10,20,40,,,,,
2,2,22,1,0,0,30,30,,,,,
2,1,21,1,0,0,30,30,,,,,
2,1,21,1,0,0,30,30,,,,,
3,2,32,1,0,0,30,30,,,,,
3,2,32,1,0,0,20,20,,,,,
2,1,21,1,0,0,10,10,,,,,
2,2,22,1,0,0,30,30,,,,,
2,2,0,0,,,,0,,,,,
2,2,0,0,,,,0,,,,,
2,1,0,0,,,,0,,,,,
2,1,0,0,,,,0,,,,,
2,1,0,0,,,,0,,,,,
2,1,0,0,,,,0,,,,,
3,2,0,0,,,,0,,,,,
3,2,0,0,,,,0,,,,,
3,1,0,0,,,,0,,,,,
3,1,31,1,30,20,30,80,,,,,
3,1,31,1,0,0,30,30,,,,,
3,1,0,0,,,,0,,,,,
3,1,31,1,30,20,10,60,,,,,
3,1,31,1,30,30,30,90,,,,,
3,1,31,1,30,30,30,90,,,,,
3,1,31,1,20,20,20,60,,,,,
3,1,31,1,30,20,30,80,,,,,
3,1,31,1,30,40,30,100,,,,,
3,1,31,1,30,30,30,90,,,,,
3,1,31,1,30,30,30,90,,,,,
3,1,31,1,30,30,30,90,,,,,
3,1,31,1,20,30,20,70,,,,,
3,1,31,1,30,30,30,90,,,,,
3,1,31,1,30,40,30,100,,,,,
3,1,31,1,30,20,10,60,,,,,
3,1,31,1,20,10,20,50,,,,,
3,1,31,1,30,20,30,80,,,,,
3,1,31,1,0,0,20,20,,,,,
3,1,31,1,30,30,30,90,,,,,
3,1,31,1,30,30,30,90,,,,,
3,1,31,1,30,30,30,90,,,,,
3,1,31,1,0,0,20,20,,,,,
3,1,31,1,20,10,10,40,,,,,
3,1,31,1,30,30,30,90,,,,,
3,1,31,1,20,20,30,70,,,,,
3,1,31,1,30,20,10,60,,,,,
3,1,31,1,10,10,30,50,,,,,
Thank you.
You forgot to filter the Present column by True.
You can do it this way.
grades = grades[grades["Present"] == True]
If the boolean is stored as a string then use the double quotation.
grades = grades[grades["Present"] == "True"]

Pandas Dataframe interpolating in sections delimited by indexes

My sample code is as follow:
import pandas as pd
dictx = {'col1':[1,'nan','nan','nan',5,'nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9]}
df = pd.DataFrame(dictx).astype(float)
I'm trying to interpolate various segments which contain the value 'nan'.
For context, I'm trying to track bus speeds using GPS data provided by the city (São Paulo, Brazil), but the data is scarce and with parts that do not provide the information, as the e.g., but there're segments which I know for a fact that they are stopped, such as dawn, but the information come as 'nan' as well.
What I need:
I've been experimenting with dataframe.interpolate() parameters (limit and limit_diretcion) but came up short. If I set df.interpolate(limit=2) I will not only interpolate the data that I need but the data where it shouldn't. So I need to interpolate between sections defined by a limit
Desired output:
Out[7]:
col1 col2 col3
0 1.0 20.00 15.00
1 nan nan nan
2 nan nan nan
3 nan nan nan
4 5.0 22.00 10.00
5 6.0 23.50 12.00
6 7.0 25.00 14.00
7 8.0 27.50 13.50
8 9.0 30.00 13.00
9 nan nan nan
10 nan nan nan
11 nan nan nan
12 13.0 25.00 9.00
The logic that I've been trying to apply is basically trying to find nan's and calculating the difference between their indexes and so createing a new dataframe_temp to interpolate and only than add it to another creating a new dataframe_final. But this has become hard to achieve due to the fact that 'nan'=='nan' return False
This is a hack but may still be useful. Likely Pandas 0.23 will have a better solution.
https://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#dataframe-interpolate-has-gained-the-limit-area-kwarg
df_fw = df.interpolate(limit=1)
df_bk = df.interpolate(limit=1, limit_direction='backward')
df_fw.where(df_bk.notna())
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
Not a Hack
More legitimate way of handling it.
Generalized to handle any limit.
def interp(df, limit):
d = df.notna().rolling(limit + 1).agg(any).fillna(1)
d = pd.concat({
i: d.shift(-i).fillna(1)
for i in range(limit + 1)
}).prod(level=1)
return df.interpolate(limit=limit).where(d.astype(bool))
df.pipe(interp, 1)
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
Can also handle variation in NaN from column to column. Consider a different df
dictx = {'col1':[1,'nan','nan','nan',5,'nan','nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan','nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9,'nan']}
df = pd.DataFrame(dictx).astype(float)
df
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 NaN NaN NaN
6 NaN 25.0 14.0
7 7.0 NaN NaN
8 NaN NaN 13.0
9 9.0 30.0 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.0 25.0 NaN
Then with limit=1
df.pipe(interp, 1)
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 NaN 23.5 12.0
6 NaN 25.0 14.0
7 7.0 NaN 13.5
8 8.0 NaN 13.0
9 9.0 30.0 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.0 25.0 9.0
And with limit=2
df.pipe(interp, 2).round(2)
col1 col2 col3
0 1.00 20.00 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.00 22.00 10.0
5 5.67 23.50 12.0
6 6.33 25.00 14.0
7 7.00 26.67 13.5
8 8.00 28.33 13.0
9 9.00 30.00 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.00 25.00 9.0
Here is a way to selectively ignore rows which are consecutive runs of NaNs whose length is greater than a certain size (given by limit):
import numpy as np
import pandas as pd
dictx = {'col1':[1,'nan','nan','nan',5,'nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9]}
df = pd.DataFrame(dictx).astype(float)
limit = 2
notnull = pd.notnull(df).all(axis=1)
# assign group numbers to the rows of df. Each group starts with a non-null row,
# followed by null rows
group = notnull.cumsum()
# find the index of groups having length > limit
ignore = (df.groupby(group).filter(lambda grp: len(grp)>limit)).index
# only ignore rows which are null
ignore = df.loc[~notnull].index.intersection(ignore)
keep = df.index.difference(ignore)
# interpolate only the kept rows
df.loc[keep] = df.loc[keep].interpolate()
print(df)
prints
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
By changing the value of limit you can control how big the group has to be before it should be ignored.
This is a partial answer.
for i in list(df):
for x in range(len(df[i])):
if not df[i][x] > -100:
df[i][x] = 0
df
col1 col2 col3
0 1.0 20.0 15.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
4 5.0 22.0 10.0
5 0.0 0.0 0.0
6 7.0 25.0 14.0
7 0.0 0.0 0.0
8 9.0 30.0 13.0
9 0.0 0.0 0.0
10 0.0 0.0 0.0
11 0.0 0.0 0.0
12 13.0 25.0 9.0
Now,
df["col1"][1] == df["col2"][1]
True

Categories