NaNs remaining after pandas interpolate - python

I have the following pandas series:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 2.291958
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 NaN
21 NaN
22 NaN
23 NaN
24 NaN
25 NaN
26 0.378826
27 NaN
28 NaN
29 NaN
...
123 NaN
124 NaN
125 NaN
126 NaN
127 1.170094
128 NaN
129 NaN
130 NaN
131 0.008531
132 NaN
133 NaN
134 NaN
135 NaN
136 NaN
137 NaN
138 NaN
139 NaN
140 NaN
141 NaN
142 NaN
143 NaN
144 NaN
145 NaN
146 NaN
147 NaN
148 NaN
149 NaN
150 NaN
151 NaN
152 NaN
Length: 153, dtype: float64
I interpolate it as follows:
ts.interpolate(method='cubic', limit_direction='both', limit=75)
I would have expected all NaNs to be filled by this, but in the output, NaNs still remain, why is that and how can I fix it in the interpolate command?Output is as follows:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 2.291958
12 1.733142
13 1.255447
14 0.854370
15 0.525409
16 0.264062
17 0.065826
18 -0.073801
19 -0.159321
20 -0.195237
21 -0.186051
22 -0.136265
23 -0.050382
24 0.067095
25 0.211666
26 0.378826
27 0.564074
28 0.762908
29 0.970824
...
123 1.649933
124 1.579817
125 1.479152
126 1.343917
127 1.170094
128 0.953663
129 0.690605
130 0.376900
131 0.008531
132 NaN
133 NaN
134 NaN
135 NaN
136 NaN
137 NaN
138 NaN
139 NaN
140 NaN
141 NaN
142 NaN
143 NaN
144 NaN
145 NaN
146 NaN
147 NaN
148 NaN
149 NaN
150 NaN
151 NaN
152 NaN
Length: 153, dtype: float64

I do not think cubic can do that to fillna without between , if you change the linear , it will do it
s.interpolate('linear',limit_direction='both', limit=75)
Out[62]:
0 2.291958
1 2.291958
2 2.291958
3 2.291958
4 2.291958
5 2.291958
6 2.291958
7 2.291958
8 2.291958
9 2.291958
10 2.291958
11 2.291958
12 2.164416
13 2.036874
14 1.909332
15 1.781789
16 1.654247
17 1.526705
18 1.399163
19 1.271621
20 1.144079
21 1.016537
22 0.888995
23 0.761452
24 0.633910
25 0.506368
26 0.378826
27 0.378826
28 0.378826
29 0.378826
Name: s, dtype: float64

Related

Merger/join two tables, missing information from second tables

The df1 I have
OBJECTID County State WeightedAverage
0 1 Allegan MI 33.088148
1 2 Arenac MI 15.000000
2 3 Branch MI 43.000000
3 4 Calhoun MI 12.931455
4 5 Charlevoix MI 7.679045
The df2 I have
County ConfirmedCases ConfirmedDeaths ProbableCases TotalCases TotalDeaths Weighed_Pos Weighed_Death OBJECTID ProbableDeaths PrevTotCases PrevTotDeaths Population Shape__Area Shape__Length
0 Eaton 7657 159 1042 8699 167 0.078908 0.001515 1 8 8265 164 110243 16150919896 5.079141e+05
1 Alcona 475 21 96 571 26 0.053018 0.002414 2 5 539 25 10770 19379884867 5.663069e+05
2 Alger 295 1 177 472 5 0.051121 0.000542 3 4 462 5 9233 26217416016 1.285307e+06
3 Allegan 8124 97 952 9076 120 0.074833 0.000989 4 23 8738 118 121283 23483935830 6.319381e+05
4 Alpena 1468
51 307 1775 52 0.061853
The information is not important, I can clearly see Allegan appeared in both tables, however when I joint the two tables together.
df=pd.merge(q1w, co, on='County', how='left')
I got all NA from second table.
OBJECTID_x County State WeightedAverage ConfirmedCases ConfirmedDeaths ProbableCases TotalCases TotalDeaths Weighed_Pos Weighed_Death OBJECTID_y ProbableDeaths PrevTotCases PrevTotDeaths Population Shape__Area Shape__Length
0 1 Allegan MI 33.088148 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 Arenac MI 15.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 3 Branch MI 43.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 4 Calhoun MI 12.931455 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I believe that there might be blank spaces in County column.
Try removing them merged = pd.merge(df1, df2, on='County', how='left') and then merged.County = merged.County.apply(lambda cty: cty.strip())

Get values from between two other values for each row in the dataframe

I want to extract the integer values for each Hole_ID between the From and To values (inclusive). And save them to a new data frame with the Hole IDs as the column headers.
import pandas as pd
import numpy as np
df=pd.DataFrame(np.array([['Hole_1',110,117],['Hole_2',220,225],['Hole_3',112,114],['Hole_4',248,252],['Hole_5',116,120],['Hole_6',39,45],['Hole_7',65,72],['Hole_8',79,83]]),columns=['HOLE_ID','FROM', 'TO'])
Example starting data
HOLE_ID FROM TO
0 Hole_1 110 117
1 Hole_2 220 225
2 Hole_3 112 114
3 Hole_4 248 252
4 Hole_5 116 120
5 Hole_6 39 45
6 Hole_7 65 72
7 Hole_8 79 83
This is what I would like:
Out[5]:
Hole_1 Hole_2 Hole_3 Hole_4 Hole_5 Hole_6 Hole_7 Hole_8
0 110 220 112 248 116 39 65 79
1 111 221 113 249 117 40 66 80
2 112 222 114 250 118 41 67 81
3 113 223 Nan 251 119 42 68 82
4 114 224 Nan 252 120 43 69 83
5 115 225 Nan Nan Nan 44 70 Nan
6 116 Nan Nan Nan Nan 45 71 Nan
7 117 Nan Nan Nan Nan Nan 72 Nan
I have tried to use the range function, which works if I manually define the range:
for i in df['HOLE_ID']:
df2[i]=range(int(1),int(10))
gives
Hole_1 Hole_2 Hole_3 Hole_4 Hole_5 Hole_6 Hole_7 Hole_8
0 1 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2 2
2 3 3 3 3 3 3 3 3
3 4 4 4 4 4 4 4 4
4 5 5 5 5 5 5 5 5
5 6 6 6 6 6 6 6 6
6 7 7 7 7 7 7 7 7
7 8 8 8 8 8 8 8 8
8 9 9 9 9 9 9 9 9
but this won't take the df To and From values as inputs to the range.
df2=pd.DataFrame()
for i in df['HOLE_ID']:
df2[i]=range(df['To'],df['From'])
gives an error.
Apply a method that returns a series of a range between from and to and then transpose the result, eg:
import numpy as np
df.set_index('HOLE_ID').apply(lambda v: pd.Series(np.arange(v['FROM'], v['TO'] + 1)), axis=1).T
Gives you:
HOLE_ID Hole_1 Hole_2 Hole_3 Hole_4 Hole_5 Hole_6 Hole_7 Hole_8
0 110.0 220.0 112.0 248.0 116.0 39.0 65.0 79.0
1 111.0 221.0 113.0 249.0 117.0 40.0 66.0 80.0
2 112.0 222.0 114.0 250.0 118.0 41.0 67.0 81.0
3 113.0 223.0 NaN 251.0 119.0 42.0 68.0 82.0
4 114.0 224.0 NaN 252.0 120.0 43.0 69.0 83.0
5 115.0 225.0 NaN NaN NaN 44.0 70.0 NaN
6 116.0 NaN NaN NaN NaN 45.0 71.0 NaN
7 117.0 NaN NaN NaN NaN NaN 72.0 NaN
Let's try:
df[['FROM','TO']] = df[['FROM', 'TO']].apply(pd.to_numeric)
dfe = df.set_index('HOLE_ID').apply(lambda x: np.arange(x['FROM'], x['TO']+1), axis=1).explode().to_frame()
dfe.set_index(dfe.groupby(level=0).cumcount(), append=True)[0].unstack(0)
Output:
HOLE_ID Hole_1 Hole_2 Hole_3 Hole_4 Hole_5 Hole_6 Hole_7 Hole_8
0 110 220 112 248 116 39 65 79
1 111 221 113 249 117 40 66 80
2 112 222 114 250 118 41 67 81
3 113 223 NaN 251 119 42 68 82
4 114 224 NaN 252 120 43 69 83
5 115 225 NaN NaN NaN 44 70 NaN
6 116 NaN NaN NaN NaN 45 71 NaN
7 117 NaN NaN NaN NaN NaN 72 NaN
Here is another way that creates a range from the 2 columns and creates a df:
out = (pd.DataFrame(df[['FROM','TO']].astype(int).agg(tuple,1)
.map(lambda x: range(x[0],x[1]+1).tolist(),index=df['HOLE_ID']).T)
HOLE_ID Hole_1 Hole_2 Hole_3 Hole_4 Hole_5 Hole_6 Hole_7 Hole_8
0 110.0 220.0 112.0 248.0 116.0 39.0 65.0 79.0
1 111.0 221.0 113.0 249.0 117.0 40.0 66.0 80.0
2 112.0 222.0 114.0 250.0 118.0 41.0 67.0 81.0
3 113.0 223.0 NaN 251.0 119.0 42.0 68.0 82.0
4 114.0 224.0 NaN 252.0 120.0 43.0 69.0 83.0
5 115.0 225.0 NaN NaN NaN 44.0 70.0 NaN
6 116.0 NaN NaN NaN NaN 45.0 71.0 NaN
7 117.0 NaN NaN NaN NaN NaN 72.0 NaN

Filter pandas DataFrame based on last_valid_index value

is there any proper way of filtering pandas DataFrame based on last_valid_index of the column?
For example, I want to have all rows where last valid index value has the format of (\d{13}).
Input:
0 ... 15 16 17 18
24 10.0 ... 1107 8712566328208 NaN NaN
25 6.0 ... 363K 1243 8712100849084 NaN
26 10.0 ... 758 3251510550005 NaN NaN
27 8.0 ... 245K 780 3560070774425 NaN
29 6.0 ... 1485 7613034528971 NaN NaN
29 6.0 ... 1485 test1 NaN NaN
29 6.0 ... 1485 280 test NaN
Output:
0 ... 15 16 17 18
24 10.0 ... 1107 8712566328208 NaN NaN
25 6.0 ... 363K 1243 8712100849084 NaN
26 10.0 ... 758 3251510550005 NaN NaN
27 8.0 ... 245K 780 3560070774425 NaN
29 6.0 ... 1485 7613034528971 NaN NaN
Thanks!
you can try using .apply with axis=1 to get the last_valid_index per row, then use df.lookup to get the actual values, and .str.match to compare them to the regex.
try this:
from io import StringIO
import pandas as pd
import re
s = """
0 15 16 17 18
24 10.0 1107 8712566328208 NaN NaN
25 6.0 363K 1243 8712100849084 NaN
26 10.0 758 3251510550005 NaN NaN
27 8.0 245K 780 3560070774425 NaN
29 6.0 1485 7613034528971 NaN NaN
30 6.0 1485 test1 NaN NaN
31 6.0 1485 280 test NaN"""
df = pd.read_csv(StringIO(s), sep="\s+")
last_valid_indices = df.apply(lambda row: row.last_valid_index(), axis=1)
last_valid_vals = pd.Series(df.lookup(last_valid_indices.index, last_valid_indices.values), index=last_valid_indices.index)
print(df[last_valid_vals.str.match("\d{13}")])
Output:
0 15 16 17 18
24 10.0 1107 8712566328208 NaN NaN
25 6.0 363K 1243 8712100849084 NaN
26 10.0 758 3251510550005 NaN NaN
27 8.0 245K 780 3560070774425 NaN
29 6.0 1485 7613034528971 NaN NaN
Here is one way using ffill
df[(pd.to_numeric(df.ffill(1).iloc[:,-1],errors='coerce')//1e12).between(1,9)]
0 ... 15 16 17 18
24 10.0 ... 1107 8712566328208 NaN NaN
25 6.0 ... 363K 1243 8712100849084 NaN
26 10.0 ... 758 3251510550005 NaN NaN
27 8.0 ... 245K 780 3560070774425 NaN
29 6.0 ... 1485 7613034528971 NaN NaN

How to work with Serias in pandas

for i,r in data.iterrows():
print(r)
a row is a Serias object and print output is like :
QuantifierId
18 0.0
19 0.0
20 0.0
21 NaN
23 NaN
24 NaN
25 NaN
26 NaN
27 NaN
28 NaN
63 NaN
64 NaN
81 NaN
82 NaN
83 NaN
84 NaN
85 NaN
86 NaN
87 NaN
88 NaN
89 NaN
91 NaN
93 NaN
94 NaN
95 NaN
96 NaN
121 NaN
Name: 52466, dtype: float64
I want to :
remove all QuantifierIds with value == Nan or 0 (and retain all QuantifierIds with value==1)
get that Name field from each row
How to do that?
remove all QuantifierIds with value == Nan or 0 (and retain all
QuantifierIds with value==1)
data = data.loc[(data.QuantifierId.notnull()) & (data.QuantifierId != 0)]
get that Name field from each row
data.index.tolist()

Plotting a heatmap for trajectory data from a pandas dataframe

I have a dataframe in pandas containing information that I would like display as a heatmap of sorts. The dataframe displays the x and y co-ordinates of several objects at varying points in time and includes other information in extra columns (eg:mass).
time object x y mass
3 1.0 216 12 12
4 1.0 218 13 12
5 1.0 217 12 12
6 1.0 234 13 13
1 2.0 361 289 23
2 2.0 362 287 22
3 2.0 362 286 22
5 3.0 124 56 18
6 3.0 126 52 17
I would like to create a heatmap with the x and y values corresponding to the x and y axes of the heatmap. The greater the number of objects at a particular x/y location, the more intense I would like the color to be. Any ideas on how you would accomplish this?
One idea is to use seaborn heatmap. First I would pivot your dataframe over your desired output, in this case x, y and say mass, with:
In [4]: df
Out[4]:
time object x y mass
0 3 1.0 216 12 12
1 4 1.0 218 13 12
2 5 1.0 217 12 12
3 6 1.0 234 13 13
4 1 2.0 361 289 23
5 2 2.0 362 287 22
6 3 2.0 362 286 22
7 5 3.0 124 56 18
8 6 3.0 126 52 17
In [5]: d = df.pivot('x','y','mass')
In [6]: d
Out[6]:
y 12 13 52 56 286 287 289
x
124 NaN NaN NaN 18.0 NaN NaN NaN
126 NaN NaN 17.0 NaN NaN NaN NaN
216 12.0 NaN NaN NaN NaN NaN NaN
217 12.0 NaN NaN NaN NaN NaN NaN
218 NaN 12.0 NaN NaN NaN NaN NaN
234 NaN 13.0 NaN NaN NaN NaN NaN
361 NaN NaN NaN NaN NaN NaN 23.0
362 NaN NaN NaN NaN 22.0 22.0 NaN
Then you can apply a simple heatmap with:
ax = sns.heatmap(d)
as a result you have the following image. In the case you need more complex attribute instead of the single mass, you can add a new column in the original dataframe. Finally here you can find some samples on how to define colormaps, style etc.

Categories