pandas.Series.interpolate() along "index" shows unexpected results - python

A pandas.Series() called "bla" in my example contains pressures in Pa as the index and wind speeds in m/s as values:
bla
100200.0 2.0
97600.0 NaN
91100.0 NaN
85000.0 3.0
82600.0 NaN
...
6670.0 NaN
5000.0 2.0
4490.0 NaN
3880.0 NaN
3000.0 9.0
Length: 29498, dtype: float64
bla.index
Float64Index([100200.0, 97600.0, 91100.0, 85000.0, 82600.0, 81400.0,
79200.0, 73200.0, 70000.0, 68600.0,
...
11300.0, 10000.0, 9970.0, 9100.0, 7000.0, 6670.0,
5000.0, 4490.0, 3880.0, 3000.0],
dtype='float64', length=29498)
As the wind speed values are NaN more often than not, I intended to interpolate considering the different pressure levels in order to have more wind speed values to work with.
The docs of interpolate() state that there's a method called "index" which interpolates considering the index-values, but the results don't make sense as compared to the initial values:
bla.interpolate(method="index", axis=0, limit=1, limit_direction="both")
100200.0 **2.00**
97600.0 10.40
91100.0 8.00
85000.0 **3.00**
82600.0 9.75
...
6670.0 3.00
5000.0 **2.00**
4490.0 9.00
3880.0 5.00
3000.0 **9.00**
Length: 29498, dtype: float64
I marked the original values in boldface.
I'd rather expect something like when using "linear":
bla.interpolate(method="linear", axis=0, limit=1, limit_direction="both")
100200.0 **2.000000**
97600.0 2.333333
91100.0 2.666667
85000.0 **3.000000**
82600.0 4.600000
...
6670.0 4.500000
5000.0 **2.000000**
4490.0 4.333333
3880.0 6.666667
3000.0 **9.000000**
Nevertheless, I'd like to use properly "index" as interpolation method, since this should be the most accurate considering the pressure levels for interpolation to mark the "distance" between each wind speed value.
By and large, I'd like to understand how the interpolation results using "index" with the pressure levels in it could become so counterintuitive, and how I could achieve them to be more sound.

Thanks to #ALollz in the first comment underneath my question I came up where the issue lied:
It was just that my dataframe had 2 index levels, the outer being unique measuring timestamps, the inner being a standard range-index.
I should've looked just at each sub-set associated with the unique timestamps separately.
Within these subsets, interpolation makes sense and the results are being produced just right.
Example:
# Loop over all unique timestamps in the outermost index level
for timestamp in df.index.get_level_values(level=0).unique():
# Extract the current subset
df_subset = df.loc[timestamp, :]
# Carry out interpolation on a column of interest
df_subset["column of interest"] = df_subset[
"column of interest"].interpolate(method="linear",
axis=0,
limit=1,
limit_direction="both")

Related

What is happening if I changed dropna to True/False

if i write this code:
train['id_03'].value_counts(dropna=False, normalize =True).head()
I am getting
NaN 0.887689233582822
0.0 0.108211128797372
1.0 0.001461374335354
3.0 0.001131168083449
2.0 0.000712906831036
Name: id_03, dtype: float64
If i changed dropna =True
I get
0.0 0.963497
1.0 0.013012
3.0 0.010072
2.0 0.006348
5.0 0.001643
Name: id_03, dtype: float64
I think the key is that you specified normalize =True It is: "If True then the object returned will contain the relative frequencies of the unique values." according to the documentations.
Before you removed Na's the Na's counts are used to calculate the relative frequencies, after you removed them, the denominator of the relative frequencies changed, hence values changed
You are normalising the result. The value for the NaN would appear to be very large with respect to the others. Hence the other indexes result in very small numbers
If you look at the relative ratio between indexes 1 & 2 you'll see that they are equivalent in both results.

How to remove NaN values from corr() function output

EDITED TO SHOW EXAMPLE OF ORIGINAL DATAFRAME:
df.head(4)
shop category subcategory season
date
2013-09-04 abc weddings shoes winter
2013-09-04 def jewelry watches summer
2013-09-05 ghi sports sneakers spring
2013-09-05 jkl jewelry necklaces fall
I've successfully generated the following dataframe using get_dummies():
wedding_seasons = pd.get_dummies(df.loc[df['category']=='weddings',['category','season']],prefix = '', prefix_sep = '' )
wedding_seasons.head(3)
weddings winter summer spring fall
71654 1.0 0.0 1.0 0.0 0.0
72168 1.0 0.0 1.0 0.0 0.0
72080 1.0 0.0 1.0 0.0 0.0
The goal of the above is to help assess frequency of weddings across seasons, so I've used corr() to generate the following result:
weddings fall spring summer winter
weddings NaN NaN NaN NaN NaN
fall NaN 1.000000 0.054019 -0.331866 -0.012122
spring NaN 0.054019 1.000000 -0.857205 0.072420
summer NaN -0.331866 -0.857205 1.000000 -0.484578
winter NaN -0.012122 0.072420 -0.484578 1.000000
I'm unsure why the wedding column is generating NaN values, but my gut feeling is that it originates from how I originally created wedding_seasons. Any guidance would be greatly appreciated so that I can properly assess column correlations.
I don't think what you're interested in seeing here is the "correlation".
All of the columns in the dataframe wedding_seasons contain floating point values; however, if my suspicions are correct, the rows in your original dataframe df contain something like transaction records, where each row corresponds to an individual.
Please tell me if I'm incorrect, but I'll proceed with my reasoning.
Correlation will measure, intuitively, the tendency for values vary together/against each other within the same observation (e.g. if X and Y are negatively correlated, then when we see X go above its mean, we'd expect Y to appear below its mean).
However, what you have here is data where, if one transaction is summer, then categorically it cannot possibly be winter at the same time. When you create wedding_seasons, Pandas is creating dummy variables that are treated as floating point values when computing your correlation matrix; since it's impossible for any row to contain two 1.0 entries at the same time, clearly your resulting correlation matrix is going to have negative entries everywhere.
You could drop the weddings column before doing corr().
wedding_seasons.drop(columns = ['weddings'])

Aligning 2 python lists according to 2 other list

I have two arrays namely nlxTTL and ttlState. Both the arrays comprise of repeating pattern of 0's and 1's indicating input voltage which can be HIGH(1) or LOW(0) and are recorded from same source which sends a TTL pulse(HIGH and LOW) with 1second pulse width.
But due to some logging mistake, some drops happen in ttlState list i.e. it doesn't log a repeating sequence of 0 and 1's and ends up dropping values.
The good part is I also log timestamp for each TTL input received for both the lists. Inter TTL event timestamp difference clearly shows that the TTL event has missed one of the pulses.
Here is an example of what data looks like:
nlxTTL, ttlState, nlxTime, ttlTime
0,0,1000,1000
1,1,2000,2000
0,1,3000,4000
1,1,4000,6000
0,0,5000,7000
1,1,6000,8000
0,0,7000,9000
1,1,8000,10000
As you can see the nlxTime and ttlTime clearly are different from each other. How can then using these timestamps I can align all 4 lists?
When dealing with tabular data such as a CSV file, it's a good idea to use a library to make the process easier. I like the pandas dataframe library.
Now for your question, one way to think about this problem is that you really have two datasets... An nlx dataset and a ttl dataset. You want to join those datasets together by timestamp. Pandas makes tasks like this very easy.
import pandas as pd
from StringIO import StringIO
data = """\
nlxTTL, ttlState, nlxTime, ttlTime
0,0,1000,1000
1,1,2000,2000
0,1,3000,4000
1,1,4000,6000
0,0,5000,7000
1,1,6000,8000
0,0,7000,9000
1,1,8000,10000
"""
# Load data into dataframe.
df = pd.read_csv(StringIO(data))
# Remove spaces from column names.
df.columns = [x.strip() for x in df.columns]
# Split the data into an nlx dataframe and a ttl dataframe.
nlx = df[['nlxTTL', 'nlxTime']].reset_index()
ttl = df[['ttlState', 'ttlTime']].reset_index()
# Merge the dataframes back together based on their timestamps.
# Use an outer join so missing data gets filled with NaNs instead
# of just dropping the rows.
merged_df = nlx.merge(ttl, left_on='nlxTime', right_on='ttlTime', how='outer')
# Get back to the original set of columns
merged_df = merged_df[df.columns]
# Print out the results.
print(merged_df)
This produces the following output.
nlxTTL ttlState nlxTime ttlTime
0 0.0 0.0 1000.0 1000.0
1 1.0 1.0 2000.0 2000.0
2 0.0 NaN 3000.0 NaN
3 1.0 1.0 4000.0 4000.0
4 0.0 NaN 5000.0 NaN
5 1.0 1.0 6000.0 6000.0
6 0.0 0.0 7000.0 7000.0
7 1.0 1.0 8000.0 8000.0
8 NaN 0.0 NaN 9000.0
9 NaN 1.0 NaN 10000.0
You'll notice that it fills in the dropped values with NaN values because we are doing an outer join. If this is undesirable, change the how='outer' parameter to how='inner' to perform an inner join. This will only keep records for which you have both an nlx and ttl response at that timestamp.

Ordinary Least Squares Regression for multiple columns in Pandas Dataframe

I'm trying to find a way to iterate code for a linear regression over many many columns, upwards of Z3. Here is a snippet of the dataframe called df1
Time A1 A2 A3 B1 B2 B3
1 1.00 6.64 6.82 6.79 6.70 6.95 7.02
2 2.00 6.70 6.86 6.92 NaN NaN NaN
3 3.00 NaN NaN NaN 7.07 7.27 7.40
4 4.00 7.15 7.26 7.26 7.19 NaN NaN
5 5.00 NaN NaN NaN NaN 7.40 7.51
6 5.50 7.44 7.63 7.58 7.54 NaN NaN
7 6.00 7.62 7.86 7.71 NaN NaN NaN
This code returns the slope coefficient of a linear regression for the very ONE column only and concatenates the value to a numpy series called series, here is what it looks like for extracting the slope for the first column:
from sklearn.linear_model import LinearRegression
series = np.array([]) #blank list to append result
df2 = df1[~np.isnan(df1['A1'])] #removes NaN values for each column to apply sklearn function
df3 = df2[['Time','A1']]
npMatrix = np.matrix(df3)
X, Y = npMatrix[:,0], npMatrix[:,1]
slope = LinearRegression().fit(X,Y) # either this or the next line
m = slope.coef_[0]
series= np.concatenate((SGR_trips, m), axis = 0)
As it stands now, I am using this slice of code, replacing "A1" with a new column name all the way up to "Z3" and this is extremely inefficient. I know there are many easy way to do this with some modules but I have the drawback of having all these intermediate NaN values in the timeseries so it seems like I'm limited to this method, or something like it.
I tried using a for loop such as:
for col in df1.columns:
and replacing 'A1', for example with col in the code, but this does not seem to be working.
Is there any way I can do this more efficiently?
Thank you!
One liner (or three)
time = df[['Time']]
pd.DataFrame(np.linalg.pinv(time.T.dot(time)).dot(time.T).dot(df.fillna(0)),
['Slope'], df.columns)
Broken down with a bit of explanation
Using the closed form of OLS
In this case X is time where we define time as df[['Time']]. I used the double brackets to preserve the dataframe and its two dimensions. If I'd done single brackets, I'd have gotten a series and its one dimension. Then the dot products aren't as pretty.
is np.linalg.pinv(time.T.dot(time)).dot(time.T)
Y is df.fillna(0). Yes, we could have done one column at a time, but why when we could do it altogether. You have to deal with the NaNs. How would you imagine dealing with them? Only doing it over the time you had data? That is equivalent to placing zeroes in the NaN spots. So, I did.
Finally, I use pd.DataFrame(stuff, ['Slope'], df.columns) to contain all slopes in one place with the original labels.
Note that I calculated the slope of the regression for Time against itself. Why not? It was there. Its value is 1.0. Great! I probably did it right!
Looping is a decent strategy for a modest number (say, fewer than thousands) of columns. Without seeing your implementation, I can't say what's wrong, but here's my version, which works:
slopes = []
for c in cols:
if c=="Time": break
mask = ~np.isnan(df1[c])
x = np.atleast_2d(df1.Time[mask].values).T
y = np.atleast_2d(df1[c][mask].values).T
reg = LinearRegression().fit(x, y)
slopes.append(reg.coef_[0])
I've simplified your code a bit to avoid creating so many temporary DataFrame objects, but it should work fine your way too.

Replace NaN or missing values with rolling mean or other interpolation

I have a pandas dataframe with monthly data that I want to compute a 12 months moving average for. Data for for every month of January is missing, however (NaN), so I am using
pd.rolling_mean(data["variable"]), 12, center=True)
but it just gives me all NaN values.
Is there a simple way that I can ignore the NaN values? I understand that in practice this would become a 11-month moving average.
The dataframe has other variables which have January data, so I don't want to just throw out the January columns and do an 11 month moving average.
There are several ways to approach this, and the best way will depend on whether the January data is systematically different from other months. Most real-world data is likely to be somewhat seasonal, so let's use the average high temperature (Fahrenheit) of a random city in the northern hemisphere as an example.
df=pd.DataFrame({ 'month' : [10,11,12,1,2,3],
'temp' : [65,50,45,np.nan,40,43] }).set_index('month')
You could use a rolling mean as you suggest, but the issue is that you will get an average temperature over the entire year, which ignores the fact that January is the coldest month. To correct for this, you could reduce the window to 3, which results in the January temp being the average of the December and February temps. (I am also using min_periods=1 as suggested in #user394430's answer.)
df['rollmean12'] = df['temp'].rolling(12,center=True,min_periods=1).mean()
df['rollmean3'] = df['temp'].rolling( 3,center=True,min_periods=1).mean()
Those are improvements but still have the problem of overwriting existing values with rolling means. To avoid this you could combine with the update() method (see documentation here).
df['update'] = df['rollmean3']
df['update'].update( df['temp'] ) # note: this is an inplace operation
There are even simpler approaches that leave the existing values alone while filling the missing January temps with either the previous month, next month, or the mean of the previous and next month.
df['ffill'] = df['temp'].ffill() # previous month
df['bfill'] = df['temp'].bfill() # next month
df['interp'] = df['temp'].interpolate() # mean of prev/next
In this case, interpolate() defaults to simple linear interpretation, but you have several other intepolation options also. See documentation on pandas interpolate for more info. Or this statck overflow question:
Interpolation on DataFrame in pandas
Here is the sample data with all the results:
temp rollmean12 rollmean3 update ffill bfill interp
month
10 65.0 48.6 57.500000 65.0 65.0 65.0 65.0
11 50.0 48.6 53.333333 50.0 50.0 50.0 50.0
12 45.0 48.6 47.500000 45.0 45.0 45.0 45.0
1 NaN 48.6 42.500000 42.5 45.0 40.0 42.5
2 40.0 48.6 41.500000 40.0 40.0 40.0 40.0
3 43.0 48.6 41.500000 43.0 43.0 43.0 43.0
In particular, note that "update" and "interp" give the same results in all months. While it doesn't matter which one you use here, in other cases one way or the other might be better.
The real key is having min_periods=1. Also, as of version 18, the proper calling is with a Rolling object. Therefore, your code should be
data["variable"].rolling(min_periods=1, center=True, window=12).mean().

Categories