Spline interpolation on dataframes by row - python

I have the following data frame:
OBJECTID 2017 2018 2019 2020 2021
1.0 NaN NaN 7569.183179 7738.162829 7907.142480
2.0 NaN NaN 766.591146 783.861122 801.131099
3.0 NaN NaN 8492.215747 8686.747704 8881.279662
4.0 NaN NaN 40760.327825 41196.877473 41633.427120
5.0 NaN NaN 6741.819674 6788.981231 6836.142788
I am trying to apply a spline interpolation on each row to get the values for 2017 and 2018 using the following code:
years = list(range(2017,2022))
df[years] = df[years].interpolate(method="spline", order =1, limit_direction="both", axis=1)
However, I get the following error:
ValueError: Index column must be numeric or datetime type when using spline method other than linear. Try setting a numeric or datetime index column before interpolating.
The dataframe in this question is just a subset of a much larger dataset I am using. All of the examples I have seen do the spline interpolation down each column, but I can't seem to get it work across each row. I feel like it's a simple solution and I'm just missing it. Could someone please help?

It appears to be because the dtype of the index (really columns for axis=1) is probably object in your case since the index contains a string column name also. Even though you are grabbing a slice of the columns that contains only integer years the overall index dtype remains the same - object. Then it looks like interpolate looks at the dtype and punts when it sees a dtype of object.
Example - even though the years are stored as integers the overall dtype is object:
df.columns
Index(['OBJECTID', 2017, 2018, 2019, 2020, 2021], dtype='object')
If we did this:
df.drop(columns=['OBJECTID'], inplace=True)
df.columns = df.columns.astype('uint64')
df.columns
UInt64Index([2017, 2018, 2019, 2020, 2021], dtype='uint64')
Then the axis=1 interpolation works:
years = list(range(2017,2022))
df[years] = df[years].interpolate(method="spline", order =1, limit_direction="both", axis=1)
2017 2018 2019 2020 2021
0 7231.223878 7400.203528 7569.183179 7738.162829 7907.142480
1 732.051193 749.321169 766.591146 783.861122 801.131099
2 8103.151832 8297.683789 8492.215747 8686.747704 8881.279662
3 39887.228530 40323.778178 40760.327825 41196.877473 41633.427120
4 6647.496560 6694.658117 6741.819674 6788.981231 6836.142788
Dropping the OBJECTID was done to illustrate what is going on.

Related

What is happening when pandas.Series converts int64s into NaNs?

I have a csv with dates and integers (Headers: Date, Number), separated by a tab.
I'm trying to create a calendar heatmap with CalMap (demo on that page). The function that creates the chart takes data that's indexed by DateTime.
df = pd.read_csv("data.csv",delimiter="\t")
df['Date'] = df['Date'].astype('datetime64[ns]')
events = pd.Series(df['Date'],index = df['Number'])
calmap.yearplot(events)
But when I check events.head(5), it gives the date followed by NaN. I check df['Number'].head(5) and they appear as int64.
What am I doing wrong that is causing this conversion?
Edit: Data below
Date Number
7/9/2018 40
7/10/2018 40
7/11/2018 40
7/12/2018 70
7/13/2018 30
Edit: Output of events.head(5)
2018-07-09 NaN
2018-07-10 NaN
2018-07-11 NaN
2018-07-12 NaN
2018-07-13 NaN
dtype: float64
First of all, it is not NaN, it is NaT (Not a Timestamp), which is unique to Pandas, though Pandas makes it compatible with NaN, and uses it similarly to NaN in floating-point columns to mark missing data.
What pd.Series(data, index=index) does apparently depends on the type of data. If data is a list, then index has to be of equal length, and a new Series will be constructed, with data being data, and index being index. However, if data is already a Series (such as df['Date']), it will instead take the rows corresponding to index and construct a new Series out of those rows. For example:
pd.Series(df['Date'], [1, 1, 4])
will give you
1 2018-07-10
1 2018-07-10
4 2018-07-13
Where 2018-07-10 comes from row #1, and 2018-07-11 from row #4 of df['Date']. However, there is no row with index 40, 70 or 30 in your sample input data, so missing data is presumed, and NaT is inserted instead.
In contrast, this is what you get when you use a list instead:
pd.Series(df['Date'].to_list(), index=df['Number'])
# => Number
# 40 2018-07-09
# 40 2018-07-10
# 40 2018-07-11
# 70 2018-07-12
# 30 2018-07-13
# dtype: datetime64[ns]
I was able to fix this by changing the series into lists via df['Date'].tolist() and df['Number'].tolist(). calmap.calendarplot(events) was able to accept these instead of the original parameters as series.

Rounding down values in Pandas dataframe column with NaNs

I have a Pandas dataframe that contains a column of float64 values:
tempDF = pd.DataFrame({ 'id': [12,12,12,12,45,45,45,51,51,51,51,51,51,76,76,76,91,91,91,91],
'measure': [3.2,4.2,6.8,5.6,3.1,4.8,8.8,3.0,1.9,2.1,2.4,3.5,4.2,5.2,4.3,3.6,5.2,7.1,6.5,7.3]})
I want to create a new column containing just the integer part. My first thought was to use .astype(int):
tempDF['int_measure'] = tempDF['measure'].astype(int)
This works fine but, as an extra complication, the column I have contains a missing value:
tempDF.ix[10,'measure'] = np.nan
This missing value causes the .astype(int) method to fail with:
ValueError: Cannot convert NA to integer
I thought I could round down the floats in the column of data. However, the .round(0) function will round to the nearest integer (higher or lower) rather than rounding down. I can't find a function equivalent to ".floor()" that will act on a column of a Pandas dataframe.
Any suggestions?
You could just apply numpy.floor;
import numpy as np
tempDF['int_measure'] = tempDF['measure'].apply(np.floor)
id measure int_measure
0 12 3.2 3
1 12 4.2 4
2 12 6.8 6
...
9 51 2.1 2
10 51 NaN NaN
11 51 3.5 3
...
19 91 7.3 7
You could also try:
df.apply(lambda s: s // 1)
Using np.floor is faster, however.
The answers here are pretty dated and as of pandas 0.25.2 (perhaps earlier) the error
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Which would be
df.iloc[:,0] = df.iloc[:,0].astype(int)
for one particular column.

Decomposing trend, seasonal and residual time series elements

I have a DataFrame with a few time series:
divida movav12 var varmovav12
Date
2004-01 0 NaN NaN NaN
2004-02 0 NaN NaN NaN
2004-03 0 NaN NaN NaN
2004-04 34 NaN inf NaN
2004-05 30 NaN -0.117647 NaN
2004-06 44 NaN 0.466667 NaN
2004-07 35 NaN -0.204545 NaN
2004-08 31 NaN -0.114286 NaN
2004-09 30 NaN -0.032258 NaN
2004-10 24 NaN -0.200000 NaN
2004-11 41 NaN 0.708333 NaN
2004-12 29 24.833333 -0.292683 NaN
2005-01 31 27.416667 0.068966 0.104027
2005-02 28 29.750000 -0.096774 0.085106
2005-03 27 32.000000 -0.035714 0.075630
2005-04 30 31.666667 0.111111 -0.010417
2005-05 31 31.750000 0.033333 0.002632
2005-06 39 31.333333 0.258065 -0.013123
2005-07 36 31.416667 -0.076923 0.002660
I want to decompose the first time series divida in a way that I can separate its trend from its seasonal and residual components.
I found an answer here, and am trying to use the following code:
import statsmodels.api as sm
s=sm.tsa.seasonal_decompose(divida.divida)
However I keep getting this error:
Traceback (most recent call last):
File "/Users/Pred_UnBR_Mod2.py", line 78, in <module> s=sm.tsa.seasonal_decompose(divida.divida)
File "/Library/Python/2.7/site-packages/statsmodels/tsa/seasonal.py", line 58, in seasonal_decompose _pandas_wrapper, pfreq = _maybe_get_pandas_wrapper_freq(x)
File "/Library/Python/2.7/site-packages/statsmodels/tsa/filters/_utils.py", line 46, in _maybe_get_pandas_wrapper_freq
freq = index.inferred_freq
AttributeError: 'Index' object has no attribute 'inferred_freq'
How can I proceed?
Works fine when you convert your index to DateTimeIndex:
df.reset_index(inplace=True)
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
s=sm.tsa.seasonal_decompose(df.divida)
<statsmodels.tsa.seasonal.DecomposeResult object at 0x110ec3710>
Access the components via:
s.resid
s.seasonal
s.trend
Statsmodel will decompose the series only if you provide frequency. Usually all time series index will contain frequency eg: Daywise, Business days, weekly So it shows error. You can remove this error by two ways:
What Stefan did is he gave the index column to pandas DateTime function. It uses internal function infer_freq to find the frequency and return the index with frequency.
Else you can set the frequency to your index column as df.index.asfreq(freq='m'). Here m represents month. You can set the frequency if you have domain knowledge or by d.
It depends on the index format. You can have DateTimeIndex or you can have PeriodIndex. Stefan presented the example for DateTimeIndex. Here is my example for PeriodIndex.
My original DataFrame has a MultiIndex index with year in first level and month in second level. Here is how I convert it to PeriodIndex:
df["date"] = pd.PeriodIndex (df.index.map(lambda x: "{0}{1:02d}".format(*x)),freq="M")
df = df.set_index("date")
Now it is ready to be used by seasonal_decompose.
Make it simple:
Follow three steps:
1. if not done, make the column in yyyy-mm-dd or dd-mm-yyyy( using excel).
2. Then using pandas convert it into date format as:
df['Date'] = pd.to_datetime(df['Date'])
3. decompose it using:
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition=seasonal_decompose(ts_log)
And finally:
Try parsing the date column using parse_dates , and later mention the index column .
from statsmodels.tsa.seasonal import seasonal_decompose
data=pd.read_csv(airline,header=0,squeeze=True,index_col=[0],parse_dates=[0])
res=seasonal_decompose(data)

pandas column division ValueError (putmask: mask and data must be the same size)

I am attempting to divide one column by another inside of a function:
lcontrib=lcontrib_lev.div(lcontrib_lev['base'],axis='index')
As can be seen, I am dividing by a column within the DataFrame, but I am getting a rather strange error:
ValueError: putmask: mask and data must be the same size
I must confess, this is the first time I have seen this error. It seems to suggest that the DF and the column are of different lengths, but clearly (since the column comes from the DataFrame) they are not.
A further twist is that am using this function to loop a data management procedure over year-specific sets (the data are from the Quarterly Census of Employment and Wages 'singlefiles' in the beta series). The sets associated with the 1990-2000 time period go off without a hitch, but 2001 throws this error. I am afraid I have not been able to identify a difference in structure across years, and even if I could, how would it explain the length mismatch?
Any thoughts would be greatly appreciated.
EDIT (2/1/2014): Thanks for taking a look Tom. As requested, the pandas version is 0.13.0, and the data file in question is located here on the BLS FTP site. Just to clarify what I meant by consistent structure, every year has the same variable set and dtype (in addition to a consistent data code structure).
EDIT (2/1/2014): Perhaps it would be useful to share the entire function:
def qcew(f,m_dict):
'''Function reads in file and captures county level aggregations with government contributions'''
#Read in file
cew=pd.read_csv(f)
#Create string version of area fips
cew['fips']=cew['area_fips'].astype(str)
#Generate description variables
cew['area']=cew['fips'].map(m_dict['area'])
cew['industry']=cew['industry_code'].map(m_dict['industry'])
cew['agglvl']=cew['agglvl_code'].map(m_dict['agglvl'])
cew['own']=cew['own_code'].map(m_dict['ownership'])
cew['size']=cew['size_code'].map(m_dict['size'])
#Generate boolean masks
lagg_mask=cew['agglvl_code']==73
lsize_mask=cew['size_code']==0
#Subset data to above specifications
cew_super=cew[lagg_mask & lsize_mask]
#Define column subset
lsub_cols=['year','fips','area','industry_code','industry','own','annual_avg_estabs_count','annual_avg_emplvl',\
'total_annual_wages','own_code']
#Subset to desired columns
cew_sub=cew_super[lsub_cols]
#Rename columns
cew_sub.columns=['year','fips','cty','ind_code','industry','own','estabs','emp','tot_wages','own_code']
#Set index
cew_sub.set_index(['year','fips','cty'],inplace=True)
#Capture total wage base and the contributions of Federal, State, and Local
cew_base=cew_sub['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_fed=cew_sub[cew_sub['own_code']==1]['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_st=cew_sub[cew_sub['own_code']==2]['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_loc=cew_sub[cew_sub['own_code']==3]['tot_wages'].groupby(level=['year','fips','cty']).sum()
#Convert to DFs for join
lbase=DataFrame(cew_base).rename(columns={0:'base'})
lfed=DataFrame(cew_fed).rename(columns={0:'fed_wage'})
lstate=DataFrame(cew_st).rename(columns={0:'st_wage'})
llocal=DataFrame(cew_loc).rename(columns={0:'loc_wage'})
#Join these series
lcontrib_lev=pd.concat([lbase,lfed,lstate,llocal],axis='index').fillna(0)
#Diag prints
print f
print lcontrib_lev.head()
print lcontrib_lev.describe()
print '*****************************\n'
#Calculate proportional contributions (failure point)
lcontrib=lcontrib_lev.div(lcontrib_lev['base'],axis='index')
#Group base data by year, county, and industry
cew_g=cew_sub.reset_index().groupby(['year','fips','cty','ind_code','industry']).sum().reset_index()
#Join contributions to joined data
cew_contr=cew_g.set_index(['year','fips','cty']).join(lcontrib[['fed_wage','st_wage','loc_wage']])
return cew_contr[[x for x in cew_contr.columns if x != 'own_code']]
Work ok for me (this is on 0.13.1, but IIRC I don't think anything in this particular area changed, but its possible it was a bug that was fixed).
In [48]: lcontrib_lev.div(lcontrib_lev['base'],axis='index').head()
Out[48]:
base fed_wage st_wage loc_wage
year fips cty
2001 1000 1000 NaN NaN NaN NaN
1000 NaN NaN NaN NaN
10000 10000 NaN NaN NaN NaN
10000 NaN NaN NaN NaN
10001 10001 NaN NaN NaN NaN
[5 rows x 4 columns]
In [49]: lcontrib_lev.div(lcontrib_lev['base'],axis='index').tail()
Out[49]:
base fed_wage st_wage loc_wage
year fips cty
2001 CS566 CS566 1 0.000000 0.000000 0.000000
US000 US000 1 0.022673 0.027978 0.073828
USCMS USCMS 1 0.000000 0.000000 0.000000
USMSA USMSA 1 0.000000 0.000000 0.000000
USNMS USNMS 1 0.000000 0.000000 0.000000
[5 rows x 4 columns]

Group by multiple time units in pandas data frame

I have a data frame that consists of a time series data with 15-second intervals:
date_time value
2012-12-28 11:11:00 103.2
2012-12-28 11:11:15 103.1
2012-12-28 11:11:30 103.4
2012-12-28 11:11:45 103.5
2012-12-28 11:12:00 103.3
The data spans many years. I would like to group by both year and time to look at the distribution of time-of-day effect over many years. For example, I may want to compute the mean and standard deviation of every 15-second interval across days, and look at how the means and standard deviations change from 2010, 2011, 2012, etc. I naively tried data.groupby(lambda x: [x.year, x.time]) but it didn't work. How can I do such grouping?
In case date_time is not your index, a date_time-indexed DataFrame could be created with:
dfts = df.set_index('date_time')
From there you can group by intervals using
dfts.groupby(lambda x : x.month).mean()
to see mean values for each month. Similarly, you can do
dfts.groupby(lambda x : x.year).std()
for standard deviations across the years.
If I understood the example task you would like to achieve, you could simply split the data into years using xs, group them and concatenate the results and store this in a new DataFrame.
years = range(2012, 2015)
yearly_month_stats = [dfts.xs(str(year)).groupby(lambda x : x.month).mean() for year in years]
df2 = pd.concat(yearly_month_stats, axis=1, keys = years)
From which you get something like
2012 2013 2014
value value value
1 NaN 5.324165 15.747767
2 NaN -23.193429 9.193217
3 NaN -14.144287 23.896030
4 NaN -21.877975 16.310195
5 NaN -3.079910 -6.093905
6 NaN -2.106847 -23.253183
7 NaN 10.644636 6.542562
8 NaN -9.763087 14.335956
9 NaN -3.529646 2.607973
10 NaN -18.633832 0.083575
11 NaN 10.297902 14.059286
12 33.95442 13.692435 22.293245
You were close:
data.groupby([lambda x: x.year, lambda x: x.time])
Also be sure to set date_time as the index, as in kermit666's answer

Categories