When I shift my time series data, I get some NaNs in the dataframe. The only interpolation method that can replace these NaNs with numbers is 'linear'. The NaN are replaced by the same number, which isn't preferable.
Is there some way to instead use a different method like 'cubic' or 'quadratic'?
import numpy as np
import pandas as pd
# original data
df = pd.DataFrame()
np.random.seed(0)
days = pd.date_range(start='2015-01-01', end='2015-01-10', freq='1D')
df = pd.DataFrame({'Date': days, 'col1': np.random.randn(len(days))})
df = df.set_index('Date')
# add lags
df['lag1'] = df['col1'].shift(1)
df['lag3'] = df['col1'].shift(3)
print(df)
def interp(dfObj):
if dfObj.isna().sum()>0:
dfObj0 = dfObj.interpolate(method='linear', limit_direction='both')
return dfObj0
else:
return dfObj
df['lag1'] = interp(df['lag1'])
df['lag3'] = interp(df['lag3'])
print(df)
I believe this is due to the fact that the null values are at the beginning of the data frame, and as such the interpolation has no values to interpolate between. In this case, you need to perform extrapolation, for which pandas does not have a built in method. See this thread for more details.
Extrapolate Dataframe
Related
I have two datasets. Below you can see codes and data
import pandas as pd
import numpy as np
pd.set_option('max_columns', None)
import matplotlib.pyplot as plt
data = {'type_sale': ['group_1','group_2','group_3','group_4','group_5','group_6','group_7','group_8','group_9','group_10'],
'id':[70,20,24,80,20,20,60,20,20,20],
}
df1 = pd.DataFrame(data, columns = ['type_sale',
'id',])
data = {'type_sale': ['group_1','group_2','group_3'],
'id':[70,20,24],
}
df2 = pd.DataFrame(data, columns = ['type_sale',
'id',])
These codes created two datasets that are shown below :
Now I want to create a new data set df3 with values from df1 that are different (distinct values) from the values df2 in the column id.
The final results should as pic below
I tried with these codes but are not giving desired results.
df = pd.concat((df1, df2))
print(df.drop_duplicates('id'))
So can anybody help me how to solve this problem?
Try as follows:
Use df.isin to check for each value in df['id'] whether it is contained in df2['id'].
Next, invert the resulting boolean pd.Series by using the unary operator ~ (tilde) and select from d1.
Finally, reset the index.
In a one-liner:
df3 = df1[~df1['id'].isin(df2['id'])].reset_index(drop=True)
print(df3)
type_sale id
0 group_4 80
1 group_7 60
I cannot manage to update a cell value when the dataframe index is a sub-second timeseries. For example:
import numpy as np
import pandas as pd
t0 = '2019-01-05 22:00:00.000'
t1 = '2019-01-05 22:00:05.000'
df_times = pd.date_range(t0, t1, freq = '500L')
df = pd.DataFrame()
df['datetime'] = df_times
df['Value']=[20,21,22,23,24,25,26,27,28,29,30]
df['Target'] = range(len(df_times))
df = df.set_index('datetime')
df
will result in this dataframe
Dataframe contents
If I try to update a cell 'Target' at index '2019-01-05 22:00:02.000', I end up updating also the 'Target' cell at index '2019-01-05 22:00:02.500'.
two cells updated instead of one
How can I work around this?
This will do the trick:
df.loc[pd.to_datetime('2019-01-05 22:00:02.000'), 'Target']=57
It apparently implicitly casts string to different (less precise) date time than pandas version.
Also using .loc[] will be better in this case.
I have two DataFrames: df1 and df2
both df1 and df2 are derived from the same original data set, which has a DatetimeIndex.
df2 still has a DatetimeIndex.
Whereas, df1 has been oversampled and now has an int index with the prior DatetimeIndex as a 'Date' column within it.
I need to reconstruct a df2 so that it aligns with df1, i.e. I'll need to oversample the rows that are oversampled and then order them and set them onto the same int index that df1 has.
Currently, I'm using these two functions below, but they are painfully slow. Is there any way to speed this up? I haven't been able to find any built-in function that does this. Is there?
def align_data(idx_col,data):
new_data = pd.DataFrame(index=idx_col.index,columns=data.columns)
for label,group in idx_col.groupby(idx_col):
if len(group.index) > 1:
slice = expanded(data.loc[label],len(group.index)).values
else:
slice = data.loc[label]
new_data.loc[group.index] = slice
return new_data
def expanded(row,l):
return pd.DataFrame(data=[row for i in np.arange(l)],index=np.arange(l),columns=row.index)
A test can be generated using the code below:
import pandas as pd
import numpy as np
import datetime as dt
dt_idx = pd.DatetimeIndex(start='1990-01-01',end='2018-07-02',freq='B')
df1 = pd.DataFrame(data=np.zeros((len(dt_idx),20)),index=dt_idx)
df1.index.name = 'Date'
df2 = df1.copy()
df1 = pd.concat([df1,df1.sample(len(dt_idx)/2)],axis=0)
df1.reset_index(drop=False,inplace=True)
t = dt.datetime.now()
df2_aligned = align_data(df1['Date'],df2)
print(dt.datetime.now()-t)
I have a dataframe with multiple columns along with a date column. The date format is 12/31/15 and I have set it as a datetime object.
I set the datetime column as the index and want to perform a regression calculation for each month of the dataframe.
I believe the methodology to do this would be to split the dataframe into multiple dataframes based on month, store into a list of dataframes, then perform regression on each dataframe in the list.
I have used groupby which successfully split the dataframe by month, but am unsure how to correctly convert each group in the groupby object into a dataframe to be able to run my regression function on it.
Does anyone know how to split a dataframe into multiple dataframes based on date, or a better approach to my problem?
Here is my code I've written so far
import pandas as pd
import numpy as np
import statsmodels.api as sm
from patsy import dmatrices
df = pd.read_csv('data.csv')
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df = df.set_index('date')
# Group dataframe on index by month and year
# Groupby works, but dmatrices does not
for df_group in df.groupby(pd.TimeGrouper("M")):
y,X = dmatrices('value1 ~ value2 + value3', data=df_group,
return_type='dataframe')
If you must loop, you need to unpack the key and the dataframe when you iterate over a groupby object:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from patsy import dmatrices
df = pd.read_csv('data.csv')
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df = df.set_index('date')
Note the use of group_name here:
for group_name, df_group in df.groupby(pd.Grouper(freq='M')):
y,X = dmatrices('value1 ~ value2 + value3', data=df_group,
return_type='dataframe')
If you want to avoid iteration, do have a look at the notebook in Paul H's gist (see his comment), but a simple example of using apply would be:
def do_regression(df_group, ret='outcome'):
"""Apply the function to each group in the data and return one result."""
y,X = dmatrices('value1 ~ value2 + value3',
data=df_group,
return_type='dataframe')
if ret == 'outcome':
return y
else:
return X
outcome = df.groupby(pd.Grouper(freq='M')).apply(do_regression, ret='outcome')
This is a split per year.
import pandas as pd
import dateutil.parser
dfile = 'rg_unificado.csv'
df = pd.read_csv(dfile, sep='|', quotechar='"', encoding='latin-1')
df['FECHA'] = df['FECHA'].apply(lambda x: dateutil.parser.parse(x))
#http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
#use to_period
per = df['FECHA'].dt.to_period("Y")
#group by that period
agg = df.groupby([per])
for year, group in agg:
#this simple save the data
datep = str(year).replace('-', '')
filename = '%s_%s.csv' % (dfile.replace('.csv', ''), datep)
group.to_csv(filename, sep='|', quotechar='"', encoding='latin-1', index=False, header=True)
I have a dataframe defined as follows:
import datetime
import pandas as pd
import random
import numpy as np
todays_date = datetime.datetime.today().date()
index = pd.date_range(todays_date - datetime.timedelta(10), periods=10, freq='D')
index = index.append(index)
idname = ['A']*10 + ['B']*10
values = random.sample(xrange(100), 20)
data = np.vstack((idname, values)).T
tmp_df = pd.DataFrame(data, columns=['id', 'value'])
tmp_index = pd.DataFrame(index, columns=['date'])
tmp_df = pd.concat([tmp_index, tmp_df], axis=1)
tmp_df = tmp_df.set_index('date')
Note that there are 2 values for each date. I would like to resample the dataframe tmp_df on a weekly basis but keep the two separate values. I tried tmp_df.resample('W-FRI') but it doesn't seem to work.
The solution you're looking for is groupby, which lets you perform operations on dataframe slices (here 'A' and 'B') independently:
df.groupby('id').resample('W-FRI')
Note: your code produces an error (No numeric types to aggregate) because the 'value' column is not converted to int. You need to convert it first:
df['value'] = pd.to_numeric(df['value'])