Pandas - Split dataframe into multiple dataframes based on dates? - python

I have a dataframe with multiple columns along with a date column. The date format is 12/31/15 and I have set it as a datetime object.
I set the datetime column as the index and want to perform a regression calculation for each month of the dataframe.
I believe the methodology to do this would be to split the dataframe into multiple dataframes based on month, store into a list of dataframes, then perform regression on each dataframe in the list.
I have used groupby which successfully split the dataframe by month, but am unsure how to correctly convert each group in the groupby object into a dataframe to be able to run my regression function on it.
Does anyone know how to split a dataframe into multiple dataframes based on date, or a better approach to my problem?
Here is my code I've written so far
import pandas as pd
import numpy as np
import statsmodels.api as sm
from patsy import dmatrices
df = pd.read_csv('data.csv')
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df = df.set_index('date')
# Group dataframe on index by month and year
# Groupby works, but dmatrices does not
for df_group in df.groupby(pd.TimeGrouper("M")):
y,X = dmatrices('value1 ~ value2 + value3', data=df_group,
return_type='dataframe')

If you must loop, you need to unpack the key and the dataframe when you iterate over a groupby object:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from patsy import dmatrices
df = pd.read_csv('data.csv')
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df = df.set_index('date')
Note the use of group_name here:
for group_name, df_group in df.groupby(pd.Grouper(freq='M')):
y,X = dmatrices('value1 ~ value2 + value3', data=df_group,
return_type='dataframe')
If you want to avoid iteration, do have a look at the notebook in Paul H's gist (see his comment), but a simple example of using apply would be:
def do_regression(df_group, ret='outcome'):
"""Apply the function to each group in the data and return one result."""
y,X = dmatrices('value1 ~ value2 + value3',
data=df_group,
return_type='dataframe')
if ret == 'outcome':
return y
else:
return X
outcome = df.groupby(pd.Grouper(freq='M')).apply(do_regression, ret='outcome')

This is a split per year.
import pandas as pd
import dateutil.parser
dfile = 'rg_unificado.csv'
df = pd.read_csv(dfile, sep='|', quotechar='"', encoding='latin-1')
df['FECHA'] = df['FECHA'].apply(lambda x: dateutil.parser.parse(x))
#http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
#use to_period
per = df['FECHA'].dt.to_period("Y")
#group by that period
agg = df.groupby([per])
for year, group in agg:
#this simple save the data
datep = str(year).replace('-', '')
filename = '%s_%s.csv' % (dfile.replace('.csv', ''), datep)
group.to_csv(filename, sep='|', quotechar='"', encoding='latin-1', index=False, header=True)

Related

Pandas interpolate NaN of shifted time series data

When I shift my time series data, I get some NaNs in the dataframe. The only interpolation method that can replace these NaNs with numbers is 'linear'. The NaN are replaced by the same number, which isn't preferable.
Is there some way to instead use a different method like 'cubic' or 'quadratic'?
import numpy as np
import pandas as pd
# original data
df = pd.DataFrame()
np.random.seed(0)
days = pd.date_range(start='2015-01-01', end='2015-01-10', freq='1D')
df = pd.DataFrame({'Date': days, 'col1': np.random.randn(len(days))})
df = df.set_index('Date')
# add lags
df['lag1'] = df['col1'].shift(1)
df['lag3'] = df['col1'].shift(3)
print(df)
def interp(dfObj):
if dfObj.isna().sum()>0:
dfObj0 = dfObj.interpolate(method='linear', limit_direction='both')
return dfObj0
else:
return dfObj
df['lag1'] = interp(df['lag1'])
df['lag3'] = interp(df['lag3'])
print(df)
I believe this is due to the fact that the null values are at the beginning of the data frame, and as such the interpolation has no values to interpolate between. In this case, you need to perform extrapolation, for which pandas does not have a built in method. See this thread for more details.
Extrapolate Dataframe

Pandas category group by category sorting

I need to be able sort the result of Pandas' 2nd groupby by Category.
The first groupby creates a list from another column, and second one is the groupby result I need. The problem is that the 2nd groupby does not honour the original sorted categorical index of the Dataframe
import pandas as pd
import numpy as np
import numpy.ma as ma
from pathlib import Path
fr = Path('../data/rules-1.xlsx')
df = pd.read_excel(fr, sheet_name='MS')
from pandas.api.types import CategoricalDtype
print('Before:')
display(df)
ms_cat = ['Parent-C', 'Parent-A', 'Parent-B']
df['ParentMS'] = df['ParentMS'].astype(CategoricalDtype(list(ms_cat)),order=True)
df = df.reset_index()
df = df.set_index('ParentMS')
df = df.sort_index()
print('After:')
display(df)
df_g = df. groupby(['ParentMS', 'Milestone'])['Tasks'].apply(list)
df_g = df_g.groupby('ParentMS')
# Category sort is not honored after the second groupby()
for name, group in df_g:
print(name, group)
This the input file:
[enter image description here][1]
[1]: https://i.stack.imgur.com/KZnZD.png
Combining the two "df_g" lines did the trick for me. I cannot explain it but it worked
df_g = df.groupby(['ParentMS', 'Milestone'])['RN'].apply(list).groupby('ParentMS')

how to update a cell value in sub-second time-series pandas data frame

I cannot manage to update a cell value when the dataframe index is a sub-second timeseries. For example:
import numpy as np
import pandas as pd
t0 = '2019-01-05 22:00:00.000'
t1 = '2019-01-05 22:00:05.000'
df_times = pd.date_range(t0, t1, freq = '500L')
df = pd.DataFrame()
df['datetime'] = df_times
df['Value']=[20,21,22,23,24,25,26,27,28,29,30]
df['Target'] = range(len(df_times))
df = df.set_index('datetime')
df
will result in this dataframe
Dataframe contents
If I try to update a cell 'Target' at index '2019-01-05 22:00:02.000', I end up updating also the 'Target' cell at index '2019-01-05 22:00:02.500'.
two cells updated instead of one
How can I work around this?
This will do the trick:
df.loc[pd.to_datetime('2019-01-05 22:00:02.000'), 'Target']=57
It apparently implicitly casts string to different (less precise) date time than pandas version.
Also using .loc[] will be better in this case.

Pandas dataframe resample without aggregation

I have a dataframe defined as follows:
import datetime
import pandas as pd
import random
import numpy as np
todays_date = datetime.datetime.today().date()
index = pd.date_range(todays_date - datetime.timedelta(10), periods=10, freq='D')
index = index.append(index)
idname = ['A']*10 + ['B']*10
values = random.sample(xrange(100), 20)
data = np.vstack((idname, values)).T
tmp_df = pd.DataFrame(data, columns=['id', 'value'])
tmp_index = pd.DataFrame(index, columns=['date'])
tmp_df = pd.concat([tmp_index, tmp_df], axis=1)
tmp_df = tmp_df.set_index('date')
Note that there are 2 values for each date. I would like to resample the dataframe tmp_df on a weekly basis but keep the two separate values. I tried tmp_df.resample('W-FRI') but it doesn't seem to work.
The solution you're looking for is groupby, which lets you perform operations on dataframe slices (here 'A' and 'B') independently:
df.groupby('id').resample('W-FRI')
Note: your code produces an error (No numeric types to aggregate) because the 'value' column is not converted to int. You need to convert it first:
df['value'] = pd.to_numeric(df['value'])

groupby by years and generate new columns

Lets say I have the following info about number of trades done in the past and I group them by year:
import pandas as pd
import numpy as np
dates = pd.date_range('19990101', periods=6000)
df = pd.DataFrame(np.random.randint(0,50,size=(6000,2)), index = dates)
df.columns = ['winners','losers']
grouped = df.groupby(lambda x: x.year)
print grouped.sum()
How can I generate one column in this "grouped" data that shows the percentage winners per year? and another column that shows the maximum consecutive losing trades per year?
Was trying to follow this example Understanding groupby in pandas, but couldn't figure out in my case how to do it by year.
Firstly Create a new DataFrame, then create necessary column according winners and losers:
new_df = pd.DataFrame()
new_df ['winners'] = df.groupby(df.index.year, as_index=True)['winners'].sum()
new_df ['losers'] = df.groupby(df.index.year, as_index=True)['losers'].sum()
Then with that, you can aggregate by winners, losers (which returns like indexed data) to calculate a percent of winners, losers.
You can do it like:
import pandas as pd
import numpy as np
dates = pd.date_range('19990101', periods=6000)
df = pd.DataFrame( np.random.randint(0,50,size=(6000,2)), index = dates)
df.columns = ['winners','losers']
new_df = pd.DataFrame()
new_df ['winners'] = df.groupby(df.index.year, as_index=True)['winners'].sum()
new_df ['losers'] = df.groupby(df.index.year, as_index=True)['losers'].sum()
new_df['winners_Percent'] = new_df['winners']/new_df['winners'].sum()
new_df['losers_Percent'] = new_df['losers']/new_df['losers'].sum()
Output:

Categories