I have a dataframe defined as follows:
import datetime
import pandas as pd
import random
import numpy as np
todays_date = datetime.datetime.today().date()
index = pd.date_range(todays_date - datetime.timedelta(10), periods=10, freq='D')
index = index.append(index)
idname = ['A']*10 + ['B']*10
values = random.sample(xrange(100), 20)
data = np.vstack((idname, values)).T
tmp_df = pd.DataFrame(data, columns=['id', 'value'])
tmp_index = pd.DataFrame(index, columns=['date'])
tmp_df = pd.concat([tmp_index, tmp_df], axis=1)
tmp_df = tmp_df.set_index('date')
Note that there are 2 values for each date. I would like to resample the dataframe tmp_df on a weekly basis but keep the two separate values. I tried tmp_df.resample('W-FRI') but it doesn't seem to work.
The solution you're looking for is groupby, which lets you perform operations on dataframe slices (here 'A' and 'B') independently:
df.groupby('id').resample('W-FRI')
Note: your code produces an error (No numeric types to aggregate) because the 'value' column is not converted to int. You need to convert it first:
df['value'] = pd.to_numeric(df['value'])
Related
I have a dataframe with several columns with dates - formatted as datetime.
I am trying to get the min/max value of a date, based on another date column being NaN
For now, I am doing this in two separate steps:
temp_df = df[(df['date1'] == np.nan)]
max_date = max(temp_df['date2'])
temp_df = None
I get the result I want, but I am using an unnecesary temporary dataframe.
How can I do this without it?
Is there any reference material to read on this?
Thanks
Here is an MCVE that can be played with to obtain statistics from other columns where the value in one isnull() (NaN or NaT). This can be done in a one-liner.
import pandas as pd
import numpy as np
print(pd.__version__)
# sample date columns
daterange1 = pd.date_range('2017-01-01', '2018-01-01', freq='MS')
daterange2 = pd.date_range('2017-04-01', '2017-07-01', freq='MS')
daterange3 = pd.date_range('2017-06-01', '2018-02-01', freq='MS')
df1 = pd.DataFrame(data={'date1': daterange1})
df2 = pd.DataFrame(data={'date2': daterange2})
df3 = pd.DataFrame(data={'date3': daterange3})
# jam them together, making NaT's in non-overlapping ranges
df = pd.concat([df1, df2, df3], axis=0, sort=False)
df.reset_index(inplace=True)
max_date = df[(df['date1'].isnull())]['date2'].max()
print(max_date)
I have two DataFrames: df1 and df2
both df1 and df2 are derived from the same original data set, which has a DatetimeIndex.
df2 still has a DatetimeIndex.
Whereas, df1 has been oversampled and now has an int index with the prior DatetimeIndex as a 'Date' column within it.
I need to reconstruct a df2 so that it aligns with df1, i.e. I'll need to oversample the rows that are oversampled and then order them and set them onto the same int index that df1 has.
Currently, I'm using these two functions below, but they are painfully slow. Is there any way to speed this up? I haven't been able to find any built-in function that does this. Is there?
def align_data(idx_col,data):
new_data = pd.DataFrame(index=idx_col.index,columns=data.columns)
for label,group in idx_col.groupby(idx_col):
if len(group.index) > 1:
slice = expanded(data.loc[label],len(group.index)).values
else:
slice = data.loc[label]
new_data.loc[group.index] = slice
return new_data
def expanded(row,l):
return pd.DataFrame(data=[row for i in np.arange(l)],index=np.arange(l),columns=row.index)
A test can be generated using the code below:
import pandas as pd
import numpy as np
import datetime as dt
dt_idx = pd.DatetimeIndex(start='1990-01-01',end='2018-07-02',freq='B')
df1 = pd.DataFrame(data=np.zeros((len(dt_idx),20)),index=dt_idx)
df1.index.name = 'Date'
df2 = df1.copy()
df1 = pd.concat([df1,df1.sample(len(dt_idx)/2)],axis=0)
df1.reset_index(drop=False,inplace=True)
t = dt.datetime.now()
df2_aligned = align_data(df1['Date'],df2)
print(dt.datetime.now()-t)
I have a dataframe with multiple columns along with a date column. The date format is 12/31/15 and I have set it as a datetime object.
I set the datetime column as the index and want to perform a regression calculation for each month of the dataframe.
I believe the methodology to do this would be to split the dataframe into multiple dataframes based on month, store into a list of dataframes, then perform regression on each dataframe in the list.
I have used groupby which successfully split the dataframe by month, but am unsure how to correctly convert each group in the groupby object into a dataframe to be able to run my regression function on it.
Does anyone know how to split a dataframe into multiple dataframes based on date, or a better approach to my problem?
Here is my code I've written so far
import pandas as pd
import numpy as np
import statsmodels.api as sm
from patsy import dmatrices
df = pd.read_csv('data.csv')
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df = df.set_index('date')
# Group dataframe on index by month and year
# Groupby works, but dmatrices does not
for df_group in df.groupby(pd.TimeGrouper("M")):
y,X = dmatrices('value1 ~ value2 + value3', data=df_group,
return_type='dataframe')
If you must loop, you need to unpack the key and the dataframe when you iterate over a groupby object:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from patsy import dmatrices
df = pd.read_csv('data.csv')
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df = df.set_index('date')
Note the use of group_name here:
for group_name, df_group in df.groupby(pd.Grouper(freq='M')):
y,X = dmatrices('value1 ~ value2 + value3', data=df_group,
return_type='dataframe')
If you want to avoid iteration, do have a look at the notebook in Paul H's gist (see his comment), but a simple example of using apply would be:
def do_regression(df_group, ret='outcome'):
"""Apply the function to each group in the data and return one result."""
y,X = dmatrices('value1 ~ value2 + value3',
data=df_group,
return_type='dataframe')
if ret == 'outcome':
return y
else:
return X
outcome = df.groupby(pd.Grouper(freq='M')).apply(do_regression, ret='outcome')
This is a split per year.
import pandas as pd
import dateutil.parser
dfile = 'rg_unificado.csv'
df = pd.read_csv(dfile, sep='|', quotechar='"', encoding='latin-1')
df['FECHA'] = df['FECHA'].apply(lambda x: dateutil.parser.parse(x))
#http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
#use to_period
per = df['FECHA'].dt.to_period("Y")
#group by that period
agg = df.groupby([per])
for year, group in agg:
#this simple save the data
datep = str(year).replace('-', '')
filename = '%s_%s.csv' % (dfile.replace('.csv', ''), datep)
group.to_csv(filename, sep='|', quotechar='"', encoding='latin-1', index=False, header=True)
Lets say I have the following info about number of trades done in the past and I group them by year:
import pandas as pd
import numpy as np
dates = pd.date_range('19990101', periods=6000)
df = pd.DataFrame(np.random.randint(0,50,size=(6000,2)), index = dates)
df.columns = ['winners','losers']
grouped = df.groupby(lambda x: x.year)
print grouped.sum()
How can I generate one column in this "grouped" data that shows the percentage winners per year? and another column that shows the maximum consecutive losing trades per year?
Was trying to follow this example Understanding groupby in pandas, but couldn't figure out in my case how to do it by year.
Firstly Create a new DataFrame, then create necessary column according winners and losers:
new_df = pd.DataFrame()
new_df ['winners'] = df.groupby(df.index.year, as_index=True)['winners'].sum()
new_df ['losers'] = df.groupby(df.index.year, as_index=True)['losers'].sum()
Then with that, you can aggregate by winners, losers (which returns like indexed data) to calculate a percent of winners, losers.
You can do it like:
import pandas as pd
import numpy as np
dates = pd.date_range('19990101', periods=6000)
df = pd.DataFrame( np.random.randint(0,50,size=(6000,2)), index = dates)
df.columns = ['winners','losers']
new_df = pd.DataFrame()
new_df ['winners'] = df.groupby(df.index.year, as_index=True)['winners'].sum()
new_df ['losers'] = df.groupby(df.index.year, as_index=True)['losers'].sum()
new_df['winners_Percent'] = new_df['winners']/new_df['winners'].sum()
new_df['losers_Percent'] = new_df['losers']/new_df['losers'].sum()
Output:
Consider the following code:
import datetime
import pandas as pd
import numpy as np
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')
columns = ['A','B', 'C']
df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0) # with 0s rather than NaNs
data = np.array([np.arange(10)]*3).T
df = pd.DataFrame(data, index=index, columns=columns)
df
Here we create an empty DataFrame in Python using Pandas and then fill it to any extent. However, is it possible to add columns dynamically in a similar manner, i.e., for columns = ['A','B', 'C'], it must be possible to add columns D,E,F etc till a specified number.
I think the
pandas.DataFrame.append
method is what you are after.
e.g.
output_frame=input_frame.append(appended_frame)
There are additional examples in the documentation Pandas merge join and concatenate documentation