I am trying to add my function value into my dataset column. I have eight columns which are:
'DATE','Max_R','Total_R','Avg_R','MAX_T','TOTAL_T','AVG_T'
Then I divided my DATE column into three columns as Day, Month and year respectively. Here is my code in python:
import pandas as pd
import numpy as np
df=pd.read_csv('moving_average_calculation.csv', sep=',')
#df = pd.DataFrame(columns=['DATE','Max_R','Total_R','Avg_R','MAX_T','TOTAL_T','AVG_T'])
df = pd.DataFrame(pd.date_range('1-Jan-08', periods=2558),columns=['DATE'])
def f(df):
df = df.copy()
df['Day'] = pd.DatetimeIndex(df['DATE']).day
df['Month'] =pd.DatetimeIndex(df['DATE']).month
df['Year'] = pd.DatetimeIndex(df['DATE']).year
return df
print(f(df).head(10))
Now I want to get my columns which would have these columns:
'Day','Month','Year','Max_R','Total_R','Avg_R','MAX_T','TOTAL_T','AVG_T'
How can I do this? Thank you.
Your question is a bit unclear, as you define df twice (+1 in a comment), but if I understand correctly (that is: you already have 'DATE' in the .csv file) this may help:
df = pd.read_csv('moving_average_calculation.csv', sep=',')
df['Day'] = pd.DatetimeIndex(df['DATE']).day
df['Month'] = pd.DatetimeIndex(df['DATE']).month
df['Year'] = pd.DatetimeIndex(df['DATE']).year
df.drop('DATE', axis=1, inplace=True)
df = df[['Day','Month','Year','Max_R','Total_R','Avg_R','MAX_T','TOTAL_T','AVG_T']]
Related
I am importing the data with this command
df = pd.read_excel('C:/Users/Me/Data.xlsx', sheet_name='Prices')
and this is the result:
The date is a common column and I want it like this:
I found the answer.Adding parse_dates=True, index_col=0 to the import command like this:
df = pd.read_excel('C:/Users/Me/Data.xlsx', sheet_name='Prices', parse_dates=True, index_col=0)
The output is this:
What you are trying to do is to set Date as an index, if I get it right:
df.set_index('Date')
I found a better way to import them, because when I was trying to calculate the monthly returns it does not work. So I use this new code and the date were perfect.
df = pd.read_excel('C:/Users/Me/Data.xlsx', sheet_name='Prices')
df.index = pd.to_datetime(df['Date'])
df.drop(['Date'], axis = 'columns', inplace=True)
df.head()
and this is the result:
I have a dataframe with several columns with dates - formatted as datetime.
I am trying to get the min/max value of a date, based on another date column being NaN
For now, I am doing this in two separate steps:
temp_df = df[(df['date1'] == np.nan)]
max_date = max(temp_df['date2'])
temp_df = None
I get the result I want, but I am using an unnecesary temporary dataframe.
How can I do this without it?
Is there any reference material to read on this?
Thanks
Here is an MCVE that can be played with to obtain statistics from other columns where the value in one isnull() (NaN or NaT). This can be done in a one-liner.
import pandas as pd
import numpy as np
print(pd.__version__)
# sample date columns
daterange1 = pd.date_range('2017-01-01', '2018-01-01', freq='MS')
daterange2 = pd.date_range('2017-04-01', '2017-07-01', freq='MS')
daterange3 = pd.date_range('2017-06-01', '2018-02-01', freq='MS')
df1 = pd.DataFrame(data={'date1': daterange1})
df2 = pd.DataFrame(data={'date2': daterange2})
df3 = pd.DataFrame(data={'date3': daterange3})
# jam them together, making NaT's in non-overlapping ranges
df = pd.concat([df1, df2, df3], axis=0, sort=False)
df.reset_index(inplace=True)
max_date = df[(df['date1'].isnull())]['date2'].max()
print(max_date)
I have a dataframe with multiple columns along with a date column. The date format is 12/31/15 and I have set it as a datetime object.
I set the datetime column as the index and want to perform a regression calculation for each month of the dataframe.
I believe the methodology to do this would be to split the dataframe into multiple dataframes based on month, store into a list of dataframes, then perform regression on each dataframe in the list.
I have used groupby which successfully split the dataframe by month, but am unsure how to correctly convert each group in the groupby object into a dataframe to be able to run my regression function on it.
Does anyone know how to split a dataframe into multiple dataframes based on date, or a better approach to my problem?
Here is my code I've written so far
import pandas as pd
import numpy as np
import statsmodels.api as sm
from patsy import dmatrices
df = pd.read_csv('data.csv')
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df = df.set_index('date')
# Group dataframe on index by month and year
# Groupby works, but dmatrices does not
for df_group in df.groupby(pd.TimeGrouper("M")):
y,X = dmatrices('value1 ~ value2 + value3', data=df_group,
return_type='dataframe')
If you must loop, you need to unpack the key and the dataframe when you iterate over a groupby object:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from patsy import dmatrices
df = pd.read_csv('data.csv')
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df = df.set_index('date')
Note the use of group_name here:
for group_name, df_group in df.groupby(pd.Grouper(freq='M')):
y,X = dmatrices('value1 ~ value2 + value3', data=df_group,
return_type='dataframe')
If you want to avoid iteration, do have a look at the notebook in Paul H's gist (see his comment), but a simple example of using apply would be:
def do_regression(df_group, ret='outcome'):
"""Apply the function to each group in the data and return one result."""
y,X = dmatrices('value1 ~ value2 + value3',
data=df_group,
return_type='dataframe')
if ret == 'outcome':
return y
else:
return X
outcome = df.groupby(pd.Grouper(freq='M')).apply(do_regression, ret='outcome')
This is a split per year.
import pandas as pd
import dateutil.parser
dfile = 'rg_unificado.csv'
df = pd.read_csv(dfile, sep='|', quotechar='"', encoding='latin-1')
df['FECHA'] = df['FECHA'].apply(lambda x: dateutil.parser.parse(x))
#http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
#use to_period
per = df['FECHA'].dt.to_period("Y")
#group by that period
agg = df.groupby([per])
for year, group in agg:
#this simple save the data
datep = str(year).replace('-', '')
filename = '%s_%s.csv' % (dfile.replace('.csv', ''), datep)
group.to_csv(filename, sep='|', quotechar='"', encoding='latin-1', index=False, header=True)
I have a dataframe defined as follows:
import datetime
import pandas as pd
import random
import numpy as np
todays_date = datetime.datetime.today().date()
index = pd.date_range(todays_date - datetime.timedelta(10), periods=10, freq='D')
index = index.append(index)
idname = ['A']*10 + ['B']*10
values = random.sample(xrange(100), 20)
data = np.vstack((idname, values)).T
tmp_df = pd.DataFrame(data, columns=['id', 'value'])
tmp_index = pd.DataFrame(index, columns=['date'])
tmp_df = pd.concat([tmp_index, tmp_df], axis=1)
tmp_df = tmp_df.set_index('date')
Note that there are 2 values for each date. I would like to resample the dataframe tmp_df on a weekly basis but keep the two separate values. I tried tmp_df.resample('W-FRI') but it doesn't seem to work.
The solution you're looking for is groupby, which lets you perform operations on dataframe slices (here 'A' and 'B') independently:
df.groupby('id').resample('W-FRI')
Note: your code produces an error (No numeric types to aggregate) because the 'value' column is not converted to int. You need to convert it first:
df['value'] = pd.to_numeric(df['value'])
Lets say I have the following info about number of trades done in the past and I group them by year:
import pandas as pd
import numpy as np
dates = pd.date_range('19990101', periods=6000)
df = pd.DataFrame(np.random.randint(0,50,size=(6000,2)), index = dates)
df.columns = ['winners','losers']
grouped = df.groupby(lambda x: x.year)
print grouped.sum()
How can I generate one column in this "grouped" data that shows the percentage winners per year? and another column that shows the maximum consecutive losing trades per year?
Was trying to follow this example Understanding groupby in pandas, but couldn't figure out in my case how to do it by year.
Firstly Create a new DataFrame, then create necessary column according winners and losers:
new_df = pd.DataFrame()
new_df ['winners'] = df.groupby(df.index.year, as_index=True)['winners'].sum()
new_df ['losers'] = df.groupby(df.index.year, as_index=True)['losers'].sum()
Then with that, you can aggregate by winners, losers (which returns like indexed data) to calculate a percent of winners, losers.
You can do it like:
import pandas as pd
import numpy as np
dates = pd.date_range('19990101', periods=6000)
df = pd.DataFrame( np.random.randint(0,50,size=(6000,2)), index = dates)
df.columns = ['winners','losers']
new_df = pd.DataFrame()
new_df ['winners'] = df.groupby(df.index.year, as_index=True)['winners'].sum()
new_df ['losers'] = df.groupby(df.index.year, as_index=True)['losers'].sum()
new_df['winners_Percent'] = new_df['winners']/new_df['winners'].sum()
new_df['losers_Percent'] = new_df['losers']/new_df['losers'].sum()
Output: