Function take values from a dataframe as parameter - python

I have a function which calculates the Holidays for a given year like this:
holidays = bf.Holidays(year)
the problem is, there is no way to edit the Holidays function so i need another solutions.
I have a datafame with some years, example:
year
0 2005
1 2011
2 2015
3 2017
right now if i do this:
yearX = year.get_value(0, 0)
and run
holidays = bf.Holidays(yearX)
it just calculates the holidays for the first year in the dataframe (2005)
How can i implement that the function should take every year and append it?
using a for loop?
Example how it works now:
year = df['YEAR']
yearX = year.get_value(0,0)
holidays = bf.Holidays(year)
holidays = holidays.get_holiday_list()
print(holidays)
output:
DATE
2005-01-01
2005-03-25
2005-03-27
2005-03-28
2005-05-01
but i want it to calculate for very dataframe row, not only the first one

Looks like you're looking for pandas.DataFrame.apply:
holidays = df.apply(bf.Holidays, axis=1)
It will apply function bf.Holidays to each row in your df DataFrame.
For the df from your question:
In [50]: df
Out[50]:
year
0 2010
1 2011
2 2015
3 2017
In [51]: def test(x):
...: return x % 13
...:
In [52]: df.apply(test, axis=1)
Out[52]:
year
0 8
1 9
2 0
3 2

I think you can follow this example and just write a little wrapper function to return the dates to their respective columns:
def holiday_mapper(row):
holidays = bf.Holidays(row['year'],'HH').get_holiday_list()
row['holiday1'], row['holiday2']...row['holidayN'] = holidays
return row
df = df.apply(holiday_mapper, axis=1)
Assuming your get_holiday_list() function actually returns a list, and that you want to store the holiday dates in columns for each holiday, rather than append a single column with all the dates.

Related

Groupby count per category per month (Current month vs Remaining past months) in separate columns in pandas

Lets say I have the following dataframe:
I am trying to get something like this.
I was thinking to maybe use the rolling function and have separate dataframes for each count type(current month and past 3 months) and then merge them based on ID.
I am new to python and pandas so please bear with me if its a simple question. I am still learning :)
EDIT:
#furas so I started with calculating cumulative sum for all the counts as separate columns
df['f_count_cum] = df.groupby(["ID"])['f_count'].transform(lambda x:x.expanding().sum())
df['t_count_cum] = df.groupby(["ID"])['t_count'].transform(lambda x:x.expanding().sum())
and then just get the current month df by
df_current = df[df.index == (max(df.index)]
df_past_month = df[df.index == (max(df.index - 1)]
and then just merge the two dataframes based on the ID ?
I am not sure if its correct but this is my first take on this
Few assumptions looking at the input sample:
Month index is of datetime64[ns] type. If not, please use below to typecast the datatype.
df['Month'] = pd.to_datetime(df.Month)
Month column is the index. If not, please set it as index.
df = df.set_index('Month')
Considering last month of the df as current month and first 3 months as 'past 3 months'. If not modify the last and first function accordingly in df1 and df2 respectively.
Code
df1 = df.last('M').groupby('ID').sum().reset_index().rename(
columns={'f_count':'f_count(current month)',
't_count':'t_count(current month)'})
df2 = df.first('3M').groupby('ID').sum().reset_index().rename(
columns={'f_count':'f_count(past 3 months)',
't_count':'t_count(past 3 months)'})
df = pd.merge(df1, df2, on='ID', how='inner').reindex(columns = [ 'ID',
'f_count(current month)', 'f_count(past 3 months)',
't_count(current month)','t_count(past 3 months)'
])
Output
ID f_count(current month) f_count(past 3 months) t_count(current month) t_count(past 3 months)
0 A 3 13 8 14
1 B 3 5 7 5
2 C 1 3 2 4
Another version of same code, if you prefer function and single statement
def get_df(freq):
if freq=='M':
return df.last('M').groupby('ID').sum().reset_index()
return df.first('3M').groupby('ID').sum().reset_index()
df = pd.merge(get_df('M').rename(
columns={'f_count':'f_count(current month)',
't_count':'t_count(current month)'}),
get_df('3M').rename(
columns={'f_count':'f_count(past 3 months)',
't_count':'t_count(past 3 months)'}),
on='ID').reindex(columns = [ 'ID',
'f_count(current month)', 'f_count(past 3 months)',
't_count(current month)','t_count(past 3 months)'])
EDIT:
For previous two months from current month:(we can use different combinations of first and last function as per our need)
df2 = df.last('3M').first('2M').groupby('ID').sum().reset_index().rename(
columns={'f_count':'f_count(past 3 months)',
't_count':'t_count(past 3 months)'})

Function that concatenates a string and a integer converts to float before converting to string

I've written an if-else function that, based on the number of the quarter, concatenates a string (i.e. '1/1/') with an integer converted to a string (i.e. str(2017)). I have three data frames I want to use this on. Two of the data frames produce the expected result (i.e. '1/1/2017'). The last data frame produces the following '1/1/2017.0'which makes it not convert to a datetime.
I'm at a loss because based on the dtypes, all three dataframes list both quarter and year as int64, and all three dataframes originally come from the same csv.
My first guess was that I had converted my years to a float at some point when I was preparing the last data frame. I tried to ensure that the year column was an integer with .astype(). The year column is listed under .dtypes as an int64 before and after the function is applied.
Data Frame
from pandas import DataFrame
Data = {'quarter': [1,2,3,4],
'year': [2017,2017,2017,2017]}
df = DataFrame(Data, columns = ['quarter', 'year'])
This is the function I am using
def f(row):
if row['quarter'] == 1:
val = '1/1/' + str(row['year'])
elif row['quarter'] == 2:
val = '4/1/' + str(row['year'])
elif row['quarter'] == 3:
val = '7/1/' + str(row['year'])
else:
val = '10/1/' + str(row['year'])
return val
My expected result would be '1/1/2017', '4/1/2017', '7/1/2017', '10/1/2017'
I don't receive any error messages or warnings.
Not sure why your code doesn't work with the third dataset, but you could use pandas' functions instead of writing your own. It might resolve your problem.
>>> df['date'] = pd.to_datetime(
... df['year'].astype(str).str.cat(df['quarter'].astype(str), sep='Q'))
>>> df
quarter year date
0 1 2017 2017-01-01
1 2 2017 2017-04-01
2 3 2017 2017-07-01
3 4 2017 2017-10-01
You could change date format like:
>>> df['date'].dt.strftime('%m/%d/%Y')
0 01/01/2017
1 04/01/2017
2 07/01/2017
3 10/01/2017
Name: date, dtype: object

Changing format of date in pandas dataframe

I have a pandas dataframe, in which a column is a string formatted as
yyyymmdd
which should be a date. Is there an easy way to convert it to a recognizable form of date?
And then what python libraries should I use to handle them?
Let's say, for example, that I would like to consider all the events (rows) whose date field is a working day (so mon-fri). What is the smoothest way to handle such a task?
Ok so you want to select Mon-Friday. Do that by converting your column to datetime and check if the dt.dayofweek is lower than 6 (Mon-Friday --> 0-4)
m = pd.to_datetime(df['date']).dt.dayofweek < 5
df2 = df[m]
Full example:
import pandas as pd
df = pd.DataFrame({
'date': [
'20180101',
'20180102',
'20180103',
'20180104',
'20180105',
'20180106',
'20180107'
],
'value': range(7)
})
m = pd.to_datetime(df['date']).dt.dayofweek < 5
df2 = df[m]
print(df2)
Returns:
date value
0 20180101 0
1 20180102 1
2 20180103 2
3 20180104 3
4 20180105 4

first date of week greater than x

I would like to get the first monday in july that is greater than July 10th for a list of dates, and i am wondering if there's an elegant solution that avoids for loops/list comprehension.
Here is my code so far that gives all july mondays greater than the 10th:
import pandas as pd
last_date = '08-Jul-2016'
monday2_dates=pd.date_range('1-Jan-1999',last_date, freq='W-MON')
g1=pd.DataFrame(1.0, columns=['dummy'], index=monday2_dates)
g1=g1.loc[(g1.index.month==7) & (g1.index.day>=10)]
IIUC you can do it this way:
get list of 2nd Mondays within specified date range
In [116]: rng = pd.date_range('1-Jan-1999',last_date, freq='WOM-2MON')
filter them so that we will have only those in July with day >= 10
In [117]: rng = rng[(rng.month==7) & (rng.day >= 10)]
create a corresponding DF
In [118]: df = pd.DataFrame({'dummy':[1] * len(rng)}, index=rng)
In [119]: df
Out[119]:
dummy
1999-07-12 1
2000-07-10 1
2003-07-14 1
2004-07-12 1
2005-07-11 1
2006-07-10 1
2008-07-14 1
2009-07-13 1
2010-07-12 1
2011-07-11 1
2014-07-14 1
2015-07-13 1

Aggregating unbalanced panel to time series using pandas

I have an unbalanced panel that I'm trying to aggregate up to a regular, weekly time series. The panel looks as follows:
Group Date value
A 1/1/2000 5
A 1/17/2000 10
B 1/9/2000 3
B 1/23/2000 7
C 1/22/2000 20
To give a better sense of what I'm looking for, I'm including an intermediate step, which I'd love to skip if possible. Basically some data needs to be filled in so that it can be aggregated. As you can see, missing weeks in between observations are interpolated. All other values are set equal to zero.
Group Date value
A 1/1/2000 5
A 1/8/2000 5
A 1/15/2000 10
A 1/22/2000 0
B 1/1/2000 0
B 1/8/2000 3
B 1/15/2000 3
B 1/22/2000 7
C 1/1/2000 0
C 1/8/2000 0
C 1/15/2000 0
C 1/22/2000 20
The final result that I'm looking for is as follows:
Date value
1/1/2000 5 = 5 + 0 + 0
1/8/2000 8 = 5 + 3 + 0
1/15/2000 13 = 10 + 3 + 0
1/22/2000 27 = 0 + 7 + 20
I haven't gotten very far, managed to create a panel:
panel = df.set_index(['Group','week']).to_panel()
Unfortunately, if I try to resample, I get an error
panel.resample('W')
TypeError: Only valid with DatetimeIndex or PeriodIndex
Assume df is your second dataframe with weeks, you can try the following:
df.groupby('week').sum()['value']
The documentation of groupby() and its application is here. It's similar to group-by function in SQL.
To obtain the second dataframe from the first one, try the following:
Firstly, prepare a function to map the day to week
def d2w_map(day):
if day <=7:
return 1
elif day <= 14:
return 2
elif day <= 21:
return 3
else:
return 4
In the method above, days from 29 to 31 are considered in week 4. But you get the idea. You can modify it as needed.
Secondly, take the lists out from the first dataframe, and convert days to weeks
df['Week'] = df['Day'].apply(d2w_map)
del df['Day']
Thirdly, initialize your second dataframe with only columns of 'Group' and 'Week', leaving the 'value' out. Assume now your initialized new dataframe is result, you can now do a join
result = result.join(df, on=['Group', 'Week'])
Last, write a function to fill the Nan up in the 'value' column with the nearby element. The Nan is what you need to interpolate. Since I am not sure how you want the interpolation to work, I will leave it to you.
Here is how you can change d2w_map to convert string of date to integer of week
from datetime import datetime
def d2w_map(day_str):
return datetime.strptime(day_str, '%m/%d/%Y').weekday()
Returned value of 0 means Monday, 1 means Tuesday and so on.
If you have the package dateutil installed, the function can be more robust:
from dateutil.parser import parse
def d2w_map(day_str):
return parse(day_str).weekday()
Sometimes, things you want are already implemented by magic :)
Turns out the key is to resample a groupby object like so:
df_temp = (df.set_index('date')
.groupby('Group')
.resample('W', how='sum', fill_method='ffill'))
ts = (df_temp.reset_index()
.groupby('date')
.sum()[value])
Used this tab delimited test.txt:
Group Date value
A 1/1/2000 5
A 1/17/2000 10
B 1/9/2000 3
B 1/23/2000 7
C 1/22/2000 20
You can skip the intermediate datafile as follows. Don't have time now. Just play around with it to get it right.
import pandas as pd
import datetime
time_format = '%m/%d/%Y'
Y = pd.read_csv('test.txt', sep="\t")
dates = Y['Date']
dates_right_format = map(lambda s: datetime.datetime.strptime(s, time_format), dates)
values = Y['value']
X = pd.DataFrame(values)
X.index = dates_right_format
print X
X = X.sort()
print X
print X.resample('W', how=sum, closed='right', label='right')
Last print
value
2000-01-02 5
2000-01-09 3
2000-01-16 NaN
2000-01-23 37

Categories