I Need to calculate the compund interest rate so, lets say I have a Dataframe like that:
days
1 10
2 15
3 20
What I want to get is (suppose the interest rate is 1% every day:
days interst rate
1 10 10,46%
2 15 16,10%
3 20 22,02%
My code is as follows:
def inclusao_juros (x):
dias = df_arrumada_4['Prazo Medio']
return ((1.0009723)^dias)-1
df_arrumada_4['juros_acumulado'] = df_arrumada_4['Prazo Medio'].apply(inclusao_juros)
What should I do??? Tks
I think you need numpy.power:
df['new'] = np.power(1.01, df['days']) - 1
print (df)
days new
1 10 0.104622
2 15 0.160969
3 20 0.220190
IIUC
pd.Series([1.01]*len(df)).pow(df.reset_index().days,0).sub(1)
Out[695]:
0 0.104622
1 0.160969
2 0.220190
dtype: float64
Jez's : pd.Series([1.01]*len(df),index=df.index).pow(df.days,0).sub(1)
Or using your apply
df.days.apply(lambda x: 1.01**x -1)
Out[697]:
1 0.104622
2 0.160969
3 0.220190
Name: days, dtype: float64
Related
I have the following df:
Week Sales
1 10
2 15
3 10
4 20
5 20
6 10
7 15
8 10
I would like to group every 3 weeks and sum up sales. I want so start with the bottom 3 weeks. If there are less than 3 weeks left at the top like in this example, these weeks should be ignored. Desired output is this:
Week Sales
5-3 50
8-6 35
I tried this on my original df df.reset_index(drop=True).groupby(by=lambda x: x/N, axis=0).sum()
but this solution is not starting from the bottom rows.
Can anyone point me into the right direction here? Thanks!
You can try inverse the data with .iloc[::-1]:
N=3
(df.iloc[::-1].groupby(np.arange(len(df))//N)
.agg({'Week': lambda x: f'{x.iloc[0]}-{x.iloc[-1]}',
'Sales': 'sum'
})
)
Output:
Week Sales
0 8-6 35
1 5-3 50
2 2-1 25
When dealing with period aggregation, I usually use .resample as it is fixable in binning data with different time periods
import io
from datetime import timedelta
import pandas as pd
dataf = pd.read_csv(io.StringIO("""Week Sales
1 10
2 15
3 10
4 20
5 20
6 10
7 15
8 10"""), sep='\s+',).astype(int)
# reverse data and transform int weeks to actual date time
dataf = dataf.iloc[::-1]
dataf['Week'] = dataf['Week'].map(lambda x: timedelta(weeks=x))
# set date object to index for resampling
dataf = dataf.set_index('Week')
# now we resample
dataf.resample('21d').sum() # 21days
::: Note: the label is misleading. And setting kind='period' does raises error
I have some monthly data with a date column in the format: YYYY.fractional month. For example:
0 1960.500
1 1960.583
2 1960.667
3 1960.750
4 1960.833
5 1960.917
Where the first index is June, 1960 (6/12=.5), the second is July, 1960 (7/12=.583) and so on.
The answers in this question don't seem to apply well, though I feel like pd.to_datetime should be able to help somehow. Obviously I can use a map to split this into components and build a datetime, but I'm hoping for a faster and more rigorous method since the data is large.
I think you need a bit maths:
a = df['date'].astype(int)
print (a)
0 1960
1 1960
2 1960
3 1960
4 1960
5 1960
Name: date, dtype: int32
b = df['date'].sub(a).add(1/12).mul(12).round(0).astype(int)
print (b)
0 7
1 8
2 9
3 10
4 11
5 12
Name: date, dtype: int32
c = pd.to_datetime(a.astype(str) + '.' + b.astype(str), format='%Y.%m')
print (c)
0 1960-07-01
1 1960-08-01
2 1960-09-01
3 1960-10-01
4 1960-11-01
5 1960-12-01
Name: date, dtype: datetime64[ns]
Solution with map:
d = {'500':'7','583':'8','667':'9','750':'10','833':'11','917':'12'}
#if necessary
#df['date'] = df['date'].astype(str)
a = df['date'].str[:4]
b = df['date'].str[5:].map(d)
c = pd.to_datetime(a + '.' + b, format='%Y.%m')
print (c)
0 1960-07-01
1 1960-08-01
2 1960-09-01
3 1960-10-01
4 1960-11-01
5 1960-12-01
Name: date, dtype: datetime64[ns]
For future reference, here's the map I was using before. I actually made a mistake in the question; the data is set so that January 1960 is 1960.0, which means 1/12 must be added to each fractional component.
def date_conv(d):
y, frac_m = str(d).split('.')
y = int(y)
m = int(round((float('0.{}'.format(frac_m)) + 1/12) * 12, 0))
d = 1
try:
date = datetime.datetime(year=y, month=m, day=d)
except ValueError:
print(y, m, frac_m)
raise
return date
dates_series = dates_series.map(lambda d: date_conv(d))
The try/except block was just something I added for troubleshooting while writing it.
I have a table that has multiple subgroups. For example, person A has a total of three visits and person B has a total of two visits. I also have the time of each visit:
id visit time_of_visit
A 1 2002-01-15
A 2 2003-01-15
A 3 2003-02-15
B 1 1996-08-09
B 2 1998-08-09
I want to compute how long apart each visit is in terms of years for each person. So I want something like this:
id visit time_of_visit difference_in_time
A 1 2002-01-15 na
A 2 2003-01-15 1
A 3 2003-02-15 0.0833
B 1 1996-08-09 na
B 2 1998-08-09 2
Any ideas how to do this in python pandas? Thanks!
groupby.diff on a datetime column will give you
df['time_of_visit'] = pd.to_datetime(df['time_of_visit'])
df.groupby('id')['time_of_visit'].diff()
Out:
0 NaT
1 365 days
2 31 days
3 NaT
4 730 days
Name: time_of_visit, dtype: timedelta64[ns]
However, timedeltas cannot give you years as it is not a standard measure. You can always convert by your own rules of course (for example divide by 365).
df.groupby('id')['time_of_visit'].diff().dt.days / 365
Out:
0 NaN
1 1.000000
2 0.084932
3 NaN
4 2.000000
Name: time_of_visit, dtype: float64
This question is related to my previous question. I have the following dataframe:
df =
QUEUE_1 QUEUE_2 DAY HOUR TOTAL_SERVICE_TIME TOTAL_WAIT_TIME EVAL
ABC123 DEF656 1 7 20 30 1
ABC123 1 7 22 32 0
DEF656 ABC123 1 8 15 12 0
FED456 DEF656 2 8 15 16 1
I need to get the following dataframe (it's similar to the one I wanted to get in my previous question, but here I need to add 2 additional columns AVG_COUNT_PER_DAY_HOUR and AVG_PERCENT_EVAL_1).
QUEUE HOUR AVG_TOT_SERVICE_TIME AVG_TOT_WAIT_TIME AVG_COUNT_PER_DAY_HOUR AVG_PERCENT_EVAL_1
ABC123 7 21 31 1 50
ABC123 8 15 12 0.5 100
DEF656 7 20 30 0.5 100
DEF656 8 15 14 1 50
FED456 7 0 0 0 0
FED456 8 15 14 0.5 100
The column AVG_COUNT_PER_DAY_HOUR should contain the average count of a corresponding HOUR value over days (DAY) grouped by QUEUE. For example, in df, in case of ABC123, the HOUR 7 appears 2 times for the DAY 1 and 0 times for the DAY 2. Therefore the average is 1. The same logic is applied to the HOUR 8. It appears 1 time in DAY 1 and 0 times in DAY 2 for ABC123. Therefore the average is 0.5.
The column AVG_PERCENT_EVAL_1 should contain the percent of EVAL equal to 1 over hours, grouped by QUEUE. For example, in case of ABC123, the EVAL is equal to 1 one time when HOUR is 7. It is also equal to 0 one time when HOUR is 7. So, AVG_PERCENT_EVAL_1 is 50 for ABC123 and hour 7.
I use this approach:
df = pd.lreshape(aa, {'QUEUE': df.columns[df.columns.str.startswith('QUEUE')].tolist()})
piv_df = df.pivot_table(index=['QUEUE'], columns=['HOUR'], fill_value=0)
result = piv_df.stack().add_prefix('AVG_').reset_index()
I get stuck with adding columns AVG_COUNT_PER_DAY_HOUR and AVG_PERCENT_EVAL_1. For instance, to add the column AVG_COUNT_PER_DAY_HOUR I am thinking to use .apply(pd.value_counts, 1).notnull().groupby(level=0).sum().astype(int), while for calculating AVG_PERCENT_EVAL_1 I am thinking to use [df.EVAL==1].agg({'EVAL' : 'count'}). However, don't know how to incorporate it into my current code in order to get correct solution.
UPDATE:
Perhaps it is easier to adopt this solution to what I need in this questions:
result = pd.lreshape(df, {'QUEUE': ['QUEUE_1','QUEUE_2']})
mux = pd.MultiIndex.from_product([result.QUEUE.dropna().unique(),
result.dropna().DAY.unique(),
result.HOUR.dropna().unique(), ], names=['QUEUE','DAY','HOUR'])
print (result.groupby(['QUEUE','DAY','HOUR'])
.mean()
.reindex(mux, fill_value=0)
.add_prefix('AVG_')
.reset_index())
Steps:
1) To compute AVG_COUNT_PER_DAY_HOUR :
With the help of pd.crosstab(), compute the distinct counts of HOUR w.r.t DAYS (so that we obtain cases for missing days) grouped by QUEUE.
stack the DF so that HOUR which was part of a hierarchical column before now gets positioned as an index, leaving just DAYS as columns. We take the mean columnwise after filling NaNs with 0.
2) To compute AVG_PERCENT_EVAL_1:
After getting the pivoted frame (same as before) and also from the fact that mean would just give us the percentage change as those are simply binary in nature (1/0), we simply take EVAL from this DF and multiply it's result by 100 as means were computed while pivoting itself (default agg=np.mean).
Finally, we join all these frames.
same as in the linked post:
df = pd.lreshape(df, {'QUEUE': df.columns[df.columns.str.startswith('QUEUE')].tolist()})
piv_df = df.pivot_table(index='QUEUE', columns='HOUR', fill_value=0).stack()
avg_tot = piv_df[['TOTAL_SERVICE_TIME', 'TOTAL_WAIT_TIME']].add_prefix("AVG_")
additional portion:
avg_cnt = pd.crosstab(df['QUEUE'], [df['DAY'], df['HOUR']]).stack().fillna(0).mean(1)
avg_pct = piv_df['EVAL'].mul(100).astype(int)
avg_tot.join(
avg_cnt.to_frame("AVG_COUNT_PER_DAY_HOUR")
).join(avg_pct.to_frame("AVG_PERCENT_EVAL_1")).reset_index()
avg_cnt looks like:
QUEUE HOUR
ABC123 7 1.0
8 0.5
DEF656 7 0.5
8 1.0
FED456 7 0.0
8 0.5
dtype: float64
avg_pct looks like:
QUEUE HOUR
ABC123 7 50
8 0
DEF656 7 100
8 50
FED456 7 0
8 100
Name: EVAL, dtype: int32
I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.