I have missing data at the start of a DataFrame for one series, and I want to fill those NAs by growing back the series using the growth rate of another.
df = pd.DataFrame({'X':[np.nan, np.nan, np.nan, 6, 6.7, 6.78, 7, 9.1],
'Y':[5.4, 5.7, 5.5, 6.1, 6.5, 6.80, 7.1, 9.12]})
X Y
0 NaN 5.40
1 NaN 5.70
2 NaN 5.50
3 6.00 6.10
4 6.70 6.50
5 6.78 6.80
6 7.00 7.10
7 9.10 9.12
i.e. what I want is:
df2 = pd.DataFrame({'X':[5.31147, 5.60656, 5.40984, 6, 6.7, 6.78, 7, 9.1],
'Y':[5.4, 5.7, 5.5, 6.1, 6.5, 6.80, 7.1, 9.12]})
So that the two series have the same growth rates for those first few original missing values
df2.pct_change()
X Y
0 NaN NaN
1 0.055556 0.055556
2 -0.035088 -0.035088
3 0.109091 0.109091
4 0.116667 0.065574
5 0.011940 0.046154
6 0.032448 0.044118
7 0.300000 0.284507
Any ideas? I've figured out how to iterate back and save the output to a list, but pre-pending it's bulky and I need to prepend it to the original DataFrame
You could let
first_non_nan = df.X.isnull().idxmin()
changes = df.Y[:first_non_nan+1].pct_change()
while first_non_nan > 0:
df.X[first_non_nan-1] = df.X[first_non_nan]/(changes[first_non_nan]+1)
first_non_nan -= 1
Result:
In [48]: df
Out[48]:
X Y
0 5.311475 5.40
1 5.606557 5.70
2 5.409836 5.50
3 6.000000 6.10
4 6.700000 6.50
5 6.780000 6.80
6 7.000000 7.10
7 9.100000 9.12
Related
I need to transform a data frame into what I think are adjacency matrices or some sort of pivot table using a datetime column. I have tried a lot of googling but haven't found anything, so any help in how to do this or even what I should be googling would be appreciated.
Here is a simplified version of my data:
df = pd.DataFrame({'Location' : [1]*7 + [2]*7,
'Postcode' : ['XXX XXX']*7 + ['YYY YYY']*7,
'Date' : ['03-12-2021', '04-12-2021', '05-12-2021', '06-12-2021', '07-12-2021',
'08-12-2021', '09-12-2021', '03-12-2021', '04-12-2021', '05-12-2021',
'06-12-2021', '07-12-2021', '08-12-2021', '09-12-2021'],
'Var 1' : [6.9, 10.2, 9.2, 7.6, 9.8, 8.6, 10.6, 9.9, 9.4, 9, 9.4, 9.1, 8, 9.9],
'Var 2' : [14.5, 6.2, 9.7, 12.7, 14.8, 12, 12.2, 12.3, 14.2, 13.8, 11.7, 17.8,
10.7, 12.3]})
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
Location Postcode Date Var 1 Var 2
0 1 XXX XXX 2021-12-03 6.9 14.5
1 1 XXX XXX 2021-12-04 10.2 6.2
2 1 XXX XXX 2021-12-05 9.2 9.7
3 1 XXX XXX 2021-12-06 7.6 12.7
4 1 XXX XXX 2021-12-07 9.8 14.8
5 1 XXX XXX 2021-12-08 8.6 12.0
6 1 XXX XXX 2021-12-09 10.6 12.2
7 2 YYY YYY 2021-12-03 9.9 12.3
8 2 YYY YYY 2021-12-04 9.4 14.2
9 2 YYY YYY 2021-12-05 9.0 13.8
10 2 YYY YYY 2021-12-06 9.4 11.7
11 2 YYY YYY 2021-12-07 9.1 17.8
12 2 YYY YYY 2021-12-08 8.0 10.7
13 2 YYY YYY 2021-12-09 9.9 12.3
The output I want to create is what each variable will be in +1, +2, +3 etc days from the Date variable, so it would look like this:
But I have no idea how or where to start. My only thought is several for loops but in reality I have hundreds of locations and 10 variables for 14 Dates each, so it is a large dataset and this would be very inefficient. I feel like there should be a function or simpler way to achieve this.
Create DatetimIndex and then use DataFrameGroupBy.shift withadd suffix by DataFrame.add_suffix with {i:02} for 01, 02..10, 11 for correct sorting columns names in last step:
df = df.set_index('Date')
for i in range(1,7):
df = df.join(df.groupby('Location')[['Var 1', 'Var 2']].shift(freq=f'-{i}d')
.add_suffix(f'+ Day {i:02}'), on=['Location','Date'])
df = df.set_index(['Location','Postcode'], append=True).sort_index(axis=1)
I would like to substitute the NaN and NaT values of the Value1 column, with others calculated with a function that takes in input Value2 and Value3 (if they exist) of the same row of Value1. This is done for each ID. To do this, I would use 'groupby' and then 'apply'.But I get an error: 'Series' objects are mutable, thus they cannot be hashed. Could you help me? Thanks in advance!
ID1 = [2002070, 2002070, 2002740,2002740,2003010]
ID2 = [2002070, 200800, 200800,2002740,2002740]
ID3 = [2002740, 2002740, 2002070, 2002070,2003010]
Value1 = [4.5, 4.2, 3.7, 4.8, 4.4]
Value2 = [7.2, 6.4, 10, 2.3, 1.5]
Value3 = [8.4, 8.4, 8.4, 7.4, 7.4]
date1 = ['2008-05-14', '2005-12-07','2008-10-27', '2009-04-20', '2012-03-01']
date2 = ['2005-12-07','2003-10-10', '2004-05-14', '2011-06-03', '2015-07-05']
date3 = ['2010-10-22', '2012-03-01', '2013-11-28', '2005-12-07', '2012-03-01']
date1=pd.to_datetime(date1)
date2=pd.to_datetime(date2)
date3=pd.to_datetime(date3)
df1=pd.DataFrame({'ID': ID1, 'Value1': Value1, 'Date1':date1}).sort_values('Date1')
df2=pd.DataFrame({'ID': ID2, 'Value2': Value2, 'Date2':date2}).sort_values('Date2')
df3=pd.DataFrame({'ID': ID3, 'Value3': Value3, 'Date3':date3}).sort_values('Date3')
ok = df1.merge(df2, left_on=['ID','Date1'],right_on=['ID','Date2'], how='outer', sort=True)
ok1 = ok.merge(df3, left_on='ID',right_on='ID', how='inner', sort=True )
the df I obtain is this:
ID Value1 Date1 Value2 Date2 Value3 Date3
0 2002070 4.2 2005-12-07 7.2 2005-12-07 7.4 2005-12-07
1 2002070 4.2 2005-12-07 7.2 2005-12-07 8.4 2013-11-28
2 2002070 4.5 2008-05-14 NaN NaT 7.4 2005-12-07
3 2002070 4.5 2008-05-14 NaN NaT 8.4 2013-11-28
4 2002740 3.7 2008-10-27 NaN NaT 8.4 2010-10-22
5 2002740 3.7 2008-10-27 NaN NaT 8.4 2012-03-01
6 2002740 4.8 2009-04-20 NaN NaT 8.4 2010-10-22
7 2002740 4.8 2009-04-20 NaN NaT 8.4 2012-03-01
8 2002740 NaN NaT 2.3 2011-06-03 8.4 2010-10-22
9 2002740 NaN NaT 2.3 2011-06-03 8.4 2012-03-01
10 2002740 NaN NaT 1.5 2015-07-05 8.4 2010-10-22
11 2002740 NaN NaT 1.5 2015-07-05 8.4 2012-03-01
12 2003010 4.4 2012-03-01 NaN NaT 7.4 2012-03-01
this is the function I made:
def func(Value2, Value3):
return Value2/((Value3/100)**2)
result = ok1.groupby("ID").Value1.apply(func(ok1.Value2, ok1.Value3))
Do you know how to apply this function only to a NaN Value1? And how to put the NaT Date1 equal to Date2?
The output of func is another Series, and pandas is not sure what you want to do with it - what would it mean to apply this series to the groups?
Is it that you want the values of this series to be assigned wherever there is a missing Value1 in the original DataFrame?
In that case
imputes = ok1.Value2.div(ok1.Value3.div(100).pow(2)) # same as your function
# overwrite missing values with the corresponding imputed values
ok1.Value1.fillna(imputes, inplace=True)
# overwrite missing dates with dates from another column
ok1.Date1.fillna(ok1.Date2, inplace=True)
However, it's not clear to me that this is quite what you wanted, given the presence of the groupby.
Let's say I have the following dataframe "A"
utilization utilization_billable
service
1 10.0 5.0
2 30.0 20.0
3 40.0 30.0
4 40.0 32.0
I need to convert it into the following dataframe "B"
utilization type
service
1 10.0 total
2 30.0 total
3 40.0 total
4 40.0 total
1 5.0 billable
2 20.0 billable
3 30.0 billable
4 32.0 billable
so the values from the first are categorized into type column with values of total or billable.
data = {
'utilization': [10.0, 30.0, 40.0, 40.0],
'utilization_billable': [5.0, 20.0, 30.0, 32.0],
'service': [1, 2, 3, 4]
}
df = pd.DataFrame.from_dict(data).set_index('service')
print(df)
data = {
'utilization': [10.0, 30.0, 40.0, 40.0, 5.0, 20.0, 30.0, 32.0],
'service': [1, 2, 3, 4, 1, 2, 3, 4],
'type': [
'total',
'total',
'total',
'total',
'billable',
'billable',
'billable',
'billable',
]
}
df = pd.DataFrame.from_dict(data).set_index('service')
print(df)
Is there a way to transform the data frame and perform such categorization?
You could use pd.melt:
import pandas as pd
data = {
'utilization': [10.0, 30.0, 40.0, 40.0],
'utilization_billable': [5.0, 20.0, 30.0, 32.0],
'service': [1, 2, 3, 4]}
df = pd.DataFrame(data)
result = pd.melt(df, var_name='type', value_name='utilization', id_vars='service')
print(result)
yields
service type utilization
0 1 utilization 10.0
1 2 utilization 30.0
2 3 utilization 40.0
3 4 utilization 40.0
4 1 utilization_billable 5.0
5 2 utilization_billable 20.0
6 3 utilization_billable 30.0
7 4 utilization_billable 32.0
Then result.set_index('service') would make service the index,
but I would recommend avoiding that since service values are not unique.
This can be done with pd.wide_to_long after adding a suffix to the first column.
import pandas as pd
df = df.rename(columns={'utilization': 'utilization_total'})
pd.wide_to_long(df.reset_index(), stubnames='utilization', sep='_',
i='service', j='type', suffix='.*').reset_index(1)
Output:
type utilization
service
1 total 10.0
2 total 30.0
3 total 40.0
4 total 40.0
1 billable 5.0
2 billable 20.0
3 billable 30.0
4 billable 32.0
looks like a job for df.stack() with multiple DataFrame.rename()
df.rename(index=str, columns={"utilization": "total", "utilization_billable": "billable"})\
.stack().reset_index(1).rename(index=str, columns={"level_1": "type", 0: "utilization"})\
.sort_values(by='type', ascending = False)
Output:
type utilization
service
1 total 10.0
2 total 30.0
3 total 40.0
4 total 40.0
1 billable 5.0
2 billable 20.0
3 billable 30.0
4 billable 32.0
I have the following series:
0 79.0
1 220.0
2 185.0
3 199.0
4 226.0
5 141.0
6 341.0
7 151.0
8 57.0
9 313.0
10 273.0
11 113.0
12 328.0
If i use pandas.cut() on this, this is what i get:
series equal_intvls
0 79.0 (0.979, 306.1]
1 220.0 (0.979, 306.1]
2 185.0 (0.979, 306.1]
3 199.0 (0.979, 306.1]
4 226.0 (0.979, 306.1]
5 141.0 (0.979, 306.1]
6 341.0 (306.1, 608.2]
7 151.0 (0.979, 306.1]
8 57.0 (0.979, 306.1]
9 313.0 (306.1, 608.2]
10 273.0 (0.979, 306.1]
11 113.0 (0.979, 306.1]
12 328.0 (306.1, 608.2]
pandas.cut() is giving me a series of intervals which have the same length (max value - min value), the length of the intervals is 2, but from the start point on the intervals till the end point there are several numbers within each interval that may not be the same for each of the intervals.
If i use pandas.cut() i get intervals of the same length, but how could i split this series into intervals that contain the same number of elements in each interval??
What i would like to obtain is a new column containing these intervals with the same number of elements within them. Taking as an example the following array:
[1, 7, 7, 4, 6, 3]
what i would like to obtain is this series of intervals with the same number of items:
[(0.999, 3.667] ,(3.667, 6.333] , (6.333, 7.0]]
(0.999, 3.667] - There are 2 values in this imterval: (1, 3)
(3.667, 6.333] - There are 2 values in this interval (4, 6)
(6.333, 7.0] - And again, 2 values within this interval (7, 7)
I would like to get the intervals in a series-like form so i can input it as a new column into y original df.
I have tried np.split, and np.array_split without success, i have also visited some other posts in this website that are similar to what i want but non seems to really fit my case. Please help.
What's the best way to get these kinds of intervals??
Thank you very much in advance
I think You are looking for qcut:
>>> >>> pd.qcut(pd.Series([1, 7, 7, 4, 6, 3]),3)
0 (0.999, 3.667]
1 (6.333, 7.0]
2 (6.333, 7.0]
3 (3.667, 6.333]
4 (3.667, 6.333]
5 (0.999, 3.667]
dtype: category
Categories (3, interval[float64]): [(0.999, 3.667] < (3.667, 6.333] < (6.333, 7.0]]
I have the following MultiIndex dataframe.
Close ATR
Date Symbol
1990-01-01 A 24 2
1990-01-01 B 72 7
1990-01-01 C 40 3.4
1990-01-02 A 21 1.5
1990-01-02 B 65 6
1990-01-02 C 45 4.2
1990-01-03 A 19 2.5
1990-01-03 B 70 6.3
1990-01-03 C 51 5
I want to calculate three columns:
Shares = previous day's Equity * 0.02 / ATR, rounded down to whole number
Profit = Shares * Close
Equity = previous day's Equity + sum of Profit for each Symbol
Equity has an initial value of 10,000.
The expected output is:
Close ATR Shares Profit Equity
Date Symbol
1990-01-01 A 24 2 0 0 10000
1990-01-01 B 72 7 0 0 10000
1990-01-01 C 40 3.4 0 0 10000
1990-01-02 A 21 1.5 133 2793 17053
1990-01-02 B 65 6 33 2145 17053
1990-01-02 C 45 4.2 47 2115 17053
1990-01-03 A 19 2.5 136 2584 26885
1990-01-03 B 70 6.3 54 3780 26885
1990-01-03 C 51 5 68 3468 26885
I suppose I need a for loop or a function to be applied to each row. With these I have two issues. One is that I'm not sure how I can create a for loop for this logic in case of a MultiIndex dataframe. The second is that my dataframe is pretty large (something like 10 million rows) so I'm not sure if a for loop would be a good idea. But then how can I create these columns?
This solution can surely be cleaned up, but will produce your desired output. I've included your initial conditions in the construction of your sample dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': ['1990-01-01','1990-01-01','1990-01-01','1990-01-02','1990-01-02','1990-01-02','1990-01-03','1990-01-03','1990-01-03'],
'Symbol': ['A','B','C','A','B','C','A','B','C'],
'Close': [24, 72, 40, 21, 65, 45, 19, 70, 51],
'ATR': [2, 7, 3.4, 1.5, 6, 4.2, 2.5, 6.3, 5],
'Shares': [0, 0, 0, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Profit': [0, 0, 0, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]})
Gives:
Date Symbol Close ATR Shares Profit
0 1990-01-01 A 24 2.0 0.0 0.0
1 1990-01-01 B 72 7.0 0.0 0.0
2 1990-01-01 C 40 3.4 0.0 0.0
3 1990-01-02 A 21 1.5 NaN NaN
4 1990-01-02 B 65 6.0 NaN NaN
5 1990-01-02 C 45 4.2 NaN NaN
6 1990-01-03 A 19 2.5 NaN NaN
7 1990-01-03 B 70 6.3 NaN NaN
8 1990-01-03 C 51 5.0 NaN NaN
Then use groupby() with apply() and track your Equity globally. Took me a second to realize that the nature of this problem requires you to group on two separate columns individually (Symbol and Date):
start = 10000
Equity = 10000
def calcs(x):
global Equity
if x.index[0]==0: return x #Skip first group
x['Shares'] = np.floor(Equity*0.02/x['ATR'])
x['Profit'] = x['Shares']*x['Close']
Equity += x['Profit'].sum()
return x
df = df.groupby('Date').apply(calcs)
df['Equity'] = df.groupby('Date')['Profit'].transform('sum')
df['Equity'] = df.groupby('Symbol')['Equity'].cumsum()+start
This yields:
Date Symbol Close ATR Shares Profit Equity
0 1990-01-01 A 24 2.0 0.0 0.0 10000.0
1 1990-01-01 B 72 7.0 0.0 0.0 10000.0
2 1990-01-01 C 40 3.4 0.0 0.0 10000.0
3 1990-01-02 A 21 1.5 133.0 2793.0 17053.0
4 1990-01-02 B 65 6.0 33.0 2145.0 17053.0
5 1990-01-02 C 45 4.2 47.0 2115.0 17053.0
6 1990-01-03 A 19 2.5 136.0 2584.0 26885.0
7 1990-01-03 B 70 6.3 54.0 3780.0 26885.0
8 1990-01-03 C 51 5.0 68.0 3468.0 26885.0
can you try using shift and groupby? Once you have the value of the previous line, all columns operations are straight forward.
table2['previous'] = table2['close'].groupby('symbol').shift(1)
table2
date symbol close atr previous
1990-01-01 A 24 2 NaN
B 72 7 NaN
C 40 3.4 NaN
1990-01-02 A 21 1.5 24
B 65 6 72
C 45 4.2 40
1990-01-03 A 19 2.5 21
B 70 6.3 65
C 51 5 45