Merge DFs on more than 2 conditions - python
Consider the following data frames:
base_df = pd.DataFrame({
'id': [1, 2, 3, 4, 5, 6, 7],
'type_a': ['nan', 'type3', 'type1', 'type2', 'type3', 'type5', 'type4'],
'q_a': [0, 0.9, 5.1, 3.0, 1.6, 1.1, 0.7],
'p_a': [0, 0.53, 0.71, 0.6, 0.53, 0.3, 0.33]
})
Edit: This is an extract of base_df. The original df 100 columns with around 500 observations.
table_df = pd.DataFrame({
'type': ['type1', 'type2', 'type3', 'type3', 'type3', 'type3', 'type4', 'type4', 'type4', 'type4', 'type5', 'type5', 'type5', 'type6', 'type6'],
'q_value': [5.1, 3.1, 1.6, 1.3, 0.9, 0.85, 0.7, 0.7, 0.7, 0.5, 1.2, 1.1, 1.1, 0.4, 0.4],
'p_value': [0.71, 0.62, 0.71, 0.54, 0.53, 0.44, 0.5, 0.54, 0.33, 0.33, 0.32, 0.31, 0.28, 0.31, 0.16],
'sigma':[2.88, 2.72, 2.73, 2.79, 2.91, 2.41, 2.63, 2.44, 2.7, 2.69, 2.59, 2.67, 2.4, 2.67, 2.35]
})
Edit: The original table_df looks exactly like this one.
For every observation in base_df, I'd like to look up if the type matches with an entry in table_df, if yes:
I'd like to look if there is an entry in table_df with the corresponding value q_a == q_value, if yes:
And there's only one value q_value, assign sigma to base_df.
If there are more than one values of q_value, compare p_a and assing the correct sigma to base_df.
If there's no exactly matching value for q_a or p_a just use the next bigger value, in case there is no bigger value use the lower one and assign the corresponding value for sigma to column sigma_a in base_df.
The resulting DF should look like this:
id type_a q_a p_a sigma_a
1 nan 0 0 0
2 type3 0.9 0.53 2.91
3 type1 5.1 0.71 2.88
4 type2 3 0.6 2.72
5 type3 1.6 0.53 2.41
6 type5 1.1 0.3 2.67
7 type4 0.7 0.33 2.7
So far I use the code below:
mapping = (pd.merge_asof(base_df.sort_values('q_a'),
table_df.sort_values('q_value'),
left_on='q_a',
left_by='type_a',
right_on='q_value',
right_by='type').set_index('id'))
base_df= (pd.merge_asof(base_df.sort_values('q_a'),
table_df.sort_values('q_value'),
left_on='q_a',
left_by='type_a',
right_on='q_value',
right_by='type',
direction = 'forward')
.set_index('id')
.combine_first(mapping)
.sort_index()
.reset_index()
)
This "two step check routine" works, but I'd like to add the third step checking p_value.
How can I realize it?
Actually, I think Metrics should not be separated into A-segment B-segment,
It supposed to concatenated into the same column and create a Metric like Segment.
Anyway, according to your description,
table_df is a reference table and they have the same criteria for _a and _b,
therefore I order them in hierarchical structure by following manipulation:
table_df.sort_values(by=["type","q_value","p_value"]).reset_index(drop = True)
type q_value p_value sigma
0 type1 5.10 0.71 2.88
1 type2 3.10 0.62 2.72
2 type3 0.85 0.44 2.41
3 type3 0.90 0.53 2.91
4 type3 1.30 0.54 2.79
5 type3 1.60 0.71 2.73
6 type4 0.50 0.33 2.69
7 type4 0.70 0.33 2.70
8 type4 0.70 0.50 2.63
9 type4 0.70 0.54 2.44
10 type5 1.10 0.28 2.40
11 type5 1.10 0.31 2.67
12 type5 1.20 0.32 2.59
13 type6 0.40 0.16 2.35
14 type6 0.40 0.31 2.67
table_df
type: a fully restrict condition
q-value&p-value: If there's no exactly matching value for q_a or p_a just use the next bigger value and assign the corresponding value for sigma to column sigma_a in base_df. If no bigger on, use the previous value in the reference table.
define the function for _a and _b (yeah they are the same)
find_sigma_a and find_sigma_b
def find_sigma_a(row):
sigma_value = table_df[
(table_df["type"]==row["type_a"]) &
(table_df["q_value"]>= row["q_a"]) &
(table_df["p_value"]>= row["p_a"])
]
if row["type_a"] == 'nan':
sigma_value = 0
elif len(sigma_value) == 0:
sigma_value = table_df[table_df["type"]==row["type_a"]].iloc[-1,3]
# .iloc[-1,3] alternatively term ["sigma"].tail(1).values[0]
else:
sigma_value = sigma_value.iloc[0,3]
# .iloc[0,3] alternatively term ["sigma"].head(1).values[0]
return sigma_value
def find_sigma_b(row):
sigma_value = table_df[
(table_df["type"] == row["type_b"]) &
(table_df["q_value"] >= row["q_b"]) &
(table_df["p_value"] >= row["p_b"])
]
if row["type_b"] == 'nan':
sigma_value = 0
elif len(sigma_value) == 0:
sigma_value = table_df[table_df["type"]==row["type_b"]].iloc[-1,3]
# .iloc[-1,3] alternatively term ["sigma"].tail(1).values[0]
else:
sigma_value = sigma_value.iloc[0,3]
# .iloc[0,3] alternatively term ["sigma"].head(1).values[0]
return sigma_value
and then use pandas.DataFrame.apply to apply these two functions
base_df["sigma_a"] = base_df.apply(find_sigma_a, axis = 1)
base_df["sigma_b"] = base_df.apply(find_sigma_b, axis = 1)
type_a q_a p_a type_b q_b p_b sigma_a sigma_b
0 nan 0.0 0.00 type6 0.4 0.11 0.00 2.35
1 type3 0.9 0.53 type3 1.4 0.60 2.91 2.73
2 type1 5.1 0.71 type3 0.9 0.53 2.88 2.91
3 type2 3.0 0.60 type6 0.5 0.40 2.72 2.67
4 type3 1.6 0.53 type6 0.4 0.11 2.73 2.35
5 type5 1.1 0.30 type1 4.9 0.70 2.67 2.88
6 type4 0.7 0.33 type4 0.7 0.20 2.70 2.70
arrange the columns:
base_df.iloc[:,[0,1,2,6,3,4,5,7]]
type_a q_a p_a sigma_a type_b q_b p_b sigma_b
0 nan 0.0 0.00 0.00 type6 0.4 0.11 2.35
1 type3 0.9 0.53 2.91 type3 1.4 0.60 2.73
2 type1 5.1 0.71 2.88 type3 0.9 0.53 2.91
3 type2 3.0 0.60 2.72 type6 0.5 0.40 2.67
4 type3 1.6 0.53 2.73 type6 0.4 0.11 2.35
5 type5 1.1 0.30 2.67 type1 4.9 0.70 2.88
6 type4 0.7 0.33 2.70 type4 0.7 0.20 2.70
Notebook_file
Related
Find values that exceed the minimum or maximum
I am attempting. My dataframe looks similar to this: Name DateTime Na Na Err Mg Mg Err Al Al Err Si Si Err STD1 2/11/2020 0.3 0.11 1.6 0.08 0.6 0.12 21.5 0.14 STD2 2/11/2020 0.2 0.10 1.6 0.08 0.2 0.12 21.6 0.14 STD3 2/11/2020 0.2 0.10 1.6 0.08 0.5 0.12 21.7 0.14 STD4 2/11/2020 0.1 0.10 1.3 0.08 0.5 0.12 21.4 0.14 Here is what I have: elements=['Na','Mg', 'Al', 'Si',...] quant=df[elements].quantile([lower, upper]) #obtain upper/lower limits outsideBounds=(quant.loc[lower_bound, elements] < df[elements].to_numpy()) \ & (df[elements].to_numpy()<quant.loc[lower_bound, elements]) However, this gives me a "ValueError: Lengths must match to compare". Any help would be appreciated
Here's a solution (I chose 0.3 and 0.7 for lower and upper bounds, respectively, but that can be changed of course): lower = 0.3 upper = 0.7 elements=['Na','Mg', 'Al', 'Si'] df[elements] bounds = df[elements].quantile([lower, upper]) #obtain upper/lower limits out_of_bounds = df[elements].lt(bounds.loc[lower, :]) | df[elements].gt(bounds.loc[upper, :]) df[elements][out_of_bounds] The resulting bounds are: Na Mg Al Si 0.3 0.19 1.57 0.47 21.49 0.7 0.21 1.60 0.51 21.61 The result of df[elements][out_of_bounds] is: Na Mg Al Si 0 0.3 NaN 0.6 NaN 1 NaN NaN 0.2 NaN 2 NaN NaN NaN 21.7 3 0.1 1.3 NaN 21.4
Pandas: dynamically shifting values across columns
I have the following df: sales2001 sales2002 sales2003 sales2004 200012 19.12 0.98 200101 19.1 0.98 2.3 200102 21 0.97 0.8 ... 200112 19.12 0.99 2.4 200201 0.98 2.5 200202 0.97 0.8 1.2 I would like to shift the content in order to align it a timegap view, as follow: sales+1y sales+2y 200012 19.12 0.98 200101 0.98 2.3 200102 0.97 0.8 ... 200112 0.99 2.4 200201 0.98 2.5 200202 0.8 1.2 basically aligning the forecasted data points to a fixed timegap to the index. I tried with iterrows and dynamically calling the columns given the index but cannot make it work. do you guys have any suggestion?
Use justify with DataFrame.dropna and axis=1 for remove all columns with at least one NaN: df1 = (pd.DataFrame(justify(df.values, invalid_val=np.nan, side='right'), index=df.index) .dropna(axis=1)) If need select last columns by position: df1 = pd.DataFrame(justify(df.values, invalid_val=np.nan, side='right')[:, -2:],index=df.index) Or: df1 = (pd.DataFrame(justify(df.values, invalid_val=np.nan, side='right'), index=df.index) .iloc[:, -2:]) df1.columns = [f'sales+{i+1}y' for i in range(len(df1.columns))] print (df1) sales+1y sales+2y 200012 19.12 0.98 200101 0.98 2.30 200102 0.97 0.80 200112 0.99 2.40 200201 0.98 2.50 200202 0.80 1.20
Another option is to use pd.wide_to_long and pivot: # here I assume the index name is index new_df = pd.wide_to_long(df.reset_index(), 'sales', i='index', j='sale_end').reset_index() # if index is datetime, then use dt.year new_df['periods'] = new_df['sale_end'] - new_df['index']//100 # pivot new_df.dropna().pivot(index='index',columns='periods', values='sales') output: periods -1 0 1 2 idx 200012 NaN NaN 19.12 0.98 200101 NaN 19.10 0.98 2.30 200102 NaN 21.00 0.97 0.80 200112 NaN 19.12 0.99 2.40 200201 0.98 2.50 NaN NaN 200202 0.97 0.80 1.20 NaN
Pandas keep highest value in every n consecutive rows
I have a pandas dataframe called df_initial with two columns 'a' and 'b' and N rows. I would like to half the rows number, deleting the row where the value of 'b' is lower. Thus between row 0 and row 1 I will keep row 1, between row 2 and row 3 I will keep row 3 etc.. This is the result that I would like to obtain: print(df_initial) a b 0 0.04 0.01 1 0.05 0.22 2 0.06 0.34 3 0.07 0.49 4 0.08 0.71 5 0.09 0.09 6 0.10 0.98 7 0.11 0.42 8 0.12 1.32 9 0.13 0.39 10 0.14 0.97 11 0.15 0.05 12 0.16 0.36 13 0.17 1.72 .... print(df_reduced) a b 0 0.05 0.22 1 0.07 0.49 2 0.08 0.71 3 0.10 0.98 4 0.12 1.32 5 0.14 0.97 6 0.17 1.72 .... Is there some Pandas function to do this ? I saw that there is a resample function, DataFrame.resample() , but it is valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, so not in this case. Thanks who will help me
You can groupby every two rows (a simple way of doing so is taking the floor division of the index) and take the idxmax of column b to index the dataframe: df.loc[df.groupby(df.index//2).b.idxmax(), :] a b 0 0.05 0.22 1 0.07 0.49 2 0.09 0.71 3 0.11 0.98 4 0.13 1.32 5 0.15 0.97 6 0.17 1.72 Or using DataFrame.rolling: df.loc[df.b.rolling(2).max()[1::2].index, :]
This is an application for a simple example, you can apply it on your base. import numpy as np import pandas as pd ar = np.array([[1.1, 1.0], [3.3, 0.2], [2.7, 10],[ 5.4, 7], [5.3, 9],[ 1.5, 15]]) df = pd.DataFrame(ar, columns = ['a', 'b']) for i in range(len(df)): if df['b'][i] < df['a'][i]: df = df.drop(index = i) print(df)````
Calculating momentum signal in python using 1 month and 12 month lag
I am wanting to calculate a simple momentum signal. The method I am following is 1 month lagged cumret divided by 12 month lagged cumret minus 1. date starts at 1/5/14 and ends at 1/5/16. As a 12 month lag is required, the first mom signal has to start 12 months after the start of date. Hence, why the first mom signal starts at 1/5/15. Here is the data utilized: import pandas as pd data = {'date':['1/5/14','1/6/14','1/7/14','1/8/14','1/9/14','1/10/14','1/11/14','1/12/14' .,'1/1/15','1/2/15','1/3/15','1/4/15','1/5/15','1/6/15','1/7/15','1/8/15','1/9/15','1/10/15','1/11/15','1/12/15','1/1/16','1/2/16','1/3/16','1/4/16','1/5/16'], 'id': ['a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a' ], 'ret':[0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20, 0.21, 0.22, 0.23, 0.24, 0.25], 'cumret':[1.01,1.03, 1.06,1.1 ,1.15,1.21,1.28, 1.36,1.45,1.55,1.66, 1.78,1.91,2.05,2.2,2.36, 2.53,2.71,2.9,3.1,3.31,3.53, 3.76,4,4.25]} df = pd.DataFrame(data).set_index(['date', 'id']) Desired output ret cumret mom date id 1/5/14 a .01 1.01 1/6/14 a .02 1.03 1/7/14 a .03 1.06 1/8/14 a .04 1.1 1/9/14 a .05 1.15 1/10/14 a .06 1.21 1/11/14 a .07 1.28 1/12/14 a .08 1.36 1/1/15 a .09 1.45 1/2/15 a .1 1.55 1/3/15 a .11 1.66 1/4/15 a .12 1.78 1/5/15 a .13 1.91 .8 1/6/15 a .14 2.05 .9 1/7/15 a .15 2.2 .9 1/8/15 a .16 2.36 1 1/9/15 a .17 2.53 1.1 1/10/15 a .18 2.71 1.1 1/11/15 a .19 2.9 1.1 1/12/15 a .2 3.1 1.1 1/1/16 a .21 3.31 1.1 1/2/16 a .22 3.53 1.1 1/3/16 a .23 3.76 1.1 1/4/16 a .24 4 1.1 1/5/16 a .25 4.25 1.1 This is the code tried to calculate mom df['mom'] = ((df['cumret'].shift(-1) / (df['cumret'].shift(-12))) - 1).groupby(level = ['id']) Entire dataset has more id e.g. a, b, c. Just included 1 variable for this example. Any help would be awesome! :)
As far as I know, momentum is simply rate of change. Pandas has a built-in method for this: df['mom'] = df['ret'].pct_change(12) # 12 month change Also, I am not sure why you are using cumret instead of ret to calculate momentum. Update: If you have multiple IDs that you need to go through, I'd recommend: for i in df.index.levels[1]: temp = df.loc[(slice(None), i), "ret"].pct_change(11) df.loc[(slice(None), i), "mom"] = temp # or df.loc[(slice(None), i), "mom"] = df.loc[(slice(None), i), "ret"].pct_change(11) for short Output: ret cumret mom date id 1/5/14 a 0.01 1.01 NaN 1/6/14 a 0.02 1.03 NaN 1/7/14 a 0.03 1.06 NaN 1/8/14 a 0.04 1.10 NaN 1/9/14 a 0.05 1.15 NaN 1/10/14 a 0.06 1.21 NaN 1/11/14 a 0.07 1.28 NaN 1/12/14 a 0.08 1.36 NaN 1/1/15 a 0.09 1.45 NaN 1/2/15 a 0.10 1.55 NaN 1/3/15 a 0.11 1.66 NaN 1/4/15 a 0.12 1.78 11.000000 1/5/15 a 0.13 1.91 5.500000 1/6/15 a 0.14 2.05 3.666667 1/7/15 a 0.15 2.20 2.750000 1/8/15 a 0.16 2.36 2.200000 1/9/15 a 0.17 2.53 1.833333 1/10/15 a 0.18 2.71 1.571429 1/11/15 a 0.19 2.90 1.375000 1/12/15 a 0.20 3.10 1.222222 1/1/16 a 0.21 3.31 1.100000 1/2/16 a 0.22 3.53 1.000000 1/3/16 a 0.23 3.76 0.916667 1/4/16 a 0.24 4.00 0.846154 1/5/16 a 0.25 4.25 0.785714 1/5/14 b 0.01 1.01 NaN 1/6/14 b 0.02 1.03 NaN 1/7/14 b 0.03 1.06 NaN 1/8/14 b 0.04 1.10 NaN 1/9/14 b 0.05 1.15 NaN 1/10/14 b 0.06 1.21 NaN 1/11/14 b 0.07 1.28 NaN 1/12/14 b 0.08 1.36 NaN 1/1/15 b 0.09 1.45 NaN 1/2/15 b 0.10 1.55 NaN 1/3/15 b 0.11 1.66 NaN 1/4/15 b 0.12 1.78 11.000000 1/5/15 b 0.13 1.91 5.500000 1/6/15 b 0.14 2.05 3.666667 1/7/15 b 0.15 2.20 2.750000 1/8/15 b 0.16 2.36 2.200000 1/9/15 b 0.17 2.53 1.833333 1/10/15 b 0.18 2.71 1.571429 1/11/15 b 0.19 2.90 1.375000 1/12/15 b 0.20 3.10 1.222222 1/1/16 b 0.21 3.31 1.100000 1/2/16 b 0.22 3.53 1.000000 1/3/16 b 0.23 3.76 0.916667 1/4/16 b 0.24 4.00 0.846154 1/5/16 b 0.25 4.25 0.785714
How to use describe () by group for all variables?
I would appreciate if you could let me know how to apply describe () to calculate summary statistics by group. My data (TrainSet) is like the following but there is a lot of coulmns: Financial Distress x1 x2 x3 0 1.28 0.02 0.87 0 1.27 0.01 0.82 0 1.05 -0.06 0.92 1 1.11 -0.02 0.86 0 1.06 0.11 0.81 0 1.06 0.08 0.88 1 0.87 -0.03 0.79 I want to compute the summary statistics by "Financial Distress" as it is shown below: count mean std min 25% 50% 75% max cat index x1 0 2474 1.4 1.3 0.07 0.95 1.1 1.54 38.1 1 95 0.7 -1.7 0.02 2.9 2.1 1.75 11.2 x2 0 2474 0.9 1.7 0.02 1.9 1.4 1.75 11.2 1 95 .45 1.95 0.07 2.8 1.6 2.94 20.12 x3 0 2474 2.4 1.5 0.07 0.85 1.2 1.3 30.1 1 95 1.9 2.3 0.33 6.1 0.15 1.66 12.3 I wrote the following code but it does not provide the answer in the aforementioned format. Statistics=pd.concat([TrainSet[TrainSet["Financial Distress"]==0].describe(),TrainSet[TrainSet["Financial Distress"]==1].describe()]) Statistics.to_csv("Descriptive Statistics1.csv") Thanks in advance. The result of coldspeed's solution: Financial Distress count mean std x1 0 2474 1.398623286 1.320468688 x1 1 95 1.028107053 0.360206966 x10 0 2474 0.143310534 0.136257947 x10 1 95 -0.032919408 0.080409407 x100 0 2474 0.141875505 0.348992946 x100 1 95 0.115789474 0.321669776
You can use DataFrameGroupBy.describe with unstack first, but it by default change ordering by reindex: print (df) Financial Distress x1 x2 x10 0 0 1.28 0.02 0.87 1 0 1.27 0.01 0.82 2 0 1.05 -0.06 0.92 3 1 1.11 -0.02 0.86 4 0 1.06 0.11 0.81 5 0 1.06 0.08 0.88 6 1 0.87 -0.03 0.79 df1 = (df.groupby('Financial Distress') .describe() .unstack() .unstack(1) .reindex(df.columns[1:], level=0)) print (df1) count mean std min 25% 50% 75% \ Financial Distress x1 0 5.0 1.144 0.119708 1.05 1.0600 1.060 1.2700 1 2.0 0.990 0.169706 0.87 0.9300 0.990 1.0500 x2 0 5.0 0.032 0.066106 -0.06 0.0100 0.020 0.0800 1 2.0 -0.025 0.007071 -0.03 -0.0275 -0.025 -0.0225 x10 0 5.0 0.860 0.045277 0.81 0.8200 0.870 0.8800 1 2.0 0.825 0.049497 0.79 0.8075 0.825 0.8425 max Financial Distress x1 0 1.28 1 1.11 x2 0 0.11 1 -0.02 x10 0 0.92 1 0.86