I am attempting. My dataframe looks similar to this:
Name DateTime Na Na Err Mg Mg Err Al Al Err Si Si Err
STD1 2/11/2020 0.3 0.11 1.6 0.08 0.6 0.12 21.5 0.14
STD2 2/11/2020 0.2 0.10 1.6 0.08 0.2 0.12 21.6 0.14
STD3 2/11/2020 0.2 0.10 1.6 0.08 0.5 0.12 21.7 0.14
STD4 2/11/2020 0.1 0.10 1.3 0.08 0.5 0.12 21.4 0.14
Here is what I have:
elements=['Na','Mg', 'Al', 'Si',...]
quant=df[elements].quantile([lower, upper]) #obtain upper/lower limits
outsideBounds=(quant.loc[lower_bound, elements] < df[elements].to_numpy()) \
& (df[elements].to_numpy()<quant.loc[lower_bound, elements])
However, this gives me a "ValueError: Lengths must match to compare". Any help would be appreciated
Here's a solution (I chose 0.3 and 0.7 for lower and upper bounds, respectively, but that can be changed of course):
lower = 0.3
upper = 0.7
elements=['Na','Mg', 'Al', 'Si']
df[elements]
bounds = df[elements].quantile([lower, upper]) #obtain upper/lower limits
out_of_bounds = df[elements].lt(bounds.loc[lower, :]) | df[elements].gt(bounds.loc[upper, :])
df[elements][out_of_bounds]
The resulting bounds are:
Na Mg Al Si
0.3 0.19 1.57 0.47 21.49
0.7 0.21 1.60 0.51 21.61
The result of df[elements][out_of_bounds] is:
Na Mg Al Si
0 0.3 NaN 0.6 NaN
1 NaN NaN 0.2 NaN
2 NaN NaN NaN 21.7
3 0.1 1.3 NaN 21.4
Related
I have two dataframes:
Actual_Values
0 0.60
1 0.60
2 0.60
3 0.60
4 0.60
Predicted_Values
0 0.60
1 0.60
2 0.60
and I want a something like this:
Actual_Values Predicted_Values
0 0.60 NaN
1 0.60 NaN
2 0.60 0.6
3 0.60 0.6
4 0.60 0.6
I have tried pandas' join, merge, concat, but none works.
Try with assign the new index
df2.index=df1.index[-len(df2):]
out = df1.join(df2)
Out[283]:
Actual_Values Predicted_Values
0 0.6 NaN
1 0.6 NaN
2 0.6 0.6
3 0.6 0.6
4 0.6 0.6
Consider the following data frames:
base_df = pd.DataFrame({
'id': [1, 2, 3, 4, 5, 6, 7],
'type_a': ['nan', 'type3', 'type1', 'type2', 'type3', 'type5', 'type4'],
'q_a': [0, 0.9, 5.1, 3.0, 1.6, 1.1, 0.7],
'p_a': [0, 0.53, 0.71, 0.6, 0.53, 0.3, 0.33]
})
Edit: This is an extract of base_df. The original df 100 columns with around 500 observations.
table_df = pd.DataFrame({
'type': ['type1', 'type2', 'type3', 'type3', 'type3', 'type3', 'type4', 'type4', 'type4', 'type4', 'type5', 'type5', 'type5', 'type6', 'type6'],
'q_value': [5.1, 3.1, 1.6, 1.3, 0.9, 0.85, 0.7, 0.7, 0.7, 0.5, 1.2, 1.1, 1.1, 0.4, 0.4],
'p_value': [0.71, 0.62, 0.71, 0.54, 0.53, 0.44, 0.5, 0.54, 0.33, 0.33, 0.32, 0.31, 0.28, 0.31, 0.16],
'sigma':[2.88, 2.72, 2.73, 2.79, 2.91, 2.41, 2.63, 2.44, 2.7, 2.69, 2.59, 2.67, 2.4, 2.67, 2.35]
})
Edit: The original table_df looks exactly like this one.
For every observation in base_df, I'd like to look up if the type matches with an entry in table_df, if yes:
I'd like to look if there is an entry in table_df with the corresponding value q_a == q_value, if yes:
And there's only one value q_value, assign sigma to base_df.
If there are more than one values of q_value, compare p_a and assing the correct sigma to base_df.
If there's no exactly matching value for q_a or p_a just use the next bigger value, in case there is no bigger value use the lower one and assign the corresponding value for sigma to column sigma_a in base_df.
The resulting DF should look like this:
id type_a q_a p_a sigma_a
1 nan 0 0 0
2 type3 0.9 0.53 2.91
3 type1 5.1 0.71 2.88
4 type2 3 0.6 2.72
5 type3 1.6 0.53 2.41
6 type5 1.1 0.3 2.67
7 type4 0.7 0.33 2.7
So far I use the code below:
mapping = (pd.merge_asof(base_df.sort_values('q_a'),
table_df.sort_values('q_value'),
left_on='q_a',
left_by='type_a',
right_on='q_value',
right_by='type').set_index('id'))
base_df= (pd.merge_asof(base_df.sort_values('q_a'),
table_df.sort_values('q_value'),
left_on='q_a',
left_by='type_a',
right_on='q_value',
right_by='type',
direction = 'forward')
.set_index('id')
.combine_first(mapping)
.sort_index()
.reset_index()
)
This "two step check routine" works, but I'd like to add the third step checking p_value.
How can I realize it?
Actually, I think Metrics should not be separated into A-segment B-segment,
It supposed to concatenated into the same column and create a Metric like Segment.
Anyway, according to your description,
table_df is a reference table and they have the same criteria for _a and _b,
therefore I order them in hierarchical structure by following manipulation:
table_df.sort_values(by=["type","q_value","p_value"]).reset_index(drop = True)
type q_value p_value sigma
0 type1 5.10 0.71 2.88
1 type2 3.10 0.62 2.72
2 type3 0.85 0.44 2.41
3 type3 0.90 0.53 2.91
4 type3 1.30 0.54 2.79
5 type3 1.60 0.71 2.73
6 type4 0.50 0.33 2.69
7 type4 0.70 0.33 2.70
8 type4 0.70 0.50 2.63
9 type4 0.70 0.54 2.44
10 type5 1.10 0.28 2.40
11 type5 1.10 0.31 2.67
12 type5 1.20 0.32 2.59
13 type6 0.40 0.16 2.35
14 type6 0.40 0.31 2.67
table_df
type: a fully restrict condition
q-value&p-value: If there's no exactly matching value for q_a or p_a just use the next bigger value and assign the corresponding value for sigma to column sigma_a in base_df. If no bigger on, use the previous value in the reference table.
define the function for _a and _b (yeah they are the same)
find_sigma_a and find_sigma_b
def find_sigma_a(row):
sigma_value = table_df[
(table_df["type"]==row["type_a"]) &
(table_df["q_value"]>= row["q_a"]) &
(table_df["p_value"]>= row["p_a"])
]
if row["type_a"] == 'nan':
sigma_value = 0
elif len(sigma_value) == 0:
sigma_value = table_df[table_df["type"]==row["type_a"]].iloc[-1,3]
# .iloc[-1,3] alternatively term ["sigma"].tail(1).values[0]
else:
sigma_value = sigma_value.iloc[0,3]
# .iloc[0,3] alternatively term ["sigma"].head(1).values[0]
return sigma_value
def find_sigma_b(row):
sigma_value = table_df[
(table_df["type"] == row["type_b"]) &
(table_df["q_value"] >= row["q_b"]) &
(table_df["p_value"] >= row["p_b"])
]
if row["type_b"] == 'nan':
sigma_value = 0
elif len(sigma_value) == 0:
sigma_value = table_df[table_df["type"]==row["type_b"]].iloc[-1,3]
# .iloc[-1,3] alternatively term ["sigma"].tail(1).values[0]
else:
sigma_value = sigma_value.iloc[0,3]
# .iloc[0,3] alternatively term ["sigma"].head(1).values[0]
return sigma_value
and then use pandas.DataFrame.apply to apply these two functions
base_df["sigma_a"] = base_df.apply(find_sigma_a, axis = 1)
base_df["sigma_b"] = base_df.apply(find_sigma_b, axis = 1)
type_a q_a p_a type_b q_b p_b sigma_a sigma_b
0 nan 0.0 0.00 type6 0.4 0.11 0.00 2.35
1 type3 0.9 0.53 type3 1.4 0.60 2.91 2.73
2 type1 5.1 0.71 type3 0.9 0.53 2.88 2.91
3 type2 3.0 0.60 type6 0.5 0.40 2.72 2.67
4 type3 1.6 0.53 type6 0.4 0.11 2.73 2.35
5 type5 1.1 0.30 type1 4.9 0.70 2.67 2.88
6 type4 0.7 0.33 type4 0.7 0.20 2.70 2.70
arrange the columns:
base_df.iloc[:,[0,1,2,6,3,4,5,7]]
type_a q_a p_a sigma_a type_b q_b p_b sigma_b
0 nan 0.0 0.00 0.00 type6 0.4 0.11 2.35
1 type3 0.9 0.53 2.91 type3 1.4 0.60 2.73
2 type1 5.1 0.71 2.88 type3 0.9 0.53 2.91
3 type2 3.0 0.60 2.72 type6 0.5 0.40 2.67
4 type3 1.6 0.53 2.73 type6 0.4 0.11 2.35
5 type5 1.1 0.30 2.67 type1 4.9 0.70 2.88
6 type4 0.7 0.33 2.70 type4 0.7 0.20 2.70
Notebook_file
I would like to update and insert a new row, if D1 value is not existing in other ID's, whilst my df['Value'] is left blank (N/A). Your help is appreciated.
Input
D1 ID Value
0.02 1 1.2
0.04 1 1.6
0.06 1 1.9
0.08 1 2.8
0.02 2 4.5
0.04 2 4.1
0.08 2 3.6
0.02 3 2.7
0.04 3 2.9
0.06 3 2.4
0.08 3 2.1
0.1 3 1.9
Expected output:
D1 ID Value
0.02 1 1.2
0.04 1 1.6
0.06 1 1.9
0.08 1 2.8
0.1 1
0.02 2 4.5
0.04 2 4.1
0.06 2
0.08 2 3.6
0.1 2
0.02 3 2.7
0.04 3 2.9
0.06 3 2.4
0.08 3 2.1
0.1 3 1.9
Unfortunately the codes I have written have been way off or simply gets multiple error messages, unlike my other questions I do not have examples to show.
Use unstack and stack. Chain additional sort_index and reset_index to achieve desired order
df_final = (df.set_index(['D1', 'ID']).unstack().stack(dropna=False)
.sort_index(level=[1,0]).reset_index())
Out[952]:
D1 ID Value
0 0.02 1 1.2
1 0.04 1 1.6
2 0.06 1 1.9
3 0.08 1 2.8
4 0.10 1 NaN
5 0.02 2 4.5
6 0.04 2 4.1
7 0.06 2 NaN
8 0.08 2 3.6
9 0.10 2 NaN
10 0.02 3 2.7
11 0.04 3 2.9
12 0.06 3 2.4
13 0.08 3 2.1
14 0.10 3 1.9
I am wanting to calculate a simple momentum signal. The method I am following is 1 month lagged cumret divided by 12 month lagged cumret minus 1.
date starts at 1/5/14 and ends at 1/5/16. As a 12 month lag is required, the first mom signal has to start 12 months after the start of date. Hence, why the first mom signal starts at 1/5/15.
Here is the data utilized:
import pandas as pd
data = {'date':['1/5/14','1/6/14','1/7/14','1/8/14','1/9/14','1/10/14','1/11/14','1/12/14' .,'1/1/15','1/2/15','1/3/15','1/4/15','1/5/15','1/6/15','1/7/15','1/8/15','1/9/15','1/10/15','1/11/15','1/12/15','1/1/16','1/2/16','1/3/16','1/4/16','1/5/16'],
'id': ['a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a' ],
'ret':[0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20, 0.21, 0.22, 0.23, 0.24, 0.25],
'cumret':[1.01,1.03, 1.06,1.1 ,1.15,1.21,1.28, 1.36,1.45,1.55,1.66, 1.78,1.91,2.05,2.2,2.36, 2.53,2.71,2.9,3.1,3.31,3.53, 3.76,4,4.25]}
df = pd.DataFrame(data).set_index(['date', 'id'])
Desired output
ret cumret mom
date id
1/5/14 a .01 1.01
1/6/14 a .02 1.03
1/7/14 a .03 1.06
1/8/14 a .04 1.1
1/9/14 a .05 1.15
1/10/14 a .06 1.21
1/11/14 a .07 1.28
1/12/14 a .08 1.36
1/1/15 a .09 1.45
1/2/15 a .1 1.55
1/3/15 a .11 1.66
1/4/15 a .12 1.78
1/5/15 a .13 1.91 .8
1/6/15 a .14 2.05 .9
1/7/15 a .15 2.2 .9
1/8/15 a .16 2.36 1
1/9/15 a .17 2.53 1.1
1/10/15 a .18 2.71 1.1
1/11/15 a .19 2.9 1.1
1/12/15 a .2 3.1 1.1
1/1/16 a .21 3.31 1.1
1/2/16 a .22 3.53 1.1
1/3/16 a .23 3.76 1.1
1/4/16 a .24 4 1.1
1/5/16 a .25 4.25 1.1
This is the code tried to calculate mom
df['mom'] = ((df['cumret'].shift(-1) / (df['cumret'].shift(-12))) - 1).groupby(level = ['id'])
Entire dataset has more id e.g. a, b, c. Just included 1 variable for this example.
Any help would be awesome! :)
As far as I know, momentum is simply rate of change. Pandas has a built-in method for this:
df['mom'] = df['ret'].pct_change(12) # 12 month change
Also, I am not sure why you are using cumret instead of ret to calculate momentum.
Update: If you have multiple IDs that you need to go through, I'd recommend:
for i in df.index.levels[1]:
temp = df.loc[(slice(None), i), "ret"].pct_change(11)
df.loc[(slice(None), i), "mom"] = temp
# or df.loc[(slice(None), i), "mom"] = df.loc[(slice(None), i), "ret"].pct_change(11) for short
Output:
ret cumret mom
date id
1/5/14 a 0.01 1.01 NaN
1/6/14 a 0.02 1.03 NaN
1/7/14 a 0.03 1.06 NaN
1/8/14 a 0.04 1.10 NaN
1/9/14 a 0.05 1.15 NaN
1/10/14 a 0.06 1.21 NaN
1/11/14 a 0.07 1.28 NaN
1/12/14 a 0.08 1.36 NaN
1/1/15 a 0.09 1.45 NaN
1/2/15 a 0.10 1.55 NaN
1/3/15 a 0.11 1.66 NaN
1/4/15 a 0.12 1.78 11.000000
1/5/15 a 0.13 1.91 5.500000
1/6/15 a 0.14 2.05 3.666667
1/7/15 a 0.15 2.20 2.750000
1/8/15 a 0.16 2.36 2.200000
1/9/15 a 0.17 2.53 1.833333
1/10/15 a 0.18 2.71 1.571429
1/11/15 a 0.19 2.90 1.375000
1/12/15 a 0.20 3.10 1.222222
1/1/16 a 0.21 3.31 1.100000
1/2/16 a 0.22 3.53 1.000000
1/3/16 a 0.23 3.76 0.916667
1/4/16 a 0.24 4.00 0.846154
1/5/16 a 0.25 4.25 0.785714
1/5/14 b 0.01 1.01 NaN
1/6/14 b 0.02 1.03 NaN
1/7/14 b 0.03 1.06 NaN
1/8/14 b 0.04 1.10 NaN
1/9/14 b 0.05 1.15 NaN
1/10/14 b 0.06 1.21 NaN
1/11/14 b 0.07 1.28 NaN
1/12/14 b 0.08 1.36 NaN
1/1/15 b 0.09 1.45 NaN
1/2/15 b 0.10 1.55 NaN
1/3/15 b 0.11 1.66 NaN
1/4/15 b 0.12 1.78 11.000000
1/5/15 b 0.13 1.91 5.500000
1/6/15 b 0.14 2.05 3.666667
1/7/15 b 0.15 2.20 2.750000
1/8/15 b 0.16 2.36 2.200000
1/9/15 b 0.17 2.53 1.833333
1/10/15 b 0.18 2.71 1.571429
1/11/15 b 0.19 2.90 1.375000
1/12/15 b 0.20 3.10 1.222222
1/1/16 b 0.21 3.31 1.100000
1/2/16 b 0.22 3.53 1.000000
1/3/16 b 0.23 3.76 0.916667
1/4/16 b 0.24 4.00 0.846154
1/5/16 b 0.25 4.25 0.785714
I am running Python 3.6 and Pandas 0.19.2 in PyCharm Community Edition 2016.3.2 and am trying to ensure a set of rows in my dataframe adds up to 1.
Initially my dataframe looks as follows:
hello world label0 label1 label2
abc def 1.0 0.0 0.0
why not 0.33 0.34 0.33
hello you 0.33 0.38 0.15
I proceed as follows:
# get list of label columns (all column headers that contain the string 'label')
label_list = df.filter(like='label').columns
# ensure every row adds to 1
if (df[label_list].sum(axis=1) != 1).any():
print('ERROR')
Unfortunately this code does not work for me. What seems to be happening is that instead of summing my rows, I just get the value of the first column in my filtered data. In other words: df[label_list].sum(axis=1)returns:
0 1.0
1 0.33
2 0.33
This should be trivial, but I just can't figure out what I'm doing wrong. Thanks up front for the help!
UPDATE:
This is an excerpt from my original data after I have filtered for label columns:
label0 label1 label2 label3 label4 label5 label6 label7 label8
1 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
2 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
3 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
4 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
5 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
6 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
7 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
8 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
9 0.34 0.1 0.1 0.1 0.2 0.4 0.1 0.1 1.2
My code from above still does not work, and I still have absolutely no idea why. When I run my code in python console everything works perfectly fine, but when I run my code in Pycharm 2016.3.2, label_data.sum(axis=1) just returns the values of the first column.
With your sample data for me it works. Just try to reproduce your sample adding a new column check to control the sum:
In [3]: df
Out[3]:
hello world label0 label1 label2
0 abc def 1.00 0.00 0.00
1 why not 0.33 0.34 0.33
2 hello you 0.33 0.38 0.15
In [4]: df['check'] = df.sum(axis=1)
In [5]: df
Out[5]:
hello world label0 label1 label2 check
0 abc def 1.00 0.00 0.00 1.00
1 why not 0.33 0.34 0.33 1.00
2 hello you 0.33 0.38 0.15 0.86
In [6]: label_list = df.filter(like='label').columns
In [7]: label_list
Out[7]: Index([u'label0', u'label1', u'label2'], dtype='object')
In [8]: df[label_list].sum(axis=1)
Out[8]:
0 1.00
1 1.00
2 0.86
dtype: float64
In [9]: if (df[label_list].sum(axis=1) != 1).any():
...: print('ERROR')
...:
ERROR
Turns out my data type was not consistent. I used astype(float) and things worked out.