Consider the following data frames:
base_df = pd.DataFrame({
'id': [1, 2, 3, 4, 5, 6, 7],
'type_a': ['nan', 'type3', 'type1', 'type2', 'type3', 'type5', 'type4'],
'q_a': [0, 0.9, 5.1, 3.0, 1.6, 1.1, 0.7],
'p_a': [0, 0.53, 0.71, 0.6, 0.53, 0.3, 0.33]
})
Edit: This is an extract of base_df. The original df 100 columns with around 500 observations.
table_df = pd.DataFrame({
'type': ['type1', 'type2', 'type3', 'type3', 'type3', 'type3', 'type4', 'type4', 'type4', 'type4', 'type5', 'type5', 'type5', 'type6', 'type6'],
'q_value': [5.1, 3.1, 1.6, 1.3, 0.9, 0.85, 0.7, 0.7, 0.7, 0.5, 1.2, 1.1, 1.1, 0.4, 0.4],
'p_value': [0.71, 0.62, 0.71, 0.54, 0.53, 0.44, 0.5, 0.54, 0.33, 0.33, 0.32, 0.31, 0.28, 0.31, 0.16],
'sigma':[2.88, 2.72, 2.73, 2.79, 2.91, 2.41, 2.63, 2.44, 2.7, 2.69, 2.59, 2.67, 2.4, 2.67, 2.35]
})
Edit: The original table_df looks exactly like this one.
For every observation in base_df, I'd like to look up if the type matches with an entry in table_df, if yes:
I'd like to look if there is an entry in table_df with the corresponding value q_a == q_value, if yes:
And there's only one value q_value, assign sigma to base_df.
If there are more than one values of q_value, compare p_a and assing the correct sigma to base_df.
If there's no exactly matching value for q_a or p_a just use the next bigger value, in case there is no bigger value use the lower one and assign the corresponding value for sigma to column sigma_a in base_df.
The resulting DF should look like this:
id type_a q_a p_a sigma_a
1 nan 0 0 0
2 type3 0.9 0.53 2.91
3 type1 5.1 0.71 2.88
4 type2 3 0.6 2.72
5 type3 1.6 0.53 2.41
6 type5 1.1 0.3 2.67
7 type4 0.7 0.33 2.7
So far I use the code below:
mapping = (pd.merge_asof(base_df.sort_values('q_a'),
table_df.sort_values('q_value'),
left_on='q_a',
left_by='type_a',
right_on='q_value',
right_by='type').set_index('id'))
base_df= (pd.merge_asof(base_df.sort_values('q_a'),
table_df.sort_values('q_value'),
left_on='q_a',
left_by='type_a',
right_on='q_value',
right_by='type',
direction = 'forward')
.set_index('id')
.combine_first(mapping)
.sort_index()
.reset_index()
)
This "two step check routine" works, but I'd like to add the third step checking p_value.
How can I realize it?
Actually, I think Metrics should not be separated into A-segment B-segment,
It supposed to concatenated into the same column and create a Metric like Segment.
Anyway, according to your description,
table_df is a reference table and they have the same criteria for _a and _b,
therefore I order them in hierarchical structure by following manipulation:
table_df.sort_values(by=["type","q_value","p_value"]).reset_index(drop = True)
type q_value p_value sigma
0 type1 5.10 0.71 2.88
1 type2 3.10 0.62 2.72
2 type3 0.85 0.44 2.41
3 type3 0.90 0.53 2.91
4 type3 1.30 0.54 2.79
5 type3 1.60 0.71 2.73
6 type4 0.50 0.33 2.69
7 type4 0.70 0.33 2.70
8 type4 0.70 0.50 2.63
9 type4 0.70 0.54 2.44
10 type5 1.10 0.28 2.40
11 type5 1.10 0.31 2.67
12 type5 1.20 0.32 2.59
13 type6 0.40 0.16 2.35
14 type6 0.40 0.31 2.67
table_df
type: a fully restrict condition
q-value&p-value: If there's no exactly matching value for q_a or p_a just use the next bigger value and assign the corresponding value for sigma to column sigma_a in base_df. If no bigger on, use the previous value in the reference table.
define the function for _a and _b (yeah they are the same)
find_sigma_a and find_sigma_b
def find_sigma_a(row):
sigma_value = table_df[
(table_df["type"]==row["type_a"]) &
(table_df["q_value"]>= row["q_a"]) &
(table_df["p_value"]>= row["p_a"])
]
if row["type_a"] == 'nan':
sigma_value = 0
elif len(sigma_value) == 0:
sigma_value = table_df[table_df["type"]==row["type_a"]].iloc[-1,3]
# .iloc[-1,3] alternatively term ["sigma"].tail(1).values[0]
else:
sigma_value = sigma_value.iloc[0,3]
# .iloc[0,3] alternatively term ["sigma"].head(1).values[0]
return sigma_value
def find_sigma_b(row):
sigma_value = table_df[
(table_df["type"] == row["type_b"]) &
(table_df["q_value"] >= row["q_b"]) &
(table_df["p_value"] >= row["p_b"])
]
if row["type_b"] == 'nan':
sigma_value = 0
elif len(sigma_value) == 0:
sigma_value = table_df[table_df["type"]==row["type_b"]].iloc[-1,3]
# .iloc[-1,3] alternatively term ["sigma"].tail(1).values[0]
else:
sigma_value = sigma_value.iloc[0,3]
# .iloc[0,3] alternatively term ["sigma"].head(1).values[0]
return sigma_value
and then use pandas.DataFrame.apply to apply these two functions
base_df["sigma_a"] = base_df.apply(find_sigma_a, axis = 1)
base_df["sigma_b"] = base_df.apply(find_sigma_b, axis = 1)
type_a q_a p_a type_b q_b p_b sigma_a sigma_b
0 nan 0.0 0.00 type6 0.4 0.11 0.00 2.35
1 type3 0.9 0.53 type3 1.4 0.60 2.91 2.73
2 type1 5.1 0.71 type3 0.9 0.53 2.88 2.91
3 type2 3.0 0.60 type6 0.5 0.40 2.72 2.67
4 type3 1.6 0.53 type6 0.4 0.11 2.73 2.35
5 type5 1.1 0.30 type1 4.9 0.70 2.67 2.88
6 type4 0.7 0.33 type4 0.7 0.20 2.70 2.70
arrange the columns:
base_df.iloc[:,[0,1,2,6,3,4,5,7]]
type_a q_a p_a sigma_a type_b q_b p_b sigma_b
0 nan 0.0 0.00 0.00 type6 0.4 0.11 2.35
1 type3 0.9 0.53 2.91 type3 1.4 0.60 2.73
2 type1 5.1 0.71 2.88 type3 0.9 0.53 2.91
3 type2 3.0 0.60 2.72 type6 0.5 0.40 2.67
4 type3 1.6 0.53 2.73 type6 0.4 0.11 2.35
5 type5 1.1 0.30 2.67 type1 4.9 0.70 2.88
6 type4 0.7 0.33 2.70 type4 0.7 0.20 2.70
Notebook_file
I have data in the below format.
Input
From To Zone 1 Zone 2 Zone 3 Zone 4 Zone 5
10.1 20 0.45 0.45 0.35 0.45 0.45
20.1 40 0.45 0.45 0.45 0.45 0.70
40.1 50 0.50 0.50 0.55 0.55 0.55
50.1 250 0.75 0.79 0.79 0.80 0.79
Desired Output
From To Kg Attribute Value
10.1 20 0.5 Zone 1 0.45
10.1 20 0.5 Zone 2 0.45
10.1 20 0.5 Zone 3 0.35
10.1 20 0.5 Zone 4 0.45
10.1 20 0.5 Zone 5 0.45
20.1 40 0.5 Zone 1 0.45
20.1 40 0.5 Zone 2 0.45
20.1 40 0.5 Zone 3 0.45
20.1 40 0.5 Zone 4 0.45
20.1 40 0.5 Zone 5 0.70
How can this be done in pandas python?
You can set From and To as index and use stack().
(
df.set_index(['From', 'To']).stack().to_frame('Value')
.rename_axis(['From', 'To', 'Attribute'])
.assign(Kg=0.5)
.reset_index()
)
From To Attribute Value Kg
0 10.1 20 Zone1 0.45 0.5
1 10.1 20 Zone2 0.45 0.5
2 10.1 20 Zone3 0.35 0.5
3 10.1 20 Zone4 0.45 0.5
4 10.1 20 Zone5 0.45 0.5
5 20.1 40 Zone1 0.45 0.5
6 20.1 40 Zone2 0.45 0.5
7 20.1 40 Zone3 0.45 0.5
8 20.1 40 Zone4 0.45 0.5
9 20.1 40 Zone5 0.70 0.5
10 40.1 50 Zone1 0.50 0.5
11 40.1 50 Zone2 0.50 0.5
12 40.1 50 Zone3 0.55 0.5
13 40.1 50 Zone4 0.55 0.5
14 40.1 50 Zone5 0.55 0.5
15 50.1 250 Zone1 0.75 0.5
16 50.1 250 Zone2 0.79 0.5
17 50.1 250 Zone3 0.79 0.5
18 50.1 250 Zone4 0.80 0.5
19 50.1 250 Zone5 0.79 0.5
I'm trying to reshape this csv file in python to have 1,000 rows and all 12 columns but it shows up as only one column
I've tried using df.iloc[:][:1000] it shortens it to 1000 rows but it still only gives me 1 column
Dataframe for Wine
df_w= pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv')
df_w= df_w.iloc[:12][:1000]
df_w
The dataframe that shows up is 1000 rows with only 1 column and I'm wondering how you set the data into its respective column title
Use sep=';', for semicolon delimiter:
df_w= pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')
df_w= df_w.iloc[:12][:1000]
df_w
Output:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
5 7.4 0.66 0.00 1.8 0.075 13.0 40.0 0.9978 3.51 0.56 9.4 5
6 7.9 0.60 0.06 1.6 0.069 15.0 59.0 0.9964 3.30 0.46 9.4 5
7 7.3 0.65 0.00 1.2 0.065 15.0 21.0 0.9946 3.39 0.47 10.0 7
8 7.8 0.58 0.02 2.0 0.073 9.0 18.0 0.9968 3.36 0.57 9.5 7
9 7.5 0.50 0.36 6.1 0.071 17.0 102.0 0.9978 3.35 0.80 10.5 5
10 6.7 0.58 0.08 1.8 0.097 15.0 65.0 0.9959 3.28 0.54 9.2 5
11 7.5 0.50 0.36 6.1 0.071 17.0 102.0 0.9978 3.35 0.80 10.5 5
I want to find the cumulative product of returns. I have tried the following code:
df['cumret'] = df.groupby(level=['date','id']).(1 + df.ret).cumprod() - 1
However, it returns an error message of
SyntaxError: invalid syntax
Any help would be greatly appreciated
import pandas as pd
data = {'date': ['2014-05-01', '2014-05-01', '2014-05-01', '2014-05-01',
'2014-05-02', '2014-05-02', '2014-05-02', '2014-05-02'
'2014-05-03', '2014-05-03', '2014-05-03', '2014-05-03']
'id': [a, b, c, d, a, b, c, d, a, b, c, d],
'sd': [0.05, 0.01, 0.03 , 0.05, 0.10, 0.04, 0.01, 0.03, 0.06, 0.07,
0.10, 0.20]
'ret':[0.01, 0.05, -0.06, -0.10, 0.20, 0.08, 0.09, 0.10, 0.20,0.03,
0.30, -0.15}
df = pd.DataFrame(data).set_index(['date', 'id']).sort_index(level='date')
df
id sd ret
date
2014-05-01 a 0.05 0.01
2014-05-01 b 0.01 0.05
2014-05-01 c 0.03 -0.06
2014-05-01 d 0.05 -0.10
2014-05-02 a 0.10 0.20
2014-05-02 b 0.04 0.08
2014-05-02 c 0.01 0.09
2014-05-02 d 0.03 0.10
2014-05-03 a 0.06 0.20
2014-05-03 b 0.07 0.03
2014-05-03 c 0.10 0.30
2014-05-03 d 0.20 -0.15
Desired output
id sd ret cumret
date
2014-05-01 a 0.05 0.01 1.01
2014-05-01 b 0.01 0.05 1.05
2014-05-01 c 0.03 -0.06 0.94
2014-05-01 d 0.05 -0.10 0.90
2014-05-02 a 0.10 0.20 1.21
2014-05-02 b 0.04 0.08 1.13
2014-05-02 c 0.01 0.09 1.03
2014-05-02 d 0.03 0.10 1.00
2014-05-03 a 0.06 0.20 1.41
2014-05-03 b 0.07 0.03 1.16
2014-05-03 c 0.10 0.30 1.33
2014-05-03 d 0.20 -0.15 0.85
I believe you need add 1 to ret column and groupbing by id column only:
df['cumret'] = (df['ret'] + 1).groupby(level=['id']).cumprod()
print(df)
sd ret cumret
date id
2014-05-01 a 0.05 0.01 1.01000
b 0.01 0.05 1.05000
c 0.03 -0.06 0.94000
d 0.05 -0.10 0.90000
2014-05-02 a 0.10 0.20 1.21200
b 0.04 0.08 1.13400
c 0.01 0.09 1.02460
d 0.03 0.10 0.99000
2014-05-03 a 0.06 0.20 1.45440
b 0.07 0.03 1.16802
c 0.10 0.30 1.33198
d 0.20 -0.15 0.84150
If want grouping by both:
df['cumret'] = (df['ret'] + 1).groupby(level=['date', 'id']).cumprod()
print(df)
sd ret cumret
date id
2014-05-01 a 0.05 0.01 1.01
b 0.01 0.05 1.05
c 0.03 -0.06 0.94
d 0.05 -0.10 0.90
2014-05-02 a 0.10 0.20 1.20
b 0.04 0.08 1.08
c 0.01 0.09 1.09
d 0.03 0.10 1.10
2014-05-03 a 0.06 0.20 1.20
b 0.07 0.03 1.03
c 0.10 0.30 1.30
d 0.20 -0.15 0.85
I would appreciate if you could let me know how to apply describe () to calculate summary statistics by group. My data (TrainSet) is like the following but there is a lot of coulmns:
Financial Distress x1 x2 x3
0 1.28 0.02 0.87
0 1.27 0.01 0.82
0 1.05 -0.06 0.92
1 1.11 -0.02 0.86
0 1.06 0.11 0.81
0 1.06 0.08 0.88
1 0.87 -0.03 0.79
I want to compute the summary statistics by "Financial Distress" as it is shown below:
count mean std min 25% 50% 75% max
cat index
x1 0 2474 1.4 1.3 0.07 0.95 1.1 1.54 38.1
1 95 0.7 -1.7 0.02 2.9 2.1 1.75 11.2
x2 0 2474 0.9 1.7 0.02 1.9 1.4 1.75 11.2
1 95 .45 1.95 0.07 2.8 1.6 2.94 20.12
x3 0 2474 2.4 1.5 0.07 0.85 1.2 1.3 30.1
1 95 1.9 2.3 0.33 6.1 0.15 1.66 12.3
I wrote the following code but it does not provide the answer in the aforementioned format.
Statistics=pd.concat([TrainSet[TrainSet["Financial Distress"]==0].describe(),TrainSet[TrainSet["Financial Distress"]==1].describe()])
Statistics.to_csv("Descriptive Statistics1.csv")
Thanks in advance.
The result of coldspeed's solution:
Financial Distress count mean std
x1 0 2474 1.398623286 1.320468688
x1 1 95 1.028107053 0.360206966
x10 0 2474 0.143310534 0.136257947
x10 1 95 -0.032919408 0.080409407
x100 0 2474 0.141875505 0.348992946
x100 1 95 0.115789474 0.321669776
You can use DataFrameGroupBy.describe with unstack first, but it by default change ordering by reindex:
print (df)
Financial Distress x1 x2 x10
0 0 1.28 0.02 0.87
1 0 1.27 0.01 0.82
2 0 1.05 -0.06 0.92
3 1 1.11 -0.02 0.86
4 0 1.06 0.11 0.81
5 0 1.06 0.08 0.88
6 1 0.87 -0.03 0.79
df1 = (df.groupby('Financial Distress')
.describe()
.unstack()
.unstack(1)
.reindex(df.columns[1:], level=0))
print (df1)
count mean std min 25% 50% 75% \
Financial Distress
x1 0 5.0 1.144 0.119708 1.05 1.0600 1.060 1.2700
1 2.0 0.990 0.169706 0.87 0.9300 0.990 1.0500
x2 0 5.0 0.032 0.066106 -0.06 0.0100 0.020 0.0800
1 2.0 -0.025 0.007071 -0.03 -0.0275 -0.025 -0.0225
x10 0 5.0 0.860 0.045277 0.81 0.8200 0.870 0.8800
1 2.0 0.825 0.049497 0.79 0.8075 0.825 0.8425
max
Financial Distress
x1 0 1.28
1 1.11
x2 0 0.11
1 -0.02
x10 0 0.92
1 0.86