How to use describe () by group for all variables? - python

I would appreciate if you could let me know how to apply describe () to calculate summary statistics by group. My data (TrainSet) is like the following but there is a lot of coulmns:
Financial Distress x1 x2 x3
0 1.28 0.02 0.87
0 1.27 0.01 0.82
0 1.05 -0.06 0.92
1 1.11 -0.02 0.86
0 1.06 0.11 0.81
0 1.06 0.08 0.88
1 0.87 -0.03 0.79
I want to compute the summary statistics by "Financial Distress" as it is shown below:
count mean std min 25% 50% 75% max
cat index
x1 0 2474 1.4 1.3 0.07 0.95 1.1 1.54 38.1
1 95 0.7 -1.7 0.02 2.9 2.1 1.75 11.2
x2 0 2474 0.9 1.7 0.02 1.9 1.4 1.75 11.2
1 95 .45 1.95 0.07 2.8 1.6 2.94 20.12
x3 0 2474 2.4 1.5 0.07 0.85 1.2 1.3 30.1
1 95 1.9 2.3 0.33 6.1 0.15 1.66 12.3
I wrote the following code but it does not provide the answer in the aforementioned format.
Statistics=pd.concat([TrainSet[TrainSet["Financial Distress"]==0].describe(),TrainSet[TrainSet["Financial Distress"]==1].describe()])
Statistics.to_csv("Descriptive Statistics1.csv")
Thanks in advance.
The result of coldspeed's solution:
Financial Distress count mean std
x1 0 2474 1.398623286 1.320468688
x1 1 95 1.028107053 0.360206966
x10 0 2474 0.143310534 0.136257947
x10 1 95 -0.032919408 0.080409407
x100 0 2474 0.141875505 0.348992946
x100 1 95 0.115789474 0.321669776

You can use DataFrameGroupBy.describe with unstack first, but it by default change ordering by reindex:
print (df)
Financial Distress x1 x2 x10
0 0 1.28 0.02 0.87
1 0 1.27 0.01 0.82
2 0 1.05 -0.06 0.92
3 1 1.11 -0.02 0.86
4 0 1.06 0.11 0.81
5 0 1.06 0.08 0.88
6 1 0.87 -0.03 0.79
df1 = (df.groupby('Financial Distress')
.describe()
.unstack()
.unstack(1)
.reindex(df.columns[1:], level=0))
print (df1)
count mean std min 25% 50% 75% \
Financial Distress
x1 0 5.0 1.144 0.119708 1.05 1.0600 1.060 1.2700
1 2.0 0.990 0.169706 0.87 0.9300 0.990 1.0500
x2 0 5.0 0.032 0.066106 -0.06 0.0100 0.020 0.0800
1 2.0 -0.025 0.007071 -0.03 -0.0275 -0.025 -0.0225
x10 0 5.0 0.860 0.045277 0.81 0.8200 0.870 0.8800
1 2.0 0.825 0.049497 0.79 0.8075 0.825 0.8425
max
Financial Distress
x1 0 1.28
1 1.11
x2 0 0.11
1 -0.02
x10 0 0.92
1 0.86

Related

How to tansfrom row data into column data format using pandas in python

I have data in the below format.
Input
From To Zone 1 Zone 2 Zone 3 Zone 4 Zone 5
10.1 20 0.45 0.45 0.35 0.45 0.45
20.1 40 0.45 0.45 0.45 0.45 0.70
40.1 50 0.50 0.50 0.55 0.55 0.55
50.1 250 0.75 0.79 0.79 0.80 0.79
Desired Output
From To Kg Attribute Value
10.1 20 0.5 Zone 1 0.45
10.1 20 0.5 Zone 2 0.45
10.1 20 0.5 Zone 3 0.35
10.1 20 0.5 Zone 4 0.45
10.1 20 0.5 Zone 5 0.45
20.1 40 0.5 Zone 1 0.45
20.1 40 0.5 Zone 2 0.45
20.1 40 0.5 Zone 3 0.45
20.1 40 0.5 Zone 4 0.45
20.1 40 0.5 Zone 5 0.70
How can this be done in pandas python?
You can set From and To as index and use stack().
(
df.set_index(['From', 'To']).stack().to_frame('Value')
.rename_axis(['From', 'To', 'Attribute'])
.assign(Kg=0.5)
.reset_index()
)
From To Attribute Value Kg
0 10.1 20 Zone1 0.45 0.5
1 10.1 20 Zone2 0.45 0.5
2 10.1 20 Zone3 0.35 0.5
3 10.1 20 Zone4 0.45 0.5
4 10.1 20 Zone5 0.45 0.5
5 20.1 40 Zone1 0.45 0.5
6 20.1 40 Zone2 0.45 0.5
7 20.1 40 Zone3 0.45 0.5
8 20.1 40 Zone4 0.45 0.5
9 20.1 40 Zone5 0.70 0.5
10 40.1 50 Zone1 0.50 0.5
11 40.1 50 Zone2 0.50 0.5
12 40.1 50 Zone3 0.55 0.5
13 40.1 50 Zone4 0.55 0.5
14 40.1 50 Zone5 0.55 0.5
15 50.1 250 Zone1 0.75 0.5
16 50.1 250 Zone2 0.79 0.5
17 50.1 250 Zone3 0.79 0.5
18 50.1 250 Zone4 0.80 0.5
19 50.1 250 Zone5 0.79 0.5

Pandas change value of column if other column values don't meet criteria

I have the following data frame. I want to check the values of each row for the columns of "mental_illness", "feeling", and "flavor". If all the values for those three columns per row are less than 0.5, I want to change the corresponding value of the "unclassified" column to 1.0.
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0.0 0.75 0.30 0.28
1 1 word_2 0.0 0.17 0.72 0.16
2 2 word_3 0.0 0.19 0.38 0.16
3 3 word_4 0.0 0.39 0.20 0.14
4 4 word_5 0.0 0.72 0.30 0.14
Expected result:
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0.0 0.75 0.30 0.28
1 1 word_2 0.0 0.17 0.72 0.16
2 2 word_3 1.0 0.19 0.38 0.16
3 3 word_4 1.0 0.39 0.20 0.14
4 4 word_5 0.0 0.72 0.30 0.14
How do I go about doing so?
Use .le and .all over axis=1:
m = df[['mental_illness', 'feeling', 'flavor']].le(0.5).all(axis=1)
df['unclassified'] = m.astype(int)
sent_no pos unclassified mental_illness feeling flavor
0 0 word_1 0 0.75 0.30 0.28
1 1 word_2 0 0.17 0.72 0.16
2 2 word_3 1 0.19 0.38 0.16
3 3 word_4 1 0.39 0.20 0.14
4 4 word_5 0 0.72 0.30 0.14
Would this work?
mask1 = df["mental_illness"] < 0.5
mask2 = df["feeling"] < 0.5
mask3 = df["flavor"] < 0.5
df.loc[mask1 & mask2 & mask3, 'unclassified'] = 1
Here is my solution:
data.unclassified = data[['mental_illness', 'feeling', 'flavor']].apply(lambda x: x.le(0.5)).apply(lambda x: 1 if sum(x) == 3 else 0, axis = 1)
output
sent_no pos unclassified mental_illness feeling flavor
0 0 Word_1 0 0.75 0.30 0.28
1 1 Word_2 0 0.17 0.72 0.16
2 2 Word_3 1 0.19 0.38 0.16
3 3 Word_4 1 0.39 0.20 0.14
4 4 Word_5 0 0.72 0.30 0.14

Calculating momentum signal in python using 1 month and 12 month lag

I am wanting to calculate a simple momentum signal. The method I am following is 1 month lagged cumret divided by 12 month lagged cumret minus 1.
date starts at 1/5/14 and ends at 1/5/16. As a 12 month lag is required, the first mom signal has to start 12 months after the start of date. Hence, why the first mom signal starts at 1/5/15.
Here is the data utilized:
import pandas as pd
data = {'date':['1/5/14','1/6/14','1/7/14','1/8/14','1/9/14','1/10/14','1/11/14','1/12/14' .,'1/1/15','1/2/15','1/3/15','1/4/15','1/5/15','1/6/15','1/7/15','1/8/15','1/9/15','1/10/15','1/11/15','1/12/15','1/1/16','1/2/16','1/3/16','1/4/16','1/5/16'],
'id': ['a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a' ],
'ret':[0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20, 0.21, 0.22, 0.23, 0.24, 0.25],
'cumret':[1.01,1.03, 1.06,1.1 ,1.15,1.21,1.28, 1.36,1.45,1.55,1.66, 1.78,1.91,2.05,2.2,2.36, 2.53,2.71,2.9,3.1,3.31,3.53, 3.76,4,4.25]}
df = pd.DataFrame(data).set_index(['date', 'id'])
Desired output
ret cumret mom
date id
1/5/14 a .01 1.01
1/6/14 a .02 1.03
1/7/14 a .03 1.06
1/8/14 a .04 1.1
1/9/14 a .05 1.15
1/10/14 a .06 1.21
1/11/14 a .07 1.28
1/12/14 a .08 1.36
1/1/15 a .09 1.45
1/2/15 a .1 1.55
1/3/15 a .11 1.66
1/4/15 a .12 1.78
1/5/15 a .13 1.91 .8
1/6/15 a .14 2.05 .9
1/7/15 a .15 2.2 .9
1/8/15 a .16 2.36 1
1/9/15 a .17 2.53 1.1
1/10/15 a .18 2.71 1.1
1/11/15 a .19 2.9 1.1
1/12/15 a .2 3.1 1.1
1/1/16 a .21 3.31 1.1
1/2/16 a .22 3.53 1.1
1/3/16 a .23 3.76 1.1
1/4/16 a .24 4 1.1
1/5/16 a .25 4.25 1.1
This is the code tried to calculate mom
df['mom'] = ((df['cumret'].shift(-1) / (df['cumret'].shift(-12))) - 1).groupby(level = ['id'])
Entire dataset has more id e.g. a, b, c. Just included 1 variable for this example.
Any help would be awesome! :)
As far as I know, momentum is simply rate of change. Pandas has a built-in method for this:
df['mom'] = df['ret'].pct_change(12) # 12 month change
Also, I am not sure why you are using cumret instead of ret to calculate momentum.
Update: If you have multiple IDs that you need to go through, I'd recommend:
for i in df.index.levels[1]:
temp = df.loc[(slice(None), i), "ret"].pct_change(11)
df.loc[(slice(None), i), "mom"] = temp
# or df.loc[(slice(None), i), "mom"] = df.loc[(slice(None), i), "ret"].pct_change(11) for short
Output:
ret cumret mom
date id
1/5/14 a 0.01 1.01 NaN
1/6/14 a 0.02 1.03 NaN
1/7/14 a 0.03 1.06 NaN
1/8/14 a 0.04 1.10 NaN
1/9/14 a 0.05 1.15 NaN
1/10/14 a 0.06 1.21 NaN
1/11/14 a 0.07 1.28 NaN
1/12/14 a 0.08 1.36 NaN
1/1/15 a 0.09 1.45 NaN
1/2/15 a 0.10 1.55 NaN
1/3/15 a 0.11 1.66 NaN
1/4/15 a 0.12 1.78 11.000000
1/5/15 a 0.13 1.91 5.500000
1/6/15 a 0.14 2.05 3.666667
1/7/15 a 0.15 2.20 2.750000
1/8/15 a 0.16 2.36 2.200000
1/9/15 a 0.17 2.53 1.833333
1/10/15 a 0.18 2.71 1.571429
1/11/15 a 0.19 2.90 1.375000
1/12/15 a 0.20 3.10 1.222222
1/1/16 a 0.21 3.31 1.100000
1/2/16 a 0.22 3.53 1.000000
1/3/16 a 0.23 3.76 0.916667
1/4/16 a 0.24 4.00 0.846154
1/5/16 a 0.25 4.25 0.785714
1/5/14 b 0.01 1.01 NaN
1/6/14 b 0.02 1.03 NaN
1/7/14 b 0.03 1.06 NaN
1/8/14 b 0.04 1.10 NaN
1/9/14 b 0.05 1.15 NaN
1/10/14 b 0.06 1.21 NaN
1/11/14 b 0.07 1.28 NaN
1/12/14 b 0.08 1.36 NaN
1/1/15 b 0.09 1.45 NaN
1/2/15 b 0.10 1.55 NaN
1/3/15 b 0.11 1.66 NaN
1/4/15 b 0.12 1.78 11.000000
1/5/15 b 0.13 1.91 5.500000
1/6/15 b 0.14 2.05 3.666667
1/7/15 b 0.15 2.20 2.750000
1/8/15 b 0.16 2.36 2.200000
1/9/15 b 0.17 2.53 1.833333
1/10/15 b 0.18 2.71 1.571429
1/11/15 b 0.19 2.90 1.375000
1/12/15 b 0.20 3.10 1.222222
1/1/16 b 0.21 3.31 1.100000
1/2/16 b 0.22 3.53 1.000000
1/3/16 b 0.23 3.76 0.916667
1/4/16 b 0.24 4.00 0.846154
1/5/16 b 0.25 4.25 0.785714

Row based chart plot (Seaborn or Matplotlib)

Given that my data is a pandas dataframe and looks like this:
Ref +1 +2 +3 +4 +5 +6 +7
2013-05-28 1 -0.44 0.03 0.06 -0.31 0.13 0.56 0.81
2013-07-05 2 0.84 1.03 0.96 0.90 1.09 0.59 1.15
2013-08-21 3 0.09 0.25 0.06 0.09 -0.09 -0.16 0.56
2014-10-15 4 0.35 1.16 1.91 3.44 2.75 1.97 2.16
2015-02-09 5 0.09 -0.10 -0.38 -0.69 -0.25 -0.85 -0.47
How can I plot a chart of the 5 lines (1 for each ref), where the X axis are the columns (+1, +2...), and starts from 0? If is in seaborn, even better. But matplotlib solutions are also welcome.
Plotting a dataframe in pandas is generally all about reshaping the table so that the individual lines you want are in separate columns, and the x-values are in the index. Some of these reshape operations are a bit ugly, but you can do:
df = pd.read_clipboard()
plot_table = pd.melt(df.reset_index(), id_vars=['index', 'Ref'])
plot_table = plot_table.pivot(index='variable', columns='Ref', values='value')
# Add extra row to have all lines start from 0:
plot_table.loc['+0', :] = 0
plot_table = plot_table.sort_index()
plot_table
Ref 1 2 3 4 5
variable
+0 0.00 0.00 0.00 0.00 0.00
+1 -0.44 0.84 0.09 0.35 0.09
+2 0.03 1.03 0.25 1.16 -0.10
+3 0.06 0.96 0.06 1.91 -0.38
+4 -0.31 0.90 0.09 3.44 -0.69
+5 0.13 1.09 -0.09 2.75 -0.25
+6 0.56 0.59 -0.16 1.97 -0.85
+7 0.81 1.15 0.56 2.16 -0.47
Now that you have a table with the right shape, plotting is pretty automatic:
plot_table.plot()

feed empty pandas.dataframe with several files

I would like to feed a empty dataframe appending several files of the same type and structure. However, I can't see what's wrong here:
def files2df(colnames, ext):
df = DataFrame(columns = colnames)
for inf in sorted(glob.glob(ext)):
dfin = read_csv(inf, sep='\t', skiprows=1)
print(dfin.head(), '\n')
df.append(dfin, ignore_index=True)
return df
The resulting dataframe is empty. Could someone give me a hand?
1.0 16.59 0.597 0.87 1.0.1 3282 100.08
0 0.953 14.52 0.561 0.80 0.99 4355 -
1 1.000 31.59 1.000 0.94 1.00 6322 -
2 1.000 6.09 0.237 0.71 1.00 10568 -
3 1.000 31.29 1.000 0.94 1.00 14363 -
4 1.000 31.59 1.000 0.94 1.00 19797 -
1.0 6.69 0.199 0.74 1.0.1 186 13.16
0 1 0.88 0.020 0.13 0.99 394 -
1 1 0.75 0.017 0.11 0.99 1052 -
2 1 3.34 0.097 0.57 1.00 1178 -
3 1 1.50 0.035 0.26 1.00 1211 -
4 1 20.59 0.940 0.88 1.00 1583 -
1.0 0.12 0.0030 0.04 0.97 2285 2.62
0 1 1.25 0.135 0.18 0.99 2480 -
1 1 0.03 0.001 0.04 0.97 7440 -
2 1 0.12 0.003 0.04 0.97 8199 -
3 1 1.10 0.092 0.16 0.99 11174 -
4 1 0.27 0.007 0.06 0.98 11310 -
0.244 0.07 0.0030 0.02 0.76 41314 1.32
0 0.181 0.64 0.028 0.03 0.36 41755 -
1 0.161 0.18 0.008 0.01 0.45 42420 -
2 0.161 0.18 0.008 0.01 0.45 42461 -
3 0.237 0.25 0.011 0.02 0.56 43060 -
4 0.267 1.03 0.047 0.07 0.46 43321 -
0.163 0.12 0.0060 0.01 0.5 103384 1.27
0 0.243 0.27 0.014 0.02 0.56 104693 -
1 0.215 0.66 0.029 0.04 0.41 105192 -
2 0.190 0.10 0.005 0.01 0.59 105758 -
3 0.161 0.12 0.006 0.01 0.50 109783 -
4 0.144 0.16 0.007 0.01 0.42 110067 -
Empty DataFrame
Columns: array([D, LOD, r2, CIlow, CIhi, Dist, T-int], dtype=object)
Index: array([], dtype=object)
df.append(dfin, ignore_index=True) returns a new DataFrame, it does not change df in place.
Use df = df.append(dfin, ignore_index=True). But even with this change i think this will not give what you need. Append extends a frame on axis=1 (columns), but i believe you want to combine the data on axis=0 (rows)
In this scenario (reading multiple files and use all data to create a single DataFrame), i would use pandas.concat(). The code below will give you a frame with columns named by colnames, and the rows are formed by the data in the csv files.
def files2df(colnames, ext):
files = sorted(glob.glob(ext))
frames = [read_csv(inf, sep='\t', skiprows=1, names=colnames) for inf in files]
return concat(frames, ignore_index=True)
I did not try this code, just wrote it here, maybe you need tweak it to get it running, but the idea is clear (i hope).
Also, I found another solution, but don't know which one is faster.
def files2df(colnames, ext):
dflist = [ ]
for inf in sorted(glob.glob(ext)):
dflist.append(read_csv(inf, names = colnames, sep='\t', skiprows=1))
#print(dflist)
df = concat(dflist, axis = 0, ignore_index=True)
#print(df.to_string())
return df

Categories