how to highlight pandas data frame on selected rows - python

I have the data like this:
df:
A-A A-B A-C A-D A-E
Tg 0.37 10.24 5.02 0.63 20.30
USL 0.39 10.26 5.04 0.65 20.32
LSL 0.35 10.22 5.00 0.63 20.28
1 0.35 10.23 5.05 0.65 20.45
2 0.36 10.19 5.07 0.67 20.25
3 0.34 10.25 5.03 0.66 20.33
4 0.35 10.20 5.08 0.69 20.22
5 0.33 10.17 5.05 0.62 20.40
Max 0.36 10.25 5.08 0.69 20.45
Min 0.33 10.17 5.03 0.62 20.22
I would like to color-highlight the data (index 1-5 in this df) by comparing Max and Min of the data (last two rows) to USL and LSL respectively. if Max > USL or Min < LSL, I would like to highlight the corresponding data points as red. if Max == USL or Min == LSL, corresponding data point as yellow and otherwise everything green.
I tried this :
highlight = np.where(df.loc['Max']>df.loc['USL'], 'background-color: red', '')
df.style.apply(lambda _: highlight)
but i get the error:
ValueError: Function <function <lambda> at 0x7fb681b601f0> created invalid index labels.
Usually, this is the result of the function returning a Series which contains invalid labels, or returning an incorrectly shaped, list-like object which cannot be mapped to labels, possibly due to applying the function along the wrong axis.
Result index has shape: (5,)
Expected index shape: (10,)
Out[58]:
<pandas.io.formats.style.Styler at 0x7fb681b52e20>

Use custom function for create DataFrame of styles by conditions:
#changed data for test
print (df)
A-A A-B A-C A-D
Tg 0.37 10.24 5.02 0.63
USL 0.39 10.26 5.04 0.65
LSL 0.33 0.22 5.00 10.63
1 0.35 10.23 5.05 0.65
2 0.36 10.19 5.07 0.67
3 0.34 10.25 5.03 0.66
4 0.35 10.20 5.08 0.69
5 0.33 10.17 5.05 0.62
Max 0.36 10.25 5.08 0.69
Min 0.33 10.17 5.03 0.62
def hightlight(x):
c1 = 'background-color:red'
c2 = 'background-color:yellow'
c3 = 'background-color:green'
#if values of index are strings
r = list('12345')
#if values of index are integers
r = [1,2,3,4,5]
m1 = (x.loc['Max']>x.loc['USL']) | (x.loc['Min']<x.loc['LSL'])
print (m1)
m2 = (x.loc['Max']==x.loc['USL']) | (x.loc['Min']==x.loc['LSL'])
print (m2)
#DataFrame with same index and columns names as original filled empty strings
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#modify values of df1 columns by boolean mask
df1.loc[r, :] = np.select([m1, m2], [c1, c2], default=c3)
return df1
df.style.apply(hightlight, axis=None)
EDIT: For compare 1-5 and Min/Max use:
def hightlight(x):
c1 = 'background-color:red'
c2 = 'background-color:yellow'
c3 = 'background-color:green'
#if values of index are strings
r = list('12345')
#if values of index are integers
# r = [1,2,3,4,5]
r += ['Max','Min']
m1 = (x.loc[r]>x.loc['USL']) | (x.loc[r]<x.loc['LSL'])
m2 = (x.loc[r]==x.loc['USL']) | (x.loc[r]==x.loc['LSL'])
#DataFrame with same index and columns names as original filled empty strings
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#modify values of df1 columns by boolean mask
df1.loc[r, :] = np.select([m1, m2], [c1, c2], default=c3)
return df1
df.style.apply(hightlight, axis=None)

Related

how do you Apply math multiply a number to a decimal point python

i want apply lambda to do multiplication which is condition type data float value like this
0.412
0.0036
0.0467
0.000678
0.00000342
expected output
0.41
0.36
0.47
0.68
0.34
You can use replace with astype and round.
Try this :
df["col"] = df["col"].replace("\.0*", ".", regex=True).astype(float).round(2)
# Output :
print(df)
col
0 0.41
1 0.36
2 0.47
3 0.68
4 0.34
Try this:
import re
lambda_func = lambda x: re.sub(r'(\.0*)', r'.', str(x))

Remove Columns from DataFrame based on Standard Deviation

I am trying to do something that I think should be rather simple but I am stuck.
I would like to be able to get the standard deviation of each column in my dataframe and remove that column if the standard deviation is below a set number. This is as far as I have gotten.
stdev_min = 0.6
df = pd.DataFrame(np.random.randn(20, 5), columns=list('ABCDE'))
namelist = list(df.columns.values.tolist())
stdev = pd.DataFrame(df.std())
I've tried a few things but nothing worth mentioning, any help would be greatly appreciated.
You don't need any loops.
You rarely do with pandas.
In this case, you need boolean indexing:
import pandas
import numpy
numpy.random.seed(37)
stdev_min = 0.95
df = pandas.DataFrame(numpy.random.randn(20, 5), columns=list('ABCDE'))
So now df.std() gives me:
A 0.928547
B 0.859394
C 0.998692
D 1.187380
E 1.092970
dtype: float64
so I can do
df.loc[:, df.std() > stdev_min]
And get:
C D E
0 0.35 -1.30 1.52
1 -0.45 0.96 -0.83
2 0.52 -0.06 -0.03
3 1.89 0.40 0.19
4 -0.27 -2.07 -0.71
5 -1.72 -0.40 1.27
6 0.44 -2.05 -0.23
7 1.76 0.06 0.36
8 -0.30 -2.05 1.68
9 0.34 1.26 -1.08
10 0.10 -0.48 -1.74
11 1.95 -0.08 1.51
12 0.43 -0.06 -0.63
13 -0.30 -1.06 0.57
14 -0.95 -1.45 0.93
15 -1.13 2.23 -0.88
16 -0.77 0.86 0.58
17 0.93 -0.11 -1.29
18 -0.82 0.03 -0.44
19 0.40 1.13 -1.89
Here's a way to do this.
Iterate through each column. Get the Standard Deviation for the column. Check if it is less than the minimum standard deviation value. If it is, drop the column using inplace=True
stdev_min = 0.6
df = pd.DataFrame(np.random.randn(10, 5), columns=list('ABCDE'))
for col in df.columns:
print (col, df[col].std())
if df[col].std() < stdev_min:
df.drop(col,axis='columns', inplace=True)
print (df)
Output:
A 0.5046725928657507
B 1.1382221163449697
C 1.0318169576864502
D 0.7129102193331575
E 1.3805207184389312
The value of A is less than 0.6 and so it got dropped.
B C D E
0 -0.923822 1.155547 -0.601033 -0.066207
1 0.068844 0.426304 -0.376052 0.368574
2 0.585187 -0.367270 0.530934 0.086811
3 0.021466 1.381579 0.483134 -0.300033
4 0.351492 -0.648734 -0.736213 0.827953
5 0.155731 -0.004504 0.315432 0.310515
6 -1.092933 1.341933 -0.672240 -3.482960
7 -0.587766 0.227846 0.246781 1.978528
8 1.565055 0.527668 -0.371854 -0.030196
9 -2.634862 -1.973874 1.508080 -0.362073
Did a few more runs. Here's an example with before and after.
DF before
A B C D E
0 0.496740 0.799021 1.655287 0.091138 0.309186
1 -0.580667 -0.749337 -0.521909 -0.529410 1.010981
2 0.212731 0.126389 -2.244500 0.400540 -0.148761
3 -0.424375 -0.832478 -0.030865 -0.561107 0.196268
4 0.229766 0.688040 0.580294 0.941885 1.554929
5 0.676926 -0.062092 -1.452619 0.952388 -0.963857
6 0.683216 0.747429 -1.834337 -0.402467 -0.383881
7 0.834815 -0.770804 1.299346 1.694612 1.171190
8 0.500445 -1.517488 0.610287 -0.601442 0.343389
9 -0.182286 -0.713332 0.526507 1.042717 1.229628
Standard Deviations for each column of DF:
A 0.49088743174291477
B 0.8047513692231202
C 1.333382184686379
D 0.8248456756163864
E 0.8033725216710547
df['A'] is less than 0.6 and so got dropped.
DF after dropping the column.
B C D E
0 0.799021 1.655287 0.091138 0.309186
1 -0.749337 -0.521909 -0.529410 1.010981
2 0.126389 -2.244500 0.400540 -0.148761
3 -0.832478 -0.030865 -0.561107 0.196268
4 0.688040 0.580294 0.941885 1.554929
5 -0.062092 -1.452619 0.952388 -0.963857
6 0.747429 -1.834337 -0.402467 -0.383881
7 -0.770804 1.299346 1.694612 1.171190
8 -1.517488 0.610287 -0.601442 0.343389
9 -0.713332 0.526507 1.042717 1.229628

Swap and group column names in a pandas DataFrame

I have a data frame with some quantitative data and one qualitative data. I would like to use describe to compute stats and group by column using the qualitative data. But I do not obtain the order I want for the level. Hereafter is an example:
df = pd.DataFrame({k: np.random.random(10) for k in "ABC"})
df["qual"] = 5 * ["init"] + 5 * ["final"]
The DataFrame looks like:
A B C qual
0 0.298217 0.675818 0.076533 init
1 0.015442 0.264924 0.624483 init
2 0.096961 0.702419 0.027134 init
3 0.481312 0.910477 0.796395 init
4 0.166774 0.319054 0.645250 init
5 0.609148 0.697818 0.151092 final
6 0.715744 0.067429 0.761562 final
7 0.748201 0.803647 0.482738 final
8 0.098323 0.614257 0.232904 final
9 0.033003 0.590819 0.943126 final
Now I would like to group by the qual column and compute statistical descriptors using describe. I did the following:
ddf = df.groupby("qual").describe().transpose()
ddf.unstack(level=0)
And I got
qual final init
A B C A B C
count 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000
mean 0.440884 0.554794 0.514284 0.211741 0.574539 0.433959
std 0.347138 0.284931 0.338057 0.182946 0.274135 0.355515
min 0.033003 0.067429 0.151092 0.015442 0.264924 0.027134
25% 0.098323 0.590819 0.232904 0.096961 0.319054 0.076533
50% 0.609148 0.614257 0.482738 0.166774 0.675818 0.624483
75% 0.715744 0.697818 0.761562 0.298217 0.702419 0.645250
max 0.748201 0.803647 0.943126 0.481312 0.910477 0.796395
I am close to what I want but I would like to swap and group the column index such as:
A B C
qual initial final initial final initial final
Is there a way to do it ?
Use columns.swaplevel and then sort_index by level=0 and axis='columns':
ddf = df.groupby('qual').describe().T.unstack(level=0)
ddf.columns = ddf.columns.swaplevel(0,1)
ddf = ddf.sort_index(level=0, axis='columns')
Or in one line using DataFrame.swaplevel instead of index.swaplevel:
ddf = ddf.swaplevel(0,1, axis=1).sort_index(level=0, axis='columns')
A B C
qual final init final init final init
count 5.00 5.00 5.00 5.00 5.00 5.00
mean 0.44 0.21 0.55 0.57 0.51 0.43
std 0.35 0.18 0.28 0.27 0.34 0.36
min 0.03 0.02 0.07 0.26 0.15 0.03
25% 0.10 0.10 0.59 0.32 0.23 0.08
50% 0.61 0.17 0.61 0.68 0.48 0.62
75% 0.72 0.30 0.70 0.70 0.76 0.65
max 0.75 0.48 0.80 0.91 0.94 0.80
Try ddf.stack().unstack(level=[0,2]), inplace of ddf.unstack(level=0)

Convert elements in masked astropy Table to np.nan

Consider the simple process of reading a data file with some non-valid entries. This is my test.dat file:
16 1035.22 1041.09 24.54 0.30 1.39 0.30 1.80 0.30 2.26 0.30 1.14 0.30 0.28 0.30 0.2884
127 824.57 1105.52 25.02 0.29 0.87 0.29 1.30 0.29 2.12 0.29 0.66 0.29 0.10 0.29 0.2986
182 1015.83 904.93 INDEF 0.28 1.80 0.28 1.64 0.28 2.38 0.28 1.04 0.28 0.06 0.28 0.3271
185 1019.15 1155.09 24.31 0.28 1.40 0.28 1.78 0.28 2.10 0.28 0.87 0.28 0.35 0.28 0.3290
192 1024.80 1045.57 24.27 0.27 1.24 0.27 2.01 0.27 2.40 0.27 0.90 0.27 0.09 0.27 0.3328
197 1035.99 876.04 24.10 0.27 1.23 0.27 1.52 0.27 2.59 0.27 0.45 0.27 0.25 0.27 0.3357
198 1110.80 1087.97 24.53 0.27 1.49 0.27 1.71 0.27 2.33 0.27 0.22 0.27 0.00 0.27 0.3362
1103 1168.39 1065.97 24.35 0.27 1.28 0.27 1.29 0.27 2.68 0.27 0.43 0.27 0.26 0.27 0.3388
And this is the code to read it, and replace the "bad" values (INDEF) with a float (99.999)
import numpy as np
from astropy.io import ascii
data = ascii.read("test.dat", fill_values=[('INDEF', '0')])
data = data.filled(99.999)
This works just fine, but if I instead try to replace the bad values with a np.nan (i.e., I use the line data = data.filled(np.nan)) I get:
ValueError: cannot convert float NaN to integer
why is this and how can I get around it?
As noted the issue is that the numpy MaskedArray.filled() method seems to try converting the fill value to the appropriate type before checking if there is actually anything to fill. Since the table in the example has an int column, this fails within numpy (and astropy.Table is just calling the filled() method on each column).
This should work:
In [44]: def fill_cols(tbl, fill=np.nan, kind='f'):
...: """
...: In-place fill of ``tbl`` columns which have dtype ``kind``
...: with ``fill`` value.
...: """
...: for col in tbl.itercols():
...: if col.dtype.kind == kind:
...: col[...] = col.filled(fill)
...:
In [45]: t = simple_table(masked=True)
In [46]: t
Out[46]:
<Table masked=True length=3>
a b c
int64 float64 str1
----- ------- ----
-- 1.0 c
2 2.0 --
3 -- e
In [47]: fill_cols(t)
In [48]: t
Out[48]:
<Table masked=True length=3>
a b c
int64 float64 str1
----- ------- ----
-- 1.0 c
2 2.0 --
3 nan e
I don't think it's primarily a numpy problem, as it works with individual columns:
>>> data['col4'].filled(np.nan)
<Column name='col4' dtype='float64' length=8>
24.54
25.02
nan
24.31
24.27
24.1
24.53
24.35
but you still can't construct a Table from this -
Table([data[n].filled(np.nan) for n in data.colnames])
raises the same error in np.ma.core.
You can explicitly set
data['col4'] = data['col4'].filled(np.nan)
but this apparently lets the table lose its .filled() method...
I am not that familiar with masked arrays and tables, but as you've already filed a related issue on Github, you might want to add this problem.
This is happening fairly deep in numpy, in numpy.ma.filled. fill values have to be scalars, basically.
A messy solution that fills with nan's and still returns a table could look like:
import numpy as np
from astropy.io import ascii
from astropy.table import Table
def fill_with_nan(t):
arr = t.as_array()
arr_list = arr.tolist()
arr = np.array(arr_list)
arr[np.equal(arr, None)] = np.nan
arr = np.array(arr.tolist())
return Table(arr)
data = ascii.read("test.dat", fill_values=[('INDEF', '0')])
data = fill_with_nan(data)
Cut out the middleman? fill_values=[('INDEF', np.nan)]) seems to work.

Pivot table and merge column with headers

I have DataFrame as below:
Ranges Relative_17-Sep Relative_17-Oct Relative_17-Nov
<=20% 0.65 0.36 0.29
>20% 99.35 99.64 99.71
I am trying to find a way to convert this to :
"Sep17<=20%" "Sep17>20%" "Oct17<=20%" "Oct17>20%" "Nov17<=20%" "Nov17>20%"
0.65 99.35 0.36 99.64 0.29 99.71
Any help in this.
Thanks
Option 1
melt
v = df.melt('Ranges')
df = pd.DataFrame(
v['value'].values,
index=v['variable'].str.split('_').str[-1] + v['Ranges']
).T
df
17-Sep<=20% 17-Sep>20% 17-Oct<=20% 17-Oct>20% 17-Nov<=20% 17-Nov>20%
0 0.65 99.35 0.36 99.64 0.29 99.71
Option 2
Modify df.columns, followed by a stacking operation.
df.columns = df.columns.str.split('_').str[-1]
v = df.set_index('Ranges').stack()
df = pd.DataFrame(
v.values,
index=v.index.get_level_values(1) + v.index.get_level_values(0)
).T
df
17-Sep<=20% 17-Oct<=20% 17-Nov<=20% 17-Sep>20% 17-Oct>20% 17-Nov>20%
0 0.65 0.36 0.29 99.35 99.64 99.71

Categories