Pivot table and merge column with headers - python

I have DataFrame as below:
Ranges Relative_17-Sep Relative_17-Oct Relative_17-Nov
<=20% 0.65 0.36 0.29
>20% 99.35 99.64 99.71
I am trying to find a way to convert this to :
"Sep17<=20%" "Sep17>20%" "Oct17<=20%" "Oct17>20%" "Nov17<=20%" "Nov17>20%"
0.65 99.35 0.36 99.64 0.29 99.71
Any help in this.
Thanks

Option 1
melt
v = df.melt('Ranges')
df = pd.DataFrame(
v['value'].values,
index=v['variable'].str.split('_').str[-1] + v['Ranges']
).T
df
17-Sep<=20% 17-Sep>20% 17-Oct<=20% 17-Oct>20% 17-Nov<=20% 17-Nov>20%
0 0.65 99.35 0.36 99.64 0.29 99.71
Option 2
Modify df.columns, followed by a stacking operation.
df.columns = df.columns.str.split('_').str[-1]
v = df.set_index('Ranges').stack()
df = pd.DataFrame(
v.values,
index=v.index.get_level_values(1) + v.index.get_level_values(0)
).T
df
17-Sep<=20% 17-Oct<=20% 17-Nov<=20% 17-Sep>20% 17-Oct>20% 17-Nov>20%
0 0.65 0.36 0.29 99.35 99.64 99.71

Related

how do you Apply math multiply a number to a decimal point python

i want apply lambda to do multiplication which is condition type data float value like this
0.412
0.0036
0.0467
0.000678
0.00000342
expected output
0.41
0.36
0.47
0.68
0.34
You can use replace with astype and round.
Try this :
df["col"] = df["col"].replace("\.0*", ".", regex=True).astype(float).round(2)
# Output :
print(df)
col
0 0.41
1 0.36
2 0.47
3 0.68
4 0.34
Try this:
import re
lambda_func = lambda x: re.sub(r'(\.0*)', r'.', str(x))

how to highlight pandas data frame on selected rows

I have the data like this:
df:
A-A A-B A-C A-D A-E
Tg 0.37 10.24 5.02 0.63 20.30
USL 0.39 10.26 5.04 0.65 20.32
LSL 0.35 10.22 5.00 0.63 20.28
1 0.35 10.23 5.05 0.65 20.45
2 0.36 10.19 5.07 0.67 20.25
3 0.34 10.25 5.03 0.66 20.33
4 0.35 10.20 5.08 0.69 20.22
5 0.33 10.17 5.05 0.62 20.40
Max 0.36 10.25 5.08 0.69 20.45
Min 0.33 10.17 5.03 0.62 20.22
I would like to color-highlight the data (index 1-5 in this df) by comparing Max and Min of the data (last two rows) to USL and LSL respectively. if Max > USL or Min < LSL, I would like to highlight the corresponding data points as red. if Max == USL or Min == LSL, corresponding data point as yellow and otherwise everything green.
I tried this :
highlight = np.where(df.loc['Max']>df.loc['USL'], 'background-color: red', '')
df.style.apply(lambda _: highlight)
but i get the error:
ValueError: Function <function <lambda> at 0x7fb681b601f0> created invalid index labels.
Usually, this is the result of the function returning a Series which contains invalid labels, or returning an incorrectly shaped, list-like object which cannot be mapped to labels, possibly due to applying the function along the wrong axis.
Result index has shape: (5,)
Expected index shape: (10,)
Out[58]:
<pandas.io.formats.style.Styler at 0x7fb681b52e20>
Use custom function for create DataFrame of styles by conditions:
#changed data for test
print (df)
A-A A-B A-C A-D
Tg 0.37 10.24 5.02 0.63
USL 0.39 10.26 5.04 0.65
LSL 0.33 0.22 5.00 10.63
1 0.35 10.23 5.05 0.65
2 0.36 10.19 5.07 0.67
3 0.34 10.25 5.03 0.66
4 0.35 10.20 5.08 0.69
5 0.33 10.17 5.05 0.62
Max 0.36 10.25 5.08 0.69
Min 0.33 10.17 5.03 0.62
def hightlight(x):
c1 = 'background-color:red'
c2 = 'background-color:yellow'
c3 = 'background-color:green'
#if values of index are strings
r = list('12345')
#if values of index are integers
r = [1,2,3,4,5]
m1 = (x.loc['Max']>x.loc['USL']) | (x.loc['Min']<x.loc['LSL'])
print (m1)
m2 = (x.loc['Max']==x.loc['USL']) | (x.loc['Min']==x.loc['LSL'])
print (m2)
#DataFrame with same index and columns names as original filled empty strings
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#modify values of df1 columns by boolean mask
df1.loc[r, :] = np.select([m1, m2], [c1, c2], default=c3)
return df1
df.style.apply(hightlight, axis=None)
EDIT: For compare 1-5 and Min/Max use:
def hightlight(x):
c1 = 'background-color:red'
c2 = 'background-color:yellow'
c3 = 'background-color:green'
#if values of index are strings
r = list('12345')
#if values of index are integers
# r = [1,2,3,4,5]
r += ['Max','Min']
m1 = (x.loc[r]>x.loc['USL']) | (x.loc[r]<x.loc['LSL'])
m2 = (x.loc[r]==x.loc['USL']) | (x.loc[r]==x.loc['LSL'])
#DataFrame with same index and columns names as original filled empty strings
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#modify values of df1 columns by boolean mask
df1.loc[r, :] = np.select([m1, m2], [c1, c2], default=c3)
return df1
df.style.apply(hightlight, axis=None)

Swap and group column names in a pandas DataFrame

I have a data frame with some quantitative data and one qualitative data. I would like to use describe to compute stats and group by column using the qualitative data. But I do not obtain the order I want for the level. Hereafter is an example:
df = pd.DataFrame({k: np.random.random(10) for k in "ABC"})
df["qual"] = 5 * ["init"] + 5 * ["final"]
The DataFrame looks like:
A B C qual
0 0.298217 0.675818 0.076533 init
1 0.015442 0.264924 0.624483 init
2 0.096961 0.702419 0.027134 init
3 0.481312 0.910477 0.796395 init
4 0.166774 0.319054 0.645250 init
5 0.609148 0.697818 0.151092 final
6 0.715744 0.067429 0.761562 final
7 0.748201 0.803647 0.482738 final
8 0.098323 0.614257 0.232904 final
9 0.033003 0.590819 0.943126 final
Now I would like to group by the qual column and compute statistical descriptors using describe. I did the following:
ddf = df.groupby("qual").describe().transpose()
ddf.unstack(level=0)
And I got
qual final init
A B C A B C
count 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000
mean 0.440884 0.554794 0.514284 0.211741 0.574539 0.433959
std 0.347138 0.284931 0.338057 0.182946 0.274135 0.355515
min 0.033003 0.067429 0.151092 0.015442 0.264924 0.027134
25% 0.098323 0.590819 0.232904 0.096961 0.319054 0.076533
50% 0.609148 0.614257 0.482738 0.166774 0.675818 0.624483
75% 0.715744 0.697818 0.761562 0.298217 0.702419 0.645250
max 0.748201 0.803647 0.943126 0.481312 0.910477 0.796395
I am close to what I want but I would like to swap and group the column index such as:
A B C
qual initial final initial final initial final
Is there a way to do it ?
Use columns.swaplevel and then sort_index by level=0 and axis='columns':
ddf = df.groupby('qual').describe().T.unstack(level=0)
ddf.columns = ddf.columns.swaplevel(0,1)
ddf = ddf.sort_index(level=0, axis='columns')
Or in one line using DataFrame.swaplevel instead of index.swaplevel:
ddf = ddf.swaplevel(0,1, axis=1).sort_index(level=0, axis='columns')
A B C
qual final init final init final init
count 5.00 5.00 5.00 5.00 5.00 5.00
mean 0.44 0.21 0.55 0.57 0.51 0.43
std 0.35 0.18 0.28 0.27 0.34 0.36
min 0.03 0.02 0.07 0.26 0.15 0.03
25% 0.10 0.10 0.59 0.32 0.23 0.08
50% 0.61 0.17 0.61 0.68 0.48 0.62
75% 0.72 0.30 0.70 0.70 0.76 0.65
max 0.75 0.48 0.80 0.91 0.94 0.80
Try ddf.stack().unstack(level=[0,2]), inplace of ddf.unstack(level=0)

Convert elements in masked astropy Table to np.nan

Consider the simple process of reading a data file with some non-valid entries. This is my test.dat file:
16 1035.22 1041.09 24.54 0.30 1.39 0.30 1.80 0.30 2.26 0.30 1.14 0.30 0.28 0.30 0.2884
127 824.57 1105.52 25.02 0.29 0.87 0.29 1.30 0.29 2.12 0.29 0.66 0.29 0.10 0.29 0.2986
182 1015.83 904.93 INDEF 0.28 1.80 0.28 1.64 0.28 2.38 0.28 1.04 0.28 0.06 0.28 0.3271
185 1019.15 1155.09 24.31 0.28 1.40 0.28 1.78 0.28 2.10 0.28 0.87 0.28 0.35 0.28 0.3290
192 1024.80 1045.57 24.27 0.27 1.24 0.27 2.01 0.27 2.40 0.27 0.90 0.27 0.09 0.27 0.3328
197 1035.99 876.04 24.10 0.27 1.23 0.27 1.52 0.27 2.59 0.27 0.45 0.27 0.25 0.27 0.3357
198 1110.80 1087.97 24.53 0.27 1.49 0.27 1.71 0.27 2.33 0.27 0.22 0.27 0.00 0.27 0.3362
1103 1168.39 1065.97 24.35 0.27 1.28 0.27 1.29 0.27 2.68 0.27 0.43 0.27 0.26 0.27 0.3388
And this is the code to read it, and replace the "bad" values (INDEF) with a float (99.999)
import numpy as np
from astropy.io import ascii
data = ascii.read("test.dat", fill_values=[('INDEF', '0')])
data = data.filled(99.999)
This works just fine, but if I instead try to replace the bad values with a np.nan (i.e., I use the line data = data.filled(np.nan)) I get:
ValueError: cannot convert float NaN to integer
why is this and how can I get around it?
As noted the issue is that the numpy MaskedArray.filled() method seems to try converting the fill value to the appropriate type before checking if there is actually anything to fill. Since the table in the example has an int column, this fails within numpy (and astropy.Table is just calling the filled() method on each column).
This should work:
In [44]: def fill_cols(tbl, fill=np.nan, kind='f'):
...: """
...: In-place fill of ``tbl`` columns which have dtype ``kind``
...: with ``fill`` value.
...: """
...: for col in tbl.itercols():
...: if col.dtype.kind == kind:
...: col[...] = col.filled(fill)
...:
In [45]: t = simple_table(masked=True)
In [46]: t
Out[46]:
<Table masked=True length=3>
a b c
int64 float64 str1
----- ------- ----
-- 1.0 c
2 2.0 --
3 -- e
In [47]: fill_cols(t)
In [48]: t
Out[48]:
<Table masked=True length=3>
a b c
int64 float64 str1
----- ------- ----
-- 1.0 c
2 2.0 --
3 nan e
I don't think it's primarily a numpy problem, as it works with individual columns:
>>> data['col4'].filled(np.nan)
<Column name='col4' dtype='float64' length=8>
24.54
25.02
nan
24.31
24.27
24.1
24.53
24.35
but you still can't construct a Table from this -
Table([data[n].filled(np.nan) for n in data.colnames])
raises the same error in np.ma.core.
You can explicitly set
data['col4'] = data['col4'].filled(np.nan)
but this apparently lets the table lose its .filled() method...
I am not that familiar with masked arrays and tables, but as you've already filed a related issue on Github, you might want to add this problem.
This is happening fairly deep in numpy, in numpy.ma.filled. fill values have to be scalars, basically.
A messy solution that fills with nan's and still returns a table could look like:
import numpy as np
from astropy.io import ascii
from astropy.table import Table
def fill_with_nan(t):
arr = t.as_array()
arr_list = arr.tolist()
arr = np.array(arr_list)
arr[np.equal(arr, None)] = np.nan
arr = np.array(arr.tolist())
return Table(arr)
data = ascii.read("test.dat", fill_values=[('INDEF', '0')])
data = fill_with_nan(data)
Cut out the middleman? fill_values=[('INDEF', np.nan)]) seems to work.

python matplotlib plotfile explicitly use floating number

I have a simple data file to plot.
Here is the contents of a data file and I named it "ttry":
0.27 0
0.28 0
0.29 0
0.3 0
0.31 0
0.32 0
0.33 0
0.34 0
0.35 0
0.36 0
0.37 0
0.38 0.00728737997257
0.39 0.0600137174211
0.4 0.11488340192
0.41 0.157321673525
0.42 0.193158436214
0.43 0.233882030178
0.44 0.273319615912
0.45 0.311556927298
0.46 0.349879972565
0.47 0.387602880658
0.48 0.424211248285
0.49 0.460390946502
0.5 0.494855967078
0.51 0.529406721536
0.52 0.561814128944
0.53 0.594307270233
0.54 0.624228395062
0.55 0.654492455418
0.56 0.683984910837
0.57 0.711762688615
0.58 0.739368998628
0.59 0.765775034294
0.6 0.790895061728
0.61 0.815586419753
0.62 0.840192043896
0.63 0.863082990398
0.64 0.886231138546
0.65 0.906292866941
0.66 0.915809327846
0.67 0.911436899863
0.68 0.908179012346
0.69 0.904749657064
0.7 0.899519890261
0.71 0.895147462277
0.72 0.891632373114
0.73 0.888803155007
0.74 0.884687928669
0.75 0.879029492455
0.76 0.876114540466
0.77 0.872170781893
0.78 0.867541152263
0.79 0.86274005487
0.8 0.858367626886
0.81 0.854080932785
0.82 0.850994513032
0.83 0.997170781893
0.84 1.13477366255
0.85 1.24296982167
0.86 1.32690329218
0.87 1.40397805213
0.88 1.46836419753
0.89 1.52306241427
0.9 1.53232167353
0.91 1.52906378601
0.92 1.52211934156
0.93 1.516718107
0.94 1.51543209877
0.95 1.50660150892
0.96 1.50137174211
0.97 1.49408436214
0.98 1.48816872428
0.99 1.48088134431
1 1.4723079561
And then I use matplotlib.pyplot.plotfile to plot it. Here is my python script
from matplotlib import pyplot
pyplot.plotfile("ttry", cols=(0,1), delimiter=" ")
pyplot.show()
However the following error appears:
C:\WINDOWS\system32\cmd.exe /c ttry.py
Traceback (most recent call last):
File "E:\research\ttry.py", line 2, in <module>
pyplot.plotfile("ttry",col=(0,1),delimiter=" ")
File "C:\Python33\lib\site-packages\matplotlib\pyplot.py", line 2311, in plotfile
checkrows=checkrows, delimiter=delimiter, names=names)
File "C:\Python33\lib\site-packages\matplotlib\mlab.py", line 2163, in csv2rec
rows.append([func(name, val) for func, name, val in zip(converters, names, row)])
File "C:\Python33\lib\site-packages\matplotlib\mlab.py", line 2163, in <listcomp>
rows.append([func(name, val) for func, name, val in zip(converters, names, row)])
File "C:\Python33\lib\site-packages\matplotlib\mlab.py", line 2031, in newfunc
return func(val)
ValueError: invalid literal for int() with base 10: '0.00728737997257'
shell returned 1
Hit any key to close this window...
Obviously, python just considers yaxis data as int. So how to tell python I use float for yaxis data?
It implies int type of your second column based on first few values, which are all int's. To make it check all rows, add checkrows = 0 to arguments, that is:
pyplot.plotfile("ttry", cols=(0,1), delimiter=" ", checkrows = 0)
It's an argument coming from matplotlib.mlab.csv2rec, see more info here.

Categories