Collapse(or combine) multiple columns into two separate columns python - python

I have a dataframe as shown below:
8964_real 8964_imag 8965_real 8965_imag 8966_real 8966_imag 8967_real ... 8984_imag 8985_real 8985_imag 8986_real 8986_imag 8987_real 8987_imag
0 112.653120 0.000000 117.104887 0.000000 127.593406 0.000000 129.522106 ... 0.000000 125.423552 0.000000 127.888477 0.000000 136.160979 0.000000
1 -0.315831 16.363974 -2.083329 22.443628 -2.166950 15.026253 0.110502 ... -26.613220 8.454297 -35.000742 11.871405 -24.914035 7.448329 -16.370041
2 -1.863497 10.672129 -6.152232 15.980813 -5.679352 18.976117 -5.775777 ... -11.131600 -18.990022 -9.520732 -11.947319 -4.641286 -17.104710 -5.691642
3 -6.749938 14.870590 -12.222749 15.012352 -10.501423 9.345518 -9.103459 ... -2.860546 -29.862724 -5.237663 -28.791194 -5.685985 -24.565608 -10.385683
4 -2.991405 -10.332938 -4.097638 -10.204587 -12.056221 -5.684882 -12.861357 ... 0.821902 -8.787235 -1.521650 -3.798446 -2.390519 -6.527762 -1.145998
I have to convert above dataframe such that values in columns "_real" should come under one column and values under "_imag" should come under another column
That is totally there should be two columns at the end , one for real and other for imag.What could be the most efficient way to do it?
I refer this link . But this is good for one column,but I need two.
Another idea , I got was use regex to select columns containing "real" and do as said in above link (and similarly for imag) ,but felt it a bit round about.
Any help appreciated.
EDIT:
For example, real should be like
real
112.653120
-0.315831
-1.863497
-6.749938
-2.991405
---------
117.104887
-2.083329
-6.152232
-12.222749
-4.097638
---------
127.593406
-2.166950
-5.679352
-10.501423
-12.056221
I have made a dotted line to make it clear

Create MultiIndex by split, so possible reshape by DataFrame.stack:
df.columns = df.columns.str.split('_', expand=True)
print (df.head(10))
8964 8965 8966 \
real imag real imag real imag
0 112.653120 0.000000 117.104887 0.000000 127.593406 0.000000
1 -0.315831 16.363974 -2.083329 22.443628 -2.166950 15.026253
2 -1.863497 10.672129 -6.152232 15.980813 -5.679352 18.976117
3 -6.749938 14.870590 -12.222749 15.012352 -10.501423 9.345518
4 -2.991405 -10.332938 -4.097638 -10.204587 -12.056221 -5.684882
8967 8984 8985 8986 \
real imag real imag real imag
0 129.522106 0.000000 125.423552 0.000000 127.888477 0.000000
1 0.110502 -26.613220 8.454297 -35.000742 11.871405 -24.914035
2 -5.775777 -11.131600 -18.990022 -9.520732 -11.947319 -4.641286
3 -9.103459 -2.860546 -29.862724 -5.237663 -28.791194 -5.685985
4 -12.861357 0.821902 -8.787235 -1.521650 -3.798446 -2.390519
8987
real imag
0 136.160979 0.000000
1 7.448329 -16.370041
2 -17.104710 -5.691642
3 -24.565608 -10.385683
4 -6.527762 -1.145998
df = df.stack(0).reset_index(level=0, drop=True).rename_axis('a').reset_index()
print (df.head(10))
a imag real
0 8964 0.000000 112.653120
1 8965 0.000000 117.104887
2 8966 0.000000 127.593406
3 8967 NaN 129.522106
4 8984 0.000000 NaN
5 8985 0.000000 125.423552
6 8986 0.000000 127.888477
7 8987 0.000000 136.160979
8 8964 16.363974 -0.315831
9 8965 22.443628 -2.083329
EDIT: For new structure of data is possible reshape values by ravel:
a = df.filter(like='real')
b = df.filter(like='imag')
c = a.columns.str.replace('_real', '').astype(int)
print (c)
Int64Index([8964, 8965, 8966, 8967, 8985, 8986, 8987], dtype='int64')
df = pd.DataFrame({'r':a.T.to_numpy().ravel(), 'i':b.T.to_numpy().ravel()},
index=np.tile(c, len(df)))
print (df.head(10))
r i
8964 112.653120 0.000000
8965 -0.315831 16.363974
8966 -1.863497 10.672129
8967 -6.749938 14.870590
8985 -2.991405 -10.332938
8986 117.104887 0.000000
8987 -2.083329 22.443628
8964 -6.152232 15.980813
8965 -12.222749 15.012352
8966 -4.097638 -10.204587

Related

how to keep same values beyond 2 end points when using make_interp_spline()?

I am using make_interp_spline to do curve fitting. I'd like to keep same values beyond min and max values. I thought I could using bc_type='clamped' to do that. But the result I got is not correct.
Here is what I have
the data is
df_krow
Sw Krw Krow
0 0.247000 0.000000 1.000000
1 0.281562 0.000006 0.850997
2 0.316125 0.000098 0.716177
3 0.350688 0.000494 0.595057
4 0.385250 0.001563 0.487139
5 0.419813 0.003815 0.391906
6 0.454375 0.007910 0.308816
7 0.488938 0.014655 0.237305
8 0.523500 0.025000 0.176777
9 0.558063 0.040045 0.126603
10 0.592625 0.061035 0.086115
11 0.627188 0.089362 0.054592
12 0.661750 0.126562 0.031250
13 0.696313 0.174323 0.015223
14 0.730875 0.234473 0.005524
15 0.765437 0.308990 0.000977
16 0.800000 0.400000 0.000000
sw_loc_list=[0.2,0.4,0.6,0.8,0.9,1.0]
from scipy.interpolate import make_interp_spline
krw_loc_list=make_interp_spline(df_krow['Sw'],df_krow['Krw'],k=3,bc_type='clamped')(sw_loc_list)
after above, you can see when sw<0.247 or higher than 0.8, Krw is negative. The result I'd like to get is Krw is 0 when Sw<0.247 and Krw=0.4 when Sw>0.8. How to do it? Thanks
print(krw_loc_list)
[-4.21675500e-05 2.34342851e-03 6.64037797e-02 4.00000000e-01
-2.73531731e+00 -1.91966201e+01]

Pandas DataFrame (long) to Series ("wide")

I have the following DataFrame:
completeness
homogeneity
label_f1_score
label_precision
label_recall
mean_bbox_iou
mean_iou
px_accuracy
px_f1_score
px_iou
px_precision
px_recall
t_eval
v_score
mean
0.1
1
0.92
0.92
0.92
0.729377
0.784934
0.843802
0.898138
0.774729
0.998674
0.832576
1.10854
0.1
std
0.0707107
0
0.0447214
0.0447214
0.0447214
0.0574177
0.0313196
0.0341158
0.0224574
0.0299977
0.000432499
0.0327758
0.0588322
0.0707107
What I would like to obtain is a Series composed of completeness_mean, completeness_std, homogenety_mean, homogenety_std, ..., i.e. a label {column}_{index} for every cell.
Does Pandas have a function for this or do I have to iterate over all cells myself to build the desired result?
EDIT: I mean a Series with {column}_{index} as index and the corresponding values from the table.
(I believe this is not a duplicate of the other questions on SO related wide to long.)
IIUC, unstack and flatten the index:
df2 = df.unstack()
df2.index = df2.index.map('_'.join)
output:
completeness_mean 0.100000
completeness_std 0.070711
homogeneity_mean 1.000000
homogeneity_std 0.000000
label_f1_score_mean 0.920000
label_f1_score_std 0.044721
label_precision_mean 0.920000
label_precision_std 0.044721
label_recall_mean 0.920000
label_recall_std 0.044721
mean_bbox_iou_mean 0.729377
mean_bbox_iou_std 0.057418
mean_iou_mean 0.784934
mean_iou_std 0.031320
px_accuracy_mean 0.843802
px_accuracy_std 0.034116
px_f1_score_mean 0.898138
px_f1_score_std 0.022457
px_iou_mean 0.774729
px_iou_std 0.029998
px_precision_mean 0.998674
px_precision_std 0.000432
px_recall_mean 0.832576
px_recall_std 0.032776
t_eval_mean 1.108540
t_eval_std 0.058832
v_score_mean 0.100000
v_score_std 0.070711
dtype: float64
or with stack for a different order:
df2 = df.stack()
df2.index = df2.swaplevel().index.map('_'.join)
output:
completeness_mean 0.100000
homogeneity_mean 1.000000
label_f1_score_mean 0.920000
label_precision_mean 0.920000
label_recall_mean 0.920000
mean_bbox_iou_mean 0.729377
mean_iou_mean 0.784934
px_accuracy_mean 0.843802
px_f1_score_mean 0.898138
px_iou_mean 0.774729
px_precision_mean 0.998674
px_recall_mean 0.832576
t_eval_mean 1.108540
v_score_mean 0.100000
completeness_std 0.070711
homogeneity_std 0.000000
label_f1_score_std 0.044721
label_precision_std 0.044721
label_recall_std 0.044721
mean_bbox_iou_std 0.057418
mean_iou_std 0.031320
px_accuracy_std 0.034116
px_f1_score_std 0.022457
px_iou_std 0.029998
px_precision_std 0.000432
px_recall_std 0.032776
t_eval_std 0.058832
v_score_std 0.070711
dtype: float64
Is this what you're looking for?
pd.merge(df.columns.to_frame(), df.index.to_frame(), 'cross').apply('_'.join, axis=1)
# OR
pd.Series(df.unstack().index.map('_'.join))
Output:
0 completeness_mean
1 completeness_std
2 homogeneity_mean
3 homogeneity_std
4 label_f1_score_mean
5 label_f1_score_std
6 label_precision_mean
7 label_precision_std
8 label_recall_mean
9 label_recall_std
10 mean_bbox_iou_mean
11 mean_bbox_iou_std
12 mean_iou_mean
13 mean_iou_std
14 px_accuracy_mean
15 px_accuracy_std
16 px_f1_score_mean
17 px_f1_score_std
18 px_iou_mean
19 px_iou_std
20 px_precision_mean
21 px_precision_std
22 px_recall_mean
23 px_recall_std
24 t_eval_mean
25 t_eval_std
26 v_score_mean
27 v_score_std
dtype: object

Pandas: Get the col name for the min value within each row and get % difference compared to the rest of the columns

I have a dataframe:
LF RF LR RR
11 22 33 44
23 43 23 12
33 23 12 43
What I want to accomplish is a calculation (The purpose is to identify which column within each row has the lowest value and determine a percentage compared to the rest of the cols mean).
For example:
Identify the min value in r1, which is 11 and col name (LF). The rest of the cols mean is (22+33+44)/3= 33. Then we calculate the percentage difference 11/33 = 0.333
Expected output:
LF RF LR RR Min_Col dif(%)
11 22 33 44 LF 0.333
23 43 23 12 RR 0.404
33 23 12 43 LR 0.364
a proper way of writing the equation would be:
(min_value)/(sum_rest_of_cols/3)
Note: I need to have a column that indicates for each row which column is the lowest (This is a program to identify problems, so within the error message we want to be able to tell the user which column it is, that is giving the problems)
EDITED:
My code (df_inter is the original df which I am locing to only get the desired columns to perform this calculation):
df_exc = df_inter.loc[:,['LF_Strut_Pressure', 'RF_Strut_Pressure', 'LR_Strut_Pressure' ,'RR_Strut_Pressure']]
df_exc['dif(%)'] = df_exc.min(1) * 3 / (df_exc.sum(1) - df_inter.min(1))
df_exc['Min_Col'] = df_exc.iloc[:, :-1].idxmin(1)
print(df_exc)
My Output:
LF_Strut RF_Strut LR_Strut RR_Strut dif(%) Min_Col
truck_id
EX7057 0.000000 0.000000 0.000000 0.000000 0.0000 LF_Strut
EX7105 0.000000 0.000000 0.000000 0.000000 0.0000 LF_Strut
EX7106 0.000000 0.000000 0.000000 0.000000 0.0000 LF_Strut
EX7107 0.000000 0.000000 0.000000 0.000000 0.0000 LF_Strut
TD6510 36588.000000 36587.000 36587.00000 36587.00 0.8204 RF_Strut
TD6511 36986.000000 36989.000 36987.00000 36989.00 0.8220 LF_Strut
TD6512 27704.000000 27705.000 27702.00000 27705.00 0.7757 LR_Strut
The problem is: When doing the calculation for TD6510 ( 36587 / ( (36587 + 36587 + 36588) / 3 ) ) = 0.9999999 .. not 0.8204 . I tried replicating where 0.8204 came from, I was unsuccesful. Thanx for al l the help and support.
First we use idxmin
df['dif(%)']=df.min(1)*3/(df.sum(1)-df.min(1))
df['Min_Col']=df.iloc[:,:-1].idxmin(1)
df
LF RF LR RR dif(%) Min_Col
0 11 22 33 44 0.333333 LF
1 23 43 23 12 0.404494 RR
2 33 23 12 43 0.363636 LR
I wrote the text in a file called "textfile.txt". This should be useful:
import pandas as pd
df= pd.read_csv('textfile.txt', sep = ' ')
df['min'] = df[['LF','RF','LR','RR']].min(axis=1)
df['sum_3'] = df[['LF','RF','LR','RR']].sum(axis=1)- df['min']
df['sum_3_div3'] = df['sum_3']/3
You can just do usual calculation, the min col is given by idxmin
# find the mins in each row
mins = df.min(axis=1)
# compute mean of the other values
other_means = (df.sum(1) - mins).div(df.shape[1]-1)
(mins /other_means)*100
Output:
0 33.333333
1 40.449438
2 36.363636
dtype: float64
Using idxmin and df.mask() with df.isin() and df.min():
final = df.assign(Min_Col=df.idxmin(1),
Diff=df.min(1).div(df.mask(df.isin(df.min(1))).mean(1)))
print(final)
LF RF LR RR Min_Col Diff
0 11 22 33 44 LF 0.333333
1 23 43 23 12 RR 0.404494
2 33 23 12 43 LR 0.363636

How to pass a variable in python with dataframe

I have a data set like this :
Id 1456 1457 1458 1459 1460
MSSubClass 60 20 70 20 20
MSZoning RL RL RL RL RL
LotFrontage 62 85 66 68 75
LotArea 7917 13175 9042 9717 9937
Street Pave Pave Pave Pave Pave
Alley NaN NaN NaN NaN NaN
LotShape Reg Reg Reg Reg Reg
LandContour Lvl Lvl Lvl Lvl Lvl
I converted the strings to pandas. Now I need to convert them numerical data. To convert it into numerical data I take the output of following:
display_all(df_raw.isnull().sum().sort_index()/len(df_raw))
1stFlrSF 0.000000
2ndFlrSF 0.000000
3SsnPorch 0.000000
Alley 0.937671
BedroomAbvGr 0.000000
BldgType 0.000000
BsmtCond 0.025342
BsmtExposure 0.026027
BsmtFinSF1 0.000000
Where ever the value is non-zero, I convert it to numerical values.
train_cats(df_raw) #convert strings to pandas
op1=df_raw.isnull().sum().sort_index()/len(df_raw)
i=0
while i < op1.shape[0]:
if op1[i]!=0.0:
variabe_name=op1.index[i]
df_raw.variable_name = df_raw.variable_name.cat.codes <----
i+=1
So the raw value of this where Alley is non-zero would be :
df_raw.alley = df_raw.alley.cat.codes
Alley needs to be passed as a variable name.
My question is how can I pass a variable name there instead of value so that I can loop through it? I tried #variable_name but it just gives me errors/
Maybe I am doing this wrong. Would there be a better way of doing this?
Your help would be very much appreciated.

Problem to implement count, groupby, np.repeat and agg with pandas

I have similar dataframe pandas:
df = pd.DataFrame({'x': np.random.rand(61800), 'y':np.random.rand(61800), 'z':np.random.rand(61800)})
I need to work out my dataset for the following result:
extract = df.assign(count=np.repeat(range(10),10)).groupby('count',as_index=False).agg(['mean','min', 'max'])
But if i use np.repeat(range(150),150)) i received this error:
This doesn't work because the .assign you're performing needs to have enough values to fit the original dataframe:
In [81]: df = pd.DataFrame({'x': np.random.rand(61800), 'y':np.random.rand(61800), 'z':np.random.rand(61800)})
In [82]: df.assign(count=np.repeat(range(10),10))
ValueError: Length of values does not match length of index
In this case, everything works fine if we do 10 groups repeated 6,180 times:
In [83]: df.assign(count=np.repeat(range(10),6180))
Out[83]:
x y z count
0 0.781364 0.996545 0.756592 0
1 0.609127 0.981688 0.626721 0
2 0.547029 0.167678 0.198857 0
3 0.184405 0.484623 0.219722 0
4 0.451698 0.535085 0.045942 0
... ... ... ... ...
61795 0.783192 0.969306 0.974836 9
61796 0.890720 0.286384 0.744779 9
61797 0.512688 0.945516 0.907192 9
61798 0.526564 0.165620 0.766733 9
61799 0.683092 0.976219 0.524048 9
[61800 rows x 4 columns]
In [84]: extract = df.assign(count=np.repeat(range(10),6180)).groupby('count',as_index=False).agg(['mean','min', 'max'])
In [85]: extract
Out[85]:
x y z
mean min max mean min max mean min max
count
0 0.502338 0.000230 0.999546 0.501603 0.000263 0.999842 0.503807 0.000113 0.999826
1 0.500392 0.000059 0.999979 0.499935 0.000012 0.999767 0.500114 0.000230 0.999811
2 0.498377 0.000023 0.999832 0.496921 0.000003 0.999475 0.502887 0.000028 0.999828
3 0.504970 0.000637 0.999680 0.500943 0.000256 0.999902 0.497370 0.000257 0.999969
4 0.501195 0.000290 0.999992 0.498617 0.000149 0.999779 0.497895 0.000022 0.999877
5 0.499476 0.000186 0.999956 0.503227 0.000308 0.999907 0.504688 0.000100 0.999756
6 0.495488 0.000378 0.999606 0.499893 0.000119 0.999740 0.495924 0.000031 0.999556
7 0.498443 0.000005 0.999417 0.495728 0.000262 0.999972 0.501255 0.000087 0.999978
8 0.494110 0.000014 0.999888 0.495197 0.000074 0.999970 0.493215 0.000166 0.999718
9 0.496333 0.000365 0.999307 0.502074 0.000110 0.999856 0.499164 0.000035 0.999927

Categories