Transposing in a specific order using Pandas - python

I have a df that looks like this:
EmpID Obs_ID Component Rating avg_2a avg_2d avg_3b avg_3c avg_3d
01301 13943 2a 3 3.0 3.0 1.0 1.33 1.0
01301 13944 2a 3 3.0 3.0 1.0 1.33 1.0
01301 13945 2a 3 3.0 3.0 1.0 1.33 1.0
01301 13945 2d 3 3.0 3.0 1.0 1.33 1.0
01301 13945 3b 3 3.0 3.0 1.0 1.33 1.0
And I need it to look like this:
EmpID comp_2a_obs_1 comp_2a_obs_2 comp_2a_obs_3 comp_2d_obs_1 ... ... comp_2a_avg comp_2d_avg comp_3b_avg comp_3c_avg comp_3d_avg
01301 3 3 3 3 ... ... 3.0 3.0 1.0 1.33 1.0
Where the obs order (obs_1, obs_2, obs_3) is based on the Obs_ID (smallest to largest) and the value of the comp_2a_obs_1, etc columns are drawn from Rating
Can I do this with pd.Transpose or would I need a for loop or something else?

Related

How to get "true" decimal place precision with pandas round?

What am I missing? I tried appending .round(3) to the end of of the api call but it doesnt work, and it also doesnt work in separate calls. The data types for all columns is numpy.float32.
>>> summary_data = api._get_data(units=list(units.id),
downsample=downsample,
table='summary_tb',
db=db).astype(np.float32)
>>> summary_data.head()
id asset_id cycle hs alt Mach TRA T2
0 10.0 1.0 1.0 1.0 3081.0 0.37945 70.399887 522.302124
1 20.0 1.0 1.0 1.0 3153.0 0.38449 70.575668 522.428162
2 30.0 1.0 1.0 1.0 3229.0 0.39079 70.575668 522.645020
3 40.0 1.0 1.0 1.0 3305.0 0.39438 70.575668 522.651184
4 50.0 1.0 1.0 1.0 3393.0 0.39690 70.663559 522.530090
>>> summary_data = summary_data.round(3)
>>> summary_data.head()
id asset_id cycle hs alt Mach TRA T2
0 10.0 1.0 1.0 1.0 3081.0 0.379 70.400002 522.302002
1 20.0 1.0 1.0 1.0 3153.0 0.384 70.575996 522.427979
2 30.0 1.0 1.0 1.0 3229.0 0.391 70.575996 522.645020
3 40.0 1.0 1.0 1.0 3305.0 0.394 70.575996 522.651001
4 50.0 1.0 1.0 1.0 3393.0 0.397 70.664001 522.530029
>>> print(type(summary_data))
pandas.core.frame.DataFrame
>>> print([type(summary_data[col][0]) for col in summary_data.columns])
[numpy.float32,
numpy.float32,
numpy.float32,
numpy.float32,
numpy.float32,
numpy.float32,
numpy.float32,
numpy.float32]
It does in fact look like some form of rounding is taking place, but something weird is happening. Thanks in advance.
EDIT
The point of this is to use 32 bit floating numbers, not 64 bit. I have since used pd.set_option('precision', 3), but according the the documentation this only affects the display, but not the underlying value. As mentioned in a comment below, I am trying to minimize the number of atomic operations. Calculations on 70.575996 vs 70.57600 are more expensive, and this is the issue I am trying to tackle. Thanks in advance.
Hmm, this might be a floating-point issue. You could change the dtype to float instead of np.float32:
>>> summary_data.astype(float).round(3)
id asset_id cycle hs alt Mach TRA T2
0 10.0 1.0 1.0 1.0 3081.0 0.379 70.400 522.302
1 20.0 1.0 1.0 1.0 3153.0 0.384 70.576 522.428
2 30.0 1.0 1.0 1.0 3229.0 0.391 70.576 522.645
3 40.0 1.0 1.0 1.0 3305.0 0.394 70.576 522.651
4 50.0 1.0 1.0 1.0 3393.0 0.397 70.664 522.530
If you change it back to np.float32 afterwards, it re-exhibits the issue:
>>> summary_data.astype(float).round(3).astype(np.float32)
id asset_id cycle hs alt Mach TRA T2
0 10.0 1.0 1.0 1.0 3081.0 0.379 70.400002 522.302002
1 20.0 1.0 1.0 1.0 3153.0 0.384 70.575996 522.427979
2 30.0 1.0 1.0 1.0 3229.0 0.391 70.575996 522.645020
3 40.0 1.0 1.0 1.0 3305.0 0.394 70.575996 522.651001
4 50.0 1.0 1.0 1.0 3393.0 0.397 70.664001 522.530029

python sum values in df1 if loc in df2 is True

I'm still quite new to Python, and after looking intensively here on SO, I've decided to just ask.
I have a DataFrame, df
df
NO2 NO2 NO3 DK1 DK2
0 1.0 3.0 2.0 1.0 1.0
1 1.0 3.0 3.0 3.0 1.0
2 2.0 2.0 2.0 1.0 1.0
Now, what I want to do is sum up all values in row 0 that are equal to the value in column "DK1" (incl. itself) and return it in a new column.
Then after doing that for row 0, the same procedure for row 1, then row 2, etc.
Desired output:
df2
NO2 NO2 NO3 DK1 DK2 Sum
0 1.0 3.0 2.0 1.0 1.0 3.0
1 1.0 3.0 3.0 3.0 1.0 9.0
2 2.0 2.0 2.0 1.0 1.0 2.0
Compare all values by DF1 column, then multiple this column and last use sum per rows:
df['sum'] = df.eq(df['DK1'], axis=0).mul(df['DK1'], axis=0).sum(axis=1)
print (df)
NO2 NO2.1 NO3 DK1 DK2 sum
0 1.0 3.0 2.0 1.0 1.0 3.0
1 1.0 3.0 3.0 3.0 1.0 9.0
2 2.0 2.0 2.0 1.0 1.0 2.0
Details:
print (df.eq(df['DK1'], axis=0))
NO2 NO2.1 NO3 DK1 DK2
0 True False False True True
1 False True True True False
2 False False False True True
print (df.eq(df['DK1'], axis=0).mul(df['DK1'], axis=0))
NO2 NO2.1 NO3 DK1 DK2
0 1.0 0.0 0.0 1.0 1.0
1 0.0 3.0 3.0 3.0 0.0
2 0.0 0.0 0.0 1.0 1.0
#jezrael, I didn't know how to put this in a comment
NO1 NO2 ... DK1 DK2 sum
0 28.4 28.4 ... 21.0 21.0 2121
1 28.2 28.2 ... 25.1 25.1 25,125,125,125,125,125,125,125,125,1
2 28.0 28.0 ... 25.1 25.1 25,125,125,125,125,125,125,125,125,1
3 28.0 28.0 ... 16.0 16.0 1616
4 28.0 28.0 ... 16.4 16.4 16,416,4
Naturally, my actual dataset is not just ones the simple ones that I started out by - these are my actual values, and the result that I get. Does that help?

Subtract column over column avoiding a string contains in pandas Dataframe

I have the following dataframe, with cumulative results quarter by quarter and resets at 1°Q.
I need the Quarter net variation, so I need to subtract column over column except the ones with 1°Q.
from pandas import DataFrame
data = {'Financials': ['EPS','Earnings','Sales','Margin'],
'1°Q19': [1,2,3,4],
'2°Q19': [2,4,6,8],
'3°Q19': [3,6,9,12],
'4°Q19': [4,8,12,16],
'1°Q20': [1,2,3,4],
'2°Q20': [2,4,6,8],
'3°Q20': [3,6,9,12],
'4°Q20': [4,8,12,16]
}
df = DataFrame(data,columns=['Financials','1°Q19','2°Q19','3°Q19','4°Q19',
'1°Q20','2°Q20','3°Q20','4°Q20'])
print(df)
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1 2 3 4 1 2 3 4
1 Earnings 2 4 6 8 2 4 6 8
2 Sales 3 6 9 12 3 6 9 12
3 Margin 4 8 12 16 4 8 12 16
I've started like this and then I got stuck big time:
if ~df.columns.str.contains('1°Q'):
# here I want to substract (1°Q remains unchanged), 2°Q - 1°Q, 3°Q - 2°Q, 4°Q - 3°Q
In order to get this desired result:
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
I've tried
new_df = df.diff(axis=1).fillna(df)
print(new_df)
But the result in this case is not the desired one for de 1°Q20:
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1.0 1.0 1.0 1.0 -3.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0 -6.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0 -9.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0 -12.0 4.0 4.0 4.0
IIUC, DataFrame.diff with axis=1 and then fill NaN with
DataFrame.fillna
new_df = df.diff(axis=1).fillna(df)
print(new_df)
Financials 1°Q 2°Q 3°Q 4°Q
0 EPS 1.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0
for expected output:
new_df = new_df.astype(int)
EDIT
df.groupby(df.columns.str.contains('1°Q').cumsum(),axis=1).diff(axis=1).fillna(df)
Financials 1°Q19 2°Q19 3°Q19 4°Q19 1°Q20 2°Q20 3°Q20 4°Q20
0 EPS 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 Earnings 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
2 Sales 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3 Margin 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0
or
df.diff(axis=1).T.mask(df.columns.to_series().str.contains('1°Q')).T.fillna(df)
You can leverage df.shift for the subtraction, and fillna to fix the NaN values left from the shift
df=df.set_index('Financials')
df-(df.shift(1, axis=1).fillna(0))
1°Q 2°Q 3°Q 4°Q
Financials
EPS 1.0 1.0 1.0 1.0
Earnings 2.0 2.0 2.0 2.0
Sales 3.0 3.0 3.0 3.0
Margin 4.0 4.0 4.0 4.0

Using fillna to replace missing data

When I am trying to use fillna to replace NaNs in the columns with means, the NaNs changed from float64 to object, showing:
bound method Series.mean of 0 NaN\n1
Here is the code:
mean = df['texture_mean'].mean
df['texture_mean'] = df['texture_mean'].fillna(mean)`
You cannot use mean = df['texture_mean'].mean. This is where the problem lies. The following code will work -
df=pd.DataFrame({'texture_mean':[2,4,None,6,1,None],'A':[1,2,3,4,5,None]}) # Example
df
A texture_mean
0 1.0 2.0
1 2.0 4.0
2 3.0 NaN
3 4.0 6.0
4 5.0 1.0
5 NaN NaN
df['texture_mean']=df['texture_mean'].fillna(df['texture_mean'].mean())
df
A texture_mean
0 1.0 2.00
1 2.0 4.00
2 3.0 3.25
3 4.0 6.00
4 5.0 1.00
5 NaN 3.25
In case you want to replace all the NaNs with the respective means of that column in all columns, then just do this -
df=df.fillna(df.mean())
df
A texture_mean
0 1.0 2.00
1 2.0 4.00
2 3.0 3.25
3 4.0 6.00
4 5.0 1.00
5 3.0 3.25
Let me know if this is what you want.

Pandas fillna() not working as expected

I'm trying to replace NaN values in my dataframe with means from the same row.
sample_df = pd.DataFrame({'A':[1.0,np.nan,5.0],
'B':[1.0,4.0,5.0],
'C':[1.0,1.0,4.0],
'D':[6.0,5.0,5.0],
'E':[1.0,1.0,4.0],
'F':[1.0,np.nan,4.0]})
sample_mean = sample_df.apply(lambda x: np.mean(x.dropna().values.tolist()) ,axis=1)
Produces:
0 1.833333
1 2.750000
2 4.500000
dtype: float64
But when I try to use fillna() to fill the missing dataframe values with values from the series, it doesn't seem to work.
sample_df.fillna(sample_mean, inplace=True)
A B C D E F
0 1.0 1.0 1.0 6.0 1.0 1.0
1 NaN 4.0 1.0 5.0 1.0 NaN
2 5.0 5.0 4.0 5.0 4.0 4.0
What I expect is:
A B C D E F
0 1.0 1.0 1.0 6.0 1.0 1.0
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.0 5.0 4.0 5.0 4.0 4.0
I've reviewed the other similar questions and can't seem to uncover the issue. Thanks in advance for your help.
By using pandas
sample_df.T.fillna(sample_df.T.mean()).T
Out[1284]:
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
Here's one way -
sample_df[:] = np.where(np.isnan(sample_df), sample_df.mean(1)[:,None], sample_df)
Sample output -
sample_df
Out[61]:
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
Another pandas way:
>>> sample_df.where(pd.notnull(sample_df), sample_df.mean(axis=1), axis='rows')
A B C D E F
0 1.00 1.0 1.0 6.0 1.0 1.00
1 2.75 4.0 1.0 5.0 1.0 2.75
2 5.00 5.0 4.0 5.0 4.0 4.00
An if condition is True is in operation here: Where elements of pd.notnull(sample_df) are True use the corresponding elements from sample_df else use the elements from sample_df.mean(axis=1) and perform this logic along axis='rows'.

Categories