Supposed I have a df as below, how to add a sum() value in below DataFrame?
df.columns=['value_a','value_b','name','up_or_down','difference']
df
value_a value_b name up_or_down difference
project_name
# sum 27.56 25.04 sum down -1.31
2021-project11 0.43 0.48 2021-project11 up 0.05
2021-project1 0.62 0.56 2021-project1 down -0.06
2021-project2 0.51 0.47 2021-project2 down -0.04
2021-porject3 0.37 0.34 2021-porject3 down -0.03
2021-porject4 0.64 0.61 2021-porject4 down -0.03
2021-project5 0.32 0.25 2021-project5 down -0.07
2021-project6 0.75 0.81 2021-project6 up 0.06
2021-project7 0.60 0.60 2021-project7 down 0.00
2021-project8 0.85 0.74 2021-project8 down -0.11
2021-project10 0.67 0.67 2021-project10 down 0.00
2021-project9 0.73 0.73 2021-project9 down 0.00
2021-project11 0.54 0.54 2021-project11 down 0.00
2021-project12 0.40 0.40 2021-project12 down 0.00
2021-project13 0.76 0.77 2021-project13 up 0.01
2021-project14 1.16 1.28 2021-project14 up 0.12
2021-project15 1.01 0.94 2021-project15 down -0.07
2021-project16 1.23 1.24 2021-project16 up 0.01
2022-project17 0.40 0.36 2022-project17 down -0.04
2022-project_11 0.40 0.40 2022-project_11 down 0.00
2022-project4 1.01 0.80 2022-project4 down -0.21
2022-project1 0.65 0.67 2022-project1 up 0.02
2022-project2 0.75 0.57 2022-project2 down -0.18
2022-porject3 0.32 0.32 2022-porject3 down 0.00
2022-project18 0.91 0.56 2022-project18 down -0.35
2022-project5 0.84 0.89 2022-project5 up 0.05
2022-project19 0.61 0.48 2022-project19 down -0.13
2022-project6 0.77 0.80 2022-project6 up 0.03
2022-project20 0.63 0.54 2022-project20 down -0.09
2022-project8 0.59 0.55 2022-project8 down -0.04
2022-project21 0.58 0.54 2022-project21 down -0.04
2022-project10 0.76 0.76 2022-project10 down 0.00
2022-project9 0.70 0.71 2022-project9 up 0.01
2022-project22 0.62 0.56 2022-project22 down -0.06
2022-project23 2.03 1.74 2022-project23 down -0.29
2022-project12 0.39 0.39 2022-project12 down 0.00
2022-project24 1.35 1.55 2022-project24 up 0.20
project25 0.45 0.42 project25 down -0.03
project26 0.53 NaN project26 down NaN
project27 0.68 NaN project27 down NaN
I tried
df.sum().columns=['value_a_sun','value_b_sum','difference_sum']
And I would like to add below sum value in the above column value,
sum 27.56 25.04 sum down -1.31
But I got AttributeError: 'Series' object has no attribute 'column', how to fix this? Thanks so much for any advice.
Filter columns names in subset by [] before sum and assign for new row in DataFrame.loc:
df.loc['sum'] = df[['value_a','value_b','difference']].sum()
For first line:
df1 = df[['value_a','value_b','difference']].sum().to_frame().T
df = pd.concat([df1, df], ignore_index=True)
Related
Supposed we have a df with a sum() value in the below DataFrame, thanks so much for #jezrael 's answer here, but we have many different df like below DataFrame with different columns, is it possible to add those three line code in a function?
df.columns=['value_a','value_b','name','up_or_down','difference']
# from here
df.loc['sum'] = df[['value_a','value_b','difference']].sum()
df1 = df[['value_a','value_b','difference']].sum().to_frame().T
df = pd.concat([df1, df], ignore_index=True)
# end here
df
value_a value_b name up_or_down difference
project_name
sum 27.56 25.04 -1.31
2021-project11 0.43 0.48 2021-project11 up 0.05
2021-project1 0.62 0.56 2021-project1 down -0.06
2021-project2 0.51 0.47 2021-project2 down -0.04
2021-porject3 0.37 0.34 2021-porject3 down -0.03
2021-porject4 0.64 0.61 2021-porject4 down -0.03
2021-project5 0.32 0.25 2021-project5 down -0.07
2021-project6 0.75 0.81 2021-project6 up 0.06
2021-project7 0.60 0.60 2021-project7 down 0.00
2021-project8 0.85 0.74 2021-project8 down -0.11
2021-project10 0.67 0.67 2021-project10 down 0.00
2021-project9 0.73 0.73 2021-project9 down 0.00
2021-project11 0.54 0.54 2021-project11 down 0.00
2021-project12 0.40 0.40 2021-project12 down 0.00
2021-project13 0.76 0.77 2021-project13 up 0.01
2021-project14 1.16 1.28 2021-project14 up 0.12
2021-project15 1.01 0.94 2021-project15 down -0.07
2021-project16 1.23 1.24 2021-project16 up 0.01
2022-project17 0.40 0.36 2022-project17 down -0.04
2022-project_11 0.40 0.40 2022-project_11 down 0.00
2022-project4 1.01 0.80 2022-project4 down -0.21
2022-project1 0.65 0.67 2022-project1 up 0.02
2022-project2 0.75 0.57 2022-project2 down -0.18
2022-porject3 0.32 0.32 2022-porject3 down 0.00
2022-project18 0.91 0.56 2022-project18 down -0.35
2022-project5 0.84 0.89 2022-project5 up 0.05
2022-project19 0.61 0.48 2022-project19 down -0.13
2022-project6 0.77 0.80 2022-project6 up 0.03
2022-project20 0.63 0.54 2022-project20 down -0.09
2022-project8 0.59 0.55 2022-project8 down -0.04
2022-project21 0.58 0.54 2022-project21 down -0.04
2022-project10 0.76 0.76 2022-project10 down 0.00
2022-project9 0.70 0.71 2022-project9 up 0.01
2022-project22 0.62 0.56 2022-project22 down -0.06
2022-project23 2.03 1.74 2022-project23 down -0.29
2022-project12 0.39 0.39 2022-project12 down 0.00
2022-project24 1.35 1.55 2022-project24 up 0.20
project25 0.45 0.42 project25 down -0.03
project26 0.53 NaN project26 down NaN
project27 0.68 NaN project27 down NaN
Can I add a function with conditions like below, and our other df values can use the function directly?
def sum_handler(x):
if .......
return .....
elif .......
return .....
else
return .....
Thanks so much for any advice
You could try a different approach for summing up your dataframe like shown in this answer.
df.loc['Total'] = df.sum(numeric_only=True, axis=0)
Since this is a one line of code, there would be no need to create a custom function to do this. But for future referrence, you can create a custom function and apply it to a dataframe like this:
import pandas as pd
def double_columns(df: pd.DataFrame, columns: list[str]):
""" Doubles chosen columns of a dataframe """
df[columns] = df[columns] * 2
return df
df = pd.DataFrame({'col1': [1,2], 'col2': [2,3]})
df = double_columns(df, ['col1'])
print(df)
would return
col1 col2
0 2 2
1 4 3
How do I fix this code, do I need to make the features_train and the features_test a DataFrame?
Anyone has an idea of how to fix that code? I really can't understand the problem....
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from sklearn.metrics import r2_score
admissions_data = pd.read_csv('admissions_data.csv')
labels = admissions_data.iloc[:, -1]
features = admissions_data.iloc[:, 1:8]
features_train, labels_train, features_test, labels_test = train_test_split(features, labels, test_size=0.2, random_state=13)
sc = StandardScaler()
features_train_scaled = sc.fit_transform(features_train)
features_test_scale = sc.transform(features_test)
features_train_scaled = pd.DataFrame(features_train_scaled)
features_test_scale = pd.DataFrame(features_test_scale)
The error is:
Traceback (most recent call last):
File "script.py", line 26, in <module>
features_test_scale = sc.transform(features_test)
File "/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/_data.py", line 794, in transform
force_all_finite='allow-nan')
File "/usr/local/lib/python3.6/dist-packages/sklearn/base.py", line 420, in _validate_data
X = check_array(X, **check_params)
File "/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py", line 73, in inner_f
return f(**kwargs)
File "/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py", line 624, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[0.57 0.78 0.59 0.64 0.47 0.63 0.65 0.89 0.84 0.73 0.75 0.64 0.46 0.78
0.62 0.53 0.85 0.67 0.84 0.94 0.64 0.53 0.47 0.86 0.62 0.7 0.77 0.61
0.61 0.63 0.86 0.82 0.65 0.58 0.7 0.7 0.84 0.72 0.71 0.77 0.69 0.8
0.52 0.62 0.79 0.71 0.9 0.84 0.6 0.86 0.67 0.61 0.71 0.52 0.62 0.37
0.73 0.64 0.71 0.8 0.88 0.78 0.45 0.62 0.62 0.86 0.74 0.94 0.58 0.7
0.92 0.64 0.65 0.83 0.34 0.66 0.67 0.7 0.71 0.54 0.68 0.61 0.68 0.79
0.57 0.94 0.59 0.79 0.73 0.91 0.86 0.95 0.9 0.92 0.68 0.84 0.69 0.72
0.94 0.53 0.45 0.77 0.77 0.91 0.61 0.78 0.77 0.82 0.9 0.92 0.54 0.92
0.72 0.5 0.68 0.78 0.72 0.53 0.79 0.49 0.68 0.72 0.73 0.93 0.72 0.52
0.54 0.86 0.65 0.93 0.89 0.72 0.34 0.64 0.96 0.79 0.73 0.49 0.73 0.94
0.7 0.95 0.65 0.86 0.78 0.75 0.89 0.94 0.91 0.87 0.93 0.81 0.94 0.89
0.57 0.77 0.39 0.46 0.78 0.64 0.76 0.58 0.56 0.53 0.79 0.9 0.92 0.96
0.67 0.65 0.64 0.58 0.94 0.76 0.78 0.88 0.84 0.68 0.66 0.42 0.56 0.66
0.46 0.65 0.58 0.72 0.48 0.68 0.89 0.95 0.46 0.71 0.79 0.52 0.57 0.76
0.52 0.8 0.77 0.91 0.75 0.49 0.72 0.72 0.61 0.97 0.8 0.85 0.73 0.64
0.87 0.63 0.97 0.72 0.82 0.54 0.71 0.45 0.8 0.49 0.77 0.93 0.89 0.93
0.81 0.62 0.81 0.66 0.78 0.76 0.48 0.61 0.82 0.68 0.7 0.68 0.62 0.81
0.87 0.94 0.38 0.67 0.64 0.84 0.62 0.7 0.62 0.5 0.79 0.78 0.36 0.77
0.57 0.87 0.74 0.71 0.61 0.57 0.64 0.73 0.81 0.74 0.8 0.69 0.66 0.64
0.93 0.64 0.59 0.71 0.82 0.69 0.69 0.89 0.93 0.74 0.64 0.84 0.91 0.97
0.55 0.74 0.72 0.71 0.93 0.96 0.8 0.8 0.81 0.88 0.64 0.38 0.87 0.73
0.78 0.89 0.56 0.61 0.76 0.46 0.78 0.71 0.81 0.59 0.47 0.7 0.42 0.76
0.8 0.67 0.94 0.65 0.51 0.73 0.9 0.8 0.65 0.7 0.96 0.96 0.73 0.79
0.86 0.89 0.85 0.76 0.76 0.71 0.83 0.76 0.42 0.9 0.58 0.66 0.86 0.71
0.8 0.51 0.65 0.58 0.76 0.8 0.7 0.61 0.71 0.69 0.95 0.72 0.79 0.97
0.74 0.96 0.47 0.56 0.73 0.94 0.76 0.79 0.71 0.58 0.94 0.66 0.75 0.76
0.84 0.59 0.68 0.75 0.76 0.72 0.87 0.78 0.67 0.79 0.91 0.57 0.77 0.69
0.73 0.43 0.93 0.68 0.82 0.67 0.74 0.82 0.85 0.62 0.54 0.71 0.92 0.85
0.79 0.63 0.59 0.73 0.66 0.74 0.9 0.81].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
You have made a mistake with splitting the data. That is because you set labels_train which are 1D to features_test by mistake, and since transform function does not expect 1D array, it returns error.
train_test_split() returns features_train, features_test, label_train, labels_test respectively.
So, change your code like this:
#features_train, labels_train, features_test, labels_test = train_test_split(features, labels, test_size=0.2, random_state=13)
features_train, features_test, label_train, labels_test = train_test_split(features, labels, test_size=0.2, random_state=13)
I have the following dataframe:
>>> mes1 mes2 mes3 mes4 mes5
A1 0.45 0.21 0.53 0.33 0.11
A2 0.44 0.32 0.11 0.38 0.91
A3 0.78 0.31 0.53 0.32 0.14
A4 0.12 0.33 0.56 0.43 0.12
posUp 0.52 0.40 0.62 0.48 0.54
posDown 0.32 0.15 0.45 0.24 0.05
I want to filer my dataframe, so I'll be left only with rows that their value is between the value of "posUp" and "posDown" for all the columns, so the result should be:
>>> mes1 mes2 mes3 mes4 mes5
A1 0.45 0.21 0.53 0.33 0.11
posUp 0.52 0.40 0.62 0.48 0.54
posDown 0.32 0.15 0.45 0.24 0.05
I have tried to do it by slicing the dataframe into series and then put condition like this:
for i in df:
db=df[i]
vmin=db.loc['posUp']
vmax=db.loc['posDown']
db=db[(db>vmin)&(db<vmax)]
and then I wanted to drop the rows that will not be found in the last db filter, but it didn't filter anything and when I print db I got "Series([],Name: ..."
Beside that, I believe there is more convenient / efficient way to do it than for loops.
So my end goal is to have only the rows that in all the columns, their value is between posUp and posDown.
Try with le and ge:
mask = (df.le(df.loc['posUp']) # compare with `posUp` row-wise
& df.ge(df.loc['posDown']) # compare with `posDown` row-wise
).all(1) # check for all True along the rows
df[mask]
Output:
mes1 mes2 mes3 mes4 mes5
A1 0.45 0.21 0.53 0.33 0.11
posUp 0.52 0.40 0.62 0.48 0.54
posDown 0.32 0.15 0.45 0.24 0.05
You can try all after sub . PS : A3 should not included since mes1 is 0.78
out = df[(df.sub(df.loc['posUp']).le(0) & df.sub(df.loc['posDown']).ge(0)).all(1)]
Out[107]:
mes1 mes2 mes3 mes4 mes5
A1 0.45 0.21 0.53 0.33 0.11
posUp 0.52 0.40 0.62 0.48 0.54
posDown 0.32 0.15 0.45 0.24 0.05
I am trying to concat two pivot tables but after join the two tables, the columns lost.
Pivot1:
SATISFIED_CHECKOUT 1.0 2.0 3.0 4.0 5.0
SEGMENT
BOTH_TX_SPEND_GROWN 0.01 0.03 0.04 0.14 0.80
BOTH_TX_SPEND_NO_GROWTH 0.01 0.03 0.04 0.14 0.78
ONLY_SHOPPED_2018 NaN 0.03 0.04 0.15 0.78
ONLY_SHOPPED_2019 0.01 0.02 0.05 0.13 0.78
ONLY_SPEND_GROWN 0.01 0.02 0.03 0.12 0.82
ONLY_TX_GROWN 0.01 0.03 0.03 0.14 0.79
SHOPPED_NEITHER NaN 0.04 0.02 0.15 0.79
Pivot2:
SATISFIED_FOOD 1.0 2.0 3.0 4.0 5.0
SEGMENT
BOTH_TX_SPEND_GROWN 0.00 0.01 0.07 0.20 0.71
BOTH_TX_SPEND_NO_GROWTH 0.00 0.01 0.08 0.19 0.71
ONLY_SHOPPED_2018 0.01 0.01 0.07 0.19 0.71
ONLY_SHOPPED_2019 0.00 0.01 0.10 0.19 0.69
ONLY_SPEND_GROWN 0.00 0.01 0.08 0.18 0.72
ONLY_TX_GROWN 0.00 0.02 0.07 0.19 0.72
SHOPPED_NEITHER NaN NaN 0.10 0.20 0.70
The original df looks like below:
SATISFIED_CHECKOUT SATISFIED_FOOD Persona
1 1 BOTH_TX_SPEND_GROWN
2 3 BOTH_TX_SPEND_NO_GROWTH
3 2 ONLY_SHOPPED_2019
.... .... ............
5 3 ONLY_SHOPPED_2019
I am using the code:
a = pd.pivot_table(df,index=["SEGMENT"], columns=["SATISFIED_FOOD"], aggfunc='size').apply(lambda x: x / x.sum(), axis=1).round(2)
b = pd.pivot_table(df,index=["SEGMENT"], columns=["SATISFIED_CHECKOUT"], aggfunc='size').apply(lambda x: x / x.sum(), axis=1).round(2)
pd.concat([a, b],axis=1)
The result like below:
1.0 2.0 3.0 4.0 ... 2.0 3.0 4.0 5.0
SEGMENT ...
BOTH_TX_SPEND_GROWN 0.01 0.03 0.07 0.23 ... 0.03 0.04 0.14 0.80
BOTH_TX_SPEND_NO_GROWTH 0.01 0.03 0.06 0.22 ... 0.03 0.04 0.14 0.78
ONLY_SHOPPED_2018 0.01 0.04 0.08 0.24 ... 0.03 0.04 0.15 0.78
ONLY_SHOPPED_2019 0.01 0.03 0.08 0.25 ... 0.02 0.05 0.13 0.78
ONLY_SPEND_GROWN 0.00 0.03 0.07 0.22 ... 0.02 0.03 0.12 0.82
ONLY_TX_GROWN 0.01 0.02 0.05 0.22 ... 0.03 0.03 0.14 0.79
SHOPPED_NEITHER NaN 0.01 0.07 0.28 ... 0.04 0.02 0.15 0.79
[7 rows x 15 columns]
But what I want to see this the result like below:
SATISFIED_CHECKOUT SATISFIED_FOOD
1.0 2.0 3.0 4.0 ... 2.0 3.0 4.0 5.0
SEGMENT ...
BOTH_TX_SPEND_GROWN 0.01 0.03 0.07 0.23 ... 0.03 0.04 0.14 0.80
BOTH_TX_SPEND_NO_GROWTH 0.01 0.03 0.06 0.22 ... 0.03 0.04 0.14 0.78
ONLY_SHOPPED_2018 0.01 0.04 0.08 0.24 ... 0.03 0.04 0.15 0.78
ONLY_SHOPPED_2019 0.01 0.03 0.08 0.25 ... 0.02 0.05 0.13 0.78
ONLY_SPEND_GROWN 0.00 0.03 0.07 0.22 ... 0.02 0.03 0.12 0.82
ONLY_TX_GROWN 0.01 0.02 0.05 0.22 ... 0.03 0.03 0.14 0.79
SHOPPED_NEITHER NaN 0.01 0.07 0.28 ... 0.04 0.02 0.15 0.79
[7 rows x 15 columns]
I have a pandas dataframe with about 100 columns of following type:
X1 Y1 X2 Y2 X3 Y3
0.78 0.22 0.19 0.42 0.04 0.65
0.43 0.29 0.43 0.84 0.14 0.42
0.57 0.70 0.59 0.86 0.11 0.40
0.92 0.52 0.81 0.33 0.54 1.00
w1here (X,Y) are basically pairs of values
I need to create the following from above.
X Y
0.78 0.22
0.43 0.29
0.57 0.70
0.92 0.52
0.19 0.42
0.43 0.84
0.59 0.86
0.81 0.33
0.04 0.65
0.14 0.42
0.11 0.40
0.54 1.00
i.e. stack all the X columns which are odd numbered and then stack all the Y columns which are even numbered.
I have no clue where to even start. For small number of columns I could easily have use the column names.
You can use lreshape, for column names use list comprehension:
x = [col for col in df.columns if 'X' in col]
y = [col for col in df.columns if 'Y' in col]
df = pd.lreshape(df, {'X': x,'Y': y})
print (df)
X Y
0 0.78 0.22
1 0.43 0.29
2 0.57 0.70
3 0.92 0.52
4 0.19 0.42
5 0.43 0.84
6 0.59 0.86
7 0.81 0.33
8 0.04 0.65
9 0.14 0.42
10 0.11 0.40
11 0.54 1.00
Solution with MultiIndex and stack:
df.columns = [np.arange(len(df.columns)) % 2, np.arange(len(df.columns)) // 2]
df = df.stack().reset_index(drop=True)
df.columns = ['X','Y']
print (df)
X Y
0 0.78 0.22
1 0.19 0.42
2 0.04 0.65
3 0.43 0.29
4 0.43 0.84
5 0.14 0.42
6 0.57 0.70
7 0.59 0.86
8 0.11 0.40
9 0.92 0.52
10 0.81 0.33
11 0.54 1.00
It may also be worth noting that you could just construct a new DataFrame explicitly with the X-Y values. This will most likely be quicker, but it assumes that the X-Y column pairs are the entirety of your DataFrame.
pd.DataFrame(dict(X=df.values[:,::2].reshape(-1),
Y=df.values[:,1::2].reshape(-1)))
Demo
>>> pd.DataFrame(dict(X=df.values[:,::2].reshape(-1),
Y=df.values[:,1::2].reshape(-1)))
X Y
0 0.78 0.22
1 0.19 0.42
2 0.04 0.65
3 0.43 0.29
4 0.43 0.84
5 0.14 0.42
6 0.57 0.70
7 0.59 0.86
8 0.11 0.40
9 0.92 0.52
10 0.81 0.33
11 0.54 1.00
You can use the documented pd.wide_to_long but you will need to use a 'dummy' column to uniquely identify each row. You can drop this column later.
pd.wide_to_long(df.reset_index(),
stubnames=['X', 'Y'],
i='index',
j='dropme').reset_index(drop=True)
X Y
0 0.78 0.22
1 0.43 0.29
2 0.57 0.70
3 0.92 0.52
4 0.19 0.42
5 0.43 0.84
6 0.59 0.86
7 0.81 0.33
8 0.04 0.65
9 0.14 0.42
10 0.11 0.40
11 0.54 1.00