How to concat two pivot tables without losing column name - python

I am trying to concat two pivot tables but after join the two tables, the columns lost.
Pivot1:
SATISFIED_CHECKOUT 1.0 2.0 3.0 4.0 5.0
SEGMENT
BOTH_TX_SPEND_GROWN 0.01 0.03 0.04 0.14 0.80
BOTH_TX_SPEND_NO_GROWTH 0.01 0.03 0.04 0.14 0.78
ONLY_SHOPPED_2018 NaN 0.03 0.04 0.15 0.78
ONLY_SHOPPED_2019 0.01 0.02 0.05 0.13 0.78
ONLY_SPEND_GROWN 0.01 0.02 0.03 0.12 0.82
ONLY_TX_GROWN 0.01 0.03 0.03 0.14 0.79
SHOPPED_NEITHER NaN 0.04 0.02 0.15 0.79
Pivot2:
SATISFIED_FOOD 1.0 2.0 3.0 4.0 5.0
SEGMENT
BOTH_TX_SPEND_GROWN 0.00 0.01 0.07 0.20 0.71
BOTH_TX_SPEND_NO_GROWTH 0.00 0.01 0.08 0.19 0.71
ONLY_SHOPPED_2018 0.01 0.01 0.07 0.19 0.71
ONLY_SHOPPED_2019 0.00 0.01 0.10 0.19 0.69
ONLY_SPEND_GROWN 0.00 0.01 0.08 0.18 0.72
ONLY_TX_GROWN 0.00 0.02 0.07 0.19 0.72
SHOPPED_NEITHER NaN NaN 0.10 0.20 0.70
The original df looks like below:
SATISFIED_CHECKOUT SATISFIED_FOOD Persona
1 1 BOTH_TX_SPEND_GROWN
2 3 BOTH_TX_SPEND_NO_GROWTH
3 2 ONLY_SHOPPED_2019
.... .... ............
5 3 ONLY_SHOPPED_2019
I am using the code:
a = pd.pivot_table(df,index=["SEGMENT"], columns=["SATISFIED_FOOD"], aggfunc='size').apply(lambda x: x / x.sum(), axis=1).round(2)
b = pd.pivot_table(df,index=["SEGMENT"], columns=["SATISFIED_CHECKOUT"], aggfunc='size').apply(lambda x: x / x.sum(), axis=1).round(2)
pd.concat([a, b],axis=1)
The result like below:
1.0 2.0 3.0 4.0 ... 2.0 3.0 4.0 5.0
SEGMENT ...
BOTH_TX_SPEND_GROWN 0.01 0.03 0.07 0.23 ... 0.03 0.04 0.14 0.80
BOTH_TX_SPEND_NO_GROWTH 0.01 0.03 0.06 0.22 ... 0.03 0.04 0.14 0.78
ONLY_SHOPPED_2018 0.01 0.04 0.08 0.24 ... 0.03 0.04 0.15 0.78
ONLY_SHOPPED_2019 0.01 0.03 0.08 0.25 ... 0.02 0.05 0.13 0.78
ONLY_SPEND_GROWN 0.00 0.03 0.07 0.22 ... 0.02 0.03 0.12 0.82
ONLY_TX_GROWN 0.01 0.02 0.05 0.22 ... 0.03 0.03 0.14 0.79
SHOPPED_NEITHER NaN 0.01 0.07 0.28 ... 0.04 0.02 0.15 0.79
[7 rows x 15 columns]
But what I want to see this the result like below:
SATISFIED_CHECKOUT SATISFIED_FOOD
1.0 2.0 3.0 4.0 ... 2.0 3.0 4.0 5.0
SEGMENT ...
BOTH_TX_SPEND_GROWN 0.01 0.03 0.07 0.23 ... 0.03 0.04 0.14 0.80
BOTH_TX_SPEND_NO_GROWTH 0.01 0.03 0.06 0.22 ... 0.03 0.04 0.14 0.78
ONLY_SHOPPED_2018 0.01 0.04 0.08 0.24 ... 0.03 0.04 0.15 0.78
ONLY_SHOPPED_2019 0.01 0.03 0.08 0.25 ... 0.02 0.05 0.13 0.78
ONLY_SPEND_GROWN 0.00 0.03 0.07 0.22 ... 0.02 0.03 0.12 0.82
ONLY_TX_GROWN 0.01 0.02 0.05 0.22 ... 0.03 0.03 0.14 0.79
SHOPPED_NEITHER NaN 0.01 0.07 0.28 ... 0.04 0.02 0.15 0.79
[7 rows x 15 columns]

Related

How to write a function with DataFrame in pandas/Python?

Supposed we have a df with a sum() value in the below DataFrame, thanks so much for #jezrael 's answer here, but we have many different df like below DataFrame with different columns, is it possible to add those three line code in a function?
df.columns=['value_a','value_b','name','up_or_down','difference']
# from here
df.loc['sum'] = df[['value_a','value_b','difference']].sum()
df1 = df[['value_a','value_b','difference']].sum().to_frame().T
df = pd.concat([df1, df], ignore_index=True)
# end here
df
value_a value_b name up_or_down difference
project_name
sum 27.56 25.04 -1.31
2021-project11 0.43 0.48 2021-project11 up 0.05
2021-project1 0.62 0.56 2021-project1 down -0.06
2021-project2 0.51 0.47 2021-project2 down -0.04
2021-porject3 0.37 0.34 2021-porject3 down -0.03
2021-porject4 0.64 0.61 2021-porject4 down -0.03
2021-project5 0.32 0.25 2021-project5 down -0.07
2021-project6 0.75 0.81 2021-project6 up 0.06
2021-project7 0.60 0.60 2021-project7 down 0.00
2021-project8 0.85 0.74 2021-project8 down -0.11
2021-project10 0.67 0.67 2021-project10 down 0.00
2021-project9 0.73 0.73 2021-project9 down 0.00
2021-project11 0.54 0.54 2021-project11 down 0.00
2021-project12 0.40 0.40 2021-project12 down 0.00
2021-project13 0.76 0.77 2021-project13 up 0.01
2021-project14 1.16 1.28 2021-project14 up 0.12
2021-project15 1.01 0.94 2021-project15 down -0.07
2021-project16 1.23 1.24 2021-project16 up 0.01
2022-project17 0.40 0.36 2022-project17 down -0.04
2022-project_11 0.40 0.40 2022-project_11 down 0.00
2022-project4 1.01 0.80 2022-project4 down -0.21
2022-project1 0.65 0.67 2022-project1 up 0.02
2022-project2 0.75 0.57 2022-project2 down -0.18
2022-porject3 0.32 0.32 2022-porject3 down 0.00
2022-project18 0.91 0.56 2022-project18 down -0.35
2022-project5 0.84 0.89 2022-project5 up 0.05
2022-project19 0.61 0.48 2022-project19 down -0.13
2022-project6 0.77 0.80 2022-project6 up 0.03
2022-project20 0.63 0.54 2022-project20 down -0.09
2022-project8 0.59 0.55 2022-project8 down -0.04
2022-project21 0.58 0.54 2022-project21 down -0.04
2022-project10 0.76 0.76 2022-project10 down 0.00
2022-project9 0.70 0.71 2022-project9 up 0.01
2022-project22 0.62 0.56 2022-project22 down -0.06
2022-project23 2.03 1.74 2022-project23 down -0.29
2022-project12 0.39 0.39 2022-project12 down 0.00
2022-project24 1.35 1.55 2022-project24 up 0.20
project25 0.45 0.42 project25 down -0.03
project26 0.53 NaN project26 down NaN
project27 0.68 NaN project27 down NaN
Can I add a function with conditions like below, and our other df values can use the function directly?
def sum_handler(x):
if .......
return .....
elif .......
return .....
else
return .....
Thanks so much for any advice
You could try a different approach for summing up your dataframe like shown in this answer.
df.loc['Total'] = df.sum(numeric_only=True, axis=0)
Since this is a one line of code, there would be no need to create a custom function to do this. But for future referrence, you can create a custom function and apply it to a dataframe like this:
import pandas as pd
def double_columns(df: pd.DataFrame, columns: list[str]):
""" Doubles chosen columns of a dataframe """
df[columns] = df[columns] * 2
return df
df = pd.DataFrame({'col1': [1,2], 'col2': [2,3]})
df = double_columns(df, ['col1'])
print(df)
would return
col1 col2
0 2 2
1 4 3

How to add a sum() value above the df column values?

Supposed I have a df as below, how to add a sum() value in below DataFrame?
df.columns=['value_a','value_b','name','up_or_down','difference']
df
value_a value_b name up_or_down difference
project_name
# sum 27.56 25.04 sum down -1.31
2021-project11 0.43 0.48 2021-project11 up 0.05
2021-project1 0.62 0.56 2021-project1 down -0.06
2021-project2 0.51 0.47 2021-project2 down -0.04
2021-porject3 0.37 0.34 2021-porject3 down -0.03
2021-porject4 0.64 0.61 2021-porject4 down -0.03
2021-project5 0.32 0.25 2021-project5 down -0.07
2021-project6 0.75 0.81 2021-project6 up 0.06
2021-project7 0.60 0.60 2021-project7 down 0.00
2021-project8 0.85 0.74 2021-project8 down -0.11
2021-project10 0.67 0.67 2021-project10 down 0.00
2021-project9 0.73 0.73 2021-project9 down 0.00
2021-project11 0.54 0.54 2021-project11 down 0.00
2021-project12 0.40 0.40 2021-project12 down 0.00
2021-project13 0.76 0.77 2021-project13 up 0.01
2021-project14 1.16 1.28 2021-project14 up 0.12
2021-project15 1.01 0.94 2021-project15 down -0.07
2021-project16 1.23 1.24 2021-project16 up 0.01
2022-project17 0.40 0.36 2022-project17 down -0.04
2022-project_11 0.40 0.40 2022-project_11 down 0.00
2022-project4 1.01 0.80 2022-project4 down -0.21
2022-project1 0.65 0.67 2022-project1 up 0.02
2022-project2 0.75 0.57 2022-project2 down -0.18
2022-porject3 0.32 0.32 2022-porject3 down 0.00
2022-project18 0.91 0.56 2022-project18 down -0.35
2022-project5 0.84 0.89 2022-project5 up 0.05
2022-project19 0.61 0.48 2022-project19 down -0.13
2022-project6 0.77 0.80 2022-project6 up 0.03
2022-project20 0.63 0.54 2022-project20 down -0.09
2022-project8 0.59 0.55 2022-project8 down -0.04
2022-project21 0.58 0.54 2022-project21 down -0.04
2022-project10 0.76 0.76 2022-project10 down 0.00
2022-project9 0.70 0.71 2022-project9 up 0.01
2022-project22 0.62 0.56 2022-project22 down -0.06
2022-project23 2.03 1.74 2022-project23 down -0.29
2022-project12 0.39 0.39 2022-project12 down 0.00
2022-project24 1.35 1.55 2022-project24 up 0.20
project25 0.45 0.42 project25 down -0.03
project26 0.53 NaN project26 down NaN
project27 0.68 NaN project27 down NaN
I tried
df.sum().columns=['value_a_sun','value_b_sum','difference_sum']
And I would like to add below sum value in the above column value,
sum 27.56 25.04 sum down -1.31
But I got AttributeError: 'Series' object has no attribute 'column', how to fix this? Thanks so much for any advice.
Filter columns names in subset by [] before sum and assign for new row in DataFrame.loc:
df.loc['sum'] = df[['value_a','value_b','difference']].sum()
For first line:
df1 = df[['value_a','value_b','difference']].sum().to_frame().T
df = pd.concat([df1, df], ignore_index=True)

How to combine Date object with float Vector

I am reading two data frames from two separate csvs and trying to combine them in a single data frame.Both df1 & df2 should be combined row by row.df1 contains floating numbers and
df2 is a date.
df1=pd.read_csv("Weights.csv")
print(df1.head(5))
df2=pd.read_csv("Date.csv")
print(df2.head(5))
0 1 2 3 4 5 6 7 8 9 10 11 12
0 0.06 0.06 -0.0 -0.0 0.11 0.06 0.37 0.01 0.05 0.10 -0.00 0.01 0.0
1 0.09 0.05 -0.0 -0.0 0.12 0.05 0.36 0.00 0.05 0.08 0.00 0.00 -0.0
2 0.14 0.07 -0.0 0.0 0.13 0.04 0.33 0.01 0.04 0.05 0.00 0.00 0.0
3 0.13 0.07 0.0 -0.0 0.12 0.06 0.34 0.01 0.05 0.04 0.01 0.00 -0.0
4 0.11 0.08 0.0 0.0 0.08 0.10 0.35 0.05 0.05 0.06 0.02 0.00 0.0
0
0 2010-12-29
1 2011-01-05
2 2011-01-12
3 2011-01-19
4 2011-01-26
I am facing problem using pd.concat in pandas.

Is there a way to interpolate values while maintaining a ratio?

I have a dataframe of percentages, and I want to interpolate the intermediate values
0 5 10 15 20 25 30 35
A 0.50 0.50 0.50 0.49 0.47 0.41 0.35 0.29 0.22
B 0.31 0.31 0.31 0.29 0.28 0.24 0.22 0.18 0.13
C 0.09 0.09 0.09 0.09 0.08 0.07 0.06 0.05 0.04
D 0.08 0.08 0.08 0.08 0.06 0.06 0.05 0.04 0.03
E 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.03 0.04
F 0.01 0.01 0.01 0.04 0.10 0.20 0.30 0.41 0.54
So far, I've been using scipy's interp1d and iterating row by row, but it doesn't always maintain the ratios perfectly down the column. Is there a way to do this all together in one function?
reindex then interpolate
r = range(df.columns.min(), df.columns.max() + 1)
df.reindex(columns=r).interpolate(axis=1)
0 1 2 3 4 5 6 7 8 9 ... 31 32 33 34 35 36 37 38 39 40
A 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 ... 0.338 0.326 0.314 0.302 0.29 0.276 0.262 0.248 0.234 0.22
B 0.31 0.31 0.31 0.31 0.31 0.31 0.31 0.31 0.31 0.31 ... 0.212 0.204 0.196 0.188 0.18 0.170 0.160 0.150 0.140 0.13
C 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 ... 0.058 0.056 0.054 0.052 0.05 0.048 0.046 0.044 0.042 0.04
D 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 ... 0.048 0.046 0.044 0.042 0.04 0.038 0.036 0.034 0.032 0.03
E 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.022 0.024 0.026 0.028 0.03 0.032 0.034 0.036 0.038 0.04
F 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.322 0.344 0.366 0.388 0.41 0.436 0.462 0.488 0.514 0.54

Pandas stack column pairs

I have a pandas dataframe with about 100 columns of following type:
X1 Y1 X2 Y2 X3 Y3
0.78 0.22 0.19 0.42 0.04 0.65
0.43 0.29 0.43 0.84 0.14 0.42
0.57 0.70 0.59 0.86 0.11 0.40
0.92 0.52 0.81 0.33 0.54 1.00
w1here (X,Y) are basically pairs of values
I need to create the following from above.
X Y
0.78 0.22
0.43 0.29
0.57 0.70
0.92 0.52
0.19 0.42
0.43 0.84
0.59 0.86
0.81 0.33
0.04 0.65
0.14 0.42
0.11 0.40
0.54 1.00
i.e. stack all the X columns which are odd numbered and then stack all the Y columns which are even numbered.
I have no clue where to even start. For small number of columns I could easily have use the column names.
You can use lreshape, for column names use list comprehension:
x = [col for col in df.columns if 'X' in col]
y = [col for col in df.columns if 'Y' in col]
df = pd.lreshape(df, {'X': x,'Y': y})
print (df)
X Y
0 0.78 0.22
1 0.43 0.29
2 0.57 0.70
3 0.92 0.52
4 0.19 0.42
5 0.43 0.84
6 0.59 0.86
7 0.81 0.33
8 0.04 0.65
9 0.14 0.42
10 0.11 0.40
11 0.54 1.00
Solution with MultiIndex and stack:
df.columns = [np.arange(len(df.columns)) % 2, np.arange(len(df.columns)) // 2]
df = df.stack().reset_index(drop=True)
df.columns = ['X','Y']
print (df)
X Y
0 0.78 0.22
1 0.19 0.42
2 0.04 0.65
3 0.43 0.29
4 0.43 0.84
5 0.14 0.42
6 0.57 0.70
7 0.59 0.86
8 0.11 0.40
9 0.92 0.52
10 0.81 0.33
11 0.54 1.00
It may also be worth noting that you could just construct a new DataFrame explicitly with the X-Y values. This will most likely be quicker, but it assumes that the X-Y column pairs are the entirety of your DataFrame.
pd.DataFrame(dict(X=df.values[:,::2].reshape(-1),
Y=df.values[:,1::2].reshape(-1)))
Demo
>>> pd.DataFrame(dict(X=df.values[:,::2].reshape(-1),
Y=df.values[:,1::2].reshape(-1)))
X Y
0 0.78 0.22
1 0.19 0.42
2 0.04 0.65
3 0.43 0.29
4 0.43 0.84
5 0.14 0.42
6 0.57 0.70
7 0.59 0.86
8 0.11 0.40
9 0.92 0.52
10 0.81 0.33
11 0.54 1.00
You can use the documented pd.wide_to_long but you will need to use a 'dummy' column to uniquely identify each row. You can drop this column later.
pd.wide_to_long(df.reset_index(),
stubnames=['X', 'Y'],
i='index',
j='dropme').reset_index(drop=True)
X Y
0 0.78 0.22
1 0.43 0.29
2 0.57 0.70
3 0.92 0.52
4 0.19 0.42
5 0.43 0.84
6 0.59 0.86
7 0.81 0.33
8 0.04 0.65
9 0.14 0.42
10 0.11 0.40
11 0.54 1.00

Categories