Supposed we have a df with a sum() value in the below DataFrame, thanks so much for #jezrael 's answer here, but we have many different df like below DataFrame with different columns, is it possible to add those three line code in a function?
df.columns=['value_a','value_b','name','up_or_down','difference']
# from here
df.loc['sum'] = df[['value_a','value_b','difference']].sum()
df1 = df[['value_a','value_b','difference']].sum().to_frame().T
df = pd.concat([df1, df], ignore_index=True)
# end here
df
value_a value_b name up_or_down difference
project_name
sum 27.56 25.04 -1.31
2021-project11 0.43 0.48 2021-project11 up 0.05
2021-project1 0.62 0.56 2021-project1 down -0.06
2021-project2 0.51 0.47 2021-project2 down -0.04
2021-porject3 0.37 0.34 2021-porject3 down -0.03
2021-porject4 0.64 0.61 2021-porject4 down -0.03
2021-project5 0.32 0.25 2021-project5 down -0.07
2021-project6 0.75 0.81 2021-project6 up 0.06
2021-project7 0.60 0.60 2021-project7 down 0.00
2021-project8 0.85 0.74 2021-project8 down -0.11
2021-project10 0.67 0.67 2021-project10 down 0.00
2021-project9 0.73 0.73 2021-project9 down 0.00
2021-project11 0.54 0.54 2021-project11 down 0.00
2021-project12 0.40 0.40 2021-project12 down 0.00
2021-project13 0.76 0.77 2021-project13 up 0.01
2021-project14 1.16 1.28 2021-project14 up 0.12
2021-project15 1.01 0.94 2021-project15 down -0.07
2021-project16 1.23 1.24 2021-project16 up 0.01
2022-project17 0.40 0.36 2022-project17 down -0.04
2022-project_11 0.40 0.40 2022-project_11 down 0.00
2022-project4 1.01 0.80 2022-project4 down -0.21
2022-project1 0.65 0.67 2022-project1 up 0.02
2022-project2 0.75 0.57 2022-project2 down -0.18
2022-porject3 0.32 0.32 2022-porject3 down 0.00
2022-project18 0.91 0.56 2022-project18 down -0.35
2022-project5 0.84 0.89 2022-project5 up 0.05
2022-project19 0.61 0.48 2022-project19 down -0.13
2022-project6 0.77 0.80 2022-project6 up 0.03
2022-project20 0.63 0.54 2022-project20 down -0.09
2022-project8 0.59 0.55 2022-project8 down -0.04
2022-project21 0.58 0.54 2022-project21 down -0.04
2022-project10 0.76 0.76 2022-project10 down 0.00
2022-project9 0.70 0.71 2022-project9 up 0.01
2022-project22 0.62 0.56 2022-project22 down -0.06
2022-project23 2.03 1.74 2022-project23 down -0.29
2022-project12 0.39 0.39 2022-project12 down 0.00
2022-project24 1.35 1.55 2022-project24 up 0.20
project25 0.45 0.42 project25 down -0.03
project26 0.53 NaN project26 down NaN
project27 0.68 NaN project27 down NaN
Can I add a function with conditions like below, and our other df values can use the function directly?
def sum_handler(x):
if .......
return .....
elif .......
return .....
else
return .....
Thanks so much for any advice
You could try a different approach for summing up your dataframe like shown in this answer.
df.loc['Total'] = df.sum(numeric_only=True, axis=0)
Since this is a one line of code, there would be no need to create a custom function to do this. But for future referrence, you can create a custom function and apply it to a dataframe like this:
import pandas as pd
def double_columns(df: pd.DataFrame, columns: list[str]):
""" Doubles chosen columns of a dataframe """
df[columns] = df[columns] * 2
return df
df = pd.DataFrame({'col1': [1,2], 'col2': [2,3]})
df = double_columns(df, ['col1'])
print(df)
would return
col1 col2
0 2 2
1 4 3
Supposed I have a df as below, how to add a sum() value in below DataFrame?
df.columns=['value_a','value_b','name','up_or_down','difference']
df
value_a value_b name up_or_down difference
project_name
# sum 27.56 25.04 sum down -1.31
2021-project11 0.43 0.48 2021-project11 up 0.05
2021-project1 0.62 0.56 2021-project1 down -0.06
2021-project2 0.51 0.47 2021-project2 down -0.04
2021-porject3 0.37 0.34 2021-porject3 down -0.03
2021-porject4 0.64 0.61 2021-porject4 down -0.03
2021-project5 0.32 0.25 2021-project5 down -0.07
2021-project6 0.75 0.81 2021-project6 up 0.06
2021-project7 0.60 0.60 2021-project7 down 0.00
2021-project8 0.85 0.74 2021-project8 down -0.11
2021-project10 0.67 0.67 2021-project10 down 0.00
2021-project9 0.73 0.73 2021-project9 down 0.00
2021-project11 0.54 0.54 2021-project11 down 0.00
2021-project12 0.40 0.40 2021-project12 down 0.00
2021-project13 0.76 0.77 2021-project13 up 0.01
2021-project14 1.16 1.28 2021-project14 up 0.12
2021-project15 1.01 0.94 2021-project15 down -0.07
2021-project16 1.23 1.24 2021-project16 up 0.01
2022-project17 0.40 0.36 2022-project17 down -0.04
2022-project_11 0.40 0.40 2022-project_11 down 0.00
2022-project4 1.01 0.80 2022-project4 down -0.21
2022-project1 0.65 0.67 2022-project1 up 0.02
2022-project2 0.75 0.57 2022-project2 down -0.18
2022-porject3 0.32 0.32 2022-porject3 down 0.00
2022-project18 0.91 0.56 2022-project18 down -0.35
2022-project5 0.84 0.89 2022-project5 up 0.05
2022-project19 0.61 0.48 2022-project19 down -0.13
2022-project6 0.77 0.80 2022-project6 up 0.03
2022-project20 0.63 0.54 2022-project20 down -0.09
2022-project8 0.59 0.55 2022-project8 down -0.04
2022-project21 0.58 0.54 2022-project21 down -0.04
2022-project10 0.76 0.76 2022-project10 down 0.00
2022-project9 0.70 0.71 2022-project9 up 0.01
2022-project22 0.62 0.56 2022-project22 down -0.06
2022-project23 2.03 1.74 2022-project23 down -0.29
2022-project12 0.39 0.39 2022-project12 down 0.00
2022-project24 1.35 1.55 2022-project24 up 0.20
project25 0.45 0.42 project25 down -0.03
project26 0.53 NaN project26 down NaN
project27 0.68 NaN project27 down NaN
I tried
df.sum().columns=['value_a_sun','value_b_sum','difference_sum']
And I would like to add below sum value in the above column value,
sum 27.56 25.04 sum down -1.31
But I got AttributeError: 'Series' object has no attribute 'column', how to fix this? Thanks so much for any advice.
Filter columns names in subset by [] before sum and assign for new row in DataFrame.loc:
df.loc['sum'] = df[['value_a','value_b','difference']].sum()
For first line:
df1 = df[['value_a','value_b','difference']].sum().to_frame().T
df = pd.concat([df1, df], ignore_index=True)
I'm having trouble converting my data from wide format to long format using pd.wide_to_long() method. The error reads IndexError: Too many levels: Index has only 1 level, not 2
My code:
import pandas as pd
df = pd.read_csv('data/data.csv', index_col=False)
print(df)
df.reset_index(inplace=True,drop=True)
df['ID'] = df.index
pd.wide_to_long(df, ['OT_', 'NT_'], i='ID', j=['MISS', 'HIT', 'CR', 'FA']).reset_index().rename(columns={'OT_': 'OT', 'NT_': 'NT'})
CSV (its just junk data):
PID,OT_MISS,OT_HIT,OT_CR,OT_FA,NT_MISS,NT_HIT,NT_CR,NT_FA
111,0.1,0.23,0.56,0.11,0.9,1.0,0.92,0.68
121,0.1,0.23,0.56,0.11,0.9,1.0,0.92,0.68
212,0.1,0.23,0.56,0.11,0.9,1.0,0.92,0.68
321,0.1,0.23,0.56,0.11,0.9,1.0,0.92,0.68
423,0.1,0.23,0.56,0.11,0.9,1.0,0.92,0.68
534,0.1,0.23,0.56,0.11,0.9,1.0,0.92,0.68
621,0.1,0.23,0.56,0.11,0.9,1.0,0.92,0.68
721,0.1,0.23,0.56,0.11,0.9,1.0,0.92,0.68
812,0.1,0.23,0.56,0.11,0.9,1.0,0.92,0.68
922,0.1,0.23,0.56,0.11,0.9,1.0,0.92,0.68
In panda, you can use melt() function to transform the data from wide to long format as follows:
df2=pd.melt(df, ['OT_', 'NT_'], i='ID', j=['MISS', 'HIT', 'CR', 'FA']).reset_index().rename(columns={'OT_': 'OT', 'NT_': 'NT'})
The issue is pandas.wide_to_long didn't properly recognize the suffixes, i should be 'PID' not 'ID', and j is supposed to be a string.
'\w+' is a regular expression to get one or more word characters
import pandas as pd
df2 = pd.wide_to_long(df, ['OT', 'NT'], i='PID', j='stubs', sep='_', suffix='\w+')
print(df2)
OT NT
PID stubs
111 MISS 0.10 0.90
121 MISS 0.10 0.90
212 MISS 0.10 0.90
321 MISS 0.10 0.90
423 MISS 0.10 0.90
534 MISS 0.10 0.90
621 MISS 0.10 0.90
721 MISS 0.10 0.90
812 MISS 0.10 0.90
922 MISS 0.10 0.90
111 HIT 0.23 1.00
121 HIT 0.23 1.00
212 HIT 0.23 1.00
321 HIT 0.23 1.00
423 HIT 0.23 1.00
534 HIT 0.23 1.00
621 HIT 0.23 1.00
721 HIT 0.23 1.00
812 HIT 0.23 1.00
922 HIT 0.23 1.00
111 CR 0.56 0.92
121 CR 0.56 0.92
212 CR 0.56 0.92
321 CR 0.56 0.92
423 CR 0.56 0.92
534 CR 0.56 0.92
621 CR 0.56 0.92
721 CR 0.56 0.92
812 CR 0.56 0.92
922 CR 0.56 0.92
111 FA 0.11 0.68
121 FA 0.11 0.68
212 FA 0.11 0.68
321 FA 0.11 0.68
423 FA 0.11 0.68
534 FA 0.11 0.68
621 FA 0.11 0.68
721 FA 0.11 0.68
812 FA 0.11 0.68
922 FA 0.11 0.68
I have a pandas dataframe with about 100 columns of following type:
X1 Y1 X2 Y2 X3 Y3
0.78 0.22 0.19 0.42 0.04 0.65
0.43 0.29 0.43 0.84 0.14 0.42
0.57 0.70 0.59 0.86 0.11 0.40
0.92 0.52 0.81 0.33 0.54 1.00
w1here (X,Y) are basically pairs of values
I need to create the following from above.
X Y
0.78 0.22
0.43 0.29
0.57 0.70
0.92 0.52
0.19 0.42
0.43 0.84
0.59 0.86
0.81 0.33
0.04 0.65
0.14 0.42
0.11 0.40
0.54 1.00
i.e. stack all the X columns which are odd numbered and then stack all the Y columns which are even numbered.
I have no clue where to even start. For small number of columns I could easily have use the column names.
You can use lreshape, for column names use list comprehension:
x = [col for col in df.columns if 'X' in col]
y = [col for col in df.columns if 'Y' in col]
df = pd.lreshape(df, {'X': x,'Y': y})
print (df)
X Y
0 0.78 0.22
1 0.43 0.29
2 0.57 0.70
3 0.92 0.52
4 0.19 0.42
5 0.43 0.84
6 0.59 0.86
7 0.81 0.33
8 0.04 0.65
9 0.14 0.42
10 0.11 0.40
11 0.54 1.00
Solution with MultiIndex and stack:
df.columns = [np.arange(len(df.columns)) % 2, np.arange(len(df.columns)) // 2]
df = df.stack().reset_index(drop=True)
df.columns = ['X','Y']
print (df)
X Y
0 0.78 0.22
1 0.19 0.42
2 0.04 0.65
3 0.43 0.29
4 0.43 0.84
5 0.14 0.42
6 0.57 0.70
7 0.59 0.86
8 0.11 0.40
9 0.92 0.52
10 0.81 0.33
11 0.54 1.00
It may also be worth noting that you could just construct a new DataFrame explicitly with the X-Y values. This will most likely be quicker, but it assumes that the X-Y column pairs are the entirety of your DataFrame.
pd.DataFrame(dict(X=df.values[:,::2].reshape(-1),
Y=df.values[:,1::2].reshape(-1)))
Demo
>>> pd.DataFrame(dict(X=df.values[:,::2].reshape(-1),
Y=df.values[:,1::2].reshape(-1)))
X Y
0 0.78 0.22
1 0.19 0.42
2 0.04 0.65
3 0.43 0.29
4 0.43 0.84
5 0.14 0.42
6 0.57 0.70
7 0.59 0.86
8 0.11 0.40
9 0.92 0.52
10 0.81 0.33
11 0.54 1.00
You can use the documented pd.wide_to_long but you will need to use a 'dummy' column to uniquely identify each row. You can drop this column later.
pd.wide_to_long(df.reset_index(),
stubnames=['X', 'Y'],
i='index',
j='dropme').reset_index(drop=True)
X Y
0 0.78 0.22
1 0.43 0.29
2 0.57 0.70
3 0.92 0.52
4 0.19 0.42
5 0.43 0.84
6 0.59 0.86
7 0.81 0.33
8 0.04 0.65
9 0.14 0.42
10 0.11 0.40
11 0.54 1.00
Given that my data is a pandas dataframe and looks like this:
Ref +1 +2 +3 +4 +5 +6 +7
2013-05-28 1 -0.44 0.03 0.06 -0.31 0.13 0.56 0.81
2013-07-05 2 0.84 1.03 0.96 0.90 1.09 0.59 1.15
2013-08-21 3 0.09 0.25 0.06 0.09 -0.09 -0.16 0.56
2014-10-15 4 0.35 1.16 1.91 3.44 2.75 1.97 2.16
2015-02-09 5 0.09 -0.10 -0.38 -0.69 -0.25 -0.85 -0.47
How can I plot a chart of the 5 lines (1 for each ref), where the X axis are the columns (+1, +2...), and starts from 0? If is in seaborn, even better. But matplotlib solutions are also welcome.
Plotting a dataframe in pandas is generally all about reshaping the table so that the individual lines you want are in separate columns, and the x-values are in the index. Some of these reshape operations are a bit ugly, but you can do:
df = pd.read_clipboard()
plot_table = pd.melt(df.reset_index(), id_vars=['index', 'Ref'])
plot_table = plot_table.pivot(index='variable', columns='Ref', values='value')
# Add extra row to have all lines start from 0:
plot_table.loc['+0', :] = 0
plot_table = plot_table.sort_index()
plot_table
Ref 1 2 3 4 5
variable
+0 0.00 0.00 0.00 0.00 0.00
+1 -0.44 0.84 0.09 0.35 0.09
+2 0.03 1.03 0.25 1.16 -0.10
+3 0.06 0.96 0.06 1.91 -0.38
+4 -0.31 0.90 0.09 3.44 -0.69
+5 0.13 1.09 -0.09 2.75 -0.25
+6 0.56 0.59 -0.16 1.97 -0.85
+7 0.81 1.15 0.56 2.16 -0.47
Now that you have a table with the right shape, plotting is pretty automatic:
plot_table.plot()