Pandas: Pairwise concatenation of column vectors - python

I'm working with a frame like
df = pd.DataFrame({
'G1':[1.00,0.69,0.23,0.22,0.62],
'G2':[0.03,0.41,0.74,0.35,0.62],
'G3':[0.05,0.40,0.15,0.32,0.19],
'G4':[0.30,0.20,0.51,0.70,0.67],
'G5':[0.40,0.36,0.88,0.10,0.19]
})
and I want to manipulate it so that the columns are pairwise permutations of the current columns e.g. all columns are now 10 elements long and for example column 'G1:G2' would have column 'G2' appended to column 'G1'. I have attached a mock-up pic. Note that the pic has named indices unlike the above example code. I can work with or without the indices.
How could I approach this? I can make a function to act on each column, but I think the function would have to return a data frame made by concatenation with all other columns. Not sure what that would look like.

I'd do it like this
from itertools import permutations
l1, l2 = map(list, zip(*permutations(range(len(df.columns)), 2)))
v = df.values
pd.DataFrame(
np.vstack([v[:, l1], v[:, l2]]),
list(map('S{}'.format, range(1, len(df) + 1))) * 2,
df.columns.values[l1] + ':' + df.columns.values[l2]
)

Here is one way, although I suspect there might also be a way to do this directly in pandas
from itertools import permutations
'''Get all the column permutations'''
lst = [x for x in permutations(df.columns, 2)]
'''Create a list of columns names'''
names = [x[0]+'_'+x[1] for x in lst]
'''Create the new arrays by vertically stacking pairs of column values'''
cols = [np.vstack((df[x[0]].values,df[x[1]].values)).ravel() for x in lst]
'''Create a dictionary with column names as keys and the arrays as values'''
d = dict(zip(names, cols))
'''Create new dataframe from dict'''
df2 = pd.DataFrame(d)
df2
G1_G2 G1_G3 G1_G4 G1_G5 G2_G1 G2_G3 G2_G4 G2_G5 G3_G1 G3_G2 \
0 1.00 1.00 1.00 1.00 0.03 0.03 0.03 0.03 0.05 0.05
1 0.69 0.69 0.69 0.69 0.41 0.41 0.41 0.41 0.40 0.40
2 0.23 0.23 0.23 0.23 0.74 0.74 0.74 0.74 0.15 0.15
3 0.22 0.22 0.22 0.22 0.35 0.35 0.35 0.35 0.32 0.32
4 0.62 0.62 0.62 0.62 0.62 0.62 0.62 0.62 0.19 0.19
5 0.03 0.05 0.30 0.40 1.00 0.05 0.30 0.40 1.00 0.03
6 0.41 0.40 0.20 0.36 0.69 0.40 0.20 0.36 0.69 0.41
7 0.74 0.15 0.51 0.88 0.23 0.15 0.51 0.88 0.23 0.74
8 0.35 0.32 0.70 0.10 0.22 0.32 0.70 0.10 0.22 0.35
9 0.62 0.19 0.67 0.19 0.62 0.19 0.67 0.19 0.62 0.62
This is part of the output
To avoid creating the lists and use the fact that itertools.permutations is a generator:
d = dict((x[0]+'_'+x[1] , np.vstack((df[x[0]].values,df[x[1]].values)).ravel())
for x in permutations(df.columns, 2))
df2 = pd.DataFrame(d)

Related

How to write a function with DataFrame in pandas/Python?

Supposed we have a df with a sum() value in the below DataFrame, thanks so much for #jezrael 's answer here, but we have many different df like below DataFrame with different columns, is it possible to add those three line code in a function?
df.columns=['value_a','value_b','name','up_or_down','difference']
# from here
df.loc['sum'] = df[['value_a','value_b','difference']].sum()
df1 = df[['value_a','value_b','difference']].sum().to_frame().T
df = pd.concat([df1, df], ignore_index=True)
# end here
df
value_a value_b name up_or_down difference
project_name
sum 27.56 25.04 -1.31
2021-project11 0.43 0.48 2021-project11 up 0.05
2021-project1 0.62 0.56 2021-project1 down -0.06
2021-project2 0.51 0.47 2021-project2 down -0.04
2021-porject3 0.37 0.34 2021-porject3 down -0.03
2021-porject4 0.64 0.61 2021-porject4 down -0.03
2021-project5 0.32 0.25 2021-project5 down -0.07
2021-project6 0.75 0.81 2021-project6 up 0.06
2021-project7 0.60 0.60 2021-project7 down 0.00
2021-project8 0.85 0.74 2021-project8 down -0.11
2021-project10 0.67 0.67 2021-project10 down 0.00
2021-project9 0.73 0.73 2021-project9 down 0.00
2021-project11 0.54 0.54 2021-project11 down 0.00
2021-project12 0.40 0.40 2021-project12 down 0.00
2021-project13 0.76 0.77 2021-project13 up 0.01
2021-project14 1.16 1.28 2021-project14 up 0.12
2021-project15 1.01 0.94 2021-project15 down -0.07
2021-project16 1.23 1.24 2021-project16 up 0.01
2022-project17 0.40 0.36 2022-project17 down -0.04
2022-project_11 0.40 0.40 2022-project_11 down 0.00
2022-project4 1.01 0.80 2022-project4 down -0.21
2022-project1 0.65 0.67 2022-project1 up 0.02
2022-project2 0.75 0.57 2022-project2 down -0.18
2022-porject3 0.32 0.32 2022-porject3 down 0.00
2022-project18 0.91 0.56 2022-project18 down -0.35
2022-project5 0.84 0.89 2022-project5 up 0.05
2022-project19 0.61 0.48 2022-project19 down -0.13
2022-project6 0.77 0.80 2022-project6 up 0.03
2022-project20 0.63 0.54 2022-project20 down -0.09
2022-project8 0.59 0.55 2022-project8 down -0.04
2022-project21 0.58 0.54 2022-project21 down -0.04
2022-project10 0.76 0.76 2022-project10 down 0.00
2022-project9 0.70 0.71 2022-project9 up 0.01
2022-project22 0.62 0.56 2022-project22 down -0.06
2022-project23 2.03 1.74 2022-project23 down -0.29
2022-project12 0.39 0.39 2022-project12 down 0.00
2022-project24 1.35 1.55 2022-project24 up 0.20
project25 0.45 0.42 project25 down -0.03
project26 0.53 NaN project26 down NaN
project27 0.68 NaN project27 down NaN
Can I add a function with conditions like below, and our other df values can use the function directly?
def sum_handler(x):
if .......
return .....
elif .......
return .....
else
return .....
Thanks so much for any advice
You could try a different approach for summing up your dataframe like shown in this answer.
df.loc['Total'] = df.sum(numeric_only=True, axis=0)
Since this is a one line of code, there would be no need to create a custom function to do this. But for future referrence, you can create a custom function and apply it to a dataframe like this:
import pandas as pd
def double_columns(df: pd.DataFrame, columns: list[str]):
""" Doubles chosen columns of a dataframe """
df[columns] = df[columns] * 2
return df
df = pd.DataFrame({'col1': [1,2], 'col2': [2,3]})
df = double_columns(df, ['col1'])
print(df)
would return
col1 col2
0 2 2
1 4 3

How do you give weights to dataframe columns iteratively for weighted mean average?

I have a dataframe with multiple columns having numerical float values. What I want to do is give fractional weights to each column and calculate its average to store and append it to the same df.
Let's say we have the columns: s1, s2, s3
I want to give the weights: w1, w2, w3 to them respectively
I was able to do this manually while experimenting with all values in hand. But when I go to a list format, it's giving me an error.
I was trying to do it through iteration and I've attached my code below, but it was giving me an error. I have also attached my manual code which worked, but it needs it first hand.
Code which didn't work:
score_df["weighted_avg"] += weight * score_df[feature]
Manual Code which worked but not with lists:
df["weighted_scores"] = 0.5*df["s1"] + 0.25*df["s2"] + 0.25*df["s3"]
We can use numpy broadcasting for this, since weights has the same shape as your column axis:
# given the following example df
df = pd.DataFrame(np.random.rand(10,3), columns=["s1", "s2", "s3"])
print(df)
s1 s2 s3
0 0.49 1.00 0.50
1 0.65 0.87 0.75
2 0.45 0.85 0.87
3 0.91 0.53 0.30
4 0.96 0.44 0.50
5 0.67 0.87 0.24
6 0.87 0.41 0.29
7 0.06 0.15 0.73
8 0.76 0.92 0.69
9 0.92 0.28 0.29
weights = [0.5, 0.25, 0.25]
df["weighted_scores"] = df.mul(weights).sum(axis=1)
print(df)
s1 s2 s3 weighted_scores
0 0.49 1.00 0.50 0.62
1 0.65 0.87 0.75 0.73
2 0.45 0.85 0.87 0.66
3 0.91 0.53 0.30 0.66
4 0.96 0.44 0.50 0.71
5 0.67 0.87 0.24 0.61
6 0.87 0.41 0.29 0.61
7 0.06 0.15 0.73 0.25
8 0.76 0.92 0.69 0.78
9 0.92 0.28 0.29 0.60
You can use dot
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,3), columns=["s1", "s2", "s3"])
df['weighted_scores'] = df.dot([.5,.25,.25])
df
Out
s1 s2 s3 weighted_scores
0 0.053543 0.659316 0.033540 0.199985
1 0.631627 0.257241 0.494959 0.503863
2 0.220939 0.870247 0.875165 0.546822
3 0.890487 0.519320 0.944459 0.811188
4 0.029416 0.016780 0.987503 0.265779
5 0.843882 0.784933 0.677096 0.787448
6 0.396092 0.297580 0.965454 0.513805
7 0.109894 0.011217 0.443796 0.168700
8 0.202096 0.637105 0.959876 0.500293
9 0.847020 0.949703 0.668615 0.828090

Drop pandas rows if value is not between two other values on the same column

I have the following dataframe:
>>> mes1 mes2 mes3 mes4 mes5
A1 0.45 0.21 0.53 0.33 0.11
A2 0.44 0.32 0.11 0.38 0.91
A3 0.78 0.31 0.53 0.32 0.14
A4 0.12 0.33 0.56 0.43 0.12
posUp 0.52 0.40 0.62 0.48 0.54
posDown 0.32 0.15 0.45 0.24 0.05
I want to filer my dataframe, so I'll be left only with rows that their value is between the value of "posUp" and "posDown" for all the columns, so the result should be:
>>> mes1 mes2 mes3 mes4 mes5
A1 0.45 0.21 0.53 0.33 0.11
posUp 0.52 0.40 0.62 0.48 0.54
posDown 0.32 0.15 0.45 0.24 0.05
I have tried to do it by slicing the dataframe into series and then put condition like this:
for i in df:
db=df[i]
vmin=db.loc['posUp']
vmax=db.loc['posDown']
db=db[(db>vmin)&(db<vmax)]
and then I wanted to drop the rows that will not be found in the last db filter, but it didn't filter anything and when I print db I got "Series([],Name: ..."
Beside that, I believe there is more convenient / efficient way to do it than for loops.
So my end goal is to have only the rows that in all the columns, their value is between posUp and posDown.
Try with le and ge:
mask = (df.le(df.loc['posUp']) # compare with `posUp` row-wise
& df.ge(df.loc['posDown']) # compare with `posDown` row-wise
).all(1) # check for all True along the rows
df[mask]
Output:
mes1 mes2 mes3 mes4 mes5
A1 0.45 0.21 0.53 0.33 0.11
posUp 0.52 0.40 0.62 0.48 0.54
posDown 0.32 0.15 0.45 0.24 0.05
You can try all after sub . PS : A3 should not included since mes1 is 0.78
out = df[(df.sub(df.loc['posUp']).le(0) & df.sub(df.loc['posDown']).ge(0)).all(1)]
Out[107]:
mes1 mes2 mes3 mes4 mes5
A1 0.45 0.21 0.53 0.33 0.11
posUp 0.52 0.40 0.62 0.48 0.54
posDown 0.32 0.15 0.45 0.24 0.05

Pandas histogram plot with kde?

I have a Pandas dataframe (Dt) like this:
Pc Cvt C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
0 1 2 0.08 0.17 0.16 0.31 0.62 0.66 0.63 0.52 0.38
1 2 2 0.09 0.15 0.13 0.49 0.71 1.28 0.42 1.04 0.43
2 3 2 0.13 0.24 0.22 0.17 0.66 0.17 0.28 0.11 0.30
3 4 1 0.21 0.10 0.23 0.08 0.53 0.14 0.59 0.06 0.53
4 5 1 0.16 0.21 0.18 0.13 0.44 0.08 0.29 0.12 0.52
5 6 1 0.14 0.14 0.13 0.20 0.29 0.35 0.40 0.29 0.53
6 7 1 0.21 0.16 0.19 0.21 0.28 0.23 0.40 0.19 0.52
7 8 1 0.31 0.16 0.34 0.19 0.60 0.32 0.56 0.30 0.55
8 9 1 0.20 0.19 0.26 0.19 0.63 0.30 0.68 0.22 0.58
9 10 2 0.12 0.18 0.13 0.22 0.59 0.40 0.50 0.24 0.36
10 11 2 0.10 0.10 0.19 0.17 0.89 0.36 0.65 0.23 0.37
11 12 2 0.19 0.20 0.17 0.17 0.38 0.14 0.48 0.08 0.36
12 13 1 0.16 0.17 0.15 0.13 0.35 0.12 0.50 0.09 0.52
13 14 2 0.19 0.19 0.29 0.16 0.62 0.19 0.43 0.14 0.35
14 15 2 0.01 0.16 0.17 0.20 0.89 0.38 0.63 0.27 0.46
15 16 2 0.09 0.19 0.33 0.15 1.11 0.16 0.87 0.16 0.29
16 17 2 0.07 0.18 0.19 0.15 0.61 0.19 0.37 0.15 0.36
17 18 2 0.14 0.23 0.23 0.20 0.67 0.38 0.45 0.27 0.33
18 19 1 0.27 0.15 0.20 0.10 0.40 0.05 0.53 0.02 0.52
19 20 1 0.12 0.13 0.18 0.22 0.60 0.49 0.66 0.39 0.66
20 21 2 0.15 0.20 0.18 0.32 0.74 0.58 0.51 0.45 0.37
.
.
.
From this i want to plot an histogram with kde for each column from C1 to C10 in an arrange just like the one that i obtain if i plot it with pandas,
Dt.iloc[:,2:].hist()
But so far i've been not able to add the kde in each histogram; i want something like this:
Any ideas on how to accomplish this?
You want to first plot your histogram then plot the kde on a secondary axis.
Minimal and Complete Verifiable Example MCVE
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(1000, 4)).add_prefix('C')
k = len(df.columns)
n = 2
m = (k - 1) // n + 1
fig, axes = plt.subplots(m, n, figsize=(n * 5, m * 3))
for i, (name, col) in enumerate(df.iteritems()):
r, c = i // n, i % n
ax = axes[r, c]
col.hist(ax=ax)
ax2 = col.plot.kde(ax=ax, secondary_y=True, title=name)
ax2.set_ylim(0)
fig.tight_layout()
How It Works
Keep track of total number of subplots
k = len(df.columns)
n will be the number of chart columns. Change this to suit individual needs. m will be the calculated number of required rows based on k and n
n = 2
m = (k - 1) // n + 1
Create a figure and array of axes with required number of rows and columns.
fig, axes = plt.subplots(m, n, figsize=(n * 5, m * 3))
Iterate through columns, tracking the column name and which number we are at i. Within each iteration, plot.
for i, (name, col) in enumerate(df.iteritems()):
r, c = i // n, i % n
ax = axes[r, c]
col.hist(ax=ax)
ax2 = col.plot.kde(ax=ax, secondary_y=True, title=name)
ax2.set_ylim(0)
Use tight_layout() as an easy way to sharpen up the layout spacing
fig.tight_layout()
Here is a pure seaborn solution, using FacetGrid.map_dataframe as explained here.
Stealing the example from #piRSquared:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000, 4)).add_prefix('C')
Get the data in the required format:
df = df.stack().reset_index(level=1, name="val")
Result:
level_1 val
0 C0 0.879714
0 C1 -0.927096
0 C2 -0.929429
0 C3 -0.571176
1 C0 -1.127939
Then:
import seaborn as sns
def distplot(x, **kwargs):
ax = plt.gca()
data = kwargs.pop("data")
sns.distplot(data[x], ax=ax, **kwargs)
g = sns.FacetGrid(df, col="level_1", col_wrap=2, size=3.5)
g = g.map_dataframe(distplot, "val")
You can adjust col_wrap as needed.

Pandas stack column pairs

I have a pandas dataframe with about 100 columns of following type:
X1 Y1 X2 Y2 X3 Y3
0.78 0.22 0.19 0.42 0.04 0.65
0.43 0.29 0.43 0.84 0.14 0.42
0.57 0.70 0.59 0.86 0.11 0.40
0.92 0.52 0.81 0.33 0.54 1.00
w1here (X,Y) are basically pairs of values
I need to create the following from above.
X Y
0.78 0.22
0.43 0.29
0.57 0.70
0.92 0.52
0.19 0.42
0.43 0.84
0.59 0.86
0.81 0.33
0.04 0.65
0.14 0.42
0.11 0.40
0.54 1.00
i.e. stack all the X columns which are odd numbered and then stack all the Y columns which are even numbered.
I have no clue where to even start. For small number of columns I could easily have use the column names.
You can use lreshape, for column names use list comprehension:
x = [col for col in df.columns if 'X' in col]
y = [col for col in df.columns if 'Y' in col]
df = pd.lreshape(df, {'X': x,'Y': y})
print (df)
X Y
0 0.78 0.22
1 0.43 0.29
2 0.57 0.70
3 0.92 0.52
4 0.19 0.42
5 0.43 0.84
6 0.59 0.86
7 0.81 0.33
8 0.04 0.65
9 0.14 0.42
10 0.11 0.40
11 0.54 1.00
Solution with MultiIndex and stack:
df.columns = [np.arange(len(df.columns)) % 2, np.arange(len(df.columns)) // 2]
df = df.stack().reset_index(drop=True)
df.columns = ['X','Y']
print (df)
X Y
0 0.78 0.22
1 0.19 0.42
2 0.04 0.65
3 0.43 0.29
4 0.43 0.84
5 0.14 0.42
6 0.57 0.70
7 0.59 0.86
8 0.11 0.40
9 0.92 0.52
10 0.81 0.33
11 0.54 1.00
It may also be worth noting that you could just construct a new DataFrame explicitly with the X-Y values. This will most likely be quicker, but it assumes that the X-Y column pairs are the entirety of your DataFrame.
pd.DataFrame(dict(X=df.values[:,::2].reshape(-1),
Y=df.values[:,1::2].reshape(-1)))
Demo
>>> pd.DataFrame(dict(X=df.values[:,::2].reshape(-1),
Y=df.values[:,1::2].reshape(-1)))
X Y
0 0.78 0.22
1 0.19 0.42
2 0.04 0.65
3 0.43 0.29
4 0.43 0.84
5 0.14 0.42
6 0.57 0.70
7 0.59 0.86
8 0.11 0.40
9 0.92 0.52
10 0.81 0.33
11 0.54 1.00
You can use the documented pd.wide_to_long but you will need to use a 'dummy' column to uniquely identify each row. You can drop this column later.
pd.wide_to_long(df.reset_index(),
stubnames=['X', 'Y'],
i='index',
j='dropme').reset_index(drop=True)
X Y
0 0.78 0.22
1 0.43 0.29
2 0.57 0.70
3 0.92 0.52
4 0.19 0.42
5 0.43 0.84
6 0.59 0.86
7 0.81 0.33
8 0.04 0.65
9 0.14 0.42
10 0.11 0.40
11 0.54 1.00

Categories