Add selected interactions as columns to pandas dataframe - python

I'm fairly new to pandas and python. I'm trying to return few selected interaction terms of all possible interactions in a data frame, and then return them as new features in the df.
My solution was to calculate the interactions of interest using sklearn's PolynomialFeature() and attach them to the df in a for loop. See example:
import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(1111)
a1 = np.random.randint(2, size = (5,3))
a2 = np.round(np.random.random((5,3)),2)
df = pd.DataFrame(np.concatenate([a1, a2], axis = 1), columns = ['a','b','c','d','e','f'])
combinations = [['a', 'e'], ['a', 'f'], ['b', 'f']]
for comb in combinations:
polynomizer = PolynomialFeatures(interaction_only=True, include_bias=False).fit(df[comb])
newcol_nam = polynomizer.get_feature_names(comb)[2]
newcol_val = polynomizer.transform(df[comb])[:,2]
df[newcol_nam] = newcol_val
df
a b c d e f a e a f b f
0 0.0 1.0 1.0 0.51 0.45 0.10 0.00 0.00 0.10
1 1.0 0.0 0.0 0.67 0.36 0.23 0.36 0.23 0.00
2 0.0 0.0 0.0 0.97 0.79 0.02 0.00 0.00 0.00
3 0.0 1.0 0.0 0.44 0.37 0.52 0.00 0.00 0.52
4 0.0 0.0 0.0 0.16 0.02 0.94 0.00 0.00 0.00
Another solution would be to run
PolynomialFeatures(2, interaction_only=True, include_bias=False).fit_transform(df)
and then drop the interactions I'm not interested in.
However, neither option is ideal in terms of performance and I'm wondering if there is a better solution.

As commented, you can try:
df = df.join(pd.DataFrame({
f'{x} {y}': df[x]*df[y] for x,y in combinations
}))
Or simply:
for comb in combinations:
df[' '.join(comb)] = df[comb].prod(1)
Output:
a b c d e f a e a f b f
0 0.0 1.0 1.0 0.51 0.45 0.10 0.00 0.00 0.10
1 1.0 0.0 0.0 0.67 0.36 0.23 0.36 0.23 0.00
2 0.0 0.0 0.0 0.97 0.79 0.02 0.00 0.00 0.00
3 0.0 1.0 0.0 0.44 0.37 0.52 0.00 0.00 0.52
4 0.0 0.0 0.0 0.16 0.02 0.94 0.00 0.00 0.00

Related

How to write a function with DataFrame in pandas/Python?

Supposed we have a df with a sum() value in the below DataFrame, thanks so much for #jezrael 's answer here, but we have many different df like below DataFrame with different columns, is it possible to add those three line code in a function?
df.columns=['value_a','value_b','name','up_or_down','difference']
# from here
df.loc['sum'] = df[['value_a','value_b','difference']].sum()
df1 = df[['value_a','value_b','difference']].sum().to_frame().T
df = pd.concat([df1, df], ignore_index=True)
# end here
df
value_a value_b name up_or_down difference
project_name
sum 27.56 25.04 -1.31
2021-project11 0.43 0.48 2021-project11 up 0.05
2021-project1 0.62 0.56 2021-project1 down -0.06
2021-project2 0.51 0.47 2021-project2 down -0.04
2021-porject3 0.37 0.34 2021-porject3 down -0.03
2021-porject4 0.64 0.61 2021-porject4 down -0.03
2021-project5 0.32 0.25 2021-project5 down -0.07
2021-project6 0.75 0.81 2021-project6 up 0.06
2021-project7 0.60 0.60 2021-project7 down 0.00
2021-project8 0.85 0.74 2021-project8 down -0.11
2021-project10 0.67 0.67 2021-project10 down 0.00
2021-project9 0.73 0.73 2021-project9 down 0.00
2021-project11 0.54 0.54 2021-project11 down 0.00
2021-project12 0.40 0.40 2021-project12 down 0.00
2021-project13 0.76 0.77 2021-project13 up 0.01
2021-project14 1.16 1.28 2021-project14 up 0.12
2021-project15 1.01 0.94 2021-project15 down -0.07
2021-project16 1.23 1.24 2021-project16 up 0.01
2022-project17 0.40 0.36 2022-project17 down -0.04
2022-project_11 0.40 0.40 2022-project_11 down 0.00
2022-project4 1.01 0.80 2022-project4 down -0.21
2022-project1 0.65 0.67 2022-project1 up 0.02
2022-project2 0.75 0.57 2022-project2 down -0.18
2022-porject3 0.32 0.32 2022-porject3 down 0.00
2022-project18 0.91 0.56 2022-project18 down -0.35
2022-project5 0.84 0.89 2022-project5 up 0.05
2022-project19 0.61 0.48 2022-project19 down -0.13
2022-project6 0.77 0.80 2022-project6 up 0.03
2022-project20 0.63 0.54 2022-project20 down -0.09
2022-project8 0.59 0.55 2022-project8 down -0.04
2022-project21 0.58 0.54 2022-project21 down -0.04
2022-project10 0.76 0.76 2022-project10 down 0.00
2022-project9 0.70 0.71 2022-project9 up 0.01
2022-project22 0.62 0.56 2022-project22 down -0.06
2022-project23 2.03 1.74 2022-project23 down -0.29
2022-project12 0.39 0.39 2022-project12 down 0.00
2022-project24 1.35 1.55 2022-project24 up 0.20
project25 0.45 0.42 project25 down -0.03
project26 0.53 NaN project26 down NaN
project27 0.68 NaN project27 down NaN
Can I add a function with conditions like below, and our other df values can use the function directly?
def sum_handler(x):
if .......
return .....
elif .......
return .....
else
return .....
Thanks so much for any advice
You could try a different approach for summing up your dataframe like shown in this answer.
df.loc['Total'] = df.sum(numeric_only=True, axis=0)
Since this is a one line of code, there would be no need to create a custom function to do this. But for future referrence, you can create a custom function and apply it to a dataframe like this:
import pandas as pd
def double_columns(df: pd.DataFrame, columns: list[str]):
""" Doubles chosen columns of a dataframe """
df[columns] = df[columns] * 2
return df
df = pd.DataFrame({'col1': [1,2], 'col2': [2,3]})
df = double_columns(df, ['col1'])
print(df)
would return
col1 col2
0 2 2
1 4 3

Add values to specific rows and columns in pandas df

I'm trying to add values on a pandas dataframe based on the inputs of a user and an agent. This is an example that I am working so far.
import numpy as np
import pandas as pd
import random
ls = np.zeros((9,3))
choices = ['R','P','S']
df = pd.DataFrame(ls, columns=['R','P','S'], index = ['RR','RP','RS','PR','PP','PS','SR','SP','SS'])
for _ in range(100):
user_choice = random.choice(choices)
agent_choice = random.choice(choices)
#print(user_choice,agent_choice)
for _ in range(len(df)):
for _ in range(len(df['R'])):
df[user_choice + agent_choice][agent_choice] += 1
Desired result will look something like:
Any help will be much appreciated
Not sure this is really what you want, but Python, NumPy, and Pandas provide some nice conveniences to do these things:
>>> import random
>>> import numpy as np, pandas as pd
>>> from itertools import product
>>> choices = 'RPS'
>>> df = pd.DataFrame(np.zeros((9,3)), columns=list(choices), index=[''.join(l) for l in product(choices, repeat=2)])
>>> user_choices = np.array([random.choice(choices) for _ in range(100)], dtype=str)
>>> agent_choices = np.array([random.choice(choices) for _ in range(100)], dtype=str)
for ac, cc in zip(agent_choices, np.char.add(user_choices, agent_choices)):
... if ac == cc[-1]:
... df[ac][cc] += 1
...
>>> df
R P S
RR 14.0 0.0 0.0
RP 0.0 14.0 0.0
RS 0.0 0.0 7.0
PR 11.0 0.0 0.0
PP 0.0 13.0 0.0
PS 0.0 0.0 8.0
SR 10.0 0.0 0.0
SP 0.0 8.0 0.0
SS 0.0 0.0 15.0
Since you seem to want it normalized to a percentage:
>>> df / 100
R P S
RR 0.14 0.00 0.00
RP 0.00 0.14 0.00
RS 0.00 0.00 0.07
PR 0.11 0.00 0.00
PP 0.00 0.13 0.00
PS 0.00 0.00 0.08
SR 0.10 0.00 0.00
SP 0.00 0.08 0.00
SS 0.00 0.00 0.15
The obvious issue is this is always going to give you a sparse matrix. You're looking to count (user_choice, agent_choice) by agent_choice, then the only cells that can ever be filled in are those where the second char of the index matches the char of the column header. You may as well just collapse that to simply make the index and column header both ['R', 'P', 'S'] and count how many times a user chose 'R' when an agent chose 'R' etc.
>>> df = pd.DataFrame(np.zeros((3,3)), columns=list(choices), index=list(choices))
>>> for a, c in zip(agent_choices, user_choices):
... df[a][c] += 1
...
>>> df
R P S
R 14.0 14.0 7.0
P 11.0 13.0 8.0
S 10.0 8.0 15.0
>>> df / 100
R P S
R 0.14 0.14 0.07
P 0.11 0.13 0.08
S 0.10 0.08 0.15
You can see that contains the same information in a smaller matrix.

Get proportionate values of columns in a dataframe - Pandas

I have a dataframe like this,
ds 0 1 2 4 5 6
0 1991Q3 nan nan nan nan 1.0 nan
1 2014Q2 1.0 3.0 nan nan 1.0 nan
2 2014Q3 1.0 nan nan 1.0 4.0 nan
3 2014Q4 nan nan nan 2.0 3.0 nan
4 2015Q1 nan 1.0 2.0 4.0 4.0 nan
I would like the proportions for each column 0-6 like this,
ds 0 1 2 4 5 6
0 1991Q3 0.00 0.00 0.00 0.00 1.00 0.00
1 2014Q2 0.20 0.60 0.00 0.00 0.20 0.00
2 2014Q3 0.16 0.00 0.00 0.16 0.67 0.00
3 2014Q4 0.00 0.00 0.00 0.40 0.60 0.00
4 2015Q1 0.00 0.09 0.18 0.36 0.36 0.00
Is there a pandas way to this? Any suggestion would be great.
You can do this:
df = df.replace(np.nan, 0)
df = df.set_index('ds')
In [3194]: df.div(df.sum(1),0).reset_index()
Out[3194]:
ds 0 1 2 4 5 6
0 1991Q3 0.00 0.00 0.00 0.00 1.00 0.00
1 2014Q2 0.20 0.60 0.00 0.00 0.20 0.00
2 2014Q3 0.17 0.00 0.00 0.17 0.67 0.00
3 2014Q4 0.00 0.00 0.00 0.40 0.60 0.00
4 2015Q1 0.00 0.09 0.18 0.36 0.36 0.00
OR you can use df.apply:
In [3196]: df = df.replace(np.nan, 0)
In [3197]: df.iloc[:,1:] = df.iloc[:,1:].apply(lambda x: x/x.sum(), axis=1)
In [3198]: df
Out[3197]:
ds 0 1 2 4 5 6
0 1991Q3 0.00 0.00 0.00 0.00 1.00 0.00
1 2014Q2 0.20 0.60 0.00 0.00 0.20 0.00
2 2014Q3 0.17 0.00 0.00 0.17 0.67 0.00
3 2014Q4 0.00 0.00 0.00 0.40 0.60 0.00
4 2015Q1 0.00 0.09 0.18 0.36 0.36 0.00
Set the first column as the index, get the sum of each row, and divide the main dataframe by the sums, and filling the null entries with 0
res = df.set_index("ds")
res.fillna(0).div(res.sum(1),axis=0)

Return column names for 3 highest values in rows

I'm trying to come up with a way to return the column names for the 3 highest values in each row of the table below. So far I've been able to return the highest value using idxmax but I haven't been able to figure out how to get the 2nd and 3rd highest.
Clust Stat1 Stat2 Stat3 Stat4 Stat5 Stat6
0 9 0.00 0.15 0.06 0.11 0.23 0.01
1 4 0.00 0.25 0.04 0.10 0.10 0.00
2 11 0.00 0.34 0.00 0.09 0.24 0.00
3 12 0.00 0.16 0.00 0.11 0.00 0.00
4 0 0.00 0.35 0.00 0.04 0.02 0.00
5 17 0.01 0.21 0.02 0.18 0.27 0.01
Expected output:
Clust Stat1 Stat2 Stat3 Stat4 Stat5 Stat6 TopThree
0 9 0.00 0.15 0.06 0.11 0.23 0.01 [Stat5,Stat2,Stat4]
1 4 0.00 0.25 0.04 0.10 0.10 0.00 [Stat2,Stat4,Stat5]
2 11 0.00 0.34 0.00 0.09 0.24 0.00 [Stat2,Stat5,Stat4]
3 12 0.00 0.16 0.00 0.19 0.00 0.01 [Stat4,Stat2,Stat6]
4 0 0.00 0.35 0.00 0.04 0.02 0.00 [Stat2,Stat4,Stat5]
5 17 0.01 0.21 0.02 0.18 0.27 0.01 [Stat5,Stat2,Stat4]
If anyone has ideas on how to do this I'd appreciate it.
Use numpy.argsort for positions of sorted values and filter all columns without first:
a = df.iloc[:, 1:].to_numpy()
df['TopThree'] = df.columns[1:].to_numpy()[np.argsort(-a, axis=1)[:, :3]].tolist()
print (df)
Clust Stat1 Stat2 Stat3 Stat4 Stat5 Stat6 TopThree
0 9 0.00 0.15 0.06 0.11 0.23 0.01 [Stat5, Stat2, Stat4]
1 4 0.00 0.25 0.04 0.10 0.10 0.00 [Stat2, Stat4, Stat5]
2 11 0.00 0.34 0.00 0.09 0.24 0.00 [Stat2, Stat5, Stat4]
3 12 0.00 0.16 0.00 0.11 0.00 0.00 [Stat2, Stat4, Stat1]
4 0 0.00 0.35 0.00 0.04 0.02 0.00 [Stat2, Stat4, Stat5]
5 17 0.01 0.21 0.02 0.18 0.27 0.01 [Stat5, Stat2, Stat4]
If performace is not important:
df['TopThree'] = df.iloc[:, 1:].apply(lambda x: x.nlargest(3).index.tolist(), axis=1)

How to combine Date object with float Vector

I am reading two data frames from two separate csvs and trying to combine them in a single data frame.Both df1 & df2 should be combined row by row.df1 contains floating numbers and
df2 is a date.
df1=pd.read_csv("Weights.csv")
print(df1.head(5))
df2=pd.read_csv("Date.csv")
print(df2.head(5))
0 1 2 3 4 5 6 7 8 9 10 11 12
0 0.06 0.06 -0.0 -0.0 0.11 0.06 0.37 0.01 0.05 0.10 -0.00 0.01 0.0
1 0.09 0.05 -0.0 -0.0 0.12 0.05 0.36 0.00 0.05 0.08 0.00 0.00 -0.0
2 0.14 0.07 -0.0 0.0 0.13 0.04 0.33 0.01 0.04 0.05 0.00 0.00 0.0
3 0.13 0.07 0.0 -0.0 0.12 0.06 0.34 0.01 0.05 0.04 0.01 0.00 -0.0
4 0.11 0.08 0.0 0.0 0.08 0.10 0.35 0.05 0.05 0.06 0.02 0.00 0.0
0
0 2010-12-29
1 2011-01-05
2 2011-01-12
3 2011-01-19
4 2011-01-26
I am facing problem using pd.concat in pandas.

Categories