Creating interaction terms in python - python

I'm trying to create interaction terms in a dataset. Is there another (simpler) way of creating interaction terms of columns in a dataset? For example, creating interaction terms in combinations of columns 4:98 and 98:106. I tried looping over the columns using numpy arrays, but with the following code, the kernel keeps dying.
col1 = df.columns[4:98] #94 columns
col2 = df.columns[98:106] #8 columns
var1_np = df_np[:, 4:98]
var2_np = df_np[:, 98:106]
for i in range(94):
for j in range(8):
name = col1[i] +"*" + col2[j]
df[name] = var1_np[:,i]*var2_np[:,j]
Here, df is the dataframe and df_np is df in NumPy array.

You could use the itertools.product that is roughly equivalent to nested for-loops in a generator expression. Then, use join to create the new column name with the product result. After that, use Pandas prod to return the product of two columns over the axis one (along the columns).
import pandas as pd
import numpy as np
from itertools import product
#setup
np.random.seed(12345)
data = np.random.rand(5, 10).round(2)
df = pd.DataFrame(data)
df.columns = [f'col_{c}' for c in range(0,10)]
print(df)
#code
col1 = df.columns[3:5]
col2 = df.columns[5:8]
df_new = pd.DataFrame()
for i in product(col1, col2):
name = "*".join(i)
df_new[name] = df[list(i)].prod(axis=1)
print(df_new)
Output from df
col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 0.93 0.32 0.18 0.20 0.57 0.60 0.96 0.65 0.75 0.65
1 0.75 0.96 0.01 0.11 0.30 0.66 0.81 0.87 0.96 0.72
2 0.64 0.72 0.47 0.33 0.44 0.73 0.99 0.68 0.79 0.17
3 0.03 0.80 0.90 0.02 0.49 0.53 0.60 0.05 0.90 0.73
4 0.82 0.50 0.81 0.10 0.22 0.26 0.47 0.46 0.71 0.18
Output from df_new
col_3*col_5 col_3*col_6 col_3*col_7 col_4*col_5 col_4*col_6 col_4*col_7
0 0.1200 0.1920 0.1300 0.3420 0.5472 0.3705
1 0.0726 0.0891 0.0957 0.1980 0.2430 0.2610
2 0.2409 0.3267 0.2244 0.3212 0.4356 0.2992
3 0.0106 0.0120 0.0010 0.2597 0.2940 0.0245
4 0.0260 0.0470 0.0460 0.0572 0.1034 0.1012

Related

Matching to a specific year column in pandas

I am trying to take a "given" value and match it to a "year" in the same row using the following dataframe:
data = {
'Given' : [0.45, 0.39, 0.99, 0.58, None],
'Year 1' : [0.25, 0.15, 0.3, 0.23, 0.25],
'Year 2' : [0.39, 0.27, 0.55, 0.3, 0.4],
'Year 3' : [0.43, 0.58, 0.78, 0.64, 0.69],
'Year 4' : [0.65, 0.83, 0.95, 0.73, 0.85],
'Year 5' : [0.74, 0.87, 0.99, 0.92, 0.95]
}
df = pd.DataFrame(data)
print(df)
Output:
Given Year 1 Year 2 Year 3 Year 4 Year 5
0 0.45 0.25 0.39 0.43 0.65 0.74
1 0.39 0.15 0.27 0.58 0.83 0.87
2 0.99 0.30 0.55 0.78 0.95 0.99
3 0.58 0.23 0.30 0.64 0.73 0.92
4 NaN 0.25 0.40 0.69 0.85 0.95
However, the matching process has a few caveats. I am trying to match to the closest year to the given value before calculating the time to the first "year" above 70%. So row 0 would match to "year 3", and we can see in the same row that it will take two years until "year 5", which is the first occurence in the row above 70%.
For any "given" value already above 70%, we can just output "full", and for any "given" values that don't contain data, we can just output the first year above 70%. The output will look like the following:
Given Year 1 Year 2 Year 3 Year 4 Year 5 Output
0 0.45 0.25 0.39 0.43 0.65 0.74 2
1 0.39 0.15 0.27 0.58 0.83 0.87 2
2 0.99 0.30 0.55 0.78 0.95 0.99 full
3 0.58 0.23 0.30 0.64 0.73 0.92 1
4 NaN 0.25 0.40 0.69 0.85 0.95 4
It has taken me a horrendously long time to clean up this data so at the moment I can't think of a way to begin other than some use of .abs() to begin the matching process. All help appreciated.
Vectorized Pandas Approach:
reset_index() of the column names and .T, so that you can have the same column names and subtract dataframes from each other in a vectorized way. pd.concat() with * creates a dataframe that duplicates the first column, so that you can get the absolute difference of the dataframes in a more vectorized way instead of looping through columns.
Use idxmax and idxmin to identify the column numbers according to your criteria.
Use np.select according to your criteria.
import pandas as pd
import numpy as np
# identify 70% columns
pct_70 = (df.T.reset_index(drop=True).T > .7).idxmax(axis=1)
# identify column number of lowest absolute difference to Given
nearest_col = ((df.iloc[:,1:].T.reset_index(drop=True).T
- pd.concat([df.iloc[:,0]] * len(df.columns[1:]), axis=1)
.T.reset_index(drop=True).T)).abs().idxmin(axis=1)
# Generate an output series
output = pct_70 - nearest_col - 1
# Conditionally apply the output series
df['Output'] = np.select([output.gt(0),output.lt(0),output.isnull()],
[output, 'full', pct_70],np.nan)
df
Out[1]:
Given Year 1 Year 2 Year 3 Year 4 Year 5 Output
0 0.45 0.25 0.39 0.43 0.65 0.74 2.0
1 0.39 0.15 0.27 0.58 0.83 0.87 2.0
2 0.99 0.30 0.55 0.78 0.95 0.99 full
3 0.58 0.23 0.30 0.64 0.73 0.92 1.0
4 NaN 0.25 0.40 0.69 0.85 0.95 4
Here you go!
import numpy as np
def output(df):
output = []
for i in df.iterrows():
row = i[1].to_list()
given = row[0]
compare = np.array(row[1:])
first_70 = np.argmax(compare > 0.7)
if np.isnan(given):
output.append(first_70 + 1)
continue
if given > 0.7:
output.append('full')
continue
diff = np.abs(np.array(compare) - np.array(given))
closest_year = diff.argmin()
output.append(first_70 - closest_year)
return output
df['output'] = output(df)

Multiple time range selection in Pandas Python

I have time-series data in CSV format. I want to calculate the mean for a different selected time period on a single run of the script, e.g. 01-05-2017: 30-04-2018, 01-05-2018: 30-04-2019 so on. Below is sample data
I have a script but it's taking only one given time period. but I want to give the multiple time period as I mentioned above.
from datetime import datetime
import pandas as pd
df = pd.read_csv(r'D:\Data\RT_2015_2020.csv', index_col=[0],parse_dates=[0])
z = df['2016-05-01' : '2017-04-30']
# Want to make like this way
#z = df[['2016-05-01' : '2017-04-30'], ['2017-05-01' : '2018-04-30']]
# It will calculate the mean for the selected time period
z.mean()
If you use dates as an index, you can extract the data with the conditions included in the desired range.
import pandas as pd
import numpy as np
import io
data = '''
Date Mean
18-05-2016 0.31
07-06-2016 0.32
17-07-2016 0.50
15-09-2016 0.62
25-10-2016 0.63
04-11-2016 0.56
24-11-2016 0.56
14-12-2016 0.22
13-01-2017 0.22
23-01-2017 0.23
12-02-2017 0.21
22-02-2017 0.21
'''
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
df['2016'].head()
Mean
Date
2016-05-18 0.31
2016-07-06 0.32
2016-07-17 0.50
2016-09-15 0.62
2016-10-25 0.63
df.loc['2016-05-01':'2017-01-30']
Mean
Date
2016-05-18 0.31
2016-07-06 0.32
2016-07-17 0.50
2016-09-15 0.62
2016-10-25 0.63
2016-11-24 0.56
2016-12-14 0.22
2017-01-13 0.22
2017-01-23 0.23
df.loc['2016-05-01':'2017-01-30'].mean()
Mean 0.401111
dtype: float64

Sum All Positive and All Negative Values Pandas

I have the following dataframe (df):
Row Number
Row 0 0.24 0.16 -0.18 -0.20 1.24
Row 1 0.18 0.12 -0.73 -0.36 -0.54
Row 2 -0.01 0.25 -0.35 -0.08 -0.43
Row 3 -0.43 0.21 0.53 0.55 -1.03
Row 4 -0.24 -0.20 0.49 0.08 0.61
Row 5 -0.19 -0.29 -0.08 -0.16 0.34
I am attempting to sum all the negative and positive numbers respectively, e.g. sum(neg_numbers) = n and sum(pos_numbers) = x
I have tried:
df.groupby(df.agg([('negative' , lambda x : x[x < 0].sum()) , ('positive' , lambda x : x[x > 0].sum())])
to no avail.
How would I sum these values?
Thank you in advance!
You can do
sum_pos = df[df>0].sum(1)
sum_neg = df[df<0].sum(1)
if you want to get the sums per row. If you want to sum all values regardless of rows/columns, can use np.nansum
sum_pos = np.nansum(df[df>0])
You can do with
df.mul(df.gt(0)).sum().sum()
Out[447]: 5.0
df.mul(~df.gt(0)).sum().sum()
Out[448]: -5.5
If need sum by row
df.mul(df.gt(0)).sum()
Out[449]:
1 0.42
2 0.74
3 1.02
4 0.63
5 2.19
dtype: float64
Yet another way for the total sums:
sum_pos = df.to_numpy().flatten().clip(min=0).sum()
sum_neg = df.to_numpy().flatten().clip(max=0).sum()
And for sums by columns:
sum_pos_col = sum(df.to_numpy().clip(min=0))
sum_neg_col = sum(df.to_numpy().clip(max=0))
If you have string columns in dataframe and want to get the sum for particular column, then
df[df['column_name']>0]['column_name'].sum()
df[df['column_name']<0]['column_name'].sum()

dataframe values multiply by 2

I have a 1 dimensional dataframe:
4DS.AX A2B.AX A2M.AX AAC.AX ABC.AX ABP.AX ACW.AX ADH.AX
2018-12-14 0.00 0.00 0.14 0.01 0.12 0.01 0.00 0.01
expected output
4DS.AX A2B.AX A2M.AX AAC.AX ABC.AX ABP.AX ACW.AX ADH.AX
2018-12-14 0.00 0.00 0.28 0.02 0.24 0.02 0.00 0.02
I want to multiply all the values by 2
this is my attempt:
[in] df=df.iloc[0,:]*2.0
[out]
A2B.AX A2M.AX AAC.AX ABC.AX ABP.AX
2018-12-14 0.000.00 0.140.14 0.010.01 0.120.12....
It seems the type of columns is of str/object type and hence it is appending rather then multiplying
Example
import pandas as pd
pd.DataFrame({'x':['0.1']})*2
Output:
0.10.1
While
pd.DataFrame({'x':[0.1]})*2
Ouput
0.2
Can check the type of the column(s)
print(df.dtypes)
In order to change the type of column(s)
for col in ['4DS.AX','A2B.AX','A2M.AX','AAC.AX','ABC.AX','ABP.AX','ACW.AX','ADH.AX']:
df[col] = df[col].astype('float')
Then, it should also work
df.iloc[0,:]*2.0

pd.to_csv set float_format with list

I need to write a df to a text file, to save some space on disk I would like to set the number of decimal places for each column i.e. have each column a different width.
I have tried:
df = pd.DataFrame(np.random.random(size=(10, 4)))
df.to_csv(path, float_format=['%.3f', '%.3f', '%.3f', '%.10f'])
But this does not work;
TypeError: unsupported operand type(s) for %: 'list' and 'float'
Any suggestions on how to do this with pandas (version 0.23.0)
You can do in this way:
df.iloc[:,0:3] = df.iloc[:,0:3].round(3)
df['d'] = df['d'].round(10)
df.to_csv('path')
Thanks for all the answers, inspired by #Joe I came up with:
df = df.round({'a':3, 'b':3, 'c':3, 'd':10})
or more generically
df = df.round({c:r for c, r in zip(df.columns, [3, 3, 3, 10])})
This is a workaround and does not answer the original question, round modifies the underlying dataframe which may be undesirable.
I usually do it this way:
a['column_name'] = round(a['column_name'], 3)
And then you can export it to csv as usual.
You can use the applymap it applies to all rows and columns value.
df = pd.DataFrame(np.random.random(size=(10, 4)))
df.applymap(lambda x: round(x,2))
Out[58]:
0 1 2 3
0 0.12 0.63 0.47 0.19
1 0.06 0.81 0.09 0.56
2 0.78 0.85 0.42 0.98
3 0.58 0.39 0.73 0.68
4 0.79 0.56 0.77 0.34
5 0.16 0.20 0.94 0.89
6 0.34 0.79 0.54 0.27
7 0.70 0.58 0.05 0.28
8 0.75 0.53 0.37 0.64
9 0.57 0.68 0.59 0.84

Categories