I am using .feature_importances_ to export the importance values from a regression tree model to a CSV. The tree is run multiple times with different numbers of variables. I am looking to add an extra column that lists the number of variables used when calculating the importance.
Currently I am able to export the importance values and variable names to a CSV with the code below
Importance = list(zip(model.feature_importances_,list(X)))
Importance = pd.DataFrame(Importance,columns=["Importance","Feature_Name"])
Importance.to_csv('File.csv', mode='a', header=False)
and I get a CSV that looks like this
Imp VarName
0 1.00 0mm
0 0.58 0mm
1 0.42 0.63mm
0 0.54 0.63mm
1 0.36 0mm
2 0.10 0.125mm
What I am looking for is to add a 4th column with the number of variables used in that tree so the output would be like this
Imp VarName NumVar
0 1.00 0mm 1
0 0.58 0mm 2
1 0.42 0.63mm 2
0 0.54 0.63mm 3
1 0.36 0mm 3
2 0.10 0.125mm 3
This should work, I add a array of the size of the len with the value of the number of the len of X.
import numpy as np
Importance = list(zip(model.feature_importances_, list(X), len(X) * np.ones(len(X))))
Importance = pd.DataFrame(Importance,columns=["Importance","Feature_Name", "NumVar"])
Importance.to_csv('File.csv')
When running for the first time, I think you can delete the append mode and the header to False.
Related
I have a pandas dataframe with the first column being dates and then lots of adjusted stock prices (some of which have 16 decimals). I would like to truncate all the dataframe values to 8 decimals so I tried the following:
df = df.set_index("Day").pipe(lambda x: math.trunc(100000000 * x) / 100000000).reset_index()
But I get the following error:
type DataFrame doesn't define __trunc__ method
Have you tried formatting?
dec = [1.2736,9.3745,5.412783,8.25389]
to_3dp = lambda x: '%.3f'%(x)
rounded = [to_3dp(i) for i in dec]
print(rounded) # [1.273, 9.374, 5.412, 8.253]
So in your case:
df['column'] = df['column'].apply(lambda x: '%.8f'%(x))
If you want to round:
df['column'] = df['column'].apply(lambda x: round(x,8))
Use numpy.trunc for a vectorial solution:
n = 10**8
out = np.trunc(df.set_index("Day").mul(n)).div(n).reset_index()
IIUC, you are trying to apply the truncate-based lambda function on multiple columns at once. That's the reason for the error, try using applymap which applies your function on each cell independently. You have to first set your date column as index, leaving only the float columns in the dataframe. Try this -
f = lambda x: math.trunc(100000000 * x) / 100000000 #<-- your function
df.set_index("Day").applymap(f).reset_index() #<-- applied on each cell
Since I don't have the sample dataset you are using, here is a working dummy example.
import math
import pandas as pd
#Dummy dataframe
df = pd.DataFrame(np.random.random((10,3)),
columns = ['col1','col2','col3'])
f = lambda x: math.trunc(100 * x) / 100
df.applymap(f)
col1 col2 col3
0 0.80 0.76 0.14
1 0.40 0.48 0.85
2 0.58 0.40 0.76
3 0.82 0.04 0.10
4 0.23 0.04 0.91
5 0.57 0.41 0.12
6 0.72 0.71 0.71
7 0.32 0.59 0.99
8 0.11 0.70 0.32
9 0.95 0.80 0.24
Another simpler solution is to just use df.set_index("Day").round(8) directly, if that works for you but that would be rounding your numbers to 8 digits instead of truncating.
I'm trying to mimic an spss style correlation table in my Pandas output to make it easier to read for supervisors who are used to seeing matrices laid out this way (and are annoyed that I don't use SPSS anymore because it's harder for them to read).
This means that there is a table where the p-value is placed directly above the correlation coeff in the table. I have easily produced both the p-values and the coeffs and saved each into a separate dataframes like the ones below.
pvals
T 4 Rw Af
T |0.00|0.05|0.24|0.01
4 |0.05|0.00|0.76|0.03
Rw|0.24|0.76|0.00|0.44
...
rs
T 4 Rw Af
T |1.00|0.65|0.28|0.44
4 |0.65|1.00|0.01|0.03
Rw|-0.03|0.01|1.00|0.32
...
What I'd like to do is make a table where the two dataframes are merged without changing the order of the index. It would look like
T |P |0.00|0.05|0.24|0.01
|r |1.00|0.65|0.28|0.44
Rw|P |0.05|0.00|0.76|0.03
|r |0.65|1.00|0.01|0.03
...
Now, I understand that if my columns had alphabetically ordered names I could use something like
pd.concat([pvals, rs]).sort_index(kind='merge')
However, my columns are named with descriptive, non-ordered names and so this doesn't work because it reorders the index into alphabetical order. I also know that
df.corr()
will produce a matrix like the rs example I've given above but this is not what I'm looking for.
If anyone has any advice I'd really appreciate it.
Kev
You can use helper MultiIndex with np.arange and DataFrame.set_index with append=True, add keys parameter for P, r values, sorting by ranges, remove this level and last change order of levels by DataFrame.swaplevel:
s1 = pvals.set_index(np.arange(len(pvals)), append=True)
s2 = rs.set_index(np.arange(len(rs)), append=True)
df = (pd.concat([s1, s2], keys=('P','r'))
.sort_index(kind='merge', level=2)
.reset_index(level=2, drop=True)
.swaplevel(0,1))
print (df)
T 4 Rw Af
T P 0.00 0.05 0.24 0.01
r 1.00 0.65 0.28 0.44
4 P 0.05 0.00 0.76 0.03
r 0.65 1.00 0.01 0.03
Rw P 0.24 0.76 0.00 0.44
r -0.03 0.01 1.00 0.32
Asker Edit
This answer worked once the code was changed to
s1 = pvals.assign(a = np.arange(len(pvals))).set_index('a', append=True)
s2 = rs.assign(a = np.arange(len(rs))).set_index('a', append=True)
df = (pd.concat([s1, s2], keys=('P','r'))
.sort_index(kind='merge', level=2)
.reset_index(level=2, drop=True)
.swaplevel(0,1))
which was recomended by the answerer.
It's probably pretty obvious in the end but I cannot think of a way to do this. Thanks for your help!
I did a prediction task and as a result I have a DataFrame with the percentages and a column with the predicted class, like so:
Class1 Class2 Class3 Prediction
0 0.99 0.01 0.00 Class1
1 0.15 0.14 0.71 Class3
2 0.05 0.80 0.15 Class2
Now I want to access the probability with which a class was predicted. So I want to have a list like below so I can work with it further.
0 0.99
1 0.71
2 0.80
I have problems finding a way to access only one value of df.Predicted at a time and have no idea how to search for it. How do I get this value or alternatively my desired list? I tried this:
values = []
for row in df.Predicted:
values.append(row)
print(values)
but it returns the whole column for each iteration. It also doesn't feel very pandas-like. I am using python 3.5 in case it makes a difference
IIUC:
In [15]: df.lookup(df.index, df.Prediction)
Out[15]: array([ 0.99, 0.71, 0.8 ])
or
In [23]: df.max(axis=1)
Out[23]:
0 0.99
1 0.71
2 0.80
dtype: float64
or
In [24]: df.filter(regex='^Class').idxmax(axis=1)
Out[24]:
0 Class1
1 Class3
2 Class2
dtype: object
I'm trying to add a slope calculation on individual subsets of two fields in a dataframe and have that value of slope applied to all rows in each subset. (I've used the "slope" function in excel previously, although I'm not married to the exact algo. The "desired_output" field is what I'm expecting as the output. The subsets are distinguished by the "strike_order" column, subsets starting at 1 and not having a specific highest value.
"IV" is the y value
"Strike" is the x value
Any help would be appreciated as I don't even know where to begin with this....
import pandas
df = pandas.DataFrame([[1200,1,.4,0.005],[1210,2,.35,0.005],[1220,3,.3,0.005],
[1230,4,.25,0.005],[1200,1,.4,0.003],[1210,2,.37,.003]],columns=
["strike","strike_order","IV","desired_output"])
df
strike strike_order IV desired_output
0 1200 1 0.40 0.005
1 1210 2 0.35 0.005
2 1220 3 0.30 0.005
3 1230 4 0.25 0.005
4 1200 1 0.40 0.003
5 1210 2 0.37 0.003
Let me know if this isn't a well posed question and I'll try to make it better.
You can use numpy's least square
We can rewrite the line equationy=mx+c as y = Ap, where A = [[x 1]] and p = [[m], [c]]. Then use lstsq to solve for p, so we need to create A by adding a column of ones to df
import numpy as np
df['ones']=1
A = df[['strike','ones']]
y = df['IV']
m, c = np.linalg.lstsq(A,y)[0]
Alternatively you can use scikit learn's linear_model Regression model
you can verify the result by plotting the data as scatter plot and the line equation as plot
import matplotlib.pyplot as plt
plt.scatter(df['strike'],df['IV'],color='r',marker='d')
x = df['strike']
#plug x in the equation y=mx+c
y_line = c + m * x
plt.plot(x,y)
plt.xlabel('Strike')
plt.ylabel('IV')
plt.show()
the resulting plot is shown below
Try this.
First create a subset column by iterating over the dataframe, using the strike_order value transitioning to 1 as the boundary between subsets
#create subset column
subset_counter = 0
for index, row in df.iterrows():
if row["strike_order"] == 1:
df.loc[index,'subset'] = subset_counter
subset_counter += 1
else:
df.loc[index,'subset'] = df.loc[index-1,'subset']
df['subset'] = df['subset'].astype(int)
Then run a linear regression over each subset using groupby
# run linear regression on subsets of the dataframe using groupby
from sklearn import linear_model
model = linear_model.LinearRegression()
for (group, df_gp) in df.groupby('subset'):
X=df_gp[['strike']]
y=df_gp.IV
model.fit(X,y)
df.loc[df.subset == df_gp.iloc[0].subset, 'slope'] = model.coef_
df
strike strike_order IV desired_output subset slope
0 1200 1 0.40 0.005 0 -0.005
1 1210 2 0.35 0.005 0 -0.005
2 1220 3 0.30 0.005 0 -0.005
3 1230 4 0.25 0.005 0 -0.005
4 1200 1 0.40 0.003 1 -0.003
5 1210 2 0.37 0.003 1 -0.003
# Scott This worked except it went subset value 0, 1 and all subsequent subset values were 2. I added an extra conditional at the beginning and a very clumsy seed "seed" value to stop it looking for row -1.
import scipy
seed=df.loc[0,"date_exp"]
#seed ="08/11/200015/06/2001C"
#print(seed)
subset_counter = 0
for index, row in df.iterrows():
#if index['strike_order']==0:
if row['date_exp'] ==seed:
df.loc[index,'subset']=0
elif row["strike_order"] == 1:
df.loc[index,'subset'] = subset_counter
subset_counter = 1 + df.loc[index-1,'subset']
else:
df.loc[index,'subset'] = df.loc[index-1,'subset']
df['subset'] = df['subset'].astype(int)
This now does exactly what I want although I think using the seed value is clunky, would have preferred to use if row == 0 etc. But it's friday and this works.
Cheers
I have a pandas dataframe with multiple columns. My goal is to apply a complicated function to 3 columns and get a new column of values. Yet I will want to apply the same function to different triplets of columns. Would there be a possibility to use smart string formatting so I don't have to hardcode different names of columns 5 (or more) times?
Rough sketch:
Columns('A1','A2','A3','B1','B2','B3',...)
def function(row):
return row['A1']**2 + row['A2']**3 + row['A3']**4 ### String format here?
do same for B1,2,3; C1,2,3 etc.
Thank you!
Using #Milo's setup dataframe df
np.random.seed(42)
col_names = 'A1 A2 A3 B1 B2 B3 C1 C2 C3'.split()
df = pd.DataFrame(np.random.rand(5,9), columns=col_names)
print(df)
A1 A2 A3 B1 B2 B3 C1 C2 C3
0 0.37 0.95 0.73 0.60 0.16 0.16 0.06 0.87 0.60
1 0.71 0.02 0.97 0.83 0.21 0.18 0.18 0.30 0.52
2 0.43 0.29 0.61 0.14 0.29 0.37 0.46 0.79 0.20
3 0.51 0.59 0.05 0.61 0.17 0.07 0.95 0.97 0.81
4 0.30 0.10 0.68 0.44 0.12 0.50 0.03 0.91 0.26
Then use groupby with columns or axis=1. We use the first letter in the column header as the grouping key.
df.pow(2).groupby(df.columns.str[0], 1).sum(axis=1).pow(.5)
A B C
0 1.256962 0.638019 1.055923
1 1.201048 0.878128 0.633695
2 0.803589 0.488905 0.929715
3 0.785843 0.634367 1.576812
4 0.755317 0.673667 0.946051
If I understand your question correctly, you want to name your columns according to a specific scheme like "Anumber" and then apply the same operation to them.
One way you can do that is to filter for the naming scheme of the columns you want to address by using regular expressions and then use the apply method to apply your function.
Let's look at an example. I will first construct a DataFrame like so:
import pandas as pd
import numpy as np
np.random.seed(42)
col_names = 'A1 A2 A3 B1 B2 B3 C1 C2 C3'.split()
df = pd.DataFrame(np.random.rand(5,9), columns=col_names)
print df
A1 A2 A3 B1 B2 B3 C1 \
0 0.374540 0.950714 0.731994 0.598658 0.156019 0.155995 0.058084
1 0.708073 0.020584 0.969910 0.832443 0.212339 0.181825 0.183405
2 0.431945 0.291229 0.611853 0.139494 0.292145 0.366362 0.456070
3 0.514234 0.592415 0.046450 0.607545 0.170524 0.065052 0.948886
4 0.304614 0.097672 0.684233 0.440152 0.122038 0.495177 0.034389
C2 C3
0 0.866176 0.601115
1 0.304242 0.524756
2 0.785176 0.199674
3 0.965632 0.808397
4 0.909320 0.258780
Then use the filter method in combination with regular expressions. I will exemplarily square every value by using a lambda. But you can use whatever function/operation you like:
print df.filter(regex=r'A\d+').apply(lambda x: x*x)
A1 A2 A3
0 0.140280 0.903858 0.535815
1 0.501367 0.000424 0.940725
2 0.186576 0.084814 0.374364
3 0.264437 0.350955 0.002158
4 0.092790 0.009540 0.468175
Edit (2017-07-10)
Taking the above examples you could proceed with what you ultimately want to calculate. For example we can calculate the euclidean distance across all A-columns as follows:
df.filter(regex=r'A\d+').apply(lambda x: x*x).sum(axis=1).apply(np.sqrt)
Which results in:
0 1.256962
1 1.201048
2 0.803589
3 0.785843
4 0.755317
So what we essentially computed is sqrt(A1^2 + A2^2 + A3^2 + ... + An^2) for every row.
But since you want to apply separate transformations to separate column naming schemes you would have to hardcode the above method concatenation.
A much more elegant solution to this would be using pipelines. Pipelines basically allow you to define operations on your DataFrame and then combine them the way you need. Again using the example of computing the Euclidean Distance, we could construct a pipeline as follows:
def filter_columns(dataframe, regex):
"""Filter out columns of `dataframe` matched by `regex`."""
return dataframe.filter(regex=regex)
def op_on_vals(dataframe, op_vals):
"""Apply `op_vals` to every value in the columns of `dataframe`"""
return dataframe.apply(op_vals)
def op_across_columns(dataframe, op_cols):
"""Apply `op_cols` across the columns of `dataframe`"""
# Catch exception that would be raised if function
# would be applied to a pandas.Series.
try:
return dataframe.apply(op_cols, axis=1)
except TypeError:
return dataframe.apply(op_cols)
For every column naming scheme you can then define the transformations to apply and the order in which they have to be applied. This can for example be done by creating a dictionary that holds the column naming schemes as keys and the arguments for the pipes as values:
pipe_dict = {r'A\d+': [(op_on_vals, np.square), (op_across_columns, np.sum), (op_across_columns, np.sqrt)],
r'B\d+': [(op_on_vals, np.square), (op_across_columns, np.mean)],
r'C\d+': [(op_on_vals, lambda x: x**3), (op_across_columns, np.max)]}
# First pipe: Euclidean distance
# Second pipe: Mean of squares
# Third pipe: Maximum cube
df_list = []
for scheme in pipe_dict.keys():
df_list.append(df.pipe(filter_columns, scheme))
for (operation, func) in pipe_dict[scheme]:
df_list[-1] = df_list[-1].pipe(operation, func)
print df_list[0]
0 1.256962
1 1.201048
2 0.803589
3 0.785843
4 0.755317
Getting the same result as above.
Now, this is just an example use and neither very elegant, nor computationally very efficient. It is just to demonstrate the concept of DataFrame pipelines. Taking these concepts, you can go really fancy with this - for example defining pipelines of pipelines etc.
However, taking this example you can achieve your goal of defining an arbitrary order of functions to be executed on your columns. You can now go one step further and apply one function at a time to specific columns, instead of applying functions across all columns.
For example, you can take my op_on_vals function and modify it so that it achieves what you outlined with row['A1']**2, row['A2']**3 and then use .pipe(op_across_columns, np.sum) to implement what you sketched with
def function(row):
return row['A1']**2 + row['A2']**3 + row['A3']**4
This shouldn't be too difficult, so I will leave the details of this implementation to you.
Edit (2017-07-11)
Here is another piece of code that uses functools.partial in order to create 'function prototypes' of a power function. These can be used to variably set an exponent for the power according to the number in the column names of the DataFrame.
This way we can use the numbers in A1, A2 etc. to calculate value**1, value**2 for each value in the corresponding column. Finally, we can sum them in order to get what you sketched with
row['A1']**2 + row['A2']**3 + row['A3']**4
You can find an excellent explanation of what functools.partial does on PyDanny's Blog. Let's look at the code:
import pandas as pd
import numpy as np
import re
from functools import partial
def power(base, exponent):
return base ** exponent
# Create example DataFrame.
np.random.seed(42)
col_names = 'A1 A2 A3 B1 B2 B3 C1 C2 C3'.split()
df = pd.DataFrame(np.random.rand(5, 9), columns=col_names)
# Separate 'letter''number' strings of columns into tuples of (letter, number).
match = re.findall(r"([A-Z]+)([0-9]+)", ''.join(df.columns.tolist()))
# Dictionary with 'prototype' functions for each column naming scheme.
func_dict = {'A': power, 'B': power, 'C': power}
# Initialize result columns with zeros.
for letter, _ in match:
df[letter+'_result'] = np.zeros_like(df[letter+'1'])
# Apply functions to columns
for letter, number in match:
col_name = ''.join([letter, number])
teh_function = partial(func_dict[letter], exponent=int(number))
df[letter+'_result'] += df[col_name].apply(teh_function)
print df
Output:
A1 A2 A3 B1 B2 B3 C1 \
0 0.374540 0.950714 0.731994 0.598658 0.156019 0.155995 0.058084
1 0.708073 0.020584 0.969910 0.832443 0.212339 0.181825 0.183405
2 0.431945 0.291229 0.611853 0.139494 0.292145 0.366362 0.456070
3 0.514234 0.592415 0.046450 0.607545 0.170524 0.065052 0.948886
4 0.304614 0.097672 0.684233 0.440152 0.122038 0.495177 0.034389
C2 C3 A_result B_result C_result
0 0.866176 0.601115 1.670611 0.626796 1.025551
1 0.304242 0.524756 1.620915 0.883542 0.420470
2 0.785176 0.199674 0.745815 0.274016 1.080532
3 0.965632 0.808397 0.865290 0.636899 2.409623
4 0.909320 0.258780 0.634494 0.576463 0.878582
You can replace the power functions in the func_dict with your own functions, for example one that sums the values with another value or performs some sort of fancy statistical calculations with them.
Using this in combination with the pipeline approach from my earlier edit should give you the tools to get the results that you need.