Merging pandas dataframes, alternating rows without soritng rows - python

I'm trying to mimic an spss style correlation table in my Pandas output to make it easier to read for supervisors who are used to seeing matrices laid out this way (and are annoyed that I don't use SPSS anymore because it's harder for them to read).
This means that there is a table where the p-value is placed directly above the correlation coeff in the table. I have easily produced both the p-values and the coeffs and saved each into a separate dataframes like the ones below.
pvals
T 4 Rw Af
T |0.00|0.05|0.24|0.01
4 |0.05|0.00|0.76|0.03
Rw|0.24|0.76|0.00|0.44
...
rs
T 4 Rw Af
T |1.00|0.65|0.28|0.44
4 |0.65|1.00|0.01|0.03
Rw|-0.03|0.01|1.00|0.32
...
What I'd like to do is make a table where the two dataframes are merged without changing the order of the index. It would look like
T |P |0.00|0.05|0.24|0.01
|r |1.00|0.65|0.28|0.44
Rw|P |0.05|0.00|0.76|0.03
|r |0.65|1.00|0.01|0.03
...
Now, I understand that if my columns had alphabetically ordered names I could use something like
pd.concat([pvals, rs]).sort_index(kind='merge')
However, my columns are named with descriptive, non-ordered names and so this doesn't work because it reorders the index into alphabetical order. I also know that
df.corr()
will produce a matrix like the rs example I've given above but this is not what I'm looking for.
If anyone has any advice I'd really appreciate it.
Kev

You can use helper MultiIndex with np.arange and DataFrame.set_index with append=True, add keys parameter for P, r values, sorting by ranges, remove this level and last change order of levels by DataFrame.swaplevel:
s1 = pvals.set_index(np.arange(len(pvals)), append=True)
s2 = rs.set_index(np.arange(len(rs)), append=True)
df = (pd.concat([s1, s2], keys=('P','r'))
.sort_index(kind='merge', level=2)
.reset_index(level=2, drop=True)
.swaplevel(0,1))
print (df)
T 4 Rw Af
T P 0.00 0.05 0.24 0.01
r 1.00 0.65 0.28 0.44
4 P 0.05 0.00 0.76 0.03
r 0.65 1.00 0.01 0.03
Rw P 0.24 0.76 0.00 0.44
r -0.03 0.01 1.00 0.32
Asker Edit
This answer worked once the code was changed to
s1 = pvals.assign(a = np.arange(len(pvals))).set_index('a', append=True)
s2 = rs.assign(a = np.arange(len(rs))).set_index('a', append=True)
df = (pd.concat([s1, s2], keys=('P','r'))
.sort_index(kind='merge', level=2)
.reset_index(level=2, drop=True)
.swaplevel(0,1))
which was recomended by the answerer.

Related

Truncate all values in dataframe

I have a pandas dataframe with the first column being dates and then lots of adjusted stock prices (some of which have 16 decimals). I would like to truncate all the dataframe values to 8 decimals so I tried the following:
df = df.set_index("Day").pipe(lambda x: math.trunc(100000000 * x) / 100000000).reset_index()
But I get the following error:
type DataFrame doesn't define __trunc__ method
Have you tried formatting?
dec = [1.2736,9.3745,5.412783,8.25389]
to_3dp = lambda x: '%.3f'%(x)
rounded = [to_3dp(i) for i in dec]
print(rounded) # [1.273, 9.374, 5.412, 8.253]
So in your case:
df['column'] = df['column'].apply(lambda x: '%.8f'%(x))
If you want to round:
df['column'] = df['column'].apply(lambda x: round(x,8))
Use numpy.trunc for a vectorial solution:
n = 10**8
out = np.trunc(df.set_index("Day").mul(n)).div(n).reset_index()
IIUC, you are trying to apply the truncate-based lambda function on multiple columns at once. That's the reason for the error, try using applymap which applies your function on each cell independently. You have to first set your date column as index, leaving only the float columns in the dataframe. Try this -
f = lambda x: math.trunc(100000000 * x) / 100000000 #<-- your function
df.set_index("Day").applymap(f).reset_index() #<-- applied on each cell
Since I don't have the sample dataset you are using, here is a working dummy example.
import math
import pandas as pd
#Dummy dataframe
df = pd.DataFrame(np.random.random((10,3)),
columns = ['col1','col2','col3'])
f = lambda x: math.trunc(100 * x) / 100
df.applymap(f)
col1 col2 col3
0 0.80 0.76 0.14
1 0.40 0.48 0.85
2 0.58 0.40 0.76
3 0.82 0.04 0.10
4 0.23 0.04 0.91
5 0.57 0.41 0.12
6 0.72 0.71 0.71
7 0.32 0.59 0.99
8 0.11 0.70 0.32
9 0.95 0.80 0.24
Another simpler solution is to just use df.set_index("Day").round(8) directly, if that works for you but that would be rounding your numbers to 8 digits instead of truncating.

Adding values in a column by "formula" in a pandas dataframe

I am trying to add values to a column using a formula, using the information from this question: Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?
I already have the first number of the column B and I want to make a formula for the rest of column B.
The dataframe looks something like this:
A B C
0.16 0.001433 25.775485
0.28 0 25.784443
0.28 0 25.792396
...
And the method I tried was:
for i in range(1, len(df)):
df.loc[i, "B"] = df.loc[i-1, "B"] + df.loc[i,"A"]*((df.loc[i,"C"]) - (df.loc[i-1,"C"]))
But this code produces an infinite loop, can someone help me with this?
you can use shift and a simple assignment.
The general rule in pandas if you use loops you're doing something wrong, it's considered an anti pattern.
df['B_new'] = df['B'].shift(-1) - df['A'] * ((df['C'] - df['C'].shift(-1)))
A B C B_new
0 0.16 0.001433 25.775485 0.001433
1 0.28 0.000000 25.784443 0.002227
2 0.28 0.000000 25.792396 NaN

Adding extra column when exporting Feature Importance

I am using .feature_importances_ to export the importance values from a regression tree model to a CSV. The tree is run multiple times with different numbers of variables. I am looking to add an extra column that lists the number of variables used when calculating the importance.
Currently I am able to export the importance values and variable names to a CSV with the code below
Importance = list(zip(model.feature_importances_,list(X)))
Importance = pd.DataFrame(Importance,columns=["Importance","Feature_Name"])
Importance.to_csv('File.csv', mode='a', header=False)
and I get a CSV that looks like this
Imp VarName
0 1.00 0mm
0 0.58 0mm
1 0.42 0.63mm
0 0.54 0.63mm
1 0.36 0mm
2 0.10 0.125mm
What I am looking for is to add a 4th column with the number of variables used in that tree so the output would be like this
Imp VarName NumVar
0 1.00 0mm 1
0 0.58 0mm 2
1 0.42 0.63mm 2
0 0.54 0.63mm 3
1 0.36 0mm 3
2 0.10 0.125mm 3
This should work, I add a array of the size of the len with the value of the number of the len of X.
import numpy as np
Importance = list(zip(model.feature_importances_, list(X), len(X) * np.ones(len(X))))
Importance = pd.DataFrame(Importance,columns=["Importance","Feature_Name", "NumVar"])
Importance.to_csv('File.csv')
When running for the first time, I think you can delete the append mode and the header to False.

How to divide one dataframe by the other without converting to numpy first?

I have a dataframe with two columns, x and y, and a few hundred rows.
I have another dataframe with only one row and two columns, x and y.
I want to divide column x of the big dataframe by the value in x of the small dataframe, and column y by column y.
If I divide one dataframe by the other, I get all NaNs. For the division to work, I must convert the small dataframe to numpy.
Why can't I divide one dataframe by the other? What am I missing? I have a toy example below.
import numpy as np
import pandas as pd
df = pd.DataFrame()
r = int(10)
df['x'] = np.arange(0,r)
df['y'] = df['x'] * 2
other_df = pd.DataFrame()
other_df['x'] = [100]
other_df['y'] = [400]
# This doesn't work - I get all nans
new = df / other_df
# this works - it gives me what I want
new2 = df / [100,400]
# this also works
new3 = df / other_df.to_numpy()
You can convert one row DataFrame to Series for correct align columns, e.g. by selecting first row by DataFrame.iloc:
new = df / other_df.iloc[0]
print (new)
x y
0 0.00 0.000
1 0.01 0.005
2 0.02 0.010
3 0.03 0.015
4 0.04 0.020
5 0.05 0.025
6 0.06 0.030
7 0.07 0.035
8 0.08 0.040
9 0.09 0.045
You can use numpy.divide() to divide as numpy has a great property that is Broadcasting.
new = np.divide(df,other_df)
Please check this link for more details.

Pandas dataframe string formatting

I have a pandas dataframe with multiple columns. My goal is to apply a complicated function to 3 columns and get a new column of values. Yet I will want to apply the same function to different triplets of columns. Would there be a possibility to use smart string formatting so I don't have to hardcode different names of columns 5 (or more) times?
Rough sketch:
Columns('A1','A2','A3','B1','B2','B3',...)
def function(row):
return row['A1']**2 + row['A2']**3 + row['A3']**4 ### String format here?
do same for B1,2,3; C1,2,3 etc.
Thank you!
Using #Milo's setup dataframe df
np.random.seed(42)
col_names = 'A1 A2 A3 B1 B2 B3 C1 C2 C3'.split()
df = pd.DataFrame(np.random.rand(5,9), columns=col_names)
print(df)
A1 A2 A3 B1 B2 B3 C1 C2 C3
0 0.37 0.95 0.73 0.60 0.16 0.16 0.06 0.87 0.60
1 0.71 0.02 0.97 0.83 0.21 0.18 0.18 0.30 0.52
2 0.43 0.29 0.61 0.14 0.29 0.37 0.46 0.79 0.20
3 0.51 0.59 0.05 0.61 0.17 0.07 0.95 0.97 0.81
4 0.30 0.10 0.68 0.44 0.12 0.50 0.03 0.91 0.26
Then use groupby with columns or axis=1. We use the first letter in the column header as the grouping key.
df.pow(2).groupby(df.columns.str[0], 1).sum(axis=1).pow(.5)
A B C
0 1.256962 0.638019 1.055923
1 1.201048 0.878128 0.633695
2 0.803589 0.488905 0.929715
3 0.785843 0.634367 1.576812
4 0.755317 0.673667 0.946051
​
If I understand your question correctly, you want to name your columns according to a specific scheme like "Anumber" and then apply the same operation to them.
One way you can do that is to filter for the naming scheme of the columns you want to address by using regular expressions and then use the apply method to apply your function.
Let's look at an example. I will first construct a DataFrame like so:
import pandas as pd
import numpy as np
np.random.seed(42)
col_names = 'A1 A2 A3 B1 B2 B3 C1 C2 C3'.split()
df = pd.DataFrame(np.random.rand(5,9), columns=col_names)
print df
A1 A2 A3 B1 B2 B3 C1 \
0 0.374540 0.950714 0.731994 0.598658 0.156019 0.155995 0.058084
1 0.708073 0.020584 0.969910 0.832443 0.212339 0.181825 0.183405
2 0.431945 0.291229 0.611853 0.139494 0.292145 0.366362 0.456070
3 0.514234 0.592415 0.046450 0.607545 0.170524 0.065052 0.948886
4 0.304614 0.097672 0.684233 0.440152 0.122038 0.495177 0.034389
C2 C3
0 0.866176 0.601115
1 0.304242 0.524756
2 0.785176 0.199674
3 0.965632 0.808397
4 0.909320 0.258780
Then use the filter method in combination with regular expressions. I will exemplarily square every value by using a lambda. But you can use whatever function/operation you like:
print df.filter(regex=r'A\d+').apply(lambda x: x*x)
A1 A2 A3
0 0.140280 0.903858 0.535815
1 0.501367 0.000424 0.940725
2 0.186576 0.084814 0.374364
3 0.264437 0.350955 0.002158
4 0.092790 0.009540 0.468175
Edit (2017-07-10)
Taking the above examples you could proceed with what you ultimately want to calculate. For example we can calculate the euclidean distance across all A-columns as follows:
df.filter(regex=r'A\d+').apply(lambda x: x*x).sum(axis=1).apply(np.sqrt)
Which results in:
0 1.256962
1 1.201048
2 0.803589
3 0.785843
4 0.755317
So what we essentially computed is sqrt(A1^2 + A2^2 + A3^2 + ... + An^2) for every row.
But since you want to apply separate transformations to separate column naming schemes you would have to hardcode the above method concatenation.
A much more elegant solution to this would be using pipelines. Pipelines basically allow you to define operations on your DataFrame and then combine them the way you need. Again using the example of computing the Euclidean Distance, we could construct a pipeline as follows:
def filter_columns(dataframe, regex):
"""Filter out columns of `dataframe` matched by `regex`."""
return dataframe.filter(regex=regex)
def op_on_vals(dataframe, op_vals):
"""Apply `op_vals` to every value in the columns of `dataframe`"""
return dataframe.apply(op_vals)
def op_across_columns(dataframe, op_cols):
"""Apply `op_cols` across the columns of `dataframe`"""
# Catch exception that would be raised if function
# would be applied to a pandas.Series.
try:
return dataframe.apply(op_cols, axis=1)
except TypeError:
return dataframe.apply(op_cols)
For every column naming scheme you can then define the transformations to apply and the order in which they have to be applied. This can for example be done by creating a dictionary that holds the column naming schemes as keys and the arguments for the pipes as values:
pipe_dict = {r'A\d+': [(op_on_vals, np.square), (op_across_columns, np.sum), (op_across_columns, np.sqrt)],
r'B\d+': [(op_on_vals, np.square), (op_across_columns, np.mean)],
r'C\d+': [(op_on_vals, lambda x: x**3), (op_across_columns, np.max)]}
# First pipe: Euclidean distance
# Second pipe: Mean of squares
# Third pipe: Maximum cube
df_list = []
for scheme in pipe_dict.keys():
df_list.append(df.pipe(filter_columns, scheme))
for (operation, func) in pipe_dict[scheme]:
df_list[-1] = df_list[-1].pipe(operation, func)
print df_list[0]
0 1.256962
1 1.201048
2 0.803589
3 0.785843
4 0.755317
Getting the same result as above.
Now, this is just an example use and neither very elegant, nor computationally very efficient. It is just to demonstrate the concept of DataFrame pipelines. Taking these concepts, you can go really fancy with this - for example defining pipelines of pipelines etc.
However, taking this example you can achieve your goal of defining an arbitrary order of functions to be executed on your columns. You can now go one step further and apply one function at a time to specific columns, instead of applying functions across all columns.
For example, you can take my op_on_vals function and modify it so that it achieves what you outlined with row['A1']**2, row['A2']**3 and then use .pipe(op_across_columns, np.sum) to implement what you sketched with
def function(row):
return row['A1']**2 + row['A2']**3 + row['A3']**4
This shouldn't be too difficult, so I will leave the details of this implementation to you.
Edit (2017-07-11)
Here is another piece of code that uses functools.partial in order to create 'function prototypes' of a power function. These can be used to variably set an exponent for the power according to the number in the column names of the DataFrame.
This way we can use the numbers in A1, A2 etc. to calculate value**1, value**2 for each value in the corresponding column. Finally, we can sum them in order to get what you sketched with
row['A1']**2 + row['A2']**3 + row['A3']**4
You can find an excellent explanation of what functools.partial does on PyDanny's Blog. Let's look at the code:
import pandas as pd
import numpy as np
import re
from functools import partial
def power(base, exponent):
return base ** exponent
# Create example DataFrame.
np.random.seed(42)
col_names = 'A1 A2 A3 B1 B2 B3 C1 C2 C3'.split()
df = pd.DataFrame(np.random.rand(5, 9), columns=col_names)
# Separate 'letter''number' strings of columns into tuples of (letter, number).
match = re.findall(r"([A-Z]+)([0-9]+)", ''.join(df.columns.tolist()))
# Dictionary with 'prototype' functions for each column naming scheme.
func_dict = {'A': power, 'B': power, 'C': power}
# Initialize result columns with zeros.
for letter, _ in match:
df[letter+'_result'] = np.zeros_like(df[letter+'1'])
# Apply functions to columns
for letter, number in match:
col_name = ''.join([letter, number])
teh_function = partial(func_dict[letter], exponent=int(number))
df[letter+'_result'] += df[col_name].apply(teh_function)
print df
Output:
A1 A2 A3 B1 B2 B3 C1 \
0 0.374540 0.950714 0.731994 0.598658 0.156019 0.155995 0.058084
1 0.708073 0.020584 0.969910 0.832443 0.212339 0.181825 0.183405
2 0.431945 0.291229 0.611853 0.139494 0.292145 0.366362 0.456070
3 0.514234 0.592415 0.046450 0.607545 0.170524 0.065052 0.948886
4 0.304614 0.097672 0.684233 0.440152 0.122038 0.495177 0.034389
C2 C3 A_result B_result C_result
0 0.866176 0.601115 1.670611 0.626796 1.025551
1 0.304242 0.524756 1.620915 0.883542 0.420470
2 0.785176 0.199674 0.745815 0.274016 1.080532
3 0.965632 0.808397 0.865290 0.636899 2.409623
4 0.909320 0.258780 0.634494 0.576463 0.878582
You can replace the power functions in the func_dict with your own functions, for example one that sums the values with another value or performs some sort of fancy statistical calculations with them.
Using this in combination with the pipeline approach from my earlier edit should give you the tools to get the results that you need.

Categories