Pandas dataframe string formatting - python

I have a pandas dataframe with multiple columns. My goal is to apply a complicated function to 3 columns and get a new column of values. Yet I will want to apply the same function to different triplets of columns. Would there be a possibility to use smart string formatting so I don't have to hardcode different names of columns 5 (or more) times?
Rough sketch:
Columns('A1','A2','A3','B1','B2','B3',...)
def function(row):
return row['A1']**2 + row['A2']**3 + row['A3']**4 ### String format here?
do same for B1,2,3; C1,2,3 etc.
Thank you!

Using #Milo's setup dataframe df
np.random.seed(42)
col_names = 'A1 A2 A3 B1 B2 B3 C1 C2 C3'.split()
df = pd.DataFrame(np.random.rand(5,9), columns=col_names)
print(df)
A1 A2 A3 B1 B2 B3 C1 C2 C3
0 0.37 0.95 0.73 0.60 0.16 0.16 0.06 0.87 0.60
1 0.71 0.02 0.97 0.83 0.21 0.18 0.18 0.30 0.52
2 0.43 0.29 0.61 0.14 0.29 0.37 0.46 0.79 0.20
3 0.51 0.59 0.05 0.61 0.17 0.07 0.95 0.97 0.81
4 0.30 0.10 0.68 0.44 0.12 0.50 0.03 0.91 0.26
Then use groupby with columns or axis=1. We use the first letter in the column header as the grouping key.
df.pow(2).groupby(df.columns.str[0], 1).sum(axis=1).pow(.5)
A B C
0 1.256962 0.638019 1.055923
1 1.201048 0.878128 0.633695
2 0.803589 0.488905 0.929715
3 0.785843 0.634367 1.576812
4 0.755317 0.673667 0.946051
​

If I understand your question correctly, you want to name your columns according to a specific scheme like "Anumber" and then apply the same operation to them.
One way you can do that is to filter for the naming scheme of the columns you want to address by using regular expressions and then use the apply method to apply your function.
Let's look at an example. I will first construct a DataFrame like so:
import pandas as pd
import numpy as np
np.random.seed(42)
col_names = 'A1 A2 A3 B1 B2 B3 C1 C2 C3'.split()
df = pd.DataFrame(np.random.rand(5,9), columns=col_names)
print df
A1 A2 A3 B1 B2 B3 C1 \
0 0.374540 0.950714 0.731994 0.598658 0.156019 0.155995 0.058084
1 0.708073 0.020584 0.969910 0.832443 0.212339 0.181825 0.183405
2 0.431945 0.291229 0.611853 0.139494 0.292145 0.366362 0.456070
3 0.514234 0.592415 0.046450 0.607545 0.170524 0.065052 0.948886
4 0.304614 0.097672 0.684233 0.440152 0.122038 0.495177 0.034389
C2 C3
0 0.866176 0.601115
1 0.304242 0.524756
2 0.785176 0.199674
3 0.965632 0.808397
4 0.909320 0.258780
Then use the filter method in combination with regular expressions. I will exemplarily square every value by using a lambda. But you can use whatever function/operation you like:
print df.filter(regex=r'A\d+').apply(lambda x: x*x)
A1 A2 A3
0 0.140280 0.903858 0.535815
1 0.501367 0.000424 0.940725
2 0.186576 0.084814 0.374364
3 0.264437 0.350955 0.002158
4 0.092790 0.009540 0.468175
Edit (2017-07-10)
Taking the above examples you could proceed with what you ultimately want to calculate. For example we can calculate the euclidean distance across all A-columns as follows:
df.filter(regex=r'A\d+').apply(lambda x: x*x).sum(axis=1).apply(np.sqrt)
Which results in:
0 1.256962
1 1.201048
2 0.803589
3 0.785843
4 0.755317
So what we essentially computed is sqrt(A1^2 + A2^2 + A3^2 + ... + An^2) for every row.
But since you want to apply separate transformations to separate column naming schemes you would have to hardcode the above method concatenation.
A much more elegant solution to this would be using pipelines. Pipelines basically allow you to define operations on your DataFrame and then combine them the way you need. Again using the example of computing the Euclidean Distance, we could construct a pipeline as follows:
def filter_columns(dataframe, regex):
"""Filter out columns of `dataframe` matched by `regex`."""
return dataframe.filter(regex=regex)
def op_on_vals(dataframe, op_vals):
"""Apply `op_vals` to every value in the columns of `dataframe`"""
return dataframe.apply(op_vals)
def op_across_columns(dataframe, op_cols):
"""Apply `op_cols` across the columns of `dataframe`"""
# Catch exception that would be raised if function
# would be applied to a pandas.Series.
try:
return dataframe.apply(op_cols, axis=1)
except TypeError:
return dataframe.apply(op_cols)
For every column naming scheme you can then define the transformations to apply and the order in which they have to be applied. This can for example be done by creating a dictionary that holds the column naming schemes as keys and the arguments for the pipes as values:
pipe_dict = {r'A\d+': [(op_on_vals, np.square), (op_across_columns, np.sum), (op_across_columns, np.sqrt)],
r'B\d+': [(op_on_vals, np.square), (op_across_columns, np.mean)],
r'C\d+': [(op_on_vals, lambda x: x**3), (op_across_columns, np.max)]}
# First pipe: Euclidean distance
# Second pipe: Mean of squares
# Third pipe: Maximum cube
df_list = []
for scheme in pipe_dict.keys():
df_list.append(df.pipe(filter_columns, scheme))
for (operation, func) in pipe_dict[scheme]:
df_list[-1] = df_list[-1].pipe(operation, func)
print df_list[0]
0 1.256962
1 1.201048
2 0.803589
3 0.785843
4 0.755317
Getting the same result as above.
Now, this is just an example use and neither very elegant, nor computationally very efficient. It is just to demonstrate the concept of DataFrame pipelines. Taking these concepts, you can go really fancy with this - for example defining pipelines of pipelines etc.
However, taking this example you can achieve your goal of defining an arbitrary order of functions to be executed on your columns. You can now go one step further and apply one function at a time to specific columns, instead of applying functions across all columns.
For example, you can take my op_on_vals function and modify it so that it achieves what you outlined with row['A1']**2, row['A2']**3 and then use .pipe(op_across_columns, np.sum) to implement what you sketched with
def function(row):
return row['A1']**2 + row['A2']**3 + row['A3']**4
This shouldn't be too difficult, so I will leave the details of this implementation to you.
Edit (2017-07-11)
Here is another piece of code that uses functools.partial in order to create 'function prototypes' of a power function. These can be used to variably set an exponent for the power according to the number in the column names of the DataFrame.
This way we can use the numbers in A1, A2 etc. to calculate value**1, value**2 for each value in the corresponding column. Finally, we can sum them in order to get what you sketched with
row['A1']**2 + row['A2']**3 + row['A3']**4
You can find an excellent explanation of what functools.partial does on PyDanny's Blog. Let's look at the code:
import pandas as pd
import numpy as np
import re
from functools import partial
def power(base, exponent):
return base ** exponent
# Create example DataFrame.
np.random.seed(42)
col_names = 'A1 A2 A3 B1 B2 B3 C1 C2 C3'.split()
df = pd.DataFrame(np.random.rand(5, 9), columns=col_names)
# Separate 'letter''number' strings of columns into tuples of (letter, number).
match = re.findall(r"([A-Z]+)([0-9]+)", ''.join(df.columns.tolist()))
# Dictionary with 'prototype' functions for each column naming scheme.
func_dict = {'A': power, 'B': power, 'C': power}
# Initialize result columns with zeros.
for letter, _ in match:
df[letter+'_result'] = np.zeros_like(df[letter+'1'])
# Apply functions to columns
for letter, number in match:
col_name = ''.join([letter, number])
teh_function = partial(func_dict[letter], exponent=int(number))
df[letter+'_result'] += df[col_name].apply(teh_function)
print df
Output:
A1 A2 A3 B1 B2 B3 C1 \
0 0.374540 0.950714 0.731994 0.598658 0.156019 0.155995 0.058084
1 0.708073 0.020584 0.969910 0.832443 0.212339 0.181825 0.183405
2 0.431945 0.291229 0.611853 0.139494 0.292145 0.366362 0.456070
3 0.514234 0.592415 0.046450 0.607545 0.170524 0.065052 0.948886
4 0.304614 0.097672 0.684233 0.440152 0.122038 0.495177 0.034389
C2 C3 A_result B_result C_result
0 0.866176 0.601115 1.670611 0.626796 1.025551
1 0.304242 0.524756 1.620915 0.883542 0.420470
2 0.785176 0.199674 0.745815 0.274016 1.080532
3 0.965632 0.808397 0.865290 0.636899 2.409623
4 0.909320 0.258780 0.634494 0.576463 0.878582
You can replace the power functions in the func_dict with your own functions, for example one that sums the values with another value or performs some sort of fancy statistical calculations with them.
Using this in combination with the pipeline approach from my earlier edit should give you the tools to get the results that you need.

Related

How to find similarity score between two rows in a pandas data frame

I want to find the similarity of given sentences between two rows.
In my sample data frame:
import pandas as pd
data = [f'Sent {str(i)}' for i in range(10)]
df = pd.DataFrame(data=data, columns=['Sentences'])
Sentences
0 Sent 0
1 Sent 1
2 Sent 2
3 Sent 3
4 Sent 4
5 Sent 5
6 Sent 6
7 Sent 7
8 Sent 8
9 Sent 9
I want to find the similarity score between every two sentences for n number of sentences.
Approach #1: Create two new columns, the first one is containing each sentence copied n times (n is the total number of sentences), this creates a row of length $n^2$. The second column would be all the sentences copied n times as well (but in groups) still creating $n^2$ rows.
From here I can get the similarities and put them in just one column.
Approach #2: Create a loop that would iterate over the sentences and create the total $nC2$ similarity scores. (for now I don't know how to do this)
How to do approach #2? Are there better ways to do this?
One option:
from difflib import SequenceMatcher
from itertools import combinations
import numpy as np
import pandas as pd
df = pd.DataFrame({'col': ['ABC', 'ABCD', 'DEF', 'GHI']})
# set up empty array
a = np.zeros((len(df), len(df)))
# compute difference for each unique pair and assign upper triangle
a[np.triu_indices(len(df), k=1)] = [SequenceMatcher(None, a, b).ratio()
for a,b in combinations(df['col'], r=2)]
# complete lower diagonaltriangle and diagonal
a += a.T
np.fill_diagonal(a, 1)
# convert to DataFrame
out = pd.DataFrame(a, columns=df['col'].values, index=df['col'].values).round(2)
Output:
ABC ABCD DEF GHI
ABC 1.00 0.86 0.00 0.0
ABCD 0.86 1.00 0.29 0.0
DEF 0.00 0.29 1.00 0.0
GHI 0.00 0.00 0.00 1.0

How to decode column value from rare label by matching column names

I have two dataframes like as shown below
import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'grade': rng.choice(list('ACD'),size=(5)),
'dash': rng.choice(list('PQRS'),size=(5)),
'dumeel': rng.choice(list('QWER'),size=(5)),
'dumma': rng.choice((1234),size=(5)),
'target': rng.choice([0,1],size=(5))
})
tdf = pd.DataFrame({'Id': [1,1,1,1,3,3,3],
'feature': ['grade=Rare','dash=Q','dumma=rare','dumeel=R','dash=Rare','dumma=rare','grade=D'],
'value': [0.2,0.45,-0.32,0.56,1.3,1.5,3.7]})
My objective is to
a) Replace the Rare or rare values in feature column of tdf dataframe by original value from cdf dataframe.
b) To identify original value, we can make use of the string before = Rare or =rare or = rare etc. That string represents the column name in cdf dataframe (from where original value to replace rare can be found)
I was trying something like the below but not sure how to go from here
replace_df = cdf.merge(tdf,how='inner',on='Id')
replace_df ["replaced_feature"] = np.where(((replace_df["feature"].str.contains('rare',regex=True)]) & (replace_df["feature"].str.split('='))])
I have to apply this on a big data where I have million rows and more than 1000 replacements to be made like this.
I expect my output to be like as shown below
Here is one possible approach using MultiIndex.map to substitute values from cdf into tdf:
s = tdf['feature'].str.split('=')
m = s.str[1].isin(['rare', 'Rare'])
v = tdf[m].set_index(['Id', s[m].str[0]]).index.map(cdf.set_index('Id').stack())
tdf.loc[m, 'feature'] = s[m].str[0] + '=' + v.astype(str)
print(tdf)
Id feature value
0 1 grade=D 0.20
1 1 dash=Q 0.45
2 1 dumma=1123 -0.32
3 1 dumeel=R 0.56
4 3 dash=P 1.30
5 3 dumma=849 1.50
6 3 grade=D 3.70
# list comprehension to find where rare is in the feature col
tdf['feature'] = [x if y.lower()=='rare' else x+'='+y for x,y in tdf['feature'].str.split('=')]
# create a mask where feature is in columns of cdf
mask = tdf['feature'].isin(cdf.columns)
# use loc to filter your frame and use merge to join cdf on the id and feature column - after you use stack
tdf.loc[mask, 'feature'] = tdf.loc[mask, 'feature']+'='+tdf.loc[mask].merge(cdf.set_index('Id').stack().to_frame(),
right_index=True, left_on=['Id', 'feature'])[0].astype(str)
Id feature value
0 1 grade=D 0.20
1 1 dash=Q 0.45
2 1 dumma=1123 -0.32
3 1 dumeel=R 0.56
4 3 dash=P 1.30
5 3 dumma=849 1.50
6 3 grade=D 3.70
My feeling is there's no need to look for Rare values.
Extract the column name from tdf to lookup in cdf. After, flatten your cdf dataframe to extract the right values:
r = tdf.set_index('Id')['feature'].str.split('=').str[0].str.lower()
tdf['feature'] = r.values + '=' + cdf.set_index('Id').unstack() \
.loc[zip(r.values, r.index)] \
.astype(str).values
Output:
>>> tdf
Id feature value
0 1 grade=D 0.20
1 1 dash=Q 0.45
2 1 dumma=1123 -0.32
3 1 dumeel=R 0.56
4 3 dash=P 1.30
5 3 dumma=849 1.50
6 3 grade=A 3.70
>>> r
Id # <- the index is the row of cdf
1 grade # <- the values are the column of cdf
1 dash
1 dumma
1 dumeel
3 dash
3 dumma
3 grade
Name: feature, dtype: object

Merging pandas dataframes, alternating rows without soritng rows

I'm trying to mimic an spss style correlation table in my Pandas output to make it easier to read for supervisors who are used to seeing matrices laid out this way (and are annoyed that I don't use SPSS anymore because it's harder for them to read).
This means that there is a table where the p-value is placed directly above the correlation coeff in the table. I have easily produced both the p-values and the coeffs and saved each into a separate dataframes like the ones below.
pvals
T 4 Rw Af
T |0.00|0.05|0.24|0.01
4 |0.05|0.00|0.76|0.03
Rw|0.24|0.76|0.00|0.44
...
rs
T 4 Rw Af
T |1.00|0.65|0.28|0.44
4 |0.65|1.00|0.01|0.03
Rw|-0.03|0.01|1.00|0.32
...
What I'd like to do is make a table where the two dataframes are merged without changing the order of the index. It would look like
T |P |0.00|0.05|0.24|0.01
|r |1.00|0.65|0.28|0.44
Rw|P |0.05|0.00|0.76|0.03
|r |0.65|1.00|0.01|0.03
...
Now, I understand that if my columns had alphabetically ordered names I could use something like
pd.concat([pvals, rs]).sort_index(kind='merge')
However, my columns are named with descriptive, non-ordered names and so this doesn't work because it reorders the index into alphabetical order. I also know that
df.corr()
will produce a matrix like the rs example I've given above but this is not what I'm looking for.
If anyone has any advice I'd really appreciate it.
Kev
You can use helper MultiIndex with np.arange and DataFrame.set_index with append=True, add keys parameter for P, r values, sorting by ranges, remove this level and last change order of levels by DataFrame.swaplevel:
s1 = pvals.set_index(np.arange(len(pvals)), append=True)
s2 = rs.set_index(np.arange(len(rs)), append=True)
df = (pd.concat([s1, s2], keys=('P','r'))
.sort_index(kind='merge', level=2)
.reset_index(level=2, drop=True)
.swaplevel(0,1))
print (df)
T 4 Rw Af
T P 0.00 0.05 0.24 0.01
r 1.00 0.65 0.28 0.44
4 P 0.05 0.00 0.76 0.03
r 0.65 1.00 0.01 0.03
Rw P 0.24 0.76 0.00 0.44
r -0.03 0.01 1.00 0.32
Asker Edit
This answer worked once the code was changed to
s1 = pvals.assign(a = np.arange(len(pvals))).set_index('a', append=True)
s2 = rs.assign(a = np.arange(len(rs))).set_index('a', append=True)
df = (pd.concat([s1, s2], keys=('P','r'))
.sort_index(kind='merge', level=2)
.reset_index(level=2, drop=True)
.swaplevel(0,1))
which was recomended by the answerer.

Calculate mean of each subsequent group of 2 rows with pandas

I'm trying to calculate the mean of each subsequent group of 2 rows for all data frame. I think I got that with the following line:
df.groupby(np.arange(len(df))//2).mean()
However, the problem is that not all values are numeric. In that case, if the second line of the group is numeric while the first is not, instead of the mean, the value stays the same as the second row. In case of both lines be non numeric, the value should be assigned with 0.
For a better visualization, I have this dataframe:
Well Ct
0 A1 Undetermined
1 A2 Undertermined
2 A3 Undetermined
3 A4 41.2
4 B1 42
5 B2 43
What I'm trying to obtain is:
Well Ct
0 A1-A2 0.0
1 A3-A4 41.2
2 B1/B2 42.5
Is there any way to do this or other similar question that was already been posted?
Use pandas.to_numeric to coerce non-numeric values to NaNs (which pandas will ignore by default when calculating means), then use groupby + agg to assign your final groups.
df.Ct = pd.to_numeric(df.Ct, errors='coerce')
df.groupby(np.arange(df.shape[0]) // 2).agg({'Well': '-'.join, 'Ct': 'mean'}).fillna(0)
Well Ct
0 A1-A2 0.0
1 A3-A4 41.2
2 B1-B2 42.5

Creating DataFrame with Hierarchical Columns

What is the easiest way to create a DataFrame with hierarchical columns?
I am currently creating a DataFrame from a dict of names -> Series using:
df = pd.DataFrame(data=serieses)
I would like to use the same columns names but add an additional level of hierarchy on the columns. For the time being I want the additional level to have the same value for columns, let's say "Estimates".
I am trying the following but that does not seem to work:
pd.DataFrame(data=serieses,columns=pd.MultiIndex.from_tuples([(x, "Estimates") for x in serieses.keys()]))
All I get is a DataFrame with all NaNs.
For example, what I am looking for is roughly:
l1 Estimates
l2 one two one two one two one two
r1 1 2 3 4 5 6 7 8
r2 1.1 2 3 4 5 6 71 8.2
where l1 and l2 are the labels for the MultiIndex
This appears to work:
import pandas as pd
data = {'a': [1,2,3,4], 'b': [10,20,30,40],'c': [100,200,300,400]}
df = pd.concat({"Estimates": pd.DataFrame(data)}, axis=1, names=["l1", "l2"])
l1 Estimates
l2 a b c
0 1 10 100
1 2 20 200
2 3 30 300
3 4 40 400
I know the question is really old but for pandas version 0.19.1 one can use direct dict-initialization:
d = {('a','b'):[1,2,3,4], ('a','c'):[5,6,7,8]}
df = pd.DataFrame(d, index=['r1','r2','r3','r4'])
df.columns.names = ('l1','l2')
print df
l1 a
l2 b c
r1 1 5
r2 2 6
r3 3 7
r4 4 8
Im not sure but i think the use of a dict as input for your DF and a MulitIndex dont play well together. Using an array as input instead makes it work.
I often prefer dicts as input though, one way is to set the columns after creating the df:
import pandas as pd
data = {'a': [1,2,3,4], 'b': [10,20,30,40],'c': [100,200,300,400]}
df = pd.DataFrame(np.array(data.values()).T, index=['r1','r2','r3','r4'])
tups = zip(*[['Estimates']*len(data),data.keys()])
df.columns = pd.MultiIndex.from_tuples(tups, names=['l1','l2'])
l1 Estimates
l2 a c b
r1 1 10 100
r2 2 20 200
r3 3 30 300
r4 4 40 400
Or when using an array as input for the df:
data_arr = np.array([[1,2,3,4],[10,20,30,40],[100,200,300,400]])
tups = zip(*[['Estimates']*data_arr.shape[0],['a','b','c'])
df = pd.DataFrame(data_arr.T, index=['r1','r2','r3','r4'], columns=pd.MultiIndex.from_tuples(tups, names=['l1','l2']))
Which gives the same result.
The solution by Rutger Kassies worked in my case, but I have
more than one column in the "upper level" of the column hierarchy.
Just want to provide what worked for me as an example since it is a more general case.
First, I have data with that looks like this:
> df
(A, a) (A, b) (B, a) (B, b)
0 0.00 9.75 0.00 0.00
1 8.85 8.86 35.75 35.50
2 8.51 9.60 66.67 50.70
3 0.03 508.99 56.00 8.58
I would like it to look like this:
> df
A B
a b a b
0 0.00 9.75 0.00 0.00
1 8.85 8.86 35.75 35.50
...
The solution is:
tuples = df.transpose().index
new_columns = pd.MultiIndex.from_tuples(tuples, names=['Upper', 'Lower'])
df.columns = new_columns
This is counter-intuitive because in order to create columns, I have to do it through index.

Categories