Create columns having similarity index values - python

How can I create columns that show the respectively similarity indices for each row?
This code
def func(name):
matches = try_test.apply(lambda row: (fuzz.partial_ratio(row['name'], name) >= 85), axis=1)
return [try_test.word[i] for i, x in enumerate(matches) if x]
try_test.apply(lambda row: func(row['name']), axis=1)
returns indices that match the condition >=85. However, I would be interested also in having the values by comparing each field to all others.
The dataset is
try_test = pd.DataFrame({'word': ['apple', 'orange', 'diet', 'energy', 'fire', 'cake'],
'name': ['dog', 'cat', 'mad cat', 'good dog', 'bad dog', 'chicken']})
Help with be very appreciated.
Expected output (values are just an example)
word name sim_index1 sim_index2 sim_index3 ...index 6
apple dog 100 0
orange cat 100
... mad cat 0.6 100
On the diagonal there is a value of 100 as I am comparing dog with dog,...
I might consider also another approach if you think it would be better.

IIUC, you can slightly change your function to get what you want:
def func(name):
return try_test.apply(lambda row: (fuzz.partial_ratio(row['name'], name)), axis=1)
print(try_test.apply(lambda row: func(row['name']), axis=1))
0 1 2 3 4 5
0 100 0 33 100 100 0
1 0 100 100 0 33 33
2 33 100 100 29 43 14
3 100 0 29 100 71 0
4 100 33 43 71 100 0
5 0 33 14 0 0 100
that said, more than half of the calculation is not necessary as the result is a symmetrical matrix and the diagonal is 100. So if you data is bigger, then you could do the partial_ratio with the rows before the current row. Adding so reindex and then creating the full matrix using T (transpose) and np.diag, you can do:
def func_pr (row):
return (try_test.loc[:row.name-1, 'name']
.apply(lambda name: fuzz.partial_ratio(name, row['name'])))
#start at index 1 (second row)
pr = (try_test.loc[1:].apply(func_pr, axis=1)
.reindex(index=try_test.index,
columns=try_test.index)
.fillna(0)
.add_prefix('sim_idx')
)
#complete the result with transpose and diag
pr += pr.to_numpy().T + np.diag(np.ones(pr.shape[0]))*100
# concat
res = pd.concat([try_test, pr.astype(int)], axis=1)
and you get
print(res)
word name sim_idx0 sim_idx1 sim_idx2 sim_idx3 sim_idx4 \
0 apple dog 100 0 33 100 100
1 orange cat 0 100 100 0 33
2 diet mad cat 33 100 100 29 43
3 energy good dog 100 0 29 100 71
4 fire bad dog 100 33 43 71 100
5 cake chicken 0 33 14 0 0
sim_idx5
0 0
1 33
2 14
3 0
4 0
5 100

Related

Combine two number columns but exclude zero

I have the following dataframe from a database download that I cleaned up a bit. Unfortunately some of the single numbers split into a second column (row 9) from a single one. I'm trying to merge the two columns but exclude the zero values.
city crashes crashes_1 total_crashes
1 ABERDEEN 710 0 710
2 ACHERS LODGE 1 0 1
3 ACME 1 0 1
4 ADVANCE 55 0 55
5 AFTON 2 0 2
6 AHOSKIE 393 0 393
7 AKERS CENTER 1 0 1
8 ALAMANCE 50 0 50
9 ALBEMARLE 1 58 59
So for row 9 I want to end with:
9 ALBEMARLE 1 58 158
I tried a few snippets but nothing seems to work:
df['total_crashes'] = df['crashes'].astype(str).str.zfill(0) + df['crashes_1'].astype(str).str.zfill(0)
df['total_crashes'] = df['total_crashes'].astype(str).replace('\0', '', regex=True)
df['total_crashes'] = df['total_crashes'].apply(lambda x: ''.join(x[x!=0]))
df['total_crashes'] = df['total_crashes'].str.cat(df['total_crashes'], x[x!=0])
df['total_crashes'] = df.drop[0].sum(axis=1)
Thanks for any help.
You can use where condition:
df['total_crashes'] = df['crashes'].astype(str) + df['crashes_1'].astype(str).where(df['crashes_1'] != 0, "")

Pandas column aggregation for duplicate values using custom function

I have a dataframe and I want to aggregate the similar ids in column.
X_train['freq_qd1'] = X_train.groupby('qid1')['qid1'].transform('count')
X_train['freq_qd2'] = X_train.groupby('qid2')['qid2'].transform('count')
The above code I understand but i want to custom build a function to apply on multiple columns.
I have attached a snapshot of the dataframe for reference. On this dataframe i tried to apply a custom function on qid1 and qid2.
I tried the below code :
def frequency(qid):
freq = []
for i in str(qid):
if i not in freq:
freq.append(i)
ids = set()
if i not in ids:
ids.add(i)
freq.append(ids)
return freq
def extract_simple_feat(fe) :
fe['question1'] = fe['question1'].fillna(' ')
fe['question2'] = fe['question2'].fillna(' ')
fe['qid1'] = fe['qid1']
fe['qid2'] = fe['qid2']
token_feat = fe.apply(lambda x : get_simple_features(x['question1'],
x['question2']), axis = 1)
fe['q1_len'] = list(map(lambda x : x[0], token_feat))
fe['q2_len'] = list(map(lambda x : x[1], token_feat))
fe['freq_qd1'] = fe.apply(lambda x: frequency(x['qid1']), axis = 1)
fe['freq_qd2'] = fe.apply(lambda x: frequency(x['qid2']), axis = 1)
fe['q1_n_words'] = list(map(lambda x : x[2], token_feat))
fe['q2_n_words'] = list(map(lambda x : x[3], token_feat))
fe['word_common'] = list(map(lambda x : x[4], token_feat))
fe['word_total'] = list(map(lambda x : x[5], token_feat))
fe['word_share'] = list(map(lambda x : x[6], token_feat))
return fe
X_train = extract_simple_feat(X_train)
after applying my own implementation i am not getting the desired result. i am attaching a snapshot for the result i got.
The desired result wanted is below:
if someone can help me because i am really stuck and not able to rectify it properly.
here's a small text input :
qid1 qid2
23 24
25 26
27 28
318830 318831
359558 318831
384105 318831
413505 318831
451953 318831
530151 318831
I want aggregation output as :
qid1 qid2 freq_qid1 freq_id2
23 24 1 1
25 26 1 1
27 28 1 1
318830 318831 1 6
359558 1 6
384105 1 6
413505 1 6
451953 1 6
530151 1 6
Given: (I added an extra row for an edge case)
qid1 qid2
0 23 24
1 25 26
2 27 28
3 318830 318831
4 359558 318831
5 384105 318831
6 413505 318831
7 451953 318831
8 530151 318831
9 495894 4394
Doing:
def get_freqs(df, cols):
temp_df = df.copy()
for col in cols:
temp_df['freq_' + col] = temp_df.groupby(col)[col].transform('count')
temp_df.loc[temp_df[col].duplicated(), col] = ''
return temp_df
df = get_freqs(df, ['qid1', 'qid2'])
print(df)
Output:
qid1 qid2 qid1_freq qid2_freq
0 23 24 1 1
1 25 26 1 1
2 27 28 1 1
3 318830 318831 1 6
4 359558 1 6
5 384105 1 6
6 413505 1 6
7 451953 1 6
8 530151 1 6
9 495894 4394 1 1
If I wanted to do more of what you're doing...
Given:
id qid1 qid2 question1 question2 is_duplicate
0 0 1 2 Why is the sky blue? Why isn't the sky blue? 0
1 1 3 4 Why is the sky blue and green? Why isn't the sky pink? 0
2 2 5 6 Where are we? Moon landing a hoax? 0
3 3 7 8 Am I real? Chickens aren't real. 0
4 4 9 10 If this Fake, surely it is? Oops I did it again. 0
Doing:
def do_stuff(df):
t_df = df.copy()
quids = [x for x in t_df.columns if 'qid' in x]
questions = [x for x in t_df.columns if 'question' in x]
for col in quids:
t_df['freq_' + col] = t_df.groupby(col)[col].transform('count')
t_df.loc[t_df[col].duplicated(), col] = ''
for i, col in enumerate(questions):
t_df[f'q{i+1}_len'] = t_df[col].str.len()
t_df[f'q{i+1}_no_words'] = t_df[col].str.split(' ').apply(lambda x: len(x))
return t_df
df = do_stuff(df)
print(df)
Output:
id qid1 qid2 question1 question2 is_duplicate freq_qid1 freq_qid2 q1_len q1_n_words q2_len q2_n_words
0 0 1 2 Why is the sky blue? Why isn't the sky blue? 0 1 1 20 5 23 5
1 1 3 4 Why is the sky blue and green? Why isn't the sky pink? 0 1 1 30 7 23 5
2 2 5 6 Where are we? Moon landing a hoax? 0 1 1 13 3 20 4
3 3 7 8 Am I real? Chickens aren't real. 0 1 1 10 3 21 3
4 4 9 10 If this Fake, surely it is? Oops I did it again. 0 1 1 27 6 20 5

new column base on multi column condition

import pandas as pd
df = pd.DataFrame({
'cakeName': ['A','B','C','D','E','F','G','H'],
'chocolate%': ['20','70','30','50','50','10','75','20'],
'milk%' : ['50','20','40','0', '30','80','15','10'],
'straberry%' : ['30','10','30','50','20','10','10','70'],
})
df.head(10)
i would like to create a new column 'cakeType' based on the columns value
objective:
- scan through each cakeName
- if there are single ingredient which stand out, >= 75, then return a value in 'cakeType'
- for example: cake 'G' chocolate% >= 75, then 'choco' etc
- else if none of the ingredient have more than 75, its just a 'normal cake'
i had seek for answer in forum, doesn't seem quite fit, as i will have many many ingredients column
so scanning the row looking for value >= 75 is a better way to do it?
thanks a lot
Method 1: np.select:
Good use case for np.select where we define our conditions and based on those conditions we select choices. Plus we have a default value if none of the conditions is met:
conditions = [
df['chocolate%'].ge(75),
df['milk%'].ge(75),
df['straberry%'].ge(75)
]
choices = ['choco', 'milk', 'strawberry']
df['cakeType'] = np.select(conditions, choices, default='normal cake')
cakeName chocolate% milk% straberry% cakeType
0 A 20 50 30 normal cake
1 B 70 20 10 normal cake
2 C 30 40 30 normal cake
3 D 50 0 50 normal cake
4 E 50 30 20 normal cake
5 F 10 80 10 milk
6 G 75 15 10 choco
7 H 20 10 70 normal cake
Method 2: idxmax, Series.where and fillna:
First we get the column names where a value is >= 75. Then we remove the column names which do not have any value >= 75 and fillna them with normal cake
m1 = df.iloc[:, 1:].ge(75).idxmax(axis=1)
newcol = m1.where(df.iloc[:, 1:].ge(75).any(axis=1)).str[:-1].fillna('normal cake')
df['cakeType'] = newcol
cakeName chocolate% milk% straberry% cakeType
0 A 20 50 30 normal cake
1 B 70 20 10 normal cake
2 C 30 40 30 normal cake
3 D 50 0 50 normal cake
4 E 50 30 20 normal cake
5 F 10 80 10 milk
6 G 75 15 10 chocolate
7 H 20 10 70 normal cake

Cumm count on additional two columns based on a few conditions

I am struggling with pandas where by condition especially in a group by
I am having following dataframe
data = {'script':['a','a','a','b','b'],
'call_put':['C', 'P', 'P','C', 'P'],
'strike':[280,260,275,280,285],
'premium':[10,20,35,38,50]}
df=pd.DataFrame(data)
df['t']=df['premium'].cumsum()
df
script call_put strike premium t
0 a C 280 10 10
1 a P 260 20 30
2 a P 275 35 65
3 b C 280 38 103
4 b P 285 50 153
I want two additional columns having running count based on script and call_put and premium > 0 expected output
k1 k2
a c 10 1 1 call_put is "C" so first value should be 1, k2 column should be also one as call_put "P" is 0
a p 30 1 1 for call_put value is P so second column count 1
a P 65 1 2 as value is "P", so increase cumm count by 1
b C 103 1 1 script value changed, "C" is 1 and "P" = 0 so 1
b P 153 1 1 "C" = 1 and "P" = 1
can you please tell me how to do this?
Based on your explanation, this what you need.
df['k1'] = df.loc[df["premium"]>0].groupby(["script"])['call_put'].apply(lambda x: np.cumsum(x=='C'))
df['k2'] = df.loc[df["premium"]>0].groupby(["script"])['call_put'].apply(lambda x: np.cumsum(x=='P'))
Output
script call_put strike premium t k1 k2
a C 280 10 10 1 0
a P 260 20 30 1 1
a P 275 35 65 1 2
b C 280 38 103 1 0
b P 285 50 153 1 1
may be you need four columns to represent the cumsum as there will be 4 different combinations of script and call_put. Following code do as per you told. The count start from zero here
data = {'script':['a','a','a','b','b'],
'call_put':['C', 'P', 'P','C', 'P'],
'strike':[280,260,275,280,285],
'premium':[10,20,35,38,50]}
df=pd.DataFrame(data)
df['t']=df['premium'].cumsum()
## column cond_col will have unique combination of script, call_put and premium >0
df["cond_col"] = df["script"] + "-" + df["call_put"] + "-" + (df["premium"]>0).astype(np.str)
## and new columns for each unique combination
for col in np.unique(df["cond_col"]):
df[col] = df["cond_col"]==col
## do cumsum in each unique combination column
for col in np.unique(df["cond_col"]):
df[col] = df[col].cumsum()
## may be the solution you want is upto here
## if you want to combine the columns then you can do following
df["k1"] = df["a-C-True"].where(df["cond_col"]=="a-C-True", df["b-C-True"])
df["k2"] = df["a-P-True"].where(df["cond_col"]=="a-P-True", df["b-P-True"])
df
Output
script call_put strike premium t cond_col a-C-True a-P-True b-C-True b-P-True k1 k2
0 a C 280 10 10 a-C-True 1 0 0 0 1 0
1 a P 260 20 30 a-P-True 1 1 0 0 0 1
2 a P 275 35 65 a-P-True 1 2 0 0 0 2
3 b C 280 38 103 b-C-True 1 2 1 0 1 0
4 b P 285 50 153 b-P-True 1 2 1 1 1 1

Attributes/information contained in DataFrame column names

I have some data imported from a csv, to create something similar I used this:
data = pd.DataFrame([[1,0,2,3,4,5],[0,1,2,3,4,5],[1,1,2,3,4,5],[0,0,2,3,4,5]], columns=['split','sex', 'group0Low', 'group0High', 'group1Low', 'group1High'])
means = data.groupby(['split','sex']).mean()
so the dataframe looks something like this:
group0Low group0High group1Low group1High
split sex
0 0 2 3 4 5
1 2 3 4 5
1 0 2 3 4 5
1 2 3 4 5
You'll notice that each column actually contains 2 variables (group# and height). (It was set up this way for running repeated measures anova in SPSS.)
I want to split the columns up, so I can also groupby "group", like this (I actually screwed up the order of the numbers, but hopefully the idea is clear):
low high
split sex group
0 0 95 265
0 0 1 123 54
1 0 120 220
1 1 98 111
1 0 0 150 190
0 1 211 300
1 0 139 86
1 1 132 250
How do I achieve this?
The first trick is to gather the columns into a single column using stack:
In [6]: means
Out[6]:
group0Low group0High group1Low group1High
split sex
0 0 2 3 4 5
1 2 3 4 5
1 0 2 3 4 5
1 2 3 4 5
In [13]: stacked = means.stack().reset_index(level=2)
In [14]: stacked.columns = ['group_level', 'mean']
In [15]: stacked.head(2)
Out[15]:
group_level mean
split sex
0 0 group0Low 2
0 group0High 3
Now we can do whatever string operations we want on group_level using pd.Series.str as follows:
In [18]: stacked['group'] = stacked.group_level.str[:6]
In [21]: stacked['level'] = stacked.group_level.str[6:]
In [22]: stacked.head(2)
Out[22]:
group_level mean group level
split sex
0 0 group0Low 2 group0 Low
0 group0High 3 group0 High
Now you're in business and you can do whatever you want. For example, sum each group/level:
In [31]: stacked.groupby(['group', 'level']).sum()
Out[31]:
mean
group level
group0 High 12
Low 8
group1 High 20
Low 16
How do I group by everything?
If you want to group by split, sex, group and level you can do:
In [113]: stacked.reset_index().groupby(['split', 'sex', 'group', 'level']).sum().head(4)
Out[113]:
mean
split sex group level
0 0 group0 High 3
Low 2
group1 0High 5
0Low 4
What if the split is not always at location 6?
This SO answer will show you how to do the splitting more intelligently.
This can be done by first construct multi-level index on column names and then reshape the dataframe by stack.
import pandas as pd
import numpy as np
# some artificial data
# ==================================
multi_index = pd.MultiIndex.from_arrays([[0,0,1,1], [0,1,0,1]], names=['split', 'sex'])
np.random.seed(0)
df = pd.DataFrame(np.random.randint(50,300, (4,4)), columns='group0Low group0High group1Low group1High'.split(), index=multi_index)
df
group0Low group0High group1Low group1High
split sex
0 0 222 97 167 242
1 117 245 153 59
1 0 261 71 292 86
1 137 120 266 138
# processing
# ==============================
level_group = np.where(df.columns.str.contains('0'), 0, 1)
# output: array([0, 0, 1, 1])
level_low_high = np.where(df.columns.str.contains('Low'), 'low', 'high')
# output: array(['low', 'high', 'low', 'high'], dtype='<U4')
multi_level_columns = pd.MultiIndex.from_arrays([level_group, level_low_high], names=['group', 'val'])
df.columns = multi_level_columns
df.stack(level='group')
val high low
split sex group
0 0 0 97 222
1 242 167
1 0 245 117
1 59 153
1 0 0 71 261
1 86 292
1 0 120 137
1 138 266

Categories