new column base on multi column condition - python

import pandas as pd
df = pd.DataFrame({
'cakeName': ['A','B','C','D','E','F','G','H'],
'chocolate%': ['20','70','30','50','50','10','75','20'],
'milk%' : ['50','20','40','0', '30','80','15','10'],
'straberry%' : ['30','10','30','50','20','10','10','70'],
})
df.head(10)
i would like to create a new column 'cakeType' based on the columns value
objective:
- scan through each cakeName
- if there are single ingredient which stand out, >= 75, then return a value in 'cakeType'
- for example: cake 'G' chocolate% >= 75, then 'choco' etc
- else if none of the ingredient have more than 75, its just a 'normal cake'
i had seek for answer in forum, doesn't seem quite fit, as i will have many many ingredients column
so scanning the row looking for value >= 75 is a better way to do it?
thanks a lot

Method 1: np.select:
Good use case for np.select where we define our conditions and based on those conditions we select choices. Plus we have a default value if none of the conditions is met:
conditions = [
df['chocolate%'].ge(75),
df['milk%'].ge(75),
df['straberry%'].ge(75)
]
choices = ['choco', 'milk', 'strawberry']
df['cakeType'] = np.select(conditions, choices, default='normal cake')
cakeName chocolate% milk% straberry% cakeType
0 A 20 50 30 normal cake
1 B 70 20 10 normal cake
2 C 30 40 30 normal cake
3 D 50 0 50 normal cake
4 E 50 30 20 normal cake
5 F 10 80 10 milk
6 G 75 15 10 choco
7 H 20 10 70 normal cake
Method 2: idxmax, Series.where and fillna:
First we get the column names where a value is >= 75. Then we remove the column names which do not have any value >= 75 and fillna them with normal cake
m1 = df.iloc[:, 1:].ge(75).idxmax(axis=1)
newcol = m1.where(df.iloc[:, 1:].ge(75).any(axis=1)).str[:-1].fillna('normal cake')
df['cakeType'] = newcol
cakeName chocolate% milk% straberry% cakeType
0 A 20 50 30 normal cake
1 B 70 20 10 normal cake
2 C 30 40 30 normal cake
3 D 50 0 50 normal cake
4 E 50 30 20 normal cake
5 F 10 80 10 milk
6 G 75 15 10 chocolate
7 H 20 10 70 normal cake

Related

How to group various groups in python into one

I have a dataset which I want to group by the age.
So, here is the first part of the dataset:
It is a simulation for a inventory data. Größe means the number of people with the age (Alter) 15. Risiko gives every person a number and Geschlecht is feminine or masculine.
I want to add a column "Group" and give every people, which have the age 15-19 one number, than with age 20-24 one number and so on.
How can I do this?
You can use map and lambda to create a new column like so :
def return_age_from_range(age):
# Max value in range is excluded, so remember to add +1 to the range you want
if age in range(15, 20):
return 1
elif age in range(20, 25):
return 2
# and so on...
df['group'] = df.alter.map(lambda x: return_age_from_range(x))
Use numpy.select:
In [488]: import numpy as np
In [489]: conds = [df['Alter'].between(15,19), df['Alter'].between(20,24), df['Alter'].between(24,28)]
In [490]: choices = [1,2,3]
In [493]: df['Group'] = np.select(conds, choices)
In [494]: df
Out[494]:
Größe Risiko Geschlecht Alter Group
0 95 1 F 15 1
1 95 2 F 15 1
2 95 3 M 15 1
3 95 4 F 15 1
4 95 5 M 15 1
5 95 6 M 15 1
6 95 7 M 15 1
7 95 8 F 15 1
8 95 9 M 15 1

Most pythonic way to search for values in one column whom value in an other is above a treshold

I have a pandas DataFrame df as follow :
siren ratio
1 20
2 25
1 40
3 16
3 19
4 35
My goal is to have a df2 with only siren whom ratio value is above 30 at least one time as follow :
siren ratio
1 20
1 40
4 35
Today, I do it in two steps :
First, I use a filter to get all the uniques siren with a value above 30 :
value_30 = df[df["ratio"] > 30]["siren"].unique()
Then, I use value_30 as a list in order to filter my df, and to get my df2.
However, I'm not satisfied with this solution and I think there are a most pythonic way to do this. Any idea ?
Use groupby.filter
res = df.groupby(df.siren).filter(lambda x: x["ratio"].max() > 30)
print(res)
Output
siren ratio
0 1 20
2 1 40
5 4 35
Try with groupby and transform:
value_30 = df[df.groupby("siren")["ratio"].transform("max")>30]
>>> value_30
siren ratio
0 1 20
2 1 40
5 4 35
df[df['ratio'].gt(30).groupby(df['siren']).transform('max')]
siren ratio
0 1 20
2 1 40
5 4 35
​

Python: Memory efficient, quick lookup in python for 100 million pairs of data?

This is my first time asking a question on here, so apologies if I am doing something wrong.
I am looking to create some sort of dataframe/dict/list where I can check if the ID in one column has seen a specific value in another column before.
For example for one pandas dataframe like this (90 million rows):
ID Another_ID
1 10
1 20
2 50
3 10
3 20
4 30
And another like this(10 million rows):
ID Another_ID
1 30
2 30
2 50
2 20
4 30
5 70
I want to end up with a third column that is like this:
ID Another_ID seen_before
1 30 0
2 30 0
2 50 1
2 20 0
4 30 1
5 20 0
I am looking for a memory efficient but quick way to do this, any ideas? Thanks!
Merge is a good idea, here, you want to merge on both columns:
df1['seen_before'] = 1
df2.merge(df1, on=['ID', 'Another_ID'], how='left')
Output:
ID Another_ID seen_before
0 1 30 NaN
1 2 30 NaN
2 2 50 1.0
3 2 20 NaN
4 4 30 1.0
5 5 70 NaN
Note: this assumes that df1 has no duplicates. If you are not sure about this, replace df1 with df1.drop_duplicates() in merge.
Note: how is important on merge. see the comments in the code as well. np.where is
quite efficient but I have never worked with 100 million rows. Request to OP to
let us know how it goes.
Code:
import pandas as pd
import numpy as np
left = pd.DataFrame(data = {'ID':[1, 1, 2, 3, 3, 4], 'Another_ID': [10, 20, 50, 10, 20, 30]})
right = pd.DataFrame(data = {'ID':[1 , 2 , 2 , 2 , 4 , 5], 'Another_ID': [30 , 30 , 50 , 20 , 30 , 70]})
print(df1, '\n', df2)
res = pd.merge(left, right, how='right', on='ID')
# Another_ID_x showed up as float despite dtype as int on both right and left
res.fillna(value=0, inplace=True) # required for astype to work in next step
res['Another_ID_x'] = res['Another_ID_x'].astype(int)
res['Another_ID_x'] = np.where(res.Another_ID_x == res.Another_ID_y, 1, 0 )
res.rename(columns={'Another_ID_x': 'seen_before'}, inplace=True)
res.drop_duplicates(inplace=True)
print(res)
Output:
Another_ID
ID
1 10
1 20
2 50
3 10
3 20
4 30
Another_ID
ID
1 30
2 30
2 50
2 20
4 30
5 70
ID seen_before Another_ID_y
0 1 0 30
2 2 0 30
3 2 1 50
4 2 0 20
5 4 1 30
6 5 0 70
Update:
Thanks to everybody for all the replies on my first post!
#Quang Hong's solution worked amazing in this case as there were so many rows.
Total time it took on my laptop was 36.6s

Create columns having similarity index values

How can I create columns that show the respectively similarity indices for each row?
This code
def func(name):
matches = try_test.apply(lambda row: (fuzz.partial_ratio(row['name'], name) >= 85), axis=1)
return [try_test.word[i] for i, x in enumerate(matches) if x]
try_test.apply(lambda row: func(row['name']), axis=1)
returns indices that match the condition >=85. However, I would be interested also in having the values by comparing each field to all others.
The dataset is
try_test = pd.DataFrame({'word': ['apple', 'orange', 'diet', 'energy', 'fire', 'cake'],
'name': ['dog', 'cat', 'mad cat', 'good dog', 'bad dog', 'chicken']})
Help with be very appreciated.
Expected output (values are just an example)
word name sim_index1 sim_index2 sim_index3 ...index 6
apple dog 100 0
orange cat 100
... mad cat 0.6 100
On the diagonal there is a value of 100 as I am comparing dog with dog,...
I might consider also another approach if you think it would be better.
IIUC, you can slightly change your function to get what you want:
def func(name):
return try_test.apply(lambda row: (fuzz.partial_ratio(row['name'], name)), axis=1)
print(try_test.apply(lambda row: func(row['name']), axis=1))
0 1 2 3 4 5
0 100 0 33 100 100 0
1 0 100 100 0 33 33
2 33 100 100 29 43 14
3 100 0 29 100 71 0
4 100 33 43 71 100 0
5 0 33 14 0 0 100
that said, more than half of the calculation is not necessary as the result is a symmetrical matrix and the diagonal is 100. So if you data is bigger, then you could do the partial_ratio with the rows before the current row. Adding so reindex and then creating the full matrix using T (transpose) and np.diag, you can do:
def func_pr (row):
return (try_test.loc[:row.name-1, 'name']
.apply(lambda name: fuzz.partial_ratio(name, row['name'])))
#start at index 1 (second row)
pr = (try_test.loc[1:].apply(func_pr, axis=1)
.reindex(index=try_test.index,
columns=try_test.index)
.fillna(0)
.add_prefix('sim_idx')
)
#complete the result with transpose and diag
pr += pr.to_numpy().T + np.diag(np.ones(pr.shape[0]))*100
# concat
res = pd.concat([try_test, pr.astype(int)], axis=1)
and you get
print(res)
word name sim_idx0 sim_idx1 sim_idx2 sim_idx3 sim_idx4 \
0 apple dog 100 0 33 100 100
1 orange cat 0 100 100 0 33
2 diet mad cat 33 100 100 29 43
3 energy good dog 100 0 29 100 71
4 fire bad dog 100 33 43 71 100
5 cake chicken 0 33 14 0 0
sim_idx5
0 0
1 33
2 14
3 0
4 0
5 100

pandas apply User defined function to grouped dataframe on multiple columns

I would like to apply a function f1 by group to a dataframe:
import pandas as pd
import numpy as np
data = np.array([['id1','id2','u','v0','v1'],
['A','A',10,1,7],
['A','A',10,2,8],
['A','B',20,3,9],
['B','A',10,4,10],
['B','B',30,5,11],
['B','B',30,6,12]])
z = pd.DataFrame(data = data[1:,:], columns=data[0,:])
def f1(u,v):
return u*np.cumprod(v)
The result of the function depends on the column u and columns v0 or v1 (that can be thousands of v ecause I'm doing a simulation on a lot of paths).
The result should be like this
id1 id2 new_v0 new_v1
0 A A 10 70
1 A A 20 560
2 A B 60 180
3 B A 40 100
4 B B 150 330
5 B B 900 3960
I tried for a start
output = z.groupby(['id1', 'id2']).apply(lambda x: f1(u = x.u,v =x.v0))
but I can't even get a result with just one column.
Thank you very much!
You can filter column names starting with v and create a list and pass them under groupby:
v_cols = z.columns[z.columns.str.startswith('v')].tolist()
z[['u']+v_cols] = z[['u']+v_cols].apply(pd.to_numeric)
out = z.assign(**z.groupby(['id1','id2'])[v_cols].cumprod()
.mul(z['u'],axis=0).add_prefix('new_'))
print(out)
id1 id2 u v0 v1 new_v0 new_v1
0 A A 10 1 7 10 70
1 A A 10 2 8 20 560
2 A B 20 3 9 60 180
3 B A 10 4 10 40 100
4 B B 30 5 11 150 330
5 B B 30 6 12 900 3960
The way you create your data frame , will make the numeric to object , we convert first , then use the groupby+ cumprod
z[['u','v0','v1']]=z[['u','v0','v1']].apply(pd.to_numeric)
s=z.groupby(['id1','id2'])[['v0','v1']].cumprod().mul(z['u'],0)
#z=z.join(s.add_prefix('New_'))
v0 v1
0 10 70
1 20 560
2 60 180
3 40 100
4 150 330
5 900 3960
If you want to handle more than 2 v columns, it's better not to reference it.
(
z.apply(lambda x: pd.to_numeric(x, errors='ignore'))
.groupby(['id1', 'id2']).apply(lambda x: x.cumprod().mul(x.u.min()))
)

Categories