import pandas as pd
df = pd.DataFrame({
'cakeName': ['A','B','C','D','E','F','G','H'],
'chocolate%': ['20','70','30','50','50','10','75','20'],
'milk%' : ['50','20','40','0', '30','80','15','10'],
'straberry%' : ['30','10','30','50','20','10','10','70'],
})
df.head(10)
i would like to create a new column 'cakeType' based on the columns value
objective:
- scan through each cakeName
- if there are single ingredient which stand out, >= 75, then return a value in 'cakeType'
- for example: cake 'G' chocolate% >= 75, then 'choco' etc
- else if none of the ingredient have more than 75, its just a 'normal cake'
i had seek for answer in forum, doesn't seem quite fit, as i will have many many ingredients column
so scanning the row looking for value >= 75 is a better way to do it?
thanks a lot
Method 1: np.select:
Good use case for np.select where we define our conditions and based on those conditions we select choices. Plus we have a default value if none of the conditions is met:
conditions = [
df['chocolate%'].ge(75),
df['milk%'].ge(75),
df['straberry%'].ge(75)
]
choices = ['choco', 'milk', 'strawberry']
df['cakeType'] = np.select(conditions, choices, default='normal cake')
cakeName chocolate% milk% straberry% cakeType
0 A 20 50 30 normal cake
1 B 70 20 10 normal cake
2 C 30 40 30 normal cake
3 D 50 0 50 normal cake
4 E 50 30 20 normal cake
5 F 10 80 10 milk
6 G 75 15 10 choco
7 H 20 10 70 normal cake
Method 2: idxmax, Series.where and fillna:
First we get the column names where a value is >= 75. Then we remove the column names which do not have any value >= 75 and fillna them with normal cake
m1 = df.iloc[:, 1:].ge(75).idxmax(axis=1)
newcol = m1.where(df.iloc[:, 1:].ge(75).any(axis=1)).str[:-1].fillna('normal cake')
df['cakeType'] = newcol
cakeName chocolate% milk% straberry% cakeType
0 A 20 50 30 normal cake
1 B 70 20 10 normal cake
2 C 30 40 30 normal cake
3 D 50 0 50 normal cake
4 E 50 30 20 normal cake
5 F 10 80 10 milk
6 G 75 15 10 chocolate
7 H 20 10 70 normal cake
Related
I have a dataset which I want to group by the age.
So, here is the first part of the dataset:
It is a simulation for a inventory data. Größe means the number of people with the age (Alter) 15. Risiko gives every person a number and Geschlecht is feminine or masculine.
I want to add a column "Group" and give every people, which have the age 15-19 one number, than with age 20-24 one number and so on.
How can I do this?
You can use map and lambda to create a new column like so :
def return_age_from_range(age):
# Max value in range is excluded, so remember to add +1 to the range you want
if age in range(15, 20):
return 1
elif age in range(20, 25):
return 2
# and so on...
df['group'] = df.alter.map(lambda x: return_age_from_range(x))
Use numpy.select:
In [488]: import numpy as np
In [489]: conds = [df['Alter'].between(15,19), df['Alter'].between(20,24), df['Alter'].between(24,28)]
In [490]: choices = [1,2,3]
In [493]: df['Group'] = np.select(conds, choices)
In [494]: df
Out[494]:
Größe Risiko Geschlecht Alter Group
0 95 1 F 15 1
1 95 2 F 15 1
2 95 3 M 15 1
3 95 4 F 15 1
4 95 5 M 15 1
5 95 6 M 15 1
6 95 7 M 15 1
7 95 8 F 15 1
8 95 9 M 15 1
I have a pandas DataFrame df as follow :
siren ratio
1 20
2 25
1 40
3 16
3 19
4 35
My goal is to have a df2 with only siren whom ratio value is above 30 at least one time as follow :
siren ratio
1 20
1 40
4 35
Today, I do it in two steps :
First, I use a filter to get all the uniques siren with a value above 30 :
value_30 = df[df["ratio"] > 30]["siren"].unique()
Then, I use value_30 as a list in order to filter my df, and to get my df2.
However, I'm not satisfied with this solution and I think there are a most pythonic way to do this. Any idea ?
Use groupby.filter
res = df.groupby(df.siren).filter(lambda x: x["ratio"].max() > 30)
print(res)
Output
siren ratio
0 1 20
2 1 40
5 4 35
Try with groupby and transform:
value_30 = df[df.groupby("siren")["ratio"].transform("max")>30]
>>> value_30
siren ratio
0 1 20
2 1 40
5 4 35
df[df['ratio'].gt(30).groupby(df['siren']).transform('max')]
siren ratio
0 1 20
2 1 40
5 4 35
This is my first time asking a question on here, so apologies if I am doing something wrong.
I am looking to create some sort of dataframe/dict/list where I can check if the ID in one column has seen a specific value in another column before.
For example for one pandas dataframe like this (90 million rows):
ID Another_ID
1 10
1 20
2 50
3 10
3 20
4 30
And another like this(10 million rows):
ID Another_ID
1 30
2 30
2 50
2 20
4 30
5 70
I want to end up with a third column that is like this:
ID Another_ID seen_before
1 30 0
2 30 0
2 50 1
2 20 0
4 30 1
5 20 0
I am looking for a memory efficient but quick way to do this, any ideas? Thanks!
Merge is a good idea, here, you want to merge on both columns:
df1['seen_before'] = 1
df2.merge(df1, on=['ID', 'Another_ID'], how='left')
Output:
ID Another_ID seen_before
0 1 30 NaN
1 2 30 NaN
2 2 50 1.0
3 2 20 NaN
4 4 30 1.0
5 5 70 NaN
Note: this assumes that df1 has no duplicates. If you are not sure about this, replace df1 with df1.drop_duplicates() in merge.
Note: how is important on merge. see the comments in the code as well. np.where is
quite efficient but I have never worked with 100 million rows. Request to OP to
let us know how it goes.
Code:
import pandas as pd
import numpy as np
left = pd.DataFrame(data = {'ID':[1, 1, 2, 3, 3, 4], 'Another_ID': [10, 20, 50, 10, 20, 30]})
right = pd.DataFrame(data = {'ID':[1 , 2 , 2 , 2 , 4 , 5], 'Another_ID': [30 , 30 , 50 , 20 , 30 , 70]})
print(df1, '\n', df2)
res = pd.merge(left, right, how='right', on='ID')
# Another_ID_x showed up as float despite dtype as int on both right and left
res.fillna(value=0, inplace=True) # required for astype to work in next step
res['Another_ID_x'] = res['Another_ID_x'].astype(int)
res['Another_ID_x'] = np.where(res.Another_ID_x == res.Another_ID_y, 1, 0 )
res.rename(columns={'Another_ID_x': 'seen_before'}, inplace=True)
res.drop_duplicates(inplace=True)
print(res)
Output:
Another_ID
ID
1 10
1 20
2 50
3 10
3 20
4 30
Another_ID
ID
1 30
2 30
2 50
2 20
4 30
5 70
ID seen_before Another_ID_y
0 1 0 30
2 2 0 30
3 2 1 50
4 2 0 20
5 4 1 30
6 5 0 70
Update:
Thanks to everybody for all the replies on my first post!
#Quang Hong's solution worked amazing in this case as there were so many rows.
Total time it took on my laptop was 36.6s
How can I create columns that show the respectively similarity indices for each row?
This code
def func(name):
matches = try_test.apply(lambda row: (fuzz.partial_ratio(row['name'], name) >= 85), axis=1)
return [try_test.word[i] for i, x in enumerate(matches) if x]
try_test.apply(lambda row: func(row['name']), axis=1)
returns indices that match the condition >=85. However, I would be interested also in having the values by comparing each field to all others.
The dataset is
try_test = pd.DataFrame({'word': ['apple', 'orange', 'diet', 'energy', 'fire', 'cake'],
'name': ['dog', 'cat', 'mad cat', 'good dog', 'bad dog', 'chicken']})
Help with be very appreciated.
Expected output (values are just an example)
word name sim_index1 sim_index2 sim_index3 ...index 6
apple dog 100 0
orange cat 100
... mad cat 0.6 100
On the diagonal there is a value of 100 as I am comparing dog with dog,...
I might consider also another approach if you think it would be better.
IIUC, you can slightly change your function to get what you want:
def func(name):
return try_test.apply(lambda row: (fuzz.partial_ratio(row['name'], name)), axis=1)
print(try_test.apply(lambda row: func(row['name']), axis=1))
0 1 2 3 4 5
0 100 0 33 100 100 0
1 0 100 100 0 33 33
2 33 100 100 29 43 14
3 100 0 29 100 71 0
4 100 33 43 71 100 0
5 0 33 14 0 0 100
that said, more than half of the calculation is not necessary as the result is a symmetrical matrix and the diagonal is 100. So if you data is bigger, then you could do the partial_ratio with the rows before the current row. Adding so reindex and then creating the full matrix using T (transpose) and np.diag, you can do:
def func_pr (row):
return (try_test.loc[:row.name-1, 'name']
.apply(lambda name: fuzz.partial_ratio(name, row['name'])))
#start at index 1 (second row)
pr = (try_test.loc[1:].apply(func_pr, axis=1)
.reindex(index=try_test.index,
columns=try_test.index)
.fillna(0)
.add_prefix('sim_idx')
)
#complete the result with transpose and diag
pr += pr.to_numpy().T + np.diag(np.ones(pr.shape[0]))*100
# concat
res = pd.concat([try_test, pr.astype(int)], axis=1)
and you get
print(res)
word name sim_idx0 sim_idx1 sim_idx2 sim_idx3 sim_idx4 \
0 apple dog 100 0 33 100 100
1 orange cat 0 100 100 0 33
2 diet mad cat 33 100 100 29 43
3 energy good dog 100 0 29 100 71
4 fire bad dog 100 33 43 71 100
5 cake chicken 0 33 14 0 0
sim_idx5
0 0
1 33
2 14
3 0
4 0
5 100
I would like to apply a function f1 by group to a dataframe:
import pandas as pd
import numpy as np
data = np.array([['id1','id2','u','v0','v1'],
['A','A',10,1,7],
['A','A',10,2,8],
['A','B',20,3,9],
['B','A',10,4,10],
['B','B',30,5,11],
['B','B',30,6,12]])
z = pd.DataFrame(data = data[1:,:], columns=data[0,:])
def f1(u,v):
return u*np.cumprod(v)
The result of the function depends on the column u and columns v0 or v1 (that can be thousands of v ecause I'm doing a simulation on a lot of paths).
The result should be like this
id1 id2 new_v0 new_v1
0 A A 10 70
1 A A 20 560
2 A B 60 180
3 B A 40 100
4 B B 150 330
5 B B 900 3960
I tried for a start
output = z.groupby(['id1', 'id2']).apply(lambda x: f1(u = x.u,v =x.v0))
but I can't even get a result with just one column.
Thank you very much!
You can filter column names starting with v and create a list and pass them under groupby:
v_cols = z.columns[z.columns.str.startswith('v')].tolist()
z[['u']+v_cols] = z[['u']+v_cols].apply(pd.to_numeric)
out = z.assign(**z.groupby(['id1','id2'])[v_cols].cumprod()
.mul(z['u'],axis=0).add_prefix('new_'))
print(out)
id1 id2 u v0 v1 new_v0 new_v1
0 A A 10 1 7 10 70
1 A A 10 2 8 20 560
2 A B 20 3 9 60 180
3 B A 10 4 10 40 100
4 B B 30 5 11 150 330
5 B B 30 6 12 900 3960
The way you create your data frame , will make the numeric to object , we convert first , then use the groupby+ cumprod
z[['u','v0','v1']]=z[['u','v0','v1']].apply(pd.to_numeric)
s=z.groupby(['id1','id2'])[['v0','v1']].cumprod().mul(z['u'],0)
#z=z.join(s.add_prefix('New_'))
v0 v1
0 10 70
1 20 560
2 60 180
3 40 100
4 150 330
5 900 3960
If you want to handle more than 2 v columns, it's better not to reference it.
(
z.apply(lambda x: pd.to_numeric(x, errors='ignore'))
.groupby(['id1', 'id2']).apply(lambda x: x.cumprod().mul(x.u.min()))
)