Is there an elegant way to assign values based on multiple columns in a dataframe in pandas? Let's say I have a dataframe with 2 columns: FruitType and Color.
import pandas as pd
df = pd.DataFrame({'FruitType':['apple', 'banana','kiwi','orange','loquat'],
'Color':['red_black','yellow','greenish_yellow', 'orangered','orangeyellow']})
I would like to assign the value of a third column, 'isYellowSeedless', based on both 'FruitType' and 'Color' columns.
I have a list of fruits that I consider seedless, and would like to check the Color column to see if it contains the str "yellow".
seedless = ['banana', 'loquat']
How do I string this all together elegantly?
This is my attempt that didn't work:
df[(df['FruitType'].isin(seedless)) & (culture_table['Color'].str.contains("yellow"))]['isYellowSeedless'] = True
Use loc with mask:
m = (df['FruitType'].isin(seedless)) & (df['Color'].str.contains("yellow"))
df.loc[m, 'isYellowSeedless'] = True
print (df)
Color FruitType isYellowSeedless
0 red_black apple NaN
1 yellow banana True
2 greenish_yellow kiwi NaN
3 orangered orange NaN
4 orangeyellow loquat True
If need True and False output:
df['isYellowSeedless'] = m
print (df)
Color FruitType isYellowSeedless
0 red_black apple False
1 yellow banana True
2 greenish_yellow kiwi False
3 orangered orange False
4 orangeyellow loquat True
For if-else by some scalars use numpy.where:
df['isYellowSeedless'] = np.where(m, 'a', 'b')
print (df)
Color FruitType isYellowSeedless
0 red_black apple b
1 yellow banana a
2 greenish_yellow kiwi b
3 orangered orange b
4 orangeyellow loquat a
And for convert to 0 and 1:
df['isYellowSeedless'] = m.astype(int)
print (df)
Color FruitType isYellowSeedless
0 red_black apple 0
1 yellow banana 1
2 greenish_yellow kiwi 0
3 orangered orange 0
4 orangeyellow loquat 1
Or you can try
df['isYellowSeedless']=df.loc[df.FruitType.isin(seedless),'Color'].str.contains('yellow')
df
Out[546]:
Color FruitType isYellowSeedless
0 red_black apple NaN
1 yellow banana True
2 greenish_yellow kiwi NaN
3 orangered orange NaN
4 orangeyellow loquat True
Related
Let's say we have a example dataframe like below,
df = pd.DataFrame(np.array([['strawberry', 'red', 3], ['apple', 'red', 6], ['apple', 'red', 5],
['banana', 'yellow', 9], ['pineapple', 'yellow', 5], ['pineapple', 'yellow', 7],
['apple', 'green', 2],['apple', 'green', 6], ['kiwi', 'green', 6]
]),
columns=['Fruit', 'Color', 'Quantity'])
df
Fruit Color Quantity
0 strawberry red 3
1 apple red 6
2 apple red 5
3 banana yellow 9
4 pineapple yellow 5
5 pineapple yellow 7
6 apple green 2
7 apple green 6
8 kiwi green 6
In this df, I' m checking is there any change in Fruit column row by row.
With shift() method rows are offsetting by 1, with fillna() method NaN values are filled and lastly with ne() method True-False labeling is done.
So as you can check from index 1, strawberry changing to apple, it will be "True".
Index 2, there are no change, it will be "False".
df['Fruit_Check'] = df.Fruit.shift().fillna(df.Fruit).ne(df.Fruit)
df
Fruit Color Quantity Fruit_Check
0 strawberry red 3 False
1 apple red 6 True
2 apple red 5 False
3 banana yellow 9 True
4 pineapple yellow 5 True
5 pineapple yellow 7 False
6 apple green 2 True
7 apple green 6 False
8 kiwi green 6 True
My problem is: I want to check also "Color" column. If there is a change in there, Fruit_Check column must be False default. So df should look like this,
df
Fruit Color Quantity Fruit_Check
0 strawberry red 3 False
1 apple red 6 True
2 apple red 5 False
3 banana yellow 9 False
4 pineapple yellow 5 True
5 pineapple yellow 7 False
6 apple green 2 False
7 apple green 6 False
8 kiwi green 6 True
Also I shouldn't use for loop. Because when I use with my original data, it takes too much time.
Use DataFrameGroupBy.shift for shift per groups:
df['Fruit_Check'] = df.groupby('Color').Fruit.shift().fillna(df.Fruit).ne(df.Fruit)
print (df)
Fruit Color Quantity Fruit_Check
0 strawberry red 3 False
1 apple red 6 True
2 apple red 5 False
3 banana yellow 9 False
4 pineapple yellow 5 True
5 pineapple yellow 7 False
6 apple green 2 False
7 apple green 6 False
8 kiwi green 6 True
Supposed my dataframe
Name Value
0 A apple
1 A banana
2 A orange
3 B grape
4 B apple
5 C apple
6 D apple
7 D orange
8 E banana
I want to show the items of each name.
(By removing duplicates)
output what I want
Name Values
0 A apple, banana, orange
1 B grape, apple
2 C apple
3 D apple, orange
4 E banana
thank you for reading
Changed sample data with duplicates:
print (df)
Name Value
0 A apple
1 A apple
2 A banana
3 A banana
4 A orange
5 B grape
6 B apple
7 C apple
8 D apple
9 D orange
10 E banana
If duplicates per both columns is necessary remove first use DataFrame.drop_duplicates and then aggregate join:
df1 = (df.drop_duplicates(['Name','Value'])
.groupby('Name')['Value']
.agg(','.join)
.reset_index())
print (df1)
Name Value
0 A apple,banana,orange
1 B grape,apple
2 C apple
3 D apple,orange
4 E banana
If not removed duplicates output is:
df2 = (df.groupby('Name')['Value']
.agg(','.join)
.reset_index())
print (df2)
Name Value
0 A apple,apple,banana,banana,orange
1 B grape,apple
2 C apple
3 D apple,orange
4 E banana
I have two dataframes.
df1 is
name colour count
0 apple red 3
1 orange orange 2
3 kiwi green 12
df2 is
name count
0 kiwi 2
1 apple 1
I want to update df1 with count from df2 based on name match.
Expected result:
name colour count
0 apple red 1
1 orange orange 2
3 kiwi green 2
I am using this but the results are incorrect.
df3 = df2.combine_first(df1).reindex(df1.index)
How can I do this correctly?
Create index by name in both DataFrames for matching by them by DataFrame.set_index, then DataFrame.combine_first and last DataFrame.reset_index for column from index:
df = df2.set_index('name').combine_first(df1.set_index('name')).reset_index()
print (df)
name colour count
0 apple red 1.0
1 kiwi green 2.0
2 orange orange 2.0
A column of an example dataframe is shown:
Fruit FruitA FruitB
Apple Banana Mango
Banana Apple Apple
Mango Apple Banana
Banana Mango Banana
Mango Banana Apple
Apple Mango Mango
I want to introduce new columns in the dataframe Fruit-Apple, Fruit-Mango, Fruit-Banana with one-hot encoding in the rows they are respectively present. So, the desired output is:
Fruit FruitA FruitB Fruit-Apple Fruit-Banana Fruit-Mango
Apple Banana Mango 1 1 1
Banana Apple Apple 1 1 0
Mango Apple Banana 1 1 1
Banana Mango Banana 0 1 1
Mango Banana Apple 1 1 1
Apple Mango Mango 1 0 1
My code to do this is:
for i in range(len(data)):
if (data['Fruits'][i] == 'Apple' or data['FruitsA'][i] == 'Apple' or data['FruitsB'][i] == 'Apple'):
data['Fruits-Apple'][i]=1
data['Fruits-Banana'][i]=0
data['Fruits-Mango'][i]=0
elif (data['Fruits'][i] == 'Banana' or data['FruitsA'][i] == 'Banana' or data['FruitsB'][i] == 'Banana'):
data['Fruits-Apple'][i]=0
data['Fruits-Banana'][i]=1
data['Fruits-Mango'][i]=0
elif (data['Fruits'][i] == 'Mango' or data['FruitsA'][i] == 'Mango' or data['FruitsB'][i] == 'Mango'):
data['Fruits-Apple'][i]=0
data['Fruits-Banana'][i]=0
data['Fruits-Mango'][i]=1
But I notice that the time taken for running this code increases dramatically if there are a lot of types of 'fruits'. In my actual data, there are only 1074 rows, and the column I'm trying to "normalize" with one-hot encoding has 18 different values. So, there are 18 if conditions inside the for loop, and the code hasn't finished running for 15 mins now. That's absurd (It would be great to know why its taking so long - in another column that contained only 6 different types of values, the code took much less time to execute, about 3 mins).
So, what's the best (vectorized) way to achieve this output?
Use join with get_dummies and add_prefix:
df = df.join(pd.get_dummies(df['Fruit']).add_prefix('Fruit-'))
print (df)
Fruit Fruit-Apple Fruit-Banana Fruit-Mango
0 Apple 1 0 0
1 Banana 0 1 0
2 Mango 0 0 1
3 Banana 0 1 0
4 Mango 0 0 1
5 Apple 1 0 0
EDIT: If input are multiple columns use get_dummies with max by columns:
df = (df.join(pd.get_dummies(df, prefix='', prefix_sep='')
.max(level=0, axis=1)
.add_prefix('Fruit-')))
print (df)
Fruit FruitA FruitB Fruit-Apple Fruit-Banana Fruit-Mango
0 Apple Banana Mango 1 1 1
1 Banana Apple Apple 1 1 0
2 Mango Apple Banana 1 1 1
3 Banana Mango Banana 0 1 1
4 Mango Banana Apple 1 1 1
5 Apple Mango Mango 1 0 1
For better performance use MultiLabelBinarizer with DataFrame converted to lists:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.values.tolist()),
columns=mlb.classes_,
index=df.index).add_prefix('Fruit-'))
print (df)
Fruit FruitA FruitB Fruit-Apple Fruit-Banana Fruit-Mango
0 Apple Banana Mango 1 1 1
1 Banana Apple Apple 1 1 0
2 Mango Apple Banana 1 1 1
3 Banana Mango Banana 0 1 1
4 Mango Banana Apple 1 1 1
5 Apple Mango Mango 1 0 1
I need to combine three columns of categorical data into a single set of binary category-named columns. This is similar to a "one-hot" but the source rows have up to three categories instead of just one. Also, note that there are 100+ categories and I will not know them beforehand.
id, fruit1, fruit2, fruit3
1, apple, orange,
2, orange, ,
3, banana, apple,
should generate...
id, apple, banana, orange
1, 1, 0, 1
2, 0, 0, 1
3, 1, 1, 0
You could use pd.melt to combine all the fruit columns into one column, and the use pd.crosstab to create a frequency table:
import numpy as np
import pandas as pd
df = pd.read_csv('data')
df = df.replace(r' ', np.nan)
# id fruit1 fruit2 fruit3
# 0 1 apple orange NaN
# 1 2 orange NaN NaN
# 2 3 banana apple NaN
melted = pd.melt(df, id_vars=['id'])
result = pd.crosstab(melted['id'], melted['value'])
print(result)
yields
value apple banana orange
id
1 1 0 1
2 0 0 1
3 1 1 0
Explanation: The melted DataFrame looks like this:
In [148]: melted = pd.melt(df, id_vars=['id']); melted
Out[149]:
id variable value
0 1 fruit1 apple
1 2 fruit1 orange
2 3 fruit1 banana
3 1 fruit2 orange
4 2 fruit2 NaN
5 3 fruit2 apple
6 1 fruit3 NaN
7 2 fruit3 NaN
8 3 fruit3 NaN
We can ignore the variable column; it is the id and value which is important.
pd.crosstab can be used to create a frequency table with melted['id'] values in the index and melted['value'] values as the columns:
In [150]: pd.crosstab(melted['id'], melted['value'])
Out[150]:
value apple banana orange
id
1 1 0 1
2 0 0 1
3 1 1 0
You can apply value counts to each row:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'fruit1': ['Apple', 'Banana', np.nan],
'fruit2': ['Banana', np.nan, 'Apple'],
'fruit3': ['Grape', np.nan, np.nan],
})
df = df.apply(lambda row: row.value_counts(), axis=1).fillna(0).applymap(int)
Before:
fruit1 fruit2 fruit3
0 Apple Banana Grape
1 Banana NaN NaN
2 NaN Apple NaN
After:
Apple Banana Grape
0 1 1 1
1 0 1 0
2 1 0 0