I have two dataframes.
df1 is
name colour count
0 apple red 3
1 orange orange 2
3 kiwi green 12
df2 is
name count
0 kiwi 2
1 apple 1
I want to update df1 with count from df2 based on name match.
Expected result:
name colour count
0 apple red 1
1 orange orange 2
3 kiwi green 2
I am using this but the results are incorrect.
df3 = df2.combine_first(df1).reindex(df1.index)
How can I do this correctly?
Create index by name in both DataFrames for matching by them by DataFrame.set_index, then DataFrame.combine_first and last DataFrame.reset_index for column from index:
df = df2.set_index('name').combine_first(df1.set_index('name')).reset_index()
print (df)
name colour count
0 apple red 1.0
1 kiwi green 2.0
2 orange orange 2.0
Related
Supposed my dataframe
Name Value
0 A apple
1 A banana
2 A orange
3 B grape
4 B apple
5 C apple
6 D apple
7 D orange
8 E banana
I want to show the items of each name.
(By removing duplicates)
output what I want
Name Values
0 A apple, banana, orange
1 B grape, apple
2 C apple
3 D apple, orange
4 E banana
thank you for reading
Changed sample data with duplicates:
print (df)
Name Value
0 A apple
1 A apple
2 A banana
3 A banana
4 A orange
5 B grape
6 B apple
7 C apple
8 D apple
9 D orange
10 E banana
If duplicates per both columns is necessary remove first use DataFrame.drop_duplicates and then aggregate join:
df1 = (df.drop_duplicates(['Name','Value'])
.groupby('Name')['Value']
.agg(','.join)
.reset_index())
print (df1)
Name Value
0 A apple,banana,orange
1 B grape,apple
2 C apple
3 D apple,orange
4 E banana
If not removed duplicates output is:
df2 = (df.groupby('Name')['Value']
.agg(','.join)
.reset_index())
print (df2)
Name Value
0 A apple,apple,banana,banana,orange
1 B grape,apple
2 C apple
3 D apple,orange
4 E banana
A column of an example dataframe is shown:
Fruit FruitA FruitB
Apple Banana Mango
Banana Apple Apple
Mango Apple Banana
Banana Mango Banana
Mango Banana Apple
Apple Mango Mango
I want to introduce new columns in the dataframe Fruit-Apple, Fruit-Mango, Fruit-Banana with one-hot encoding in the rows they are respectively present. So, the desired output is:
Fruit FruitA FruitB Fruit-Apple Fruit-Banana Fruit-Mango
Apple Banana Mango 1 1 1
Banana Apple Apple 1 1 0
Mango Apple Banana 1 1 1
Banana Mango Banana 0 1 1
Mango Banana Apple 1 1 1
Apple Mango Mango 1 0 1
My code to do this is:
for i in range(len(data)):
if (data['Fruits'][i] == 'Apple' or data['FruitsA'][i] == 'Apple' or data['FruitsB'][i] == 'Apple'):
data['Fruits-Apple'][i]=1
data['Fruits-Banana'][i]=0
data['Fruits-Mango'][i]=0
elif (data['Fruits'][i] == 'Banana' or data['FruitsA'][i] == 'Banana' or data['FruitsB'][i] == 'Banana'):
data['Fruits-Apple'][i]=0
data['Fruits-Banana'][i]=1
data['Fruits-Mango'][i]=0
elif (data['Fruits'][i] == 'Mango' or data['FruitsA'][i] == 'Mango' or data['FruitsB'][i] == 'Mango'):
data['Fruits-Apple'][i]=0
data['Fruits-Banana'][i]=0
data['Fruits-Mango'][i]=1
But I notice that the time taken for running this code increases dramatically if there are a lot of types of 'fruits'. In my actual data, there are only 1074 rows, and the column I'm trying to "normalize" with one-hot encoding has 18 different values. So, there are 18 if conditions inside the for loop, and the code hasn't finished running for 15 mins now. That's absurd (It would be great to know why its taking so long - in another column that contained only 6 different types of values, the code took much less time to execute, about 3 mins).
So, what's the best (vectorized) way to achieve this output?
Use join with get_dummies and add_prefix:
df = df.join(pd.get_dummies(df['Fruit']).add_prefix('Fruit-'))
print (df)
Fruit Fruit-Apple Fruit-Banana Fruit-Mango
0 Apple 1 0 0
1 Banana 0 1 0
2 Mango 0 0 1
3 Banana 0 1 0
4 Mango 0 0 1
5 Apple 1 0 0
EDIT: If input are multiple columns use get_dummies with max by columns:
df = (df.join(pd.get_dummies(df, prefix='', prefix_sep='')
.max(level=0, axis=1)
.add_prefix('Fruit-')))
print (df)
Fruit FruitA FruitB Fruit-Apple Fruit-Banana Fruit-Mango
0 Apple Banana Mango 1 1 1
1 Banana Apple Apple 1 1 0
2 Mango Apple Banana 1 1 1
3 Banana Mango Banana 0 1 1
4 Mango Banana Apple 1 1 1
5 Apple Mango Mango 1 0 1
For better performance use MultiLabelBinarizer with DataFrame converted to lists:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.values.tolist()),
columns=mlb.classes_,
index=df.index).add_prefix('Fruit-'))
print (df)
Fruit FruitA FruitB Fruit-Apple Fruit-Banana Fruit-Mango
0 Apple Banana Mango 1 1 1
1 Banana Apple Apple 1 1 0
2 Mango Apple Banana 1 1 1
3 Banana Mango Banana 0 1 1
4 Mango Banana Apple 1 1 1
5 Apple Mango Mango 1 0 1
I have 2 data sets with a common unique ID(duplicates in 2nd data frame)
I want to map all records with respect to each ID.
df1
id
1
2
3
4
5
df2
id col1
1 mango
2 melon
1 straw
3 banana
3 papaya
i want the out put like
df1
id col1
1 mango
straw
2 melon
3 banana
papaya
4 not available
5 not available
Thanks in advance
You're looking to do an outer df.merge:
df1 = df1.merge(df2, how='outer').set_index('id').fillna('not available')
>>> df1
col1
id
1 mango
1 straw
2 melon
3 banana
3 papaya
4 not available
5 not available
Is there an elegant way to assign values based on multiple columns in a dataframe in pandas? Let's say I have a dataframe with 2 columns: FruitType and Color.
import pandas as pd
df = pd.DataFrame({'FruitType':['apple', 'banana','kiwi','orange','loquat'],
'Color':['red_black','yellow','greenish_yellow', 'orangered','orangeyellow']})
I would like to assign the value of a third column, 'isYellowSeedless', based on both 'FruitType' and 'Color' columns.
I have a list of fruits that I consider seedless, and would like to check the Color column to see if it contains the str "yellow".
seedless = ['banana', 'loquat']
How do I string this all together elegantly?
This is my attempt that didn't work:
df[(df['FruitType'].isin(seedless)) & (culture_table['Color'].str.contains("yellow"))]['isYellowSeedless'] = True
Use loc with mask:
m = (df['FruitType'].isin(seedless)) & (df['Color'].str.contains("yellow"))
df.loc[m, 'isYellowSeedless'] = True
print (df)
Color FruitType isYellowSeedless
0 red_black apple NaN
1 yellow banana True
2 greenish_yellow kiwi NaN
3 orangered orange NaN
4 orangeyellow loquat True
If need True and False output:
df['isYellowSeedless'] = m
print (df)
Color FruitType isYellowSeedless
0 red_black apple False
1 yellow banana True
2 greenish_yellow kiwi False
3 orangered orange False
4 orangeyellow loquat True
For if-else by some scalars use numpy.where:
df['isYellowSeedless'] = np.where(m, 'a', 'b')
print (df)
Color FruitType isYellowSeedless
0 red_black apple b
1 yellow banana a
2 greenish_yellow kiwi b
3 orangered orange b
4 orangeyellow loquat a
And for convert to 0 and 1:
df['isYellowSeedless'] = m.astype(int)
print (df)
Color FruitType isYellowSeedless
0 red_black apple 0
1 yellow banana 1
2 greenish_yellow kiwi 0
3 orangered orange 0
4 orangeyellow loquat 1
Or you can try
df['isYellowSeedless']=df.loc[df.FruitType.isin(seedless),'Color'].str.contains('yellow')
df
Out[546]:
Color FruitType isYellowSeedless
0 red_black apple NaN
1 yellow banana True
2 greenish_yellow kiwi NaN
3 orangered orange NaN
4 orangeyellow loquat True
I need to combine three columns of categorical data into a single set of binary category-named columns. This is similar to a "one-hot" but the source rows have up to three categories instead of just one. Also, note that there are 100+ categories and I will not know them beforehand.
id, fruit1, fruit2, fruit3
1, apple, orange,
2, orange, ,
3, banana, apple,
should generate...
id, apple, banana, orange
1, 1, 0, 1
2, 0, 0, 1
3, 1, 1, 0
You could use pd.melt to combine all the fruit columns into one column, and the use pd.crosstab to create a frequency table:
import numpy as np
import pandas as pd
df = pd.read_csv('data')
df = df.replace(r' ', np.nan)
# id fruit1 fruit2 fruit3
# 0 1 apple orange NaN
# 1 2 orange NaN NaN
# 2 3 banana apple NaN
melted = pd.melt(df, id_vars=['id'])
result = pd.crosstab(melted['id'], melted['value'])
print(result)
yields
value apple banana orange
id
1 1 0 1
2 0 0 1
3 1 1 0
Explanation: The melted DataFrame looks like this:
In [148]: melted = pd.melt(df, id_vars=['id']); melted
Out[149]:
id variable value
0 1 fruit1 apple
1 2 fruit1 orange
2 3 fruit1 banana
3 1 fruit2 orange
4 2 fruit2 NaN
5 3 fruit2 apple
6 1 fruit3 NaN
7 2 fruit3 NaN
8 3 fruit3 NaN
We can ignore the variable column; it is the id and value which is important.
pd.crosstab can be used to create a frequency table with melted['id'] values in the index and melted['value'] values as the columns:
In [150]: pd.crosstab(melted['id'], melted['value'])
Out[150]:
value apple banana orange
id
1 1 0 1
2 0 0 1
3 1 1 0
You can apply value counts to each row:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'fruit1': ['Apple', 'Banana', np.nan],
'fruit2': ['Banana', np.nan, 'Apple'],
'fruit3': ['Grape', np.nan, np.nan],
})
df = df.apply(lambda row: row.value_counts(), axis=1).fillna(0).applymap(int)
Before:
fruit1 fruit2 fruit3
0 Apple Banana Grape
1 Banana NaN NaN
2 NaN Apple NaN
After:
Apple Banana Grape
0 1 1 1
1 0 1 0
2 1 0 0