Let's say we have a example dataframe like below,
df = pd.DataFrame(np.array([['strawberry', 'red', 3], ['apple', 'red', 6], ['apple', 'red', 5],
['banana', 'yellow', 9], ['pineapple', 'yellow', 5], ['pineapple', 'yellow', 7],
['apple', 'green', 2],['apple', 'green', 6], ['kiwi', 'green', 6]
]),
columns=['Fruit', 'Color', 'Quantity'])
df
Fruit Color Quantity
0 strawberry red 3
1 apple red 6
2 apple red 5
3 banana yellow 9
4 pineapple yellow 5
5 pineapple yellow 7
6 apple green 2
7 apple green 6
8 kiwi green 6
In this df, I' m checking is there any change in Fruit column row by row.
With shift() method rows are offsetting by 1, with fillna() method NaN values are filled and lastly with ne() method True-False labeling is done.
So as you can check from index 1, strawberry changing to apple, it will be "True".
Index 2, there are no change, it will be "False".
df['Fruit_Check'] = df.Fruit.shift().fillna(df.Fruit).ne(df.Fruit)
df
Fruit Color Quantity Fruit_Check
0 strawberry red 3 False
1 apple red 6 True
2 apple red 5 False
3 banana yellow 9 True
4 pineapple yellow 5 True
5 pineapple yellow 7 False
6 apple green 2 True
7 apple green 6 False
8 kiwi green 6 True
My problem is: I want to check also "Color" column. If there is a change in there, Fruit_Check column must be False default. So df should look like this,
df
Fruit Color Quantity Fruit_Check
0 strawberry red 3 False
1 apple red 6 True
2 apple red 5 False
3 banana yellow 9 False
4 pineapple yellow 5 True
5 pineapple yellow 7 False
6 apple green 2 False
7 apple green 6 False
8 kiwi green 6 True
Also I shouldn't use for loop. Because when I use with my original data, it takes too much time.
Use DataFrameGroupBy.shift for shift per groups:
df['Fruit_Check'] = df.groupby('Color').Fruit.shift().fillna(df.Fruit).ne(df.Fruit)
print (df)
Fruit Color Quantity Fruit_Check
0 strawberry red 3 False
1 apple red 6 True
2 apple red 5 False
3 banana yellow 9 False
4 pineapple yellow 5 True
5 pineapple yellow 7 False
6 apple green 2 False
7 apple green 6 False
8 kiwi green 6 True
Related
Say I have a column in my df. There are only three distinct values: Apple, banana and kiwi
Category
Apple
Apple
Apple
Banana
Kiwi
Banana
Banana
Banana
And I would like to insert a new column of their corresponding color
Category Color
Apple Red
Banana yellow
Kiwi green
How can i create such a column?
df['Color'] = ['Red', 'yellow', 'green']
To accommodate the edit:
df['Color'] = df['Category'].map({'Apple': 'Red', 'Banana': 'yellow', 'Kiwi': 'green'})
Category Color
0 Apple Red
1 Apple Red
2 Apple Red
3 Banana yellow
4 Kiwi green
5 Banana yellow
6 Banana yellow
7 Banana yellow
Say I have the following dataframe:
>>> df = pd.DataFrame({'Person': ['bob', 'jim', 'joe', 'bob', 'jim', 'joe'], 'Color':['blue', 'green', 'orange', 'yellow', 'pink', 'purple']})
>>> df
Color Person
0 blue bob
1 green jim
2 orange joe
3 yellow bob
4 pink jim
5 purple joe
And I want to create a new column that represents the first color seen for each person:
Color Person First Color
0 blue bob blue
1 green jim green
2 orange joe orange
3 yellow bob blue
4 pink jim green
5 purple joe orange
I have come to a solution but it seems really inefficient:
>>> df['First Color'] = 0
>>> groups = df.groupby(['Person'])['Color']
>>> for g in groups:
... first_color = g[1].iloc[0]
... df['First Color'].loc[df['Person']==g[0]] = first_color
Is there a faster way to do this all at once where it doesn't have to iterate through the groupby object?
You need transform with first:
print (df.groupby('Person')['Color'].transform('first'))
0 blue
1 green
2 orange
3 blue
4 green
5 orange
Name: Color, dtype: object
df['First_Col'] = df.groupby('Person')['Color'].transform('first')
print (df)
Color Person First_Col
0 blue bob blue
1 green jim green
2 orange joe orange
3 yellow bob blue
4 pink jim green
5 purple joe orange
use transform() method:
In [177]: df['First_Col'] = df.groupby('Person')['Color'].transform('first')
In [178]: df
Out[178]:
Color Person First_Col
0 blue bob blue
1 green jim green
2 orange joe orange
3 yellow bob blue
4 pink jim green
5 purple joe orange
I have a input dataframe as
ID Visit11 Visit12 Visit13 Visit1Int4 Visit15
1 Orange
2 Orange Apple
3 Grapes
4 Apple
5 Orange Apple
6 Apple
7 Banana
8 Banana Apple Banana Apple Banana
I want to fill the first NA of each row with 'Exit' (SO for ID 1, Visit12 should be 'Exit', for ID2 Visit13 should be 'Exit', etc.). The final output should look like
ID Visit11 Visit12 Visit13 Visit1Int4 Visit15
1 Orange Exit
2 Orange Apple Exit
3 Grapes Exit
4 Apple Exit
5 Orange Apple Exit
6 Apple Exit
7 Banana Exit
8 Banana Apple Banana Apple Banana E
You could start by replacing empty values with np.nan, and take the cumsum of DataFrame.isna. Then use np.where to assign Exit where cumsum is 1, or the value in df otherwise:
import numpy as np
m = df.replace('',np.nan).isna().cumsum(axis=1)
r = np.where(m == 1, 'Exit', df)
pd.DataFrame(r, columns=df.columns).fillna('')
ID Visit11 Visit12 Visit13 Visit1Int4 Visit15
0 1 Orange Exit
1 2 Orange Apple Exit
2 3 Grapes Exit
3 4 Apple Exit
4 5 Orange Apple Exit
5 6 Apple Exit
6 7 Banana Exit
7 8 Banana Apple Banana Apple Banana
I have a data frame which has the structure as follows
code value
1 red
2 blue
3 yellow
1
4
4 pink
2 blue
so basically i want to update the value column so that the blank rows are filled with values from other rows. So I know the code 4 refers to value pink, I want it to be updated in all the rows where that value is not present.
Using groupby and ffill and bfill
df.groupby('code').value.ffill().bfill()
0 red
1 blue
2 yellow
3 red
4 pink
5 pink
6 blue
Name: value, dtype: object
You could use first value of the given code group
In [379]: df.groupby('code')['value'].transform('first')
Out[379]:
0 red
1 blue
2 yellow
3 red
4 pink
5 pink
6 blue
Name: value, dtype: object
To assign back
In [380]: df.assign(value=df.groupby('code')['value'].transform('first'))
Out[380]:
code value
0 1 red
1 2 blue
2 3 yellow
3 1 red
4 4 pink
5 4 pink
6 2 blue
Or
df['value'] = df.groupby('code')['value'].transform('first')
You can create a series of your code-value pairs, and use that to map:
my_map = df[df['value'].notnull()].set_index('code')['value'].drop_duplicates()
df['value'] = df['code'].map(my_map)
>>> df
code value
0 1 red
1 2 blue
2 3 yellow
3 1 red
4 4 pink
5 4 pink
6 2 blue
Just to see what is happening, you are passing the following series to map:
>>> my_map
code
1 red
2 blue
3 yellow
4 pink
Name: value, dtype: object
So it says: "Where you find 1, give the value red, where you find 2, give blue..."
You can sort_values, ffill and then sort_index. The last step may not be necessary if order is not important. If it is, then the double sort may be unreasonably expensive.
df = df.sort_values(['code', 'value']).ffill().sort_index()
print(df)
code value
0 1 red
1 2 blue
2 3 yellow
3 1 red
4 4 pink
5 4 pink
6 2 blue
Using reindex
df.dropna().drop_duplicates('code').set_index('code').reindex(df.code).reset_index()
Out[410]:
code value
0 1 red
1 2 blue
2 3 yellow
3 1 red
4 4 pink
5 4 pink
6 2 blue
Is there an elegant way to assign values based on multiple columns in a dataframe in pandas? Let's say I have a dataframe with 2 columns: FruitType and Color.
import pandas as pd
df = pd.DataFrame({'FruitType':['apple', 'banana','kiwi','orange','loquat'],
'Color':['red_black','yellow','greenish_yellow', 'orangered','orangeyellow']})
I would like to assign the value of a third column, 'isYellowSeedless', based on both 'FruitType' and 'Color' columns.
I have a list of fruits that I consider seedless, and would like to check the Color column to see if it contains the str "yellow".
seedless = ['banana', 'loquat']
How do I string this all together elegantly?
This is my attempt that didn't work:
df[(df['FruitType'].isin(seedless)) & (culture_table['Color'].str.contains("yellow"))]['isYellowSeedless'] = True
Use loc with mask:
m = (df['FruitType'].isin(seedless)) & (df['Color'].str.contains("yellow"))
df.loc[m, 'isYellowSeedless'] = True
print (df)
Color FruitType isYellowSeedless
0 red_black apple NaN
1 yellow banana True
2 greenish_yellow kiwi NaN
3 orangered orange NaN
4 orangeyellow loquat True
If need True and False output:
df['isYellowSeedless'] = m
print (df)
Color FruitType isYellowSeedless
0 red_black apple False
1 yellow banana True
2 greenish_yellow kiwi False
3 orangered orange False
4 orangeyellow loquat True
For if-else by some scalars use numpy.where:
df['isYellowSeedless'] = np.where(m, 'a', 'b')
print (df)
Color FruitType isYellowSeedless
0 red_black apple b
1 yellow banana a
2 greenish_yellow kiwi b
3 orangered orange b
4 orangeyellow loquat a
And for convert to 0 and 1:
df['isYellowSeedless'] = m.astype(int)
print (df)
Color FruitType isYellowSeedless
0 red_black apple 0
1 yellow banana 1
2 greenish_yellow kiwi 0
3 orangered orange 0
4 orangeyellow loquat 1
Or you can try
df['isYellowSeedless']=df.loc[df.FruitType.isin(seedless),'Color'].str.contains('yellow')
df
Out[546]:
Color FruitType isYellowSeedless
0 red_black apple NaN
1 yellow banana True
2 greenish_yellow kiwi NaN
3 orangered orange NaN
4 orangeyellow loquat True