Missing value replaced by another column [duplicate] - python

This question already has answers here:
Pandas: filling missing values by mean in each group
(12 answers)
Closed 6 months ago.
data = {
"Food": ['apple', 'apple', 'apple','orange','apple','apple','orange','orange','orange'],
"Calorie": [50, 40, 50,30,'Nan','Nan',50,30,'Nan']
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
Having a data frame as above.Need to replace the missing value using median. Like if the food is apple and the Nan value need to be replace by median.orange is also just like that.The output needs to be like this:
Food Calorie
0 apple 50
1 apple 40
2 apple 50
3 orange 30
4 apple 50
5 apple 50
6 orange 50
7 orange 30
8 orange 30

You could do
df = df.replace('Nan',np.nan)
df.Calorie.fillna(df.groupby('Food')['Calorie'].transform('median') , inplace=True)
df
Out[170]:
Food Calorie
0 apple 50.0
1 apple 40.0
2 apple 50.0
3 orange 30.0
4 apple 50.0
5 apple 50.0
6 orange 50.0
7 orange 30.0
8 orange 30.0

Related

Search N consecutive rows with same value in one dataframe

I need to create a python code to search "N" as variable, consecutive rows in a column dataframe with the same value and different that NaN like this.
I can't figure out how to do it with a for loop because I don't know which row I'm looking at in each case. Any idea that how can do it?
Fruit
2 matches
5 matches
Apple
No
No
NaN
No
No
Pear
No
No
Pear
Yes
No
Pear
Yes
No
Pear
Yes
No
Pear
Yes
Yes
NaN
No
No
NaN
No
No
NaN
No
No
NaN
No
No
NaN
No
No
Banana
No
No
Banana
Yes
No
Update: testing solutions by #Corralien
counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
.transform('cumcount').add(1) # cumulative counter
.where(df['Fruit'].notna(), other=0)) # set NaN to 0
N = 2
df['Matches'] = df.where(counts >= N, other='No')
VSCode return me the 'Frame skipped from debugging during step-in.' message when execute the last line and generate an exception in the previous for loop.
Compute consecutive values and set NaN to 0. Once you have calculated the cumulative counter, you just have to check if the counter is greater than or equal to N:
counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
.transform('cumcount').add(1) # cumulative counter
.where(df['Fruit'].notna(), other=0)) # set NaN to 0
N = 2
df['2 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})
N = 5
df['5 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})
Output:
>>> df
Fruit 2 matches 5 matches
0 Apple No No
1 NaN No No
2 Pear No No
3 Pear Yes No
4 Pear Yes No
5 Pear Yes No
6 Pear Yes Yes
7 NaN No No
8 NaN No No
9 NaN No No
10 NaN No No
11 NaN No No
12 Banana No No
13 Banana Yes No
>>> counts
0 1
1 0
2 1
3 2
4 3
5 4
6 5
7 0
8 0
9 0
10 0
11 0
12 1
13 2
dtype: int64
Update
if I need to change "Yes" for the fruit name for example
N = 2
df['2 matches'] = df.where(counts >= N, other='No')
print(df)
# Output
Fruit 2 matches
0 Apple No
1 NaN No
2 Pear No
3 Pear Pear
4 Pear Pear
5 Pear Pear
6 Pear Pear
7 NaN No
8 NaN No
9 NaN No
10 NaN No
11 NaN No
12 Banana No
13 Banana Banana

Pandas: Create column with rolling sum of previous n rows of another column for within the same id/group

Sample dataset:
id fruit
0 7 NaN
1 7 apple
2 7 NaN
3 7 mango
4 7 apple
5 7 potato
6 3 berry
7 3 olive
8 3 olive
9 3 grape
10 3 NaN
11 3 mango
12 3 potato
In fruit column value of NaN and potato is 0. All other strings value is 1. I want to generate a new column sum_last_3 where each row calculates the sum of previous 3 rows (inclusive) of fruit column. When a new id appears, it should calculate from the beginning.
Output I want:
id fruit sum_last3
0 7 NaN 0
1 7 apple 1
2 7 NaN 1
3 7 mango 2
4 7 apple 2
5 7 potato 2
6 3 berry 1
7 3 olive 2
8 3 olive 3
9 3 grape 3
10 3 NaN 2
11 3 mango 2
12 3 potato 1
My Code:
df['sum_last5'] = (df['fruit'].ne('potato') & df['fruit'].notna())
.groupby('id',sort=False, as_index=False)['fruit']
.rolling(min_periods=1, window=3).sum().astype(int).values
You can modify your codes slightly, as follows:
df['sum_last3'] = ((df['fruit'].ne('potato') & df['fruit'].notna())
.groupby(df['id'],sort=False)
.rolling(min_periods=1, window=3).sum().astype(int)
.droplevel(0)
)
or use .values as in your codes:
df['sum_last3'] = ((df['fruit'].ne('potato') & df['fruit'].notna())
.groupby(df['id'],sort=False)
.rolling(min_periods=1, window=3).sum().astype(int)
.values
)
Your codes are close, just need to change id to df['id'] in the .groupby() call (since the main subject for calling .groupby() is now a boolean series rather than df itself, so .groupby() cannot recognize the id column by the column label 'id' alone and need also the dataframe name to fully qualify/identify the column).
Also remove as_index=False since this parameter is for dataframe rather than (boolean) series here.
Result:
print(df)
id fruit sum_last3
0 7 NaN 0
1 7 apple 1
2 7 NaN 1
3 7 mango 2
4 7 apple 2
5 7 potato 2
6 3 berry 1
7 3 olive 2
8 3 olive 3
9 3 grape 3
10 3 NaN 2
11 3 mango 2
12 3 potato 1

Add values to the sum of a column based on another column in csv using Pandas Python

Lets say i have this dataframe:
Fruits Price Quantity
apple 12 10
pear 50 5
kiwi 42 20
kiwi 30 35
I want to do the sum like this grouping by fruits:
df.groupby(['Fruits'])['Price'].sum()
All good until now, but i want the price to be added to the sum halfed (price/2) for the columns where the quantity is above 10. How do i do this?
You can try making changes to the dataframe first and then calculate the sum after grouping.
df_clone = df.copy()
df_clone['Price'] = [df_clone['Price'].loc[i]/2 if df_clone['Quantity'].loc[i]<10 else df_clone['Price'].loc[i] for i in range(df_clone.shape[0])]
print(df_clone)
which will give:
Fruits Price Quantity
0 apple 12.0 10
1 pear 50.0 5
2 kiwi 21.0 20
3 kiwi 15.0 35
and now you can group this new dataframe to get your output:
df_clone.groupby(['Fruits'])['Price'].sum()
which results in:
Fruits
apple 12.0
kiwi 36.0
pear 50.0
Name: Price, dtype: float64

update pandas groupby group with column value

I have a test df like this:
df = pd.DataFrame({'A': ['Apple','Apple', 'Apple','Orange','Orange','Orange','Pears','Pears'],
'B': [1,2,9,6,4,3,2,1]
})
A B
0 Apple 1
1 Apple 2
2 Apple 9
3 Orange 6
4 Orange 4
5 Orange 3
6 Pears 2
7 Pears 1
Now I need to add a new column with the respective %differences in col 'B'. How is this possible. I cannot get this to work.
I have looked at
update column value of pandas groupby().last()
Not sure that it is pertinent to my problem.
And this which looks promising
Pandas Groupby and Sum Only One Column
I need to find and insert into the col maxpercchng (all rows in group) the maximum change in col (B) per group of col 'A'.
So I have come up with this code:
grouppercchng = ((df.groupby['A'].max() - df.groupby['A'].min())/df.groupby['A'].iloc[0])*100
and try to add it to the group col 'maxpercchng' like so
group['maxpercchng'] = grouppercchng
Or like so
df_kpi_hot.groupby(['A'], as_index=False)['maxpercchng'] = grouppercchng
Does anyone know how to add to all rows in group the maxpercchng col?
I believe you need transform for Series with same size like original DataFrame filled by aggregated values:
g = df.groupby('A')['B']
df['maxpercchng'] = (g.transform('max') - g.transform('min')) / g.transform('first') * 100
print (df)
A B maxpercchng
0 Apple 1 800.0
1 Apple 2 800.0
2 Apple 9 800.0
3 Orange 6 50.0
4 Orange 4 50.0
5 Orange 3 50.0
6 Pears 2 50.0
7 Pears 1 50.0
Or:
g = df.groupby('A')['B']
df1 = ((g.max() - g.min()) / g.first() * 100).reset_index()
print (df1)
A B
0 Apple 800.0
1 Orange 50.0
2 Pears 50.0

Pandas: find all unique values in one column and normalize all values in another column to their last value

Problem
I want to find all unique values in one column and normalize the
corresponding values in another column to its last value. I want to achieve this via the pandas module using python3.
Example:
Original dataset
Fruit | Amount
Orange | 90
Orange | 80
Orange | 10
Apple | 100
Apple | 50
Orange | 20
Orange | 60 --> latest value of Orange. Use to normalize Orange
Apple | 75
Apple | 25
Apple | 40 --> latest value of Apple. Used to normalize Apple
Desired output
Ratio column with normalized values for unique values in the 'Fruit' column
Fruit | Amount | Ratio
Orange | 90 | 90/60 --> 150%
Orange | 80 | 80/60 --> 133.3%
Orange | 10 | 10/60 --> 16.7%
Apple | 100 | 100/40 --> 250%
Apple | 50 | 50/40 --> 125%
Orange | 20 | 20/60 --> 33.3%
Orange | 60 | 60/60 --> 100%
Apple | 75 | 75/40 --> 187.5%
Apple | 25 | 25/40 --> 62.5%
Apple | 40 | 40/40 --> 100%
Python code attempt
import pandas as pd
filename = r'C:\fruitdata.dat'
df = pd.read_csv(filename, delimiter='|')
print(df)
print(df.loc[df['Fruit '] == 'Orange '])
print(df[df['Fruit '] == 'Orange '].tail(1))
Python output (IPython)
In [1]: df
Fruit Amount
0 Orange 90
1 Orange 80
2 Orange 10
3 Apple 100
4 Apple 50
5 Orange 20
6 Orange 60
7 Apple 75
8 Apple 25
9 Apple 40
In [2]: df.loc[df['Fruit '] == 'Orange ']
Fruit Amount
0 Orange 90
1 Orange 80
2 Orange 10
5 Orange 20
6 Orange 60
In [3]: df[df['Fruit '] == 'Orange '].tail(1)
Out[3]:
Fruit Amount
6 Orange 60
Question
How can I loop through each unique item in 'Fruit' and normalize all values against its
tail value?
You can using iloc with groupby
df.groupby('Fruit').Amount.apply(lambda x: x/x.iloc[-1])
Out[38]:
0 1.500000
1 1.333333
2 0.166667
3 2.500000
4 1.250000
5 0.333333
6 1.000000
7 1.875000
8 0.625000
9 1.000000
Name: Amount, dtype: float64
After assign it back
df['New']=df.groupby('Fruit').Amount.apply(lambda x: x/x.iloc[-1])
df
Out[40]:
Fruit Amount New
0 Orange 90 1.500000
1 Orange 80 1.333333
2 Orange 10 0.166667
3 Apple 100 2.500000
4 Apple 50 1.250000
5 Orange 20 0.333333
6 Orange 60 1.000000
7 Apple 75 1.875000
8 Apple 25 0.625000
9 Apple 40 1.000000
Without using lambda
df.Amount/df.groupby('Fruit',sort=False).Amount.transform('last')
Out[46]:
0 1.500000
1 1.333333
2 0.166667
3 2.500000
4 1.250000
5 0.333333
6 1.000000
7 1.875000
8 0.625000
9 1.000000
Name: Amount, dtype: float64

Categories