Find the index of multi-rows - python

Suppose I have a dataframe called df as follows:
A_column B_column C_column
0 Apple 100 15
1 Banana 80 20
2 Orange 110 10
3 Apple 150 16
4 Apple 90 13
[Q] How to list the index [0,3,4] for Apple in A_column?

You can just pass the row indexes as list to df.iloc
>>> df
A_column B_column C_column
0 Apple 100 15
1 Banana 80 20
2 Orange 110 10
3 Apple 150 16
4 Apple 90 13
>>> df.iloc[[0,3,4]]
A_column B_column C_column
0 Apple 100 15
3 Apple 150 16
4 Apple 90 13
EDIT: seems i misunderstood your questions
So you want to have the list containing the index number of the rows containing 'Apple', you can use df.index[df['A_column']=='Apple'].tolist()
>>> df.index[df['A_column']=='Apple'].tolist()
[0, 3, 4]

Related

Missing value replaced by another column [duplicate]

This question already has answers here:
Pandas: filling missing values by mean in each group
(12 answers)
Closed 6 months ago.
data = {
"Food": ['apple', 'apple', 'apple','orange','apple','apple','orange','orange','orange'],
"Calorie": [50, 40, 50,30,'Nan','Nan',50,30,'Nan']
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
Having a data frame as above.Need to replace the missing value using median. Like if the food is apple and the Nan value need to be replace by median.orange is also just like that.The output needs to be like this:
Food Calorie
0 apple 50
1 apple 40
2 apple 50
3 orange 30
4 apple 50
5 apple 50
6 orange 50
7 orange 30
8 orange 30
You could do
df = df.replace('Nan',np.nan)
df.Calorie.fillna(df.groupby('Food')['Calorie'].transform('median') , inplace=True)
df
Out[170]:
Food Calorie
0 apple 50.0
1 apple 40.0
2 apple 50.0
3 orange 30.0
4 apple 50.0
5 apple 50.0
6 orange 50.0
7 orange 30.0
8 orange 30.0

Python_Cumulative sum based on two conditions

I'm trying to compute the cumulative sum in python based on a two different conditions.
As you can see in the attached image, Calculation column would take the same value as the Number column as long as the Cat1 and Cat2 column doesn't change.
Once Cat1 column changes, we should reset the Number column.
Calculation column stays the same as the Number column, Once Cat2 column changes with the same value of Cat1 column, the Calculation column will take the last value of the Number column and add it to the next one.
Example of data below:
Cat1 Cat2 Number CALCULATION
a orange 1 1
a orange 2 2
a orange 3 3
a orange 4 4
a orange 5 5
a orange 6 6
a orange 7 7
a orange 8 8
a orange 9 9
a orange 10 10
a orange 11 11
a orange 12 12
a orange 13 13
b purple 1 1
b purple 2 2
b purple 3 3
b purple 4 4
b purple 5 5
b purple 6 6
b purple 7 7
b purple 8 8
b silver 1 9
b silver 2 10
b silver 3 11
b silver 4 12
b silver 5 13
b silver 6 14
b silver 7 15
Are you looking for:
import pandas as pd
df = pd.DataFrame({'Cat1': ['a','a','a','a','a','a','a','a','a','a','a', 'a','a','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b'],
'Cat2': ['orange','orange','orange','orange','orange','orange','orange', 'orange','orange','orange','orange','orange','orange','purple','purple', 'purple','purple','purple','purple','purple','purple','silver','silver','silver', 'silver','silver','silver','silver']})
df['Number'] = df.groupby(['Cat1', 'Cat2']).cumcount()+1
df['CALCULATION'] = df.groupby('Cat1').cumcount()+1

Filter rows with only Zero values with 2 columns

Sample DF:
ID Name Price Amount Fit_Test
1 Apple 10 15 Super_Fit
2 Apple 10 0 Super_Fit
3 Apple 10 0 Super_Fit
4 Orange 12 20 Not_Fit
5 Orange 12 0 Not_Fit
6 Banana 15 17 Medium_Fit
7 Banana 15 0 Medium_Fit
8 Pineapple 25 19 Medium_Fit
9 Pineapple 25 18 Medium_Fit
10 Cherry 30 56 Super_Fit
11 XXX 50 0 Medium_Fit
12 XXX 50 0 Medium_Fit
Expected DF:
ID Name Price Amount Fit_Test
1 Apple 10 15 Super_Fit
2 Apple 10 0 Super_Fit
3 Apple 10 0 Super_Fit
4 Orange 12 20 Not_Fit
6 Banana 15 17 Medium_Fit
8 Pineapple 25 19 Medium_Fit
9 Pineapple 25 18 Medium_Fit
10 Cherry 30 56 Super_Fit
11 XXX 50 0 Medium_Fit
12 XXX 50 0 Medium_Fit
Problem Statement:
I want to group-by by Name and Price and then filter based on Amount with Fit_Test as conditional column.
If Fit_Test is Super_Fit then don't the no operation is needed. (Rows 1,2,3 and 10 is same in Input and Expected DF)
If within a Name and Price conditions and Fit_Test is not Super_Fit, check if it has a Amount as 0 then and only then remove that row (ID 4,5,6,7 and in Expected 5 & 7 are deleted)
If within a Name and Price conditions & Fit_Test is not Super_Fit if contract amount has values greater than zero then don't remove any rows (Rows 8 & 9 is same in Input and Expected DF)
If within a Name and Price conditions & Fit_Test is not Super_Fit if contract amount has all values equal to zero then don't remove any rows (Rows 11 & 12 is same in Input and Expected DF)
I can do solution where it removes all zero but no help with conditional column
You can chain 2 conditions - compare Fit_Test and check if all Trues per groups by GroupBy.transform and GroupBy.all and for second compare for not equal:
m1 = df['Fit_Test'].eq('Super_Fit').groupby([df['Name'],df['Price']]).transform('all')
m2 = df['Amount'].ne(0)
df = df[m1 | m2]
print (df)
ID Name Price Amount Fit_Test
0 1 Apple 10 15 Super_Fit
1 2 Apple 10 0 Super_Fit
2 3 Apple 10 0 Super_Fit
3 4 Orange 12 20 Not_Fit
5 6 Banana 15 17 Medium_Fit
7 8 Pineapple 25 19 Medium_Fit
8 9 Pineapple 25 18 Medium_Fit
9 10 Cherry 30 56 Super_Fit

Rank the elements in a pandas dataframe

Hi I am working with a pandas.Dataframe like below:
Name Quality
Carrot 50
Potato 34
Raddish 43
Ginger 50
Tomato 43
Cabbage 12
I want to associate a rank to the dataframe. I have successfully been able to sort the dataframe based on the field Quality like below:
Name Quality
Carrot 50
Ginger 50
Raddish 43
Tomato 43
Potato 34
Cabbage 12
Now what I want to do is, add a new column called Position and have the rank at which they exist.
The point is, the same rank can be given to two different elements if their quality is the same.
Sample Output Dataframe:
Name Quality Position
Carrot 50 1
Ginger 50 1
Raddish 43 2
Tomato 43 2
Potato 34 3
Cabbage 12 4
Notice how two elements with same quality have the same position while the below elements have +1 position. Also, the dataframe has avg 10 million records
How can I do this in Pandas.Dataframe?
I Sort my Dataframe like below:
df_sort = dataframe.sort_values(by=attribute, ascending=order)
df_sort.reset_index(drop=True)
You're going to want to use Rank.
There are a few variations to rank. The one you want is Dense. That ensures that ties don't result in halves.
df['Position'] = df.Quality.rank(method='dense', ascending = False).astype(int)
df
Name Quality Position
0 Carrot 50 1
1 Ginger 50 1
2 Raddish 43 2
3 Tomato 43 2
4 Potato 34 3
5 Cabbage 12 4
For demonstration purposes, if you didn't use dense but rather min, your dataframe would look like this:
Name Quality Position
0 Carrot 50 1
1 Ginger 50 1
2 Raddish 43 3
3 Tomato 43 3
4 Potato 34 5
5 Cabbage 12 6
The key here is to use ascending = False
For a pre-sorted dataframe, you can use pandas.factorize:
df['Rank'] = pd.factorize(df['Quality'])[0] + 1
print(df)
Name Quality Rank
0 Carrot 50 1
1 Ginger 50 1
2 Raddish 43 2
3 Tomato 43 2
4 Potato 34 3
5 Cabbage 12 4

Pandas: find all unique values in one column and normalize all values in another column to their last value

Problem
I want to find all unique values in one column and normalize the
corresponding values in another column to its last value. I want to achieve this via the pandas module using python3.
Example:
Original dataset
Fruit | Amount
Orange | 90
Orange | 80
Orange | 10
Apple | 100
Apple | 50
Orange | 20
Orange | 60 --> latest value of Orange. Use to normalize Orange
Apple | 75
Apple | 25
Apple | 40 --> latest value of Apple. Used to normalize Apple
Desired output
Ratio column with normalized values for unique values in the 'Fruit' column
Fruit | Amount | Ratio
Orange | 90 | 90/60 --> 150%
Orange | 80 | 80/60 --> 133.3%
Orange | 10 | 10/60 --> 16.7%
Apple | 100 | 100/40 --> 250%
Apple | 50 | 50/40 --> 125%
Orange | 20 | 20/60 --> 33.3%
Orange | 60 | 60/60 --> 100%
Apple | 75 | 75/40 --> 187.5%
Apple | 25 | 25/40 --> 62.5%
Apple | 40 | 40/40 --> 100%
Python code attempt
import pandas as pd
filename = r'C:\fruitdata.dat'
df = pd.read_csv(filename, delimiter='|')
print(df)
print(df.loc[df['Fruit '] == 'Orange '])
print(df[df['Fruit '] == 'Orange '].tail(1))
Python output (IPython)
In [1]: df
Fruit Amount
0 Orange 90
1 Orange 80
2 Orange 10
3 Apple 100
4 Apple 50
5 Orange 20
6 Orange 60
7 Apple 75
8 Apple 25
9 Apple 40
In [2]: df.loc[df['Fruit '] == 'Orange ']
Fruit Amount
0 Orange 90
1 Orange 80
2 Orange 10
5 Orange 20
6 Orange 60
In [3]: df[df['Fruit '] == 'Orange '].tail(1)
Out[3]:
Fruit Amount
6 Orange 60
Question
How can I loop through each unique item in 'Fruit' and normalize all values against its
tail value?
You can using iloc with groupby
df.groupby('Fruit').Amount.apply(lambda x: x/x.iloc[-1])
Out[38]:
0 1.500000
1 1.333333
2 0.166667
3 2.500000
4 1.250000
5 0.333333
6 1.000000
7 1.875000
8 0.625000
9 1.000000
Name: Amount, dtype: float64
After assign it back
df['New']=df.groupby('Fruit').Amount.apply(lambda x: x/x.iloc[-1])
df
Out[40]:
Fruit Amount New
0 Orange 90 1.500000
1 Orange 80 1.333333
2 Orange 10 0.166667
3 Apple 100 2.500000
4 Apple 50 1.250000
5 Orange 20 0.333333
6 Orange 60 1.000000
7 Apple 75 1.875000
8 Apple 25 0.625000
9 Apple 40 1.000000
Without using lambda
df.Amount/df.groupby('Fruit',sort=False).Amount.transform('last')
Out[46]:
0 1.500000
1 1.333333
2 0.166667
3 2.500000
4 1.250000
5 0.333333
6 1.000000
7 1.875000
8 0.625000
9 1.000000
Name: Amount, dtype: float64

Categories