Flagging NaN values based on a condition and year - python

I am trying to get this requirement of flagging NaN values based on condition and particular year, below is my code:
import pandas as pd
import numpy as np
s={'Fruits':['Apple','Orange', 'Banana', 'Mango'],'month':['201401','201502','201603','201604'],'weight':[2,4,1,6],'Quant':[251,178,298,300]}
p=pd.DataFrame(data=s)
upper = 250
How would I be able to flag NaN values for month- 201603 and 201604 (03 and 04 are the months), if upper>250. Basically my intention is to check if Quant value is greater than defined upper value, but for specific date i.e. 201603 and 201604.
This is how the output should look like-
Fruits month weight Quant
0 Apple 201401 2 251.0
1 Orange 201502 4 178.0
2 Banana 201603 1 NaN
3 Mango 201604 6 NaN

You can use .loc:
p.loc[(p.Quant > upper) & (p.month.str[-2:].isin(['03','04'])), 'Quant'] = np.nan
OutPut:
Fruits month weight Quant
0 Apple 201401 2 251.0
1 Orange 201502 4 178.0
2 Banana 201603 1 NaN
3 Mango 201604 6 NaN

You could build a boolean condition that checks if "Quant" is greater than "upper" and the month is "03" or "04", and mask "Quant" column:
p['Quant'] = p['Quant'].mask(p['Quant'].gt(upper) & p['month'].str[-2:].isin(['03','04']))
Output:
Fruits month weight Quant
0 Apple 201401 2 251.0
1 Orange 201502 4 178.0
2 Banana 201603 1 NaN
3 Mango 201604 6 NaN

Use:
p['Quant1'] = p[~(((p['month']=='201603')|(p['month']=='201604'))&(p['Quant']>250))]['Quant']

Related

Search N consecutive rows with same value in one dataframe

I need to create a python code to search "N" as variable, consecutive rows in a column dataframe with the same value and different that NaN like this.
I can't figure out how to do it with a for loop because I don't know which row I'm looking at in each case. Any idea that how can do it?
Fruit
2 matches
5 matches
Apple
No
No
NaN
No
No
Pear
No
No
Pear
Yes
No
Pear
Yes
No
Pear
Yes
No
Pear
Yes
Yes
NaN
No
No
NaN
No
No
NaN
No
No
NaN
No
No
NaN
No
No
Banana
No
No
Banana
Yes
No
Update: testing solutions by #Corralien
counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
.transform('cumcount').add(1) # cumulative counter
.where(df['Fruit'].notna(), other=0)) # set NaN to 0
N = 2
df['Matches'] = df.where(counts >= N, other='No')
VSCode return me the 'Frame skipped from debugging during step-in.' message when execute the last line and generate an exception in the previous for loop.
Compute consecutive values and set NaN to 0. Once you have calculated the cumulative counter, you just have to check if the counter is greater than or equal to N:
counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
.transform('cumcount').add(1) # cumulative counter
.where(df['Fruit'].notna(), other=0)) # set NaN to 0
N = 2
df['2 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})
N = 5
df['5 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})
Output:
>>> df
Fruit 2 matches 5 matches
0 Apple No No
1 NaN No No
2 Pear No No
3 Pear Yes No
4 Pear Yes No
5 Pear Yes No
6 Pear Yes Yes
7 NaN No No
8 NaN No No
9 NaN No No
10 NaN No No
11 NaN No No
12 Banana No No
13 Banana Yes No
>>> counts
0 1
1 0
2 1
3 2
4 3
5 4
6 5
7 0
8 0
9 0
10 0
11 0
12 1
13 2
dtype: int64
Update
if I need to change "Yes" for the fruit name for example
N = 2
df['2 matches'] = df.where(counts >= N, other='No')
print(df)
# Output
Fruit 2 matches
0 Apple No
1 NaN No
2 Pear No
3 Pear Pear
4 Pear Pear
5 Pear Pear
6 Pear Pear
7 NaN No
8 NaN No
9 NaN No
10 NaN No
11 NaN No
12 Banana No
13 Banana Banana

Missing value replaced by another column [duplicate]

This question already has answers here:
Pandas: filling missing values by mean in each group
(12 answers)
Closed 6 months ago.
data = {
"Food": ['apple', 'apple', 'apple','orange','apple','apple','orange','orange','orange'],
"Calorie": [50, 40, 50,30,'Nan','Nan',50,30,'Nan']
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
Having a data frame as above.Need to replace the missing value using median. Like if the food is apple and the Nan value need to be replace by median.orange is also just like that.The output needs to be like this:
Food Calorie
0 apple 50
1 apple 40
2 apple 50
3 orange 30
4 apple 50
5 apple 50
6 orange 50
7 orange 30
8 orange 30
You could do
df = df.replace('Nan',np.nan)
df.Calorie.fillna(df.groupby('Food')['Calorie'].transform('median') , inplace=True)
df
Out[170]:
Food Calorie
0 apple 50.0
1 apple 40.0
2 apple 50.0
3 orange 30.0
4 apple 50.0
5 apple 50.0
6 orange 50.0
7 orange 30.0
8 orange 30.0

How can I group multiple columns in a Data Frame?

I don't know if this is possible but I have a data frame like this one:
df
State County Homicides Man Woman Not_Register
Gto Celaya 2 2 0 0
NaN NaN 8 4 2 2
NaN NaN 3 2 1 0
NaN Yiriria 2 1 1 0
Nan Acambaro 1 1 0 0
Sin Culiacan 3 1 1 1
NaN Nan 5 4 0 1
Chih Juarez 1 1 0 0
I want to group by State, County, Man Women, Homicides and Not Register. Like this:
State County Homicides Man Woman Not_Register
Gto Celaya 13 8 3 2
Gto Yiriria 2 1 1 0
Gto Acambaro 1 1 0 0
Sin Culiacan 8 5 1 2
Chih Juarez 1 1 0 0
So far, I been able to group by State and County and fill the rows with NaN with the right name of the county and State. My result and code:
import numpy as np
import math
df = df.fillna(method ='pad') #To repeat the name of the State and County with the right order
#To group
df = df.groupby(["State","County"]).agg('sum')
df =df.reset_index()
df
State County Homicides
Gto Celaya 13
Gto Yiriria 2
Gto Acambaro 1
Sin Culiacan 8
Chih Juarez 1
But When I tried to add the Men and woman
df1 = df.groupby(["State","County", "Man", "Women", "Not_Register"]).agg('sum')
df1 =df.reset_index()
df1
My result is repeating the Counties not giving me a unique County for State,
How can I resolve this issue?
Thanks for your help
Change to
df[['Homicides','Man','Woman','Not_Register']]=df[['Homicides','Man','Woman','Not_Register']].apply(pd.to_numeric,errors = 'coerce')
df = df.groupby(['State',"County"]).sum().reset_index()

Pandas: find all unique values in one column and normalize all values in another column to their last value

Problem
I want to find all unique values in one column and normalize the
corresponding values in another column to its last value. I want to achieve this via the pandas module using python3.
Example:
Original dataset
Fruit | Amount
Orange | 90
Orange | 80
Orange | 10
Apple | 100
Apple | 50
Orange | 20
Orange | 60 --> latest value of Orange. Use to normalize Orange
Apple | 75
Apple | 25
Apple | 40 --> latest value of Apple. Used to normalize Apple
Desired output
Ratio column with normalized values for unique values in the 'Fruit' column
Fruit | Amount | Ratio
Orange | 90 | 90/60 --> 150%
Orange | 80 | 80/60 --> 133.3%
Orange | 10 | 10/60 --> 16.7%
Apple | 100 | 100/40 --> 250%
Apple | 50 | 50/40 --> 125%
Orange | 20 | 20/60 --> 33.3%
Orange | 60 | 60/60 --> 100%
Apple | 75 | 75/40 --> 187.5%
Apple | 25 | 25/40 --> 62.5%
Apple | 40 | 40/40 --> 100%
Python code attempt
import pandas as pd
filename = r'C:\fruitdata.dat'
df = pd.read_csv(filename, delimiter='|')
print(df)
print(df.loc[df['Fruit '] == 'Orange '])
print(df[df['Fruit '] == 'Orange '].tail(1))
Python output (IPython)
In [1]: df
Fruit Amount
0 Orange 90
1 Orange 80
2 Orange 10
3 Apple 100
4 Apple 50
5 Orange 20
6 Orange 60
7 Apple 75
8 Apple 25
9 Apple 40
In [2]: df.loc[df['Fruit '] == 'Orange ']
Fruit Amount
0 Orange 90
1 Orange 80
2 Orange 10
5 Orange 20
6 Orange 60
In [3]: df[df['Fruit '] == 'Orange '].tail(1)
Out[3]:
Fruit Amount
6 Orange 60
Question
How can I loop through each unique item in 'Fruit' and normalize all values against its
tail value?
You can using iloc with groupby
df.groupby('Fruit').Amount.apply(lambda x: x/x.iloc[-1])
Out[38]:
0 1.500000
1 1.333333
2 0.166667
3 2.500000
4 1.250000
5 0.333333
6 1.000000
7 1.875000
8 0.625000
9 1.000000
Name: Amount, dtype: float64
After assign it back
df['New']=df.groupby('Fruit').Amount.apply(lambda x: x/x.iloc[-1])
df
Out[40]:
Fruit Amount New
0 Orange 90 1.500000
1 Orange 80 1.333333
2 Orange 10 0.166667
3 Apple 100 2.500000
4 Apple 50 1.250000
5 Orange 20 0.333333
6 Orange 60 1.000000
7 Apple 75 1.875000
8 Apple 25 0.625000
9 Apple 40 1.000000
Without using lambda
df.Amount/df.groupby('Fruit',sort=False).Amount.transform('last')
Out[46]:
0 1.500000
1 1.333333
2 0.166667
3 2.500000
4 1.250000
5 0.333333
6 1.000000
7 1.875000
8 0.625000
9 1.000000
Name: Amount, dtype: float64

FuzzyWuzzy using two pandas dataframes python

I want to find the fuzz.ratio of strings that are in two dataframes. Let's say I have 2 dataframes df with columns A, B and bt_df with columns A1, B1.. I want to compare the column df['B'] and bt_df['B1'] and return the best matching score and its corresponding id in df[A] and .
df
Out[8]:
A B
0 11111111111111111111 Cheesesalad
1 22222222222222222222 Cheese
2 33333333333333333333 salad
3 44444444444444444444 BMWSalad
4 55555555555555555555 BMW
5 66666666666666666666 Apple
6 77777777777777777777 Apple####
7 88888888888888888888 Macrooni!
bt_df
Out[9]:
A1 B1
0 180336 NaN
1 154263 Cheese
2 130876 Salad
3 204430 Macrooni
4 153490 NaN
5 48879 NaN
6 185495 NaN
7 105099 NaN
8 8645 Apple
9 54038 NaN
10 156523 NaN
11 18156 BWM
Hence the result should be:
B1 matchedstring score id
Cheese Cheese 100 22222222222222222222
.....
.....
Thanks in advance.

Categories