Pandas DataFrame removing NaN rows based on condition? - python

Pandas DataFrame removing NaN rows based on condition.
I'm trying to remove the rows whose gender==male and status == NaN.
Sample df:
name status gender leaves
0 tom NaN male 5
1 tom True male 6
2 tom True male 7
3 mary True female 1
4 mary NaN female 10
5 mary True female 15
6 john NaN male 2
7 mark True male 3
Expected Ouput:
name status gender leaves
0 tom True male 6
1 tom True male 7
2 mary True female 1
3 mary NaN female 10
4 mary True female 15
5 mark True male 3

You can use isna (or isnull) function to get the rows with a value of NaN.
With this knowledge, you can filter your dataframe using something like:
conditions = (df.gender == 'male')&(df.status.isna())
filtered_df = df[~conditions]

Good One given by #Derlin, other way I tried is using fillna() filled NaN with -1 and indexed them, just like below:
>>> df[~((df.fillna(-1)['status']==-1)&(df['gender']=='male'))]
Just for reference ~ operator is the same as np.logical_not() of numpy. So if you use this:
df[np.logical_not((df.fillna(-1)['status']==-1)&(df['gender']=='male'))] (dont forget to import numpy as np), means the same.

Related

update column values to reflect other column values having two different size dataframes

I have the following two data frames.
df_1:
unique_id amount
1 NaN
2 5
df_2:
unique_id amount city email
1 90 Kansas True
2 100 Miami False
3 NaN Kent True
4 123 Newport True
I would like to only update the amount column where unique_id is 1 or 2 or any other that might match on unique_id. The output should be:
unique_id amount city email
1 NaN Kansas True
2 5 Miami False
3 NaN Kent True
4 123 Newport True
I've tried merging and contacting but I am not getting the desired result. I just want an idea of what the best approach is when two data frames are of different sizes and want to update certain column values. Any guidance is greatly appreciated
Try with mask:
df_2['amount'] = df_2['amount'].mask(df_2['unique_id'].isin(df_1['unique_id']),
df_2['unique_id'].map(df_1.set_index('unique_id')['amount'])
)
Output:
unique_id amount city email
0 1 NaN Kansas True
1 2 5.0 Miami False
2 3 NaN Kent True
3 4 123.0 Newport True

Pandas comparing dataframes and changing column value based on number of similar rows in another dataframe

Suppose I have two dataframes:
df1:
Person Number Type
0 Kyle 12 Male
1 Jacob 15 Male
2 Jacob 15 Male
df2:
A much larger dataset with similar format except there is a count column that needs to increment based on df1
Person Number Type Count
0 Kyle 12 Male 0
1 Jacob 15 Male 0
3 Sally 43 Female 0
4 Mary 15 Female 5
What I am looking to do is increase the count column based on the number of occurrences of the same person in df1
Excepted output for this example:
Person Number Type Count
0 Kyle 12 Male 1
1 Jacob 15 Male 2
3 Sally 43 Female 0
4 Mary 15 Female 5
Increase count to 1 for Kyle because there is one instance, increase count to 2 because there are two instances for Jacob. Don't change value for Sally and Mary and keep the value the same.
How do I do this? I have tried using .loc but I can't figure out how to account for two instances of the same row. Meaning that I can only get count to increase by one for Jacob even though there are two Jacobs in df1.
I have tried
df2.loc[df2['Person'].values == df1['Person'].values, 'Count'] += 1
However this does not account for duplicates.
df1 = df1.groupby(df.columns.tolist(), as_index=False).size().to_frame('Count').reset_index()
df1 = df1.set_index(['Person','Number','Type'])
df2 = df2.set_index(['Person','Number','Type'])
df1.add(df2, fill_value=0).reset_index()
Or
df1 = df1.groupby(df.columns.tolist(), as_index=False).size().to_frame('Count').reset_index()
df2.merge(df1, on=['Person','Number','Type'], how='left').set_index(['Person','Number','Type']).sum(axis=1).to_frame('Count').reset_index()
value_counts + Index alignment.
u = df2.set_index("Person")
u.assign(Count=df1["Person"].value_counts().add(u["Count"], fill_value=0))
Number Type Count
Person
Kyle 12 Male 1.0
Jacob 15 Male 2.0
Sally 43 Female 0.0
Mary 15 Female 5.0

Could not convert string to float error from the Titanic competition

I'm trying to solve the Titanic survival program from Kaggle. It's my first step in actually learning Machine Learning. I have a problem where the gender column causes an error. The stacktrace says could not convert string to float: 'female'. How did you guys come across this issue? I don't want solutions. I just want a practical approach to this problem because I do need the gender column to build my model.
This is my code:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv"
train_data = pd.read_csv(train_path)
columns_of_interest = ['Survived','Pclass', 'Sex', 'Age']
filtered_titanic_data = train_data.dropna(axis=0)
x = filtered_titanic_data[columns_of_interest]
y = filtered_titanic_data.Survived
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)
titanic_model = DecisionTreeRegressor()
titanic_model.fit(train_x, train_y)
val_predictions = titanic_model.predict(val_x)
print(filtered_titanic_data)
There are a couple ways to deal with this, and it kind of depends what you're looking for:
You could encode your categories to numeric values, i.e. transform each level of your category to a distinct number,
or
dummy code your category, i.e. turn each level of your category into a separate column, which gets a value of 0 or 1.
In lots of machine learning applications, factors are better to deal with as dummy codes.
Note that in the case of a 2-level category, encoding to numeric according to the methods outlined below is essentially equivalent to dummy coding: all the values that are not level 0 are necessarily level 1. In fact, in the dummy code example I've given below, there is redundant information, as I've given each of the 2 classes its own column. It's just to illustrate the concept. Typically, one would only create n-1 columns, where n is the number of levels, and the omitted level is implied (i.e. make a column for Female, and all the 0 values are implied to be Male).
Encoding Categories to numeric:
Method 1: pd.factorize
pd.factorize is a simple, fast way of encoding to numeric:
For example, if your column gender looks like this:
>>> df
gender
0 Female
1 Male
2 Male
3 Male
4 Female
5 Female
6 Male
7 Female
8 Female
9 Female
df['gender_factor'] = pd.factorize(df.gender)[0]
>>> df
gender gender_factor
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0
Method 2: categorical dtype
Another way would be to use category dtype:
df['gender_factor'] = df['gender'].astype('category').cat.codes
This would result in the same output
Method 3 sklearn.preprocessing.LabelEncoder()
This method comes with some bonuses, such as easy back transforming:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
# Transform the gender column
df['gender_factor'] = le.fit_transform(df.gender)
>>> df
gender gender_factor
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0
# Easy to back transform:
df['gender_factor'] = le.inverse_transform(df.gender_factor)
>>> df
gender gender_factor
0 Female Female
1 Male Male
2 Male Male
3 Male Male
4 Female Female
5 Female Female
6 Male Male
7 Female Female
8 Female Female
9 Female Female
Dummy Coding:
Method 1: pd.get_dummies
df.join(pd.get_dummies(df.gender))
gender Female Male
0 Female 1 0
1 Male 0 1
2 Male 0 1
3 Male 0 1
4 Female 1 0
5 Female 1 0
6 Male 0 1
7 Female 1 0
8 Female 1 0
9 Female 1 0
Note, if you want to omit one column to get a non-redundant dummy code (see my note at the beginning of this answer), you can use:
df.join(pd.get_dummies(df.gender, drop_first=True))
gender Male
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0

Pandas Fillna with the MAX value_counts of each group

There two columns in the DataFrame named "country" & "taster_name". Column "taster_name" got some missing values in it. I want to fillna the missing values with the MAX VALUE_COUNTS of the taster_name of each country(depending on which country the missing value belongs to). I don't know how I can make it.
From the code below, we can check the MAX VALUE_COUNTS of the taster_name of each country.
wine[['country','taster_name']].groupby('country').taster_name.value_counts()
try this,
df.groupby('country')['teaser_name'].apply(lambda x:x.fillna(x.value_counts().index.tolist()[0]))
As you didn't provide sample data. I created by myself.
Sample Input:
country teaser_name
0 A abraham
1 B silva
2 A abraham
3 A NaN
4 B NaN
5 C john
6 C NaN
7 C john
8 C jacob
9 A NaN
10 B silva
11 A william
Output:
country teaser_name
0 A abraham
1 B silva
2 A abraham
3 A abraham
4 B silva
5 C john
6 C john
7 C john
8 C jacob
9 A abraham
10 B silva
11 A william
Explanation:
Try to groupby country and fill NaN values with value_counts. In value_counts by default it's in descending order. So, you can take first element and fill with NaN.

How to make complex data cleaning in pandas

For example, I have a DataFrame as following.
lineNum id name Cname score
1 001 Jack Math 99
2 002 Jack English 110
3 003 Jack Chinese 90
4 003 Jack Chinese 90
5 004 Tom Math Nan
6 005 Tom English 75
7 006 Tom Chinese 85
As you see, I want to data cleaning for this data.
1) delete the duplicate value which is line 3 and line 4.
2) deal with ths unreasonable value. In line 2, Jack's English get 110 which is over the max value 100. I want to set his score to the mean value for all students' english score.
3) deal with the Nan value. Tom's Math score is Nan. I want to change to the mean value for all students' math score.
I can do every requirement respectively. But I don't know how to do all these three requirements. Thanks!
plan
I drop duplicates to start.
use mask to make scores > than 100 null
filter new dataframe and group by with mean
map means and use it to fill nulls
d = df.drop_duplicates(['id', 'name', 'Cname'])
s0 = d.score
s1 = s0.mask(s > 100)
m = s1.mask(s1 > 100).notnull()
d.assign(score=s1.fillna(d.Cname.map(d[m].groupby('Cname').score.mean())))
lineNum id name Cname score
0 1 1 Jack Math 99.0
1 2 2 Jack English 110.0
2 3 3 Jack Chinese 90.0
4 5 4 Tom Math 99.0
5 6 5 Tom English 75.0
6 7 6 Tom Chinese 85.0
You can use:
cols = ['id','name','Cname','score']
#remove duplicates by columns
df = df.drop_duplicates(subset=cols)
#replace values > 100 to NaN
df.loc[df['score'] > 100, 'score'] = np.nan
#replace NaN by mean for all students by subject
df['score'] = df.groupby('Cname')['score'].transform(lambda x: x.fillna(x.mean()))
print (df)
lineNum id name Cname score
0 1 1 Jack Math 99.0
1 2 2 Jack English 75.0
2 3 3 Jack Chinese 90.0
4 5 4 Tom Math 99.0
5 6 5 Tom English 75.0
6 7 6 Tom Chinese 85.0
Alternative solution with mask for NaN:
cols = ['id','name','Cname','score']
df = df.drop_duplicates(subset=cols)
df['score'] = df['score'].mask(df['score'] > 100)
df['score'] = df.groupby('Cname')['score'].apply(lambda x: x.fillna(x.mean()))
print (df)
lineNum id name Cname score
0 1 1 Jack Math 99.0
1 2 2 Jack English 75.0
2 3 3 Jack Chinese 90.0
4 5 4 Tom Math 99.0
5 6 5 Tom English 75.0
6 7 6 Tom Chinese 85.0
You should consider `.apply(func)' if the data is not too big.
import pandas as pd
df = pd.read_table('sample.txt', delimiter='\s+', na_values='Nan') # Your sample data
df = df.set_index('lineNum').drop_duplicates()
def deal_with(x):
if (x['score'] > 100.) or (pd.isnull(x['score'])):
df_ = df[df['id'] != x['id']]
x['score'] = df_.loc[df_['Cname'] == x['Cname'], 'score'].mean()
return x
print(df.apply(deal_with, axis=1))
id name Cname score
lineNum
1 1 Jack Math 99.0
2 2 Jack English 75.0
3 3 Jack Chinese 90.0
5 4 Tom Math 99.0
6 5 Tom English 75.0
7 6 Tom Chinese 85.0

Categories