Combining is in and where

Combining is in and where - python

How can I create new column based on the odd even flag in Pandas
This is my data:
id Flag
001 1
002 2
003 3
004 4
I would like to have this output if flag is even number then female, if flag is odd number then male:
id Flag Gender
001 1 Male
002 2 Female
003 3 Male
004 4 Female

Use numpy.where with modulo 2 for check even and odd numbers:
df['Gender'] = np.where(df['Flag'] % 2,'Male','Female')
print (df)
id Flag Gender
0 1 1 Male
1 2 2 Female
2 3 3 Male
3 4 4 Female

try apply
Id =['001','002','003','004']
Flag=[1,2,3,4]
df=pd.DataFrame({'id':Id,'flag':Flag})
df['gender']=df['flag'].apply(lambda x: 'Male' if x%2 else 'Female')
output:
id flag gender
0 001 1 Male
1 002 2 Female
2 003 3 Male
3 004 4 Female

Related

How to add multiple calculated columns to dataframe at once

I have dataframe as below
Slno Name_x Age_x Sex_x Name_y Age_y Sex_y
0 1 A 27 Male A 32 Male
1 2 B 28 Female B 28 Female
2 3 C 8 Female C 1 Female
3 4 D 28 Male D 72 Male
4 5 E 25 Female E 64 Female
I need to create calculated column , difference between age, check gender match and to achieve this in one go I am using
DF3.loc[:,["Gendermatch","Agematch"]]= pd.DataFrame([np.where(DF3["Name_x"]==DF3["Name_y"],True,False),np.where(DF3["Age_x"]-DF3["Age_y"]==0,True,False)])
and the resultant dataframe looks like as below
Slno Name_x Age_x Sex_x Name_y Age_y Sex_y Gendermatch Agematch
0 1 A 27 Male A 32 Male NaN NaN
1 2 B 28 Female B 28 Female NaN NaN
2 3 C 8 Female C 1 Female NaN NaN
3 4 D 28 Male D 72 Male NaN NaN
4 5 E 25 Female E 64 Female NaN NaN
Resultant columns shows not a number , what wrong am I doing here?

DF3[["Gendermatch","Agematch"]]= np.where(DF3["Name_x"]==DF3["Name_y"],True,False),np.where(DF3["Age_x"]-DF3["Age_y"]==0,True,False)

DF3[["Gendermatch","Agematch"]] = pd.DataFrame([np.where(DF3["Name_x"]==DF3["Name_y"],True,False),np.where(DF3["Age_x"]-DF3["Age_y"]==0,True,False)]).T

np.where is useless, Series comparison already returns boolean Series
DF3["Gendermatch"] = DF3["Name_x"]==DF3["Name_y"]
DF3["Agematch"] = DF3["Age_x"]==DF3["Age_y"]
# or in one line
DF3["Gendermatch"], DF3["Agematch"] = (DF3["Name_x"]==DF3["Name_y"]), (DF3["Age_x"]==DF3["Age_y"])
print(DF3)
Slno Name_x Age_x Sex_x Name_y Age_y Sex_y Gendermatch Agematch
0 1 A 27 Male A 32 Male True False
1 2 B 28 Female B 28 Female True True
2 3 C 8 Female C 1 Female True False
3 4 D 28 Male D 72 Male True False
4 5 E 25 Female E 64 Female True False

Pandas: Group by two parameters and sort by third parameter

I want to group my dataframe by two columns (Name and Budget) and then sort the aggregated results by a third parameter (Prio).
Name Budget Prio Quantity
peter A 2 12
B 1 123
joe A 3 34
B 1 51
C 2 43
I already checked this post, which was very helpful and leads to the following output. However, I cannot manage sorting by the third parameter (Prio).
df_agg = df.groupby(['Name','Budget','Prio']).agg({'Quantity':sum})
g = df_agg['Quantity'].groupby(level=0, group_keys=False)
res = g.apply(lambda x: x.sort_values(ascending=True))
I would now like to sort the prio in ascending order within each of the groups. To get something like:
Name Budget Prio Quantity
peter B 1 123
A 2 12
joe B 1 51
C 2 34
A 3 43

IIUC,
df.groupby(['Name','Budget','Prio']).agg({'Quantity':sum}).sort_values(['Name','Prio'])
Output:
Quantity
Name Budget Prio
joe B 1 51
C 2 4
A 3 34
peter B 1 123
A 2 12

If you want only sort by Prio, you can use sort_index:
(df.groupby(['Name','Budget','Prio'])
.agg({'Quantity':'sum'})
.sort_index(level=['Name', 'Prio'],
ascending=[False, True])
)
Output:
Quantity
Name Budget Prio
peter B 1 123
A 2 12
joe B 1 51
C 2 43
A 3 34

pandas: transform based on count of row value in another dataframe

I have two dataframes:
df1:
Gender Registered
female 1
male 0
female 0
female 1
male 1
male 0
df2:
Gender
female
female
male
male
I want to modify df2, so that there is a new column 'Count' with the count of registered = 1 for corresponding gender values from df1. For example, in df1 there are 2 registered females and 1 registered male. I want to transform the df2 so that the output is as follows:
output:
Gender Count
female 2
female 2
male 1
male 1
I tried many things and got close but couldn't make it fully work.

sum + map:
v = df1.groupby('Gender').Registered.sum()
df2.assign(Count=df2.Gender.map(v))
Gender Count
0 female 2
1 female 2
2 male 1
3 male 1

pd.merge
pd.merge(df2, df1.groupby('Gender', as_index=False).sum())
Gender Registered
0 female 2
1 female 2
2 male 1
3 male 1

Could not convert string to float error from the Titanic competition

I'm trying to solve the Titanic survival program from Kaggle. It's my first step in actually learning Machine Learning. I have a problem where the gender column causes an error. The stacktrace says could not convert string to float: 'female'. How did you guys come across this issue? I don't want solutions. I just want a practical approach to this problem because I do need the gender column to build my model.
This is my code:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv"
train_data = pd.read_csv(train_path)
columns_of_interest = ['Survived','Pclass', 'Sex', 'Age']
filtered_titanic_data = train_data.dropna(axis=0)
x = filtered_titanic_data[columns_of_interest]
y = filtered_titanic_data.Survived
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)
titanic_model = DecisionTreeRegressor()
titanic_model.fit(train_x, train_y)
val_predictions = titanic_model.predict(val_x)
print(filtered_titanic_data)

There are a couple ways to deal with this, and it kind of depends what you're looking for:
You could encode your categories to numeric values, i.e. transform each level of your category to a distinct number,
or
dummy code your category, i.e. turn each level of your category into a separate column, which gets a value of 0 or 1.
In lots of machine learning applications, factors are better to deal with as dummy codes.
Note that in the case of a 2-level category, encoding to numeric according to the methods outlined below is essentially equivalent to dummy coding: all the values that are not level 0 are necessarily level 1. In fact, in the dummy code example I've given below, there is redundant information, as I've given each of the 2 classes its own column. It's just to illustrate the concept. Typically, one would only create n-1 columns, where n is the number of levels, and the omitted level is implied (i.e. make a column for Female, and all the 0 values are implied to be Male).
Encoding Categories to numeric:
Method 1: pd.factorize
pd.factorize is a simple, fast way of encoding to numeric:
For example, if your column gender looks like this:
>>> df
gender
0 Female
1 Male
2 Male
3 Male
4 Female
5 Female
6 Male
7 Female
8 Female
9 Female
df['gender_factor'] = pd.factorize(df.gender)[0]
>>> df
gender gender_factor
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0
Method 2: categorical dtype
Another way would be to use category dtype:
df['gender_factor'] = df['gender'].astype('category').cat.codes
This would result in the same output
Method 3 sklearn.preprocessing.LabelEncoder()
This method comes with some bonuses, such as easy back transforming:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
# Transform the gender column
df['gender_factor'] = le.fit_transform(df.gender)
>>> df
gender gender_factor
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0
# Easy to back transform:
df['gender_factor'] = le.inverse_transform(df.gender_factor)
>>> df
gender gender_factor
0 Female Female
1 Male Male
2 Male Male
3 Male Male
4 Female Female
5 Female Female
6 Male Male
7 Female Female
8 Female Female
9 Female Female
Dummy Coding:
Method 1: pd.get_dummies
df.join(pd.get_dummies(df.gender))
gender Female Male
0 Female 1 0
1 Male 0 1
2 Male 0 1
3 Male 0 1
4 Female 1 0
5 Female 1 0
6 Male 0 1
7 Female 1 0
8 Female 1 0
9 Female 1 0
Note, if you want to omit one column to get a non-redundant dummy code (see my note at the beginning of this answer), you can use:
df.join(pd.get_dummies(df.gender, drop_first=True))
gender Male
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0

How to make complex data cleaning in pandas

For example, I have a DataFrame as following.
lineNum id name Cname score
1 001 Jack Math 99
2 002 Jack English 110
3 003 Jack Chinese 90
4 003 Jack Chinese 90
5 004 Tom Math Nan
6 005 Tom English 75
7 006 Tom Chinese 85
As you see, I want to data cleaning for this data.
1) delete the duplicate value which is line 3 and line 4.
2) deal with ths unreasonable value. In line 2, Jack's English get 110 which is over the max value 100. I want to set his score to the mean value for all students' english score.
3) deal with the Nan value. Tom's Math score is Nan. I want to change to the mean value for all students' math score.
I can do every requirement respectively. But I don't know how to do all these three requirements. Thanks!

plan
I drop duplicates to start.
use mask to make scores > than 100 null
filter new dataframe and group by with mean
map means and use it to fill nulls
d = df.drop_duplicates(['id', 'name', 'Cname'])
s0 = d.score
s1 = s0.mask(s > 100)
m = s1.mask(s1 > 100).notnull()
d.assign(score=s1.fillna(d.Cname.map(d[m].groupby('Cname').score.mean())))
lineNum id name Cname score
0 1 1 Jack Math 99.0
1 2 2 Jack English 110.0
2 3 3 Jack Chinese 90.0
4 5 4 Tom Math 99.0
5 6 5 Tom English 75.0
6 7 6 Tom Chinese 85.0

You can use:
cols = ['id','name','Cname','score']
#remove duplicates by columns
df = df.drop_duplicates(subset=cols)
#replace values > 100 to NaN
df.loc[df['score'] > 100, 'score'] = np.nan
#replace NaN by mean for all students by subject
df['score'] = df.groupby('Cname')['score'].transform(lambda x: x.fillna(x.mean()))
print (df)
lineNum id name Cname score
0 1 1 Jack Math 99.0
1 2 2 Jack English 75.0
2 3 3 Jack Chinese 90.0
4 5 4 Tom Math 99.0
5 6 5 Tom English 75.0
6 7 6 Tom Chinese 85.0
Alternative solution with mask for NaN:
cols = ['id','name','Cname','score']
df = df.drop_duplicates(subset=cols)
df['score'] = df['score'].mask(df['score'] > 100)
df['score'] = df.groupby('Cname')['score'].apply(lambda x: x.fillna(x.mean()))
print (df)
lineNum id name Cname score
0 1 1 Jack Math 99.0
1 2 2 Jack English 75.0
2 3 3 Jack Chinese 90.0
4 5 4 Tom Math 99.0
5 6 5 Tom English 75.0
6 7 6 Tom Chinese 85.0

You should consider `.apply(func)' if the data is not too big.
import pandas as pd
df = pd.read_table('sample.txt', delimiter='\s+', na_values='Nan') # Your sample data
df = df.set_index('lineNum').drop_duplicates()
def deal_with(x):
if (x['score'] > 100.) or (pd.isnull(x['score'])):
df_ = df[df['id'] != x['id']]
x['score'] = df_.loc[df_['Cname'] == x['Cname'], 'score'].mean()
return x
print(df.apply(deal_with, axis=1))
id name Cname score
lineNum
1 1 Jack Math 99.0
2 2 Jack English 75.0
3 3 Jack Chinese 90.0
5 4 Tom Math 99.0
6 5 Tom English 75.0
7 6 Tom Chinese 85.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Combining is in and where - python

How can I create new column based on the odd even flag in Pandas This is my data: id Flag 001 1 002 2 003 3 004 4 I would like to have this output if flag is even number then female, if flag is odd number then male: id Flag Gender 001 1 Male 002 2 Female 003 3 Male 004 4 Female

Use numpy.where with modulo 2 for check even and odd numbers: df['Gender'] = np.where(df['Flag'] % 2,'Male','Female') print (df) id Flag Gender 0 1 1 Male 1 2 2 Female 2 3 3 Male 3 4 4 Female

try apply Id =['001','002','003','004'] Flag=[1,2,3,4] df=pd.DataFrame({'id':Id,'flag':Flag}) df['gender']=df['flag'].apply(lambda x: 'Male' if x%2 else 'Female') output: id flag gender 0 001 1 Male 1 002 2 Female 2 003 3 Male 3 004 4 Female

Related

How to add multiple calculated columns to dataframe at once

Pandas: Group by two parameters and sort by third parameter

pandas: transform based on count of row value in another dataframe

Could not convert string to float error from the Titanic competition

How to make complex data cleaning in pandas

Categories

Resources