How to change categorical column values? - python

In my data frame, I have column 'countries', I am trying to change that column values into 'developed countries' and 'developing countries'. My data frame is as following:
countries age gender
1 India 21 Male
2 China 22 Female
3 USA 23 Male
4 UK 25 Male
I have following two arrays:
developed = ['USA','UK']
developing = ['India', 'China']
I want to convert array into following data frame:
countries age gender
1 developing 21 Male
2 developing 22 Female
3 developed 23 Male
4 developed 25 Male
I tried following code, but I got 'SettingWithCopyWarning' error:
df[df['countries'].isin(developed)]['countries'] = 'developed'
I tried following code, but I got 'SettingWithCopyWarning' error and my jupyter notebook got hanged:
for i, x in enumerate(df['countries']):
if x in developed:
df['countries'][i] = 'developed'
Is their alternative way to change column categories??

use np.where:
import numpy as np
df['countries']=np.where(df['countries'].isin(developed),'developed','developing')
print(df)
countries age gender
1 developing 21 Male
2 developing 22 Female
3 developed 23 Male
4 developed 25 Male
Also you can use DataFrame.loc:
c=df['countries'].isin(developed)
df.loc[c,'countries']='developed'
df.loc[~c,'countries']='developing'
print(df)
countries age gender
1 developing 21 Male
2 developing 22 Female
3 developed 23 Male
4 developed 25 Male

You can try implementing a replace function, it wouldn't give an error.
Updated_DataSet1 = data_set.replace("India", "Developing")
Updated_DataSet2 = Updated_DataSet1.replace("China","Developing")

Related

Add a column in pandas based on sum of the subgroup values in another column [duplicate]

This question already has answers here:
How do I create a new column from the output of pandas groupby().sum()?
(4 answers)
Closed 12 days ago.
Here is a simplified version of my dataframe (the number of persons in my dataframe is way more than 3):
df = pd.DataFrame({'Person':['John','David','Mary','John','David','Mary'],
'Sales':[10,15,20,11,12,18],
})
Person Sales
0 John 10
1 David 15
2 Mary 20
3 John 11
4 David 12
5 Mary 18
I would like to add a column "Total" to this data frame, which is the sum of total sales per person
Person Sales Total
0 John 10 21
1 David 15 27
2 Mary 20 38
3 John 11 21
4 David 12 27
5 Mary 18 38
What would be the easiest way to achieve this?
I have tried
df.groupby('Person').sum()
but the shape of the output is not congruent with the shape of df.
Sales
Person
David 27
John 21
Mary 38
What you want is the transform method which can apply a function on each group:
df['Total'] = df.groupby('Person')['Sales'].transform(sum)
It gives as expected:
Person Sales Total
0 John 10 21
1 David 15 27
2 Mary 20 38
3 John 11 21
4 David 12 27
5 Mary 18 38
The easiest way to achieve this is by using the pandas groupby and sum functions.
df['Total'] = df.groupby('Person')['Sales'].sum()
This will add a column to the dataframe with the total sales per person.
your 'Persons' column in the dataframe contains repeated values
it is not possible to apply a new column to this via groupby
I would suggest making a new dataframe based on sales sum
The below code will help you with that
newDf = pd.DataFrame(df.groupby('Person')['Sales'].sum()).reset_index()
This will create a new dataframe with 'Person' and 'sales' as columns.

Python PanelOLS different statistics with single categorical and multiple dummy columns

I am trying to balance check on a Pandas DataFrame using an OLS with entity fixed effects. An example DataFrame is below:
county
year
treatment_vs_control
age
gender
Jefferson
2022
1
24
M
Jackson
2022
1
31
M
Jefferson
2022
0
28
F
Jackson
2022
1
24
null
Adams
2022
0
72
F
First I try to run the model with the gender field as-is.
model_as_is = PanelOLS.from_formula(
formula="treatment_vs_control ~ age + gender + EntityEffects",
data=df
).fit()
model_as_is.summary
I get an F statistics of ~3.05 with a p value of 0.0001.
Then, I try to run the model with one-hot encoded dummy gender columns. The DataFrame looks like below:
county
year
treatment_vs_control
age
gender_m
gender_f
Jefferson
2022
1
24
1
0
Jackson
2022
1
31
1
0
Jefferson
2022
0
28
0
1
Jackson
2022
1
24
0
0
Adams
2022
0
72
0
1
My model now looks like:
model_dummy = PanelOLS(
dependent = df["treatment_vs_control"],
exog = df[["age", "gender"]],
entity_effects=True,
time_effects=False,
).fit()
model_dummy.summary
My F statistic is now ~2.61 with a p value of 0.0002.
If I try to simply keep a single gender column but make it numeric instead of string-type, I get even a third statistical breakdown.
Why might this happen?

Is it possible to do full text search in pandas dataframe

currently, I'm using pandas DataFrame.filter to filter the records of the dataset. if I give a word, I have got all the records that are matching with that word. now if I give two words that are present in the dataset but they are not in one record then I got an empty set. Is there any way in either pandas or other python modules that I can find something that can search multiple words ( not in one record )?
With python list comprehension, we can build a full-text search by mapping. in pandas DataFrame.filter uses indexing. is there any difference between mapping and indexing? if yes what is it and which can give a better performance?
CustomerID Genre Age AnnualIncome (k$) SpendingScore (1-100)
1 Male 19 15 39
2 Male 21 15 81
3 Female 20 16 6
4 Female 23 16 77
5 Female 31 17 40
pokemon[pokemon['CustomerID'].isin(['200','5'])]
Output:
CustomerID Genre Age AnnualIncome (k$) SpendingScore (1-100)
5 Female 31 17 40
200 Male 30 137 83
Name Qty.
0 Apple 3
1 Orange 4
2 Cake 5
Considering the above dataframe, if you want to find quantities of Apples and Oranges, you can do it like this:
result = df[df['Name'].isin(['Apple','Orange'])]
print (result)

Filling missing values with groupby object on Pandas - Titanic dataset

I've already seen similar questions but none of them is an answer to mine or I didn't see/understand. I am a newbie in ML, and trying to discover numpy, pandas with known datasets on Kaggle. Currently, I am on Titanic dataset. I have 2 distinct dataset: train and test. I have to fill missing values on "Age" column of both train and test datasets. My criteria is a grouped object I created with train dataset. I am grouping with "Sex", "Pclass", and "Title"(comes from title of every passengers name).
grouped = train.groupby(["Sex","Title","Pclass"])
grouped_m = grouped.median()
grouped_m = grouped_m.reset_index()[["Sex","Title","Pclass", "Age"]]
Output is:
Sex Title Pclass Age
0 female Miss 1 30.0
1 female Miss 2 24.0
2 female Miss 3 18.0
3 female Mrs 1 40.0
4 female Mrs 2 32.0
5 female Mrs 3 31.0
6 female Officer 1 49.0
7 female Royalty 1 40.5
8 male Master 1 4.0
9 male Master 2 1.0
10 male Master 3 4.0
11 male Mr 1 40.0
12 male Mr 2 31.0
13 male Mr 3 26.0
14 male Officer 1 51.0
15 male Officer 2 46.5
16 male Royalty 1 40.0
This is my criteria to apply on "Age" column of "test" dataset. For ex: when a row on test dataset with Sex = Female, Title = Miss, Pclass = 1, Age = NaN, Nan value must be filled with output that above, which should be Age = 30.
Before filling:
train["Age"].isna().sum()
Output is:
177
I tried this:
train["Age"] = train["Age"].fillna(grouped["Age"].transform("median"))
It perfectly filled NaN values on train set.
After filling:
train["Age"].isna().sum()
Output is:
0
But when I apply this on test dataset, it changes nothing at all and didn't give any errors.
Before filling:
test["Age"].isna().sum()
Output is:
86
Then I apply the function with group object that I created on train dataset:
test["Age"] = test["Age"].fillna(grouped["Age"].transform("median"))
test["Age"].isna().sum()
Output is:
86
NaN values still there on test dataset. How should I apply this function to change NaN values on test dataset with my grouped object which I created with train dataset ?
We want to fill in missing age data instead of just dropping the missing age data rows. One way to do this is by filling in the mean age of all the passengers (imputation).
check the average age by passenger class. For example:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#Data visualization to see the age difference due to Passenger class
plt.figure(figsize=(12, 7))
sns.boxplot(x='Pclass',y='Age',data=train,palette='winter')
def impute_age(cols):
Age = cols[0]
Pclass = cols[1]
if pd.isnull(Age):
if Pclass == 1:
return 37
elif Pclass == 2:
return 29
else:
return 24
else:
return Age
train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)#filling the missing values
Edit:
I merged the data using DataFrame.merge() method as #ALollz suggests and apparently is works. Here's the code:
# First filling NaN on train set as I did before.
grouped = train.groupby(["Sex","Title", "Pclass"])
grouped_m = grouped.median().reset_index()[["Sex", "Title", "Pclass", "Age"]]
train["Age"] = train["Age"].fillna(grouped["Age"].transform("median"))
# Then used pd.DataFrame.merge() to apply the same grouped features on the test data.
med = train.groupby(['Sex', 'Pclass', 'Title'],
as_index=False)['Age'].median()
test = test.merge(med, on=['Sex','Pclass','Title'], how='left', suffixes=('','_'))
test['Age'] = test['Age'].fillna(test.pop('Age_'))
Thank you everyone!

Python - Creating a data frame,transpose and merge it to get a table

I am learning Python and I have a question related to creating a data frame for every 5 rows, transpose and merge the data frames.
I have a .txt file with the following input. It has thousands of rows and I need to go through each line until the end of the file.
Name,Kamath
Age,23
Sex,Male
Company,ACC
Vehicle,Car
Name,Ram
Age,32
Sex,Male
Company,CCA
Vehicle,Bike
Name,Reena
Age,26
Sex,Female
Company,BARC
Vehicle,Cycle
I need to get this as my output:
Name,Age,Sex,Company,Vehicle
Kamath,23,Male,ACC,Car
Ram,32,Male,CCA,Bike
Reena,26,Female,BARC,Cycle
Use read_csv for DataFrame and then pivot with cumcount for counter for new index:
import pandas as pd
temp=u"""Name,Kamath
Age,23
Sex,Male
Company,ACC
Vehicle,Car
Name,Ram
Age,32
Sex,Male
Company,CCA
Vehicle,Bike
Name,Reena
Age,26
Sex,Female
Company,BARC
Vehicle,Cycle"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.txt'
df = pd.read_csv(pd.compat.StringIO(temp), names=['a','b'])
print (df)
a b
0 Name Kamath
1 Age 23
2 Sex Male
3 Company ACC
4 Vehicle Car
5 Name Ram
6 Age 32
7 Sex Male
8 Company CCA
9 Vehicle Bike
10 Name Reena
11 Age 26
12 Sex Female
13 Company BARC
14 Vehicle Cycle
df = pd.pivot(index=df.groupby('a').cumcount(),
columns=df['a'],
values=df['b'])
print (df)
a Age Company Name Sex Vehicle
0 23 ACC Kamath Male Car
1 32 CCA Ram Male Bike
2 26 BARC Reena Female Cycle

Categories