I need to convert my Sex and Embarked values into num.
My solution doesn't work properly, it changes values for all columns
titset[titset['Sex']=='male'] = 2
titset[titset['Sex']=='female'] = 1
titset
Piece of my dataframe:
import pandas as pd
id = [1,2,3,4,5,6,7,8,9,10]
data = {
'Fare': ['male', 'female', 'male', 'male', 'female','male','male','female','male','female'],
'Embarked': ['S','C','Q','S','S','S','C','C','S','S']
}
titanic = pd.DataFrame(data=data, index=id)
titanic
Use Series.map:
titanic["Fare"] = titanic["Fare"].map({"male": 2, "female": 1})
you can use code from #Code_Different or you can use DataFrame.replace function
titanic = titanic.replace({'Fare':{'male':2,'female':1'}})
I have two data frames:
one (multiindex) of size (1113, 7897) containing values for different country and sectors in columns and different IDs in the row, example:
F_Frame:
AT BE ...
Food Energy Food Energy ...
ID1
ID2
...
In another dataframe (CC_LO) I have factor-values with corresponding country and IDs that I would like to match with the former dataframe (F_frame), so that I multiply values in F_frame with factorvalues on CC_LO if they match by country and ID. If they do not match, I put a zero.
The code I have so far, seems to work, but it runs very slowly. Is there a smarter way to match the tables based on the index/header names?
(The code loops over 49 countries and multiply by the same factor for every 163 sector within the country)
LO_impacts = pd.DataFrame(np.zeros((1113,7987)))
for i in range(0, len(F_frame)):
for j in range(0, 49):
for k in range(0, len(CF_LO)):
if (F_frame.index.get_level_values(1)[i] == CF_LO.iloc[k,1] and
F_frame.columns.get_level_values(0)[j*163] == CF_LO.iloc[k,2]):
LO_impacts.iloc[i,(j*163):((j+1)*163)] = F_frame.iloc[i,(j*163):((j+1)*163)] * CF_LO.iloc[k,4]
else:
LO_impacts.iloc[i,(j*163):((j+1)*163)] == 0
i have made two dataframes, then i setted a new index for the second dataFrame as below:
then i have used the function assign() to create a new column for df2:
df2=df2.assign(gre_multiply=lambda x: x.gre*df1.gre)
don't forget to make df2=, i forgot it in the picture.
and i have got the following dataFrame:
of course it look at index you can check using a calculator, it returns values as float, it is easy now to convert to int later df2.gre_multiply.astype(int)
but before that you need to fillna because if the indexes of the two dataframes don't match it will return Nan
df2.gre_multiply=df2.gre_multiply.fillna(0).astype(int)
import pandas as pd
# Creating dummy data
data = pd.DataFrame([
[2.0, 1.1, 6.7, 4.5],
[4.3, 5.7, 8.6, 9.0],
[5.5, 6.8, 9.0, 4.7],
[5.5, 6.8, 9.0, 4.7],
], index = ["S1", "S1", "S2", "S2"], columns = mindex)
mindex = pd.MultiIndex.from_product([["AT", "DK"], ["Food", "Energy"]])
mul_factor = pd.DataFrame({"Country": ['AT', 'DK', 'AT', 'DK'],
"Value": [1.0, 0.8, 0.9, 0.6],
}, index = ['S1', 'S1', 'S2', 'S2'])
new_data = data.copy()
new_data.columns = data.columns.to_frame()[0].to_list()
# Reshaping the second Dataframe
mat = mul_factor.reset_index().pivot(index = 'Country', columns='index')
mat.index.name = None
mat = mat.T.reset_index(0, drop = True)
mat.index.name = None
new_data.multiply(mat) # Required result
Please let me know if I've misunderstood your question. You might have to modify the code a bit to accommodate missing country values.
The class is composed of a set of attributes and functions including:
Attributes:
df : a pandas dataframe.
numerical_feature_names: df columns with a numeric value.
label_column_names: df string columns to be grouped.
Functions:
mean(nums): takes a list of numbers as input and returns the mean
fill_na(df, numerical_feature_names, label_columns): takes class attributes as inputs and returns a transformed df.
And here's the class:
class PLUMBER():
def __init__(self):
################# attributes ################
self.df=df
# specify label and numerical features names:
self.numerical_feature_names=numerical_feature_names
self.label_column_names=label_column_names
##################### mean ##############################
def mean(self, nums):
total=0.0
for num in nums:
total=total+num
return total/len(nums)
############ fill the numerical features ##################
def fill_na(self, df, numerical_feature_names, label_column_names):
# declaring parameters:
df=self.df
numerical_feature_names=self.numerical_feature_names
label_column_names=self.label_column_names
# now replacing NaN with group mean
for numerical_feature_name in numerical_feature_names:
df[numerical_feature_name]=df.groupby([label_column_names]).transform(lambda x: x.fillna(self.mean(x)))
return df
When trying to apply it to a pandas df:
if __name__=="__main__":
# initialize class
plumber=PLUMBER()
# replace NaN with group mean
df=plumber.fill_na(df=df, numerical_feature_names=numerical_feature_names, label_column_names=label_column_names)
The next error arises:
ValueError: Grouper and axis must be same length
data and class parameters
import pandas as pd
d={'month': ['01/01/2020', '01/02/2020', '01/03/2020', '01/01/2020', '01/02/2020', '01/03/2020'],
'country': ['Japan', 'Japan', 'Japan', 'Poland', 'Poland', 'Poland'],
'level':['A01', 'A01', 'A01', 'A00','A00', 'A00'],
'job title':['Insights Manager', 'Insights Manager', 'Insights Manager', 'Sales Director', 'Sales Director', 'Sales Director'],
'number':[np.nan, 450, 299, np.nan, 19, 29],
'age':[np.nan, 30, 28, np.nan, 29, 18]}
df=pd.DataFrame(d)
# headers
column_names=df.columns.values.tolist()
column_names= [column_name.strip() for column_name in column_names]
# label_column_names (to be grouped)
label_column_names=['country', 'level', 'job title']
# numerical_features:
numerical_feature_names = [x for x in column_names if x not in label_column_names]
numerical_feature_names.remove('month')
How could I change the class in order to get the transformed df (i.e. the one that replaces np.nan with it's group mean)?
First the error is because label_column_names is already a list, so in the groupby you don't need the [] around it. so it should be df.groupby(label_column_names)... instead of df.groupby([label_column_names])...
Now, to actually solve you problem, in the function fill_na of your class, replace the loop for (you don't need it actually) by
df[numerical_feature_names] = (
df[numerical_feature_names]
.fillna(
df.groupby(label_column_names)
[numerical_feature_names].transform('mean')
)
)
in which you fillna the columns numerical_feature_names by the result of the groupy.tranform with the mean of these columns
I'm reading some code that has the following lines:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df[1])
Where df[1] is of type pandas.core.series.Series and contains string values such as "basketball", "football", "soccer", etc.
What does the method le.fit() do? I saw that some other fit methods are used to train the model, but that doesn't make sense to me because the input here is purely the labels, not the training data. The documentation simply says "Fit label encoder." What does that mean?
It takes a categorical column and converts/maps it to numerical values.
If for example we have a dataset of people and their favorite sport, and we want to do some machine learning (that uses mathematics) on that dataframe, mathematically, we can't do any computations to the string 'basketball' or 'football'. But what we can do is map a value for each of those sports that allow machine learning algorithms to do their thing:
For example: 'basketball' = 0, 'football' = 1, 'soccer' = 2, etc.
We could do that manually using a dictionary, and just apply that mapping to a column, or we can use the le.fit() to do that for us.
So we use it on our training data, and it will figure out the unique values and assign a value to it:
import pandas as pd
from sklearn import preprocessing
train_df = pd.DataFrame(
[
['Person1', 'basketball'],
['Person2', 'football'],
['Person3', 'basketball'],
['Person4', 'basketball'],
['Person5', 'soccer'],
['Person6', 'soccer'],
['Person7', 'soccer'],
['Person8', 'basketball'],
['Person9', 'football'],
],
columns=['person', 'sport']
)
le = preprocessing.LabelEncoder()
le.fit(train_df['sport'])
And now, we can transform the 'sport' column in our test data using that determined mapping from the le.fit()
test_df = pd.DataFrame(
[
['Person11', 'soccer'],
['Person12', 'soccer'],
['Person13', 'basketball'],
['Person14', 'football'],
['Person15', 'football'],
['Person16', 'soccer'],
['Person17', 'soccer'],
['Person18', 'basketball'],
['Person19', 'soccer'],
],
columns=['person', 'sport']
)
le.transform(test_df['sport'])
And if you want to see how that mapping looks, we'll just throw that on the test set as a column:
test_df['encoded'] = le.transform(test_df['sport'])
And now we see it assigned 'soccer' to the value 2, 'basketball' to 0, and 'football' to 1.
print(test_df)
person sport encoded
0 Person11 soccer 2
1 Person12 soccer 2
2 Person13 basketball 0
3 Person14 football 1
4 Person15 football 1
5 Person16 soccer 2
6 Person17 soccer 2
7 Person18 basketball 0
8 Person19 soccer 2
As #PSK says, the LabelEncoder() method will store the unique values of the array you're passing to. For example, if it is a numerical array it will call numpy.unique()
import pandas as pd
d = {'col1': [1, 2, 2, 3], 'col2': ['A', 'B', 'B', 'C']}
df = pd.DataFrame(data=d)
# For numerical array
np.unique(df.col1)
>>> array([1, 2, 3])
or basically set if it is an object type
set(df.col2)
>>> {'A', 'B', 'C'}
and store this result in the attribute .classes_ of LabelEncoder, which can later be access by other methods of the class like transform() to encode new data.
Sometimes, it seems that the more I use Python (and Pandas), the less I understand. So I apologise if I'm just not seeing the wood for the trees here but I've been going round in circles and just can't see what I'm doing wrong.
Basically, I have an example script (that I'd like to implement on a much larger dataframe) but I can't get it to work to my satisfaction.
The dataframe consists of columns of various datatypes. I'd like to group the dataframe on 2 columns and then produce a new dataframe that contains lists of all the unique values for each variable in each group. (Ultimately, I'd like to concatenate the list items into a single string – but that's a different question.)
The initial script I used was:
import numpy as np
import pandas as pd
def tempFuncAgg(tempVar):
tempList = set(tempVar.dropna()) # Drop NaNs and create set of unique values
print(tempList)
return tempList
# Define dataframe
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Groupby based on 2 categorical variables
tempGroupby = tempDF.groupby(['gender','age'])
# Aggregate for each variable in each group using function defined above
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
print(dfAgg)
The output from this script is as expected: a series of lines containing the sets of values and a dataframe containing the returned sets:
{'09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34'}
{'01/06/2015 11:09', '12/05/2015 14:19', '27/05/2015 22:31', '19/06/2015 05:37'}
{'15/04/2015 07:12', '19/05/2015 19:22', '06/05/2015 11:12', '04/06/2015 12:57', '15/06/2015 03:23', '12/04/2015 01:00'}
{'02/04/2015 02:34', '10/05/2015 08:52'}
{2, 3, 6}
{18, 11, 13, 14}
{4, 5, 9, 12, 15, 17}
{1, 10}
date \
gender age
female old set([09/04/2015 23:03, 21/04/2015 12:59, 06/04...
young set([01/06/2015 11:09, 12/05/2015 14:19, 27/05...
male old set([15/04/2015 07:12, 19/05/2015 19:22, 06/05...
young set([02/04/2015 02:34, 10/05/2015 08:52])
id
gender age
female old set([2, 3, 6])
young set([18, 11, 13, 14])
male old set([4, 5, 9, 12, 15, 17])
young set([1, 10])
The problem occurs when I try to convert the sets to lists. Bizarrely, it produces 2 duplicated rows containing identical lists but then fails with a 'ValueError: Function does not reduce' error.
def tempFuncAgg(tempVar):
tempList = list(set(tempVar.dropna())) # This is the only difference
print(tempList)
return tempList
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
tempGroupby = tempDF.groupby(['gender','age'])
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
print(dfAgg)
But now the output is:
['09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34']
['09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: Function does not reduce
Any help to troubleshoot this problem would be appreciated and I apologise in advance if it's something obvious that I'm just not seeing.
EDIT
Incidentally, converting the set to a tuple rather than a list works with no problem.
Lists can sometimes have weird problems in pandas. You can either :
Use tuples (as you've already noticed)
If you really need lists, just do it in a second operation like this :
dfAgg.applymap(lambda x: list(x))
full example :
import numpy as np
import pandas as pd
def tempFuncAgg(tempVar):
tempList = set(tempVar.dropna()) # Drop NaNs and create set of unique values
print(tempList)
return tempList
# Define dataframe
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Groupby based on 2 categorical variables
tempGroupby = tempDF.groupby(['gender','age'])
# Aggregate for each variable in each group using function defined above
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
# Transform in list
dfAgg.applymap(lambda x: list(x))
print(dfAgg)
There's many such bizzare behaviours in pandas, it is generally better to go on with a workaround (like this), than to find a perfect solution