what does LabelEncoder().fit() do? - python

I'm reading some code that has the following lines:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df[1])
Where df[1] is of type pandas.core.series.Series and contains string values such as "basketball", "football", "soccer", etc.
What does the method le.fit() do? I saw that some other fit methods are used to train the model, but that doesn't make sense to me because the input here is purely the labels, not the training data. The documentation simply says "Fit label encoder." What does that mean?

It takes a categorical column and converts/maps it to numerical values.
If for example we have a dataset of people and their favorite sport, and we want to do some machine learning (that uses mathematics) on that dataframe, mathematically, we can't do any computations to the string 'basketball' or 'football'. But what we can do is map a value for each of those sports that allow machine learning algorithms to do their thing:
For example: 'basketball' = 0, 'football' = 1, 'soccer' = 2, etc.
We could do that manually using a dictionary, and just apply that mapping to a column, or we can use the le.fit() to do that for us.
So we use it on our training data, and it will figure out the unique values and assign a value to it:
import pandas as pd
from sklearn import preprocessing
train_df = pd.DataFrame(
[
['Person1', 'basketball'],
['Person2', 'football'],
['Person3', 'basketball'],
['Person4', 'basketball'],
['Person5', 'soccer'],
['Person6', 'soccer'],
['Person7', 'soccer'],
['Person8', 'basketball'],
['Person9', 'football'],
],
columns=['person', 'sport']
)
le = preprocessing.LabelEncoder()
le.fit(train_df['sport'])
And now, we can transform the 'sport' column in our test data using that determined mapping from the le.fit()
test_df = pd.DataFrame(
[
['Person11', 'soccer'],
['Person12', 'soccer'],
['Person13', 'basketball'],
['Person14', 'football'],
['Person15', 'football'],
['Person16', 'soccer'],
['Person17', 'soccer'],
['Person18', 'basketball'],
['Person19', 'soccer'],
],
columns=['person', 'sport']
)
le.transform(test_df['sport'])
And if you want to see how that mapping looks, we'll just throw that on the test set as a column:
test_df['encoded'] = le.transform(test_df['sport'])
And now we see it assigned 'soccer' to the value 2, 'basketball' to 0, and 'football' to 1.
print(test_df)
person sport encoded
0 Person11 soccer 2
1 Person12 soccer 2
2 Person13 basketball 0
3 Person14 football 1
4 Person15 football 1
5 Person16 soccer 2
6 Person17 soccer 2
7 Person18 basketball 0
8 Person19 soccer 2

As #PSK says, the LabelEncoder() method will store the unique values of the array you're passing to. For example, if it is a numerical array it will call numpy.unique()
import pandas as pd
d = {'col1': [1, 2, 2, 3], 'col2': ['A', 'B', 'B', 'C']}
df = pd.DataFrame(data=d)
# For numerical array
np.unique(df.col1)
>>> array([1, 2, 3])
or basically set if it is an object type
set(df.col2)
>>> {'A', 'B', 'C'}
and store this result in the attribute .classes_ of LabelEncoder, which can later be access by other methods of the class like transform() to encode new data.

Related

Using 'isin' in python for three filters

I have the following dataframe
# Import pandas library
import pandas as pd
import numpy as np
# initialize list elements
data = ['george',
'instagram',
'nick',
'basketball',
'tennis']
# Create the pandas DataFrame with column name is provided explicitly
df = pd.DataFrame(data, columns=['Unique Words'])
# print dataframe.
df
and I want to create a new column based on the following two lists that looks like this
key_words = ["football", "basketball", "tennis"]
usernames = ["instagram", "facebook", "snapchat"]
Label
-----
0
2
0
1
1
So the words that are in the list key_words take the label 1, in the list usernames take the label 2 and all the other the label 0.
Thank you so much for your time and help!
One way to do this is to create a label map by numbering all of the elements in the first list as 1, and the other as 2. Then you can use .map in pandas to map the values and fillna with 0.
# Import pandas library
import pandas as pd
import numpy as np
# initialize list elements
data = ['george',
'instagram',
'nick',
'basketball',
'tennis']
# Create the pandas DataFrame with column name is provided explicitly
df = pd.DataFrame(data, columns=['Unique Words'])
key_words = ["football", "basketball", "tennis"]
usernames = ["instagram", "facebook", "snapchat"]
label_map = {e: i+1 for i, l in enumerate([key_words,usernames]) for e in l}
print(label_map)
df['Label'] = df['Unique Words'].map(label_map).fillna(0).astype(int)
print(df)
Output
{'football': 1, 'basketball': 1, 'tennis': 1, 'instagram': 2, 'facebook': 2, 'snapchat': 2}
Unique Words Label
0 george 0
1 instagram 2
2 nick 0
3 basketball 1
4 tennis 1

Extract values from dictionary and conditionally assign them to columns in pandas

I am trying to extract values from a column of dictionaries in pandas and assign them to their respective columns that already exist. I have hardcoded an example below of the data set that I have:
df_have = pd.DataFrame(
{
'value_column':[np.nan, np.nan, np.nan]
,'date':[np.nan, np.nan, np.nan]
,'string_column':[np.nan, np.nan, np.nan]
, 'dict':[[{'value_column':40},{'date':'2017-08-01'}],[{'value_column':30},
{'string_column':'abc'}],[{'value_column':10},{'date':'2016-12-01'}]]
})
df_have
df_want = pd.DataFrame(
{
'value_column':[40, 30, 10]
,'date':['2017-08-01', np.nan, '2016-12-01']
,'string_column':[np.nan, 'abc', np.nan]
,'dict':[[{'value_column':40},{'date':'2017-08-01'}],[{'value_column':30},
{'string_column':'abc'}],[{'value_column':10},{'date':'2016-12-01'}]]})
df_want
I have managed to extract the values out of the dictionaries using loops:
'''
for row in range(len(df_have)):
row_holder = df_have.dict[row]
number_of_dictionaries_in_the_row = len(row_holder)
for dictionary in range(number_of_dictionaries_in_the_row):
variable_holder = df_have.dict[row][dictionary].keys()
variable = list(variable_holder)[0]
value = df_have.dict[row][dictionary].get(variable)
'''
I now need to somehow conditionally turn df_have into df_want. I am happy to take a completely new approach and recreate the whole thing from scratch. We could even assume that I only have a dataframe with the dictionaries and nothing else.
You could use pandas string methods to pull the data out, although I think it is inefficient nesting data structures within Pandas :
df_have.loc[:, "value_column"] = df_have["dict"].str.get(0).str.get("value_column")
df_have.loc[:, "date"] = df_have["dict"].str.get(-1).str.get("date")
df_have.loc[:, "string_column"] = df_have["dict"].str.get(-1).str.get("string_column")
value_column date string_column dict
0 40 2017-08-01 None [{'value_column': 40}, {'date': '2017-08-01'}]
1 30 None abc [{'value_column': 30}, {'string_column': 'abc'}]
2 10 2016-12-01 None [{'value_column': 10}, {'date': '2016-12-01'}]

Convert panda dataframe group of values to multiple lists

I have pandas dataframe, where I listed items, and categorised them:
col_name |col_group
-------------------------
id | Metadata
listing_url | Metadata
scrape_id | Metadata
name | Text
summary | Text
space | Text
To reproduce:
import pandas
df = pandas.DataFrame([
['id','metadata'],
['listing_url','metadata'],
['scrape_id','metadata'],
['name','Text'],
['summary','Text'],
['space','Text']],
columns=['col_name', 'col_group'])
Can you suggest how I can convert this dataframe to multiple lists based on "col_group":
Metadata = ['id','listing_url','scraping_id]
Text = ['name','summary','space']
This is to allow me to pass these lists of columns to panda and drop columns.
I googled a lot and got stuck: all answers are about converting lists to df, not vice versa. Should I aim to convert into dictionary, or list of lists?
I have over 100 rows, belonging to 10 categories, so would like to avoid manual hard-coding.
I've try this code:
import pandas
df = pandas.DataFrame([
[1, 'url_a', 'scrap_a', 'name_a', 'summary_a', 'space_a'],
[2, 'url_b', 'scrap_b', 'name_b', 'summary_b', 'space_b'],
[3, 'url_c', 'scrap_c', 'name_c', 'summary_c', 'space_ac']],
columns=['id', 'listing_url', 'scrape_id', 'name', 'summary', 'space'])
print(df)
for row in df.iterrows():
print(row[1].to_list())
which give this answer:
[1, 'url_a', 'scrap_a', 'name_a', 'summary_a', 'space_a']
[2, 'url_b', 'scrap_b', 'name_b', 'summary_b', 'space_b']
[3, 'url_c', 'scrap_c', 'name_c', 'summary_c', 'space_ac']
You can use
for row in df[['name', 'summary', 'space']].iterrows():
to only iter over specific columns.
Like this:
In [245]: res = df.groupby('col_group', as_index=False)['Col_name'].apply(list)
In [248]: res.tolist()
Out[248]: [['id', 'listing_url', 'scrape_id'], ['name', 'summary', 'space']]
my_vars = df.groupby('col_group').agg(list)['col_name'].to_dict()
Output:
>>> my_vars
{'Text': ['name', 'summary', 'space'], 'metadata': ['id', 'listing_url', 'scrape_id']}
The recommended usage would be just my_vars['Text'] to access the Text, and etc. If you must have this as distinct names you can force it upon your target scope, e.g. globals:
globals().update(df.groupby('col_group').agg(list)['col_name'].to_dict())
Result:
>>> Text
['name', 'summary', 'space']
>>> metadata
['id', 'listing_url', 'scrape_id']
However I would advise against that as you might unwittingly overwrite some of your other objects, or they might not be in the proper scope you needed (e.g. locals).

Python Error: 'list' has no attribute 'mean'

I am trying to get the mean value for a list of percentages from an Excel file which has data. My current code is as follows:
import numpy as pd
data = pd.DataFrame =({'Percentages': [.20, .10, .05], 'Nationality':['American', 'Mexican', 'Russian'],
'Gender': ['Male', 'Female'], 'Question': ['They have good looks']})
pref = data[data.Nationality == 'American']
prefPref = pref.pivot_table(data.Percentage.mean(), index=['Question'], column='Gender')
The error is coming from where I try to get the .mean() from my ['Percentage'] list. So, how can I get the mean from the list of Percentages? Do I need to create a variable for the mean value, and if so how to I implement that into the code?
["Percentage"] is a list containging the single string item "Percentage". It isn't possible to calculate a mean from lists of text.
In addition, the method .mean() doesn't exist in Python for generic lists, have a look at numpy for calculating means and other mathematical operations.
For example:
import numpy
numpy.array([4,2,6,5]).mean()
Here is a reworked version of your pd.pivot_table. See also How to pivot a dataframe.
import pandas as pd, numpy as np
data = pd.DataFrame({'Percentages': [0.20, 0.10, 0.05],
'Nationality': ['American', 'American', 'Russian'],
'Gender': ['Male', 'Female', 'Male'],
'Question': ['Q1', 'Q2', 'Q3']})
pref = data[data['Nationality'] == 'American']
prefPref = pref.pivot_table(values='Percentages', index='Question',\
columns='Gender', aggfunc='mean')
# Gender Female Male
# Question
# Q1 NaN 0.2
# Q2 0.1 NaN

Converting a set to a list with Pandas grouopby agg function causes 'ValueError: Function does not reduce'

Sometimes, it seems that the more I use Python (and Pandas), the less I understand. So I apologise if I'm just not seeing the wood for the trees here but I've been going round in circles and just can't see what I'm doing wrong.
Basically, I have an example script (that I'd like to implement on a much larger dataframe) but I can't get it to work to my satisfaction.
The dataframe consists of columns of various datatypes. I'd like to group the dataframe on 2 columns and then produce a new dataframe that contains lists of all the unique values for each variable in each group. (Ultimately, I'd like to concatenate the list items into a single string – but that's a different question.)
The initial script I used was:
import numpy as np
import pandas as pd
def tempFuncAgg(tempVar):
tempList = set(tempVar.dropna()) # Drop NaNs and create set of unique values
print(tempList)
return tempList
# Define dataframe
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Groupby based on 2 categorical variables
tempGroupby = tempDF.groupby(['gender','age'])
# Aggregate for each variable in each group using function defined above
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
print(dfAgg)
The output from this script is as expected: a series of lines containing the sets of values and a dataframe containing the returned sets:
{'09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34'}
{'01/06/2015 11:09', '12/05/2015 14:19', '27/05/2015 22:31', '19/06/2015 05:37'}
{'15/04/2015 07:12', '19/05/2015 19:22', '06/05/2015 11:12', '04/06/2015 12:57', '15/06/2015 03:23', '12/04/2015 01:00'}
{'02/04/2015 02:34', '10/05/2015 08:52'}
{2, 3, 6}
{18, 11, 13, 14}
{4, 5, 9, 12, 15, 17}
{1, 10}
date \
gender age
female old set([09/04/2015 23:03, 21/04/2015 12:59, 06/04...
young set([01/06/2015 11:09, 12/05/2015 14:19, 27/05...
male old set([15/04/2015 07:12, 19/05/2015 19:22, 06/05...
young set([02/04/2015 02:34, 10/05/2015 08:52])
id
gender age
female old set([2, 3, 6])
young set([18, 11, 13, 14])
male old set([4, 5, 9, 12, 15, 17])
young set([1, 10])
The problem occurs when I try to convert the sets to lists. Bizarrely, it produces 2 duplicated rows containing identical lists but then fails with a 'ValueError: Function does not reduce' error.
def tempFuncAgg(tempVar):
tempList = list(set(tempVar.dropna())) # This is the only difference
print(tempList)
return tempList
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
tempGroupby = tempDF.groupby(['gender','age'])
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
print(dfAgg)
But now the output is:
['09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34']
['09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: Function does not reduce
Any help to troubleshoot this problem would be appreciated and I apologise in advance if it's something obvious that I'm just not seeing.
EDIT
Incidentally, converting the set to a tuple rather than a list works with no problem.
Lists can sometimes have weird problems in pandas. You can either :
Use tuples (as you've already noticed)
If you really need lists, just do it in a second operation like this :
dfAgg.applymap(lambda x: list(x))
full example :
import numpy as np
import pandas as pd
def tempFuncAgg(tempVar):
tempList = set(tempVar.dropna()) # Drop NaNs and create set of unique values
print(tempList)
return tempList
# Define dataframe
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Groupby based on 2 categorical variables
tempGroupby = tempDF.groupby(['gender','age'])
# Aggregate for each variable in each group using function defined above
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
# Transform in list
dfAgg.applymap(lambda x: list(x))
print(dfAgg)
There's many such bizzare behaviours in pandas, it is generally better to go on with a workaround (like this), than to find a perfect solution

Categories