OneHotEncoder only a single feature which is string - python

I want one of my ONLY ONE of my features to be converted to a separate binary features:
df["pattern_id"]
Out[202]:
0 3
1 3
...
7440 2
7441 2
7442 3
Name: pattern_id, Length: 7443, dtype: int64
df["pattern_id"]
Out[202]:
0 0 0 1
1 0 0 1
...
7440 0 1 0
7441 0 1 0
7442 0 0 1
Name: pattern_id, Length: 7443, dtype: int64
I want to use OneHotEncoder, data is int, so no need to encode it:
onehotencoder = OneHotEncoder(categorical_features=["pattern_id"])
df = onehotencoder.fit_transform(df).toarray()
ValueError: could not convert string to float: 'http://www.zaragoza.es/sedeelectronica/'
Interesting enough I receive an error... sklearn tried to encode another column, not the one I wanted.
We have to encode pattern_id to be an integer value
I used this link: Issue with OneHotEncoder for categorical features
#transform the pattern_id feature to int
encoding_feature = ["pattern_id"]
enc = LabelEncoder()
enc.fit(encoding_feature)
working_feature = enc.transform(encoding_feature)
working_feature = working_feature.reshape(-1, 1)
ohe = OneHotEncoder(sparse=False)
#convert the pattern_id feature to separate binary features
onehotencoder = OneHotEncoder(categorical_features=working_feature, sparse=False)
df = onehotencoder.fit_transform(df).toarray()
And I get the same error. What am I doing wrong ?
Edit
source:
https://github.com/martin-varbanov96/scraper/blob/master/logo_scrape/logo_scrape/analysis.py
df
Out[259]:
found_img is_http link_img \
0 True 0 img/aahoteles.svg
//www.zaragoza.es/cont/paginas/img/sede/logo_e...
pattern_id current_link site_id \
0 3 https://www.aa-hoteles.com/es/reservas 3
6 3 https://www.aa-hoteles.com/es/ofertas-hoteles 3
7 2 http://about.pressreader.com/contact-us/ 4
8 3 http://about.pressreader.com/contact-us/ 4
status link_id
0 200 https://www.aa-hoteles.com/
1 200 https://www.365travel.asia/
2 200 https://www.365travel.asia/
3 200 https://www.365travel.asia/
4 200 https://www.aa-hoteles.com/
5 200 https://www.aa-hoteles.com/
6 200 https://www.aa-hoteles.com/
7 200 http://about.pressreader.com
8 200 http://about.pressreader.com
9 200 https://www.365travel.asia/
10 200 https://www.365travel.asia/
11 200 https://www.365travel.asia/
12 200 https://www.365travel.asia/
13 200 https://www.365travel.asia/
14 200 https://www.365travel.asia/
15 200 https://www.365travel.asia/
16 200 https://www.365travel.asia/
17 200 https://www.365travel.asia/
18 200 http://about.pressreade
[7443 rows x 8 columns]

If you take a look at the documentation for OneHotEncoder you can see that the categorical_features argument expects '“all” or array of indices or mask' not a string. You can make your code work by changing to the following lines
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Create a dataframe of random ints
df = pd.DataFrame(np.random.randint(0, 4, size=(100, 4)),
columns=['pattern_id', 'B', 'C', 'D'])
onehotencoder = OneHotEncoder(categorical_features=[df.columns.tolist().index('pattern_id')])
df = onehotencoder.fit_transform(df)
However df will no longer be a DataFrame, I would suggest working directly with the numpy arrays.

You can also do it like this
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(df.required_column.values.reshape(-1, 1)).toarray()
We need to reshape the column, because fit_transform requires a 2-D array. Then you can add columns to this numpy array and then merge it with your DataFrame.
Seen from this link here

The recommended way to work with different column types is detailed in the sklearn documentation here.
Representative example:
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])

Related

How to convert the data type from object to numeric & then find the mean for each row in pandas ? eg. convert '<17,500, >=15,000' to 16250(mean val)

data['family_income'].value_counts()
>=35,000 2517
<27,500, >=25,000 1227
<30,000, >=27,500 994
<25,000, >=22,500 833
<20,000, >=17,500 683
<12,500, >=10,000 677
<17,500, >=15,000 634
<15,000, >=12,500 629
<22,500, >=20,000 590
<10,000, >= 8,000 563
< 8,000, >= 4,000 402
< 4,000 278
Unknown 128
The data column to be shown as a MEAN value instead of values in range
data['family_income']
0 <17,500, >=15,000
1 <27,500, >=25,000
2 <30,000, >=27,500
3 <15,000, >=12,500
4 <30,000, >=27,500
...
10150 <30,000, >=27,500
10151 <25,000, >=22,500
10152 >=35,000
10153 <10,000, >= 8,000
10154 <27,500, >=25,000
Name: family_income, Length: 10155, dtype: object
Output: as mean imputed value
0 16250
1 26250
3 28750
...
10152 35000
10153 9000
10154 26500
data['family_income']=data['family_income'].str.replace(',', ' ').str.replace('<',' ')
data[['income1','income2']] = data['family_income'].apply(lambda x: pd.Series(str(x).split(">=")))
data['income1']=pd.to_numeric(data['income1'], errors='coerce')
data['income1']
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
10150 NaN
10151 NaN
10152 NaN
10153 NaN
10154 NaN
Name: income1, Length: 10155, dtype: float64
In this case, conversion of datatype from object to numeric doesn't seem to work since all the values are returned as NaN. So, how to convert to numeric data type and find mean imputed values?
You can use the following snippet:
# Importing Dependencies
import pandas as pd
import string
# Replicating Your Data
data = ['<17,500, >=15,000', '<27,500, >=25,000', '< 4,000 ', '>=35,000']
df = pd.DataFrame(data, columns = ['family_income'])
# Removing punctuation from family_income column
df['family_income'] = df['family_income'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
# Splitting ranges to two columns A and B
df[['A', 'B']] = df['family_income'].str.split(' ', 1, expand=True)
# Converting cols A and B to float
df[['A', 'B']] = df[['A', 'B']].apply(pd.to_numeric)
# Creating mean column from A and B
df['mean'] = df[['A', 'B']].mean(axis=1)
# Input DataFrame
family_income
0 <17,500, >=15,000
1 <27,500, >=25,000
2 < 4,000
3 >=35,000
# Result DataFrame
mean
0 16250.0
1 26250.0
2 4000.0
3 35000.0

Preserve column order after applying sklearn.compose.ColumnTransformer

I'm using Pipeline and ColumnTransformer modules from sklearn library to perform feature engineering on my dataset.
The dataset initially looks like this:
date
date_block_num
shop_id
item_id
item_price
02.01.2013
0
59
22154
999.00
03.01.2013
0
25
2552
899.00
05.01.2013
0
25
2552
899.00
06.01.2013
0
25
2554
1709.05
15.01.2013
0
25
2555
1099.00
$> data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
# Column Dtype
--- ------ -----
0 date object
1 date_block_num object
2 shop_id object
3 item_id object
4 item_price float64
dtypes: float64(2), int64(3), object(1)
memory usage: 134.4+ MB
Then I have the following transformations:
num_column_transformer = ColumnTransformer(
transformers=[
("std_scaler", StandardScaler(), make_column_selector(dtype_include=np.number)),
],
remainder="passthrough"
)
num_pipeline = Pipeline(
steps=[
("percent_item_cnt_day_per_shop", PercentOverTotalAttributeWholeAdder(
attribute_percent_name="shop_id",
attribute_total_name="item_cnt_day",
new_attribute_name="%_item_cnt_day_per_shop")
),
("percent_item_cnt_day_per_item", PercentOverTotalAttributeWholeAdder(
attribute_percent_name="item_id",
attribute_total_name="item_cnt_day",
new_attribute_name="%_item_cnt_day_per_item")
),
("percent_sales_per_shop", SalesPerAttributeOverTotalSalesAdder(
attribute_percent_name="shop_id",
new_attribute_name="%_sales_per_shop")
),
("percent_sales_per_item", SalesPerAttributeOverTotalSalesAdder(
attribute_percent_name="item_id",
new_attribute_name="%_sales_per_item")
),
("num_column_transformer", num_column_transformer),
]
)
The first four Transformers create four new different numeric variables and the last one applies StandardScaler over all the numerical values of the dataset.
After executing it, I get the following data:
0
1
2
3
4
5
6
7
8
-0.092652
-0.765612
-0.173122
-0.756606
-0.379775
02.01.2013
0
59
22154
-0.092652
1.557684
-0.175922
1.563224
-0.394319
03.01.2013
0
25
2552
-0.856351
1.557684
-0.175922
1.563224
-0.394319
05.01.2013
0
25
2552
-0.092652
1.557684
-0.17613
1.563224
-0.396646
06.01.2013
0
25
2554
-0.092652
1.557684
-0.173278
1.563224
-0.380647
15.01.2013
0
25
2555
I'd like to have the following output:
date
date_block_num
shop_id
item_id
item_price
%_item_cnt_day_per_shop
%_item_cnt_day_per_item
%_sales_per_shop
%_sales_per_item
02.01.2013
0
59
22154
-0.092652
-0.765612
-0.173122
-0.756606
-0.379775
03.01.2013
0
25
2552
-0.092652
1.557684
-0.175922
1.563224
-0.394319
05.01.2013
0
25
2552
-0.856351
1.557684
-0.175922
1.563224
-0.394319
06.01.2013
0
25
2554
-0.092652
1.557684
-0.17613
1.563224
-0.396646
15.01.2013
0
25
2555
-0.092652
1.557684
-0.173278
1.563224
-0.380647
As you can see, columns 5, 6, 7, and 8 from the output corresponds to the first four columns in the original dataset. For example, I don't know where the item_price feature lies in the outputted table.
How could I preserve the column order and names? After that, I want to do feature engineering over categorical variables and my Transformers make use of the feature column name.
Am I using correctly the Scikit-Learn API?
There's one point to be aware of when dealing with ColumnTransformer, which is reported within the doc as follows:
The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list.
That's the reason why your ColumnTransformer instance messes things up. Indeed, consider this simplified example which resembles your setting:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
df = pd.DataFrame({
'date': ['02.01.2013', '03.01.2013', '05.01.2013', '06.01.2013', '15.01.2013'],
'date_block_num': ['0', '0', '0', '0', '0'],
'shop_id': ['59', '25', '25', '25', '25'],
'item_id': ['22514', '2252', '2252', '2254', '2255'],
'item_price': [999.00, 899.00, 899.00, 1709.05, 1099.00]})
ct = ColumnTransformer([
('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))],
remainder='passthrough')
pd.DataFrame(ct.fit_transform(df), columns=ct.get_feature_names_out())
As you might notice, the first column in the transformed dataframe turns out to be the numeric one, i.e. the one which undergoes the scaling (and the first in the transformers list).
Conversely, here's an example of how you can bypass such issue by postponing the scaling on numeric variables after passing through all the string variables and thus ensuring the possibility of getting the columns in your desired order:
ct = ColumnTransformer([
('pass', 'passthrough', make_column_selector(dtype_include=object)),
('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))
])
pd.DataFrame(ct.fit_transform(df), columns=ct.get_feature_names_out())
To complete the picture, here is an attempt to reproduce your Pipeline (though the custom transformer is for sure slightly different from yours):
from sklearn.base import BaseEstimator, TransformerMixin
class PercentOverTotalAttributeWholeAdder(BaseEstimator, TransformerMixin):
def __init__(self, attribute_percent_name='shop_id', new_attribute_name='%_item_cnt_day_per_shop'):
self.attribute_percent_name = attribute_percent_name
self.new_attribute_name = new_attribute_name
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
df[self.new_attribute_name] = df.groupby(by=self.attribute_percent_name)[self.attribute_percent_name].transform('count') / df.shape[0]
return df
ct_pipe = ColumnTransformer([
('pass', 'passthrough', make_column_selector(dtype_include=object)),
('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))
], verbose_feature_names_out=False)
pipe = Pipeline([
('percent_item_cnt_day_per_shop', PercentOverTotalAttributeWholeAdder(
attribute_percent_name='shop_id',
new_attribute_name='%_item_cnt_day_per_shop')
),
('percent_item_cnt_day_per_item', PercentOverTotalAttributeWholeAdder(
attribute_percent_name='item_id',
new_attribute_name='%_item_cnt_day_per_item')
),
('column_trans', ct_pipe),
])
pd.DataFrame(pipe.fit_transform(df), columns=pipe[-1].get_feature_names_out())
As a final remark, observe that the verbose_feature_names_out=False parameter ensures that the names of the columns of the transformed dataframe do not show prefixes which refer to the different transformers in ColumnTransformer.
Answer using scikit-learn 1.2.1
In scikit-learn 1.2 it's possible to set the output of the ColumnTransformer to a pandas dataframe, avoiding this transformation in a second step. Besides this, in the answer proposed by #amiola the ColumnTransformer makes use of a passthrough phase to preserve the order of string-type columns with respect to numeric ones, but this only works if all the string-type columns are before the numerics. To show this I use the same example converting the shop_id column to numeric:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
df = pd.DataFrame({
'date': ['02.01.2013', '03.01.2013', '05.01.2013', '06.01.2013', '15.01.2013'],
'date_block_num': ['0', '0', '0', '0', '0'],
'shop_id': [59, 25, 25, 25, 25],
'item_id': ['22514', '2252', '2252', '2254', '2255'],
'item_price': [999.00, 899.00, 899.00, 1709.05, 1099.00]})
ct = ColumnTransformer([
('pass', 'passthrough', make_column_selector(dtype_include=object)),
('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))
]).set_output(transform='pandas')
out_df = ct.fit_transform(df)
out_df
pass__date
pass__date_block_num
pass__item_id
std_scaler__shop_id
std_scaler__item_price
0
02.01.2013
0
22514
2.0
-0.402369
1
03.01.2013
0
2252
-0.5
-0.732153
2
05.01.2013
0
2252
-0.5
-0.732153
3
06.01.2013
0
2254
-0.5
1.939261
4
15.01.2013
0
2255
-0.5
-0.072585
As it's possible to see the shop_id column is moved to the end, for the same reason also explained in amiola's answer (i.e. columns are reordered following the order of the transformation in the ColumnTrasnformer). To overcome this issue, you can reorder dataframe columns after the transformation with verbose_feature_names_out set to False to preserve the same starting column names (beware that those names must be unique, see docs). There's also no need to create a specific passthrough step.
ct = ColumnTransformer([
('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))],
remainder='passthrough',
verbose_feature_names_out=False).set_output(transform='pandas')
out_df = ct.fit_transform(df)
out_df = out_df[df.columns]
out_df
date
date_block_num
shop_id
item_id
item_price
0
02.01.2013
0
2.0
22514
-0.402369
1
03.01.2013
0
-0.5
2252
-0.732153
2
05.01.2013
0
-0.5
2252
-0.732153
3
06.01.2013
0
-0.5
2254
1.939261
4
15.01.2013
0
-0.5
2255
-0.072585

Getting array with empty values in flask but get correct values in a python notebook

I am making a basic web application which takes the inputs for a logistic regression model and returns the class in which it lies. Here is the code for the prediction:
test_data = pd.Series([battery_power, blue, clock_speed, dual_sim, fc, four_g,
int_memory, m_dep, mobile_wt, n_cores, pc, px_height,
px_width, ram, sc_h, sc_w, talk_time, three_g,
touch_screen, wifi])
df = pd.read_csv("Users\ADMIN\Desktop\project\mobiledata_clean.csv")
df.drop(['Unnamed: 0', 'price_range'], inplace=True, axis=1)
print(df)
print(test_data)
#scaling the values
xpred= np.array((test_data-df.min())/(df.max()-df.min())).reshape(1,-1)
print(xpred)
the test_data is:
0 842
1 0
2 2.2
3 0
4 1
5 0
6 7
7 0.6
8 188
9 2
10 2
11 20
12 756
13 2549
14 9
15 7
16 19
17 0
18 0
19 1
dtype: object
Here's the dataframe in df:
df
I get a (1,40) array of null values in the xpred variable. can someone tell me why this is happening
The data type of test_data is showing as an object, try casting it into float and then do the operations maybe.
For series: s.astype('int64')
For dataframe: df.astype({'col1': 'int64'})

NaN Values inputed into test and train data

I am working on a Data Science project with the Fifa dataset. I cleaned the data and took care of any NaN values in the Data to get it ready to be split into test and train. I need to use StratifiedShuffleSplit in order to split the data. Updated to a cleaner way to divided the value data into groups, but I am still getting NaN values once it goes through the split.
Link to the data set I am using: https://www.kaggle.com/karangadiya/fifa19
n = fifa['value'].count()
folds = 3
fifa.sort_values('value', ascending=False, inplace=True)
fifa['group_id'] = np.floor(np.arange(n)/folds)
fifa['value_cat'] = fifa.groupby('group_id', as_index = False)['name'].transform(lambda x: np.random.choice(v_cats, size=x.size, replace = False))
At this point when I check the test and train data I now have mystery NaN values inputed. I think the NaN values maybe a result of .loc since I am getting a 'warning' in jupyter.
c:\python37\lib\site-packages\ipykernel_launcher.py:6: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
Code below:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(fifa, fifa['value_cat']):
strat_train_set = fifa.loc[train_index]
strat_test_set = fifa.loc[test_index]
fifa = strat_train_set.drop('value', axis=1)
value_labels = strat_train_set['value'].copy()
PLEASE HELP MY POOR SOUL!!
Here's one solution.
import numpy as np
import pandas as pd
n = 100
folds = 3
# Make some data
df = pd.DataFrame({'id':np.arange(n), 'value':np.random.lognormal(mean=10, sigma=1, size=n)})
# Sort by value
df.sort_values('value', ascending=False, inplace=True)
# Insert 'group' ids, 0, 0, 0, 1, 1, 1, 2, 2, 2, ...
df['group_id'] = np.floor(np.arange(n)/folds)
# Randomly assign folds within each group
df['fold'] = df.groupby('group_id', as_index=False)['id'].transform(lambda x: np.random.choice(folds, size=x.size, replace=False))
# Inspect
df.head(10)
id value group_id fold
46 46 208904.679048 0.0 0
3 3 175730.118616 0.0 2
0 0 137067.103600 0.0 1
87 87 101894.243831 1.0 2
11 11 100570.573379 1.0 1
90 90 93681.391254 1.0 0
73 73 92462.150435 2.0 2
13 13 90349.408620 2.0 1
86 86 87568.402021 2.0 0
88 88 82581.010789 3.0 1
Assuming you want k folds, the idea is to sort the data by value, then randomly assign folds 1, 2, ..., k to the first k rows, then do the same to the next k rows, etc.
By the way, you will have more luck getting answers to questions here if you can create reproducible examples with data that make it easy for others to tinker with. :)

python pandas text block to data frame mixed types

I am a python and pandas newbie. I have a text block that has data arranged in columns. The data in the first six columns are integers and the rest are floating point. I tried to create two DataFrames that I could then concatenate:
sect1 = DataFrame(dtype=int)
sect2 = DataFrame(dtype=float)
i = 0
# The first 26 lines are header text
for line in txt[26:]:
colmns = line.split()
sect1[i] = colmns[:6] # Columns with integers
sect2[i] = colmns[6:] # Columns with floating point
i +=
This causes an AssertionError: Length of values does not match length of index
Here are two lines of data
2013 11 15 0000 56611 0 1.36e+01 3.52e-01 7.89e-02 4.33e-02 3.42e-02 1.76e-02 2.89e+04 5.72e+02 -1.00e+05
2013 11 15 0005 56611 300 1.08e+01 5.50e-01 2.35e-01 4.27e-02 3.35e-02 1.70e-02 3.00e+04 5.50e+02 -1.00e+05
Thanks in advance for the help.
You can use Pandas csv parser along with StringIO. An example in pandas documentation.
For you sample that will be:
>>> import pandas as pd
>>> from StringIO import StringIO
>>> data = """2013 11 15 0000 56611 0 1.36e+01 3.52e-01 7.89e-02 4.33e-02 3.42e-02 1.76e-02 2.89e+04 5.72e+02 -1.00e+05
... 2013 11 15 0005 56611 300 1.08e+01 5.50e-01 2.35e-01 4.27e-02 3.35e-02 1.70e-02 3.00e+04 5.50e+02 -1.00e+05"""
Load data
>>> df = pd.read_csv(StringIO(data), sep=r'\s+', header=None)
Convert first three rows to datetime (optional)
>>> df[0] = df.iloc[:,:3].apply(lambda x:'{}.{}.{}'.format(*x), axis=1).apply(pd.to_datetime)
>>> del df[1]
>>> del df[2]
>>> df
0 3 4 5 6 7 8 9 10 \
0 2013-11-15 00:00:00 0 56611 0 13.6 0.352 0.0789 0.0433 0.0342
1 2013-11-15 00:00:00 5 56611 300 10.8 0.550 0.2350 0.0427 0.0335
11 12 13 14
0 0.0176 28900 572 -100000
1 0.0170 30000 550 -100000

Categories