Quantcast
Channel: How to perform OneHotEncoding in Sklearn, getting value error - Stack Overflow
Viewing all articles
Browse latest Browse all 5

Answer by Ken Syme for How to perform OneHotEncoding in Sklearn, getting value error

$
0
0

An alternative if you do want to encode multiple categorical features is to use a Pipeline with a FeatureUnion and a couple custom Transformers.

First need two transformers - one for selecting a single column and one for making LabelEncoder usable in a Pipeline (The fit_transform method only takes X, it needs to take an optional y to work in a Pipeline).

from sklearn.base import BaseEstimator, TransformerMixinclass SingleColumnSelector(TransformerMixin, BaseEstimator):    def __init__(self, column):        self.column = column    def transform(self, X, y=None):        return X[:, self.column].reshape(-1, 1)    def fit(self, X, y=None):        return selfclass PipelineAwareLabelEncoder(TransformerMixin, BaseEstimator):    def fit(self, X, y=None):        return self    def transform(self, X, y=None):        return LabelEncoder().fit_transform(X).reshape(-1, 1)

Next create a Pipeline (or just a FeatureUnion) which has 2 branches - one for each of the categorical columns. Within each select 1 column, encode the labels and then one hot encode.

import pandas as pdimport numpy as npfrom sklearn.preprocessing import LabelEncoder, OneHotEncoder, FunctionTransformerfrom sklearn.pipeline import Pipeline, make_pipeline, FeatureUnionpipeline = Pipeline([('encoded_features',    FeatureUnion([('countries',        make_pipeline(            SingleColumnSelector(0),            PipelineAwareLabelEncoder(),            OneHotEncoder()        )),         ('names', make_pipeline(            SingleColumnSelector(1),            PipelineAwareLabelEncoder(),            OneHotEncoder()        ))    ]))])

Finally run your full dataframe through the Pipeline - it will one hot encode each column separately and concatenate at the end.

df = pd.DataFrame([["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]], columns=['Country', 'Name'])X = df.valuestransformed_X = pipeline.fit_transform(X)print(transformed_X.toarray())

Which returns (first 3 columns are the countries, second 4 are the names)

[[ 1.  0.  0.  0.  0.  1.  0.] [ 0.  0.  1.  0.  0.  0.  1.] [ 0.  1.  0.  1.  0.  0.  0.] [ 0.  0.  1.  0.  1.  0.  0.]]

Viewing all articles
Browse latest Browse all 5

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>