Avoid Getting Burned by OneHotEncoder: A Cool Hack for Newbies

Marina Saito
4 min readMar 14, 2021

There are plenty of blogs out there that are designed to benefit those who already have developed some meaningful experience in the data science field. These blogs provide exquisite detail on highly technical topics. I have not yet achieved a level of skill that would enable me to understand, let alone draft, a blog that offers insights for intermediate and advanced data scientists. My blog is intended for the newbie, the beginner who perhaps may glean some insight from reading about my frustrating trial-and-error experiences. I launch my “If Only I Had Known” blog with a lesson I learned through some frustrating missteps I made while trying to use Scikit Learn’s OneHotEncoder.

Those familiar with inferential modeling (a topic for which I am just scratching the surface, so my apologies if I mangle or misuse the terms or concepts here) will recognize that in order to analyze categorical data, the categories need to be transformed into numeric data. One form of transformation is “one hot encoding.”

One hot encoding establishes a different column for each category, and encodes each column with a one or a zero to denote whether that category applies. For example, in a home improvement analysis I recently performed, I analyzed whether the addition of a porch would increase the value of a home. My analysis included four porch categories: (1) no porch; (2) an open porch; (3); an enclosed porch; and (4) both an open and an enclosed porch. Because the four categories are mutually exclusive and collectively exhaustive, one, and only one, of the categories is applicable to each home. Therefore, if you know the values for three of the four variables, you can determine the value of the fourth variable. In other words, the value in that fourth column depends on the value in the other three.

Because you cannot include dependent variables when creating an inferential model, one of the four categories must be dropped in order to avoid running into a multicollinearity problem. Since I wanted to determine whether adding one or more porches would increase the value of a home, I wanted to drop the “no porch” category so that any model that I derived would compare one of the other three categories to the “no porch” category. That comparison, in turn, would enable me to identify how much the value of a home would increase with the addition of one or more porches. So far, so good, I just needed to drop a category. That is where I ran into my problem with Scikit Learn’s OneHotEncoder.

I started by using the option in which OneHotEncoder will drop the “first” column. But I had no idea which of my four columns was the “first” column that OneHotEncoder would drop. I went ahead and applied OneHotEncoder to my data anyway, hoping that it would somehow select the category I wanted to drop as the “first” column. Needless to say, that didn’t work.

In my next attempt, I noticed that OneHotEncoder allowed me to “feed” in a list of categories. I thought that since I could feed in the categories in an order I selected, it would recognize the category that I listed first in the feed and drop that category as the “first” column. Nope. That didn’t work, either. Rather than recognizing the order that I provided, OneHotEncoder (I learned after some trial and error) evidently had reorganized the categories and dropped the first category in the reorganized order that it had determined. But since I did not know how it was organizing the data, I could not control which category was dropped.

This is a detail that I was unable to find in the docstring for OneHotEncoder. Unable to find any information explaining how categories gets dropped, I played with the program for a while, eventually figuring out that OneHotEncoder was sorting the categories alphabetically by category name. Accordingly, it would always drop the category listed first in alphabetical order.

Once I realized that the program was sorting alphabetically, I was able to control which column was dropped by renaming the category to ensure whatever I called it (“aaa” was my go-to category name), was at the top of the alphabet and would be dropped as the “first” column.

Having gone through that time-consuming and frustrating exercise, I later learned that there is an easier and more efficient way to control which column gets dropped. To avoid manipulating the naming protocol to accommodate OneHotEncoder’s alphabetical sorting process, I simply instructed OneHotEncoder not to drop any category column. After OneHotEncoder creates the new columns, I could manually delete any category I chose, without having to worry about category names.

This tip may not impress the data processing gurus of the world, but hopefully it will help another newbie to avoid a couple hours of frustration when trying to accomplish a task as seeming simple as dropping a category in one hot encoding.

--

--