Steps for a Machine Learning project

This is a short note of steps to follow when doing a machine learning project.

Main steps

In a machine learning (ML) project, we work normally within 7 follwing steps:

  • Step 1: Gather the data.
  • Step 2: Prepare the data including dealing with missing values and categorical data.
  • Step 3: Select the model there are many types of model can be applied in a machine learning project. It depends on the problem and the objective of your project. When we get started, we should try all possible models to see wether which one gives the most efficient result as we desire.
  • Step 4: Train the model fit the model chosen with the training data.
  • Step 5: Evaluate the model use the validation set to evaluate how well our model fits with the unseen data.
  • Step 6: Tune parameters tune the model’s parameters to get better performance. A method usually used is XGBoost.
  • Step 7: Get predictions generate the trained model to get the predictions for the test set.

Prepare the data: Dealing with missing data and categorical data

Missing data

There are 3 approches to dealing with the missing data:

  1. Simple option: Drop all the columns containing missing entries.
  2. Imputaion: We fill the missing values (NAs) with appropriate numbers, for instant, the mean of all values along the column. This method can lead to better models than the ones of dropping the entire column.
  3. Extension imputation: Missing values may play an important role for the prediction and may be unique in some way. In order not to underestimate the missing value, we add a new column precising which information were originally missing.

Categorical data

There are 3 approaches to dealing with categorical variables:

  1. Simply drop the categorical variables from the data set. This way may remove the useful information.
  2. Label Encoding: we assign each class with an integer. For example, in a survey of asking for colors of smartphone line including 3 colors: black, white, red, the answer ‘black’ is assigned to $0$; ‘white’ to $1$ and ‘red’ to $2$.
  3. One-hot Encoding: we create new columns indicating the presence (or absence) of each possible value.

One-hot encoding doesn’t make importance of raning the labels. So, this approach is usually used when the order of categories is not taken into account. Furthermore, the one-hot encoding becomes complicated if the categorical variables take so many values. Normally, we don’t use this method when the number of categories is bigger than 15.

Tune parameters

XGBoost - Extreme gradient boost

The XGBoost libary is usually used for working with standard tabular data such as data stored in the Pandas DataFrames, opposed to more exotic types of data like videos or images.

Top