The 5 Phases of Every Machine Learning Project

Machine learning and predictive analytics are pervasive in our lives today. AI impacts nearly everything we do and interact with including retail and wholesale pricing, consumer habits and behaviors, marketing and advertising, politics, entertainment, sports, medicine, business logistics and planning, fraud and risk detection, airline and truck route planning, pricing strategy, gaming, AI speech recognition, AI image recognition, self-driving cars, and robotics.

Yet whether you are creating a self-driving car, predicting customer churn, or cresting a product recommendation system, all machine learning projects follow the same process and the same five basic phases.

Phase 1: Data Collection

Data is the new oil. It is quickly becoming the most valuable commodity in the world. Data is like oil because it fuels machine learning projects. Without data, there is no machine learning and no predictive analytics. And just like grades of oil, there are grades of data. Supreme data is like rocket fuel for machine learning models, and buyers pay a premium for it. Just like physical oil needs to be removed from the ground before it can be refined, data must also be removed from its storage devices before a data scientist can process the data and a data analyst can perform the task of gathering data.

The process of data collection can be very quick and easy, or it can be complicated and painstaking. When data is stored in one location, like a relational database, the process of extracting the data is straightforward, and could be accomplished within hours. However, this is often not the state of reality. Usually, data is spread over many systems and storage devices. For example, a data analyst may gather data from spreadsheets dispersed over several employee laptops, database tables from a remote database, and even data from several third party integrated systems the business uses for business. A restaurant may have employee schedules on spreadsheets, inventory and pricing in a database, and customer orders in a third-party ordering system. It is the task of the data analytics to gather all the data from the disparate systems to one location.

Phase 2: Exploring and Preparing the Data (Exploratory Data Analysis)

Once physical oil has been removed from the ground, it must be refined. The refinery process depends on the final state of the oil, like gasoline or petroleum jelly. Some refinement processes are easier than others. And some crude oils are more difficult to refine than others. This is the same for data. Recall in Phase 1, data may have come from one, clean source, or many disparate sources. Generally, the more sources from which the data is extracted, the longer it will take to prepare the data for Phase 2.

Phase 2 is the process of cleaning the data and preparing it for a machine learning model, and also doing some basic exploration of the data. This phase is commonly referred to a EDA, or Exploratory Data Analysis. So, what does it mean to “clean” data? In a perfect world, your data will be perfect. But alas, we do not live in a perfect world, and we rarely have clean data. The data we work with is usually riddled with holes, missing values, and erroneous values.

Let’s look at the data for 3 days of watching movies for 8 people. Our goal is to create a movie recommender that will suggest a movie for a person based on the movies they watched previously, as well as their gender and age. Notice that our data is entirely clean. There are 2 moviegoers which do not have their gender reported, and 1 moviegoer that does not have their age reported. Also, there are gaps in the movies watched for several of the moviegoers on various days.

When cleaning the data, a data analyst will work closely with a domain expert in the field. She may ask questions like, “Why are there missing values in the movies watched? Is it because they did not watch a movie that day, or was the movie watched not reported?” She also needs to decide if the rows that contain missing gender and age should be discarded from the dataset and not used.

Sometimes data cannot be cleaned solely by a human. Some datasets contain dozens to hundreds to thousands of features (dimensions). Data gathered by clinical drug studies and data gathered by website interactions usually have hundreds to thousands of features. A recent dataset from Amazon that studied consumer reviews had 10,001 dimensions; and many datasets from clinical drug studies can have 50,000 – 100,000 features. However, most of these features are extraneous and do not add value to the machine learning model. The process of “dimensionality reduction” can reduce the number of features in a dataset by as much as 90%. Dimensionality reduction is a class of machine learning algorithms that explores the features of a dataset and attempts to eliminate the extraneous features while identifying the principal features. There are several main algorithms used to approach dimensionality reduction, including Principal Component Analysis (PCA), as well as actually running a Regression or Classification model on the large feature set to identify extraneous and principal features. Dimensionality reduction is an endeavor that involves a data analyst or scientist, as well as a domain expert.

Because dirty data may require evaluating and updating numerous details, this phase is usually the lengthiest phase in a machine learning project.

After cleaning the data, the data analyst will explore the data for some basic insights, and create graphs and charts that summarize the data. These graphs and charts will visually depict the nature of the data, including statistical insights. The graphs and charts, along with the clean data, is then provided to a data scientist to begin the machine learning.


Phase 3: Training a Model on the Data

Now that the data has been gathered and cleaned, and the data scientist can begin the process of training the data on a machine learning model. There are three basic classes of machine learning models:

  • Regression
  • Classification
  • Clustering



Regression is a class of machine learning used to determine a real value, like a number. The most popular regression algorithm is the Leaner Regressor. An example of a 2-dimnails linear regression is below. The model of the car is plotted on the 2D graph based on the car’s horsepower (HP) and the cars’ fuel efficiency (MPG). Then a trend line is drawn (by the algorithm) through the plotted points. In a linear regression model, the algorithm used to draw the trend line is the familiar y=mx+b derived from basic algebra. Once a trend line is drawn, we can make predictions. For example, if my new car has 400 horsepower, I can predict that it’s fuel efficiency will be about 14 miles per gallon. Maybe I should get an electric car.

Note that the example above used two dimensions, or “features,” as they are called in data science. Humans can work with and understand two dimensions at one time. Some humans—some very smart humans—can work with and understand three-dimensional linear regression charts. Computers can work with and understand multi-dimensional charts. When retailers use machine learning to create pricing models or when hedge funds use machine learning to predict stock prices, the machine learning model usually uses hundred of features. I cannot draw you a 125-dimednional linear regression chart… sorry.


Classification models create categories. These models are typically used to classify people, events, or things. Classification models are typically easier for a human to understand, in most cases. For example, the popular K-Nearest Neighbor (KNN) algorithm can be visually depicted and understood in two dimensions. In the graph below, we plot foods based on how crunchy the food is and how sweet the food is. The crunchier the food, the higher the food is plotted on the y-axis; the sweeter the food, the further to the right the food is plotted on the x-axis.

Once the foods have been plotted, we can cluster the foods into categories based on their proximity to each other. (The proximity determination is a parameter in the machine learning model that the algorithm tunes.) In our example, we cluster the plotted foods into 3 categories: vegetables, proteins, and fruits.

Now that we have plotted the foods with two features, crunchiness and sweetness, and categorized each cluster, we can start making predictions. In the graph below, we add a tomato, which is moderately sweet and moderately crunchy. We now need to predict in what category the tomato resides, and the tomato clearly does not fall within any of the predefined categories. We notice that the tomato has 4 neighbors: 1 protein, 1 vegetable, and 2 fruits. Its nearest neighbor is a protein. The final determination of whether the tomato is a protein, vegetable, or fruit is dependent on how the KNN model is configured. First, the measuring algorithm used to determine the distance the tomato is from the other objects. Several popular algorithms are used, including Euclidean and Manhattan measurements. Second, the K in KNN refers to the number of nearest neighbors we should consider when making our final determination.

So, is the tomato a protein, vegetable, or fruit? Surely you did not expect that age-old question to be answered in this article. Instead, I invite you to run a KNN model yourself and make your own determination.

The above example explores a KNN with two dimensions (features). Just like regression models, a KNN can efficiently work with hundreds of dimensions. This is typically the case, as with Netflix’s movie recommender and Amazon’s product recommender.

Another very popular machine learning model for classification is a decision tree. Decision trees are popular because they are efficient, and humans can understand them very well, even when there are multiple dimensions. What makes decision trees so easy for humans to understand is not how they are calculated; that’s actually quite complex. What makes decision trees easy to understand is the output prediction tree. In the example below, a decision tree model crunched data on hundreds of thousands of customers to whom a leading bank proved loans. Based on some key features, and loan default history, the algorithm was able to create a prediction tree that’s easy for any human to understand and follow.


It seems that people under 30 with a high income are credit risks. Why? Also, any student with a moderate income is credit worthy? Again, why? I’ll leave those questions for you to ponder.


In our previous examples, we had clearly defined data. But data is not always clearly defined. Clustering is a special machine learning algorithm used when the structure of the data is not clearly defined. In the previous example, we supervised the machine learning models. In clustering, the model runs unsupervised, and creates clusters on its own. We may define the number of clusters to create, or we may allow the model to run on its own and see what it discovers.

Clustering can be used in Phase 2 to help the data analyst create structure from unstructured data. The most popular cluster model is the K-Means Clustering model. Do you know what the K in K-Means stands for? Hint: It’s the number of clusters you ask the model to create. You may also leave this value undefined if you choose to allow the model to experiment with different cluster sizes.

In the example below, we run a clustering algorithm on a set of data that contains three features (dimensions): age, education, and income. Note—and this is very important—that in this example we do not have a target prediction in mind (or a “label,” as it is called in machine learning). We are not trying to predict credit worthiness, movie preferences, or gym membership churn. All we are doing at this stage is asking the model to segment the population into clusters. In this example, the K-means clustering algorithm segmented the population into 5 clusters.

As mentioned, clustering is an unsupervised learning model that can be applied in Phase 2 or Phase 3. It is typically applied in Phase 2 when we want to segment the data into clusters for the sake of running a different model on the cluster in Phase 3 in order to make a prediction on an identified label. We may run the clustering algorithm in Phase 3 when our objective is to just cluster. For example, an advertising agency may want to identify clusters for the purpose of creating targeted advertising. In this scenario, the advertisers are not making a prediction of future events as much as predicting existing clusters.

Deep Learning

Deep learning is the newest entrant to the world of machine learning models. It has also shown the greatest potential. Whereas many of the algorithms we have discussed thus far have their roots in applied statistics dating back hundred of years. Deep learning is modeled on the human brain and, specifically, neuron interactions.

Deep learning is responsible for major advances in natural language processing, speech recognition and image / facial recognition. It is also at the core of self-driving cars and robotics. The technology is still nascent, but the potential is has demonstrated and the advances it has made puts deep learning in a class by itself. Many experts agree that if deep learning matures at the rate anticipated, the world could see a conscious and sentient artificial agent within 50 years. Wow!


Phase 4: Evaluating Model Performance

In the previous phase, we trained a model on the data. When a data scientist trains a model, she does not use all the data. True, the more data used for training, the better the model will be trained and the more accurate the predictions will be. However, we must keep about 30% of the data hidden from the model. We want to train our model on about 70% of the data, leaving the blind 30% as a test.

We evaluate the model’s performance by testing the model on the 30% of the data the model did not see during the training phase. This ensures that the model’s predictions are not biased by data it has already seen. Sometimes, a portion of the data is never released to the data scientist to ensure no contamination or bias occurs.

When the model is evaluated, it is presented the 30% of data, minus the “label,” which is what we are trying to predict. For example, when evaluating the decision tree of credit worthiness, we would provide the decision tree with the population data of consumers, but not include the one piece of data that indicates whether they defaulted on their loan. Instead, we ask the model to predict loan default for each of the consumers on the data provided. Once the model makes a prediction, we than compare the prediction with the actual value: either the consumer defaulted or not.

When evaluating the performance of a model, several factors must be evaluated. We cannot simply score a model on how many times the model made a correct prediction. That type of evaluation is called “accuracy,” and is just one type of measurement. For example, if 95% of consumers do not default on a loan, then a model that blindly predicts that no one will ever default has an accuracy of 95%. That’s not very impressive, because it escorted the 5% of defaulters right through the front door. Instead, we need several means of evaluating a model.

Frist, let’s examine the results of the credit-default model. We can see that the model predicted that 1,370 consumers would default. Those who actually defaulted are called “true positives” (TP). Those predicted to default who did not default are called “false positives” (FP). The model also predicted that 8,509 consumers will not default. The ones that actually did not default are called “true negatives” (TN), and the consumers that were predicted to default and did not actually default are called “false negatives” (FN).

A list of the most common performance metrics for classification models are detailed in the image below. They are:

Accuracy: The proportion of the total number of predictions that were correct. This is most commonly known to laypeople.

Precision: The proportion of positive cases that were correctly identified.

Sensitivity or Recall: The proportion of actual positive cases that are correctly identified.

Specificity: The proportion of actual negative cases that are correctly identified.

While accuracy is the most common evaluation metric known to a layperson, other evaluation metrics are of great concern to data scientists, domain experts, and business professionals. For example, let’s say a model needs to predict a rare disease, one that is present in 0.01 percent of the population. If the model predicts every time that no one has the disease, then the model has an accuracy of 99.9%…but the ones who actually have the disease are not identified. This is where the other performance metrics help to create an overall evaluation of the model’s efficiency.

The F1-score is an aggregate score that combines recall and precision, the two most common used metrics when evaluating categorical model performance. The F1 is a good measure of overall model performance, but should not be considered solely. It is important to examine the breadth of performance metrics to truly understand how well a model performs.

Regression models predict real numbers, like prices, quantities, or measurements. These are not categories, hence cannot be evaluated with categorical performance metrics like accuracy and recall. Instead, we must use statistics analysis tools for regression model.

The most popular regression evaluation metric is the root mean squared error (RMSE). The formula is below, and not nearly as foreboding as it seems. Let’s say a stock picking regression model predicts that the price of Amazon stock will be 936.89, 939.16, 949.88, 960.34, 962.36 over the next 5 days, and the actual values were 939.79, 938.60, 950.87, 956.40, 961.35. The difference between the first predicted value and the first actual value is 2.9. The difference between the second predicted value and the second actual value is 0.56. Calculate this difference for every value, and square the result each time so that negative values do not adversely bias the metric.  The difference is the called the “error”; and when it is squared, it’s called the squared-error. It is a measure of how close the prediction was to the actual value, measured in squared-error.

Then, sum all of the squared-errors and dived that sum by the total number of predictions (an average, from 5th grade math…told you this is easy stuff). Finally, we root the final answer so that we can work with a more manageable number (234.1 is more manageable than 54,895). The RMSE is a very good indicator of the performance of a regression model.

Phase 5: Improving Model Performance

Are we happy with our model’s performance? If so, we can skip this step. More usual then not, there’s always room for improvement. Just like when we learn a new skill in life, like boxing or playing the piano, there is always room for improvement in performance.

If the performance of our model is abysmal, we go back to Phase 2 or even Phase 1. Horrible performance may be the fault of the model but more often when we have truly bad performance, the fault resides in the data. Perhaps the data itself is just not of good quality, and no model will get it to talk. Or perhaps the data was not cleaned properly, as we need to go back to the raw data and reexamine how we clean the data. In this example, it’s best for the data analyst to consult with a domain expert in the data.

If the performance of our model is no better than guessing, we may consider a entirely new model. This is not an uncommon scenario, and in fact, happens quite often. When this happens, we repeat Phase 3, and train a new model on a portion of our data, and then Phase 4, where we evaluate our new model’s performance. Then, we’re back here again.

This iterative process training multiple models is so common, there are software applications dedicated to Phase 3 and Phase 4 exclusively. These run through many models, training and testing the data, identifying the best performing models.

If the performance of the data is better than guessing, but nothing to write home about, the best course of action may be to stick with the current model, and tune the model’s hyperparameters. A model’s hyperparameters are set prior to the model being trained on the data. most models contain several hyperparameters, and the each hyperparameter can be tuned a variety of ways. In fact, there are often hundreds to thousands of hyperparameters combinations for any given model.  For some models, the opportunity to tune hyperparameters can make an enormous difference to the performance of the model.


Machine learning is part of our everyday life, and has transformed business intelligence. Predictive analytics applications are favorites for customer segmentation, risk assessment, churn prevention, sales forecasting, market analysis, and financial modeling. Using the three main types of machine learning, plus deep learning, data scientists can provide business analysts with what some have called “an unfair competitive advantage,” due to the power and accuracy of a machine learning model’s predictions.

Yet, even the more powerful, accurate, and sophisticated machine learning models all follow the same five phases: Data gathering, data preprocessing and evaluation, training a model, evaluating a model’s performance, and improving a model’s performance.

What secrets reside in your data? What insights is your data waiting to tell you? Find out by contacting Vincent Serpico at SerpicoDEV for a no-charge data-checkup.