Understanding Train, Test, and Validation Sets in Machine Learning

When building machine learning models, one of the key practices is to split your dataset into different subsets to train, test, and validate the model. This approach helps assess how well your model generalizes to new, unseen data. In this blog post, we'll walk you through the train, test, and validation sets, their purpose, and how to split your data using scikit-learn's train_test_split function.

Why Split the Dataset?

The main goal of splitting your dataset is to evaluate how effectively the trained model will generalize to new, real-world data. By splitting the data into different subsets, we can avoid overfitting and ensure that the model doesn't just memorize the training data but can predict new data accurately.

1. Training Set

The training set is the subset of the dataset used to train the model. This is where the model learns the patterns and relationships within the data to make predictions. The quality and diversity of the training data directly affect the performance of the model. Typically, around 60-80% of the total dataset is used for training.

Here’s an example of how to create a training set using scikit-learn:

import numpy as np
from sklearn.model_selection import train_test_split

# Create a dummy dataset for demonstration
x = np.arange(16).reshape((8, 2))  # Features
y = range(8)  # Target variable

# Split the data into 80% training and 20% testing
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=42)

# Print the training data
print("Training set features (x): ", x_train)
print("Training set target (y): ", y_train)

In this example, we split the dataset into 80% for training and 20% for testing. The train_test_split function handles this for us, making it easy to divide the data.

2. Testing Set

The testing set is used to evaluate the model after it has been trained. It contains data that the model hasn’t seen before, so it's crucial to assess how well the model generalizes. If a model performs well on the training data but poorly on the testing data, it's a sign that the model has overfitted.

The testing set typically makes up 20-25% of the total dataset.

Here’s an example to create a testing set:

import numpy as np
from sklearn.model_selection import train_test_split

# Create a dummy dataset
x = np.arange(16).reshape((8, 2))  # Features
y = range(8)  # Target variable

# Split the data into 80% training and 20% testing
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Print the testing data
print("Testing set features (x): ", x_test)
print("Testing set target (y): ", y_test)

3. Validation Set

The validation set plays a crucial role in fine-tuning the model. It is used to adjust the hyperparameters of the model, such as learning rates, the number of layers in a neural network, or regularization strength. Unlike the training set, the model doesn’t learn from the validation data but uses it to evaluate how changes in hyperparameters affect its performance.

The validation set usually makes up 10-15% of the total dataset, though the exact percentage can vary depending on the complexity of the model.

Here’s how you can split your dataset into training, validation, and testing sets:

import numpy as np
from sklearn.model_selection import train_test_split

# Create a dummy dataset
x = np.arange(24).reshape((8, 3))  # Features
y = range(8)  # Target variable

# First split: 80% for training and 20% for combined testing and validation
x_train, x_combine, y_train, y_combine = train_test_split(x, y, train_size=0.8, random_state=42)

# Second split: Split combined dataset into 50% validation and 50% testing
x_val, x_test, y_val, y_test = train_test_split(x_combine, y_combine, test_size=0.5, random_state=42)

# Print the datasets
print("Training set features (x): ", x_train)
print("Training set target (y): ", y_train)
print(" ")

print("Validation set features (x): ", x_val)
print("Validation set target (y): ", y_val)
print(" ")

print("Testing set features (x): ", x_test)
print("Testing set target (y): ", y_test)

Key Takeaways

Training Set: The model learns from this data. It usually makes up 60-80% of the total dataset.
Testing Set: Used to evaluate the model’s performance after training. Typically 20-25% of the dataset.
Validation Set: Helps fine-tune hyperparameters to improve the model. It generally accounts for 10-15% of the data.

Properly splitting your data into these three sets is crucial for building a machine learning model that generalizes well to new, unseen data. Always keep your training, validation, and testing sets separate to avoid data leakage and ensure unbiased evaluation.

Search This Blog

Learn Data Science Easy Way A To Z