Today I learned #1

Introduction to machine learning

Supervised ML

Always have labels associated with features. Assume features are “questions” and the corresponding labels are “solutions”, supervised ML is like teaching the model by showing it a combination of questions and answers with the goal to hopefully pick up the pattern and be able to get the “solutions” by itself
Train the model with features + known labels -> model can make predictions on new features

The goal of supervised ML is to come up with a function, $g$ that takes in the feature matrix (i.e., a matrix with inputs), $X$, as a parameter and makes predictions as close as possible to the target, $y$

Types of supervised ML problems

Regression: target variable is a continuous number (car’s price)
Classification: target variable is categorical (email is span or not)
- Binary
- Multi-class
Ranking: target variable is scores associated with particular items (e.g., used in recommender systems)

Model selection process

Split the data into three datasets (training, validation, and test)
- The test set is very crucial to prevent the Multiple comparison problem(MCP)
- MCP occurs when you repeatedly test different models potentially leading to an overestimation (by pure change) of model performance
Train various models on the training set
Validate the models and tune hyperparameters on the validation set: this will give you performance for each model
Select the best model
Test the selected model on the test set to ensure generalization on unseen data
Check the score on the validation and test sets are close enough to confirm the performance was not by pure chance/overfitting to the validation set (i.e., there was no multiple comparison problem in the validation process)

What happens if, after using the test set, the performance is significantly worse than on the validation set, indicating overfitting or MCP? **If performance on the test set is poor, focus on refining your model selection and validation process (e.g., using cross-validation) rather than tweaking the model based on the test results.**

Resources

ML Zoomcamp