Today I learned #1
Introduction to machine learning
Supervised ML
- Always have labels associated with features. Assume features are “questions” and the corresponding labels are “solutions”, supervised ML is like teaching the model by showing it a combination of questions and answers with the goal to hopefully pick up the pattern and be able to get the “solutions” by itself
- Train the model with features + known labels -> model can make predictions on new features
The goal of supervised ML is to come up with a function, $g$ that takes in the feature matrix (i.e., a matrix with inputs), $X$, as a parameter and makes predictions as close as possible to the target, $y$
Types of supervised ML problems
- Regression: target variable is a continuous number (car’s price)
- Classification: target variable is categorical (email is span or not)
- Binary
- Multi-class
- Ranking: target variable is scores associated with particular items (e.g., used in recommender systems)
Model selection process
- Split the data into three datasets (training, validation, and test)
- The test set is very crucial to prevent the Multiple comparison problem(MCP)
- MCP occurs when you repeatedly test different models potentially leading to an overestimation (by pure change) of model performance
- Train various models on the training set
- Validate the models and tune hyperparameters on the validation set: this will give you performance for each model
- Select the best model
- Test the selected model on the test set to ensure generalization on unseen data
- Check the score on the validation and test sets are close enough to confirm the performance was not by pure chance/overfitting to the validation set (i.e., there was no multiple comparison problem in the validation process)
What happens if, after using the test set, the performance is significantly worse than on the validation set, indicating overfitting or MCP?
**If performance on the test set is poor, focus on refining your model selection and validation process (e.g., using cross-validation) rather than tweaking the model based on the test results.**