Becoming an Artificial Intelligence Expert Part 2: Statistical Learning

Angel Sanchez
16 min readAug 27, 2024

--

To begin we will talk about Statistical learning. Statistical Learning provides a framework for understanding relationships between independent and dependent variables. This relationship is represented through mathematical formulations that contain Independent and dependent variables. Statistical learning is important for developing accurate predictive models in machine learning. In machine learning the primary aim is often to predict unknown outcomes based on known inputs. Machine learning also primarily focuses on pattern recognition and prediction efficiency. The main difference between Statistical learning and machine learning is The main difference between statistical learning and machine learning is the emphasis on interpretability and theoretical understanding in statistical learning, while machine learning often prioritizes predictive accuracy and computational efficiency. To put it simply, statistical learning ensures that machine learning models are not only effective but also comprehensible and reliable in their predictions.

What is Statistical Learning?

Statistical learning is about finding a relationship between independent and dependent variables. This relationship can be denoted using the formula: Y = f(X)+𝜖. (Garath James, 2023, pp. 15). Dependent variables are denoted using Y. Dependent variables are the variables that are measured and are often called Output Variables, or the response. Independent variables are denoted using X. Multiple independent variables can be denoted as X1,X2,X so on and so forth. Independent variables are the variables you can change and go by different names, such as input variables, predictors, and features. The function f represents the systematic information that X provides about Y. The function f may involve many input variables. The function f models the expected value of Y based on X (Garath James, 2023, pp. 16). This reflects the complexity of the models used in data analysis. To clarify f represents a function modeling the relationship. Epsilon (ϵ) is a random error term in statistical models. The Error Term 𝜖 captures the data’s variability not explained by the model. It is independent of X and assumed to have a mean of zero, often following a normal distribution. To make predictions based on this relationship, we use an estimated function fhat that approximates the true function f.

This estimation allows us to apply the model to new data, where the predicted response, denoted as Yhat, is generated using the estimated function applied to the input variables. YHAT = fHAT(X) represents the predicted value of the response variable Y using a statistical learning model. This equation states that the predicted response YHAT is obtained by applying the estimated function fhat​ to the input predictors X. YHAT is the predicted value of the response variable Y. The “hat” notation (^) signifies that this is an estimated value, as opposed to the actual observed value. Yhat is also used to denote the output of the model when it is applied to the input data. For example, in regression tasks, YHAT might represent predicted house prices, while in classification tasks, it might represent predicted probabilities or class labels. fHAT is the estimated function or model that has been trained on the data. It could be a linear regression model, a neural network, a decision tree, or any other type of statistical or machine learning model. fHAT(X) is the estimated function or model, denoted as fhat​, that maps the predictors X to the predicted response YHAT. It is the result of fitting the model to the training data and is used to make predictions based on new data.

Why Estimate f?

There are two main reasons for estimating the function f. The first is for prediction — where the inputs X are known, but the output Y is not directly observable. A prediction-focused approach is useful for a surface level understanding. In prediction since the error term (ϵ) averages to zero, we predict Y using YHAT = fHAT(X). In this case fHAT​ is often treated as a black box, focusing on accuracy rather than the model’s form. Questions that Prediction deals with for example are ‘Is this house under- or overvalued? These questions focus on accurately estimating outcomes based on available data. Non-linear models are often favored for their ability to capture complex patterns for prediction (Garath James, 2023, pp. 20). The second reason for estimating the function of f is for inference. If a deep understanding is the goal, then inference is the appropriate choice (Garath James, 2023, pp. 19). Inference aims to understand the relationships between variables and generalize these findings (Garath James, 2023, pp. 17). Inference involves questions like ‘How much extra will a house be worth if it has a view of the river?’ In Inference the goal is to quantify the impact of specific features on outcomes. Linear models are mostly associated with inference due to their straightforward interpretability.

Prediction

In prediction The accuracy of YHAT depends on two types of error. The first error is the Reducible error. It comes from the difference between the true function f and our estimation fHAT. Reducible error is influenced by model bias and variance. Minimizing Reducible error involves refining fHAT​ using appropriate statistical learning techniques. Minimizing Reducible error aids in closely approximating f. The second error is Irreducible Error. No matter how well we estimate f, the irreducible error cannot be eliminated. Irreducible error is associated with inherent data variability. This variability includes measurement errors and unobserved variables. Irreducible error also affects prediction accuracy (Garath James, 2023, pp. 17). Expected Value provides the average or mean of a random variable. It represents the central tendency of the data or predictions. Variance measures the variability of a random variable around its expected value. This spread of variance indicates the level of dispersion in the data. Variance also indicates the uncertainty in the model predictions.

Inference

In inference, our aim is to estimate f, but the goal is not necessarily to make predictions for Y. Instead, we focus on understanding the relationships within the data. We also analyze how different models behave under various conditions (Garath James, 2023, pp. 18). Here are three essential questions of interest when dealing with inference. The first question is which predictors are significantly associated with the response? This involves identifying the most influential predictors. The second asks what is the nature of the relationship between the response and predictors? This includes determining whether the relationships are positive or negative. The Third Question asks if the relationships between the response and each predictor can be summarized using a linear equation, or is it more complex? Often, true relationships are more intricate. This means that a linear model may not provide an accurate representation.

How to estimate f?

Statistical learning begins assuming an observed set of n distinct data points. These data points are referred to as training data. Training Data is used to train or teach a statistical learning method how to estimate the function f. The function f represents the underlying pattern or relationship in the data. The goal is to apply a method to this training data to accurately estimate the unknown function f. Most statistical learning methods fall into two categories. The first category is Parametric. Parametric methods involve assuming a specific form for f. An example of a parametric method is linear regression. This method involves estimating the parameters based on the data. The second category is Non-parametric. Non-parametric methods do not make explicit assumptions about the form of f. Instead Non-parametric methods focus on constructing a model that adapts to patterns observed in the data. Non-Parametric methods adapt more flexibly to the inherent complexity of the data (Garath James, 2023, pp. 20).

Parametric Methods

Parametric methods involve a two-step model-based approach. The first step is an assumption about the functional form or shape of the model. In other words selecting the Model. The second step is fitting the model to the training data. You can use the least squares method, to simplify the estimation of f. A disadvantage of the model chosen is that it might not accurately reflect true f. This may lead to model misfit. To address this, more flexible models may be considered. Flexible models allow for greater adaptability to the data’s inherent complexities. Caution is needed to avoid overfitting. Overfitting Occurs when a model follows the training data errors too closely. This impairs the models performance on new data.

Non-parametric Methods

Non-parametric methods do not make assumptions about the form of the function f. This avoids the risks associated with incorrect model specification inherent in parametric methods. Non-parametric methods do not simplify the problem by reducing it to a few parameters. This means Non-parametric methods generally require more data. This is especially true as the complexity of the data increases. An example of a non-parametric method is the thin-plate spline. Thin-plate spline is particularly useful for smoothing and interpolation in multi-dimensional data. Thin-plate spline offers flexibility in capturing complex patterns. Thin-plate spline does this without assuming a specific form for f (Garath James, 2023, pp. 22).

In order to fit a thin-plate spline, the data analyst must select a level of smoothness. Choosing the correct level of smoothness is crucial. The more Smoothness the less responsive the model becomes to fluctuation. These fluctuations come from the training data. These fluctuations can cause the model to deviate from the true function f. This excessive variability is an example of overfitting. Overfitting is when the model captures noise instead of the underlying signal. This leads to poor performance on new data (Garath James, 2023, pp. 23). Non-parametric methods like the thin-plate spline, which do not presuppose a specific form for function f, offer the flexibility needed to capture complex patterns in multi-dimensional data without the constraints of parametric methods, though careful selection of model smoothness is crucial to prevent overfitting and ensure reliable performance on new data.

Prediction Accuracy and Model Interpretability Trade-Off

In general, as the flexibility of a method increases, its interpretability decreases. Restrictive models have greater interpretability. If inference is the primary interest, then restrictive models are favored. Linear Regression and Lasso are two linear models that are used for inference. This is because the relationships between the response and predictors is easy to see. Linear regression using least squares, while relatively inflexible, is quite interpretable. Lasso is another form of linear regression that is slightly less interpretable than plain linear regression. Generalized additive models are Non-linear models that are used for Inference. Generalized additive models offer more flexibility than linear regression. They do this by allowing non-linear relationships between predictors and the response. Highly flexible non-linear methods like bagging, boosting, support vector machines, non-linear kernels, and neural networks provide powerful predictive capabilities. This is often at the cost of interpretability.

Supervised versus unsupervised learning

Most statistical learning problems are either supervised or unsupervised. This was previously discussed at length however it will now be explained in the context of inference and prediction. Supervised learning involves fitting a model that relates predictors to a response. Supervised Learning aims for accurate prediction or inference (understanding relationships between predictors and a response variable). There are Classical and modern methods used in supervised learning. Linear Regression and Logistic Regression are what are known as Classical Methods. More modern approaches include GAM, boosting, and SVM.

In unsupervised learning, we only have predictor measurements without a response variable. This makes it impossible to fit models like linear regression. Unsupervised learning aims for inference (understanding relationships between observations without a response variable). Clustering is a common Unsupervised learning method. Clustering identifies groups within the data based on the predictors alone. Clustering is useful in situations like market segmentation where response information is unavailable. Clustering problems can be straightforward when groups are well-separated but challenging when there is overlap. Visual inspection of scatterplots can help identify clusters when dealing with few variables. With more variables, automated clustering methods are necessary.

Regression Versus Classification Problems

Statistical learning methods are selected based on the response type. Responses can be characterized as either quantitative or qualitative. Quantitative is Numerical. Variables include a person’s age, height, income, the value of a house, or the price of a stock. Methods applied to quantitative responses tend to be regression. For example, Least squares linear regression uses a quantitative response.

Qualitative is categorical. Variables include a person’s marital status (married or not married), brand of product purchased (brand A, B, C), whether a person defaults on a debt, or cancer diagnosis. Methods applied to qualitative responses tend to be Classification. Logistic regression is typically used with a qualitative (two-class, or binary) response. (Garath James, 2023, pp. 27). Some statistical methods can be used for either quantitative or qualitative responses. Examples include K-nearest neighbors and boosting. (Garath James, 2023, pp. 27).

Assessing Model Accuracy

In statistics, there is no Universal solution. No single method outperforms all others across every possible data set. It’s important to know a variety of statistical learning methods. This is because no single method is best for all data sets. Choosing the best statistical learning method is challenging and crucial. This is because different methods may work better on different data sets.

Measuring the Quality of Fit

To effectively evaluate the performance of a statistical learning method, it is essential to assess its fit against a given dataset. A reliable indicator is the Mean Squared Error (MSE), which is commonly used in regression analysis. A small MSE indicates that the predicted responses closely match the actual responses, whereas a larger MSE suggests discrepancies between predictions and actual outcomes. The primary objective is to verify the accuracy of predictions on unseen test data, ensuring the model’s applicability in real-world scenarios (Garath James, 2023, pp. 28). The goal in assessing Test MSE is to identify the method that achieves the lowest training MSE; however, a low training MSE does not necessarily guarantee a low test MSE. This underscores the complexity of finding a model that generalizes well beyond the training data (Garath James, 2023, pp. 29).

Smoothing splines aid in model fitting by minimizing a residual sum of squares criterion, subject to a penalty for smoothness. This approach helps in adjusting the model to the data’s inherent patterns without overfitting (Garath James, 2023, pp. 290). The flexibility of a model, as measured by its degrees of freedom, influences the training MSE. Generally, increased flexibility leads to a lower training MSE, but it may also result in a higher test MSE if the model overfits the training data. Overfitting occurs when a model identifies patterns in the training data that do not exist in the test data, leading to poor predictive performance (Garath James, 2023, pp. 30). Cross-validation is a crucial technique used to estimate the test MSE, particularly when test data are not readily available. By leveraging parts of the training data to simulate test scenarios, cross-validation helps in assessing how well a model is likely to perform in practice, thus aiding in the minimization of the training MSE.

The Bias-variance Trade-off

The expected test Mean Squared Error (MSE) for a given value x0 can be decomposed into the sum of three fundamental quantities. The first quantity is the variance of fhat(x0). The second Quantity is the squared bias of fhat. And finally, the variance of the error terms ϵ. To minimize the expected test MSE, we need to select a statistical learning method. This learning Method should simultaneously achieve low variance and low bias (Garath James, 2023, pp. 32). Variance is the amount by which fhat​ changes if it is estimated using a different training data set. This balance directly impacts the effectiveness and reliability of the model’s predictions (Garath James, 2023, pp. 32).

Bias refers to the error that is introduced by approximating a real-life problem. This problem may be extremely complicated, yet analyzed by a simple model. As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease (Garath James, 2023, pp. 33). The relationship between bias, variance, and test set MSE is referred to as the Bias-Variance Trade-Off (Garath James, 2023, pp. 34). Ultimately, effectively minimizing the expected test MSE involves carefully choosing a statistical learning method that balances the Bias-Variance Trade-Off, striving to simultaneously reduce variance and bias to enhance both the accuracy and reliability of model predictions.

The Classification Setting

Error rate is the proportion of mistakes a model makes when applying, fhat​, to the training data (Garath James, 2023, pp. 34). An indicator variable equals 1 if yi does not equal yhati​ and zero if yi=yhati​. If I(yi≠yhati), then the ith observation was incorrectly classified by the classification method (Garath James, 2023, pp. 35). This forms the basis for the training error rate. The training error rate is computed from the data used to train the classifier (Garath James, 2023, pp. 35). The test error rate is associated with a set of test observations of the form (x0,y0). Test error rate is calculated similarly to the training error rate. A good classifier is one for which the test error rate is minimized (Garath James, 2023, pp. 35). A conditional probability in this context is the probability that Y=j, given the observed predictor vector x0 (Garath James, 2023, pp. 35).​ The Bayes Classifier is the classifier with the lowest possible error rate for any given set of data. This is based on the probabilities computed from the data (Garath James, 2023, pp. 35). The points where the probability is exactly 50% is referred to as the Bayes Decision Boundary. It indicates where the classifier switches from predicting one class to another (Garath James, 2023, pp. 35). The Bayes Classifier is known for producing the lowest possible test error rate. This is known as the Bayes error rate.

This error rate is analogous to the irreducible error. This means Bayes Classifier is an unattainable gold standard against which other methods are compared. Bayes Classifier estimates the conditional distribution of Y given X. It also classifies observations to the class with the highest estimated probability ( Garath James, 2023, pp. 36). The Bayes Classifier, by computing the lowest possible error rates and establishing a decision boundary at a 50% probability, serves as the ideal benchmark for evaluating the efficacy of other classification methods. One method that exemplifies this approach is the K-nearest neighbors (KNN) classifier. The KNN classifier first identifies the K points in the training data that are closest to x0​, denoted as N0. This is for a given positive integer K and a test observation x0​ (Garath James, 2023, pp. 36). KNN classifiers then estimate the conditional probability for class j as the fraction of points in N0 that belong to class J. As K grows, the method becomes less flexible and produces a decision boundary that is close to linear (Garath James, 2023, pp. 37). The K-nearest neighbors (KNN) classifier exemplifies adaptive classification by adjusting its complexity based on the number of neighbors considered, illustrating a trade-off between flexibility and the simplicity of the decision boundary. Selecting the appropriate flexibility is crucial for the success of any model.

This is in both regression and classification settings. The challenge of achieving this balance is compounded by the bias-variance tradeoff. Bias-variance tradeoff typically results in a U-shaped curve in the test error. This Makes the task of optimizing model performance complex (Garath James, 2023, pp. 38). Selecting the right level of model flexibility is essential across both regression and classification frameworks, with the bias-variance tradeoff playing a critical role in optimizing model performance and navigating the complexities of error minimization. Error rate is the proportion of mistakes a model makes when applied to training data (Garath James, 2023, pp. 34). A check variable is set to 1 if the predicted and actual outcomes don’t match, and zero if they do match. If the check variable is 1, then the specific data point was incorrectly classified by the classification method (Garath James, 2023, pp. 35). This forms the basis for the training error rate. The training error rate is calculated using the data that was used to train the classifier (Garath James, 2023, pp. 35). Similarly, the test error rate uses a different set of data points and is calculated in the same way. A good classifier aims to minimize the test error rate (Garath James, 2023, pp. 35).

In this context, a conditional probability is the likelihood that the outcome equals a specific value, given the observed predictor (Garath James, 2023, pp. 35). The Bayes Classifier is known for having the lowest possible error rate based on the probabilities computed from the data (Garath James, 2023, pp. 35). It includes a decision point called the Bayes Decision Boundary, which is where the classifier switches from predicting one class to another based on a 50% probability (Garath James, 2023, pp. 35). The Bayes Classifier serves as the ideal standard to compare other methods because it achieves the lowest possible test error rate, known as the Bayes error rate. This is considered the irreducible error, which means it’s the best performance any classifier can achieve given the data (Garath James, 2023, pp. 36). One example of a method that uses this approach is the K-nearest neighbors (KNN) classifier. The KNN classifier identifies the K closest data points to a new observation and estimates the likelihood of each class based on how many of these points belong to each class (Garath James, 2023, pp. 36). As K increases, the classifier becomes less flexible and the decision boundary more linear (Garath James, 2023, pp. 37). Choosing the right level of flexibility is crucial for any model’s success in both regression and classification settings. Achieving this balance is complex due to the bias-variance tradeoff, which usually results in a U-shaped curve for the test error, making it challenging to optimize model performance (Garath James, 2023, pp. 38).

Conclusion

As we navigate the realms of statistical and machine learning, it becomes imperative to understand their distinct focuses and applications. Statistical learning focuses on theoretical underpinnings and interpretability, while machine learning prioritizes predictive accuracy and computational efficiency. Statistical learning involves understanding and modeling the relationships between independent (input) and dependent (output) variables using mathematical formulations. Estimating the function f is essential for predicting unknown outcomes from known inputs and understanding the relationships between variables for deeper insights.

Prediction in machine learning involves using models to forecast unknown outcomes based on known data, focusing primarily on accuracy and efficiency. Inference seeks to understand and quantify the relationships and effects between variables, often using models to draw conclusions about the data’s underlying structures. Estimating f involves selecting a statistical model that best captures the relationship between inputs and outputs from the available data. Parametric methods assume a specific form for the function f, using a predefined equation to model the relationship between variables. Non-parametric methods make no assumptions about the form of f, adapting more flexibly to the data’s inherent complexities.

Increasing a model’s flexibility often reduces its interpretability, necessitating a balance to achieve both accurate predictions and understandable results. Supervised learning uses labeled data to predict outcomes, while unsupervised learning finds patterns and structures in data without predefined labels. Regression predicts continuous outcomes, classification categorizes data into predefined groups, addressing different types of predictive modeling challenges. Evaluating a model’s performance involves comparing its predictions against actual outcomes to determine its effectiveness.

The quality of fit for a model is typically assessed using statistical measures like Mean Squared Error (MSE), which indicate how well the model’s predictions match the actual data. This trade-off involves managing the trade-off between bias (error from erroneous assumptions in the model) and variance (error from sensitivity to small fluctuations in the training set) to optimize model accuracy. In classification, the goal is to predict discrete labels for data points, assessing models based on their error rates in accurately categorizing data. By the end of this discussion, you should be well-equipped to leverage these insights into practical applications, enhancing your ability to effectively employ statistical and machine learning methodologies in various scenarios.

Works cited

Garath James, D. W. (2023). An Introduction to Statistical Learning: with Applications in Python. Springer. https://www.statlearning.com/

--

--

Angel Sanchez
Angel Sanchez

No responses yet