{% extends "layout.html" %} {% block content %}
Story-style intuition: The Wisdom of Crowds
Imagine you want to guess the number of jellybeans in a giant jar. If you ask one person, their guess might be way off. They might be an expert, or they might be terrible at guessing. Their prediction has high variance. But what if you ask 100 different people and take the average of all their guesses? The final averaged guess is almost always much closer to the true number than any single individual's guess. This is the "wisdom of crowds" effect. Bagging applies this same logic to machine learning. Instead of trusting one complex model (one expert guesser), we train many models on slightly different perspectives of the data and combine their predictions to get a more stable and accurate result.
Bagging, short for Bootstrap Aggregating, is a powerful ensemble machine learning technique. Its primary goal is to reduce the variance of a model, thereby preventing overfitting and improving its stability. It works by training multiple instances of the same base model on different random subsets of the training data and then aggregating their predictions.
The process of Bagging is a straightforward three-step method.
Example: If our original dataset is `[A, B, C, D]`, a bootstrap sample might be `[B, A, D, B]`. Notice that 'B' was picked twice and 'C' was not picked at all. Each bootstrap sample is the same size as the original dataset.
The aggregation step is what combines the "wisdom" of the individual models. For a new data point \(x\), and \(m\) trained models:
$$ \hat{y} = \frac{1}{m} \sum_{i=1}^{m} f_i(x) $$
$$ \hat{y} = \text{majority\_vote}\{f_1(x), ..., f_m(x)\} $$
| Advantages | Disadvantages |
|---|---|
| ✅ Significantly reduces overfitting and variance. | ❌ Increased Computational Cost: You have to train multiple models instead of just one, which takes more time and resources. |
| ✅ Often leads to a major improvement in accuracy and stability. | ❌ Loss of Interpretability: It's easy to understand and visualize a single decision tree, but it's very difficult to interpret the combined logic of 100 different trees. |
| ✅ Can be applied to almost any type of base model (e.g., trees, SVMs, neural networks). | ❌ Less effective for models that are already stable and have low variance (like Linear Regression). |
In this example, we'll compare a single, complex Decision Tree to a Bagging ensemble of many Decision Trees. We expect the single tree to overfit and perform perfectly on the training data but poorly on the test data. The Bagging classifier should be more robust and perform well on both.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score
# --- 1. Create a Sample Dataset ---
X, y = make_classification(n_samples=500, n_features=10, n_informative=5,
n_redundant=0, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# --- 2. Train a Single Decision Tree (High Variance Model) ---
single_tree = DecisionTreeClassifier(random_state=42)
single_tree.fit(X_train, y_train)
y_pred_tree = single_tree.predict(X_test)
print(f"Single Decision Tree Accuracy: {accuracy_score(y_test, y_pred_tree):.2%}")
# --- 3. Train a Bagging Ensemble of Decision Trees ---
# We create an ensemble of 100 decision trees.
bagging_clf = BaggingClassifier(
base_estimator=DecisionTreeClassifier(random_state=42),
n_estimators=100,
random_state=42
)
bagging_clf.fit(X_train, y_train)
y_pred_bagging = bagging_clf.predict(X_test)
print(f"Bagging Classifier Accuracy: {accuracy_score(y_test, y_pred_bagging):.2%}")
1. Bootstrap refers to creating random subsamples of the data with replacement. Aggregating refers to combining the predictions of the models trained on these subsamples (e.g., by averaging or voting).
2. The main goal of Bagging is to reduce variance. It helps to stabilize unstable models that are prone to overfitting.
3. You would take the average of the price predictions from all the individual models in the ensemble.
4. Linear Regression is a low-variance (stable) model. Its predictions don't change drastically even when the training data is slightly modified. Since Bagging's main strength is reducing variance, it provides little benefit to an already stable model.
The Story: Decoding the Jellybean Guesser's Strategy