“Two heads are better than one.”
In the field of machine learning, one of the most effective ways to improve the accuracy of predictive models is through ensemble learning.
Ensemble learning is a machine learning technique that combines multiple models to create a more accurate and robust model than any individual model could achieve on its own. It leverages multiple machine learning algorithms collectively to address classification or regression problems. This is done by training several models on the same/independent dataset and then aggregating their predictions to make a final decision. The algorithms used can either be of the same type (homogeneous ensemble learning) or different types (heterogeneous ensemble learning).
Ensemble learning offers numerous advantages. Here, we’ll examine its most compelling benefits and explain why it’s such a powerful tool.
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
# Create a decision tree classifier
tree = DecisionTreeClassifier()
# Create a bagging classifier
bag = BaggingClassifier(base_estimator=tree, n_estimators=10, random_state=42)
# Train the bagging classifier
bag.fit(X_train, y_train)
# Make predictions on the test data
y_pred = bag.predict(X_test)
In this example, we create a decision tree classifier and use it as the base estimator for the bagging classifier. We set the number of estimators to 10, which means we will train 10 models on different subsets of the data.
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
# Create the individual classifiers
lr = LogisticRegression(random_state=42)
knn = KNeighborsClassifier()
tree = DecisionTreeClassifier(random_state=42)
# Create the ensemble classifier
ensemble = VotingClassifier(estimators=[('lr', lr), ('knn', knn), ('tree', tree)], voting='hard')
# Train the ensemble classifier
ensemble.fit(X_train, y_train)
# Make predictions on the test data
y_pred = ensemble.predict(X_test)
In this example, we create three different classifiers (logistic regression, k-nearest neighbors, and decision tree) and use them as the base estimators for the voting classifier. We set the voting parameter to ‘hard’, which means that the predictions are based on a simple majority vote.
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
# Create the individual classifiers
lr = LogisticRegression(random_state=42)
knn = KNeighborsClassifier()
tree = DecisionTreeClassifier(random_state=42)
# Create the ensemble classifier
ensemble = VotingClassifier(estimators=[('lr', lr), ('knn', knn), ('tree', tree)], voting='hard')
# Train the ensemble classifier
ensemble.fit(X_train, y_train)
# Make predictions on the test data
y_pred = ensemble.predict(X_test)
In this example, we create a random forest classifier with 100 trees and use the square root of the number of features as the maximum number of features to consider at each split. We then train the classifier on the training data and make predictions on the test data.
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.tree import DecisionTreeClassifier
# Create a decision tree classifier
tree = DecisionTreeClassifier()
# Create a balanced bagging classifier
bag = BalancedBaggingClassifier(base_estimator=tree, n_estimators=10, random_state=42)
# Train the balanced bagging classifier
bag.fit(X_train, y_train)
# Make predictions on the test data
y_pred = bag.predict(X_test)
In this example, we create a decision tree classifier and use it as the base estimator for the balanced bagging classifier. We set the number of estimators to 10, which means we will train 10 models on different subsets of the data. The balanced bagging classifier uses a resampling strategy to balance the class distribution in each subset of the data.
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
# Create the individual classifiers
from dask_ml.ensemble import RandomForestClassifier
# Create a random forest classifier
rf = RandomForestClassifier(n_estimators=100, max_features='sqrt', random_state=42)
# Train the random forest classifier using dask
rf.fit(X_train_dask, y_train_dask)
# Make predictions on the test data using dask
y_pred_dask = rf.predict(X_test_dask)
In this example, we create a random forest classifier with 100 trees and use the square root of the number of features as the maximum number of features to consider at each split. We train the classifier using dask_ml, which is a library for distributed machine learning using Dask. This allows us to train the model on large datasets using multiple processors or even multiple machines.
There are different types of Ensemble Learning techniques which differ mainly by the type of models used (homogeneous or heterogeneous models), the data sampling (with or without replacement, k-fold, etc.) and the decision function (voting, average, meta model, etc). Therefore, Ensemble Learning techniques can be classified as:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(f"Random Forest Accuracy: {accuracy_score(y_test, y_pred):.2f}")
from sklearn.ensemble import AdaBoostClassifier
adb = AdaBoostClassifier(n_estimators=100, random_state=42)
adb.fit(X_train, y_train)
y_pred_adb = adb.predict(X_test)
print(f"AdaBoost Accuracy: {accuracy_score(y_test, y_pred_adb):.2f}")
from sklearn.ensemble import StackingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
estimators = [
('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
('adb', AdaBoostClassifier(n_estimators=10, random_state=42)),
('dt', DecisionTreeClassifier(random_state=42))
]
stacking_clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stacking_clf.fit(X_train, y_train)
y_pred_stack = stacking_clf.predict(X_test)
print(f"Stacking Accuracy: {accuracy_score(y_test, y_pred_stack):.2f}")
from sklearn.ensemble import VotingClassifier
voting_clf = VotingClassifier(estimators=estimators, voting='hard')
voting_clf.fit(X_train, y_train)
y_pred_voting = voting_clf.predict(X_test)
print(f"Voting Accuracy: {accuracy_score(y_test, y_pred_voting):.2f}")
Random Forests: An ensemble of decision trees, typically trained via bagging, that introduces randomness in tree construction to further reduce correlation between individual trees.
AdaBoost: Iteratively reweights data samples, focusing more on previously misclassified instances. Each learner’s contribution to the final model is weighted based on accuracy.
Gradient Boosting: Constructs models sequentially by fitting the residuals of previous models. Widely used due to high predictive accuracy and flexibility.
XGBoost and LightGBM: Optimized gradient boosting algorithms designed for speed and scalability, popular for structured/tabular data.
Recent research has focused on improving ensemble methods through better aggregation strategies, deeper integration with neural network architectures (deep ensembles), and optimizing computational efficiency through distributed computing frameworks. Techniques such as Bayesian ensembles and reinforcement learning-based stacking methods have emerged to further enhance model robustness and accuracy.
Future research in ensemble learning may include:
Ensemble learning is a foundational technique in machine learning, substantially enhancing predictive performance. Understanding its methods, algorithms, and practical considerations is crucial for applying ensembles effectively to real-world problems.