Boosting
There is another important approach to ensembling, called boosting, where we add models instead of averaging them. Here is how boosting works:

Train a small model that underfits your dataset.
Calculate the predictions in the training set for this model.
Subtract the predictions from the targets; these are called the "residuals" and represent the error for each point in the training set.
Go back to step 1, but instead of using the original targets, use the residuals as the targets for the training.
Continue doing this until you reach some stopping criterion, such as a maximum number of trees, or you observe your validation set error getting worse.

Using this approach, each new tree will be attempting to fit the error of all of the previous trees combined. Because we are continually creating new residuals, by subtracting the predictions of each new tree from the residuals from the previous tree, the residuals will get smaller and smaller.

To make predictions with an ensemble of boosted trees, we calculate the predictions from each tree, and then add them all together

so instead of using the original targets, use the residuals as the targets for the trainingâ€¦
So here instead of trying to predict according to the targets we predict to match the residuals which keeps getting smaller and smaller as the error keeps getting smaller and smaller , so the predictions are small , after which we â€śTo make predictions with an ensemble of boosted trees, we calculate the predictions from each tree, and then add them all togetherâ€ť i dont see what logic is being used to calculate the predictions ? how can you just add the predictions of n_trees which are very small hoping it will match the expected ? please help me out here

In the context of boosting, residuals refer to the differences between the actual target values and the predicted values made by the ensemble of weak models constructed during the boosting process. Residuals are used to update the weights of the training examples and guide the subsequent iterations of the boosting algorithm.

To understand how residuals are used in boosting, letâ€™s consider the AdaBoost algorithm as an example. In AdaBoost, each base model (weak learner) is trained on the dataset with updated weights assigned to each training example. Initially, all examples are assigned equal weights, but as the boosting process progresses, the weights are adjusted based on the performance of the previous base models.

During training, after each base model makes its predictions, the residuals are calculated as the differences between the actual target values and the predictions made by the ensemble of weak models constructed so far. The residuals highlight the examples that are difficult to classify correctly by the current ensemble.

The next base model is then trained on the updated dataset, where the weights of the training examples are modified based on the residuals. The algorithm assigns higher weights to the examples for which the ensemble of weak models made larger errors (i.e., higher residuals). By doing so, the subsequent base model focuses more on the difficult examples that were misclassified by the previous models.

The process of updating the weights and training subsequent models based on the residuals is repeated for a fixed number of iterations or until a stopping criterion is met. At the end of the boosting process, the final prediction is obtained by aggregating the predictions of all the base models, usually using a weighted combination.

By using residuals to adjust the weights of training examples, boosting algorithms effectively emphasize the examples that are challenging for the current ensemble. This iterative approach allows the boosting model to focus on misclassified examples, gradually improving its performance and creating a stronger predictive model.

so whatâ€™s happening is changing weights to get the values of residuals to a minimal value which will result in the best possible weights for the model .
Thank you for the explanation .

how can i predict the true value if iâ€™m adding the predicted error ? How are you coming to the conclusion that this value is close to the real expected value ?
or is it like â€śpredictedâ€ť - â€śerrorâ€ť logic ?
isnt this as good as just using sum of predicted for the many ensemble of weak models because the value predicted when summed cancel the error of â€śpredictedâ€ť variable ?