Advantages and Disadvantages of Random Forest
Random Forest is an extension over bagging. It takes one extra step where in addition to taking the random subset of data, it also takes the random selection of features rather than using all features to grow trees. It’s called Random Forest.
Advantages of Random Forest:
1. This algorithm can solve both type of problems, classification and regression and does a decent estimation at both fronts.
2. One of benefits of Random forest which excites me most is the power of handle large data set with higher conditionality. It can handle thousands of input variables and identify most significant varibales so it is considered as one of the dimensionality reduction methods. Further, the model outputs Importance of variable, which can be a very handy feature.
3. It has an effective methods for estimating missing data and maintains accuracy when a large proportion of the data are missing.
4. It has methods for balancing errors in data sets where classes are imbalanced.
5. The capabilities of the above can gbe extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.
6. Random Forest involves sampling of the input data with replacement called as bootstrap sampling. Here on third of the data is not used for training and can be used to testing. These are called the out of bag samples.
7. It handles higher dimensionality data very well.
8. It also handles missing values and maintains accuracy for missing data.
Disadvantages of Random Forest:
1. It surely does a good job at classification but not as good as for regression problem as it does not give precise continuous nature predilections. In case of regression, it doesn’t predict beyond the range in the training data and that they may over-fit data sets that are particularly noisy.
2. Random forest can feel like a black box approach for statistical modelers – you have very like control on what the model does. You can at best – try different parameters and random forest.
3. Since final prediction is based on the mean predictions from subset trees, it won’t give precise values for the regression model.