Machine Learning Terms Every Data Scientist Should Know

0
1013
Machine-learning_Data-scientist-ML-AI-devOps-Artisan

OverFitting and UnderFitting:  

Overfitting occurs when a statistical model or machine learning algorithm captures the noise of the data.  Intuitively, overfitting occurs when the model or the algorithm fits the data too well.  Specifically, overfitting occurs if the model or algorithm shows low bias but high variance.  Overfitting is often a result of an excessively complicated model, and it can be prevented by fitting multiple models and using validation or cross-validation to compare their predictive accuracies on test data.

Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data.  Intuitively, underfitting occurs when the model or the algorithm does not fit the data well enough.  Specifically, underfitting occurs if the model or algorithm shows low variance but high bias.  Underfitting is often a result of an excessively simple model.

Ensemble methods

Ensemble methods are meta-algorithms that combine several machine learning techniques into one predictive model to decrease variance (bagging), bias (boosting), or improve predictions (stacking).

Bagging: Short for Bootstrap Aggregating using a statistical technique of Bootstrap Sampling, selecting samples from a data set.  Each data point has an equal probability of getting picked and can be repeatedly picked in the bagging technique.

Random Subspace: Randomly picking a subset of features to build the decision tree. Multiple trees with a subset of different features.  You can build a single tree with all the features, but complex models do not increase the model’s accuracy.

Random Forrest uses a combination of both Bagging and Random Subspace.  You can use either technique independently, but when used together, you practice the parallel ensemble methods known as random forest.  The primary motivation of parallel methods is to exploit independence between the base learners since the error can be reduced dramatically by averaging. For more detail on Random Forrest

Homogeneous ensembles: using a single base learning algorithm to create the model.

Combining stable models is less advantageous since the ensemble will not help improve generalization performance.

Boosting: Each tree is built sequentially from a different subset of the training set. See more details on Boosting.

Gradient Decent: Each tree has some learning from the previous tree that was built.  Adjusting the probability of a data point getting in the following training set.

Gradient Boosted: This is the combination of boosting and gradient descent techniques. This is another popular ensemble technique where Data points classified correctly are down-weighted, and the incorrect ones are up-weighted so each subsequent iteration will have a concentration of incorrectly classified data points.

Note for gradient boosted trees: the larger the data set use fewer estimators, and for smaller data sets use more estimators.

Hyper Parameter Tuning: an optimization technique that examines a limited set of combinations while still finding the best parameter for building the model or decision tree.  Bute force can do it, but there might be too many combinations to traverse to get the best parameters for the algorithm.  This series of scored trails for the combination range specified is called hyperparameter tuning.

Thanks for reading ❤

Follow me: Instagram | Facebook | LinkedIn