Mixture of Experts (MoE) models use multiple sub-models, or experts, to handle different parts of the input space, orchestrated by a router or gating mechanism. MoEs are trained by dividing data, specializing experts, and using a router to direct inputs. Not all parameters are activated for each input, using sparse activation, and techniques such as load balancing and expert capacity are used to improve training. MoE models can be built through upcycling or sparse splitting. While MoEs offer faster pretraining and inference, they also present training challenges such as imbalanced routing and high resource requirements, which can be mitigated using techniques such as regularization and specialized algorithms.