Training large language models (LLMs) is challenging due to the large amount of GPU memory and long training times required. Several parallelism paradigms enable model training across multiple GPUs, and various model architecture and memory-saving designs make it possible to train very large neural networks. The optimal model size and number of training tokens should be scaled equally, with a doubling of model size requiring a doubling of training tokens. Current large language models are significantly under-trained. Techniques such as data parallelism, model parallelism, pipeline parallelism, and tensor parallelism can be used to distribute the training workload. Other strategies include CPU offloading, activation recomputation, mixed-precision training, and compression to save memory.