The Qwen2 series of large language models introduces several key enhancements over its predecessors. It employs Grouped Query Attention (GQA) and Dual Chunk Attention (DCA) for improved efficiency and long-context handling, using YARN to rescale attention weights. The models utilize fine-grained Mixture-of-Experts (MoE) and have a reduced KV size. Pre-training data was significantly increased to 7 trillion tokens with more code, math and multilingual content, and post-training involves supervised fine-tuning (SFT) and direct preference optimization (DPO). These changes allow for enhanced performance, especially in coding, mathematics, and multilingual tasks, and better performance in long-context scenarios.