美團開源LongCat-Next:3B參數統一視覺理解、生成與語音
Meituan Longcat Team's open-source LongCat-Next is a multimodal model based on MoE architecture, integrating five capabilities including text, visual understanding, image generation, and speech. Its core design DiNA achieves unified task processing through discrete tokens, while the dNaViT used in the visual aspect enables excellent image generation performance. Compared with similar models, LongCat-Next demonstrates leading benchmark performance across various metrics, showcasing its advantages in multimodal understanding and generation.