Google releases Gemma 4 12B, an open-weights multimodal model with about 12 billion parameters, designed to run locally on everyday laptops. Multiple sources say the model can execute on devices with around 16GB of VRAM or unified memory, aiming to bring text, image, and audio understanding to local workflows rather than relying on cloud inference. A central technical change is the model’s “encoder-free” or “unified” architecture: instead of separate vision and audio encoders feeding representations into an LLM, Gemma 4 12B routes vision and audio inputs directly into the LLM backbone. The sources describe vision processing as using a lightweight embedding module and describe audio handling as projecting raw audio into the same token space used by text.

The model is positioned for agentic use cases, including tool/function calling and multi-step “thinking” or reasoning modes, and it includes Multi-Token Prediction (MTP) drafters to reduce latency. Sources also report a large context window (cited as 256K tokens) and benchmark performance that approaches or nears Google’s larger Gemma 4 26B model on standard tests. Gemma 4 12B is released under the Apache 2.0 license and is available via platforms including Hugging Face and Kaggle, with local deployment options referenced through common tooling (e.g., LM Studio and Ollama) and Google AI Edge applications for offline audio use.