Tech

Google releases Gemma 4 12B, an encoder-free open multimodal model for 16GB laptops

AI-Generated Summary

7 sources

1 week ago

4 views

Key Points

Gemma 4 12B is an open-weights multimodal model released under the Apache 2.0 license.
The model is designed to run locally on laptops with about 16GB of VRAM or unified memory.
Gemma 4 12B uses a unified, encoder-free architecture where vision and audio inputs feed directly into the LLM backbone.
Sources describe multimodal inputs including native audio and vision processing, with agentic capabilities such as tool/function calling.
Google reports performance near its larger Gemma 4 26B model and provides local deployment via tools and Google AI Edge apps.

Google releases Gemma 4 12B, an open-weights multimodal model with about 12 billion parameters, designed to run locally on everyday laptops. Multiple sources say the model can execute on devices with around 16GB of VRAM or unified memory, aiming to bring text, image, and audio understanding to local workflows rather than relying on cloud inference. A central technical change is the model’s “encoder-free” or “unified” architecture: instead of separate vision and audio encoders feeding representations into an LLM, Gemma 4 12B routes vision and audio inputs directly into the LLM backbone. The sources describe vision processing as using a lightweight embedding module and describe audio handling as projecting raw audio into the same token space used by text.

The model is positioned for agentic use cases, including tool/function calling and multi-step “thinking” or reasoning modes, and it includes Multi-Token Prediction (MTP) drafters to reduce latency. Sources also report a large context window (cited as 256K tokens) and benchmark performance that approaches or nears Google’s larger Gemma 4 26B model on standard tests. Gemma 4 12B is released under the Apache 2.0 license and is available via platforms including Hugging Face and Kaggle, with local deployment options referenced through common tooling (e.g., LM Studio and Ollama) and Google AI Edge applications for offline audio use.

How Outlets Covered This Story

GOO

Google Developers Blog

Gemma 4 12B: The Developer Guide

The newly released Gemma 4 12B is a dense, multimodal model designed for high-performance local AI execution on consumer devices. By introducing a novel, encoder-free architecture, it bypasses traditional visual and audio encoders to feed multimodal data directly into the LLM backbone.

GOO

Google Developers Blog

Bringing Gemma 4 12B to your Laptop: Unlocking Local, Agentic Workflows with Google AI Edge

Google DeepMind’s Gemma 4 12B model brings agentic, multimodal AI capabilities to everyday laptops with 16GB of RAM, enabling local data processing and visual insight generation. Users can leverage this model on macOS through the Google AI Edge Gallery for dynamic Python code execution and visualization, as well as via Google AI Edge Eloquent for completely offline voice dictation and text editing. Additionally, developer workflows are enhanced by the LiteRT-LM CLI's new serve command, which creates an industry-compatible local endpoint to power fully-local AI tools and agents.

INF

InfoQ

Gemma 4 12B Enables On-Device, Multimodal Agentic Workflows with an Encoder-free Architecture

Google says Gemma 4 12B is "designed to bring agentic, multimodal intelligence directly to your laptop", further noting that the new model can be combined with Google AI Edge to "build and experiment locally, on everyday machines". This integration allows for a wide range of capabilities, from autonomous data processing to generating visual insights and even building webpages or executing tools. By Sergio De Simone

3 days ago

DEV

Dev.to

Gemma 4 12B: Google's encoder-free multimodal AI now runs on a laptop

Google shipped Gemma 4 12B this week — a model that packs near-26B performance into something that runs on a consumer laptop with 16GB of RAM or unified memory. That alone would be notable. But the more significant move is the architecture: no multimodal encoders at all. Vision and audio go straight into the LLM backbone. "Gemma 4 12B packages powerful capabilities inside a reduced memory footprint. It is also our first mid-sized model to feature native audio inputs." — Google DeepMind What actually changed Encoder-free multimodal: Traditional multimodal models pipe images and audio through separate encoder networks before the LLM ever sees them. Gemma 4 12B removes those entirely. Vision gets a lightweight embedding module (a single matrix multiplication + positional embedding). Audio skips encoding altogether — the raw signal is projected directly into the same token space as text. Near-26B benchmark performance at half the footprint: On standard benchmarks it runs neck-and-neck with Gemma 4 26B, and actually surpasses it on DocVQA (document visual question answering). A new slot in the lineup: April's Gemma 4 release had E2B/E4B for mobile/IoT, and 26B/31B for heavier compute. The 12B fills the gap — more capable than edge models, runnable without a GPU server. Drafter-ready: Ships with Multi-Token Prediction (MTP) drafters to reduce inference latency. Apache 2.0: Open weights, available now on Hugging Face, Kaggle, Ollama, and LM Studio. Why the architecture matters Encoder-free isn't just an efficiency hack — it's a different architectural bet. Separate encoders add latency, memory overhead, and a seam in the stack that limits how tightly vision and language reasoning can be integrated. Removing them means the LLM backbone handles the full chain from pixels and audio waveforms to text output, which allows for tighter cross-modal understanding rather than bolted-on modalities. Whether that bet pays off at scale is still an open question. But for local deployment, the operational benefit is immediate: fewer moving parts, smaller footprint, and native audio without needing a separate pipeline. Google's own Eloquent app demo shows the model doing offline transcription, formatting, and translation entirely on-device — that's the kind of capability that used to require API calls. Gemma 4 as a family has now crossed 150 million downloads. Developers have built everything from wearable robotic assistants to enterprise AI security tooling on top of it. The 12B gives that community a laptop-sized option that doesn't require stripping out multimodal capabilities to fit. What to do Building local AI apps: 16GB RAM is now the floor for a capable multimodal model. ollama run gemma4:12b is the fastest path to testing it. On the audio pipeline side: Worth a serious look for offline transcription and voice-to-text — the encoder-free approach means no extra audio infrastructure to manage. Deploying on GKE or Cloud Run: Google published tutorials for both — links in the official blog post below. Building agents: Google released a Gemma Skills Repository alongside this, specifically targeting agentic workflows using the latest Gemma models. Source: The New Stack · Google Blog ✏️ Drafted with KewBot (AI), edited and approved by Drew.

6 days ago

DEV

Dev.to

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Gemma 4 12B is designed to bring high-performance multimodal intelligence directly to your laptop, combining mobile-first efficiency with advanced reasoning. Today, we are introducing Gemma 4 12B, our latest model designed to bring agentic multimodal intelligence directly to laptops. Bridging the gap between our edge-friendly E4B and our more advanced 26B Mixture of Experts (MoE), Gemma 4 12B packages powerful capabilities inside a reduced memory footprint. It is also our first mid-sized model to feature native audio inputs. Thanks to the developer community, Gemma 4 models have now crossed 150 million downloads. You've built everything from wearable robotic arms for physical assistance to enterprise-grade AI security. We're excited to see what you build with this latest addition. Here's an overview of what makes Gemma 4 12B unique: Novel unified architecture: No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone. Advanced reasoning: Benchmark performance nearing our 26B model, unlocking powerful multi-step reasoning and agentic workflows. Laptop ready: Small enough to run locally with just 16GB of VRAM or unified memory. Open and accessible: Released under an Apache 2.0 license with support across the developer ecosystem. Drafter-ready: Gemma 4 12B comes equipped with Multi-Token Prediction (MTP) drafters to reduce latency. Together, these features bring advanced multimodal capabilities to everyday hardware without sacrificing speed or reasoning. Let's now take a closer look at how Gemma 4 12B achieves this. Run state-of-the-art agents locally Gemma 4 12B delivers performance nearing our larger 26B MoE model on standard benchmarks, but at less than half the total memory footprint. Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine. Experience a uniquely efficient, unified architecture What makes Gemma 4 12B stand out is its streamlined approach to processing visual and audio inputs. Traditional multimodal models typically rely on separate encoders to translate images and audio before passing those representations to the language model. Because these split encoders add latency and increase memory usage, we trained Gemma 4 12B with an encoder-free architecture to integrate audio and vision input directly. Here is how Gemma 4 12B processes multimodal inputs natively: Vision: We replaced Gemma 4's vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations. This allows the LLM backbone to take over visual processing. Audio: We simplified audio processing even further. We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens. For developers who want a breakdown, head over to our companion Gemma 4 12B Developer Guide. See native audio processing in action: Watch Gemma 4 12B transcribe, format, and translate voice inputs entirely offline using the Google AI Edge Eloquent app. Get started today Try it yourself: Experiment with a couple of clicks in LM Studio, Ollama, Google AI Edge Gallery App, the Google AI Edge Eloquent app and the LiteRT-LM CLI Download the weights: Download the pre-trained and instruction-tuned checkpoints directly from Hugging Face and Kaggle. Integrate & learn: Review the developer documentation and the quick start notebook. Use your favorite development tools: Implement local inference pipelines with Hugging Face Transformers, llama.cpp, MLX, SGLang, and vLLM, or fine-tune with efficiency using Unsloth. Unlock Agentic Development with Gemma Skills: To support agents to build with the latest Gemma advancements, we are releasing our official Skills Repository. This is a library of skills designed specifically to enable agents to build with Gemma models. Deploy your way: Spin up endpoints in production using Google Cloud. Deploy your way through Gemini Enterprise Agent Platform Model Garden, Cloud Run and GKE.

6 days ago

AI Business

Google’s Gemma 4 12B Shows AI Race Moving to Edge Devices

The model, released under the Apache 2.0 license, is another example of how cloud providers are enabling enterprises to run models on local devices for agentic workflows.

1 week ago

Katie Price returns to UK after visiting Lee Andrews in Dubai prison

Katie Price returns to the UK after a five-day trip to Dubai during which she tries to see her husband, Lee Andrews, who...

6 sources 4 weeks ago

Tech

Science podcasts examine consciousness, brain science, and AI approaches

Two Science Quickly episodes from Scientific American discuss why consciousness remains difficult to define and study, d...

1 sources 1 hour ago

Tech

Scientific American Discusses How Algorithmic Social Media Drives New Slang

A Scientific American “60-Second Science” episode examines how algorithmic social media shapes everyday language, focusi...

1 sources 1 hour ago