Gemma 4: 8 Real-World Tests (JSON, Code, Vision, Reasoning)

We recently evaluated Google’s Gemma 4 model across 8 distinct, real-world tests to see where it breaks. We skipped the generic leaderboards and instead pushed it through structured JSON extraction, advanced vision Q&A, and live architectural code evaluations.

The Setup

For these tests, we ran Gemma 4 (26B MoE and 31B Dense variants) locally, with “Thinking” mode selectively toggled to see if we could force logical failures.

We specifically engineered Tests 4, 6, 7, and 8 to trigger failure states. We wanted to find the exact perimeter where Gemma 4’s reasoning broke down.

Instead, it passed them.

The Evaluation Breakdown

1. Structured JSON Extraction & Code Generation

We threw a messy, unstructured e-commerce product string at the model and asked it to extract eight strict fields (SKU, currency, stock status, ratings). Both the 26B and 31B models nailed the schema flawlessly on the first pass, with the 26B Mixture-of-Experts (MoE) variant firing back noticeably faster.

For single-function code generation (writing a regex-based validate_email function in Python with docstrings and assertions), both models produced clean, executing code instantly.

2. Multi-File Architecture (Surprise Pass)

Single functions are easy. We escalated to Test 6: generating a full-stack, multi-file application (a Python FastAPI backend with Pydantic schemas, and a React/TypeScript frontend with matching API typings).

Historically, smaller local models generate plausible individual files, but cross-file coherence crumbles. The frontend typings inevitably drift from the backend schemas. We expected Gemma to fail here. It didn’t. The TypeScript interfaces mapped perfectly to the generated Pydantic models.

3. Deep Logical Reasoning (The “Thinking” Constraint)

Test 7 is a notorious 10-step logic puzzle involving five people, five floors, five jobs, and five pets, with heavily interwoven constraints (“The Lawyer lives directly above the Teacher”).

With “Thinking” turned OFF, the 31B model broke down. It lost track of constraints around step 8, hallucinating floor assignments. However, when we enabled Thinking mode, the model successfully mapped the entire constraint matrix and solved the puzzle. For external validation, we ran the same puzzle through Gemini 3.1 Pro, which solved it immediately—highlighting the closing gap between local 31B reasoning engines and frontier cloud models.

4. Advanced Vision & OCR

We utilized Gemma’s vision encoder for two heavy tests:

Test 4: Extracting 72 exact data points from a screenshot of a dense benchmark table.
Test 5: Analyzing an architectural diagram of Cloudflare’s Composable AI infrastructure.

For the tabular data, we anticipated hallucinated floating-point scores. Actual result? Zero mistakes. All 72 points were extracted perfectly.

For the architecture diagram, the model didn’t just transcribe text; it correctly identified the four architectural layers, explained the value of composable redundancy, and successfully deduced that vendor lock-in was the primary underlying risk of the design.

5. Complex Nested Medical JSON

In our final test, we fed the model a dense medical patient record (diagnoses, ICD-10 codes, multi-visit histories) and enforced a volatile, deeply nested JSON schema. We heavily expected dropped brackets or hallucinated keys (like outputting medication instead of drug).

Once again, the model adhered to the nested arrays and returned a highly accurate, parsable payload.

The Verdict

Our core takeaway is that the threshold for local compute is shifting dramatically. Gemma 4 31B demonstrates cross-file architectural coherence, elite OCR extraction, and deep logic solving capabilities that, until recently, required a massive API budget.

For the complete repository of tests, prompts, and raw JSON outputs, check the GitHub link in the video description.