Building a Doodle Detector with Gemini Embedding 2

What if you could draw a rough shape on a screen and have an AI instantly tell you what it is?

That is exactly the core utility of Google’s Gemini Embedding 2 model. Unlike traditional embeddings that only handle raw text, this newly released model maps text, images, and video directly into the exact same vector space.

To test its capability, I built the Doodle Detector. You sketch a rough concept, the system converts it into a vector, and it matches that shape against a database in milliseconds. Here is exactly how the architecture works under the hood.

The Architecture of the Doodle Detector

The application is simple but relies on a robust real-time pipeline tying the front end to the vector database.

The Canvas: You draw directly on an HTML5 canvas governed by React.
Dynamic Cropping: Before leaving the browser, a content-aware crop removes empty white space. This is critical for embedding accuracy—you want the model focusing on the strokes, not the negative space.
The WebSockets Pipeline: The cropped image payload is immediately transported over WebSockets to a lightweight Python backend orchestrator.
Vector Generation: The Python orchestrator calls the Gemini Embedding 2 API. In milliseconds, Google returns a 768-dimension vector representing the semantic “meaning” of your doodle.
Similarity Search: That newly captured vector is rapidly compared against 20 pre-indexed emoji vectors stored in a local ChromaDB instance. We calculate the mathematical distance using Cosine Similarity.
The Result Stream: The top 3 closest matches stream back to the React front end alongside explicit confidence scores.

Matryoshka Representation Learning

The standout feature powering this setup is Matryoshka Representation Learning.

Traditionally, an embedding model gives you a massive, fixed vector array. With Gemini 2’s Matryoshka learning, you can easily truncate your vectors.

If you want absolute, bleeding-edge mathematical accuracy, you run the full 768 dimensions. However, if speed is crucial or storage is tight (for instance, when indexing millions of images instead of 20), you can seamlessly compress this down to 512, 256, or even 128 dimensions. You get vastly faster search and lower compute costs alongside a very graceful degradation in recognition accuracy.

One single model handles multiple modalities with tunable precision mapping.

The Scale of Multimodal Vector Search

The tech stack here is deceivingly lean:

Model: gemini-embedding-2-preview
Database: ChromaDB (Local persistent vector store)
Backend: Python + WebSockets
Frontend: React + Vite + TypeScript

While my Doodle Detector demonstrates this with 20 pre-indexed emojis, the underlying engine is identical to Google-scale infrastructure. Google’s internal FindMeMedia demo uses this exact same technology layer to search across 1.1 million images and 570 thousand videos. You pick an image, and it retrieves visually and semantically similar content across millions of assets in roughly one second.

Semantic similarity, document retrieval, fact verification, and real-time visual recommendation engines are no longer gated behind millions of dollars in R&D. The primitives are here, and they run entirely on open-source vector databases backing massive multimodal APIs.

For the complete Doodle Detector pipeline including source code, check the GitHub repository linked in the video.