Multimodal Search Optimization: A 2026 How-To

Answer first

Optimizing for Google's new multimodal Search box means treating your non-text assets as primary ranking surfaces, not decoration. At I/O 2026, Google rebuilt the Search box to accept text, images, files, videos, and Chrome tabs in a single query and reason across them. To stay visible, ship high-quality original images with descriptive filenames and alt text, mark them up with structured data and image sitemaps, surround them with topical context, and make your entities unambiguous so Google can match a photo or a spoken question to your page.

At a glance

What changedSearch box now reasons across text, images, files, videos, and Chrome tabs
Multimodal shareMore than 16% of Google searches are already multimodal
Default engineAI Mode runs on Gemini 3.5 Flash, past 1 billion monthly users
Highest leverageImage SEO: original assets, descriptive metadata, schema, sitemaps
Connective tissueEntity clarity so the model resolves image, tab, and sentence to one thing
CadenceRe-audit quarterly as the multimodal query share keeps rising

For two decades, image and video optimization was a checklist item below titles, links, and copy. That ordering is now wrong. Google calls its I/O 2026 redesign the biggest upgrade to the Search box in over 25 years, and the query is no longer a string. It is a string plus a picture plus a tab plus context, and a reasoning model weighs all of it at once. This guide walks the work modality by modality: what changed, why non-text optimization moved to the core, then image SEO, structured data and sitemaps, Google Lens, video, and the entity clarity that lets Google reason across all of it.

CH.01What actually changed at I/O 2026

Google called the redesign the biggest upgrade to its Search box in over 25 years. Per the official Search at I/O 2026 announcement, the box dynamically expands to give you space to describe exactly what you need, surfaces AI-powered suggestions that go beyond autocomplete, and lets you search across modalities, using text, images, files, videos, or Chrome tabs as inputs. It started rolling out in all countries and languages where AI Mode is available, on both desktop and mobile.

Key fact

Google says more than 16 percent of searches are now multimodal, and AI Mode now runs by default on the new Gemini 3.5 Flash model, a reasoning model that can interpret an image, a screenshot, and a follow-up sentence together. AI Mode has surpassed 1 billion monthly users, so this is the front door, not a niche feature.

Two facts make this more than a UI refresh. First, a sixth of all queries already involve something other than typed text. Second, the system answering a multimodal query reasons over the inputs together rather than treating the picture and the words as separate lookups. We cover the broader shift in our breakdown of what AI Mode being the default means for search visibility.

Reported but not yet fully confirmed: trade coverage describes the box adding a "Talk" or Search Live voice option and a "plus" menu to attach images from gallery or camera and to attach documents, with Search Live letting users interact in real time through the phone camera. Treat those specifics as press-reported until Google documents them, but plan for them, because the direction is clear.

The practitioner takeaway: your job is to make every modality on your pages legible to that reasoning. For how this fits alongside every other I/O 2026 change, see our complete breakdown of Google I/O 2026 for search marketers.

CH.02Why non-text optimization is now core SEO

When a sixth of queries are multimodal and the answer engine is a model that reads pixels as readily as text, the image you ship is a ranking input on equal footing with your H1. The surfaces that win are the ones that hand the model clean, structured, unambiguous signals.

A blurry stock photo with a filename of IMG_4821.jpg and empty alt text is invisible to a system trying to visually match a user's camera input. An original, high-resolution product shot, named descriptively, marked up with Product structured data, and surrounded by accurate copy, is a strong candidate. Optimizing the non-text assets is now a core discipline.

The query is no longer a string. It is a string plus a picture plus a tab plus context, and the engine reasons over all of it. Capconvert GEO practice

Here is how to do it, modality by modality, starting with the highest-leverage work for almost every site.

CH.03Image SEO: the foundation of multimodal visibility

Image optimization is the highest-leverage multimodal work for most sites, because images feed Google Images, Google Lens, visual matches, shopping surfaces, and AI Mode's visual reasoning at once. Google's image SEO best practices define the floor. Hit every item.

Ship original, high-quality images. Use relevant, high-resolution photos and avoid generic logos or extreme aspect ratios. Original imagery beats stock for visual matching because it is distinctive.
Write descriptive filenames. Google explicitly prefers my-new-black-kitten.jpg over IMG00023.JPG. Name files for what the image shows, in plain words.
Write accurate, informative alt text. Describe the content, not keywords. Google's own example is "Dalmatian puppy playing fetch" over generic terms.
Use real HTML image elements. Google finds images in the src of an img element and does not index CSS background images.
Place images in topical context. Position images near relevant text on topically appropriate pages so Google can confirm what a photo depicts.
Serve responsive images with picture or srcset and a fallback src. Supported formats include JPEG, PNG, WebP, SVG, and AVIF; match the file extension to the actual file type.
Optimize for speed. Compress and size images correctly and verify with PageSpeed Insights, because heavy images suppress the page they live on.

Designate the page's hero image explicitly with the primaryImageOfPage property or an og:image tag so Google knows which image represents the page. This is small effort with outsized payoff on social and AI surfaces that pull a single representative image.

CH.04Image structured data and image sitemaps

Structured data and sitemaps are the two technical levers that move images from indexed to eligible for rich treatment and reliably discovered. Add structured data so images become eligible for rich results and prominent badges in Google Images. The type depends on the page: Product for commerce pages, Recipe for recipes, and ImageObject where you need to describe a standalone image with its license, creator, and caption. Licensable-image markup can earn a badge that links to your licensing terms, which is useful for any site whose imagery gets reused.

Submit an image sitemap to surface images Google might not otherwise find, including images loaded via JavaScript. The rules are specific.

Nest one or more image:image tags inside each url entry; each URL can contain up to 1,000 image:image tags.
Put the image URL in image:loc inside each image:image.
Declare the namespace http://www.google.com/schemas/sitemap-image/1.1.
Images can be hosted on a different domain, for example a CDN, as long as both domains are verified in Search Console, and image URLs must not be blocked by robots.txt.

Why it matters

If your images are injected client-side or served from a CDN, the image sitemap is not optional. It is often the only reliable way Google discovers them. Generate one as part of your standard sitemap build and resubmit when your asset library changes.

CH.05Google Lens and visual-match optimization

To win Google Lens and visual-match queries, give the system distinctive, original images of the exact thing a user is likely to point a camera at. Lens and Circle to Search work by matching a real-world or on-screen image against indexed images, so your odds rise when your images are unique, well-lit, shot from the angles people actually photograph, and unambiguously tied to a product or entity.

Lead with clean, isolated product shots plus in-context lifestyle imagesThe isolated shot helps the visual match; the lifestyle shot helps the model understand use and intent.
Show the distinctive features a person would photographThe logo, the silhouette, the label, the packaging. If your product has a recognizable shape, make sure an image captures it cleanly.
Keep product imagery consistent across your site, your Merchant feed, and third-party listingsThe same entity reinforces itself everywhere Google looks.
Tie images to structured product dataGTIN, brand, price, and availability so a visual match can resolve straight into a shopping result.

Visual search rewards specificity. A retailer with original, multi-angle, structured-up product images will out-match a competitor relying on the manufacturer's stock photo that a hundred other sites also use. This matters more as Google moves toward AI-driven shopping; see our piece on getting your products into Google's AI shopping agent.

CH.06Video discoverability for multimodal and conversational queries

Video is now a first-class input and output: users can attach a video to the new Search box, and conversational queries increasingly surface video answers. The same legibility principles apply, so give the engine clean metadata and structure.

Mark up videos with VideoObject structured data including a thumbnail, name, description, upload date, and duration.
Provide a representative, high-quality thumbnail.
Supply transcripts and captions so the spoken content is machine-readable.
Submit a video sitemap so Google reliably discovers embedded and JavaScript-loaded videos.
Answer the specific question early and clearly, because the model is extracting an answer, not ranking a ten-minute watch.

Video and YouTube deserve their own playbook, especially now that YouTube has its own conversational search. We go deep on this in how conversational video search changes YouTube SEO; treat that as the companion to this section.

CH.07Entity and topical clarity so Google can reason across modalities

The connective tissue of multimodal optimization is entity clarity. Google can only reason across an image, a tab, and a sentence if it can resolve all three to the same well-defined entity. If your brand, products, people, and topics are ambiguous, the model has nothing stable to anchor a cross-modal query to.

Use Organization, Product, and Person structured data consistently, and connect them with sameAs links to authoritative profiles like Wikipedia, Wikidata, LinkedIn, and official social accounts.
Keep names, attributes, and descriptions consistent across your site, your structured data, your image alt text, and your off-site presence.
Build genuine topical depth so that when a query spans text and an image, your page is the obvious match on the subject, not a thin page that happens to contain the keyword.
Anchor every asset to its topic, so the image, the video, the copy, and the schema all describe the same thing. Conflicting signals dilute the match.

This is the same E-E-A-T and entity work that drives every other AI surface; multimodal search just raises the stakes by adding pixels and spoken language to the inputs the model weighs.

CH.08A practical multimodal optimization checklist

Work through this list for any page you want visible in multimodal and visual search.

Replace stock and low-resolution imagesUse original, high-quality assets shot from the angles users actually photograph.
Rename and re-caption every important imageDescriptive filenames and accurate, content-specific alt text.
Use real img markup, served responsively and compressedReal elements, srcset, and speed verified in PageSpeed Insights.
Add page-appropriate image structured dataAnd designate the primary image of the page.
Generate and submit an image sitemapAdd a video sitemap if you host video, and resubmit when assets change.
Mark up videos with VideoObjectAdd transcripts and captions, and front-load the answer.
Tie product images to complete structured product dataFor shopping and visual-match resolution.
Reinforce entity clarityOrganization, Product, and Person schema plus sameAs links.
Place every asset in topically relevant contextWith supporting copy that describes it.
Re-audit quarterlyBecause the multimodal share of queries is rising and the surfaces keep expanding.

FAQCommon questions

What is Google's new multimodal Search box?

It is a redesigned Search box, announced at Google I/O 2026 as the biggest upgrade in over 25 years, that dynamically expands as you type, offers AI-powered suggestions beyond autocomplete, and accepts text, images, files, videos, and Chrome tabs as inputs in a single query. Per Google, it reasons across those inputs together and started rolling out on desktop and mobile in all countries and languages where AI Mode is available, running on the Gemini 3.5 Flash model.

How do I optimize images for Google's multimodal and visual search?

Ship original, high-resolution images; give each a descriptive filename and accurate, content-specific alt text; place them in real img markup surrounded by topical copy; add page-appropriate structured data; and submit an image sitemap so Google discovers everything, including JavaScript-loaded images. For Google Lens and visual matches, use distinctive, multi-angle product shots tied to complete structured product data. Compress images for speed, since slow pages suppress the images on them.

Does alt text still matter for SEO in 2026?

Yes, more than ever. Google reads alt text to understand image content, screen readers rely on it for accessibility, and the reasoning model interpreting a multimodal query uses it as a text signal about the pixels. Google's guidance is to write accurate, informative descriptions of what the image shows and to avoid keyword stuffing; its own example favors "Dalmatian puppy playing fetch" over a generic keyword string.

What is an image sitemap and do I need one?

An image sitemap tells Google about images on your site it might not otherwise find, especially images loaded via JavaScript or served from a CDN. You nest up to 1,000 image:image tags inside each url, put each image URL in image:loc, and declare the http://www.google.com/schemas/sitemap-image/1.1 namespace. If your images are client-side rendered or on a separate verified domain, it is often the only reliable way Google indexes them, so yes, generate and submit one.

How does the multimodal Search box affect video SEO?

Video became a first-class input and answer surface, so videos need to be machine-legible. Mark them up with VideoObject structured data, provide a strong thumbnail, add transcripts and captions so spoken content is readable, submit a video sitemap, and answer the query early since the model extracts answers rather than ranking long watches. Conversational video search, including on YouTube, rewards clear, specific, well-structured content over production length.

Is image and video optimization still optional in 2026?

No. With more than 16 percent of Google searches now multimodal and the default answer engine a model that reads images and reasons across modalities, non-text assets are primary ranking inputs, not afterthoughts. Treating image and video optimization as a core discipline, with original assets, descriptive metadata, structured data, sitemaps, and entity clarity, is now table stakes for visibility in both classic search and AI Mode.

References

Google. "Search at I/O 2026: the biggest upgrade to our Search box in over 25 years." blog.google/products-and-platforms/products/search/search-io-2026/
Google. "Google I/O 2026: all our announcements." blog.google/innovation-and-ai/technology/ai/google-io-2026-all-our-announcements/
Google Search Central. "Image SEO best practices." developers.google.com/search/docs/appearance/google-images
Google Search Central. "Image sitemaps." developers.google.com/search/docs/crawling-indexing/sitemaps/image-sitemaps
Schema.org. "ImageObject." schema.org/ImageObject
Schema.org. "VideoObject." schema.org/VideoObject

Cortex

Search Marketing Intelligence, Capconvert

Cortex is Capconvert's search marketing intelligence system. This guide synthesizes Google's I/O 2026 Search announcements and published image and video SEO documentation into a modality-by-modality playbook for multimodal visibility. Reviewed by Jacque.