SAM 3 Visual prompts

With SAM 3 visual prompts, you don’t need the perfect text prompt or class name. Tap, draw a box, or rough out a mask, and SAM 3 turns those simple visual hints into precise, pixel level segmentation on any object in your image or video. It’s segmentation that feels more like sketching than scripting.

Start Creating Free Watch Demo

SAM 3 Visual Prompts: From Click-to-Segment to Concept-Aware Interactions

Segment Anything Model 3 (SAM 3) is known for its powerful text and concept prompts but that doesn’t mean the classic visual prompts disappeared. In fact, SAM 3 keeps and improves the clicks, boxes, and masks style of interaction that made earlier SAM versions popular, then layers them into a much richer promptable concept segmentation (PCS) pipeline.

If you’re coming from SAM 1 or SAM 2, “visual prompts” were the main way you interacted with the model:

Click on an object → get a mask
Draw a box → refine the region
Paste a rough mask → improve edges

With SAM 3 visual prompts, you can still do all of that but now inside a model that also understands text, image exemplars, hybrid prompts, and video tracking. Visual prompts become one flexible input channel among many.

In this article, we’ll walk through:

What “visual prompts” mean in SAM 3
How they relate to concept prompts and PCS
Types of visual prompts (points, boxes, masks, regions)
Workflows on images and video
When to use visual prompts instead of text
UI, UX, and integration ideas
Limitations, trade-offs, and best practices

1. What Are Visual Prompts in SAM 3?

A visual prompt is any spatial, visual hint you give the model on top of the raw image or frame—such as:

A point (or multiple points) you click on
A bounding box around an area of interest
An existing mask (from a previous prediction or another model)
A region or polygon scribble

Instead of only telling SAM 3 what you want via language (“yellow bus”), you tell it where to look, or which pixels are foreground vs background.

SAM 3 supports visual prompts as a legacy from earlier Segment Anything models, but now they can be:

Used alone, like in SAM 1/SAM 2
Combined with text prompts, e.g.,
- “Person” + click on one person
- “Car” + box around a cluster of cars
Used to refine concept-based outputs, by cleaning up or correcting masks returned by text or hybrid prompts

2. Why Visual Prompts Still Matter in a Text-Prompt World

Text prompts are incredibly powerful, but visual prompts still solve real problems that language alone can’t handle cleanly.

2.1 Disambiguation in crowded scenes

Example:

Text prompt: “person with backpack”
There are 8 people with backpacks in the image.
You only want one of them.

You can combine a visual prompt (a click/box on the specific person) with the text concept to guide SAM 3:

Visual prompt → this particular person here
Text concept → the “person-with-backpack” semantics

Result: SAM 3 focuses on the right instance, instead of returning masks for all matches.

2.2 Fine control for creative workflows

In video editing, graphic design, or VFX, artists often:

Want pixel-perfect edges around a subject
Need to control exactly which object is segmented
Prefer interactive refinement over only text

Visual prompts make the process feel like painting with AI:

Text prompt for the general concept: “bride’s dress”
Visual click to choose which person (if multiple)
Visual corrections to fix sleeves, hair, veil edges

2.3 “I don’t know what this is called, but I can point at it”

Sometimes, text fails because:

You don’t know the proper name of the object
The concept is vague (“this texture”, “this exact logo”)
There’s no clean noun phrase

Visual prompts let you say:

“I don’t know the word, but this thing here segment that.”

3. Types of Visual Prompts in SAM 3

While the exact API details depend on the implementation, SAM 3 visual prompts can be grouped into a few core types:

3.1 Point Prompts

The classic: single or multiple clicks on the image.

Positive points: “this pixel belongs to the object”
Negative points: “this pixel does not belong to the object”

Usage examples:

One positive point on a person’s shirt
Several positive points along a car’s body
Negative points on the background to prevent leak

SAM-like models use these points to help the segmentation head infer the full shape.

3.2 Box Prompts

A bounding box around a region of interest:

Quick way to tell SAM 3 where to look
Useful for rough localization when text is too broad

Example workflow:

Draw a box around a cluster of objects (e.g., a shelf with multiple products).
Use text: “blue cereal box” inside that box.
SAM 3 returns masks only inside that spatial region.

Boxes are powerful for narrowing down context, especially in cluttered images.

3.3 Mask Prompts

You can feed an existing mask to SAM 3 as a prompt:

A previous SAM 3 output
A mask from another model (e.g., an instance segmentation model)
A very rough hand-drawn mask

Why use masks as prompts?

To refine edges, fill holes, fix jagged boundaries
To propagate an initial mask across video frames (with tracking)
To treat one segmented region as a “seed” to grow or shrink

SAM 3 can treat the mask as an initial guess and adjust it using its deeper features and concept awareness.

3.4 Region or Scribble Prompts

Depending on your tool or UI, you might also support:

Scribbles: quick strokes marking foreground vs background
Polygon selections: manual outlines passed as prompts

These are basically more expressive variants of mask/point prompts, especially useful in annotation tools.

4. Visual Prompts vs Text Prompts vs Hybrid Prompts in SAM 3

SAM 3 unifies three kinds of prompting:

Visual prompts: points, boxes, masks
Text prompts: short noun phrases, concept phrases
Image exemplars: reference crops or images

So what’s the difference in usage?

4.1 When to use only visual prompts

Use purely visual prompts when:

You’re doing interactive segmentation (like a Photoshop-style helper)
You don’t care about the object category just need masks
You want fine manual control, like in annotation tools

This feels very similar to SAM 1 / SAM 2 workflows.

4.2 When to use only text prompts

Use only text prompts when:

You want open-vocabulary segmentation without interaction
You want all instances of a concept (“all red cars”, “soccer balls”)
You’re running large-scale batch processing or automation

This is where SAM 3’s concept engine shines.

4.3 When to use hybrid visual + text prompts

This is often the sweet spot for SAM 3:

Text defines the semantic concept
Visual focuses on the spatial instance or subset

Example:

Text: “glass bottle”
Visual: one click on the specific bottle you care about
Result: SAM 3 segments the correct bottle, not every bottle in the scene

In complex scenes or video, hybrid prompts often produce the cleanest, most controllable results.

5. SAM 3 Visual Prompt Workflow on Images

Let’s walk through a typical image workflow in a UI or script.

Step 1: Load Image and Model

You load the image into your tool and initialize SAM 3 (from GitHub, Hugging Face, or an integrated SDK).

Step 2: User Adds a Visual Prompt

The user interacts:

One click on the object
Or a box around it
Or a rough mask

The UI converts this into prompt coordinates / masks for the model.

Step 3: Run SAM 3 with Visual Prompt

The backend sends something like:

Image tensor
Prompt type: point/box/mask
Optional: text prompt

SAM 3 processes the prompt through its prompt encoder and segmentation decoder, returning:

One or more masks
Confidence scores
(If text is involved) concept presence signals

Step 4: Display and Refine

The UI overlays the mask:

User can add more points (positive or negative)
User can adjust or erase parts of the mask
New prompts are sent for iterative refinement

Step 5: Export

Once the user is satisfied:

Export masks as PNGs, alpha channels, vector shapes, or polygon coordinates
Feed them into downstream pipelines (VFX, product cutouts, data labeling, etc.)

6. SAM 3 Visual Prompt Workflow on Video

Visual prompts are also important in video.

6.1 Initialize on a Key Frame

You start with a key frame:

Choose a frame where the object is clearly visible.
Add a visual prompt:
- Click on the object
- Draw a box
(Optionally) Add a text concept for semantic hints.

6.2 Run “First Frame” Segmentation

SAM 3 segments the object in that frame, producing:

A high-quality mask
An instance ID

6.3 Track Through Time

Then SAM 3’s video tracker:

Propagates the mask across subsequent frames
Maintains the same instance ID
Adjusts masks as the object moves, rotates, or changes scale

6.4 Correct with Additional Visual Prompts

When tracking drifts or fails (occlusions, fast motion):

The user can click on a frame where the mask is wrong
Provide additional visual prompts (e.g., correction clicks)
SAM 3 re-aligns the track from that point onward

This is extremely powerful for:

Rotoscoping
Object-focused video analytics
Automated highlight generation (e.g., tracking “ball” in sports)

7. Implementing Visual Prompts: Practical Considerations

If you’re a developer, here’s how to think about integrating SAM 3 visual prompts into your tooling.

7.1 Coordinate Systems

Your frontend UI:

Works in pixel coordinates (x, y) relative to the displayed image.
May be scaled, cropped, or padded compared to the model’s input.

You must:

Convert UI coordinates → model input coordinates
Apply any resizing / padding transformations consistently on input and mask output

7.2 Prompt Encoding

Visual prompts are usually encoded as:

Point arrays: [x, y, label] where label is positive/negative
Box arrays: [x1, y1, x2, y2]
Binary masks: same spatial resolution as input (or downscaled/upscaled)

Your library or SDK (GitHub/HF) will often provide helper functions to build these prompt tensors.

7.3 State Management for Interactive Sessions

In an ideal UI:

Each interaction (click, box, correction) becomes a new prompt added to the “state” for that object.
You may keep a list:
- Points so far
- Current mask
- Object ID

When the user adds a new prompt, you send the full prompt history (or a summarized version) so SAM 3 can refine the segmentation more intelligently.

7.4 Performance

Visual prompt workflows are often interactive humans are waiting for feedback:

You want response times under ~300–500 ms for a good UX.
That may mean:
- Using a smaller SAM 3 variant
- Caching features of the image so you don’t recompute the backbone every time
- Running on GPU

A common approach:

Compute the image’s backbone features once.
For each new visual prompt, only run the prompt encoder + segmentation head.

8. Advantages of SAM 3 Visual Prompting

8.1 Fine-Grained Control

Visual prompts give per-object, per-pixel control that pure text can’t match, especially in creative and manual workflows.

8.2 Human-in-the-loop labeling

For dataset construction:

SAM 3 can propose masks
Annotators use visual prompts to quickly fix them
This drastically cuts annotation time compared to drawing masks from scratch

8.3 Backward Compatibility with Existing Tools

If you previously integrated SAM 1 / SAM 2-style interactions:

You can upgrade to SAM 3’s open-vocabulary power
Still keep your existing UI pattern (clicks, boxes, etc.)

9. Limitations and Failure Modes of Visual Prompts in SAM 3

Visual prompting isn’t magic; it has some limitations.

9.1 Ambiguous Regions

If you click on a blurry, mixed region (like motion blur or transparent glass), SAM 3 may incorrectly estimate boundaries.

Mitigation:

Place points on clear, well-defined object pixels
Use multiple points for large or complex shapes

9.2 Complex Overlaps

When objects overlap heavily (people in a crowd, tangled objects):

A single click might not be enough to isolate the exact instance.

Mitigation:

Combine text (“child in blue jacket”) + click on that specific child
Add negative points on adjacent objects

9.3 Domain Shift

In extremely unusual domains (e.g., medical scans, thermal images):

Visual prompts help, but the underlying features may not generalize well.
SAM 3 may still need domain-specific fine-tuning for high accuracy.

9.4 Video Drift

In video:

Even with visual prompts, tracking can drift if frames are heavily blurred, occluded, or jump-cut.

Mitigation:

Use keyframe correction periodically add fresh visual prompts on new frames to realign tracking.

10. Best Practices for “SAM 3 Visual Prompt” Workflows

Here are some practical guidelines to get the best results.

10.1 Use Clear Foreground Points

Place clicks on visually distinctive parts of the object (edges, patterns, colors).
Avoid extremely ambiguous regions like reflections or shadows.

10.2 Combine Positive and Negative Points

Use positive points to define the object
Use negative points to carve away background or neighboring objects

This dramatically improves mask quality.

10.3 Use Text When the Scene is Busy

In busy scenes:

Start with a text prompt (“yellow excavator”)
Add a visual prompt to pick exactly which excavator you care about

You get both semantic understanding and spatial precision.

10.4 Cache Features for Interactivity

For UI apps:

Precompute and cache backbone features
Only recompute the prompt-specific parts

This makes visual prompting feel instant rather than “click and wait”.

10.5 Save Prompt Histories

If you’re building a professional tool:

Store each object’s prompt history (points, boxes, etc.)
Allow users to reload and re-edit objects without starting from scratch

This is also helpful if you upgrade to a new SAM version prompt histories can be re-run with improved models.

11. Building Products Around SAM 3 Visual Prompts

Some concrete product ideas where SAM 3 visual prompts are central:

11.1 Video Editing Panel

Timeline view with frames
Click on an actor → create tracked mask with SAM 3
Offer “Refine mask” button with additional prompt clicks

11.2 Labeling/Annotation Tool

Display images from training datasets
Auto-suggest masks via concept prompts
Annotators fix them via visual prompts
Export cleaned masks + polygons

11.3 AR/Camera App

Live camera feed
User taps an object → SAM 3 segments it
Apply filters, color changes, or overlays only on that object

Visual prompts here are just taps on the screen.

12. Visual Prompts as Part of the Bigger SAM 3 Story

SAM 3 is often described in terms of:

Promptable Concept Segmentation (PCS)
Open-vocabulary segmentation
Text/image exemplars
Tracking and SA-Co benchmark

Visual prompts are the bridge between that high-level vision and the proven, practical workflows from earlier SAM generations. They let humans:

Intervene when text alone is not enough
Correct and refine masks
Align the model’s understanding with their intent

In other words, SAM 3 visual prompts are how you “steer” the model interactively one click, box, or scribble at a time.

13. Summary: Key Takeaways About SAM 3 Visual Prompts

Visual prompts are still first-class citizens in SAM 3: points, boxes, masks, and regions remain essential.
They work alongside text and exemplar prompts, not instead of them.
Use visual prompts when you need:
- Per-instance control
- Fine-grained edits
- Manual correction in labeling and VFX workflows
Use hybrid prompts (visual + text) for the best of both worlds in cluttered scenes.
In video, visual prompts seed the tracker, and additional prompts correct drift.
Implementations should focus on:
- Good coordinate handling
- Fast, interactive response times
- Storing prompt histories for re-editing

If you’re building tools on top of SAM 3, leaning into visual prompts gives your users a feeling of direct, tactile control instead of just typing prompts and hoping for the best.

AI RESEARCH FROM META

Introducing Segment Anything Model 3 (SAM 3) - the future of segmentation is promptable. Use text or visual prompts to instantly identify, segment, and track any object in images or video. Coming soon to Instagram Edits and Meta AI's Vibes.

Start Creating Free Download the model Try Playground