SAM 3 Visual prompts
With SAM 3 visual prompts, you don’t need the perfect text prompt or class name. Tap, draw a box, or rough out a mask, and SAM 3 turns those simple visual hints into precise, pixel level segmentation on any object in your image or video. It’s segmentation that feels more like sketching than scripting.
SAM 3 Visual Prompts: From Click-to-Segment to Concept-Aware Interactions
Segment Anything Model 3 (SAM 3) is known for its powerful text and concept prompts but that doesn’t mean the classic visual prompts disappeared. In fact, SAM 3 keeps and improves the clicks, boxes, and masks style of interaction that made earlier SAM versions popular, then layers them into a much richer promptable concept segmentation (PCS) pipeline.
If you’re coming from SAM 1 or SAM 2, “visual prompts” were the main way you interacted with the model:
-
Click on an object → get a mask
-
Draw a box → refine the region
-
Paste a rough mask → improve edges
With SAM 3 visual prompts, you can still do all of that but now inside a model that also understands text, image exemplars, hybrid prompts, and video tracking. Visual prompts become one flexible input channel among many.
In this article, we’ll walk through:
-
What “visual prompts” mean in SAM 3
-
How they relate to concept prompts and PCS
-
Types of visual prompts (points, boxes, masks, regions)
-
Workflows on images and video
-
When to use visual prompts instead of text
-
UI, UX, and integration ideas
-
Limitations, trade-offs, and best practices
1. What Are Visual Prompts in SAM 3?
A visual prompt is any spatial, visual hint you give the model on top of the raw image or frame—such as:
-
A point (or multiple points) you click on
-
A bounding box around an area of interest
-
An existing mask (from a previous prediction or another model)
-
A region or polygon scribble
Instead of only telling SAM 3 what you want via language (“yellow bus”), you tell it where to look, or which pixels are foreground vs background.
SAM 3 supports visual prompts as a legacy from earlier Segment Anything models, but now they can be:
-
Used alone, like in SAM 1/SAM 2
-
Combined with text prompts, e.g.,
-
“Person” + click on one person
-
“Car” + box around a cluster of cars
-
-
Used to refine concept-based outputs, by cleaning up or correcting masks returned by text or hybrid prompts
2. Why Visual Prompts Still Matter in a Text-Prompt World
Text prompts are incredibly powerful, but visual prompts still solve real problems that language alone can’t handle cleanly.
2.1 Disambiguation in crowded scenes
Example:
-
Text prompt: “person with backpack”
-
There are 8 people with backpacks in the image.
-
You only want one of them.
You can combine a visual prompt (a click/box on the specific person) with the text concept to guide SAM 3:
-
Visual prompt → this particular person here
-
Text concept → the “person-with-backpack” semantics
Result: SAM 3 focuses on the right instance, instead of returning masks for all matches.
2.2 Fine control for creative workflows
In video editing, graphic design, or VFX, artists often:
-
Want pixel-perfect edges around a subject
-
Need to control exactly which object is segmented
-
Prefer interactive refinement over only text
Visual prompts make the process feel like painting with AI:
-
Text prompt for the general concept: “bride’s dress”
-
Visual click to choose which person (if multiple)
-
Visual corrections to fix sleeves, hair, veil edges
2.3 “I don’t know what this is called, but I can point at it”
Sometimes, text fails because:
-
You don’t know the proper name of the object
-
The concept is vague (“this texture”, “this exact logo”)
-
There’s no clean noun phrase
Visual prompts let you say:
“I don’t know the word, but this thing here segment that.”
3. Types of Visual Prompts in SAM 3
While the exact API details depend on the implementation, SAM 3 visual prompts can be grouped into a few core types:
3.1 Point Prompts
The classic: single or multiple clicks on the image.
-
Positive points: “this pixel belongs to the object”
-
Negative points: “this pixel does not belong to the object”
Usage examples:
-
One positive point on a person’s shirt
-
Several positive points along a car’s body
-
Negative points on the background to prevent leak
SAM-like models use these points to help the segmentation head infer the full shape.
3.2 Box Prompts
A bounding box around a region of interest:
-
Quick way to tell SAM 3 where to look
-
Useful for rough localization when text is too broad
Example workflow:
-
Draw a box around a cluster of objects (e.g., a shelf with multiple products).
-
Use text: “blue cereal box” inside that box.
-
SAM 3 returns masks only inside that spatial region.
Boxes are powerful for narrowing down context, especially in cluttered images.
3.3 Mask Prompts
You can feed an existing mask to SAM 3 as a prompt:
-
A previous SAM 3 output
-
A mask from another model (e.g., an instance segmentation model)
-
A very rough hand-drawn mask
Why use masks as prompts?
-
To refine edges, fill holes, fix jagged boundaries
-
To propagate an initial mask across video frames (with tracking)
-
To treat one segmented region as a “seed” to grow or shrink
SAM 3 can treat the mask as an initial guess and adjust it using its deeper features and concept awareness.
3.4 Region or Scribble Prompts
Depending on your tool or UI, you might also support:
-
Scribbles: quick strokes marking foreground vs background
-
Polygon selections: manual outlines passed as prompts
These are basically more expressive variants of mask/point prompts, especially useful in annotation tools.
4. Visual Prompts vs Text Prompts vs Hybrid Prompts in SAM 3
SAM 3 unifies three kinds of prompting:
-
Visual prompts: points, boxes, masks
-
Text prompts: short noun phrases, concept phrases
-
Image exemplars: reference crops or images
So what’s the difference in usage?
4.1 When to use only visual prompts
Use purely visual prompts when:
-
You’re doing interactive segmentation (like a Photoshop-style helper)
-
You don’t care about the object category just need masks
-
You want fine manual control, like in annotation tools
This feels very similar to SAM 1 / SAM 2 workflows.
4.2 When to use only text prompts
Use only text prompts when:
-
You want open-vocabulary segmentation without interaction
-
You want all instances of a concept (“all red cars”, “soccer balls”)
-
You’re running large-scale batch processing or automation
This is where SAM 3’s concept engine shines.
4.3 When to use hybrid visual + text prompts
This is often the sweet spot for SAM 3:
-
Text defines the semantic concept
-
Visual focuses on the spatial instance or subset
Example:
-
Text: “glass bottle”
-
Visual: one click on the specific bottle you care about
-
Result: SAM 3 segments the correct bottle, not every bottle in the scene
In complex scenes or video, hybrid prompts often produce the cleanest, most controllable results.
5. SAM 3 Visual Prompt Workflow on Images
Let’s walk through a typical image workflow in a UI or script.
Step 1: Load Image and Model
You load the image into your tool and initialize SAM 3 (from GitHub, Hugging Face, or an integrated SDK).
Step 2: User Adds a Visual Prompt
The user interacts:
-
One click on the object
-
Or a box around it
-
Or a rough mask
The UI converts this into prompt coordinates / masks for the model.
Step 3: Run SAM 3 with Visual Prompt
The backend sends something like:
-
Image tensor
-
Prompt type: point/box/mask
-
Optional: text prompt
SAM 3 processes the prompt through its prompt encoder and segmentation decoder, returning:
-
One or more masks
-
Confidence scores
-
(If text is involved) concept presence signals
Step 4: Display and Refine
The UI overlays the mask:
-
User can add more points (positive or negative)
-
User can adjust or erase parts of the mask
-
New prompts are sent for iterative refinement
Step 5: Export
Once the user is satisfied:
-
Export masks as PNGs, alpha channels, vector shapes, or polygon coordinates
-
Feed them into downstream pipelines (VFX, product cutouts, data labeling, etc.)
6. SAM 3 Visual Prompt Workflow on Video
Visual prompts are also important in video.
6.1 Initialize on a Key Frame
You start with a key frame:
-
Choose a frame where the object is clearly visible.
-
Add a visual prompt:
-
Click on the object
-
Draw a box
-
-
(Optionally) Add a text concept for semantic hints.
6.2 Run “First Frame” Segmentation
SAM 3 segments the object in that frame, producing:
-
A high-quality mask
-
An instance ID
6.3 Track Through Time
Then SAM 3’s video tracker:
-
Propagates the mask across subsequent frames
-
Maintains the same instance ID
-
Adjusts masks as the object moves, rotates, or changes scale
6.4 Correct with Additional Visual Prompts
When tracking drifts or fails (occlusions, fast motion):
-
The user can click on a frame where the mask is wrong
-
Provide additional visual prompts (e.g., correction clicks)
-
SAM 3 re-aligns the track from that point onward
This is extremely powerful for:
-
Rotoscoping
-
Object-focused video analytics
-
Automated highlight generation (e.g., tracking “ball” in sports)
7. Implementing Visual Prompts: Practical Considerations
If you’re a developer, here’s how to think about integrating SAM 3 visual prompts into your tooling.
7.1 Coordinate Systems
Your frontend UI:
-
Works in pixel coordinates (x, y) relative to the displayed image.
-
May be scaled, cropped, or padded compared to the model’s input.
You must:
-
Convert UI coordinates → model input coordinates
-
Apply any resizing / padding transformations consistently on input and mask output
7.2 Prompt Encoding
Visual prompts are usually encoded as:
-
Point arrays:
[x, y, label]where label is positive/negative -
Box arrays:
[x1, y1, x2, y2] -
Binary masks: same spatial resolution as input (or downscaled/upscaled)
Your library or SDK (GitHub/HF) will often provide helper functions to build these prompt tensors.
7.3 State Management for Interactive Sessions
In an ideal UI:
-
Each interaction (click, box, correction) becomes a new prompt added to the “state” for that object.
-
You may keep a list:
-
Points so far
-
Current mask
-
Object ID
-
When the user adds a new prompt, you send the full prompt history (or a summarized version) so SAM 3 can refine the segmentation more intelligently.
7.4 Performance
Visual prompt workflows are often interactive humans are waiting for feedback:
-
You want response times under ~300–500 ms for a good UX.
-
That may mean:
-
Using a smaller SAM 3 variant
-
Caching features of the image so you don’t recompute the backbone every time
-
Running on GPU
-
A common approach:
-
Compute the image’s backbone features once.
-
For each new visual prompt, only run the prompt encoder + segmentation head.
8. Advantages of SAM 3 Visual Prompting
8.1 Fine-Grained Control
Visual prompts give per-object, per-pixel control that pure text can’t match, especially in creative and manual workflows.
8.2 Human-in-the-loop labeling
For dataset construction:
-
SAM 3 can propose masks
-
Annotators use visual prompts to quickly fix them
-
This drastically cuts annotation time compared to drawing masks from scratch
8.3 Backward Compatibility with Existing Tools
If you previously integrated SAM 1 / SAM 2-style interactions:
-
You can upgrade to SAM 3’s open-vocabulary power
-
Still keep your existing UI pattern (clicks, boxes, etc.)
9. Limitations and Failure Modes of Visual Prompts in SAM 3
Visual prompting isn’t magic; it has some limitations.
9.1 Ambiguous Regions
If you click on a blurry, mixed region (like motion blur or transparent glass), SAM 3 may incorrectly estimate boundaries.
Mitigation:
-
Place points on clear, well-defined object pixels
-
Use multiple points for large or complex shapes
9.2 Complex Overlaps
When objects overlap heavily (people in a crowd, tangled objects):
-
A single click might not be enough to isolate the exact instance.
Mitigation:
-
Combine text (“child in blue jacket”) + click on that specific child
-
Add negative points on adjacent objects
9.3 Domain Shift
In extremely unusual domains (e.g., medical scans, thermal images):
-
Visual prompts help, but the underlying features may not generalize well.
-
SAM 3 may still need domain-specific fine-tuning for high accuracy.
9.4 Video Drift
In video:
-
Even with visual prompts, tracking can drift if frames are heavily blurred, occluded, or jump-cut.
Mitigation:
-
Use keyframe correction periodically add fresh visual prompts on new frames to realign tracking.
10. Best Practices for “SAM 3 Visual Prompt” Workflows
Here are some practical guidelines to get the best results.
10.1 Use Clear Foreground Points
-
Place clicks on visually distinctive parts of the object (edges, patterns, colors).
-
Avoid extremely ambiguous regions like reflections or shadows.
10.2 Combine Positive and Negative Points
-
Use positive points to define the object
-
Use negative points to carve away background or neighboring objects
This dramatically improves mask quality.
10.3 Use Text When the Scene is Busy
In busy scenes:
-
Start with a text prompt (“yellow excavator”)
-
Add a visual prompt to pick exactly which excavator you care about
You get both semantic understanding and spatial precision.
10.4 Cache Features for Interactivity
For UI apps:
-
Precompute and cache backbone features
-
Only recompute the prompt-specific parts
This makes visual prompting feel instant rather than “click and wait”.
10.5 Save Prompt Histories
If you’re building a professional tool:
-
Store each object’s prompt history (points, boxes, etc.)
-
Allow users to reload and re-edit objects without starting from scratch
This is also helpful if you upgrade to a new SAM version prompt histories can be re-run with improved models.
11. Building Products Around SAM 3 Visual Prompts
Some concrete product ideas where SAM 3 visual prompts are central:
11.1 Video Editing Panel
-
Timeline view with frames
-
Click on an actor → create tracked mask with SAM 3
-
Offer “Refine mask” button with additional prompt clicks
11.2 Labeling/Annotation Tool
-
Display images from training datasets
-
Auto-suggest masks via concept prompts
-
Annotators fix them via visual prompts
-
Export cleaned masks + polygons
11.3 AR/Camera App
-
Live camera feed
-
User taps an object → SAM 3 segments it
-
Apply filters, color changes, or overlays only on that object
Visual prompts here are just taps on the screen.
12. Visual Prompts as Part of the Bigger SAM 3 Story
SAM 3 is often described in terms of:
-
Text/image exemplars
-
Tracking and SA-Co benchmark
Visual prompts are the bridge between that high-level vision and the proven, practical workflows from earlier SAM generations. They let humans:
-
Intervene when text alone is not enough
-
Correct and refine masks
-
Align the model’s understanding with their intent
In other words, SAM 3 visual prompts are how you “steer” the model interactively one click, box, or scribble at a time.
13. Summary: Key Takeaways About SAM 3 Visual Prompts
-
Visual prompts are still first-class citizens in SAM 3: points, boxes, masks, and regions remain essential.
-
They work alongside text and exemplar prompts, not instead of them.
-
Use visual prompts when you need:
-
Per-instance control
-
Fine-grained edits
-
Manual correction in labeling and VFX workflows
-
-
Use hybrid prompts (visual + text) for the best of both worlds in cluttered scenes.
-
In video, visual prompts seed the tracker, and additional prompts correct drift.
-
Implementations should focus on:
-
Good coordinate handling
-
Fast, interactive response times
-
Storing prompt histories for re-editing
-
If you’re building tools on top of SAM 3, leaning into visual prompts gives your users a feeling of direct, tactile control instead of just typing prompts and hoping for the best.
AI RESEARCH FROM META
Introducing Segment Anything Model 3 (SAM 3) - the future of segmentation is promptable. Use text or visual prompts to instantly identify, segment, and track any object in images or video. Coming soon to Instagram Edits and Meta AI's Vibes.