Segment Anything Model 3 (SAM 3)

Stop drawing boxes and clicking frame-by-frame. Segment Anything Model 3 (SAM 3) lets you describe what you want like "red cars" or "person holding a phone" and it can segment every matching object and track them across video. In this guide, you'll see what's new in SAM 3, how concept prompts work, and how to use it for fast masks, labeling, and real world workflows.

Start Creating Free Watch Demo

Segment Anything Model 3 (SAM 3): What It Is, What’s New, and How to Use It

Meta AI continues to push the boundaries of computer vision with the release of Segment Anything Model 3 (SAM 3), the latest in its line of powerful segmentation foundation models. Unlike its predecessors, SAM 3 isn’t just about “click-to-segment” anymore it’s a multi-modal, concept-aware model that can detect, segment, and track objects in both images and videos using concept prompts, such as short text phrases, image exemplars, or a combination of both.

Released on November 19, 2025, SAM 3 represents a significant leap in segmentation performance, open-vocabulary support, and real-world usability across domains like video editing, robotics, e-commerce, and privacy-focused workflows.

🔍 What Is SAM 3?

SAM 3 is Meta AI’s third-generation segmentation model, part of the Segment Anything project. It's built to support promptable segmentation, which means it responds to prompts like text descriptions or images to isolate and identify objects within a scene.

While SAM 1 and SAM 2 focused on interactive segmentation using points, boxes, or clicks, SAM 3 introduces Promptable Concept Segmentation (PCS): a powerful system for detecting all instances of a concept (like “all red cars”) across images or video sequences automatically.

🆕 What’s New in SAM 3?

✅ Promptable Concept Segmentation (PCS)

SAM 3’s flagship feature is PCS, which enables:

Segmentation masks for all instances matching a concept.
Unique IDs for each instance, crucial for video tracking.
Support for concept prompts, including:
- Text-only (e.g., “yellow school bus”)
- Image-only (an exemplar image)
- Hybrid (text + image)

This transforms how creators, developers, and researchers interact with segmentation tools.

✅ Unified Framework for Detection + Segmentation + Tracking

SAM 3 blends three tasks into a single model:

Task	Purpose
Detection	Identify object locations
Segmentation	Generate pixel-level object masks
Tracking	Maintain object identity across frames

No need for three separate models SAM 3 does it all.

✅ Large-Scale Training on 4M Concept Labels

SAM 3 was trained with a massive 4-million-label dataset, including:

Hard negatives (to teach the model what not to segment)
Mixed image and video sources
Scalable data engine for rapid dataset expansion

This scale enables broad generalization, even on unfamiliar concepts.

⚙️ How SAM 3 Works: High-Level Architecture

Meta’s paper describes SAM 3 as a unified vision model combining:

Image-level detector: for spatial recognition
Memory-based video tracker: for object continuity
Shared visual backbone: for feature extraction

It uses a “presence head” to decouple recognition and localization, improving detection accuracy across object types and scenes.

Even if you're not an ML engineer, the key takeaway is:

SAM 3 can handle images and videos in a single pipeline from prompt → segmentation masks → tracked instances.

📊 Benchmarking: SA-Co

To evaluate SAM 3’s performance on concept prompts, Meta introduced the Segment Anything with Concepts (SA-Co) benchmark.

Why it matters: SA-Co provides a shared metric and dataset for comparing models that support concept-based segmentation.
Outcome: SAM 3 establishes new state-of-the-art results across open-vocabulary segmentation and tracking.

🆚 SAM 3 vs. SAM 2: Key Differences

Feature	SAM 2	SAM 3
Prompt type	Points, boxes, masks	Points, boxes, masks + text + image exemplars
Concept-level segmentation	❌	✅
Tracking across frames	✅	✅ (with unique IDs)
Real-world robustness	Moderate	High (more generalizable)
Data scale	Millions of masks	4M labeled concepts across modalities

🎥 Use Cases for SAM 3

SAM 3 is not just for research. Here’s how it’s being used:

1. Video Editing & VFX

Generate high-quality segmentation masks for visual effects
Track subjects across a timeline for compositing
Eliminate the need for frame-by-frame manual rotoscoping

2. Computer Vision Data Labeling

Speed up annotation for training datasets
Auto-generate masks to reduce human labor
Improve dataset quality with less time and cost

3. E-commerce & Retail Imaging

Create product cutouts with clean backgrounds
Apply consistent image masks across catalog images
Automate photo editing workflows

4. Robotics & Autonomous Systems

Detect and track multiple objects in real time
Support for dynamic, real-world environments
Use in mobile robots, drones, AR/VR systems

5. Privacy-Preserving Redaction

Mask faces, license plates, screens, and documents
Protect privacy in surveillance footage
Automate content redaction pipelines

🔧 How to Use SAM 3: Practical Workflow

Here’s a recommended workflow for using SAM 3:

Step 1: Choose Your Prompt

Text: Use short, descriptive noun phrases (“red bike”, “white chair”)
Image: Upload an example if the object is hard to describe
Hybrid: Combine both for stronger results

Step 2: Run PCS

Run the model to get:
- Segmentation masks
- Instance IDs (especially for video)

Step 3: For Video

Start on a clean frame
Let SAM 3 track objects across frames
Post-process:
- Filter small masks
- Clean edges
- Refine with domain-specific adjustments

📥 How to Access SAM 3

You can find and use SAM 3 through these official sources:

Meta Research Page: Publication, code, model links
GitHub (facebookresearch/sam3): Codebase, inference scripts, fine-tuning support
Hugging Face Model Page: Pretrained weights, transformers integration
Transformers Docs: SAM3 + SAM3 Tracker examples
Ultralytics Integration: Workflow examples for YOLO/SAM pipelines (may require model access)

🔗 Access: You may need to request weights depending on platform restrictions.

❗ Limitations of SAM 3

While powerful, SAM 3 is not flawless:

Ambiguous prompts can yield inconsistent results (“tool”, “bag”, “thing”)
Rare or niche concepts may be underrepresented
Fast motion or occlusion can confuse tracking
Out-of-domain inputs (e.g., thermal images, medical scans) require fine-tuning

✅ Best Practices for Accurate Results

Use short, specific prompts (e.g., “yellow jacket” > “jacket”)
If results are too broad, add qualifiers (color, size, material)
For poor performance, switch to image prompts
Start video tracking on a high-quality frame (minimal blur, object centered)

📈 Why SAM 3 Matters

SAM 3 isn’t just an academic model it’s a new interface for vision AI:

Concept-based segmentation is intuitive and fast
Makes AI accessible to creators, editors, developers, researchers
Paves the way for language-driven image understanding

By combining segmentation, detection, and tracking into a unified pipeline powered by promptable input SAM 3 pushes vision models closer to general-purpose, real-world usability.

🧠 Summary: SAM 3 in One Paragraph

Segment Anything Model 3 (SAM 3) by Meta AI is a cutting-edge foundation model for object segmentation in images and videos. With Promptable Concept Segmentation, SAM 3 enables users to input text descriptions or image examples to detect and track all relevant instances whether in still images or video sequences. Built on a massive concept-level dataset and unified architecture, SAM 3 offers unprecedented flexibility and performance across detection, segmentation, and tracking tasks, making it invaluable for creators, developers, researchers, and engineers alike.

AI RESEARCH FROM META

Introducing Segment Anything Model 3 (SAM 3) - the future of segmentation is promptable. Use text or visual prompts to instantly identify, segment, and track any object in images or video. Coming soon to Instagram Edits and Meta AI's Vibes.

Start Creating Free Download the model Try Playground