Meta SAM 2 Segment Anything in Images and Video

Meta SAM 2 lets you click once and keep up forever segmenting and tracking your object across every frame in both images and video.

Start Creating Free Watch Demo

Meta SAM 2 – Segment Anything for Images and Videos

Meta SAM 2 is the second generation of Meta’s Segment Anything family and the first one truly designed for both images and videos.
Where SAM 1 focused on “click and segment” in still images, SAM 2 adds tracking and streaming, so you can follow objects across frames with simple prompts like clicks and boxes.

1. What Is Meta SAM 2?

Meta SAM 2 is a promptable visual segmentation model for:

Images – like SAM 1
Videos – short clips and long streams

You give it:

Visual prompts – points, boxes, or masks

And SAM 2 returns:

Pixel-accurate masks of the objects you asked for
Tracks over time when you’re working with video

Meta describes SAM 2 as a “foundation model for image and video segmentation, built to support interactive editing and real-time style applications.

2. Key Features of Meta SAM 2

2.1 Unified image + video segmentation

Unlike SAM 1 (images only), SAM 2:

Works on single images and full videos
Uses one architecture for both, so you don’t need separate models
Keeps segmentation quality similar to SAM 1 on images, while adding time awareness for video

2.2 Streaming memory for long videos

A core innovation in SAM 2 is its streaming memory system:

Frames are processed incrementally (streaming), not all at once.
The model maintains memory tokens that carry information from previous frames.
This lets SAM 2 segment and track objects over longer videos with reasonable GPU memory.

This is what makes SAM 2 suitable for:

Real-time-ish applications (video editing previews, camera feeds)
Long-form content, not just short clips

2.3 Visual prompts: points, boxes, and masks

Like the original SAM, Meta SAM 2 is promptable but still visual-only (no text):

You can guide the model with:

Points
- Positive clicks on the object you want
- Negative clicks on background or unwanted regions
Bounding boxes
- Rough rectangle around the target object
Masks
- Use an existing mask as input for refinement

These prompts can be applied at a single frame, and SAM 2 will:

Segment that object precisely
Track it across future frames using its memory and temporal features

2.4 Object tracking across frames

SAM 2 is built to segment and track:

Once you prompt an object in one frame, it gets an ID.
As the video progresses, SAM 2:
- Predicts that object’s location
- Updates its mask
- Keeps the same ID through motion, partial occlusions, and viewpoint changes

This turns simple clicks into full object tracks you can use for:

Video editing (mask layers over time)
Sports & analytics overlays
Training data for tracking models

2.5 SA-V dataset – large-scale video segmentation

To train SAM 2, Meta introduced SA-V, a huge dataset for video segmentation built with a model-in-the-loop engine:

Starts with a model like SAM
Humans correct and refine segmentations
Loop repeats to create high-quality masks at scale

SA-V gives SAM 2:

Better understanding of motion and temporal consistency
Strong performance on multiple public video segmentation benchmarks

3. Meta SAM 2 vs SAM 1 vs SAM 3

To position SAM 2 within the whole family:

3.1 Compared to SAM 1

SAM 1 – images only, visual prompts, no tracking
SAM 2 – images + videos, visual prompts, tracking + streaming memory

If you’re only doing still images, SAM 1 may be enough.
If you work with video, SAM 2 is the natural upgrade.

3.2 Compared to SAM 3

SAM 2
- Prompts: visual only (points/boxes/masks)
- Task: interactive segmentation + tracking
SAM 3
- Prompts: text + exemplars + visual
- Task: Promptable Concept Segmentation (PCS) – find all instances of a concept, plus PVS like SAM 2

So SAM 2 is best if you:

Don’t need text prompts
Want a solid, efficient model for visual-only segmentation and tracking

SAM 3 is better if you:

Want to say things like “all red cars” or “all players”
Need concept-level segmentation plus video tracking

4. Real-World Use Cases of Meta SAM 2

4.1 Video editing & post-production

Track a subject once (person, car, product) with a few clicks.
Automatically generate masks across frames.
Apply color grading, blur, or VFX only to the segmented subject or background.

Great for: YouTube creators, short-form video editors, and video tools that want AI masking.

4.2 Sports and broadcast

Click players or the ball in a key frame.
Use SAM 2 to propagate masks through the clip.
Build heat maps, highlights, and tactical diagrams using tracked positions.

SAM 2’s temporal consistency is particularly useful for sports analytics and overlays.

4.3 Surveillance and traffic analysis

Segment and track vehicles, people, bikes in CCTV feeds.
Count objects, measure dwell time, or detect unusual activity based on motion.

Because it uses visual prompts, SAM 2 fits well in semi-automatic pipelines where operators choose what to track.

4.4 Data labeling for video datasets

Label a few frames with clicks/boxes/masks.
SAM 2 extends those masks across many frames.
You get high-quality tracks and segmentation masks as training data for other models.

This can dramatically reduce manual labeling time for segmentation & tracking datasets.

5. How to Try Meta SAM 2

5.1 Official SAM 2 repository

Meta released SAM 2 at facebookresearch/sam2 on GitHub, with:

Pretrained weights
Example notebooks
Scripts for image and video segmentation

5.2 Segment Anything Playground

Meta’s Segment Anything Playground lets you try SAM 2 in the browser:

Upload images or short videos
Click on objects
Watch masks and tracks update interactively

5.3 Third-party integrations

Libraries and platforms (e.g., Ultralytics, Roboflow, and others) have added SAM 2 support, letting you:

Run it from simple Python APIs
Use it in web UIs for annotation and editing

6. Strengths and Limitations of Meta SAM 2

Strengths

Handles both images and videos with one model.
Adds streaming memory so you can work on longer clips.
Great for interactive video editing and semi-automatic labeling.
Builds directly on the success of SAM 1’s segmentation quality.

Limitations

No text prompts – you must use clicks, boxes, or masks.
Concept-level “find all X in the scene” is better covered by SAM 3.
Real-time performance still depends heavily on GPU power and resolution; on low-end hardware, it’s mostly offline/batch.

Meta SAM 2 vs SAM 1 vs SAM 3 – Which Model Should You Use?

Model	Main Focus	Media Type	Prompt Types	Big Use Case
SAM 1	General image segmentation	Images only	Points, boxes, masks	Click-based cutouts in single images
SAM 2	Segmentation + tracking with streaming	Images + videos	Points, boxes, masks	Track prompted objects across video frames
SAM 3	Concept-level segmentation + tracking	Images + videos	Text, exemplars, points, boxes	“All instances of concept X” in image + video

2. Meta SAM 2 vs SAM 1

2.1 Core difference

SAM 1:
- Built for images only.
- Great at interactive segmentation with clicks/boxes.
- No notion of time or tracking.
SAM 2:
- Built for images and videos.
- Keeps SAM 1–style segmentation quality.
- Adds streaming memory and object tracking over frames.

👉 If you move from static photos to video clips, SAM 2 is the natural upgrade from SAM 1.

2.2 Prompts

Both use visual prompts only:

Points (positive/negative)
Bounding boxes
Masks

Neither SAM 1 nor SAM 2 understands text prompts like “all red cars.” You still interact by clicking or drawing.

2.3 Use cases comparison

Choose SAM 1 when:
- You only work with photos.
- You need fast, lightweight segmentation (e.g., background removal, quick annotation).
- You don’t care about tracking or video.
Choose SAM 2 when:
- You work with video: editing, sports, surveillance, etc.
- You want your clicks to propagate as masks across frames.
- You want one model to handle both image and video in your app.

3. Meta SAM 2 vs SAM 3

Now compare SAM 2 to the newer, “smarter” SAM 3.

3.1 Prompting and concepts

SAM 2:
- Visual prompts only (points, boxes, masks).
- Great if a human is in the loop clicking things.
- Doesn’t understand language concepts—only what you show.
SAM 3:
- Text prompts (“yellow school buses”, “players in blue”).
- Exemplar prompts (boxes/masks around one example object).
- Also supports points/boxes/masks for fine control.
- Can do Promptable Concept Segmentation:
  
  “Find and track all instances of concept X in this image or video.”

👉 SAM 3 lets you describe what you want; SAM 2 needs you to click what you want.

3.2 Tasks and intelligence level

SAM 2:
- Best for: “Track this object I clicked” across a clip.
- More like a super-powered, promptable roto/track tool.
SAM 3:
- Best for: “Find every X and follow them” (all cars, all players, all boxes).
- Combines detection + segmentation + tracking + language in one model.

If you’re building smart analytics (sports analysis, traffic stats, dataset labeling by category), SAM 3 is usually the better choice.
If you’re building a video editing tool where users click specific things, SAM 2 might be simpler and cheaper.

4. Performance & Complexity

4.1 Model complexity

SAM 1 – simplest; image-only; easiest to run.
SAM 2 – more complex than SAM 1 because of:
- Streaming memory
- Video handling
SAM 3 – most complex; adds:
- Text encoder
- Exemplar handling
- Concept heads for “presence” and multi-instance detection.

In practice:

SAM 1 → lightest, good for many real-time image tools.
SAM 2 → medium-heavy, tuned for video workloads.
SAM 3 → heaviest, best reserved for servers or strong GPUs.

4.2 Typical usage pattern

SAM 1 → image cutout tools, annotation UIs.
SAM 2 → video editing plugins, annotation of clips, semi-automatic tracking.
SAM 3 → high-level analytics (find all objects of type X), dataset mining, open-vocabulary segmentation, and 2D→3D workflows with SAM 3D.

5. Which SAM Should You Choose?

You can summarize this on your site like:

Pick SAM 1 if you:
- Only segment photos
- Want the simplest and most lightweight option
Pick SAM 2 if you:
- Need video + tracking
- Are happy with visual prompts (clicks/boxes)
- Are building editing or semi-automatic labeling tools
Pick SAM 3 if you:
- Want text and exemplar prompts
- Need to find all instances of a concept in images/videos
- Plan to connect segmentation to 3D (via SAM 3D) or advanced analytics

AI RESEARCH FROM META

Introducing Segment Anything Model 3 (SAM 3) - the future of segmentation is promptable. Use text or visual prompts to instantly identify, segment, and track any object in images or video. Coming soon to Instagram Edits and Meta AI's Vibes.

Start Creating Free Download the model Try Playground