← Back to news

I indexed 669 GB of my GoPro videos using my M1 Max computer and local ML models

321 points|79 comments|by iliashad|Jun 14, 2026

Organizing 669 GB of GoPro Footage via Local ML on an M1 Max

For years, my GoPro footage sat in a digital graveyard. I had accumulated 669 GB of raw video, but finding a specific moment—like a particular sunset or a specific trail ride—was a nightmare. I tried manual tagging, but that was an impossible task given the volume of data.

I wanted a way to search my videos using natural language, but I refused to upload my private memories to a cloud provider. The solution? Leveraging the neural engine of my M1 Max MacBook Pro and local machine learning models.

The Objective

The goal was to create a searchable index of every single video clip. Instead of relying on filenames like GH010234.mp4, I wanted to type "mountain biking in the rain" and have the system point me to the exact second the action occurs.

Hardware Specifications

To pull this off, I utilized the following setup:

ComponentSpecificationRole
CPU/GPUApple M1 MaxHeavy lifting & Tensor acceleration
RAM64GB Unified MemoryLoading models and frame buffers
StorageNVMe SSDFast I/O for 669 GB of source files
OSmacOSHost environment

The Technical Strategy: CLIP

The core of this project is CLIP (Contrastive Language-Image Pre-training), a model developed by OpenAI. Unlike traditional image classifiers that recognize a fixed set of labels (e.g., "dog," "car"), CLIP learns visual concepts from natural language descriptions.

Key Insight: CLIP maps both images and text into the same embedding space. If an image of a beach and the word "beach" are close to each other in this multi-dimensional space, the model considers them a match.

The Workflow Pipeline

The process follows a linear path from raw binary video to a searchable database.


Implementation Details

1. Data Extraction

I couldn't process every single frame (that would be overkill and computationally expensive). Instead, I sampled frames at a specific interval.

  • Scan directory for .mp4 files.
  • Extract one frame every NN seconds.
  • Resize frames to 224×224224 \times 224 pixels (CLIP's required input).

2. Generating Embeddings

Using PyTorch and the clip library, I passed each frame through the image encoder. This converted a visual image into a vector (a long list of numbers).

import clip
import torch

# Load the model and preprocess
device = "mps" # Use Apple Silicon GPU
model, preprocess = clip.load("ViT-B/32", device=device)

# Process a frame
image = preprocess(frame).unsqueeze(0).to(device)
with torch.no_grad():
    image_features = model.encode_image(image)

3. The Search Math

To find the most relevant video clip, the system calculates the Cosine Similarity between the vector of the search query (qq) and the vector of the stored frame (ff).

The formula used is: similarity=cos(θ)=qfqf\text{similarity} = \cos(\theta) = \frac{\mathbf{q} \cdot \mathbf{f}}{\|\mathbf{q}\| \|\mathbf{f}\|}

Where a value closer to 1.01.0 indicates a near-perfect match.


Results and Performance

The result is a local system where I can query my footage instantly.

![Conceptual UI of Video Search](https://via.placeholder.com/600x300?text=Search+Query:+ 'Surfing+at+Sunset' \rightarrow Video+Clip+Found)

Performance Observations

  • Processing Speed: The M1 Max's mps (Metal Performance Shaders) backend significantly accelerated the inference.
  • Storage: The resulting vector database is surprisingly small compared to the 669 GB of source video.
  • Accuracy: It handles general concepts (e.g., "forest") perfectly, though it occasionally struggles with very specific niche objects.

Final Thoughts

By moving the intelligence to the edge (my own laptop), I achieved three things:

  1. Privacy: No data ever left my machine.
  2. Cost: Zero monthly subscription fees for cloud AI.
  3. Utility: I can now actually use my archives instead of letting them rot on a hard drive.

It turns out that the "Max" in M1 Max is exactly what's needed for local ML indexing.