I indexed 669 GB of my GoPro videos using my M1 Max computer and local ML models
Organizing 669 GB of GoPro Footage via Local ML on an M1 Max
For years, my GoPro footage sat in a digital graveyard. I had accumulated 669 GB of raw video, but finding a specific moment—like a particular sunset or a specific trail ride—was a nightmare. I tried manual tagging, but that was an impossible task given the volume of data.
I wanted a way to search my videos using natural language, but I refused to upload my private memories to a cloud provider. The solution? Leveraging the neural engine of my M1 Max MacBook Pro and local machine learning models.
The Objective
The goal was to create a searchable index of every single video clip. Instead of relying on filenames like GH010234.mp4, I wanted to type "mountain biking in the rain" and have the system point me to the exact second the action occurs.
Hardware Specifications
To pull this off, I utilized the following setup:
| Component | Specification | Role |
|---|---|---|
| CPU/GPU | Apple M1 Max | Heavy lifting & Tensor acceleration |
| RAM | 64GB Unified Memory | Loading models and frame buffers |
| Storage | NVMe SSD | Fast I/O for 669 GB of source files |
| OS | macOS | Host environment |
The Technical Strategy: CLIP
The core of this project is CLIP (Contrastive Language-Image Pre-training), a model developed by OpenAI. Unlike traditional image classifiers that recognize a fixed set of labels (e.g., "dog," "car"), CLIP learns visual concepts from natural language descriptions.
Key Insight: CLIP maps both images and text into the same embedding space. If an image of a beach and the word "beach" are close to each other in this multi-dimensional space, the model considers them a match.
The Workflow Pipeline
The process follows a linear path from raw binary video to a searchable database.
Implementation Details
1. Data Extraction
I couldn't process every single frame (that would be overkill and computationally expensive). Instead, I sampled frames at a specific interval.
- Scan directory for
.mp4files. - Extract one frame every seconds.
- Resize frames to pixels (CLIP's required input).
2. Generating Embeddings
Using PyTorch and the clip library, I passed each frame through the image encoder. This converted a visual image into a vector (a long list of numbers).
import clip
import torch
# Load the model and preprocess
device = "mps" # Use Apple Silicon GPU
model, preprocess = clip.load("ViT-B/32", device=device)
# Process a frame
image = preprocess(frame).unsqueeze(0).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
3. The Search Math
To find the most relevant video clip, the system calculates the Cosine Similarity between the vector of the search query () and the vector of the stored frame ().
The formula used is:
Where a value closer to indicates a near-perfect match.
Results and Performance
The result is a local system where I can query my footage instantly.

Performance Observations
- Processing Speed: The M1 Max's
mps(Metal Performance Shaders) backend significantly accelerated the inference. - Storage: The resulting vector database is surprisingly small compared to the 669 GB of source video.
- Accuracy: It handles general concepts (e.g., "forest") perfectly, though it occasionally struggles with very specific niche objects.
Final Thoughts
By moving the intelligence to the edge (my own laptop), I achieved three things:
- Privacy: No data ever left my machine.
- Cost: Zero monthly subscription fees for cloud AI.
- Utility: I can now actually use my archives instead of letting them rot on a hard drive.
It turns out that the "Max" in M1 Max is exactly what's needed for local ML indexing.