Organizing 669 GB of GoPro Footage via Local ML on an M1 Max

For years, my GoPro footage sat in a digital graveyard. I had accumulated 669 GB of raw video, but finding a specific moment—like a particular sunset or a specific trail ride—was a nightmare. ~~I tried manual tagging~~, but that was an impossible task given the volume of data.

I wanted a way to search my videos using natural language, but I refused to upload my private memories to a cloud provider. The solution? Leveraging the neural engine of my M1 Max MacBook Pro and local machine learning models.

The Objective

The goal was to create a searchable index of every single video clip. Instead of relying on filenames like GH010234.mp4, I wanted to type "mountain biking in the rain" and have the system point me to the exact second the action occurs.

Hardware Specifications

To pull this off, I utilized the following setup:

Component	Specification	Role
CPU/GPU	Apple M1 Max	Heavy lifting & Tensor acceleration
RAM	64GB Unified Memory	Loading models and frame buffers
Storage	NVMe SSD	Fast I/O for 669 GB of source files
OS	macOS	Host environment

The Technical Strategy: CLIP

The core of this project is CLIP (Contrastive Language-Image Pre-training), a model developed by OpenAI. Unlike traditional image classifiers that recognize a fixed set of labels (e.g., "dog," "car"), CLIP learns visual concepts from natural language descriptions.

Key Insight: CLIP maps both images and text into the same embedding space. If an image of a beach and the word "beach" are close to each other in this multi-dimensional space, the model considers them a match.

The Workflow Pipeline

The process follows a linear path from raw binary video to a searchable database.

Implementation Details

1. Data Extraction

I couldn't process every single frame (that would be overkill and computationally expensive). Instead, I sampled frames at a specific interval.

Scan directory for .mp4 files.
Extract one frame every $N$ seconds.
Resize frames to $224 \times 224$ pixels (CLIP's required input).

2. Generating Embeddings

Using PyTorch and the clip library, I passed each frame through the image encoder. This converted a visual image into a vector (a long list of numbers).

import clip
import torch

# Load the model and preprocess
device = "mps" # Use Apple Silicon GPU
model, preprocess = clip.load("ViT-B/32", device=device)

# Process a frame
image = preprocess(frame).unsqueeze(0).to(device)
with torch.no_grad():
    image_features = model.encode_image(image)

3. The Search Math

To find the most relevant video clip, the system calculates the Cosine Similarity between the vector of the search query ( $q$ ) and the vector of the stored frame ( $f$ ).

The formula used is: $\text{similarity} = \cos(\theta) = \frac{\mathbf{q} \cdot \mathbf{f}}{\|\mathbf{q}\| \|\mathbf{f}\|}$

Where a value closer to $1.0$ indicates a near-perfect match.

Results and Performance

The result is a local system where I can query my footage instantly.

![Conceptual UI of Video Search](https://via.placeholder.com/600x300?text=Search+Query:+ 'Surfing+at+Sunset' $\rightarrow$ Video+Clip+Found)

Performance Observations

Processing Speed: The M1 Max's mps (Metal Performance Shaders) backend significantly accelerated the inference.
Storage: The resulting vector database is surprisingly small compared to the 669 GB of source video.
Accuracy: It handles general concepts (e.g., "forest") perfectly, though it occasionally struggles with very specific niche objects.

Final Thoughts

By moving the intelligence to the edge (my own laptop), I achieved three things:

Privacy: No data ever left my machine.
Cost: Zero monthly subscription fees for cloud AI.
Utility: I can now actually use my archives instead of letting them rot on a hard drive.

It turns out that the "Max" in M1 Max is exactly what's needed for local ML indexing.