The 90-Year-Old Foundation of JEPA: Canonical Correlation Analysis

By Shon Czinner | Published May 4, 2026

Introduction

The conceptual roots of modern embedding prediction—specifically Joint-Embedding Predictive Architectures (JEPA)—stretch back nearly a century. The foundation was laid in 1936 by the economist and statistician Harold Hotelling.

"Concepts of correlation and regression may be applied not only to ordinary one-dimensional variates but also to variates of two or more dimensions." — Hotelling (1936), "Relations Between Two Sets of Variates"

In contemporary terms, Canonical Correlation Analysis (CCA) is a method designed to extract a shared signal from two distinct, large matrices (Bykhovskaya and Gorin 2025). In the context of JEPA, the objective is identical, with one key twist: the second matrix is simply an alternative "view" or transformation of the first dataset.

The Lineage of the Idea

Recent scholarship (Huang 2026) suggests that JEPA models are essentially non-linear generalizations of CCA. This connection is central to the ongoing debate between Jürgen Schmidhuber and Yann LeCun regarding the "invention" of JEPA. While the debate continues, the core intellectual credit for maximizing correlation within an embedding space arguably belongs to Hotelling.

The evolution of this idea has followed a clear path:

1936: Original CCA proposed by Hotelling.
1961: Horst generalized CCA to handle more than two sets of variables.
2017: "Deep CCA" (Andrew et al.) introduced non-linear neural variants.
Modern Era: JEPA models apply these principles to large-scale AI.

Note: It is entirely possible that future JEPAs could be expanded to integrate more than two data views, mirroring Horst's 1961 generalization.

Technical Deep Dive: CCA vs. JEPA

1. Canonical Correlation Analysis (CCA)

Imagine we have two zero-mean matrices: X $\in \mathbb R^{n\times d_x}$ and Y $\in \mathbb R^{n\times d_y}$ .

We seek projection matrices $A \in \mathbb R^{d_x\times k}$ and $B \in \mathbb R^{d_y\times k}$ to create embeddings: $z_x = XA$ and $z_y = YB$ (where $z_x, z_y \in \mathbb R^{n \times k}$ ).

The goal of CCA is to solve this maximization problem: $\max_{A,B} \text{tr}\left(\frac{1}{n}z_x^Tz_y\right)$ $\text{subject to: } \frac{1}{n}z_x^Tz_x = \frac{1}{n}z_y^Tz_y = I$

This process maximizes the trace of the cross-correlation matrix while ensuring the embeddings maintain unit variance and zero covariance (whitening).

The Link to Prediction Error

Just as PCA links variance maximization to error minimization, CCA links cross-correlation to the Mean Squared Error (MSE): $\frac{1}{n}\sum_{i=1}^n ||z_x^{(i)}-z_y^{(i)}||^2 = \frac{1}{n}||z_x-z_y||_F^2 = \frac{1}{n}\text{tr}(z_x^Tz_x) + \frac{1}{n}\text{tr}(z_y^Tz_y) - \frac{2}{n}\text{tr}(z_x^Tz_y)$

Given the whitening constraints, this simplifies to: $= 2k - \frac{2}{n}\text{tr}(z_x^Tz_y)$

Thus, maximizing correlation under whitening is mathematically equivalent to: $\min_{A,B} \frac{1}{n}\sum_{i=1}^n ||z_x^{(i)}-z_y^{(i)}||^2 \quad \text{s.t.} \quad \frac{1}{n}z_x^Tz_x = \frac{1}{n}z_y^Tz_y = I$

2. Joint-Embedding Predictive Architecture (JEPA)

In JEPA, we assume $d_x = d_y = d$ . We utilize an encoder $f_\theta$ and a predictor $g_\varphi$ .

$z_x^{(i)} = g_\varphi(f_\theta(x_i))$
$z_y^{(i)} = f_\theta(y_i)$

The objective function is: $\min_{\theta,\varphi}\frac{1}{n} \sum_{i=1}^n ||z_x^{(i)}-z_y^{(i)}||^2$

The Problem of Collapse

Unlike CCA, standard JEPA lacks explicit whitening constraints. This leads to representational collapse, where the model finds a "cheat" solution: z_x^{(i)} = z_y^{(i)} = c (a constant vector).

To solve this, SIGReg (Balestriero and LeCun 2025) was introduced to force embeddings to be isotropic, effectively recreating the CCA constraint: $\frac{1}{n}z_x^Tz_x = \frac{1}{n}z_y^Tz_y = I$

Comparison Summary

Feature	Canonical Correlation Analysis (CCA)	JEPA Models
Core Goal	Maximize correlation between matrices	Predict one embedding from another
Linearity	Originally Linear	Non-linear (Neural Networks)
Constraints	Explicit Whitening ( $I$ )	Implicit/Added via SIGReg
Risk	Overfitting	Dimensional/Representational Collapse

Conceptual Workflow

Conclusion: The Debate on Innovation

The tension between Schmidhuber and LeCun boils down to a disagreement on what constitutes "invention." Schmidhuber argues that JEPA is essentially the same as his 1992 Predictability Maximization system. LeCun counters that JEPA is a general concept, and the real achievement lies in making it work at scale on complex, non-toy problems.

My Perspective:

~~Ideas are a dime a dozen.~~ $\rightarrow$ While implementation is where the value is realized, the chain of citation is vital for scientific progress.
If foundational citations (like Hotelling's) are omitted, they should be added.
JEPA and Predictability Maximization are essentially architectural layers built upon the bedrock of CCA.

Ultimately, all these models share a singular, 90-year-old goal: finding the transformations that maximize the correlation between multidimensional data sets.

Implementation Checklist for JEPA-like Systems

Define two views of the same data.
Implement an encoder $f_\theta$ .
Implement a predictor $g_\varphi$ .
Define an MSE loss function.
Apply a regularization technique (e.g., SIGReg) to prevent collapse.

Conceptual Pseudo-code for Embedding Alignment

def compute_jepa_loss(x, y, encoder, predictor):
    # Generate embeddings
    z_y = encoder(y)
    z_x = predictor(encoder(x))
    
    # MSE Loss (The CCA-linked objective)
    mse_loss = torch.mean((z_x - z_y)**2)
    
    # SIGReg: Prevent collapse by encouraging identity covariance
    reg_loss = compute_isotropic_regularization(z_x, z_y)
    
    return mse_loss + reg_loss

References

Andrew et al. (2017): "Deep Canonical Correlation Analysis." ICML.
Balestriero & LeCun (2025): LeJEPA: Provable and Scalable Self-Supervised Learning.
Benton et al.: Generalized Canonical Correlations.
Hotelling (1936): "Relations Between Two Sets of Variates." Biometrika.
Huang (2026): VJEPA: Variational Joint Embedding Predictive Architectures.