The 90-year-old idea behind JEPA models: Canonical Correlation Analysis
The 90-Year-Old Foundation of JEPA: Canonical Correlation Analysis
By Shon Czinner | Published May 4, 2026
Introduction
The conceptual roots of modern embedding prediction—specifically Joint-Embedding Predictive Architectures (JEPA)—stretch back nearly a century. The foundation was laid in 1936 by the economist and statistician Harold Hotelling.
"Concepts of correlation and regression may be applied not only to ordinary one-dimensional variates but also to variates of two or more dimensions." — Hotelling (1936), "Relations Between Two Sets of Variates"
In contemporary terms, Canonical Correlation Analysis (CCA) is a method designed to extract a shared signal from two distinct, large matrices (Bykhovskaya and Gorin 2025). In the context of JEPA, the objective is identical, with one key twist: the second matrix is simply an alternative "view" or transformation of the first dataset.
The Lineage of the Idea
Recent scholarship (Huang 2026) suggests that JEPA models are essentially non-linear generalizations of CCA. This connection is central to the ongoing debate between Jürgen Schmidhuber and Yann LeCun regarding the "invention" of JEPA. While the debate continues, the core intellectual credit for maximizing correlation within an embedding space arguably belongs to Hotelling.
The evolution of this idea has followed a clear path:
- 1936: Original CCA proposed by Hotelling.
- 1961: Horst generalized CCA to handle more than two sets of variables.
- 2017: "Deep CCA" (Andrew et al.) introduced non-linear neural variants.
- Modern Era: JEPA models apply these principles to large-scale AI.
Note: It is entirely possible that future JEPAs could be expanded to integrate more than two data views, mirroring Horst's 1961 generalization.
Technical Deep Dive: CCA vs. JEPA
1. Canonical Correlation Analysis (CCA)
Imagine we have two zero-mean matrices:
X and Y .
We seek projection matrices and to create embeddings: and (where ).
The goal of CCA is to solve this maximization problem:
This process maximizes the trace of the cross-correlation matrix while ensuring the embeddings maintain unit variance and zero covariance (whitening).
The Link to Prediction Error
Just as PCA links variance maximization to error minimization, CCA links cross-correlation to the Mean Squared Error (MSE):
Given the whitening constraints, this simplifies to:
Thus, maximizing correlation under whitening is mathematically equivalent to:
2. Joint-Embedding Predictive Architecture (JEPA)
In JEPA, we assume . We utilize an encoder and a predictor .
The objective function is:
The Problem of Collapse
Unlike CCA, standard JEPA lacks explicit whitening constraints. This leads to representational collapse, where the model finds a "cheat" solution:
z_x^{(i)} = z_y^{(i)} = c (a constant vector).
To solve this, SIGReg (Balestriero and LeCun 2025) was introduced to force embeddings to be isotropic, effectively recreating the CCA constraint:
Comparison Summary
| Feature | Canonical Correlation Analysis (CCA) | JEPA Models |
|---|---|---|
| Core Goal | Maximize correlation between matrices | Predict one embedding from another |
| Linearity | Originally Linear | Non-linear (Neural Networks) |
| Constraints | Explicit Whitening () | Implicit/Added via SIGReg |
| Risk | Overfitting | Dimensional/Representational Collapse |
Conceptual Workflow
Conclusion: The Debate on Innovation
The tension between Schmidhuber and LeCun boils down to a disagreement on what constitutes "invention." Schmidhuber argues that JEPA is essentially the same as his 1992 Predictability Maximization system. LeCun counters that JEPA is a general concept, and the real achievement lies in making it work at scale on complex, non-toy problems.
My Perspective:
Ideas are a dime a dozen.While implementation is where the value is realized, the chain of citation is vital for scientific progress.- If foundational citations (like Hotelling's) are omitted, they should be added.
- JEPA and Predictability Maximization are essentially architectural layers built upon the bedrock of CCA.
Ultimately, all these models share a singular, 90-year-old goal: finding the transformations that maximize the correlation between multidimensional data sets.
Implementation Checklist for JEPA-like Systems
- Define two views of the same data.
- Implement an encoder .
- Implement a predictor .
- Define an MSE loss function.
- Apply a regularization technique (e.g., SIGReg) to prevent collapse.
Conceptual Pseudo-code for Embedding Alignment
def compute_jepa_loss(x, y, encoder, predictor):
# Generate embeddings
z_y = encoder(y)
z_x = predictor(encoder(x))
# MSE Loss (The CCA-linked objective)
mse_loss = torch.mean((z_x - z_y)**2)
# SIGReg: Prevent collapse by encouraging identity covariance
reg_loss = compute_isotropic_regularization(z_x, z_y)
return mse_loss + reg_loss
References
- Andrew et al. (2017): "Deep Canonical Correlation Analysis." ICML.
- Balestriero & LeCun (2025): LeJEPA: Provable and Scalable Self-Supervised Learning.
- Benton et al.: Generalized Canonical Correlations.
- Hotelling (1936): "Relations Between Two Sets of Variates." Biometrika.
- Huang (2026): VJEPA: Variational Joint Embedding Predictive Architectures.