Control-oriented Clustering of
Visual Latent Representation

1School of Engineering and Applied Sciences, Harvard University
2Department of Computer Science, ETH Zürich
*Equal contribution, Work done during visit at the Harvard Computational Robotics Lab

Abstract

We initiate a study of the geometry of the visual representation space --the information channel from the vision encoder to the action decoder-- in an image-based control pipeline learned from behavior cloning. Inspired by the phenomenon of neural collapse (NC) in image classification (Papyan et al., 2020), we investigate whether a similar law of clustering emerges in the visual representation space. Since image-based control is a regression task without explicitly defined classes, the central piece of the puzzle lies in determining according to what implicit classes the visual features cluster, if such a law exists.

Focusing on image-based planar pushing, we posit the most important role of the visual representation in a control task is to convey a goal to the action decoder; for instance, rotate the object clockwise and push it northeast. We then classify training samples of expert demonstrations into eight control-oriented classes --based on (a) the relative pose between the object and the target in the input or (b) the relative pose of the object induced by expert actions in the output-- where one class corresponds to one relative pose orthant (Repos). Across four different instantiations of the vision-based control architecture, we report the prevalent emergence of control-oriented clustering (similar to NC) in the visual representation space according to the eight Repos.

Beyond empirical observation, we show such a law of clustering can be leveraged as an algorithmic tool to improve test-time performance when training a policy with a limited amount of expert demonstrations. Particularly, we pretrain the vision encoder using NC as a regularization to encourage control-oriented clustering of the visual features. Surprisingly, such an NC-pretrained vision encoder, when finetuned end-to-end with the action decoder, boosts the test-time performance by 10% to 35% in the low-data regime. Real-world vision-based planar pushing experiments confirmed the surprising advantage of control-oriented visual representation pretraining.

Neural Collapse in Image Classification

Neural Collapse (NC) is a clustering phenomenon which was firstly observed by Papyan et al. (2020). It illustrates an elegant geometric structure of the last-layer feature and classifier for a well-trained model in classification tasks. In particular, NC refers to a set of four manifestations in the representation space (i.e., the penultimate layer):

  1. (NC1) Variability collapse: feature vectors of the same class converge to their class mean.
  2. (NC2) Simplex ETF: globally-centered class mean vectors converge to a geometric configuration known as Simplex Equiangular Tight Frame (ETF), i.e., mean vectors have the same lengths and form equal angles pairwise.
  3. (NC3) Self-duality: the class means and the last-layer linear classifiers are self-dual.
  4. (NC4) Nearest class-center prediction: the network predicts the class whose mean vector has the minimum Euclidean distance to the feature of the test image.

Control-oriented Neural Collapse in Vision-based Control

sim_sample_video

Evaluation example demo for the last (300th) training epoch

sim_score

Evaluation score w.r.t. training epoches

We evaluate (NC1) using the class-distance normalized variance (CDNV) metric that depends on the ratio of within-class to between-class variabilities. We evaluate (NC2) using the standard deviation of the angles and lengths spanned by cluster mean vectors. We do not evaluate (NC3) and (NC4) because they require a linear classifier from the representation space to the output, which does not hold in our vision-based control setup. For comprehensive details, including the mathematical definitions of all three metrics, please refer to Section 3 of the paper.

sim_nc1

Class-Distance Normalized Variance

sim_nc2angle

Standard Deviation of Angle Spanned by Cluster Mean Vectors

sim_nc2norm

Standard Deviation of Length of Cluster Mean Vectors

Real-World Validation

Example Evaluation Result

real_sample_nc_encouraged_traj

(Ours) Real-world evaluation trajectory of test 8 for NC-pretrained policy trained under 100 demos.

real_sample_baseline_traj

Real-world evaluation trajectory of test 8 for baseline policy trained under 100 demos.

(Ours) Real-world evaluation video of test 8 for NC-pretrained policy trained under 100 demos. Success.

Real-world evaluation video of test 8 for baseline policy trained under 100 demos. Failure.

Check all real-world evalutaion results (by clicking the title):

Our NC-pretrained policy demonstrates consistent and superior performance compared to the baseline policy. It successfully completes all tests that the baseline policy manages to accomplish, while also succeeding in a subset of tests (highlighted in background) where the baseline fails. The results showcase a monotonic improvement in performance, underscoring the efficacy of our approach across various test scenarios.

(Ours) NC-pretrained (8 out of 10) vs. Baseline (5 out of 10) policy trained under 100 demos

NC-pretrained Policy Test 0: Success.

Baseline Policy Test 0: Success.

NC-pretrained Policy Test 1: Success.

Baseline Policy Test 1: Success.

NC-pretrained Policy Test 2: Failure.

Baseline Policy Test 2: Failure.

NC-pretrained Policy Test 3: Success.

Baseline Policy Test 3: Success.

NC-pretrained Policy Test 4: Success.

Baseline Policy Test 4: Success.

NC-pretrained Policy Test 5: Success.

Baseline Policy Test 5: Failure.

NC-pretrained Policy Test 6: Failure.

Baseline Policy Test 6: Failure.

NC-pretrained Policy Test 7: Success.

Baseline Policy Test 7: Success.

NC-pretrained Policy Test 8: Success.

Baseline Policy Test 8: Failure.

NC-pretrained Policy Test 9: Success.

Baseline Policy Test 9: Failure.

(Ours) NC-pretrained (4 out of 10) vs. Baseline (2 out of 10) policy trained under 50 demos

NC-pretrained Policy Test 0: Success.

Baseline Policy Test 0: Success.

NC-pretrained Policy Test 1: Success.

Baseline Policy Test 1: Failure.

NC-pretrained Policy Test 2: Failure.

Baseline Policy Test 2: Failure.

NC-pretrained Policy Test 3: Failure.

Baseline Policy Test 3: Failure.

NC-pretrained Policy Test 4: Failure.

Baseline Policy Test 4: Failure.

NC-pretrained Policy Test 5: Failure.

Baseline Policy Test 5: Failure.

NC-pretrained Policy Test 6: Failure.

Baseline Policy Test 6: Failure.

NC-pretrained Policy Test 7: Success.

Baseline Policy Test 7: Failure.

NC-pretrained Policy Test 8: Failure.

Baseline Policy Test 8: Failure.

NC-pretrained Policy Test 9: Success.

Baseline Policy Test 9: Success.

BibTeX

  @article{qi2024control_oriented_NC,
    title={{Control-oriented Clustering of Visual Latent Representation}},
    author={Qi, Han and Yin, Haocheng and Yang, Heng},
    journal={arXiv preprint arXiv:2410.05063},
    year={2024},
  }