(Ours) Real-world evaluation video of test 8 for NC-pretrained policy trained under 100 demos. Success.
We initiate a study of the geometry of the visual representation space --the information channel from the vision encoder to the action decoder-- in an image-based control pipeline learned from behavior cloning. Inspired by the phenomenon of neural collapse (NC) in image classification (Papyan et al., 2020), we investigate whether a similar law of clustering emerges in the visual representation space. Since image-based control is a regression task without explicitly defined classes, the central piece of the puzzle lies in determining according to what implicit classes the visual features cluster, if such a law exists.
Focusing on image-based planar pushing, we posit the most important role of the visual
representation in a control task is to convey a goal to the action decoder; for instance,
rotate the object clockwise and push it northeast
. We then classify training samples of expert
demonstrations into eight control-oriented
classes --based on (a) the relative pose
between the object and the target in the input or (b) the relative pose of the object induced by expert
actions in the output-- where one class corresponds to one relative pose orthant
(Repos). Across four different instantiations of the vision-based control
architecture, we report the prevalent emergence of control-oriented clustering (similar to NC) in the
visual representation space according to the eight Repos.
Beyond empirical observation, we show such a law of clustering can be leveraged as an algorithmic tool to improve test-time performance when training a policy with a limited amount of expert demonstrations. Particularly, we pretrain the vision encoder using NC as a regularization to encourage control-oriented clustering of the visual features. Surprisingly, such an NC-pretrained vision encoder, when finetuned end-to-end with the action decoder, boosts the test-time performance by 10% to 35% in the low-data regime. Real-world vision-based planar pushing experiments confirmed the surprising advantage of control-oriented visual representation pretraining.
Neural Collapse (NC) is a clustering phenomenon which was firstly observed by Papyan et al. (2020). It illustrates an elegant geometric structure of the last-layer feature and classifier for a well-trained model in classification tasks. In particular, NC refers to a set of four manifestations in the representation space (i.e., the penultimate layer):
Evaluation example demo for the last (300th) training epoch
Evaluation score w.r.t. training epoches
We evaluate (NC1) using the class-distance normalized variance (CDNV) metric that depends on the ratio of within-class to between-class variabilities. We evaluate (NC2) using the standard deviation of the angles and lengths spanned by cluster mean vectors. We do not evaluate (NC3) and (NC4) because they require a linear classifier from the representation space to the output, which does not hold in our vision-based control setup. For comprehensive details, including the mathematical definitions of all three metrics, please refer to Section 3 of the paper.
Class-Distance Normalized Variance
Standard Deviation of Angle Spanned by Cluster Mean Vectors
Standard Deviation of Length of Cluster Mean Vectors
(Ours) Real-world evaluation trajectory of test 8 for NC-pretrained policy trained under 100 demos.
Real-world evaluation trajectory of test 8 for baseline policy trained under 100 demos.
(Ours) Real-world evaluation video of test 8 for NC-pretrained policy trained under 100 demos. Success.
Real-world evaluation video of test 8 for baseline policy trained under 100 demos. Failure.
Our NC-pretrained policy demonstrates consistent and superior performance compared to the baseline policy. It successfully completes all tests that the baseline policy manages to accomplish, while also succeeding in a subset of tests (highlighted in background) where the baseline fails. The results showcase a monotonic improvement in performance, underscoring the efficacy of our approach across various test scenarios.
NC-pretrained Policy Test 0: Success.
Baseline Policy Test 0: Success.
NC-pretrained Policy Test 1: Success.
Baseline Policy Test 1: Success.
NC-pretrained Policy Test 2: Failure.
Baseline Policy Test 2: Failure.
NC-pretrained Policy Test 3: Success.
Baseline Policy Test 3: Success.
NC-pretrained Policy Test 4: Success.
Baseline Policy Test 4: Success.
NC-pretrained Policy Test 5: Success.
Baseline Policy Test 5: Failure.
NC-pretrained Policy Test 6: Failure.
Baseline Policy Test 6: Failure.
NC-pretrained Policy Test 7: Success.
Baseline Policy Test 7: Success.
NC-pretrained Policy Test 8: Success.
Baseline Policy Test 8: Failure.
NC-pretrained Policy Test 9: Success.
Baseline Policy Test 9: Failure.
NC-pretrained Policy Test 0: Success.
Baseline Policy Test 0: Success.
NC-pretrained Policy Test 1: Success.
Baseline Policy Test 1: Failure.
NC-pretrained Policy Test 2: Failure.
Baseline Policy Test 2: Failure.
NC-pretrained Policy Test 3: Failure.
Baseline Policy Test 3: Failure.
NC-pretrained Policy Test 4: Failure.
Baseline Policy Test 4: Failure.
NC-pretrained Policy Test 5: Failure.
Baseline Policy Test 5: Failure.
NC-pretrained Policy Test 6: Failure.
Baseline Policy Test 6: Failure.
NC-pretrained Policy Test 7: Success.
Baseline Policy Test 7: Failure.
NC-pretrained Policy Test 8: Failure.
Baseline Policy Test 8: Failure.
NC-pretrained Policy Test 9: Success.
Baseline Policy Test 9: Success.
@article{qi2024control_oriented_NC,
title={{Control-oriented Clustering of Visual Latent Representation}},
author={Qi, Han and Yin, Haocheng and Yang, Heng},
journal={arXiv preprint arXiv:2410.05063},
year={2024},
}