Explain the meaning of self-supervised learning.
What is self-supervised learning in machine learning? How is it different from supervised learning?
Introduction
The term self-supervised learning (SSL) has been used (sometimes differently) in different contexts and fields, such as representation learning [1], neural networks, robotics [2], natural language processing, and reinforcement learning. In all cases, the basic idea is to automatically generate some kind of supervisory signal to solve some task (typically, to learn representations of the data or to automatically label a dataset). I will describe what SSL means more specifically in three contexts: representation learning, neural networks and robotics.
Representation learning
The term self-supervised learning has been widely used to refer to techniques that do not use human-annotated datasets to learn (visual) representations of the data (i.e. representation learning).
Example In [1], two patches are randomly selected and cropped from an unlabelled image and the goal is to predict the relative position of the two patches. Of course, we have the relative position of the two patches once you have chosen them (i.e. we can keep track of their centers), so, in this case, this is the automatically generated supervisory signal. The idea is that, to solve this task (known as a pretext or auxiliary task in the literature [3, 4, 5, 6]), the neural network needs to learn features in the images. These learned representations can then be used to solve the so-called downstream tasks, i.e. the tasks you are interested in (e.g. object detection or semantic segmentation).
So, you first learn representations of the data (by SSL pre-training), then you can transfer these learned representations to solve a task that you actually want to solve, and you can do this by fine-tuning the neural network that contains the learned representations on a labeled (but smaller dataset), i.e. you can use SSL for transfer learning.
Neural networksSome neural networks, for example, autoencoders (AE) [7] are sometimes called self-supervised learning tools. In fact, you can train AEs without images that have been manually labeled by a human. More concretely, consider a de- noising AE, whose goal is to reconstruct the original image when given a noisy version of it. During training, you actually have the original image, given that you have a dataset of uncorrupted images and you just corrupt these images with some noise, so you can calculate some kind of distance between the original image and the noisy one, where the original image is the supervisory signal. In this sense, AEs are self-supervised learning tools, but it's more common to say that AEs are unsupervised learning tools, so SSL has also been used to refer to unsupervised learning techniques.
Robotics In [2], the training data is automatically but approximately labeled by finding and exploiting the relations or correlations between inputs coming from different sensor modalities (and this technique is called SSL by the authors). So, as opposed to representation learning or auto-encoders, in this case, an actual labeled dataset is produced automatically.
Example Consider a robot that is equipped with a proximity sensor (which is a short-range sensor capable of detecting objects in front of the robot at short distances) and a camera (which is a long-range sensor, but which does not provide a direct way of detecting objects). You can also assume that this robot is capable of performing odometry. An example of such a robot is Mighty Thymio.
Consider now the task of detecting objects in front of the robot at longer ranges than the range the proximity sensor allows. In general, we could train a CNN to achieve that. However, to train such CNN, in supervised learning, we would first need a labeled dataset, which contains labeled images (or videos), where the labels could e.g. be "object in the image" or "no object in the image". In supervised learning, this dataset would need to be manually labeled by a human, which clearly would require a lot of work. To overcome this issue, we can use a self-supervised learning approach. In this example, the basic idea is to associate the output of the proximity sensors at a time step t′>t with the output of the camera at time step t (a smaller time step than t′).
More specifically, suppose that the robot is initially at coordinates (x,y) (on the plane), at time step t. At this point, we still do not have enough info to label the output of the camera (at the same time step t). Suppose now that, at time t′, the robot is at position (x′,y′). At time step t′, the output of the proximity sensor will e.g. be "object in front of the robot" or "no object in front of the robot". Without loss of generality, suppose that the output of the proximity sensor at t′>t is "no object in front of the robot", then the label associated with the output of the camera (an image frame) at time t will be "no object in front of the robot".