When can I use the 3D convolution?
Do you know AWS Solutions Architect is one of the most popular and lucrative cloud certifications in IT? AWS Solutions Architect– Associate validates a candidate’s expertise in designing and deploying scalable systems on the AWS platform.
3D convolutions are used when you want to extract features in 3 dimensions or establish a relationship between 3 dimensions. Essentially, it's the same as 2D convolutions, but the kernel movement is now 3-dimensional, causing a better capture of dependencies within the 3 dimensions and a difference in output dimensions post convolution. The kernel of the 3d convolution will move in 3-dimensions if the kernel's depth is lesser than the feature map's depth. On the other hand, 2-D convolutions on 3-D data mean that the kernel will traverse in 2-D only. This happens when the feature map's depth is the same as the kernel's depth (channels).
Some use cases for better understanding are MRI scans where the relationship between a stack of images is to be understood;
low-level feature extractor for spatio-temporal data, like videos for gesture recognition, weather forecasting, etc. (3-D CNN's are used as low level feature extractors only over multiple short intervals, as 3D CNN's fail to capture long term spatio-temporal dependencies - for more on that check out ConvLSTM or an alternate perspective here.)
Most CNN models that learn from video data almost always have 3D CNN as a low level feature extractor.
In the example you have mentioned above regarding the number 5 - 2D convolutions would probably perform better, as you're treating every channel intensity as an aggregate of the information it holds, meaning the learning would almost be the same as it would on a black and white image. Using 3D convolution for this, on the other hand, would cause learning of relationships between the channels which do not exist in this case! (Also 3D convolutions on an image with depth 3 would require a very uncommon kernel to be used, especially for the use case)