To train a neural network,GeoDict-AI uses the UNet method Ronneberger and Fischer, 2015, which implements an image-to-image transformation. When illustrated as below, the UNet has a U-shape (hence the name) with a convolutional constricting and a deconvolutional expanding branch on the left and right respectively. The constricting branch analyzes and simplifies the input structure to features. The expanding branch uses these features to decide, for example, which of the input voxels to label as binder.
The UNet always works on a fixed input window size and produces results for a smaller output window which is centered within the input window. In order to analyze a given 3D structure, which is usually larger than the input window, the window is automatically shifted over the whole structure and the network is applied at each window location in order to obtain results for the whole domain.
In the diagram, the given size for this input window is 52 voxels in X-, Y- and Z-direction. The window contents are encoded in the left branch of the UNet to an abstract feature map. This feature map is then decoded within the right branch of the UNet to obtain the output image.
The diagram below shows a UNet with depth=2. The depth is defined by the number of max pool operations / deconvolutions leading from layer to layer. Thus, a depth of 2 results in three layers.
In each layer of the UNet two convolutions are applied on this cutout, encoding it to discriminative features. The first convolution in the first layer of the UNet encodes the input image into a given number of features or feature maps describing the pattern of the cutout.In the diagram above there are 16 features in the first layer.
A convolution analyzes a cutout using kernels. A kernel is a feature detector, i.e. a 3x3x3 voxels box with a defined pattern, for example to recognize edges. In the figure below, three exemplary kernels are shown. The features are learned by the network, so that each network produces its own set of features.
Kernel Examples
For each kernel the convolution then decodes the considered structure to one feature map. For this, a kernel scans the cutout voxel for voxel and for each 3x3x3 voxel group, it measures the similarity between the kernel and the underlying image cutout using the dot-product between the kernel values and the image values. The resulting so-called feature map indicates where the feature is found in the input image. For the given network topology, we have 16 features in the first layer, so this process is repeated for each different feature/kernel, resulting in 16 different feature maps.
As the context information from surrounding voxels has to be taken into account for the convolution, the kernel has to stay within the bounds of the input window. Thus, after each convolution the resulting cutout decreases, losing the border voxels. In the UNet diagram above, this results in a cutout size of 48x48x48 voxels after the two convolutions in the first layer.
In the figure below, one kernel is applied to a 4x4x4 cutout, resulting in a 2x2x2 feature map where large values correspond to a strong detection of the feature. Regions where the feature is not present will result in low or even negative values in the feature map.
The next layer of the UNet is then reached with a Max Pool operation. Max pooling halves each spatial dimension by replacing each 2x2x2 block of voxel values by a single voxel containing the maximum value within the block. In our example, this means that the second layer starts with a cutout of 24x24x24 voxels.
Max-Pool Operation
This provides the neural network with a limited form of translational invariance, meaning that it becomes tolerant to slight variations in the location of detected features at the expense of spatial accuracy. We will see later how the UNet is able to compensate for that.
In each new layer of the UNet the first convolution doubles the number of features. Thus, the input image is encoded into feature representations at multiple different levels. This is followed by a second convolution with the same number of output features.
At the end of the last layer, we start moving up the right (expanding) branch of the UNet, which progressively increases the resolution of the image while at the same time reducing the number of features, essentially mirroring the transformation seen in the left (constricting) branch.
As we move up the right branch, the max-pool operation is replaced by a so-called deconvolution (transposed convolution) which doubles the spatial resolution and halves the number of feature maps. In detail, for a deconvolution first an up-sampling is applied on the features. For this, each voxel is transformed into 8 voxels with the same value, doubling the dimensions in all three directions.
Up-sampling
Then, a regular convolution concretizes the features, by “painting” the voxels in a feature. Here, smaller kernels are used, i.e., 2x2x2 voxels boxes with defined patterns. The feature values then are multiplied with the kernel values as shown in the figure below. Here we show an example with a kernel containing only the values -1 and 1. The kernel is applied for each 2x2x2 voxel group in the input feature map and the dot product is the value for the corresponding position in the new feature.
Convolution of an up-sampled feature map
To keep the volume dimensions, in our case, however, zeros are appended to the input feature on three sides before multiplying it with the kernel (padding).
Afterwards, the resulting features are concatenated with high-resolution cutouts from the last features in the left branch in the same layer, doubling the number of feature maps. This way small features with high resolution are combined with low resolution abstract features for context information. This makes it possible to paint the voxels correctly, since it helps the neurons to build a more precise output based on this information.
Since the cutout in the right branch is still convolved twice in each layer, the final output window is smaller than the input window. In our example with a UNet of depth 2 and a window size of 52x52x52 voxels, the resulting cutout has a size of 12x12x12 voxels. This last cutout, in fact is an image again instead of a feature map.
At this point we have described the topology of the neural network, but the specific values within the convolutional kernels are not yet specified. These so-called weights are determined during the training process using an optimizer such as Adam Kingma and Ba, 2016. In our supervised learning setup, we specify pairs of input and desired output images. The optimizer applies the neural network to a given input image, computes the differences between the output and the desired output (the so-called loss) and adjusts the weights of the neural network to reduce that error. This process is repeated iteratively until the loss converges to an acceptable value.