VL_NNLOSS
 CNN categorical or attribute loss.
Y = VL_NNLOSS(X, C) computes the loss incurred by the prediction scores X given the categorical labels C.
The prediction scores X are organised as a field of prediction vectors, represented by a H x W x D x N array. The first two dimensions, H and W, are spatial and correspond to the height and width of the field; the third dimension D is the number of categories or classes; finally, the dimension N is the number of data items (images) packed in the array.
While often one has H = W = 1, the case W, H > 1 is useful in
dense labelling problems such as image segmentation. In the latter
case, the loss is summed across pixels (contributions can be
weighed using the InstanceWeights
option described below).
The array C contains the categorical labels. In the simplest case, C is an array of integers in the range [1, D] with N elements specifying one label for each of the N images. If H, W > 1, the same label is implicitly applied to all spatial locations.
In the second form, C has dimension H x W x 1 x N and specifies a categorical label for each spatial location.
In the third form, C has dimension H x W x D x N and specifies attributes rather than categories. Here elements in C are either

1 or 1 and C, where +1 denotes that an attribute is present and

1 that it is not. The key difference is that multiple attributes
can be active at the same time, while categories are mutually
exclusive. By default, the loss is summed across attributes
(unless otherwise specified using the InstanceWeights
option
described below).
DZDX = VL_NNLOSS(X, C, DZDY) computes the derivative of the block projected onto the output derivative DZDY. DZDX and DZDY have the same dimensions as X and Y respectively.
VL_NNLOSS() supports several loss functions, which can be selected
by using the option type
described below. When each scalar c in
C is interpreted as a categorical label (first two forms above),
the following losses can be used:

Classification error [
classerror
]L(X,c) = (argmax_q X(q) ~= c). Note that the classification error derivative is flat; therefore this loss is useful for assessment, but not for training a model.

TopK classification error [
topkerror
]L(X,c) = (rank X(c) in X <= K). The top rank is the one with highest score. For K=1, this is the same as the classification error. K is controlled by the
topK
option. 
Log loss [
log
]L(X,c) =  log(X(c)). This function assumes that X(c) is the predicted probability of class c (hence the vector X must be non negative and sum to one).

Softmax log loss (multinomial logistic loss) [
softmaxlog
]L(X,c) =  log(P(c)) where P(c) = exp(X(c)) / sum_q exp(X(q)). This is the same as the
log
loss, but renormalizes the predictions using the softmax function. 
Multiclass hinge loss [
mhinge
]L(X,c) = max{0, 1  X(c)}. This function assumes that X(c) is the score margin for class c against the other classes. See also the
mmhinge
loss below. 
Multiclass structured hinge loss [
mshinge
]L(X,c) = max{0, 1  M(c)} where M(c) = X(c)  max_{q ~= c} X(q). This is the same as the
mhinge
loss, but computes the margin between the prediction scores first. This is also known the CrammerSinger loss, an example of a structured prediction loss.
When C is a vector of binary attribures c in (+1,1), each scalar prediction score x is interpreted as voting for the presence or absence of a particular attribute. The following losses can be used:

Binary classification error [
binaryerror
]L(x,c) = (sign(x  t) ~= c). t is a threshold that can be specified using the
threshold
option and defaults to zero. If x is a probability, it should be set to 0.5. 
Binary log loss [
binarylog
]L(x,c) =  log(c(x0.5) + 0.5). x is assumed to be the probability that the attribute is active (c=+1). Hence x must be a number in the range [0,1]. This is the binary version of the
log
loss. 
Logistic log loss [
logistic
]L(x,c) = log(1 + exp( cx)). This is the same as the
binarylog
loss, but implicitly normalizes the score x into a probability using the logistic (sigmoid) function: p = sigmoid(x) = 1 / (1 + exp(x)). This is also equivalent tosoftmaxlog
loss where class c=+1 is assigned score x and class c=1 is assigned score 0. 
Hinge loss [
hinge
]L(x,c) = max{0, 1  cx}. This is the standard hinge loss for binary classification. This is equivalent to the
mshinge
loss if class c=+1 is assigned score x and class c=1 is assigned score 0.
VL_NNLOSS(...,'OPT', VALUE, ...) supports these additionals options:

InstanceWeights [[]]
Allows to weight the loss as L'(x,c) = WGT L(x,c), where WGT is a perinstance weight extracted from the array
InstanceWeights
. For categorical losses, this is either a H x W x 1 or a H x W x 1 x N array. For attribute losses, this is either a H x W x D or a H x W x D x N array. 
TopK [5]
TopK value for the topK error. Note that K should not exceed the number of labels.
See also: VL_NNSOFTMAX().