# Softmax的名字是怎么来的？

#### by Jiyang Wang (e-mail: jiyang_wang@yahoo.com)

When I was learning multiclass classifiers such as SVM and Neural Networks, "Softmax" came across to my mind with some mystery in its name. I was wondering why it was named so, and whether there was "Hardmax" function being its brother or even ancestor. I checked it out on Wikipedia but failed to find any useful information there. There was the same question about Softmax's name on Quora and one of the answers, to my memory, revealed some part of the mystery. However, I cannot find that Quora link anymore.

Like most people, I continue to use it natually in my work without thinking about its origin. Engineers and researchers are mostly pragmatic, aren't they?

But recently when I was preparing for a few job interviews and Softmax drew my attention again (because I was concerned that an interviewer would ask me to explain Softmax function or Softmax classifier in details). I decided to figure out why it's got this funny name and whether it really stems from another related function nick-named "Hardmax" (or not --- no chance to find it on Wikipedia, anyway).

## Hardmax

Before elaborating Softmax, I just jump to the conclusion that there **is** "Hardmax" function, which is usually called **Hinge Loss** function used in linear classifiers such as SVM:

where * sj* and

*are classification scores of the*

**si***j-th*and

*i-th*element of the output vector of the model. And

*is the loss for classifying the input as the*

**Li***i-th*class.

Stanford CS class CS231n: Convolutional Neural Networks for Visual Recognition has a very good explanation of the above loss function. Please check it out here (http://cs231n.github.io/linearclassify/#softmax).

And here is an example from it:

Wikipedia has Hinge Loss (https://en.wikipedia.org/wiki/Hinge_loss) as well.

Basically, hinge loss has a threshold Δ below which the loss of an element of output vector (score vector) is
perceived as zero. The threshold Δ, which functions as a margin between the classification boundary
(a.k.a. decision boundary) and the samples nearest to the boundary, is applied to sj for all j ≠ i so that the
loss of sj is added up to the overall loss of Li only when sj has a difference from si smaller than the
threshold.

Thus the hinge loss function has the form of max function ** max(0, x)** and it is "hard" by its nature. We'll see what this means later on when we draw the graph of max function. Now here is an example of how hinge loss is
calculated (from Stanford CS231n), in which i = 0, i.e., the ground truth label of the input pitcure is "cat",
and Δ = 10:

Then we want to see what each *sj* contributes to the hinge loss of *si*, with *sj* as variate and *si* fixed, by drawing a graph of it. I simplify the graph by using only
integers for *sj* while fixing *si* to 0 and Δ = 0.

No doubt why ** max** function is also called "hinge" function, because its shape looks like a hinge. It can be
called "hardmax" because the loss introduced by

*sj*to

*Li*is "harshly" zeroed out as long as its negative difference from

*si*is larger than a threshold regardless of its own value. Or, from another perspective, there is a point (at

*sj*= 0 in the graph) where the function is not differentiable (which is 'hard', i.e., not smooth, as compared to 'soft').

## Interpretation of Scores

Let's put hinge/hardmax function aside for a while and talk about Softmax function. Again, Stanford CS231n provides very clear description of why Softmax function is applied to classification scores, quoted below:

*Unlike the SVM which treats the outputs f (xi ,W) as (uncalibrated and possibly difficult to interpret) scores
for each class, the Softmax classifier gives a slightly more intuitive output (normalized class probabilities)
and also has a probabilistic interpretation that we will describe shortly. In the Softmax classifier, the function
mapping*

*stays unchanged, but we now interpret these scores as the unnormalized log probabilities for each class
and replace the hinge loss with a cross-entropy loss that has the form:*

*The function*

*where*

*is called Softmax function: It takes a vector of arbitrary real-valued scores (in z) and squashes it to a
vector of values between zero and one that sum to one.*

## Cross-Entropy Loss

Attention should be paid to the final form of *Li* just above, in which it is not obvious where cross-entropy
loss is applied. Let's delve into more details.

As said above, score is interpreted as unnormalized log probability for class *i*, so we have:

Or

Now we normalize this probability:

This is ** Softmax** function. Then we calculate cross-entropy, quote from Stanford CS231n:

*The cross-entropy between a “true” distribution p and an estimated distribution q is defined as:*

*The Softmax classifier is hence minimizing the cross-entropy between the estimated class probabilities (q(si) = as seen above) and the “true” distribution, which in this interpretation is the distribution
where all probability mass is on the correct class (i.e. p = [0, …, 1, …, 0] contains a single 1 at the i-th
position).*

In a nutshell, cross-entropy measures the difference between two vectors. In our case, we want to compare
the ground-truth label that has been one-hot encoded to * p* = [0, . . . , 1, . . . , 0] with the output vector of the
model

**As vector**

*q(si).**has 1 at the i-th position and 0's at all the other positions, the result of cross-entropy between vector*

**p***and vector*

**p****keeps only the i-th element of vector**

*q***.**

*q*This is exactly ** Li**. It is the negative log of Softmax function.

## Softmax Function

Let's re-write cross-entropy loss over Softmax function as below so that it makes it clear that the loss function is in fact a function of score difference (which is consistent with hinge loss):

K is the number of classes.

Unlike in hinge loss, though, every element of the output score vector in cross-entropy loss over Softmax function
has some influence on the final loss regardless of its score value * sk*. Let's evaluate the contribution of one of the element's score

*as variate with*

**sk***fixed so that we can compare the result with the contribution of*

**si***in hinge loss:*

**sk**Now let's draw the graph of * f(sk).* As we are only interested in the shape of

*, we can fix*

**f(sk)***to 0 as we did when drawing the graph of*

**si***function. For comparison, I draw the*

**max***function together with*

**max***function.*

**Softmax**Can you tell why * Softmax * is a 'softened' version of

*function? I'm sure you can now.*

**max**