Skip to content

Clarify that the Softmax derivative is good-enough#5

Open
MadLittleMods wants to merge 1 commit intoSebLague:mainfrom
MadLittleMods:madlittlemods/note-on-multi-input-softmax
Open

Clarify that the Softmax derivative is good-enough#5
MadLittleMods wants to merge 1 commit intoSebLague:mainfrom
MadLittleMods:madlittlemods/note-on-multi-input-softmax

Conversation

@MadLittleMods
Copy link
Copy Markdown

First, thank you so much for this amazing resource and video series! 🙇 Your videos are a gold-standard to understand concepts and polished end-products to impress everyone 🌠

While following along and writing my own implementation in Zig, I added some gradient check tests to ensure my backpropagation code/math was correct and saw that they were failing whenever I used Softmax. I banged my head against this for a long-while and even compared the network outputs to this implementation only to find it the exact same.

Finally after some external help, I realized the difference between single-input activation functions like Sigmoid, TanH, ReLU and the multi-input activation functions like Softmax which require more work to find the full derivative. I wrote some notes on the difference or perhaps the source code I ended up with is easier to understand.

Just wanted to add a note to the code here so others don't hit the same pitfall as hard.

It's really interesting how the "good-enough" derivative of Softmax using only the diagonal elements from the Jacobian matrix, empirically, still works so well for the neural network to converge. The best way I was able to understand this and relate this to a concept that has more research/documentation is stochastic gradient descent which trains with mini-batches to make quick, imperfect but good-enough steps down the cost gradient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant