Clarify that the Softmax derivative is good-enough#5
Open
MadLittleMods wants to merge 1 commit intoSebLague:mainfrom
Open
Clarify that the Softmax derivative is good-enough#5MadLittleMods wants to merge 1 commit intoSebLague:mainfrom
Softmax derivative is good-enough#5MadLittleMods wants to merge 1 commit intoSebLague:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First, thank you so much for this amazing resource and video series! 🙇 Your videos are a gold-standard to understand concepts and polished end-products to impress everyone 🌠
While following along and writing my own implementation in Zig, I added some gradient check tests to ensure my backpropagation code/math was correct and saw that they were failing whenever I used
Softmax. I banged my head against this for a long-while and even compared the network outputs to this implementation only to find it the exact same.Finally after some external help, I realized the difference between single-input activation functions like
Sigmoid,TanH,ReLUand the multi-input activation functions likeSoftmaxwhich require more work to find the full derivative. I wrote some notes on the difference or perhaps the source code I ended up with is easier to understand.Just wanted to add a note to the code here so others don't hit the same pitfall as hard.
It's really interesting how the "good-enough" derivative of
Softmaxusing only the diagonal elements from the Jacobian matrix, empirically, still works so well for the neural network to converge. The best way I was able to understand this and relate this to a concept that has more research/documentation is stochastic gradient descent which trains with mini-batches to make quick, imperfect but good-enough steps down the cost gradient.