Very well made video about the somewhat mysterious phenomenon of 'grokking' in transformers, where they will quickly memorize all training data in a few hundred steps but still completely fail on testing data, then seemingly not improve at all for thousands of steps before suddenly generalizing perfectly: https://www.youtube.com/watch?v=D8GOeCFFby4
Timeline
Post
Remote status