From: Misha Gromov < Sent: Wednesday, October 11, 2017 7:47 PM To: Jeffrey E. Subject: Re: Fwd: Like Bach's comments:) On Wed, 11 Oct 2017 20:01:46 +0200, Jeffrey E. wrote: Forwarded message Fro=: Joscha Bach < Date: Wed, Oct 11, 2017 at 7:55 PM Subject: Re: To: Jeffrey Eps=ein <[email protected] <mailto:[email protected]> =gt; After skimming their paper, the idea seemed unexcitin= to me at first: basically, if we have enough feature dimensions we can al=ost always find a linear separation. This is also related to how Support V=ctor Machines work: they project the data into an extremely high-dimension=l space, find a separating hyperplane with linear regression, and then pro=ect that plane back into the original space as the separator. A similar id=a is behind Echo State networks, which use a randomly wired recurrent neur=l network and then only train the output layer with a single linear regres=ion. The authors take an existing trained neural network, and whenev=r it makes a mistake, they train a linear classifier on the network state =nd data, i.e. they try to find out when the network goes wrong. Instead of=improving the network (which is also likely to make it worse in other case=), they add an additional layer to it. For engineering, this makes a lot o= sense, because large neural networks are cheap to use and deploy but expe=sive to train. On a more philosophical level, it is tempting t= ask if that might be a general learning principle for brains: when you do='t perform well, add more control structure on top. It probably makes sens= whenever you are confident that training the existing structure won't imp=ove it that much, but unless training the weights in an existing network, =t also adds quite a few milliseconds to the processing time. There is prob=bly an optimal tradeoff for this. The other thing is that the new layer is=a linear classifier only (at least in this paper), and it is creating a lo=al override on the s