In a recent episode of the excellent InControl podcast, Alberto Padoan asked Stephen Boyd about the value of convex optimization. What follows is a transcript of his response, followed by a few of my own comments at the end.
It's an incredibly important tool, to solve all kinds of optimization problems. The usual sort of story on optimization problems is, "Here's what I'd like to solve." And people don't like to admit it when they're teaching these classes, but the bottom line is that we can't solve most of these problems. You just can't. Now maybe it doesn't matter; maybe we have heuristics that do well. If you get a good model, and you get good predictions, and your hedge fund makes money, you're fine, you don't need to worry about that.
But the fact is that there is a group of mathematical optimization problems that you can solve numerically, really really well. I used to go around and say that the advantage is that you get the global optimum. There's no apologies. There's no star and a footnote, where you say, "Well, maybe I didn't."
I used to do that, until I met someone who was a senior executive at a large semiconductor firm. So I was chatting with him, and I said, "This is the best design of the thing that there is." And he said to me, "How could you possibly know that?" And I realized that, for a normal person, that's the correct response. What hubris! What would cause a person to say, "This is the best there is." So I started my usual spiel, which is, "Basically, it has to do with the mathematical properties of the functions involved." And I'm blabbering on and on. And he just puts his hand up and says, stop right there. And he says, "Look, fine. But that's only the best design for this particular circuit configuration. But there's lots of others. How do you know that there's not another one?" And I said, "You're right." And he said, "Now you be quiet, and I'm going to tell you what's useful about convex optimization."
So here I am being lectured on convex optimization by a senior circuit designer. And he says to me, "I don't care about your global stuff. Who cares! If you do real engineering, nothing is global. I don't even know what that means." I said, "Granted." And he says, "Look, here's what useful. You don't need a starting point, number one. Number two, you don't need to babysit the algorithm."
He said, "We've looked at a lot of these algorithms that do circuit design. They require a starting point. Then they take off and they'll go from a good design to a quite-good design. Do you know what we call a good design around here?" And I said, "No." And he said, "Done. Done. We are under unbelievable business pressure to get things out there. We don't want another 15% of performance; we want to tape it out."
My thoughts…
It’s important to think about what “not needing to babysit the algorithm” buys you. Consider the following optimization problem which we will assume to be convex:
The optimization problem is a function of some dataset D. Convexity means that the mapping from our data to our estimated parameters will be deterministic. It will not depend on anything other than the data. If we make a change to how we curated our dataset, and then performance gets worse, we know that we made the wrong decision in curating our data. If f had been non-convex, there is always the possibility that our curation might actually have been the better approach, but optimization issues prevented us from taking advantage of this data improvement, perhaps so much so that performance got worse.
On the other hand, the mapping from our data to our parameters can still be arbitrarily complex. Optimizing data-centric parameters remains a non-convex problem, and there are no obvious solutions that avoid trying all possibilities.
From this discussion, it becomes clear that there is some analogy between in-context learning (prompting) vs training (and fine-tuning), and convex vs non-convex optimization. In-context learning does not require understanding of learning rates, batch size, and other ML engineering hacks. A “prompt engineer”, with no real understanding of deep learning optimization, can purely focus on manipulating their dataset. Runtime is also simple to think about; a longer context will increase the computational cost, but you’re always just going through the forward-pass of a neural network. You really don’t need to babysit that process.
This suggests that in-context learning, even if it is less computationally efficient or statistically efficient as SGD-based training, has an important role to play in the future. Just as people modified non-convex optimization problems to obtain convexity, perhaps losing fidelity to what they actually cared about in the process, people will use in-context learning to obtain simplicity.