Discussion about this post

User's avatar
MoltenOak's avatar

Very interesting, thanks!

> 13. Tabular datasets often to have high irreducible error. For LLMs, even on extremely hard problems like writing IMO math proofs, it is plausible (though a very difficult ongoing research question) to get the LLMs to work perfectly. But for even "simple" tabular classification problems, the different classes overlap so much in distribution that (without adding new features or improving one's measurement process) perfect accuracy is impossible.

Isn't this a feature of the problem type rather than the dataset or model? In any classification task, as long as the features you're using overlap in the population, perfect accuracy is impossible. And the more they overlap, the more difficult this is. The same is true for a prediction task: Unless your features fully determine the outcome, there will be some inaccuracy.

I feel like proofs are a different type of problem, because you "just" have to construct one and confirm it works - there isn't a set of answers, each of which has some plausibility given the facts, but only one of which is correct. And for the IMO problems, of course, we know that proofs exist and that they can be found (by humans).

However, tabular data *is* much lower-dimensional than both text and image data, as you said. As such, e.g. classification tasks using tabular as opposed to image data may be more difficult in general (think cats vs. dogs classification).

Also a minor correction: Missing or superfluous word in "Tabular datasets often to have high irreducible error."

Expand full comment
2 more comments...

No posts