Science indeed but not perfection - you can't have that.
As to 'why', well simple practicalities such as time, effort and money. I only argue that blind testing is superior to sighted, not that it ensures a perfect result every time. If you have the budget of a multinational plc and need to produce results on which the reputation and shareholder value of your company stand, go ahead and spend a fortune on a test in which every conceivable variable has been eliminated. You can also have huge sample.
You fall into the classic trap of making something that is essentially very simple into something that in needlessly complex, and with respect, what you propose is a world away from the requirements of simply comparing two items of audio equipment at home.
So lets go back to practical examples.
We compare two cables in a sighted test and ask say three participants to give their responses. All hear clear differences.
Changing nothing in the system or room, you repeat the session with the cable identities disguised and again take the responses.
If the differences are indeed real, those differences should be heard in both tests and the comments via each test should match. If they don't, well it is highly likely that factors other than sound have influenced the result.
That's it - it really is that simple.
Could you go global and claim that 'all cables sound the same' - absolutely not.
Have we shown that what the participants heard initially is likely to be false?
Yes we have.
Moreover, what if the direct "ABX" identification returns a "null" but a following (otherwise identical) preference test shows a marked (and statistcally significant) preference for one of the two items (that previous where not be able to be distinguished), where do find ourselves?
We find ourselves questioning one or other of the tests.
Or it could simply be that one test used less than perceptive listeners.
Which is why I said results have to be fairly parochial unless the test is conducted on a grand scale, with the resources that would require.